Office Hours — When should you use traditional ML (tabular models, gradient boosting) instead of jumping straight to LLMs for a new feature?

When should you use traditional ML (tabular models, gradient boosting) instead of jumping straight to LLMs for a new feature?

The instinct to reach for an LLM is understandable. They’re powerful, well-marketed, and sitting in your API budget. But they’re also expensive, slow, and prone to hallucination in ways that XGBoost, LightGBM, and scikit-learn models just aren’t. The decision comes down to three concrete questions: Is your input structured or unstructured? Do you need interpretability or speed? Can you afford 50-500ms latency and per-token costs?

Structured data stays tabular

If your feature receives structured inputs—customer attributes, transaction history, numeric features, categorical variables, time series—gradient boosting is almost always the right first move. A well-tuned LightGBM model on a customer churn dataset will outperform an LLM on both accuracy and latency, often by a wide margin. XGBoost’s built-in feature importance tells you why a prediction happened. An LLM’s explanation is a guess wrapped in confident prose.

The benchmark from Daily Signal 2026-06-02 is instructive: when comparing LLMs to traditional classifiers on text, the conclusion wasn’t “LLMs win everywhere.” It was “reach for classical text classifiers when simpler approaches suffice.” That framing applies harder to tabular data. You have 20 features describing a user? Tabular model. You have 200 hand-crafted features? Definitely tabular. You have customer behavior across 18 months encoded as structured rows? Not even a question.

Unstructured or fuzzy inputs point toward LLMs

The case for LLMs strengthens when your input is messy, free-form, or requires semantic understanding. Classifying whether a customer support ticket is billing, technical, or product feedback? An LLM can handle that in a single pass without feature engineering. Extracting structured data from PDFs with variable layouts? Vision + LLM beats traditional OCR pipelines that Daily Signal 2026-06-03 documented struggling across 93 real documents.

But even here, the constraint is important. If you’re building a simple text classifier (spam/not spam, sentiment: positive/negative/neutral, intent: buy/complain/ask), a fine-tuned smaller model like Mistral Small 3.1 mentioned in Daily Signal 2026-06-05 will likely beat GPT-5.5 on cost and latency. Daily Signal 2026-06-02 directly addresses this: “LLMs vs. Traditional Classifiers: A Practical Comparison” concluded that classical approaches suffice for well-bounded classification problems.

Cost and latency are non-negotiable constraints

An LLM call costs $0.00015 to $0.003 per 1,000 tokens depending on model and rate limits. XGBoost inference on a single row is measured in milliseconds and costs nothing after training. If your feature powers a real-time dashboard that refreshes 10,000 times per day, or a search ranking system that needs <50ms response time, tabular models aren’t optional. They’re mandatory.

A concrete example: a fintech platform needed to classify whether a transaction was fraudulent. The team started with Claude Opus 4.8 because fraud detection sounded “smart.” They hit two problems immediately. First, a 500ms latency added unacceptable friction to checkout flows. Second, at 50 million daily transactions, LLM costs balllooned to $15,000 per day. They switched to LightGBM trained on transaction history, dropped latency to 2ms per prediction, and cut costs to $200/month. The accuracy actually improved because the model learned feature interactions from real fraud patterns instead of trying to reason from first principles.

Here’s a rough cost comparison for the same fraud-detection workload:

LLM approach (GPT-5.4 Instant):
  - 50M daily transactions
  - ~500 tokens per call (transaction details + context)
  - $0.05 per 1M tokens (public pricing)
  - Daily cost: ~$1,250
  - Latency: 400-600ms per prediction
  - Monthly: ~$37,500

LightGBM approach:
  - Training: ~2 hours on GPU, one-time cost
  - Inference: <1ms per prediction
  - Hosting: $50/month on modest compute
  - Monthly: ~$50
  - Latency: <1ms per prediction

The gap isn’t subtle. It’s a 750x cost difference.

When you actually need the LLM layer

There are real reasons to build with LLMs. If your feature requires understanding nuance, context, or multi-step reasoning, you need them. If the problem domain is too new or weird for labeled training data, LLMs can bootstrap where traditional ML would require months of annotation. If you’re building something conversational or agentic where the model needs to reason about unknown unknowns, that’s your signal to go generative.

A recommendation engine for an e-commerce site? Tabular models (collaborative filtering, gradient boosting on user-item features). A chatbot that helps users understand their bill? LLM all the way. A system that flags suspicious account activity? Gradient boosting on temporal patterns and anomaly scores. A tool that writes summaries of customer conversations? LLM.

The mistake is defaulting to LLMs because they feel like the future. They’re the future for some problems. For the structured, measurable, interpretable problems that make up most of production ML, traditional models are still faster, cheaper, and more reliable. And if 80% of AI projects fail at scale (Daily Signal 2026-06-04), part of the reason is teams picked the wrong tool for the job because it sounded cooler.

Bottom line: Use tabular models as your default for structured data, real-time constraints, and problems where interpretability matters. Reach for LLMs only when your input is unstructured, the problem requires reasoning, or you can absorb 50-500ms latency and per-token costs. Start simple; add complexity only when the simpler approach demonstrably fails.

Question via Hacker News