Office Hours — How do you structure your LLM training infrastructure to avoid getting locked into a single provider or model?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
How do you structure your LLM training infrastructure to avoid getting locked into a single provider or model?
This is the vendor lock-in question everyone asks but almost nobody solves cleanly. The honest answer: you can’t fully avoid it, but you can make switching materially cheaper than it looks today.
The Real Constraint
The problem isn’t the model API. OpenAI, Anthropic, Google, and the open-source community all expose standardized interfaces now. You can swap Claude Opus 4.8 for GPT-5.5 in a morning if your code is clean.
The real lock-in lives in three places: application logic tuned to one model’s behavior, evaluation infrastructure built around specific model outputs, and operational workflows that assume a particular provider’s tooling (caching, rate limits, pricing tiers, data retention policies).
You hit this wall hard when a model changes behavior between versions, when pricing shifts, or when a provider throttles features (like Anthropic did with Claude Fable 5). Suddenly your entire evaluation rig reports false negatives. Or your cost assumptions evaporate.
Abstraction Layer, Not Abstraction Fantasies
Start with a clean provider abstraction. Don’t build a fake “LLMClient” that tries to hide all provider differences behind a generic interface. That’s a lie. Providers are different, and pretending they’re not means you’ll hit the differences in production.
Instead, build a thin adapter layer that:
- Routes calls through a provider selector you can toggle without redeploying.
- Normalizes inputs and outputs to a canonical format your application understands.
- Logs all provider-specific behavior (model versions, token counts, latency, failure modes) in a central store.
- Never hardcodes provider assumptions into business logic.
Here’s a concrete pattern:
# providers/interface.py
from dataclasses import dataclass
from typing import Protocol
@dataclass
class ModelResponse:
content: str
model: str
input_tokens: int
output_tokens: int
provider: str
latency_ms: float
metadata: dict # Provider-specific details
class LLMProvider(Protocol):
def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
...
# providers/openai_adapter.py
from openai import OpenAI
from .interface import ModelResponse
class OpenAIAdapter:
def __init__(self, model: str = "gpt-5.5"):
self.client = OpenAI()
self.model = model
def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
start = time.time()
response = self.client.chat.completions.create(
model=self.model,
system=system,
messages=[{"role": "user", "content": prompt}],
temperature=kwargs.get("temperature", 0.7),
max_tokens=kwargs.get("max_tokens", 2000),
)
latency = (time.time() - start) * 1000
return ModelResponse(
content=response.choices[0].message.content,
model=self.model,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
provider="openai",
latency_ms=latency,
metadata={"finish_reason": response.choices[0].finish_reason},
)
# providers/claude_adapter.py
from anthropic import Anthropic
from .interface import ModelResponse
class ClaudeAdapter:
def __init__(self, model: str = "claude-opus-4-8"):
self.client = Anthropic()
self.model = model
def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
start = time.time()
response = self.client.messages.create(
model=self.model,
system=system,
messages=[{"role": "user", "content": prompt}],
temperature=kwargs.get("temperature", 0.7),
max_tokens=kwargs.get("max_tokens", 2000),
)
latency = (time.time() - start) * 1000
return ModelResponse(
content=response.content[0].text,
model=self.model,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
provider="anthropic",
latency_ms=latency,
metadata={"stop_reason": response.stop_reason},
)
# router.py
class ProviderRouter:
def __init__(self, config: dict):
self.adapters = {
"openai": OpenAIAdapter(config.get("openai_model", "gpt-5.5")),
"anthropic": ClaudeAdapter(config.get("anthropic_model", "claude-opus-4-8")),
}
self.primary = config.get("primary_provider", "openai")
self.fallback = config.get("fallback_provider", "anthropic")
def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
provider = kwargs.pop("provider", None) or self.primary
try:
return self.adapters[provider].complete(prompt, system, **kwargs)
except Exception as e:
if provider != self.fallback:
return self.adapters[self.fallback].complete(prompt, system, **kwargs)
raise
This is deliberately boring. It doesn’t hide provider differences. It just makes them swappable and observable.
Evaluation Isolation
The second lock-in vector is evaluation. If you build your test suite around GPT-5.5’s specific behavior, switching models becomes terrifying.
Separate evaluation into three buckets: objective correctness, cost, and model-specific behavior.
# evals/correctness.py
def eval_code_generation(output: str, expected_behavior: str) -> bool:
"""Test if generated code actually works.
Provider-agnostic."""
test_result = run_tests(output)
return test_result.passed
# evals/cost.py
def eval_cost(response: ModelResponse, threshold_usd: float = 0.01) -> bool:
"""Did we stay in budget?"""
cost = (response.input_tokens * 0.002 + response.output_tokens * 0.006) / 1_000_000
return cost < threshold_usd
# evals/quality_by_provider.py
PROVIDER_EXPECTATIONS = {
"openai": {"latency_p99_ms": 2500, "hallucination_rate": 0.08},
"anthropic": {"latency_p99_ms": 3200, "hallucination_rate": 0.06},
}
def eval_provider_characteristics(response: ModelResponse) -> dict:
"""Log what we observe about this provider's behavior.
Use for comparative analysis, not hard pass/fail."""
expectations = PROVIDER_EXPECTATIONS[response.provider]
return {
"faster_than_expectation": response.latency_ms < expectations["latency_p99_ms"],
"observed_hallucination_rate": measure_hallucination(response.content),
}
Run objective correctness tests against all providers. Track cost separately. Keep provider-specific behavior observations in a time-series database so you can spot when a model’s characteristics change.
Operational Telemetry
This is where most teams fail. They don’t instrument deeply enough to notice lock-in until it’s expensive.
Log everything:
- Which provider was called and why (was it primary or fallback?).
- Token counts and actual cost per request.
- Latency percentiles by provider and model version.
- Error rates and failure modes (rate limit, timeout, content policy, silent refusal).
- Quality deltas when you switch providers on a fixed evaluation set.
Use this data to answer: “If we switched to provider X right now, what would break?”
# monitoring/provider_metrics.py
from dataclasses import dataclass
import json
@dataclass
class ProviderMetrics:
provider: str
*Question via [Hacker News](https://news.ycombinator.com/item?id=47168402)*