Office Hours — How do you structure your LLM training infrastructure to avoid getting locked into a single provider or model?

How do you structure your LLM training infrastructure to avoid getting locked into a single provider or model?

This is the vendor lock-in question everyone asks but almost nobody solves cleanly. The honest answer: you can’t fully avoid it, but you can make switching materially cheaper than it looks today.

The Real Constraint

The problem isn’t the model API. OpenAI, Anthropic, Google, and the open-source community all expose standardized interfaces now. You can swap Claude Opus 4.8 for GPT-5.5 in a morning if your code is clean.

The real lock-in lives in three places: application logic tuned to one model’s behavior, evaluation infrastructure built around specific model outputs, and operational workflows that assume a particular provider’s tooling (caching, rate limits, pricing tiers, data retention policies).

You hit this wall hard when a model changes behavior between versions, when pricing shifts, or when a provider throttles features (like Anthropic did with Claude Fable 5). Suddenly your entire evaluation rig reports false negatives. Or your cost assumptions evaporate.

Abstraction Layer, Not Abstraction Fantasies

Start with a clean provider abstraction. Don’t build a fake “LLMClient” that tries to hide all provider differences behind a generic interface. That’s a lie. Providers are different, and pretending they’re not means you’ll hit the differences in production.

Instead, build a thin adapter layer that:

Routes calls through a provider selector you can toggle without redeploying.
Normalizes inputs and outputs to a canonical format your application understands.
Logs all provider-specific behavior (model versions, token counts, latency, failure modes) in a central store.
Never hardcodes provider assumptions into business logic.

Here’s a concrete pattern:

# providers/interface.py
from dataclasses import dataclass
from typing import Protocol

@dataclass
class ModelResponse:
    content: str
    model: str
    input_tokens: int
    output_tokens: int
    provider: str
    latency_ms: float
    metadata: dict  # Provider-specific details

class LLMProvider(Protocol):
    def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
        ...

# providers/openai_adapter.py
from openai import OpenAI
from .interface import ModelResponse

class OpenAIAdapter:
    def __init__(self, model: str = "gpt-5.5"):
        self.client = OpenAI()
        self.model = model
    
    def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
        start = time.time()
        response = self.client.chat.completions.create(
            model=self.model,
            system=system,
            messages=[{"role": "user", "content": prompt}],
            temperature=kwargs.get("temperature", 0.7),
            max_tokens=kwargs.get("max_tokens", 2000),
        )
        latency = (time.time() - start) * 1000
        
        return ModelResponse(
            content=response.choices[0].message.content,
            model=self.model,
            input_tokens=response.usage.prompt_tokens,
            output_tokens=response.usage.completion_tokens,
            provider="openai",
            latency_ms=latency,
            metadata={"finish_reason": response.choices[0].finish_reason},
        )

# providers/claude_adapter.py
from anthropic import Anthropic
from .interface import ModelResponse

class ClaudeAdapter:
    def __init__(self, model: str = "claude-opus-4-8"):
        self.client = Anthropic()
        self.model = model
    
    def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
        start = time.time()
        response = self.client.messages.create(
            model=self.model,
            system=system,
            messages=[{"role": "user", "content": prompt}],
            temperature=kwargs.get("temperature", 0.7),
            max_tokens=kwargs.get("max_tokens", 2000),
        )
        latency = (time.time() - start) * 1000
        
        return ModelResponse(
            content=response.content[0].text,
            model=self.model,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            provider="anthropic",
            latency_ms=latency,
            metadata={"stop_reason": response.stop_reason},
        )

# router.py
class ProviderRouter:
    def __init__(self, config: dict):
        self.adapters = {
            "openai": OpenAIAdapter(config.get("openai_model", "gpt-5.5")),
            "anthropic": ClaudeAdapter(config.get("anthropic_model", "claude-opus-4-8")),
        }
        self.primary = config.get("primary_provider", "openai")
        self.fallback = config.get("fallback_provider", "anthropic")
    
    def complete(self, prompt: str, system: str, **kwargs) -> ModelResponse:
        provider = kwargs.pop("provider", None) or self.primary
        
        try:
            return self.adapters[provider].complete(prompt, system, **kwargs)
        except Exception as e:
            if provider != self.fallback:
                return self.adapters[self.fallback].complete(prompt, system, **kwargs)
            raise

This is deliberately boring. It doesn’t hide provider differences. It just makes them swappable and observable.

Evaluation Isolation

The second lock-in vector is evaluation. If you build your test suite around GPT-5.5’s specific behavior, switching models becomes terrifying.

Separate evaluation into three buckets: objective correctness, cost, and model-specific behavior.

# evals/correctness.py
def eval_code_generation(output: str, expected_behavior: str) -> bool:
    """Test if generated code actually works.
    Provider-agnostic."""
    test_result = run_tests(output)
    return test_result.passed

# evals/cost.py
def eval_cost(response: ModelResponse, threshold_usd: float = 0.01) -> bool:
    """Did we stay in budget?"""
    cost = (response.input_tokens * 0.002 + response.output_tokens * 0.006) / 1_000_000
    return cost < threshold_usd

# evals/quality_by_provider.py
PROVIDER_EXPECTATIONS = {
    "openai": {"latency_p99_ms": 2500, "hallucination_rate": 0.08},
    "anthropic": {"latency_p99_ms": 3200, "hallucination_rate": 0.06},
}

def eval_provider_characteristics(response: ModelResponse) -> dict:
    """Log what we observe about this provider's behavior.
    Use for comparative analysis, not hard pass/fail."""
    expectations = PROVIDER_EXPECTATIONS[response.provider]
    return {
        "faster_than_expectation": response.latency_ms < expectations["latency_p99_ms"],
        "observed_hallucination_rate": measure_hallucination(response.content),
    }

Run objective correctness tests against all providers. Track cost separately. Keep provider-specific behavior observations in a time-series database so you can spot when a model’s characteristics change.

Operational Telemetry

This is where most teams fail. They don’t instrument deeply enough to notice lock-in until it’s expensive.

Log everything:

Which provider was called and why (was it primary or fallback?).
Token counts and actual cost per request.
Latency percentiles by provider and model version.
Error rates and failure modes (rate limit, timeout, content policy, silent refusal).
Quality deltas when you switch providers on a fixed evaluation set.

Use this data to answer: “If we switched to provider X right now, what would break?”

# monitoring/provider_metrics.py
from dataclasses import dataclass
import json

@dataclass
class ProviderMetrics:
    provider: str
    

*Question via [Hacker News](https://news.ycombinator.com/item?id=47168402)*