CI/CD for AI Applications
CI/CD for AI applications: prompt versioning, eval pipelines in CI, canary model deployments, rollback strategies, and why traditional software delivery breaks for LLM-powered systems.
CI/CD for AI Applications
Traditional CI/CD assumes deterministic software. Run the tests, get a pass/fail, ship the artifact. LLM-powered applications break every assumption in that pipeline. The “code” includes natural language prompts that change behavior without changing any source file. The “dependencies” include external models that providers update or deprecate without notice — GPT-4 and GPT-4o were retired in February 2026, breaking applications that referenced them. The “tests” require subjective evaluation of natural language output, where the same input can produce different correct answers.
This post covers the full CI/CD lifecycle for AI applications: versioning prompts alongside code, running evaluations in CI that catch regressions before production, deploying model changes safely with canary patterns, and rolling back when things go wrong. The patterns here apply whether the application is a RAG pipeline, an agent workflow, or a single prompt behind an API endpoint.
Table of Contents
- Why Traditional CI/CD Breaks
- Prompt Versioning
- Evaluation Pipelines in CI
- Building an Eval Suite
- Model Migration as a First-Class CI Event
- Canary Deployments for AI
- Rollback Strategies
- The Config vs Code Boundary
- Pipeline Architecture
- Cost Management in CI
- Summary
- Further Reading
Why Traditional CI/CD Breaks
Standard CI/CD pipelines are built around three properties: determinism (same input → same output), fast feedback (tests complete in minutes), and binary outcomes (pass or fail). LLM applications violate all three.
Non-determinism. Even with temperature=0, LLM outputs vary across API calls due to batching, quantization differences, and infrastructure routing. A prompt that produces valid JSON 99.2% of the time will occasionally emit malformed output that no unit test predicted.
Slow feedback. Running an eval suite against a live model takes minutes to hours depending on dataset size. A 500-example eval at GPT-5.4 speeds costs real money and wall-clock time. The tight feedback loop developers expect from git push → green checkmark stretches considerably.
Continuous outputs. A code change either compiles or doesn’t. An LLM output exists on a quality spectrum. “The response is 80% as good as before” doesn’t map to pass/fail without explicit thresholds.
Traditional CI/CD assumes binary test outcomes; AI pipelines require scored evaluations with configurable thresholds.
The dependency problem is unique to AI applications. When Anthropic ships Claude Opus 4.7 or OpenAI releases GPT-5.5, the application’s behavior changes without any code commit. Model deprecations force migration on the provider’s timeline, not the team’s. A CI/CD system for AI must treat model versions as dependencies that trigger pipeline runs.
Prompt Versioning
Prompts are code. They change application behavior, they have bugs, and they need version history. But they don’t fit neatly into existing version control patterns.
Storage Options
In-repo (recommended for most teams). Store prompts as files in the repository — YAML, Markdown, or a custom format. They get the full Git history, code review, and branch-based workflows.
# prompts/summarizer/v3.yaml
id: summarizer
version: 3
model: claude-sonnet-4.6
temperature: 0.3
max_tokens: 1024
system: |
You are a document summarizer. Produce a structured summary
with: key_points (list), sentiment (positive/negative/neutral),
and action_items (list). Output valid JSON only.
user_template: |
Summarize the following document:
{{document}}
metadata:
author: eng-team
created: 2026-04-15
eval_baseline: summarizer-eval-v3
min_score: 0.82
Prompt management platforms. Tools like Braintrust, Humanloop, and Promptfoo offer web UIs for editing prompts, built-in versioning, and A/B testing. The tradeoff: prompts live outside the code repository, which breaks the single-source-of-truth property that makes CI/CD reliable.
Hybrid approach. Store the canonical prompt in the repo; sync it to a management platform for non-engineer access. The repo version is authoritative. CI rejects deployments if the platform version drifts from the repo version.
Version Semantics
Semantic versioning maps poorly to prompts. A one-word change to a system prompt can alter behavior more than a major code refactoring. A practical scheme:
| Change Type | Example | Version Bump | CI Requirement |
|---|---|---|---|
| Formatting, typos | Fix whitespace in template | Patch (v3.0.1) | Standard tests |
| Behavioral tuning | Adjust tone instructions | Minor (v3.1.0) | Full eval suite |
| Model change | claude-sonnet-4.6 → claude-opus-4.7 | Major (v4.0.0) | Full eval + human review |
| Schema change | New output field added | Major (v4.0.0) | Full eval + integration tests |
Prompt Diffing
Standard git diff shows text changes but not behavioral changes. A useful addition to PR workflows: run the eval suite against both the old and new prompt versions and post the comparison as a PR comment.
# scripts/prompt_diff_eval.py
import yaml
import json
from pathlib import Path
def load_prompt(path: str) -> dict:
return yaml.safe_load(Path(path).read_text())
def run_comparison(old_path: str, new_path: str, eval_dataset: str):
old_prompt = load_prompt(old_path)
new_prompt = load_prompt(new_path)
results = {
"old_version": old_prompt["version"],
"new_version": new_prompt["version"],
"model_changed": old_prompt["model"] != new_prompt["model"],
"scores": {}
}
# Run eval dataset against both versions
for label, prompt in [("old", old_prompt), ("new", new_prompt)]:
scores = run_eval_suite(prompt, eval_dataset)
results["scores"][label] = {
"accuracy": scores.mean_accuracy,
"latency_p50_ms": scores.p50_latency,
"cost_per_1k": scores.cost_per_1k_calls,
"failure_rate": scores.failure_rate,
}
return results
A prompt change triggers both text diffing and behavioral comparison against the eval suite.
Evaluation Pipelines in CI
Evals are the test suite for AI applications. They replace (or supplement) traditional assertions with scored assessments of LLM output quality.
Eval Types
Deterministic evals check structural properties without calling an LLM. These are fast, cheap, and should run on every commit.
def test_json_validity(output: str) -> bool:
"""Does the output parse as valid JSON?"""
try:
json.loads(output)
return True
except json.JSONDecodeError:
return False
def test_schema_compliance(output: str, schema: dict) -> bool:
"""Does the JSON match the expected schema?"""
import jsonschema
try:
jsonschema.validate(json.loads(output), schema)
return True
except jsonschema.ValidationError:
return False
def test_no_pii_leakage(output: str) -> bool:
"""Check for common PII patterns in output."""
import re
patterns = [
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', # Email
]
return not any(re.search(p, output) for p in patterns)
LLM-as-judge evals use a model to assess output quality. More expensive, slower, but capable of evaluating subjective properties like helpfulness, coherence, and factual grounding.
JUDGE_PROMPT = """Rate the following summary on a 1-5 scale for each criterion:
- Completeness: Does it capture all key points from the source?
- Accuracy: Are all claims supported by the source document?
- Conciseness: Is it free of unnecessary repetition or filler?
Source document:
{source}
Summary to evaluate:
{summary}
Respond with JSON: {"completeness": N, "accuracy": N, "conciseness": N}"""
async def llm_judge(source: str, summary: str) -> dict:
response = await client.messages.create(
model="claude-sonnet-4.6",
max_tokens=200,
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(source=source, summary=summary)
}]
)
return json.loads(response.content[0].text)
Comparative evals present a judge with two outputs and ask which is better. More robust than absolute scoring because LLM judges show positional bias and scale calibration drift.
Human evals remain the gold standard for subjective quality. They don’t run in CI — they run asynchronously, with results fed back into threshold calibration.
Eval Dataset Design
An eval dataset is a collection of (input, expected_output, metadata) tuples. The expected_output might be a reference answer, a set of required facts, or just the input itself (for judge-based evals where the source material is the reference).
Practical sizing:
| Eval Tier | Dataset Size | Run Frequency | Typical Cost (GPT-5.4) | Purpose |
|---|---|---|---|---|
| Smoke | 20-50 examples | Every commit | $0.50-2 | Catch catastrophic regressions |
| Standard | 200-500 examples | Every PR | $5-25 | Measure quality metrics |
| Full | 1000+ examples | Pre-release | $25-100+ | Comprehensive quality gate |
The smoke tier is critical. Twenty well-chosen examples that cover edge cases (empty input, adversarial input, multilingual content, very long documents) catch most regressions in under 60 seconds.
Three-tier eval strategy balances speed and coverage across the development lifecycle.
CI Integration
A GitHub Actions workflow that runs evals on prompt or model config changes:
# .github/workflows/ai-eval.yml
name: AI Eval Pipeline
on:
pull_request:
paths:
- 'prompts/**'
- 'model_config/**'
- 'src/ai/**'
jobs:
smoke-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-eval.txt
- name: Run smoke evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python -m eval.runner \
--tier smoke \
--output results/smoke.json
- name: Check thresholds
run: |
python -m eval.threshold_check \
--results results/smoke.json \
--config eval/thresholds.yaml
standard-eval:
needs: smoke-eval
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-eval.txt
- name: Run standard evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python -m eval.runner \
--tier standard \
--output results/standard.json \
--compare-with main
- name: Post PR comment
uses: actions/github-script@v7
with:
script: |
const results = require('./results/standard.json');
const body = formatEvalResults(results);
github.rest.issues.createComment({
...context.repo,
issue_number: context.issue.number,
body
});
Threshold Configuration
Thresholds convert continuous scores into pass/fail gates. Store them in a config file, not hardcoded in the pipeline:
# eval/thresholds.yaml
summarizer:
accuracy:
min: 0.85
regression_tolerance: 0.03 # allow 3% drop from baseline
completeness:
min: 0.80
regression_tolerance: 0.05
json_validity:
min: 0.99 # structural check, very high bar
latency_p95_ms:
max: 3000
cost_per_call_usd:
max: 0.05
classifier:
precision:
min: 0.90
recall:
min: 0.85
f1:
min: 0.87
Regression tolerance matters more than absolute thresholds. An eval score dropping from 0.91 to 0.88 on a prompt change is a stronger signal than the score being 0.88 on a fresh eval — the delta indicates the change made things worse.
Building an Eval Suite
The eval runner is the core abstraction. It takes a prompt config, a dataset, and a set of evaluators, and produces scored results.
# eval/runner.py
import asyncio
import json
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Callable
@dataclass
class EvalCase:
id: str
input: dict
expected: dict | None
metadata: dict
@dataclass
class EvalResult:
case_id: str
output: str
scores: dict[str, float]
latency_ms: float
tokens_used: int
cost_usd: float
error: str | None = None
class EvalRunner:
def __init__(self, prompt_config: dict, evaluators: list[Callable]):
self.prompt_config = prompt_config
self.evaluators = evaluators
self.client = self._init_client(prompt_config["model"])
async def run(self, dataset: list[EvalCase],
concurrency: int = 10) -> list[EvalResult]:
semaphore = asyncio.Semaphore(concurrency)
async def run_one(case: EvalCase) -> EvalResult:
async with semaphore:
start = time.monotonic()
try:
output = await self._call_model(case.input)
latency = (time.monotonic() - start) * 1000
scores = {}
for evaluator in self.evaluators:
score = await evaluator(
input=case.input,
output=output,
expected=case.expected
)
scores.update(score)
return EvalResult(
case_id=case.id,
output=output,
scores=scores,
latency_ms=latency,
tokens_used=self._last_token_count,
cost_usd=self._last_cost,
)
except Exception as e:
return EvalResult(
case_id=case.id,
output="",
scores={},
latency_ms=(time.monotonic() - start) * 1000,
tokens_used=0,
cost_usd=0,
error=str(e),
)
return await asyncio.gather(*[run_one(c) for c in dataset])
Eval Storage and Trending
Raw eval results should be stored persistently — not just as CI artifacts that expire after 90 days. A simple approach: write results to a SQLite database checked into the repo (for small teams) or to a dedicated store.
# eval/storage.py
import sqlite3
import json
from datetime import datetime
def store_results(db_path: str, run_id: str, commit_sha: str,
prompt_version: str, results: list) -> None:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS eval_runs (
run_id TEXT PRIMARY KEY,
commit_sha TEXT,
prompt_version TEXT,
timestamp TEXT,
summary JSON
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS eval_cases (
run_id TEXT,
case_id TEXT,
scores JSON,
latency_ms REAL,
cost_usd REAL,
error TEXT,
FOREIGN KEY (run_id) REFERENCES eval_runs(run_id)
)
""")
summary = compute_summary(results)
conn.execute(
"INSERT INTO eval_runs VALUES (?, ?, ?, ?, ?)",
(run_id, commit_sha, prompt_version,
datetime.utcnow().isoformat(), json.dumps(summary))
)
for r in results:
conn.execute(
"INSERT INTO eval_cases VALUES (?, ?, ?, ?, ?, ?)",
(run_id, r.case_id, json.dumps(r.scores),
r.latency_ms, r.cost_usd, r.error)
)
conn.commit()
Trending eval scores over time catches slow degradation that individual threshold checks miss. A 0.5% accuracy drop per week compounds into a meaningful regression over a quarter.
Model Migration as a First-Class CI Event
Model deprecations and new releases are external events that should trigger CI pipelines. When OpenAI retired GPT-4 and GPT-4o in February 2026, teams that didn’t have migration pipelines scrambled.
Dependency Tracking
Treat model versions like package dependencies. A lockfile-like config:
# model_config/models.yaml
models:
primary:
provider: anthropic
model: claude-sonnet-4.6
fallback: claude-haiku-4.5
deprecated: false
reasoning:
provider: openai
model: gpt-5.4-thinking
fallback: gpt-5.4
deprecated: false
judge:
provider: anthropic
model: claude-opus-4.7
fallback: claude-opus-4.6
deprecated: false
deprecation_watch:
# Models to monitor for deprecation announcements
- gpt-5.2 # being phased out
- claude-haiku-4.5 # current but watch for successor
Migration Pipeline
When evaluating a model migration (e.g., upgrading from Claude Sonnet 4.6 to Claude Opus 4.7), the pipeline should:
- Run the full eval suite against the current model (baseline)
- Run the same suite against the candidate model
- Compare scores with statistical significance testing
- Generate a migration report
- Require human approval for major model changes
Model migrations follow a controlled comparison pipeline with human approval.
# scripts/model_migration.py
from scipy import stats
import numpy as np
def compare_model_runs(baseline_scores: list[float],
candidate_scores: list[float],
alpha: float = 0.05) -> dict:
"""Paired statistical comparison of two model runs."""
# Paired t-test (same eval cases, different models)
t_stat, p_value = stats.ttest_rel(baseline_scores, candidate_scores)
baseline_mean = np.mean(baseline_scores)
candidate_mean = np.mean(candidate_scores)
delta = candidate_mean - baseline_mean
return {
"baseline_mean": round(baseline_mean, 4),
"candidate_mean": round(candidate_mean, 4),
"delta": round(delta, 4),
"p_value": round(p_value, 4),
"significant": p_value < alpha,
"recommendation": (
"UPGRADE" if delta > 0 and p_value < alpha
else "NO_CHANGE" if p_value >= alpha
else "REGRESSION"
)
}
Scheduled Model Checks
A cron-triggered CI job that runs weekly against all configured models catches upstream changes:
# .github/workflows/model-health.yml
name: Model Health Check
on:
schedule:
- cron: '0 6 * * 1' # Every Monday at 6am UTC
workflow_dispatch:
jobs:
health-check:
runs-on: ubuntu-latest
strategy:
matrix:
model_role: [primary, reasoning, judge]
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements-eval.txt
- name: Run health eval
run: |
python -m eval.health_check \
--role ${{ matrix.model_role }} \
--baseline latest \
--alert-on-regression
- name: Alert on drift
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{"text": "Model health regression detected for ${{ matrix.model_role }}"}
Canary Deployments for AI
Canary deployments route a small percentage of traffic to a new configuration before full rollout. For AI applications, “new configuration” can mean a new prompt version, a new model, or both.
Traffic Splitting Strategies
Random split. Route 5-10% of requests to the canary. Simple but requires enough traffic volume for statistical significance.
User-based split. Assign users to canary/control groups deterministically (hash of user ID). Ensures consistent experience per user and enables paired analysis.
Capability-based split. Route specific query types to the canary. Useful when the change targets a particular capability (e.g., improved code generation).
# src/routing/canary.py
import hashlib
from dataclasses import dataclass
@dataclass
class CanaryConfig:
enabled: bool
canary_percent: float # 0.0 to 1.0
canary_prompt_version: str
canary_model: str
control_prompt_version: str
control_model: str
sticky_by_user: bool = True
def route_request(user_id: str, config: CanaryConfig) -> str:
"""Returns 'canary' or 'control'."""
if not config.enabled:
return "control"
if config.sticky_by_user:
# Deterministic routing based on user ID
hash_val = int(hashlib.sha256(user_id.encode()).hexdigest(), 16)
bucket = (hash_val % 1000) / 1000.0
else:
import random
bucket = random.random()
return "canary" if bucket < config.canary_percent else "control"
Canary Metrics
The canary needs automated evaluation on live traffic. Key metrics to track:
| Metric | Collection Method | Alert Threshold |
|---|---|---|
| Error rate | Response status codes, parsing failures | >2x control rate |
| Latency P50/P95 | Timing middleware | >30% increase |
| User satisfaction | Thumbs up/down, follow-up queries | >10% decrease |
| Token usage | API response metadata | >25% increase (cost signal) |
| Structured output validity | Schema validation | >1% failure rate increase |
Canary routing splits traffic between control and candidate configurations with automated metric comparison.
Promotion and Rollback Criteria
Automated canary analysis after a defined observation window (typically 1-24 hours depending on traffic volume):
# src/routing/canary_analysis.py
def analyze_canary(control_metrics: dict, canary_metrics: dict,
min_sample_size: int = 100) -> dict:
if canary_metrics["sample_size"] < min_sample_size:
return {"decision": "WAIT", "reason": "insufficient samples"}
checks = []
# Error rate comparison
error_ratio = (canary_metrics["error_rate"] /
max(control_metrics["error_rate"], 0.001))
checks.append({
"metric": "error_rate",
"control": control_metrics["error_rate"],
"canary": canary_metrics["error_rate"],
"pass": error_ratio < 2.0
})
# Latency comparison
latency_ratio = (canary_metrics["p95_latency_ms"] /
control_metrics["p95_latency_ms"])
checks.append({
"metric": "p95_latency",
"control": control_metrics["p95_latency_ms"],
"canary": canary_metrics["p95_latency_ms"],
"pass": latency_ratio < 1.3
})
# Quality score (from online eval sampling)
if "quality_score" in canary_metrics:
quality_delta = (canary_metrics["quality_score"] -
control_metrics["quality_score"])
checks.append({
"metric": "quality_score",
"control": control_metrics["quality_score"],
"canary": canary_metrics["quality_score"],
"pass": quality_delta > -0.03 # tolerate 3% drop
})
all_pass = all(c["pass"] for c in checks)
return {
"decision": "PROMOTE" if all_pass else "ROLLBACK",
"checks": checks,
"sample_sizes": {
"control": control_metrics["sample_size"],
"canary": canary_metrics["sample_size"]
}
}
Rollback Strategies
Rollback for AI applications has dimensions that traditional deployments don’t.
What To Roll Back
| Rollback Target | Trigger | Speed | Complexity |
|---|---|---|---|
| Prompt version | Quality regression detected | Seconds (config change) | Low |
| Model version | Model behaving differently post-update | Seconds (config change) | Low |
| Both prompt + model | Major deployment failure | Seconds (config change) | Low |
| Code + prompt + model | System-level failure | Minutes (redeploy) | Medium |
| Feature flag off | Any of the above | Seconds | Very low |
The key insight: if prompts and model configurations are externalized from application code, most rollbacks don’t require a code deployment. A config store update (or feature flag toggle) reverts the AI behavior instantly.
Failure classification determines the rollback mechanism, with most AI-specific regressions resolved via config changes.
Implementing Fast Rollback
Feature flags are the fastest rollback mechanism. LaunchDarkly, Flipt, or even a simple Redis key can gate prompt versions:
# src/prompts/loader.py
from functools import lru_cache
import yaml
class PromptLoader:
def __init__(self, feature_flags, prompt_dir: str = "prompts"):
self.flags = feature_flags
self.prompt_dir = prompt_dir
def get_prompt(self, prompt_id: str) -> dict:
# Feature flag determines which version to use
version = self.flags.get_string(
f"prompt_version_{prompt_id}",
default=self._get_default_version(prompt_id)
)
return self._load_prompt(prompt_id, version)
def _load_prompt(self, prompt_id: str, version: str) -> dict:
path = f"{self.prompt_dir}/{prompt_id}/{version}.yaml"
return yaml.safe_load(open(path))
def _get_default_version(self, prompt_id: str) -> str:
"""Read default version from manifest."""
manifest = yaml.safe_load(
open(f"{self.prompt_dir}/{prompt_id}/manifest.yaml")
)
return manifest["default_version"]
Rollback Testing
Include rollback scenarios in the CI pipeline. A rollback that hasn’t been tested is a rollback that might not work:
# eval/rollback_test.yaml
scenarios:
- name: prompt_version_rollback
steps:
- deploy: summarizer/v4
- verify: eval_passes
- rollback_to: summarizer/v3
- verify: eval_passes
- assert: rollback_time_seconds < 5
- name: model_fallback
steps:
- configure: model=claude-opus-4.7
- simulate: provider_outage
- verify: fallback_to=claude-opus-4.6
- verify: eval_passes_with_fallback_model
The Config vs Code Boundary
A recurring architectural question: what belongs in code (deployed via standard CI/CD) and what belongs in configuration (changeable without deployment)?
Recommended Separation
| Layer | Storage | Change Process | Rollback Speed |
|---|---|---|---|
| System prompts | Config (versioned YAML) | PR → eval → merge | Seconds |
| Model selection | Config (model manifest) | PR → eval → merge | Seconds |
| Temperature, max_tokens | Config | PR → merge | Seconds |
| Output parsing logic | Code | PR → test → deploy | Minutes |
| Eval datasets | Repo (data files) | PR → merge | N/A |
| Eval thresholds | Config | PR → merge | Seconds |
| Routing/orchestration logic | Code | PR → test → deploy | Minutes |
The principle: anything that affects LLM behavior but doesn’t require code changes should be externalized as config. This keeps the rollback surface for AI-specific changes fast (seconds, not deployment cycles).
Config Validation
Externalized config needs its own validation in CI:
# scripts/validate_config.py
import yaml
import jsonschema
from pathlib import Path
PROMPT_SCHEMA = {
"type": "object",
"required": ["id", "version", "model", "system"],
"properties": {
"id": {"type": "string"},
"version": {"type": "integer", "minimum": 1},
"model": {
"type": "string",
"enum": [
"claude-opus-4.7", "claude-opus-4.6",
"claude-sonnet-4.6", "claude-haiku-4.5",
"gpt-5.5", "gpt-5.4", "gpt-5.4-thinking",
"gpt-4.1-nano",
"gemini-3.1-pro", "gemini-3.1-flash-lite",
]
},
"temperature": {"type": "number", "minimum": 0, "maximum": 2},
"max_tokens": {"type": "integer", "minimum": 1, "maximum": 128000},
"system": {"type": "string", "minLength": 10},
}
}
def validate_all_prompts(prompt_dir: str = "prompts"):
errors = []
for path in Path(prompt_dir).rglob("*.yaml"):
if path.name == "manifest.yaml":
continue
try:
config = yaml.safe_load(path.read_text())
jsonschema.validate(config, PROMPT_SCHEMA)
except Exception as e:
errors.append(f"{path}: {e}")
if errors:
for e in errors:
print(f"FAIL: {e}")
raise SystemExit(1)
print(f"Validated {len(list(Path(prompt_dir).rglob('*.yaml')))} prompt configs")
Pipeline Architecture
Putting it all together, a complete CI/CD pipeline for an AI application has more stages than a traditional pipeline.
The complete pipeline adds eval tiers and canary stages between traditional CI steps and full deployment.
Stage Details
Stage 1: Validate (every push, <30s). Config schema validation, prompt template syntax checking, Python type checking and linting, dependency compatibility verification.
Stage 2: Unit tests (every push, <2min). Output parser tests with fixture data, routing logic tests, tool/function schema validation, mock-based integration tests (no LLM calls).
Stage 3: Smoke eval (every push, <2min). 20-50 real LLM calls against critical paths, structural checks (JSON validity, schema compliance), latency sanity checks.
Stage 4: Standard eval (every PR, 5-15min). 200-500 LLM calls across the full eval dataset, LLM-as-judge scoring, comparison against baseline (previous commit on main), regression detection with statistical significance.
Stage 5: Canary deploy (pre-production). 5-10% traffic split, automated metric collection, promotion/rollback decision after observation window.
Stage 6: Full promotion. Update production config to the new versions, run a final smoke eval against production, tag the release.
Branch Strategy
A branching model that works for AI applications:
main ─── always deployable, eval-gated
├── feature/* ─── new prompts, model changes, code changes
├── eval/* ─── eval dataset additions/modifications
└── hotfix/* ─── emergency prompt or model rollbacks
The eval/* branches are worth calling out. Eval dataset changes should go through code review — a poorly constructed eval case can either mask regressions or cause false failures. But eval changes shouldn’t trigger full eval runs (circular dependency). They should trigger a meta-eval: verify the new cases are well-formed and that existing thresholds still make sense.
Cost Management in CI
Running LLM evals in CI costs real money. A team pushing 20 PRs per day with a 500-case eval suite on Claude Sonnet 4.6 spends a nontrivial amount monthly.
Cost Reduction Strategies
Cache eval outputs for unchanged prompts. If the prompt version and model haven’t changed, reuse the previous eval results. Only re-run when something relevant changes.
# eval/cache.py
import hashlib
import json
from pathlib import Path
def eval_cache_key(prompt_config: dict, eval_case: dict) -> str:
"""Deterministic cache key for an eval case."""
content = json.dumps({
"prompt": prompt_config,
"case": eval_case,
}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get_cached_result(cache_dir: str, key: str) -> dict | None:
path = Path(cache_dir) / f"{key}.json"
if path.exists():
return json.loads(path.read_text())
return None
Use cheaper models for structural evals. JSON validity and schema compliance don’t need frontier models. GPT-4.1 Nano or Claude Haiku 4.5 work fine for structural checks at a fraction of the cost.
Tiered eval triggers. Not every commit needs the full suite. Use path-based triggers:
| Files Changed | Eval Tier |
|---|---|
src/ai/prompts/** | Full standard eval |
src/ai/parsers/** | Smoke eval + unit tests |
src/frontend/** | Unit tests only |
model_config/** | Full standard eval + migration comparison |
Parallelize with concurrency controls. Run eval cases concurrently (bounded by API rate limits) to reduce wall-clock time without increasing cost.
Route eval effort based on what changed, using cheaper models for structural checks.
Budget Guardrails
Set hard limits on CI eval spending:
# eval/budget.yaml
daily_limit_usd: 50.00
per_run_limit_usd: 10.00
per_case_limit_usd: 0.10
alerts:
- threshold: 0.80 # 80% of daily limit
channel: "#ai-platform"
- threshold: 0.95
channel: "#ai-platform-urgent"
action: pause_non_critical_evals
Summary
CI/CD for AI applications requires extending traditional pipelines with three new capabilities: prompt versioning (treat prompts as versioned, reviewable, diffable artifacts), eval pipelines (replace binary tests with scored evaluations at multiple tiers), and model-aware deployment (canary routing, automated comparison, fast config-based rollback).
The practical implementation:
- Store prompts as versioned YAML in the repo. Validate schemas in CI.
- Build a three-tier eval suite: smoke (every commit, 20 cases, <60s), standard (every PR, 200-500 cases, 5-15min), full (pre-release, 1000+ cases).
- Use both deterministic evaluators (JSON validity, schema compliance, regex checks) and LLM-as-judge evaluators (quality, completeness, accuracy). Run structural evals on cheap models.
- Track model versions as dependencies. Run comparative evals when migrating models. Schedule weekly health checks against production models.
- Deploy with canary routing. Measure error rate, latency, quality score, and cost on live traffic before full promotion.
- Externalize prompts and model configs from application code so rollbacks take seconds, not deployment cycles.
- Gate evals by what changed. Frontend-only changes don’t need LLM eval runs. Control CI costs with caching, tiered triggers, and budget limits.
The teams that ship AI applications reliably are the ones that invested in this infrastructure early. Every week spent building eval pipelines pays back in regressions caught before users see them.
Further Reading
- Promptfoo — Open-source LLM eval framework with CI integration, model comparison, and red-teaming capabilities
- Braintrust — Eval and prompt management platform with logging, scoring, and dataset management for production AI
- OpenAI Evals — OpenAI’s framework for evaluating LLMs, including dataset formats and eval patterns
- DeepChecks LLM Evaluation — Open-source validation framework with LLM-specific checks for production monitoring
- Langfuse — Open-source LLM observability platform with trace-based eval integration and prompt management
- LMSYS Chatbot Arena — Crowdsourced LLM comparison platform; the methodology behind comparative/arena-style evaluation
- Anthropic’s guide to prompt engineering — Practical patterns for prompt design that inform how prompts should be versioned and tested
- Humanloop — Prompt management and evaluation platform with version control and CI/CD integration patterns