CI/CD for AI Applications 2026-05-12T09:00:00.000Z Deep Dives Deep Dives deep-divereferencearchitecture

CI/CD for AI Applications

The post you bookmark. One topic, covered end to end.

CI/CD for AI applications: prompt versioning, eval pipelines in CI, canary model deployments, rollback strategies, and why traditional software delivery breaks for LLM-powered systems.

CI/CD for AI Applications

Traditional CI/CD assumes deterministic software. Run the tests, get a pass/fail, ship the artifact. LLM-powered applications break every assumption in that pipeline. The “code” includes natural language prompts that change behavior without changing any source file. The “dependencies” include external models that providers update or deprecate without notice — GPT-4 and GPT-4o were retired in February 2026, breaking applications that referenced them. The “tests” require subjective evaluation of natural language output, where the same input can produce different correct answers.

This post covers the full CI/CD lifecycle for AI applications: versioning prompts alongside code, running evaluations in CI that catch regressions before production, deploying model changes safely with canary patterns, and rolling back when things go wrong. The patterns here apply whether the application is a RAG pipeline, an agent workflow, or a single prompt behind an API endpoint.

Table of Contents

Why Traditional CI/CD Breaks

Standard CI/CD pipelines are built around three properties: determinism (same input → same output), fast feedback (tests complete in minutes), and binary outcomes (pass or fail). LLM applications violate all three.

Non-determinism. Even with temperature=0, LLM outputs vary across API calls due to batching, quantization differences, and infrastructure routing. A prompt that produces valid JSON 99.2% of the time will occasionally emit malformed output that no unit test predicted.

Slow feedback. Running an eval suite against a live model takes minutes to hours depending on dataset size. A 500-example eval at GPT-5.4 speeds costs real money and wall-clock time. The tight feedback loop developers expect from git push → green checkmark stretches considerably.

Continuous outputs. A code change either compiles or doesn’t. An LLM output exists on a quality spectrum. “The response is 80% as good as before” doesn’t map to pass/fail without explicit thresholds.

Diagram

Traditional CI/CD assumes binary test outcomes; AI pipelines require scored evaluations with configurable thresholds.

The dependency problem is unique to AI applications. When Anthropic ships Claude Opus 4.7 or OpenAI releases GPT-5.5, the application’s behavior changes without any code commit. Model deprecations force migration on the provider’s timeline, not the team’s. A CI/CD system for AI must treat model versions as dependencies that trigger pipeline runs.

Prompt Versioning

Prompts are code. They change application behavior, they have bugs, and they need version history. But they don’t fit neatly into existing version control patterns.

Storage Options

In-repo (recommended for most teams). Store prompts as files in the repository — YAML, Markdown, or a custom format. They get the full Git history, code review, and branch-based workflows.

# prompts/summarizer/v3.yaml
id: summarizer
version: 3
model: claude-sonnet-4.6
temperature: 0.3
max_tokens: 1024
system: |
  You are a document summarizer. Produce a structured summary
  with: key_points (list), sentiment (positive/negative/neutral),
  and action_items (list). Output valid JSON only.
user_template: |
  Summarize the following document:
  
  {{document}}
metadata:
  author: eng-team
  created: 2026-04-15
  eval_baseline: summarizer-eval-v3
  min_score: 0.82

Prompt management platforms. Tools like Braintrust, Humanloop, and Promptfoo offer web UIs for editing prompts, built-in versioning, and A/B testing. The tradeoff: prompts live outside the code repository, which breaks the single-source-of-truth property that makes CI/CD reliable.

Hybrid approach. Store the canonical prompt in the repo; sync it to a management platform for non-engineer access. The repo version is authoritative. CI rejects deployments if the platform version drifts from the repo version.

Version Semantics

Semantic versioning maps poorly to prompts. A one-word change to a system prompt can alter behavior more than a major code refactoring. A practical scheme:

Change TypeExampleVersion BumpCI Requirement
Formatting, typosFix whitespace in templatePatch (v3.0.1)Standard tests
Behavioral tuningAdjust tone instructionsMinor (v3.1.0)Full eval suite
Model changeclaude-sonnet-4.6 → claude-opus-4.7Major (v4.0.0)Full eval + human review
Schema changeNew output field addedMajor (v4.0.0)Full eval + integration tests

Prompt Diffing

Standard git diff shows text changes but not behavioral changes. A useful addition to PR workflows: run the eval suite against both the old and new prompt versions and post the comparison as a PR comment.

# scripts/prompt_diff_eval.py
import yaml
import json
from pathlib import Path

def load_prompt(path: str) -> dict:
    return yaml.safe_load(Path(path).read_text())

def run_comparison(old_path: str, new_path: str, eval_dataset: str):
    old_prompt = load_prompt(old_path)
    new_prompt = load_prompt(new_path)
    
    results = {
        "old_version": old_prompt["version"],
        "new_version": new_prompt["version"],
        "model_changed": old_prompt["model"] != new_prompt["model"],
        "scores": {}
    }
    
    # Run eval dataset against both versions
    for label, prompt in [("old", old_prompt), ("new", new_prompt)]:
        scores = run_eval_suite(prompt, eval_dataset)
        results["scores"][label] = {
            "accuracy": scores.mean_accuracy,
            "latency_p50_ms": scores.p50_latency,
            "cost_per_1k": scores.cost_per_1k_calls,
            "failure_rate": scores.failure_rate,
        }
    
    return results
Diagram

A prompt change triggers both text diffing and behavioral comparison against the eval suite.

Evaluation Pipelines in CI

Evals are the test suite for AI applications. They replace (or supplement) traditional assertions with scored assessments of LLM output quality.

Eval Types

Deterministic evals check structural properties without calling an LLM. These are fast, cheap, and should run on every commit.

def test_json_validity(output: str) -> bool:
    """Does the output parse as valid JSON?"""
    try:
        json.loads(output)
        return True
    except json.JSONDecodeError:
        return False

def test_schema_compliance(output: str, schema: dict) -> bool:
    """Does the JSON match the expected schema?"""
    import jsonschema
    try:
        jsonschema.validate(json.loads(output), schema)
        return True
    except jsonschema.ValidationError:
        return False

def test_no_pii_leakage(output: str) -> bool:
    """Check for common PII patterns in output."""
    import re
    patterns = [
        r'\b\d{3}-\d{2}-\d{4}\b',  # SSN
        r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',  # Email
    ]
    return not any(re.search(p, output) for p in patterns)

LLM-as-judge evals use a model to assess output quality. More expensive, slower, but capable of evaluating subjective properties like helpfulness, coherence, and factual grounding.

JUDGE_PROMPT = """Rate the following summary on a 1-5 scale for each criterion:
- Completeness: Does it capture all key points from the source?
- Accuracy: Are all claims supported by the source document?
- Conciseness: Is it free of unnecessary repetition or filler?

Source document:
{source}

Summary to evaluate:
{summary}

Respond with JSON: {"completeness": N, "accuracy": N, "conciseness": N}"""

async def llm_judge(source: str, summary: str) -> dict:
    response = await client.messages.create(
        model="claude-sonnet-4.6",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(source=source, summary=summary)
        }]
    )
    return json.loads(response.content[0].text)

Comparative evals present a judge with two outputs and ask which is better. More robust than absolute scoring because LLM judges show positional bias and scale calibration drift.

Human evals remain the gold standard for subjective quality. They don’t run in CI — they run asynchronously, with results fed back into threshold calibration.

Eval Dataset Design

An eval dataset is a collection of (input, expected_output, metadata) tuples. The expected_output might be a reference answer, a set of required facts, or just the input itself (for judge-based evals where the source material is the reference).

Practical sizing:

Eval TierDataset SizeRun FrequencyTypical Cost (GPT-5.4)Purpose
Smoke20-50 examplesEvery commit$0.50-2Catch catastrophic regressions
Standard200-500 examplesEvery PR$5-25Measure quality metrics
Full1000+ examplesPre-release$25-100+Comprehensive quality gate

The smoke tier is critical. Twenty well-chosen examples that cover edge cases (empty input, adversarial input, multilingual content, very long documents) catch most regressions in under 60 seconds.

Diagram

Three-tier eval strategy balances speed and coverage across the development lifecycle.

CI Integration

A GitHub Actions workflow that runs evals on prompt or model config changes:

# .github/workflows/ai-eval.yml
name: AI Eval Pipeline
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'model_config/**'
      - 'src/ai/**'

jobs:
  smoke-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements-eval.txt
      - name: Run smoke evals
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python -m eval.runner \
            --tier smoke \
            --output results/smoke.json
      - name: Check thresholds
        run: |
          python -m eval.threshold_check \
            --results results/smoke.json \
            --config eval/thresholds.yaml

  standard-eval:
    needs: smoke-eval
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r requirements-eval.txt
      - name: Run standard evals
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python -m eval.runner \
            --tier standard \
            --output results/standard.json \
            --compare-with main
      - name: Post PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./results/standard.json');
            const body = formatEvalResults(results);
            github.rest.issues.createComment({
              ...context.repo,
              issue_number: context.issue.number,
              body
            });

Threshold Configuration

Thresholds convert continuous scores into pass/fail gates. Store them in a config file, not hardcoded in the pipeline:

# eval/thresholds.yaml
summarizer:
  accuracy:
    min: 0.85
    regression_tolerance: 0.03  # allow 3% drop from baseline
  completeness:
    min: 0.80
    regression_tolerance: 0.05
  json_validity:
    min: 0.99  # structural check, very high bar
  latency_p95_ms:
    max: 3000
  cost_per_call_usd:
    max: 0.05

classifier:
  precision:
    min: 0.90
  recall:
    min: 0.85
  f1:
    min: 0.87

Regression tolerance matters more than absolute thresholds. An eval score dropping from 0.91 to 0.88 on a prompt change is a stronger signal than the score being 0.88 on a fresh eval — the delta indicates the change made things worse.

Building an Eval Suite

The eval runner is the core abstraction. It takes a prompt config, a dataset, and a set of evaluators, and produces scored results.

# eval/runner.py
import asyncio
import json
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Callable

@dataclass
class EvalCase:
    id: str
    input: dict
    expected: dict | None
    metadata: dict

@dataclass  
class EvalResult:
    case_id: str
    output: str
    scores: dict[str, float]
    latency_ms: float
    tokens_used: int
    cost_usd: float
    error: str | None = None

class EvalRunner:
    def __init__(self, prompt_config: dict, evaluators: list[Callable]):
        self.prompt_config = prompt_config
        self.evaluators = evaluators
        self.client = self._init_client(prompt_config["model"])
    
    async def run(self, dataset: list[EvalCase], 
                  concurrency: int = 10) -> list[EvalResult]:
        semaphore = asyncio.Semaphore(concurrency)
        
        async def run_one(case: EvalCase) -> EvalResult:
            async with semaphore:
                start = time.monotonic()
                try:
                    output = await self._call_model(case.input)
                    latency = (time.monotonic() - start) * 1000
                    
                    scores = {}
                    for evaluator in self.evaluators:
                        score = await evaluator(
                            input=case.input,
                            output=output,
                            expected=case.expected
                        )
                        scores.update(score)
                    
                    return EvalResult(
                        case_id=case.id,
                        output=output,
                        scores=scores,
                        latency_ms=latency,
                        tokens_used=self._last_token_count,
                        cost_usd=self._last_cost,
                    )
                except Exception as e:
                    return EvalResult(
                        case_id=case.id,
                        output="",
                        scores={},
                        latency_ms=(time.monotonic() - start) * 1000,
                        tokens_used=0,
                        cost_usd=0,
                        error=str(e),
                    )
        
        return await asyncio.gather(*[run_one(c) for c in dataset])

Raw eval results should be stored persistently — not just as CI artifacts that expire after 90 days. A simple approach: write results to a SQLite database checked into the repo (for small teams) or to a dedicated store.

# eval/storage.py
import sqlite3
import json
from datetime import datetime

def store_results(db_path: str, run_id: str, commit_sha: str,
                  prompt_version: str, results: list) -> None:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS eval_runs (
            run_id TEXT PRIMARY KEY,
            commit_sha TEXT,
            prompt_version TEXT,
            timestamp TEXT,
            summary JSON
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS eval_cases (
            run_id TEXT,
            case_id TEXT,
            scores JSON,
            latency_ms REAL,
            cost_usd REAL,
            error TEXT,
            FOREIGN KEY (run_id) REFERENCES eval_runs(run_id)
        )
    """)
    
    summary = compute_summary(results)
    conn.execute(
        "INSERT INTO eval_runs VALUES (?, ?, ?, ?, ?)",
        (run_id, commit_sha, prompt_version,
         datetime.utcnow().isoformat(), json.dumps(summary))
    )
    for r in results:
        conn.execute(
            "INSERT INTO eval_cases VALUES (?, ?, ?, ?, ?, ?)",
            (run_id, r.case_id, json.dumps(r.scores),
             r.latency_ms, r.cost_usd, r.error)
        )
    conn.commit()

Trending eval scores over time catches slow degradation that individual threshold checks miss. A 0.5% accuracy drop per week compounds into a meaningful regression over a quarter.

Model Migration as a First-Class CI Event

Model deprecations and new releases are external events that should trigger CI pipelines. When OpenAI retired GPT-4 and GPT-4o in February 2026, teams that didn’t have migration pipelines scrambled.

Dependency Tracking

Treat model versions like package dependencies. A lockfile-like config:

# model_config/models.yaml
models:
  primary:
    provider: anthropic
    model: claude-sonnet-4.6
    fallback: claude-haiku-4.5
    deprecated: false
    
  reasoning:
    provider: openai
    model: gpt-5.4-thinking
    fallback: gpt-5.4
    deprecated: false
    
  judge:
    provider: anthropic
    model: claude-opus-4.7
    fallback: claude-opus-4.6
    deprecated: false

deprecation_watch:
  # Models to monitor for deprecation announcements
  - gpt-5.2  # being phased out
  - claude-haiku-4.5  # current but watch for successor

Migration Pipeline

When evaluating a model migration (e.g., upgrading from Claude Sonnet 4.6 to Claude Opus 4.7), the pipeline should:

  1. Run the full eval suite against the current model (baseline)
  2. Run the same suite against the candidate model
  3. Compare scores with statistical significance testing
  4. Generate a migration report
  5. Require human approval for major model changes
Diagram

Model migrations follow a controlled comparison pipeline with human approval.

# scripts/model_migration.py
from scipy import stats
import numpy as np

def compare_model_runs(baseline_scores: list[float], 
                       candidate_scores: list[float],
                       alpha: float = 0.05) -> dict:
    """Paired statistical comparison of two model runs."""
    
    # Paired t-test (same eval cases, different models)
    t_stat, p_value = stats.ttest_rel(baseline_scores, candidate_scores)
    
    baseline_mean = np.mean(baseline_scores)
    candidate_mean = np.mean(candidate_scores)
    delta = candidate_mean - baseline_mean
    
    return {
        "baseline_mean": round(baseline_mean, 4),
        "candidate_mean": round(candidate_mean, 4),
        "delta": round(delta, 4),
        "p_value": round(p_value, 4),
        "significant": p_value < alpha,
        "recommendation": (
            "UPGRADE" if delta > 0 and p_value < alpha
            else "NO_CHANGE" if p_value >= alpha
            else "REGRESSION"
        )
    }

Scheduled Model Checks

A cron-triggered CI job that runs weekly against all configured models catches upstream changes:

# .github/workflows/model-health.yml
name: Model Health Check
on:
  schedule:
    - cron: '0 6 * * 1'  # Every Monday at 6am UTC
  workflow_dispatch:

jobs:
  health-check:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        model_role: [primary, reasoning, judge]
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements-eval.txt
      - name: Run health eval
        run: |
          python -m eval.health_check \
            --role ${{ matrix.model_role }} \
            --baseline latest \
            --alert-on-regression
      - name: Alert on drift
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {"text": "Model health regression detected for ${{ matrix.model_role }}"}

Canary Deployments for AI

Canary deployments route a small percentage of traffic to a new configuration before full rollout. For AI applications, “new configuration” can mean a new prompt version, a new model, or both.

Traffic Splitting Strategies

Random split. Route 5-10% of requests to the canary. Simple but requires enough traffic volume for statistical significance.

User-based split. Assign users to canary/control groups deterministically (hash of user ID). Ensures consistent experience per user and enables paired analysis.

Capability-based split. Route specific query types to the canary. Useful when the change targets a particular capability (e.g., improved code generation).

# src/routing/canary.py
import hashlib
from dataclasses import dataclass

@dataclass
class CanaryConfig:
    enabled: bool
    canary_percent: float  # 0.0 to 1.0
    canary_prompt_version: str
    canary_model: str
    control_prompt_version: str
    control_model: str
    sticky_by_user: bool = True

def route_request(user_id: str, config: CanaryConfig) -> str:
    """Returns 'canary' or 'control'."""
    if not config.enabled:
        return "control"
    
    if config.sticky_by_user:
        # Deterministic routing based on user ID
        hash_val = int(hashlib.sha256(user_id.encode()).hexdigest(), 16)
        bucket = (hash_val % 1000) / 1000.0
    else:
        import random
        bucket = random.random()
    
    return "canary" if bucket < config.canary_percent else "control"

Canary Metrics

The canary needs automated evaluation on live traffic. Key metrics to track:

MetricCollection MethodAlert Threshold
Error rateResponse status codes, parsing failures>2x control rate
Latency P50/P95Timing middleware>30% increase
User satisfactionThumbs up/down, follow-up queries>10% decrease
Token usageAPI response metadata>25% increase (cost signal)
Structured output validitySchema validation>1% failure rate increase
Diagram

Canary routing splits traffic between control and candidate configurations with automated metric comparison.

Promotion and Rollback Criteria

Automated canary analysis after a defined observation window (typically 1-24 hours depending on traffic volume):

# src/routing/canary_analysis.py

def analyze_canary(control_metrics: dict, canary_metrics: dict,
                   min_sample_size: int = 100) -> dict:
    
    if canary_metrics["sample_size"] < min_sample_size:
        return {"decision": "WAIT", "reason": "insufficient samples"}
    
    checks = []
    
    # Error rate comparison
    error_ratio = (canary_metrics["error_rate"] / 
                   max(control_metrics["error_rate"], 0.001))
    checks.append({
        "metric": "error_rate",
        "control": control_metrics["error_rate"],
        "canary": canary_metrics["error_rate"],
        "pass": error_ratio < 2.0
    })
    
    # Latency comparison
    latency_ratio = (canary_metrics["p95_latency_ms"] / 
                     control_metrics["p95_latency_ms"])
    checks.append({
        "metric": "p95_latency",
        "control": control_metrics["p95_latency_ms"],
        "canary": canary_metrics["p95_latency_ms"],
        "pass": latency_ratio < 1.3
    })
    
    # Quality score (from online eval sampling)
    if "quality_score" in canary_metrics:
        quality_delta = (canary_metrics["quality_score"] - 
                        control_metrics["quality_score"])
        checks.append({
            "metric": "quality_score",
            "control": control_metrics["quality_score"],
            "canary": canary_metrics["quality_score"],
            "pass": quality_delta > -0.03  # tolerate 3% drop
        })
    
    all_pass = all(c["pass"] for c in checks)
    
    return {
        "decision": "PROMOTE" if all_pass else "ROLLBACK",
        "checks": checks,
        "sample_sizes": {
            "control": control_metrics["sample_size"],
            "canary": canary_metrics["sample_size"]
        }
    }

Rollback Strategies

Rollback for AI applications has dimensions that traditional deployments don’t.

What To Roll Back

Rollback TargetTriggerSpeedComplexity
Prompt versionQuality regression detectedSeconds (config change)Low
Model versionModel behaving differently post-updateSeconds (config change)Low
Both prompt + modelMajor deployment failureSeconds (config change)Low
Code + prompt + modelSystem-level failureMinutes (redeploy)Medium
Feature flag offAny of the aboveSecondsVery low

The key insight: if prompts and model configurations are externalized from application code, most rollbacks don’t require a code deployment. A config store update (or feature flag toggle) reverts the AI behavior instantly.

Diagram

Failure classification determines the rollback mechanism, with most AI-specific regressions resolved via config changes.

Implementing Fast Rollback

Feature flags are the fastest rollback mechanism. LaunchDarkly, Flipt, or even a simple Redis key can gate prompt versions:

# src/prompts/loader.py
from functools import lru_cache
import yaml

class PromptLoader:
    def __init__(self, feature_flags, prompt_dir: str = "prompts"):
        self.flags = feature_flags
        self.prompt_dir = prompt_dir
    
    def get_prompt(self, prompt_id: str) -> dict:
        # Feature flag determines which version to use
        version = self.flags.get_string(
            f"prompt_version_{prompt_id}",
            default=self._get_default_version(prompt_id)
        )
        return self._load_prompt(prompt_id, version)
    
    def _load_prompt(self, prompt_id: str, version: str) -> dict:
        path = f"{self.prompt_dir}/{prompt_id}/{version}.yaml"
        return yaml.safe_load(open(path))
    
    def _get_default_version(self, prompt_id: str) -> str:
        """Read default version from manifest."""
        manifest = yaml.safe_load(
            open(f"{self.prompt_dir}/{prompt_id}/manifest.yaml")
        )
        return manifest["default_version"]

Rollback Testing

Include rollback scenarios in the CI pipeline. A rollback that hasn’t been tested is a rollback that might not work:

# eval/rollback_test.yaml
scenarios:
  - name: prompt_version_rollback
    steps:
      - deploy: summarizer/v4
      - verify: eval_passes
      - rollback_to: summarizer/v3
      - verify: eval_passes
      - assert: rollback_time_seconds < 5
  
  - name: model_fallback
    steps:
      - configure: model=claude-opus-4.7
      - simulate: provider_outage
      - verify: fallback_to=claude-opus-4.6
      - verify: eval_passes_with_fallback_model

The Config vs Code Boundary

A recurring architectural question: what belongs in code (deployed via standard CI/CD) and what belongs in configuration (changeable without deployment)?

LayerStorageChange ProcessRollback Speed
System promptsConfig (versioned YAML)PR → eval → mergeSeconds
Model selectionConfig (model manifest)PR → eval → mergeSeconds
Temperature, max_tokensConfigPR → mergeSeconds
Output parsing logicCodePR → test → deployMinutes
Eval datasetsRepo (data files)PR → mergeN/A
Eval thresholdsConfigPR → mergeSeconds
Routing/orchestration logicCodePR → test → deployMinutes

The principle: anything that affects LLM behavior but doesn’t require code changes should be externalized as config. This keeps the rollback surface for AI-specific changes fast (seconds, not deployment cycles).

Config Validation

Externalized config needs its own validation in CI:

# scripts/validate_config.py
import yaml
import jsonschema
from pathlib import Path

PROMPT_SCHEMA = {
    "type": "object",
    "required": ["id", "version", "model", "system"],
    "properties": {
        "id": {"type": "string"},
        "version": {"type": "integer", "minimum": 1},
        "model": {
            "type": "string",
            "enum": [
                "claude-opus-4.7", "claude-opus-4.6",
                "claude-sonnet-4.6", "claude-haiku-4.5",
                "gpt-5.5", "gpt-5.4", "gpt-5.4-thinking",
                "gpt-4.1-nano",
                "gemini-3.1-pro", "gemini-3.1-flash-lite",
            ]
        },
        "temperature": {"type": "number", "minimum": 0, "maximum": 2},
        "max_tokens": {"type": "integer", "minimum": 1, "maximum": 128000},
        "system": {"type": "string", "minLength": 10},
    }
}

def validate_all_prompts(prompt_dir: str = "prompts"):
    errors = []
    for path in Path(prompt_dir).rglob("*.yaml"):
        if path.name == "manifest.yaml":
            continue
        try:
            config = yaml.safe_load(path.read_text())
            jsonschema.validate(config, PROMPT_SCHEMA)
        except Exception as e:
            errors.append(f"{path}: {e}")
    
    if errors:
        for e in errors:
            print(f"FAIL: {e}")
        raise SystemExit(1)
    print(f"Validated {len(list(Path(prompt_dir).rglob('*.yaml')))} prompt configs")

Pipeline Architecture

Putting it all together, a complete CI/CD pipeline for an AI application has more stages than a traditional pipeline.

Diagram

The complete pipeline adds eval tiers and canary stages between traditional CI steps and full deployment.

Stage Details

Stage 1: Validate (every push, <30s). Config schema validation, prompt template syntax checking, Python type checking and linting, dependency compatibility verification.

Stage 2: Unit tests (every push, <2min). Output parser tests with fixture data, routing logic tests, tool/function schema validation, mock-based integration tests (no LLM calls).

Stage 3: Smoke eval (every push, <2min). 20-50 real LLM calls against critical paths, structural checks (JSON validity, schema compliance), latency sanity checks.

Stage 4: Standard eval (every PR, 5-15min). 200-500 LLM calls across the full eval dataset, LLM-as-judge scoring, comparison against baseline (previous commit on main), regression detection with statistical significance.

Stage 5: Canary deploy (pre-production). 5-10% traffic split, automated metric collection, promotion/rollback decision after observation window.

Stage 6: Full promotion. Update production config to the new versions, run a final smoke eval against production, tag the release.

Branch Strategy

A branching model that works for AI applications:

main           ─── always deployable, eval-gated
├── feature/*  ─── new prompts, model changes, code changes
├── eval/*     ─── eval dataset additions/modifications
└── hotfix/*   ─── emergency prompt or model rollbacks

The eval/* branches are worth calling out. Eval dataset changes should go through code review — a poorly constructed eval case can either mask regressions or cause false failures. But eval changes shouldn’t trigger full eval runs (circular dependency). They should trigger a meta-eval: verify the new cases are well-formed and that existing thresholds still make sense.

Cost Management in CI

Running LLM evals in CI costs real money. A team pushing 20 PRs per day with a 500-case eval suite on Claude Sonnet 4.6 spends a nontrivial amount monthly.

Cost Reduction Strategies

Cache eval outputs for unchanged prompts. If the prompt version and model haven’t changed, reuse the previous eval results. Only re-run when something relevant changes.

# eval/cache.py
import hashlib
import json
from pathlib import Path

def eval_cache_key(prompt_config: dict, eval_case: dict) -> str:
    """Deterministic cache key for an eval case."""
    content = json.dumps({
        "prompt": prompt_config,
        "case": eval_case,
    }, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

def get_cached_result(cache_dir: str, key: str) -> dict | None:
    path = Path(cache_dir) / f"{key}.json"
    if path.exists():
        return json.loads(path.read_text())
    return None

Use cheaper models for structural evals. JSON validity and schema compliance don’t need frontier models. GPT-4.1 Nano or Claude Haiku 4.5 work fine for structural checks at a fraction of the cost.

Tiered eval triggers. Not every commit needs the full suite. Use path-based triggers:

Files ChangedEval Tier
src/ai/prompts/**Full standard eval
src/ai/parsers/**Smoke eval + unit tests
src/frontend/**Unit tests only
model_config/**Full standard eval + migration comparison

Parallelize with concurrency controls. Run eval cases concurrently (bounded by API rate limits) to reduce wall-clock time without increasing cost.

Diagram

Route eval effort based on what changed, using cheaper models for structural checks.

Budget Guardrails

Set hard limits on CI eval spending:

# eval/budget.yaml
daily_limit_usd: 50.00
per_run_limit_usd: 10.00
per_case_limit_usd: 0.10

alerts:
  - threshold: 0.80  # 80% of daily limit
    channel: "#ai-platform"
  - threshold: 0.95
    channel: "#ai-platform-urgent"
    action: pause_non_critical_evals

Summary

CI/CD for AI applications requires extending traditional pipelines with three new capabilities: prompt versioning (treat prompts as versioned, reviewable, diffable artifacts), eval pipelines (replace binary tests with scored evaluations at multiple tiers), and model-aware deployment (canary routing, automated comparison, fast config-based rollback).

The practical implementation:

  • Store prompts as versioned YAML in the repo. Validate schemas in CI.
  • Build a three-tier eval suite: smoke (every commit, 20 cases, <60s), standard (every PR, 200-500 cases, 5-15min), full (pre-release, 1000+ cases).
  • Use both deterministic evaluators (JSON validity, schema compliance, regex checks) and LLM-as-judge evaluators (quality, completeness, accuracy). Run structural evals on cheap models.
  • Track model versions as dependencies. Run comparative evals when migrating models. Schedule weekly health checks against production models.
  • Deploy with canary routing. Measure error rate, latency, quality score, and cost on live traffic before full promotion.
  • Externalize prompts and model configs from application code so rollbacks take seconds, not deployment cycles.
  • Gate evals by what changed. Frontend-only changes don’t need LLM eval runs. Control CI costs with caching, tiered triggers, and budget limits.

The teams that ship AI applications reliably are the ones that invested in this infrastructure early. Every week spent building eval pipelines pays back in regressions caught before users see them.

Further Reading

  • Promptfoo — Open-source LLM eval framework with CI integration, model comparison, and red-teaming capabilities
  • Braintrust — Eval and prompt management platform with logging, scoring, and dataset management for production AI
  • OpenAI Evals — OpenAI’s framework for evaluating LLMs, including dataset formats and eval patterns
  • DeepChecks LLM Evaluation — Open-source validation framework with LLM-specific checks for production monitoring
  • Langfuse — Open-source LLM observability platform with trace-based eval integration and prompt management
  • LMSYS Chatbot Arena — Crowdsourced LLM comparison platform; the methodology behind comparative/arena-style evaluation
  • Anthropic’s guide to prompt engineering — Practical patterns for prompt design that inform how prompts should be versioned and tested
  • Humanloop — Prompt management and evaluation platform with version control and CI/CD integration patterns