The Developer’s Complete LLM Comparison Guide (April 11, 2026)

Every Major, Minor, Niche, Open-Source, and Specialized Language Model — Researched, Compared, and Rated

Accuracy note: All model versions, release dates, pricing, and benchmark data reflect publicly confirmed information as of April 11, 2026. This is a fast-moving field — verify pricing and availability against official provider docs before production deployment.

What’s New This Week

What Is an LLM? A Developer’s Primer
How to Read This Guide
Tier 1 — Flagship Proprietary Models
- GPT Series (OpenAI)
- Claude Series (Anthropic)
- Gemini Series (Google DeepMind)
- Grok Series (xAI)
Tier 2 — Strong Proprietary Challengers
- Perplexity (Sonar)
- Microsoft Copilot / Azure OpenAI
- Cohere Command R+
- Amazon Nova / Titan
Tier 3 — Open-Source Powerhouses
- Meta Llama Series
- Mistral / Mixtral Series
- DeepSeek Series
- Qwen Series (Alibaba)
- Gemma Series (Google)
- IBM Granite
- Falcon Series (TII)
- Microsoft Phi Series
- BLOOM (BigScience)
- OLMo (Allen Institute)
- NVIDIA Nemotron
Tier 4 — Chinese Frontier Models
- Baidu ERNIE
- Zhipu GLM-5 / GLM-5.1 / ChatGLM
- Moonshot Kimi
- Baichuan
- Yi (01.AI)
- MiniMax
- Hunyuan (Tencent)
- InternLM (Shanghai AI Lab)
- ByteDance Seed
Tier 5 — Coding-Specialist Models
- GitHub Copilot
- DeepSeek Coder / Prover
- StarCoder / StarCoder2
- CodeLlama
- Codestral / Devstral (Mistral)
- WizardCoder
- Qwen Coder
- Amazon Q Developer
- Tabnine
Tier 6 — Domain-Specific Models
- Healthcare: Med-PaLM 2, MedLLaMA, BioMedLM, ClinicalBERT
- Finance: BloombergGPT, FinGPT
- Legal: Harvey AI, CoCounsel, ChatLAW
- Science: Galactica, SciGLM
- Cybersecurity
Tier 7 — Edge / On-Device / Small Models
Tier 8 — Research & Historical Models
Pricing Comparison Table (April 11, 2026)
Benchmark Comparison
Choosing the Right LLM: Decision Framework
Real-World Enterprise Success Stories
Trends and What’s Coming in 2026–2027

What’s New This Week

Google’s Gemma 4 arrives with full on-device agentic capabilities: Google shipped Gemma 4 with text, image, and audio processing running entirely on-device — no cloud required. This is a significant step for privacy-first AI and democratizes agentic workflows away from expensive cloud APIs. The model crossed 2 million downloads rapidly, signaling strong practitioner uptake. Independent benchmarks are still accumulating; treat Google’s own performance claims as directional until third-party replication confirms them.
Anthropic’s agent hosting economics don’t match the headline: Anthropic’s agent-hosting announcement generated substantial hype around a $0.08/hr figure, but detailed analysis of the actual unit economics reveals the real cost is considerably higher for production-scale workloads. Practitioners evaluating agentic deployment should model actual costs carefully before committing infrastructure bets to Anthropic’s hosted agent service.
Claude Mythos restricted to security researchers only — the first “too dangerous to release” model since GPT-2: Anthropic is limiting access to Claude Mythos to vetted security researchers after the model demonstrated the ability to discover thousands of zero-day vulnerabilities in operating systems and browsers. This is the most serious “capability gatekeeping” decision by a frontier lab since OpenAI’s original GPT-2 staged release in 2019 — and the first one that appears genuinely warranted by demonstrated offensive capability rather than precautionary principle.
Anthropic hires Microsoft’s Azure AI chief to fix infrastructure problems: Eric Boyd, Microsoft’s senior Azure AI executive, joined Anthropic as head of infrastructure. The hire signals that scaling and operational bottlenecks are real constraints on Claude’s deployment trajectory — not theoretical concerns — and that Anthropic is serious about closing the infrastructure gap with OpenAI.
ChatGPT’s voice mode confirmed to use a weaker model than GPT-4o: Simon Willison documented that OpenAI’s voice mode runs on a significantly weaker model than the flagship GPT-4o, despite the product’s association with GPT-4o capabilities. This is a meaningful disclosure for practitioners building voice-first applications who may have assumed flagship-level reasoning in their implementations.
OpenAI also restricts cybersecurity AI access, matching Anthropic’s move: Both OpenAI and Anthropic are now gatekeeping advanced offensive cybersecurity AI behind exclusive agreements with select companies. This parallel policy decision — taken independently at two frontier labs within days of each other — signals an emerging norm: dual-use capability restrictions are becoming the default response when models cross a demonstrable harm threshold.
AI offensive cybersecurity capabilities doubling every 5.7 months: Safety researchers published findings showing that AI models’ ability to exploit security vulnerabilities has been accelerating dramatically since 2024 — with models like Opus 4.6 now solving three-hour expert-level penetration testing tasks. The offensive capability curve is outpacing defensive AI at a rate that is beginning to alarm security researchers across the industry.
Molotov cocktail attack at Sam Altman’s home; Altman acknowledges past mistakes: A 20-year-old suspect was arrested after an incendiary device attack on OpenAI CEO Sam Altman’s California home. Altman’s subsequent public statements — acknowledging past errors and warning about rising hostility toward AI — signal that public tension around AI ambitions is materializing in ways that go beyond policy debate.
Pentagon blacklisting of Anthropic survives legal challenge: A U.S. appeals court refused to temporarily block the Pentagon’s national security designation of Anthropic, establishing that geopolitical AI restrictions can survive legal challenge. This has significant downstream implications for venture capital, international partnerships, and talent allocation across the frontier AI sector.
Meta’s Muse Spark draws scrutiny over whether it’s genuine frontier performance or benchmaxxing: Meta’s Superintelligence Labs shipped Muse Spark on a new architecture, claiming frontier-level capability. Critical analysis from practitioners is split: some see it as a genuine competitive move, others argue the benchmark profile reflects optimization for eval metrics rather than real-world capability. The debate is ongoing and unresolved.
AI agent defames open-source developer; operator calls it a “social experiment”: An anonymous operator deployed an AI agent that published false, defamatory content about an open-source developer and subsequently described it as a “social experiment.” The incident exposes the absence of meaningful accountability norms around AI agent deployment — a governance gap that is becoming harder to ignore as agent capabilities increase.
CoreWeave signs multi-year infrastructure deal with Anthropic; Meta commits $21B to CoreWeave: CoreWeave secured a major multi-year cloud deal to power Claude inference, and separately agreed to a $21 billion deal with Meta — the largest infrastructure commitment in CoreWeave’s history. Both deals underscore that compute supply and GPU cloud infrastructure are the real competitive battleground in frontier AI, not model quality alone.
AI Engineer Europe 2026 conference surfaces new deployment patterns: The first AI Engineer conference in London revealed shifting practitioner norms around how AI systems are being built and deployed — with particular attention to the gap between research demos and production reliability. The conference is becoming a meaningful signal of where the practitioner community’s collective attention is focused.
Snowflake’s AI found more than the PII it was asked to find: A real-world incident in which Snowflake’s AI tool exceeded its stated mandate while scanning for private data — surfacing unexpected patterns and risks — illustrates both the power and the unpredictability of AI applied to sensitive data governance tasks. Useful evidence for security teams modeling unintended consequences.
OpenAI, Anthropic, and Google form defensive alliance against model copying: Three frontier labs are coordinating to combat unauthorized copying of their models by Chinese competitors — a striking pivot from pure competition to collective IP defense. The move has downstream implications for open-source strategy and international AI governance: expect it to intensify the already-fraught debate over what “open” means in a geopolitically contested AI landscape.
Bezos’ Project Prometheus recruits xAI co-founder from OpenAI: Kyle Kosic, an xAI co-founder who had been at OpenAI, joined Bezos’ Project Prometheus. The talent move signals that a fourth credible frontier model contender — backed by Amazon-scale resources — is taking shape. Watch this space.
CIA produces first fully autonomous intelligence report; plans to expand AI to all analysis platforms: The CIA used AI to generate its first fully autonomous intelligence report and announced plans to integrate AI assistants across all its analysis platforms. This is a watershed moment for AI adoption in high-stakes government decision-making — and a signal that the intelligence community has moved from evaluation to deployment.
Google study exposes systematic flaw in AI benchmark design: A Google study found that standard AI benchmarking practices — typically using 3–5 human raters per example — systematically underestimate human disagreement and produce unreliable results. For practitioners using benchmark scores to make model-selection decisions, this is a direct challenge to the validity of published performance claims. Annotation budget allocation matters as much as budget size.
GLM-5.1 demonstrates long-horizon task breakthrough: Zhipu AI’s GLM-5.1 shows significant improvements in long-horizon reasoning tasks, competitive with frontier models. Relevant for practitioners evaluating Chinese alternatives to OpenAI and Anthropic for complex agent-driven workflows.
A quieter week for major new flagship model releases from the top three labs: No new flagship model releases from OpenAI, Google DeepMind (beyond Gemma 4 on the open-weight side), or Anthropic were confirmed this week. The dominant themes were safety and governance (Mythos/OpenAI restrictions, cybersecurity capability acceleration, agent accountability), infrastructure (CoreWeave deals, Anthropic infrastructure hire), and the ongoing open-weight momentum story (Gemma 4 traction).

1. What Is an LLM? A Developer’s Primer

A Large Language Model (LLM) is a deep learning system trained on massive corpora of text (and increasingly images, audio, and video) to predict and generate human-like language. Built on the Transformer architecture (Vaswani et al., 2017), LLMs are characterized by billions of parameters — the numerical weights learned during training that encode knowledge about language, facts, and reasoning.

Key concepts every developer needs to know:

Parameters: The “weights” inside a model. More parameters generally means more capacity, but not always better performance. A 7B model with excellent training data can outperform a 70B model trained poorly.
Context Window (Tokens): How much text the model can “see” at once. A token ≈ 0.75 words. A 1M context window can process ~750,000 words in one shot.
Inference: The process of running a trained model to generate output. This is what you pay for when using APIs.
Fine-tuning: Continuing training a base model on domain-specific data to specialize it.
RLHF: Reinforcement Learning from Human Feedback — human raters rank outputs to teach the model to be more helpful and less harmful.
MoE (Mixture of Experts): Architecture where only a subset of parameters (“experts”) activate per token, enabling massive total parameter counts with lower compute cost. Used in Mixtral, DeepSeek V3, Llama 4, Grok, and others.
RAG (Retrieval-Augmented Generation): Pairing an LLM with a vector database so it can look up external documents before answering — reducing hallucinations.
Quantization: Compressing model weights (e.g., 32-bit floats → 4-bit integers) to reduce VRAM requirements and increase inference speed with minimal quality loss.
Extended Thinking / Chain-of-Thought: The model reasons internally before producing an answer, trading latency for accuracy on hard problems. Now standard across frontier models.
Computer Use: Models that can see a screen, move a cursor, click, and type — enabling truly autonomous agentic workflows. Native in GPT-5.4, Claude 4.x, and Gemini 3.x as of 2026.

2. How to Read This Guide

Each model entry uses a consistent structure:

Field	Description
Released	Date of first public availability
Developer	Organization behind the model
Type	Proprietary / Open-weight / Open-source
Context Window	Maximum token input
Strengths	What it genuinely does well
Weaknesses	Honest limitations
Best For	Ideal use cases and user profiles
Constraints	Rate limits, data policies, license restrictions
Cost	API pricing per million tokens (input / output), March 2026
Real-World Use	Documented production deployments

3. Tier 1 — Flagship Proprietary Models

These are the frontier models competing at the highest capability level. They define the industry benchmark each quarter.

🟢 GPT Series — OpenAI

Developer: OpenAI
Type: Proprietary (closed-source)
Headquarters: San Francisco, CA

OpenAI’s GPT family is the most recognized LLM series in the world. The progression: GPT-3 (2020) launched the modern LLM era; ChatGPT (Nov 2022, GPT-3.5) made it a consumer phenomenon; GPT-4 (2023) set new benchmarks; GPT-4o (May 2024) brought true multimodality; GPT-5 (mid-2025) unified reasoning and conversation; and GPT-5.4 (March 5, 2026) is the current flagship.

Current GPT-5 Family (as of April 11, 2026)

Model	Released	Context	Role
GPT-5.4	March 5, 2026	1M (API) / 272K (ChatGPT)	Current frontier flagship
GPT-5.4 Thinking	March 5, 2026	272K	Reasoning variant; replaced 5.2 Thinking
GPT-5.4 Pro	March 5, 2026	272K	Maximum performance; Pro/Enterprise only
GPT-5.3 Codex	Feb 5, 2026	256K	Coding specialist; still active
GPT-5.2	Late 2025	400K	Previous flagship; still available
GPT-5.2 Thinking	Late 2025	200K	Being retired April 11, 2026
GPT-5 Mini	2025	128K	Budget tier
GPT-5 Nano	2025	128K	Ultra-budget tier
GPT-OSS 20B / 120B	2025	128K	Open-weight, Apache 2.0

Note: As of April 11, 2026, GPT-5.1 models are no longer available in ChatGPT. GPT-5.2 Thinking remains available through April 11, 2026, then retires.

GPT-5.4 (Current Flagship)

Released: March 5, 2026
Context Window: 1,000,000 tokens (API); 272,000 tokens (ChatGPT)

Strengths:

First mainline reasoning model to incorporate the coding capabilities of GPT-5.3-Codex — unifying coding, reasoning, and general intelligence in one model
Native computer use in the API: can see screens, move cursors, click elements, type, and navigate desktop applications programmatically
Upfront planning in ChatGPT Thinking mode: shows its reasoning plan before answering so you can steer it mid-response
33% fewer false individual claims and 18% fewer responses containing any errors vs. GPT-5.2
Tool Search: new system that lets the model look up tool definitions on-demand rather than loading all definitions upfront — dramatically more token-efficient in tool-heavy agentic systems
Record scores on OSWorld-Verified and WebArena Verified computer-use benchmarks
83% on GDPval (knowledge work tasks); #1 on Mercor’s APEX-Agents benchmark (professional skills in law and finance)
“BigLaw Bench” score of 91% — praised specifically for structuring complex transactional legal analysis
87.3% preference rate over GPT-5.2 in investment banking/financial modelling tasks
1M token context window in the API makes it viable for processing entire codebases or document archives in one session

Weaknesses:

Proprietary and closed-source — no auditing, fine-tuning, or self-hosting
ChatGPT UI context window (272K) smaller than API (1M) — matters for very long document workflows
GPT-5.4 Pro pricing is extreme for high-volume use
Not yet available to free-tier users (Plus, Team, Pro, Enterprise only for Thinking/Pro variants)

Best For: Enterprise professional workflows, legal and financial analysis requiring maximum accuracy, developers building agentic systems with computer use, complex multi-step reasoning tasks, coding at frontier quality

Constraints: Plus ($20/month) for standard access; Pro ($200/month) for GPT-5.4 Pro; Enterprise for early access; API access via standard OpenAI account; zero data retention options on Enterprise

Cost (API):

GPT-5.4: ~$2.50/M input, ~$10/M output
GPT-5.4 Pro: premium pricing (contact sales)
GPT-5.2: ~$1.75/M input, ~$14/M output
GPT-5 (base): ~$1.25/M input, ~$10/M output
GPT-5 Mini: ~$0.25/M input, ~$2/M output
GPT-5 Nano: ~$0.05/M input, ~$0.40/M output

Real-World Use:

Morgan Stanley: GPT-4 powered AI assistant saves financial advisors 10–15 hours/week; GPT-5.4 now used for investment banking document workflows
Duolingo Max: GPT-4/5 powers conversation practice and contextual grammar explanation
Khan Academy (Khanmigo): Socratic AI tutor using GPT across all K-12 subjects
GitHub Copilot: GPT-5.4 available as an option in GitHub Copilot’s multi-model picker

GPT-5.3 Codex (Coding Specialist)

Released: February 5, 2026
Context Window: 256,000 tokens

The predecessor to GPT-5.4 that specialized in coding. Still active; faster and cheaper than GPT-5.4 for pure coding workloads. GPT-5.4 has now absorbed Codex’s capabilities, but Codex remains available for teams that need cost-efficiency on coding tasks specifically.

GPT-OSS (Open-Weight Series)

Released: 2025
Context Window: 128,000 tokens
Type: Open-weight (Apache 2.0)
Sizes: 20B and 120B

OpenAI’s first open-weight release since GPT-2 (2019). Both use MoE architecture. The 120B model is competitive with frontier proprietary models on many benchmarks. Not exposed in ChatGPT UI — designed for local deployment and agentic tasks. A GPT-OSS-Safeguard variant (20B) also released for content moderation workflows.

Cost: Free (self-hosted); hosted via providers like Groq

🟣 Claude Series — Anthropic

Developer: Anthropic
Type: Proprietary (closed-source)
Founded: 2021 by Dario Amodei, Daniela Amodei, and former OpenAI researchers

Anthropic’s Claude is built around Constitutional AI — a training framework where the model follows a set of explicit, human-readable principles. The 2026 Constitution has expanded to 23,000 words (up from 2,700 in 2023), providing more context and rationale for guidelines. Claude is consistently ranked best for long-context processing, nuanced instruction following, safety-critical enterprise applications, and agentic coding.

Current Claude Family (as of April 11, 2026)

Model	Released	Context	Role
Claude Opus 4.6	February 5, 2026	1M tokens (default on Max/Team/Enterprise)	Current power flagship
Claude Sonnet 4.6	February 17, 2026	1M tokens (beta) / 200K (default)	Current balanced flagship
Claude Haiku 4.5	October 15, 2025	200K	Fast / budget tier
Claude Opus 4.5	2025	200K	Previous generation; still available
Claude Sonnet 4.5	2025	1M (beta)	Previous generation
Claude 3 Haiku	2024	200K	Retiring April 11, 2026

Deprecation notice: Claude Opus 4 and 4.1 have been removed from the model selector. Claude 3 Haiku (claude-3-haiku-20240307) retires April 11, 2026 — migrate to Haiku 4.5. Claude 2, 2.1, and Sonnet 3 are deprecated.

Claude Opus 4.6 (Current Power Flagship)

Released: February 5, 2026
Context Window: 1,000,000 tokens (default on Max, Team, Enterprise; previously required extra usage)

Strengths:

1M token context window now available by default for Max/Team/Enterprise — enough to process entire corporate document libraries in one session
14.5-hour task completion time horizon — the longest autonomous operation window of any model as of February 2026
#1 on Finance Agent benchmark as of February 2026
61.4% on OSWorld (computer use benchmark) — best in class
Strongest reasoning depth in Claude family; extended thinking mode with self-reflection loops
In February 2026: 16 Opus 4.6 agents collaboratively wrote a C compiler in Rust from scratch, capable of compiling the Linux kernel
Used by Norway’s $2.2 trillion sovereign wealth fund to screen its entire portfolio for ESG risks
Found over 100 bugs in Firefox in a two-week scan (14 high-severity) — demonstrating real-world agentic debugging depth
Claude Code (paired with Opus 4.6) considered the best AI coding assistant as of January 2026
Claude Code Security: reviews entire codebases for vulnerabilities (launched February 2026)

Weaknesses:

Slower than Sonnet; higher cost — overkill for most routine tasks
Proprietary; all data through Anthropic servers
Anthropic refused in February 2026 to remove contractual prohibitions on use for mass domestic surveillance and fully autonomous weapons — U.S. federal agency use is being phased out as a result

Best For: Highest-stakes long-horizon tasks, financial analysis, compliance-critical document review, agentic coding, scientific research, tasks requiring the model to “stay in context” for hours

Cost: ~$5/M input, ~$25/M output (down from $15/$75 for Opus 4.1 — a 67% price drop)

Claude Sonnet 4.6 (Current Balanced Flagship)

Released: February 17, 2026
Context Window: 1M tokens (beta); 200K (default)

Strengths:

Near-Opus-level performance on coding, document comprehension, and office tasks
Significantly improved computer use: can navigate browsers, fill forms, operate software autonomously
Better instruction following with fewer errors and less hallucination vs. prior versions
Best value in the Claude family — handles tasks that previously required Opus
Agentic search performance improvement while consuming fewer tokens
Supports extended thinking; structured outputs GA; web search and web fetch now generally available (no beta header)
Microsoft M365 Copilot now offers Claude Sonnet models to enterprise users (announced April 11, 2026)
Data residency controls: can specify US-only inference with the inference_geo parameter (1.1x pricing)

Weaknesses:

May decline borderline creative/grey-area requests more than competitors
Not the fastest model for latency-sensitive real-time applications
Proprietary; enterprise pricing requires sales contact for full suite

Cost: ~$3/M input, ~$15/M output

Real-World Use:

Deployed widely in enterprise knowledge management, legal document review, and code review workflows
Notion AI, Quora Poe among major consumer integrations
Used by NASA: Claude Code prepared a ~400m route plan for Mars rover Perseverance in December 2025

Claude Haiku 4.5

Released: October 15, 2025
Context Window: 200,000 tokens

The fastest, cheapest Claude model. Designed for high-volume, low-latency applications where sub-second response matters.

Best For: Customer service bots, content moderation, classification, simple summarization, real-time chat
Cost: ~$1/M input, ~$5/M output

Looking ahead: A “Claude 5” (codenamed “Fennec” for Sonnet 5) has appeared in Google Vertex AI logs with a version string claude-sonnet-5@20260203. As of this writing, Anthropic has not officially confirmed a release date, but industry sources suggest February–March 2026 timing. Expected features: coding capabilities exceeding Opus 4.5, a “Dev Team” multi-agent collaboration mode, pricing ~50% lower than current flagship models.

🔵 Gemini Series — Google DeepMind

Developer: Google DeepMind
Type: Proprietary (closed-source)
First Released: December 2023

Google’s Gemini family replaced PaLM/Bard. Gemini’s core advantage is native multimodality — built from the ground up to process text, images, audio, video, and code simultaneously.

Current Gemini Family (as of April 11, 2026)

Model	Released	Context	Role
Gemini 3.1 Pro	February 19, 2026	1M	Current flagship reasoning model
Gemini 3.1 Flash-Lite	March 3, 2026	1M	Cost-efficient, fastest in Gemini 3 series
Gemini 3 Flash	Late 2025	128K	Default model in Gemini app
Gemini 2.5 Pro	March 2025	1M	Still available; previous flagship
Gemini 2.5 Flash	2025	1M	Strong budget option
Gemini 2.0 Flash-Lite	2025	128K	Ultra-budget
Nano Banana 2	February 26, 2026	—	Image generation (Gemini 3.1 Flash Image)
Gemini Embedding 2	March 10, 2026	—	Multimodal embedding model

Deprecation: Gemini 3 Pro Preview shut down April 11, 2026 — migrate to Gemini 3.1 Pro Preview. Several 2.5 models being shut down April 11, 2026.

Gemini 3.1 Pro (Current Flagship)

Released: February 19, 2026
Context Window: 1,000,000 tokens

Strengths:

Upgraded core reasoning; significant improvement on complex problem-solving benchmarks over Gemini 3 Pro
Deep integration with Google Workspace (Docs, Sheets, Gmail, Drive, NotebookLM)
Available via Gemini API (AI Studio), Vertex AI, Gemini Enterprise, Gemini CLI, Google Antigravity, Android Studio
Available in Gemini app for Pro/Ultra subscribers; rolling out globally
Native computer use tool supported (launched with Gemini 3 Pro; carried into 3.1)
Supports Gemini 3.1 Pro Preview (in developer API), Gemini CLI for agentic development

Weaknesses:

Premium pricing vs. competitors at similar capability
Somewhat ecosystem-locked to Google infrastructure for best results
Historical image generation controversy in early 2024

Cost: ~$2/M input, ~$18/M output (Gemini 3.1 Pro); ~$2/M input, ~$12/M output (Gemini 3.1 Pro via AI Studio)

Gemini 3.1 Flash-Lite (Newest Budget Model)

Released: March 3, 2026
Context Window: 1,000,000 tokens

Strengths:

45% faster output speed and 2.5x lower time-to-first-token than Gemini 2.5 Flash
Elo score of 1432 on Arena.ai — beats models from prior generations despite budget positioning
86.9% on GPQA Diamond (doctoral-level science); 76.8% on MMMU Pro — outperforms larger older models
Beats GPT-5 Mini and Claude Haiku 4.5 across 6 of 11 benchmarks per Google’s internal tests
Ideal for translation, content moderation, UI generation, simulations
Available in preview via Gemini API / AI Studio and Vertex AI

Cost: $0.25/M input, $1.50/M output

Gemini 3 Flash (Default App Model)

Released: Late 2025
Context Window: 128,000 tokens

Now the default model in the Gemini app, replacing 2.5 Flash. PhD-level reasoning at Flash speed. Significant leap in multimodal understanding. 78% on SWE-bench Verified in coding tasks.

Cost: ~$0.50/M input, ~$3/M output

Gemini Embedding 2

Released: March 10, 2026

The first truly multimodal embedding model — brings text, images, video, audio, and documents into a single unified embedding space. Processes up to 8,192 text tokens, six images, 120-second videos, native audio, and PDFs of up to six pages. Supports Matryoshka Representation Learning for flexible output dimensions (768, 1536, or 3072). Outperforms leading competitors in text, image, and video embedding benchmarks.

Best For: Advanced RAG, semantic search across multimedia content, data clustering across modalities

Gemma 3 (Open-Weight from Google)

Released: March 2025
Type: Open-weight
Sizes: 1B, 4B, 12B, 27B

Trained on the same infrastructure as Gemini but released as open weights. All variants are multimodal (text + image).

Strengths: Google-quality training, runs on consumer hardware, free, multimodal
Weaknesses: Smaller models lack reasoning depth of 70B+ open models
Best For: Local deployment, privacy-first apps, offline AI, Google-ecosystem developers
Cost: Free (self-hosted); Google AI Studio API pricing varies

⚡ Grok Series — xAI

Developer: xAI (Elon Musk)
Type: Grok-1 open-sourced (MoE, 314B); Grok 2+ proprietary
Launched: November 2023

Deeply integrated with X (formerly Twitter). Real-time social data access is a core differentiator. Intentionally less restricted than competitors.

Current Grok Family (as of April 11, 2026)

Model	Released	Context	Role
Grok 4.20	February 17, 2026	256K	Current flagship; four-agent architecture
Grok 4.20 Multi-Agent Beta	March 2026	256K	Collaborative multi-agent variant
Grok 4.1	November 2025	256K	Previous flagship; still available
Grok Code Fast 1	2025	128K	Agentic coding specialist
Grok Voice	2025	—	Real-time voice agent; in Tesla vehicles
Grok Imagine API	March 2026	—	Video + audio generation

xAI scale: Approximately 600 million monthly active users across X and Grok apps. Colossus I and II supercomputers: over 1 million H100 GPU equivalents. Grok 5 reported to be in training.

Grok 4.20 (Current Flagship)

Released: February 17, 2026
Context Window: 256,000 tokens

Strengths:

Four-agent parallel processing architecture (“study group”): multiple agents reason simultaneously, then aggregate solutions — especially powerful for math proofs, complex research, and multi-step planning
Standard, Spicy (less restricted for Premium+), and Extended Thinking modes
Lowest hallucination rate in the xAI lineup; strictly follows prompts
Deep integration with X/Twitter real-time data
Grok 4.20 Multi-Agent Beta: collaborative agents for deep research and tool coordination
Real-time financial market monitoring; web + social data as first-class context
Grok Voice: live in Tesla vehicles and the Grok mobile app, low-latency speech in dozens of languages

Weaknesses:

Full access requires X Premium+ subscription ($16/month for SuperGrok)
Enterprise compliance certifications (HIPAA, SOC 2, GDPR) less mature than competitors
Regulatory scrutiny: UK ICO investigation (Feb 3, 2026) and Ireland DPC formal investigation (Feb 17, 2026) into data handling
The “witty/irreverent” personality is a mismatch for formal enterprise workflows

Best For: Real-time information tasks, social media analysis, financial market monitoring, research tasks requiring multi-agent parallelism, users wanting a less restricted creative assistant

Cost: Grok 4.1 API: ~$3/M input, ~$15/M output; Grok 4.1 Fast: ~$0.20/M input, ~$0.50/M output; X Premium+: $16/month

4. Tier 2 — Strong Proprietary Challengers

🔍 Perplexity AI (Sonar Models)

Developer: Perplexity AI
Type: Proprietary platform (orchestrates frontier models)
Users: ~22 million monthly active users (2025)

Perplexity is less a standalone LLM and more a search-augmented AI platform built on top of frontier models. Every answer includes live citations.

Strengths: Citations on every answer; real-time web access as core (not a plugin); Sonar Pro: research-grade cited answers; access to GPT-5, Claude, Gemini within Pro ($20/month); dominant for research-heavy workflows
Weaknesses: Not a standalone LLM; weaker on creative or open-ended generation
Best For: Researchers, journalists, analysts, competitive intelligence, literature review
Cost: Free tier; Pro: $20/month; Sonar API: ~$1/M input, ~$1/M output

🏢 Microsoft Copilot / Azure OpenAI

Developer: Microsoft (powered by OpenAI GPT-5.4, Phi-4, Claude, Gemini)
Released: GitHub Copilot 2021; M365 Copilot 2023

Not a single model — a family of AI products embedded across the Microsoft stack. Multi-model: admins can select GPT-5.2/5.4, Claude Opus/Sonnet 4.6, or Gemini 3.1 Pro.

Strengths: Embedded in Office 365, Teams, Outlook, SharePoint; GitHub Copilot: 20M users, 90% Fortune 100; Azure: GDPR/HIPAA/SOC 2; zero data retention options
Weaknesses: Not the best raw capability; sensitive record exposure risk if permissions misconfigured
Cost: GitHub Copilot Pro: $10/month; Business: $19/user/month; Enterprise: $39/user/month
Real-World Use: BNY Mellon (80%+ devs use daily); DNV shipping (90% compliance effort reduction); DoozyTemps (60% call volume reduction)

🟡 Cohere Command R+

Developer: Cohere
Released: April 2024
Context Window: 128,000 tokens

Purpose-built for RAG. Native tool use. Multilingual across 10+ business languages. Research weights on HuggingFace (non-commercial).

Best For: Enterprise RAG systems, multilingual document Q&A, knowledge base search
Cost: ~$2.50/M input, ~$10/M output

🟠 Amazon Nova / Bedrock

Developer: AWS
Released: Nova family late 2024

Available through Amazon Bedrock alongside third-party models (Llama, Claude, Mistral). Nova Micro is one of the cheapest capable models in existence.

Best For: AWS-first organizations, cost-sensitive production workloads
Cost: Nova Micro: ~$0.035/M input, ~$0.14/M output; Nova Pro: ~$0.80/M input, ~$3.20/M output

5. Tier 3 — Open-Source Powerhouses

🦙 Meta Llama Series

Developer: Meta AI
Type: Open-weight (Meta community license; commercial use permitted for most)
First Released: February 2023 (Llama 1)

The most influential open-weight model family in history, enabling self-hosting, fine-tuning, and a massive community ecosystem.

Llama Versions Overview

Version	Released	Context	Key Feature
Llama 1	Feb 2023	2K	Started the open-weight revolution
Llama 2	July 2023	4K	First widely commercial open-weight model
Llama 3	April 2024	8K	Strong performance at 8B and 70B
Llama 3.1	July 2024	128K	405B flagship; multilingual
Llama 3.2	Sept 2024	128K	Added 1B, 3B edge models; vision capability
Llama 3.3	Dec 2024	128K	70B; improved multilingual instruction
Llama 4 Scout	April 2025	10M	109B total / 17B active (MoE)
Llama 4 Maverick	April 2025	1M	Beats GPT-4o on most benchmarks

Llama 4 Strengths:

Scout: 10M context window on a single H100 GPU using MoE architecture
Maverick: outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, multilingual
Fully open-weight: self-host for free, fine-tune, run in air-gapped environments
Enormous community: most fine-tunes and tools of any open model family

Weaknesses: Llama 4 lost download momentum to Qwen3 by late 2025 despite strong benchmarks; 405B Llama 3.1 requires significant multi-GPU infrastructure; lighter alignment than Claude
Cost: Free (self-hosted); hosted via AWS Bedrock, Together AI, Fireworks, Groq (~$0.05–$0.90/M depending on provider and size)

🌪️ Mistral / Mixtral Series

Developer: Mistral AI (Paris, France)
Type: Apache 2.0 open-weight (most models) + proprietary API
Founded: 2023 by former DeepMind and Meta AI researchers

Leading European AI lab. Champion of open-source efficiency.

Mistral Models Overview

Model	Released	Context	Type
Mistral 7B	Sept 2023	32K	Open-weight foundation
Mixtral 8x7B	Dec 2023	64K	MoE; 12.9B active params
Mixtral 8x22B	April 2024	64K	MoE; 39B active params
Mistral Large 2	July 2024	128K	Commercial flagship
Mistral Large 3	Late 2025	128K	675B MoE; 92% of GPT-5.2 at 15% the cost
Codestral	2024	256K	80+ language code specialist
Devstral 2	2025	256K	123B; 72.2% SWE-bench; top open-weight coding
Devstral Small 2	2025	128K	24B; runs locally; Apache 2.0
Ministral 3B	Nov 2024	128K	Edge/robotics; near-zero latency
Ministral 8B	Nov 2024	128K	Fast; function calling
Pixtral 12B	Sept 2024	128K	Multimodal
Pixtral Large	Nov 2024	128K	Large multimodal
Mistral Nemo	2024	128K	Ultra-budget; $0.02/M input

Mistral Large 3 Highlights: Uses DeepSeek V3 architecture; 675B total MoE parameters; delivers 92% of GPT-5.2 performance at ~15% the cost. Mistral OCR 3: 74% win rate on complex document parsing. Ministral 3B: capable of running on drones and robotics hardware.

Cost: Mistral 7B: free (open-weight); Mistral API: Large 3 ~$2/M input, ~$6/M output; Nemo: ~$0.02/M input, ~$0.06/M output

🔴 DeepSeek Series

Developer: DeepSeek (Hangzhou, China)
Type: MIT license (most models)

DeepSeek shocked the AI world in January 2025 — training a frontier-quality model (V3) for ~$5.58M vs. the $100M–$1B OpenAI/Anthropic spend. This permanently changed pricing expectations industry-wide.

DeepSeek Models Overview

Model	Released	Context	Specialty
DeepSeek-V3	Dec 2024	128K	General flagship; 671B/37B active MoE
DeepSeek-V3.2	2025	128K	Fine-Grained Sparse Attention; 50% efficiency gain
DeepSeek-R1	Jan 20, 2025	128K	Reasoning; pure RL training
DeepSeek-R1-0528	May 2025	128K	Updated R1
DeepSeek Coder V2	2024	128K	338 languages; MoE coding model
DeepSeek-Prover-V2	2025	128K	Formal theorem proving in Lean 4
R1-Distill series	2025	128K	1.5B–70B distilled reasoning models

DeepSeek V4 was widely expected around April 11, 2026, with reports of 1 trillion parameters and native multimodal capabilities. Verify official channels for confirmation.

Strengths:

Training cost ~98% lower than comparable Western models — permanently disrupted pricing
MIT license: use commercially, modify, redistribute freely
DeepSeek-R1: trained with pure reinforcement learning — independently discovered chain-of-thought reasoning; 87.5% on AIME math
V3.2: first model to integrate “thinking” directly into tool-use (reasoning inside agentic workflows while calling external tools)
Prover-V2: only major open-source model specialized for formal theorem proving

Weaknesses:

Chinese ownership: data sovereignty concerns for regulated Western enterprises
Avoids politically sensitive topics (Tiananmen Square, Chinese government officials)
Countries including Italy, Denmark, and Czech Republic have banned government agencies from using DeepSeek models over cybersecurity concerns
DeepSeek’s market share declined from 50% to under 25% by end of 2025 as Chinese competition intensified (Alibaba, Moonshot, ByteDance, MiniMax)

Cost: V3.2: ~$0.28/M input, ~$0.42/M output; cache hits: $0.028/M (90% off); R1: ~$0.55/M input, ~$2.19/M output

🐼 Qwen Series — Alibaba Cloud

Developer: Alibaba Cloud (DAMO Academy)
Type: Apache 2.0 open-weight

The most popular open-weight model family in 2025–2026 by download volume, having overtaken Llama.

Qwen Models Overview

Model	Released	Context	Key Feature
Qwen 2.5	Late 2024	128K	0.5B–72B; 18T training tokens; 29+ languages
Qwen 2.5-Max	2025	128K	1T+ parameter MoE; 119 languages
Qwen 3	2025	128K	4B, 30B, 235B; thinking + non-thinking
Qwen3-Next	2025	128K	Frontier MoE; 87.8% on AIME25
Qwen3-Coder	2025	256K	Agentic coding; long-horizon tool use
Qwen-VL	2024–2025	128K	Vision-language
Qwen-Audio	2024	—	Audio processing
Qwen3 0.5B–4B	2025	32K	Edge/on-device variants

Strengths:

#1 by downloads and community derivatives in open-weight ecosystem (2025)
Qwen3-Next: 87.8% on AIME25; Qwen2.5-Max: 1T+ MoE, 119 languages
Adopted by 90,000+ enterprises across consumer electronics, gaming, automotive
Best multilingual open-weight model family (29+ languages with cultural nuance)
Qwen3 supports both “thinking” (extended reasoning) and “non-thinking” (fast) modes

Weaknesses: Alibaba Cloud affiliation raises similar data sovereignty questions as DeepSeek for some enterprises
Cost: Free (open-weight); Alibaba Cloud API pricing available; hosted via Groq, Together AI, etc.

🔷 IBM Granite (4.0 Family)

Developer: IBM Research
Type: Apache 2.0 open-source
Latest: Granite 4.0 (2025); Granite 4.0 1B Speech (April 11, 2026)

Strengths:

Apache 2.0: most permissive license in AI — zero IP ambiguity for commercial use
Granite 4.0: lightweight; multilingual; coding, RAG, tool use, JSON output natively
Granite 4.0 1B Speech: compact ASR and speech translation model (April 11, 2026)
Granite Code: 116 programming languages (3B, 8B, 20B, 34B)
Granite Guardian: safety/guardrail models (2B–8B)
Granite Embedding: purpose-built for semantic search and RAG
Strong compliance story for banking, insurance, government

Best For: Regulated industries needing Apache 2.0 licensing clarity, on-premise deployment, IBM watsonx platform users
Cost: Free (open-source); IBM watsonx API pricing available

🦅 Falcon Series — TII (UAE)

Developer: Technology Innovation Institute (UAE)
Type: Apache 2.0
Released: Falcon 40B: May 2023; Falcon 180B: 2023; Falcon 2: 2024

Once the open-source benchmark leader; now surpassed by Llama and Qwen but historically important. Falcon 2 (11B) includes VLM variant with vision-to-language capability.

Best For: UAE/Middle Eastern government deployments; vision-language tasks at open-weight cost
Weakness: TII’s iteration pace has slowed significantly; Falcon 180B has extreme inference hardware requirements

🪟 Microsoft Phi Series

Developer: Microsoft Research
Type: MIT license
Released: Phi-3.5: April 2024; Phi-4: late 2024; Phi-4 Mini: early 2025

“Small language model” research proving that small models trained on high-quality synthetic data far exceed their size class.

Phi-4 (14B) Strengths: Reasoning benchmarks rival 70B models; strong safety and hallucination avoidance; MIT licensed
Phi-4 Mini (3.8B): 128K context; runs on consumer hardware; great for mobile and education
Best For: Education, mobile AI, resource-constrained devices, consumer hardware deployment
Cost: Free (open-weight); available on Azure

🌍 BLOOM — BigScience

Developer: BigScience Workshop (1,000+ global researchers)
Type: BigScience RAIL license
Released: July 2022 | Parameters: 176B

Supports 46 natural languages and 13 programming languages — the most multilingual open model ever released. Architecture now outdated but critically important for low-resource language research.

🔬 OLMo — Allen Institute for AI

Developer: Allen Institute for AI (AI2)
Type: Fully open-source (Apache 2.0, including training data and code)
Released: 2024 | Parameters: 7B, 65B

The only fully transparent frontier model — releases weights, training data (Dolma), training code, evaluation code, and intermediate checkpoints. Essential for AI safety research and reproducibility.

🟩 NVIDIA Nemotron 3 Super

Released: March 2026
Parameters: 120B total, 12B active (Hybrid Mamba-Transformer MoE)
Type: Open
Context Window: 1,000,000 tokens

Strengths:

Hybrid Mamba-Transformer MoE architecture: over 50% higher token generation vs. leading open models
Multi-token prediction (MTP) for faster inference
1M context window for long-term agent coherence
439 tokens/second — one of the fastest models available (any size)
Optimized for complex multi-agent applications

Best For: High-throughput agentic applications needing long-context and extreme speed; NVIDIA ecosystem developers

6. Tier 4 — Chinese Frontier Models

China has built a parallel AI ecosystem serving hundreds of millions of users domestically and growing globally. Competition intensified dramatically in 2025: Alibaba, Moonshot, Zhipu, ByteDance, and MiniMax all released major models, eroding DeepSeek’s dominance.

🔴 Baidu ERNIE (文心 4.5)

Developer: Baidu
Type: Proprietary
Users: 200M+ registered users

China’s most-deployed enterprise LLM. Integrated into Baidu Search (dominant Chinese search engine). Superior Chinese NLP; strong on Chinese legal, medical, and business documents.

Weaknesses: Weaker than GPT-5 on English/multilingual; restricted to approved topics under Chinese regulations
Best For: Chinese-language applications, businesses operating in China, Mandarin-first customer service

🟤 Zhipu GLM-5 / ChatGLM

Developer: Zhipu AI (Beijing) Released: GLM-5: 2025; GLM-5 Turbo: March 2026; GLM-5.1: April 2026

Strengths:

GLM-5 (Reasoning): scores 50 on Intelligence Index — highest-ranked open-weight model globally
GLM-5 Turbo: optimized for fast inference in agent-driven environments (OpenClaw scenarios); long execution chains, tool use, scheduled and persistent execution
GLM-5.1 (April 2026): significant improvements in long-horizon reasoning tasks; benchmarks competitive with frontier proprietary models on complex agent workflows
Strong bilingual Chinese + English performance
Kimi K2.5 Thinking (related): scores 47 on Intelligence Index

Best For: Bilingual applications, agentic tasks requiring persistent execution, Chinese-first reasoning, long-horizon reasoning tasks

🌙 Moonshot Kimi

Developer: Moonshot AI (Beijing)
Type: Proprietary

Strengths:

Extraordinary long-context capabilities (up to 2M tokens)
Kimi Linear (October 2025): efficient attention reducing memory usage for large context windows
OK Computer feature: creates web applications from descriptions
Kimi K2.5 Thinking: ranks 2nd among open-weight models on Intelligence Index (47)
Qwen3-Next-based Kimi K2 Thinking: 44.9 on Intelligence Index

Best For: Long document analysis, Chinese market, web application generation

🔷 Baichuan / Yi / Hunyuan / InternLM

Baichuan: Strong Chinese cultural/historical knowledge; BaichuanMed for clinical decision support
Yi (01.AI): Yi-34B was strong open-weight bilingual model; now surpassed by Qwen3 and Llama 4
Hunyuan (Tencent): WeChat/QQ integration; video + image + text generation; Chinese creative content
InternLM (Shanghai AI Lab): Academic orientation; Apache 2.0; strong reasoning and code; InternLM 2.5 (7B, 20B)

📦 ByteDance Seed

Developer: ByteDance
Released: Seed 2.0 Lite and Pro: February 2026

ByteDance’s frontier model family, leveraging TikTok/Douyin ecosystem data. Seed 2.0 Pro is competitive with GPT-4o-class models on coding and reasoning benchmarks. Rapidly gaining adoption in China.

🔢 MiniMax M2.5

Developer: MiniMax
Released: February 2026

Rapidly emerging Chinese lab. M2.5 competitive with frontier models on coding and math. Known for efficient inference architecture and aggressive pricing. Growing developer adoption via API.

7. Tier 5 — Coding-Specialist Models

💻 GitHub Copilot

Developer: GitHub + Microsoft (multi-model backend)
Released: Preview 2021; GA 2022
Users: 20 million (July 2025; 400% YoY growth); 90% of Fortune 100

Now multi-model: users can choose GPT-5.4, Claude Opus/Sonnet 4.6, Gemini 3.1 Pro, or auto-selection. Agent mode handles autonomous multi-file development. Deep IDE integration (VS Code, JetBrains, Neovim, Xcode).

Cost: Free (limited, 2,000 completions/month); Pro: $10/month; Pro+: $39/month; Business: $19/user/month; Enterprise: $39/user/month
Real-World Use: BNY Mellon (80%+ devs daily); 20M developers globally; 90% Fortune 100

🤖 DeepSeek Coder V2 / Prover-V2

Coder V2: 236B MoE total / ~21B active; 338 programming languages; 128K context; near GPT-4 Turbo coding quality at DeepSeek pricing
Prover-V2: Open-source; only major model specialized for formal theorem proving in Lean 4 — significant for mathematics and formal verification communities

⭐ StarCoder2

Developer: BigCode (HuggingFace + ServiceNow)
Released: February 2024 | Sizes: 3B, 7B, 15B

Trained on The Stack v2 (619 programming languages). Fill-in-the-Middle capability. StarCoder2-15B rivals CodeLlama 34B. OpenRAIL-M license.

🦙 CodeLlama

Developer: Meta | Released: August 2023 | Sizes: 7B, 13B, 34B, 70B

Llama 2-based code model. Fill-in-the-Middle. 70B version approaches GPT-4 on coding benchmarks.

🌊 Codestral / Devstral 2 (Mistral)

Codestral: 80+ languages; fast code completion; 256K context
Devstral 2: 123B parameters; 72.2% on SWE-bench Verified — top open-weight coding model as of 2026
Devstral Small 2: 24B; runs locally on consumer hardware; Apache 2.0

🛒 Amazon Q Developer / Tabnine

Amazon Q Developer: Deep AWS service knowledge; ideal for developers in the AWS ecosystem
Tabnine: On-premise deployment; zero code leaves the organization — critical for IP-sensitive codebases at banks, defense contractors, law firms. Enterprise: custom pricing

8. Tier 6 — Domain-Specific Models

🏥 Healthcare LLMs

Med-PaLM 2 / MedLM (Google): First LLM at expert-level USMLE accuracy (85%+). MedLM deployed in multiple U.S. hospital systems for clinical documentation, triage, and diagnostic support. HIPAA-compliant via Google Cloud BAAs.

BioMedLM (Stanford CRFM): Trained on PubMed; strong biomedical NER, relation extraction, and QA.

ClinicalBERT: Fine-tuned BERT on MIMIC-III clinical notes. Still widely used in healthcare informatics for ICD coding, clinical NER, adverse event detection.

Real-World: Hospital reduced patient triage times by 34% using a domain-specific SLM trained on internal case data.

💰 Finance LLMs

BloombergGPT: 50B parameters; trained on 363B tokens of Bloomberg financial data. Cutting error rates by 30%+ vs. general LLMs. Integrated into investment platforms. Proprietary — Bloomberg products only.

FinGPT (AI4Finance Foundation): Open-source foundation for fintech. Fine-tunable on proprietary data. Sentiment analysis, stock prediction, financial QA.

Real-World: 60%+ of major North American financial institutions running pilots or production financial LLM systems. JPMorgan COIN platform reviews loan agreements using domain-trained models.

⚖️ Legal LLMs

Harvey AI: Fine-tuned GPT-4/5 for legal workflows. BigLaw Bench score 91% (GPT-5.4). Integrates with Westlaw and LexisNexis.

CoCounsel (Thomson Reuters / Casetext): GPT-4 powered; native Westlaw integration. Top legal AI benchmarks alongside Harvey.

ChatLAW: Research model trained on legal corpora; 40% faster legal research times in studies.

Real-World: 45%+ of AmLaw 200 firms exploring or deploying legal AI tools in 2025.

🔬 Science / Security

Galactica (Meta, 2022): Trained on scientific papers — withdrew after 3 days due to confident hallucinations. A cautionary tale about domain LLM risk.

SciGLM: Chinese academic model for cross-domain scientific reasoning (chemistry, biology, physics, math).

Cybersecurity: Microsoft Security Copilot (GPT-4 + Microsoft Sentinel); CrowdStrike Falcon AI; Snyk AI (code security). No single dominant open cybersecurity LLM — most serious deployments use frontier models with security-specific RAG pipelines.

9. Tier 7 — Edge / On-Device / Small Models

Model	Developer	Params	Context	License
Phi-4 Mini	Microsoft	3.8B	128K	MIT
Gemma 3 1B	Google	1B	32K	Open
Gemma 3 4B	Google	4B	128K	Open
Llama 3.2 1B	Meta	1B	128K	Meta
Llama 3.2 3B	Meta	3B	128K	Meta
MiniCPM 3B	ModelBest/Tsinghua	3B	32K	Open
Qwen3 0.5B–4B	Alibaba	0.5–4B	32K	Apache 2.0
Ministral 3B	Mistral	3B	128K	Open
Apple on-device	Apple	Private	—	Proprietary

Apple FastVLM (CVPR 2025): FastViTHD encoder reduces image encoding latency while generating 4x fewer tokens. All processing stays on-device. iOS 18+ AI features use on-device LLMs for privacy-first inference. Weights not publicly released.

Key pattern: Phi-4 Mini and Gemma 3 4B are the current leaders for on-device/consumer hardware deployment — MIT/Apache licensed, strong reasoning despite small size.

10. Tier 8 — Research & Historical Models

These models are largely deprecated for production use but historically important and still referenced in research.

Model	Developer	Year	Significance
GPT-1	OpenAI	2018	First GPT; proved unsupervised pre-training
BERT	Google	2018	Bidirectional transformer; dominated NLP for years
GPT-2 (1.5B)	OpenAI	2019	”Too dangerous to release” — now fully open
XLNet	CMU + Google	2019	Permutation-based training; beat BERT on 20 tasks
RoBERTa	Facebook AI	2019	Improved BERT training methodology
GPT-3 (175B)	OpenAI	2020	Changed the field; first practical few-shot learning
T5 / FLAN-T5	Google	2020/2022	Unified text-to-text framing
Megatron-Turing NLG (530B)	MS + NVIDIA	2021	Largest model at release; proved distributed training
Gopher (280B)	DeepMind	2021	Strong knowledge tasks
LaMDA	Google Brain	2021	Dialogue-focused; became Bard then Gemini
ERNIE 3.0 Titan	Baidu	2021	260B; Chinese knowledge pre-training
WuDao 2.0	BAAI/CAS	2021	1.75T params; multilingual; largest announced model
Chinchilla (70B)	DeepMind	2022	Proved smaller models + more data beat larger models on less data — “Chinchilla scaling laws” changed how the entire industry trains
GPT-NeoX (20B)	EleutherAI	2022	Largest open model before LLaMA
GPT-J (6B)	EleutherAI	2021	First widely-used open GPT-3 alternative
BLOOM (176B)	BigScience	2022	46 languages; global collaborative model
PaLM (540B)	Google	2022	Google’s dominant research model before Gemini
InstructGPT	OpenAI	2022	RLHF pioneer; led to ChatGPT
ChatGPT (GPT-3.5)	OpenAI	Nov 2022	Made LLMs a consumer product; deprecated 2025
GPT-4	OpenAI	March 2023	Multi-year benchmark leader; now deprecated
Alpaca	Stanford	2023	LLaMA fine-tuned on GPT-3.5 data for $600 — proved instruction tuning works
Vicuna	LMSYS	2023	LLaMA fine-tuned on ChatGPT conversations
MPT-7B	MosaicML	2023	FlashAttention + ALiBi; foundation for DBRX
Falcon 180B	TII	2023	Held open-source lead for months; Apache 2.0
SOLAR 10.7B	Upstage	2023	”Depth Upscaling” to merge two 7B models; beat GPT-3.5
Galactica	Meta	2022	Scientific LLM; withdrawn after 3 days
PaLM 2	Google	2023	Powered Bard; PaLM API deprecated Oct 2024
DBRX	Databricks	March 2024	132B MoE; Apache 2.0; strong at launch
Cerebras-GPT	Cerebras	2023	Trained on wafer-scale cluster
DistilBERT	HuggingFace	2019	97% of BERT at 40% size; still used in prod

Pricing Comparison Table (April 11, 2026)

All prices in USD per million tokens (Input / Output). Verified against official provider documentation as of April 11, 2026. Prices change frequently — always confirm on provider pricing pages before budgeting. No confirmed pricing changes were reported in verified sources this week; table reflects last confirmed figures.

Proprietary Models

Model	Input ($/M)	Output ($/M)	Context	Notes
Mistral Nemo	$0.02	$0.06	128K
Nova Micro (AWS)	$0.035	$0.14	128K
GPT-5 Nano	$0.05	$0.40	128K
Gemini 2.0 Flash-Lite	$0.075	$0.30	128K
Grok 4.1 Fast	$0.12	$0.30	2M
Gemini 3.1 Flash-Lite	$0.25	$1.00	1M
GPT-5 Mini	$0.25	$2.00	128K
Gemini 2.5 Flash	$0.30	$2.50	1M
Gemini 3 Flash	$0.50	$3.00	128K
Claude Haiku 4.5	$1.00	$5.00	200K
GPT-5	$1.25	$10.00	400K
Gemini 2.5 Pro	$1.25	$10.00	1M
Grok 4.1	$3.00	$15.00	256K
GPT-5.2	$1.75	$14.00	400K
Gemini 3.1 Pro	$2.00	$12–18	1M
Mistral Large 3	$2.00	$6.00	128K
GPT-5.4	$2.50	$10.00	1M (API)
Claude Sonnet 4.6	$3.00	$15.00	1M	Restricted Mythos variant gated to security researchers
Claude Opus 4.6	$5.00	$25.00	1M	Confirmed solving 3-hr expert security tasks
GPT-5.4 Pro	Contact sales	—	272K	Enterprise/Pro tier

Open-Weight Models (Self-Hosted = Free; Hosted Pricing Below)

Model	Hosted Input ($/M)	Hosted Output ($/M)	Context	License	Notes
DeepSeek V3.2	$0.14–0.28	$0.28–0.42	128K	MIT
Llama 4 Scout	~$0.10	~$0.40	10M	Meta
Qwen3 32B	~$0.15	~$0.60	128K	Apache 2.0
Llama 4 Maverick	~$0.15	~$0.60	1M	Meta
Gemma 4	~$0.20	~$0.40	128K	Google	2M+ downloads; hosted pricing estimated — verify before budgeting
Gemma 3 27B	~$0.20	~$0.20	128K	Google
Mistral 7B	~$0.25	~$0.75	128K	Apache 2.0
IBM Granite 4.0 3B	Free on watsonx	—	128K	Apache 2.0	Multimodal vision variant
DeepSeek R1	$0.55	$2.19	128K	MIT
Mixtral 8x7B	~$0.65	~$0.65	32K	Apache 2.0
GPT-OSS 120B	~$0.90	~$0.90	128K	Apache 2.0
IBM Granite 4.0	Free on watsonx	—	128K	Apache 2.0
DeepSeek V4	TBD (beta)	TBD	TBD	MIT (confirmed)

Note on Gemma 4 hosted pricing: Gemma 4 crossed 2 million downloads this week, confirming strong adoption. Hosted pricing from third-party providers is not yet confirmed in verified sources — the figure above is estimated based on comparable Gemma 3 pricing and should be verified on provider pages before budgeting.

Note on Anthropic agent hosting: Anthropic’s $0.08/hr headline price for agent hosting was scrutinized this week. Actual unit economics at production scale are reported to be significantly higher. Do not use the headline figure for infrastructure budgeting without modelling your specific workload against the real pricing tiers.

Cost Optimization Strategies

Prompt caching: Up to 90% savings on repeated context — now supported by Anthropic, OpenAI, Google, and xAI
Batch API: 50% discount for async, non-latency-sensitive workloads (OpenAI, Anthropic, Google)
Tiered model routing: Budget model (Gemini Flash-Lite / Haiku 4.5) for triage and classification → mid-tier (Sonnet 4.6 / Grok 4.1) for generation → flagship (GPT-5.4 / Opus 4.6) only for high-stakes reasoning; can reduce costs 60–85% vs. using flagship for everything
Quantization on open models: 4-bit quantization reduces compute ~60–70% with minimal quality degradation on Llama 4 and Qwen3 family; GGUF format now well-supported across llama.cpp and Ollama
DeepSeek cache hits: DeepSeek V3.2 cache pricing at $0.028/M (90% off base) — strong for repetitive retrieval-augmented workloads
Gemma 4 for open multimodal workloads: Now confirmed at 2M+ downloads, Gemma 4 is establishing itself as the leading open-weight option for vision-language and on-device agentic tasks — worth evaluating against proprietary tiers before committing to closed APIs
Devstral 2 for coding pipelines: At competitive hosted pricing, offers strong open-weight coding quality (72%+ SWE-bench range) with Apache 2.0 license for self-hosted deployments
Anthropic agent hosting — verify real costs first: This week’s analysis revealed the $0.08/hr figure does not reflect production-scale economics. Run your own cost model against Anthropic’s actual pricing tiers before treating hosted agents as a budget option

Benchmark Comparison (April 11, 2026)

Key Benchmarks Explained

Benchmark	What It Measures
AIME 2025	Hard math competition problems — primary reasoning/math gold standard
SWE-bench Verified	Real GitHub issue resolution — most practical coding benchmark
HumanEval	Basic function-level code generation; largely saturated at frontier
GPQA Diamond	Doctoral-level science questions across biology, chemistry, physics
ARC-AGI-2	Novel pattern reasoning explicitly designed to resist memorization
OSWorld	Computer use — can the model autonomously operate a real desktop
LMArena Elo	Human preference ranking via blind side-by-side comparisons
Finance Agent	Agentic financial analysis tasks across real-world scenarios
BigLaw Bench	Legal document analysis, contract review, transactional structuring
GDPval	Knowledge work tasks across professional domains (law, finance, medicine)
Aider Polyglot	Multi-language code editing across real repositories
MMMU	Multimodal understanding — images, charts, scientific figures
Penetration Testing (Expert-Level)	3-hour expert security tasks; newly relevant given Opus 4.6 / Mythos findings

Benchmark Snapshot (April 11, 2026)

Model	AIME 2025	SWE-bench Verified	OSWorld	LMArena Elo	Notes
Gemini 3.1 Pro	100% (w/code)	—	—	Top tier	Current Google frontier flagship
Llama 4 Behemoth	96.2%*	—	—	—	*Developer tech report; weights unreleased
GPT-5.4	—	~80%	Record	Top tier	BigLaw Bench 91%; GDPval 83%
DeepSeek V4	~95%*	~78%*	—	—	*Early leak; unverified; limited beta only
Qwen3-Next	87.8%	—	—	—	Strongest publicly available open-weight math
GPT-5.2	100%	—	—	—	Previous OpenAI flagship; still available
Grok 4.1	—	—	—	1483 Elo (#1*)	*Position may shift; monitor arena rankings
Claude Opus 4.6	—	—	61.4%	~91.3 II	Finance Agent #1; 14.5hr task horizon; confirmed solving 3-hr expert pen-test tasks
Claude Sonnet 4.6	—	77.2%	—	~89.9 II
Claude Mythos	—	—	—	Restricted	Security-researcher-only; zero-day vuln discovery; no public benchmark profile available
DeepSeek R1	87.5%	—	—	—	Landmark pure-RL trained reasoning model
Devstral 2	—	72.2%	—	—	Top confirmed open-weight coding
Gemma 4	—	—	—	Accumulating	Released recently; 2M+ downloads; independent benchmarks still accumulating
Meta Muse Spark	—	—	—	Contested	Frontier claims disputed; benchmaxxing concerns raised by practitioners
GLM-5	—	—	—	50 II	Highest open-weight Intelligence Index
Gemini 3.1 Flash-Lite	—	—	—	1432 Elo	Budget model; beats prior-gen flagships
NVIDIA Nemotron 3 Super	—	—	—	—	439 tokens/sec; speed-optimized
Llama 4 Maverick	—	~65%	—	—	Top open-weight generalist at accessible cost
IBM Granite 4.0 3B	—	—	—	—	Multimodal vision; enterprise document focus

II = Intelligence Index score. Asterisked scores () are from developer-reported or early/leaked evaluations — treat as directional until third-party replication.*

Benchmark Notes for This Week

Claude Mythos has no public benchmark profile. Access is restricted to vetted security researchers following the discovery of thousands of zero-day vulnerabilities across operating systems and browsers. The capability story here is not about standard benchmark scores — it is about demonstrated offensive security capability that Anthropic judged too dangerous to release openly. This is the first time since GPT-2 that a major lab has adopted this posture, and it was triggered by empirical results rather than precautionary principle.
Claude Opus 4.6 is now confirmed solving three-hour expert-level penetration testing tasks — a capability data point that sits outside the standard benchmark suite but is among the most consequential capability signals published this week.
Meta Muse Spark benchmark claims are actively contested. Multiple practitioners have raised concerns that the model’s profile reflects benchmark optimization rather than general capability. Treat developer-reported scores as directional until independent third-party reproductions are published.
Gemma 4 continues accumulating arena comparisons and third-party evaluations. The 2M+ download figure confirms practitioner interest, but independent benchmark scores on SWE-bench Verified, GPQA Diamond, and AIME 2025 are not yet confirmed in verified sources. We will add confirmed scores as they emerge.
Google’s benchmark methodology was challenged this week by a Google-internal study finding that standard benchmarking practices (3–5 human raters per example) systematically underestimate human disagreement and produce unreliable results. This is a methodological warning that applies across the entire table above — treat all benchmark figures as directional signals, not ground truth, particularly on tasks with high annotator subjectivity.
DeepSeek V4 leaked scores remain unverified pending independent reproduction. Continue treating as directional.
LMArena Elo rankings shift regularly as new models enter the arena. Treat Elo positions as directional, not definitive.

13. Choosing the Right LLM: Decision Framework

Step 1: Define Your Primary Workload

Workload	Top Picks (April 11, 2026)
Complex reasoning / math	GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6, DeepSeek R1
Long document analysis	Claude Opus 4.6 (1M), Gemini 3.1 Pro (1M), GPT-5.4 API (1M)
Agentic coding	Claude Sonnet/Opus 4.6, GPT-5.4, Devstral 2
Code completion (IDE)	GitHub Copilot, Codestral, StarCoder2
Real-time conversation	GPT-5 Mini, Grok 4.20, Claude Haiku 4.5, Gemini 3 Flash
RAG / document Q&A	Cohere Command R+, Claude Sonnet 4.6, GPT-5
Multilingual	Qwen3 (119 langs), Qwen2.5-Max, Mistral Large 3, BLOOM
High-volume, budget	DeepSeek V3.2, Gemini 3.1 Flash-Lite, Mistral Nemo
Self-hosted / air-gapped	Llama 4, Qwen3, Mistral Large 3, GPT-OSS 120B
Medical	MedLM (Google Cloud), BioMedLM + RAG, Med-PaLM 2
Legal	Harvey AI, CoCounsel, GPT-5.4 (BigLaw Bench 91%)
Financial	Bloomberg GPT, Claude Opus 4.6 (Finance Agent #1)
On-device / edge	Phi-4 Mini, Gemma 3 1B–4B, Llama 3.2 1B–3B, Qwen3 0.5B–4B
Chinese language	ERNIE 4.5, Qwen3, GLM-5, Moonshot Kimi, ByteDance Seed
Maximum compliance	Claude Enterprise, GitHub Copilot Enterprise, IBM Granite, Azure OpenAI
Formal theorem proving	DeepSeek-Prover-V2
Computer use / GUI agents	GPT-5.4 (native), Claude 4.6, Gemini 3.1 Pro
Real-time social/web data	Grok 4.20, Perplexity Sonar

Step 2: Assess Constraints

Constraint	Recommendation
Data sovereignty (data can’t leave country)	Self-hosted open-weight, or regional cloud (Azure EU, Google EU)
HIPAA/SOC 2/GDPR required	Azure OpenAI, Google Vertex AI, Claude Enterprise, AWS Bedrock
Budget (high volume)	DeepSeek V3.2, Gemini 3.1 Flash-Lite, Mistral Nemo, GPT-5 Nano
Real-time latency (<1s)	Gemini Flash-Lite, Claude Haiku, Grok 4.1 Fast, Ministral 3B
Need fine-tuning control	Open-weight: Llama 4, Qwen3, Mistral, GPT-OSS
IP clarity for commercial use	Apache 2.0 only: IBM Granite, Phi-4, Qwen3, Mistral, OLMo
Reasoning depth over speed	o3, Claude Opus 4.6, DeepSeek R1, Gemini 3.1 Pro Deep Think

Step 3: Run Your Own Evaluation

Don’t rely solely on public benchmarks:

Create 10–20 prompts from your actual production queries
Score on: accuracy, format compliance, latency, and cost per correct answer
Re-run monthly — model catalogs change every 2–3 weeks

14. Real-World Enterprise Success Stories

OpenAI / GPT

Morgan Stanley: AI research assistant saves financial advisors 10–15 hours/week; GPT-5.4 used for investment banking document workflows (87.3% preference rate)
Duolingo Max: GPT-4/5 powers “Explain My Answer” and conversation practice for 30M+ learners
Khan Academy (Khanmigo): Socratic AI tutor across all K-12 subjects
GitHub Copilot: 20M developers globally; 90% Fortune 100; BNY Mellon: “part of our DNA”

Anthropic / Claude

NASA: Claude Code planned a ~400m route for Mars rover Perseverance (December 2025)
Norway Sovereign Wealth Fund ($2.2T): Claude screens entire portfolio for ESG risks — earlier divestments, improved monitoring of forced labour and corruption (February 2026)
Firefox audit: Claude found 100+ bugs in Firefox in two weeks; 14 high-severity (2026)
Notion AI, Quora Poe: Major consumer integrations for writing and Q&A

Google / Gemini

Google Workspace: Hundreds of millions of Docs/Sheets/Gmail users access Gemini AI Assist
Hospital systems: MedLM deployed for clinical documentation at multiple U.S. health systems
Gemini in Chrome: Rolled out to Canada, New Zealand, India with 50+ language support (April 11, 2026)

Microsoft / Copilot

BNY Mellon: 80%+ of developers use GitHub Copilot daily — “part of our DNA”
DNV (shipping/maritime): Azure OpenAI reduced compliance analysis effort by 90%
DoozyTemps: Copilot customer service bot reduced call volume by 60%
New Zealand power utility: Copilot planning system halved required project staff

DeepSeek

Global startups: Hundreds switched after January 2025 announcement, cutting API costs 80–95%
Academic research: R1’s pure RL training approach widely studied and reproduced

Finance / Legal

BloombergGPT: 30%+ error rate reduction on financial tasks vs. general LLMs; integrated into investment platforms
JPMorgan COIN: Domain-trained LLM reviews commercial loan agreements
AmLaw 200 firms: 45%+ exploring or deploying legal AI tools in 2025
Global bank: 27% AML compliance cost reduction using SLM trained on transaction patterns

Trends and What’s Coming in 2026–2027

1. The Inference Efficiency Race Is Now the Main Event

The frontier capability gap between top labs continues narrowing, but the cost-per-useful-output gap is widening as the key competitive axis. Gemma 4’s rapid accumulation of 2M+ downloads confirms that open-weight models are reaching genuine frontier-adjacent performance on real practitioner workloads — not just on curated benchmarks. The next 12 months will be defined less by “which model scores highest” and more by “which model delivers the best quality per dollar at production scale.” Labs that cannot win the inference efficiency race will cede enterprise deals to those that can. The scrutiny of Anthropic’s agent hosting economics this week — revealing that the $0.08/hr headline significantly understates real production costs — is a direct expression of this dynamic: practitioners are no longer buying capability headlines, they are modelling unit economics.

2. AI Offensive Cybersecurity Is Now a Documented Arms Race, Not a Hypothetical

Two separate threads converged this week into a single alarming signal. First: AI offensive cybersecurity capabilities are doubling every 5.7 months since 2024 — with models like Opus 4.6 now solving expert-level three-hour penetration testing tasks. Second: Anthropic restricted Claude Mythos to security researchers only after it demonstrated the ability to discover thousands of zero-day vulnerabilities at scale. Together, these developments confirm that the “AI finds your security holes” scenario is no longer speculative threat modeling — it is a documented, accelerating capability curve. The arms race between offensive and defensive AI applications is real and the offensive side is currently winning on velocity. Every security team deploying AI should be treating this as a first-order operational concern, not a future risk.

3. Governance and Accountability Norms for AI Agents Are Dangerously Underdeveloped

This week’s “social experiment” defamation incident — in which an anonymous operator deployed an AI agent that published false content about an open-source developer and subsequently framed it as research — is not an isolated edge case. It is a preview of a systemic governance failure. Agent capabilities are outpacing the norms, legal frameworks, and platform policies that would assign accountability when agents cause harm. The Pentagon’s successful blacklisting of Anthropic, now affirmed by an appeals court, demonstrates that governments are willing to impose geopolitical restrictions on AI lab access. The gap between what agents can do and what accountability structures exist to govern them is widening every week. Expect this to become a regulatory flashpoint within 12–18 months, particularly as agentic systems are deployed in consumer-facing contexts.

4. Open-Weight Models Are Now a Credible Default, Not a Compromise

Gemma 4’s on-device agentic capabilities — text, image, and audio, no cloud required — combined with 2M+ downloads represent something qualitatively different from previous open-weight milestones. This is not a model that “almost keeps up with” proprietary alternatives; it is a model that enables a class of deployments (fully private, fully local, fully agentic) that proprietary APIs structurally cannot offer. Combined with Llama 4 Behemoth’s 96.2% AIME score, DeepSeek V4 in beta, and Qwen3-Next at 87.8%, the open-weight tier is no longer a budget compromise — it is a legitimate architectural choice driven by privacy, cost, and customization requirements. Organizations that defaulted to proprietary APIs for capability reasons should run a fresh evaluation against the current open-weight tier before renewing those commitments.

5. Compute Infrastructure Is the Real Competitive Moat — and It’s Concentrating Fast

CoreWeave secured multi-year infrastructure deals with both Anthropic and Meta ($21B) in the same week. OpenAI is explicitly pitching its early infrastructure buildout as a competitive moat to investors. Anthropic hired Microsoft’s Azure AI chief specifically to close an infrastructure gap. The pattern is unambiguous: the frontier AI race has shifted from “who has the best researchers” to “who has the most reliable access to compute at scale.” For practitioners, this matters because infrastructure concentration affects API reliability, pricing leverage, and long-term vendor lock-in risk. Labs without owned or committed infrastructure capacity are structurally disadvantaged — and the consolidation of that capacity around a small number of hyperscalers and GPU cloud providers means that supply shocks propagate quickly.

6. Benchmark Credibility Is Under Structural Pressure

Google’s internal study revealing that standard AI benchmarking practices systematically underestimate human disagreement — combined with the ongoing dispute over Meta Muse Spark’s benchmark profile and concerns about whether published frontier scores reflect genuine capability or evaluation optimization — represents a quiet but important crisis of trust in published benchmark results. The field is increasingly aware that the leaderboard does not straightforwardly correspond to real-world capability. For practitioners making infrastructure and vendor decisions, this reinforces the case for running domain-specific internal evaluations rather than relying on published rankings. The labs that invest in honest, reproducible, third-party evaluation infrastructure will have a durable credibility advantage over those that optimize for headline scores.

7. AI Safety Research Is Being Forced Into Operational Reality

The combination of Anthropic’s Mythos restriction and the documented doubling rate of offensive cybersecurity capability is doing something that years of theoretical safety research could not: making safety a concrete operational decision with immediate business consequences. Restricting a model’s release costs revenue and competitive position. Not restricting it creates documented, measurable real-world risk. This is no longer an abstract alignment debate — it is a product decision with a clear cost-benefit structure. Expect frontier labs to develop more explicit internal capability thresholds and disclosure policies in response, and expect those policies to become part of enterprise procurement conversations as customers increasingly ask “what is your dangerous-capability response plan?“

8. Voice Mode Quality Expectations Are Being Reset — Downward

Simon Willison’s documentation that ChatGPT voice mode runs on a significantly weaker model than GPT-4o is a meaningful calibration signal for the industry. The consumer expectation — that voice mode delivers flagship-quality reasoning — does not match the technical reality. For practitioners building voice-first applications, this gap is not just an OpenAI-specific disclosure; it is a reminder that production voice AI involves architectural compromises (latency, cost, inference efficiency) that trade directly against model quality. Building voice applications that assume flagship reasoning quality without verifying the underlying model tier is a common and costly mistake. The gap between advertised and actual voice mode capability is likely an industry-wide pattern, not an anomaly.

9. Agentic Reliability — Not Raw Capability — Remains the Enterprise Bottleneck

Every major lab now ships computer use. Every major lab has a coding agent. The scrutiny of Anthropic’s agent hosting economics this week, the ongoing discussion of stateless coding assistant limitations (no persistent memory between sessions), and the practitioner-level debate about when multi-agent systems actually justify their compute cost all point to the same diagnosis: enterprise deployments gate on reliability, cost predictability, and auditability — not on capability ceiling. Research this week confirming that multi-agent systems’ apparent advantages largely derive from additional compute rather than architectural benefits is particularly important: it suggests that agent orchestration complexity should be justified by specific workflow requirements, not adopted as a default pattern. The next competitive moat will be built by the lab that ships the most reliable and cost-predictable agentic runtime.

10. AI Sycophancy Is a Structural Problem, Not a Prompt Engineering Fix

The formally proven finding that sycophantic AI chatbots can manipulate even perfectly rational users into delusional reasoning patterns — and that fact-checking and education do not fully mitigate the risk — is not an academic curiosity. It is a structural challenge for every organization deploying AI in decision-support roles. The dangerous feedback loop (users prefer sycophantic behavior, which reinforces it through RLHF) means that standard fine-tuning approaches may actively worsen the problem over time. For practitioners building AI into consequential decision workflows — legal review, financial analysis, medical triage, security assessment — explicit countermeasures against sycophancy (adversarial prompting, dissent-eliciting system prompts, multi-model cross-checking) are becoming best practice, not optional enhancements.

11. Enterprise AI Governance Is Now a Procurement Blocker

The EU AI Act is in force. The Pentagon’s national security designation of Anthropic — now surviving appeals court challenge — signals that geopolitical AI governance is entering formal legal infrastructure, not just executive-order territory. Several large European enterprises have begun requiring ISO 42001 compliance or equivalent AI governance certifications as a condition of LLM procurement. By end of 2026, expect AI governance certifications — audit trails, explainability documentation, data residency guarantees, dangerous-capability response policies — to be as standard in enterprise RFPs as SOC 2 Type II is today. Labs that have invested in compliance infrastructure early will have a structural advantage in regulated sectors.

Quick Reference: Who Makes What (April 11, 2026)

Organization	Latest Models	Notes
OpenAI	GPT-5.4, GPT-5.4 Pro, GPT-5.3 Codex, GPT-5.2, GPT-5 Mini/Nano, o3, GPT-OSS 20B/120B	Altman’s home attacked; Sam Altman acknowledges past mistakes and warns of rising AI hostility; voice mode confirmed to use weaker model than GPT-4o
Anthropic	Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, Claude Mythos (restricted), Claude Code	Mythos restricted to security researchers only — zero-day vulnerability discovery at scale; Eric Boyd (ex-Microsoft Azure AI) hired as head of infrastructure; agent hosting economics scrutinised this week; CoreWeave multi-year deal signed to power Claude
Google DeepMind	Gemini 3.1 Pro, 3.1 Flash-Lite, 3 Flash, 2.5 Pro/Flash, Gemma 4, Gemma 3, Gemini Embedding 2, Veo 3.1, Lyria 3	Gemma 4 crosses 2M downloads; on-device agentic AI (text/image/audio, no cloud); Google’s own benchmark methodology challenged by internal study on human rater reliability
xAI	Grok 4.20, Grok 4.1, Grok Code Fast 1, Grok Voice, Grok Imagine
Meta AI	Llama 4 Scout, Llama 4 Maverick, Llama 4 Behemoth (tech report only), Llama 3.3, Muse Spark, CodeLlama	Muse Spark frontier claims disputed by practitioners; $21B CoreWeave infrastructure deal; Behemoth weights still unreleased
Mistral AI	Mistral Large 3, Devstral 2, Voxtral TTS, Mixtral 8x22B, Ministral 3B/8B, Pixtral Large, Mistral Nemo
DeepSeek	DeepSeek V4 (limited beta), DeepSeek V3.2, R1, Coder V2, Prover-V2, R1-Distill series	V4 in limited beta; MIT license confirmed; public release expected ~mid-April
Alibaba Cloud	Qwen3-Next, Qwen3 (4B–235B), Qwen3.6-Plus, Qwen2.5-Max, Qwen-VL, Qwen-Audio, Qwen3.5-Omni	Rapid iteration continuing; Qwen3.6-Plus among recent releases
Microsoft Research	Phi-4, Phi-4 Mini	Phi-5 rumored for H1 2026
IBM Research	Granite 4.0, Granite 4.0 3B (multimodal vision), Granite 4.0 1B Speech, Granite 3.3	Apache 2.0; free on watsonx
NVIDIA	Nemotron 3 Super 120B	439 tokens/sec; inference-optimized
TII (UAE)	Falcon 2 (11B, VLM), Falcon Perception	Open-weight vision model
H Company	Holo3	Multimodal computer use; on-device agentic capabilities highlighted
Zhipu AI	GLM-5, GLM-5V-Turbo, GLM-5 Turbo, ChatGLM	GLM-5.1 long-horizon task improvements confirmed; GLM-5V-Turbo converts design mockups to front-end code
Naver	Seoul World Model	Spatially grounded video world model
BigCode	StarCoder2
Allen Institute	OLMo
Baidu	ERNIE 4.5
Moonshot AI	Kimi family, Kimi K2.5 Thinking
ByteDance	Seed 2.0 Lite / Pro
MiniMax	M2.5
01.AI	Yi series
Tencent	Hunyuan
Shanghai AI Lab	InternLM 2.5
Cohere	Command R+, Cohere Transcribe
Amazon	Nova Micro / Lite / Pro (via Bedrock)
**

Useful Resources

Resource	URL
Live pricing (300+ models)	pricepertoken.com
Benchmark leaderboard	artificialanalysis.ai/leaderboards/models
Open model leaderboard	huggingface.co/spaces/open-llm-leaderboard
Real-time model releases	llm-stats.com
Wikipedia model list	en.wikipedia.org/wiki/List_of_large_language_models
OpenAI API pricing	platform.openai.com/docs/pricing
Anthropic API docs	platform.claude.com/docs/en/about-claude/models/overview
Google Gemini API	ai.google.dev/gemini-api/docs/models
Mistral API	mistral.ai/technology

Last verified: April 11, 2026. The LLM landscape changes every 2–3 weeks — treat all version numbers and pricing as starting points, not gospel. Always verify against official provider documentation before production deployment.

16. Use Case Directory — Which Model for Which Software Task

This section maps real-world software development and product use cases to the best available models as of March 2026. Each use case includes a primary pick, budget alternative, open-weight alternative, and key reasoning for the recommendation.

🤖 Conversational Chatbots & Customer Support

What you’re building: Customer service bots, help desk automation, FAQ agents, onboarding assistants, internal IT support.

Requirements: Fast responses, multi-turn memory, graceful handling of off-topic queries, tone consistency, escalation awareness.

Tier	Model	Why
Best overall	Claude Sonnet 4.6	Best instruction following; least likely to go off-script; Constitutional AI keeps tone professional
Fastest/cheapest	Claude Haiku 4.5 or Gemini 3.1 Flash-Lite	Sub-second latency; handles routine queries; <$1.50/M output
Open-weight	Llama 4 Maverick or Mistral Large 3	Self-hostable; fine-tuneable on your support KB
RAG-heavy support	Cohere Command R+	Purpose-built for retrieving from support databases; multilingual

Key decision point: If your support volume is high (millions of tickets), DeepSeek V3.2 at $0.42/M output with a smarter fallback model for complex tickets is the most cost-effective architecture.

Avoid: o3, Claude Opus, GPT-5.4 Pro for this use case — their reasoning depth is wasted on routine support and the cost-per-ticket becomes unjustifiable.

💻 Code Generation & Autocomplete

What you’re building: IDE plugins, code completion tools, inline code suggestions, boilerplate generation.

Requirements: Low latency (<200ms for feel-good UX), high acceptance rate, language breadth, context awareness across open files.

Tier	Model	Why
Turnkey solution	GitHub Copilot (multi-model)	Handles infrastructure; multi-model; 20M devs already use it
Best raw model	Claude Sonnet 4.6	Highest SWE-bench scores for instruction-following code generation
Fastest	Codestral (Mistral)	Optimized for low-latency completions; 80+ languages; 256K context
Open-weight	Qwen3-Coder or StarCoder2-15B	Free; strong on code; deployable locally
Budget API	DeepSeek Coder V2	338 languages; near-GPT-4 quality; $0.42/M output

Key decision point: For IDE autocomplete where latency is everything, Codestral and StarCoder2 are purpose-built for fill-in-the-middle (FIM) tasks. For agentic multi-file generation, Claude Sonnet 4.6 or GPT-5.4 win on quality.

🧑‍💻 Agentic Coding / Software Engineering Agents

What you’re building: Autonomous coding agents that can read a codebase, implement features, fix bugs, open PRs, run tests, and iterate without human in the loop.

Requirements: Long context (entire codebase), multi-step reasoning, tool use (file read/write, shell exec, web search), recovery from failed steps, sustained context over long sessions.

Tier	Model	Why
Best overall	Claude Opus 4.6 (via Claude Code)	14.5hr task horizon; #1 SWE-bench; 1M context; considered best AI coding assistant Jan 2026
Runner-up	GPT-5.4	Native computer use; ~80% SWE-bench; 1M context in API; strong at tool-heavy workflows
Best open-weight	Devstral 2	72.2% SWE-bench; 123B MoE; 256K context; top open-weight coding model
Budget open-weight	Devstral Small 2 (24B)	Runs locally; Apache 2.0; solid SWE-bench for size

Key decision point: If your agent needs to stay focused across a 6+ hour session without losing context, Claude Opus 4.6 is uniquely designed for this. For teams that want to self-host, Devstral 2 is the open-weight equivalent.

📄 Document Analysis & Summarization

What you’re building: Contract review, financial report analysis, research paper summarization, compliance document processing, meeting notes, legal brief analysis.

Requirements: Long context (full documents), accurate extraction without hallucination, structured output, citation support.

Tier	Model	Why
Largest context	Gemini 3.1 Pro (1M) or Claude Opus 4.6 (1M)	Process entire document archives in one session
Best accuracy	Claude Sonnet 4.6	Lowest hallucination rate; citation support via API
Google Workspace users	Gemini 3.1 Pro	Native in Docs/Sheets/Gmail; no integration work
Budget	Gemini 2.5 Flash ($0.30/M) or DeepSeek V3.2	Solid summarization quality at 10–20x lower cost
Open-weight	Llama 4 Scout (10M context)	Unprecedented context window; free to self-host

Key decision point: For documents under 200K tokens, Sonnet 4.6 is the best accuracy/cost trade-off. For entire legal contract databases or codebases in one prompt, Gemini 3.1 Pro or Llama 4 Scout are your only options.

🔍 RAG (Retrieval-Augmented Generation) Systems

What you’re building: Internal knowledge bases, enterprise search, document Q&A, product documentation assistants, customer-facing knowledge bots.

Requirements: Faithfulness to retrieved context (not making things up), citation of sources, multilingual support, structured output for downstream systems.

Tier	Model	Why
Best for RAG	Cohere Command R+	Purpose-built for RAG; trained to ground answers in retrieved docs; 128K context; 10+ languages
Best general	Claude Sonnet 4.6	Citations API; strong at faithfully synthesizing retrieved chunks
Google ecosystem	Gemini 3.1 Pro	Native Google Search grounding; Vertex AI RAG pipelines
Open-weight	Mixtral 8x22B or Qwen3-32B	Strong at following system prompt instructions; free to self-host
Research/transparency	OLMo	Full training data transparency; important for auditable enterprise AI

Key decision point: If multilingual RAG across 10+ languages is required, Command R+ is the clear winner. For a simpler English-only internal knowledge base, Claude Sonnet 4.6 with citation mode is the most reliable.

🧠 Complex Reasoning & Multi-Step Problem Solving

What you’re building: Automated analysis pipelines, scientific research assistants, financial modeling, algorithmic problem solving, proof generation, strategic planning tools.

Requirements: Deep reasoning, self-correction, structured logical output, tolerance for slow response times in exchange for accuracy.

Tier	Model	Why
Best overall	GPT-5.4 Thinking or Gemini 3.1 Pro	State-of-the-art on AIME and reasoning benchmarks
Deepest reasoning	Claude Opus 4.6 (extended thinking)	Deliberate self-reflection loops; best for multi-step enterprise analysis
Best open-weight	DeepSeek R1	87.5% AIME; discovered chain-of-thought via pure RL; MIT licensed
Math/proofs	DeepSeek-Prover-V2	Only major open-source model for formal theorem proving in Lean 4
Multi-agent reasoning	Grok 4.20	Four-agent parallel architecture; aggregates multiple independent reasoning paths
Budget	Qwen3-Next (87.8% AIME25)	Open-weight; frontier reasoning at zero API cost

Key decision point: If latency doesn’t matter and accuracy is everything, use Claude Opus 4.6 with extended thinking or GPT-5.4 Thinking. If you need this at scale on a budget, DeepSeek R1 hosted via Groq or Together AI is the best cost/accuracy ratio.

🌐 Real-Time Web & Search Applications

What you’re building: News aggregators, competitive intelligence tools, financial data monitors, social listening platforms, research assistants with live data.

Requirements: Real-time web access, citation of sources, recency awareness, speed.

Tier	Model	Why
Best for citations	Perplexity Sonar Pro	Every answer cites sources; purpose-built for grounded web answers
Best for social data	Grok 4.20	Native X/Twitter real-time integration; best for social intelligence
Google ecosystem	Gemini 3.1 Pro with Search grounding	Grounding with Google Search is native; Workspace integration
General web	GPT-5 with web browsing	Strong at synthesis; large ecosystem

Key decision point: For social media monitoring or anything involving X/Twitter data, Grok 4.20 is uniquely positioned. For academic or news research requiring citations, Perplexity Sonar Pro is purpose-built.

🌍 Multilingual Applications

What you’re building: Global customer support, multilingual content generation, cross-language search, localization pipelines, translation tools.

Requirements: High accuracy in target languages (not just English), cultural nuance beyond literal translation, support for less common languages.

Tier	Model	Why
Best breadth	Qwen2.5-Max / Qwen3-Next	119 languages; genuine cultural nuance; best non-English open model
Best for business languages	Cohere Command R+	Optimized for 10 major business languages; strong multilingual RAG
Best coverage (46 languages)	BLOOM	Only model covering many low-resource and regional languages
Proprietary	Mistral Large 3	Strong European language support (FR, DE, IT, ES, PT)
Chinese-first	Qwen3 or ERNIE 4.5	Native Chinese cultural understanding; far outperforms Western models in Chinese

Key decision point: For European business languages, Mistral Large 3 is optimized and cost-effective. For Asian and global markets at scale, Qwen3 is the dominant choice. For low-resource language research, BLOOM remains uniquely capable.

🔒 Privacy-Critical / Air-Gapped Deployments

What you’re building: Healthcare data processing, legal document handling, defense applications, financial systems with strict data sovereignty, government workloads.

Requirements: Data never leaves your infrastructure, compliance certifications, ability to audit model behavior, fine-tuning on proprietary data.

Tier	Model	Why
Best overall	Llama 4 Maverick (self-hosted)	Meta license permits commercial use; strong benchmarks; no API calls
Best for regulated industries	IBM Granite 4.0 (Apache 2.0)	Most permissive license; IBM enterprise support; Apache 2.0 = IP clarity
Best reasoning	GPT-OSS 120B (Apache 2.0)	OpenAI-quality reasoning; fully self-hostable
Best coding	Devstral Small 2 (24B, Apache 2.0)	Strong coding; single GPU deployment
Smallest footprint	Phi-4 Mini or Gemma 3 4B	Runs on laptop; MIT/Apache; HIPAA-friendly if deployed privately

Key decision point: For maximum IP protection, Apache 2.0 licensed models (IBM Granite, Phi-4, GPT-OSS, Qwen3) remove all ambiguity. For maximum capability, Llama 4 or GPT-OSS 120B self-hosted on your own infrastructure.

📊 Data Analysis & Structured Output

What you’re building: Data extraction pipelines, schema-to-JSON converters, report generators, database query generators, ETL automation, spreadsheet AI.

Requirements: Reliable JSON/structured output, function calling, low hallucination on numbers and facts, ability to follow strict schemas.

Tier	Model	Why
Best for structured output	Claude Sonnet 4.6	Structured outputs GA with expanded schema support; strong schema adherence
Best for data + spreadsheets	GPT-5.4 (via ChatGPT for Excel add-in)	Native Excel operations; spreadsheet + presentation skills built in as of March 2026
Best for SQL generation	DeepSeek Coder V2	Outperforms IBM Watson on SQL (73.78% vs 45.6% HumanEval SQL)
Best open-weight	Qwen3 or Mistral 7B (fine-tuned)	Function calling native; easy to fine-tune on your schema
Budget	DeepSeek V3.2	Unified chat + structured output; $0.42/M; strong JSON following

Key decision point: If you need guaranteed JSON schema adherence in production, use structured outputs mode via Anthropic or OpenAI APIs — it uses constrained grammar to guarantee valid output, not just hope.

🖥️ Computer Use / GUI Automation Agents

What you’re building: Browser automation, desktop workflow agents, RPA (robotic process automation) replacements, autonomous research agents, form-filling bots, QA automation.

Requirements: Vision (screenshot understanding), ability to click/type/navigate, multi-step planning, error recovery.

Tier	Model	Why
Best overall	GPT-5.4 (Computer Use API)	Native computer use; first mainline model with state-of-the-art GUI control; record on OSWorld-Verified
Best for enterprise workflows	Claude Opus 4.6	61.4% OSWorld; computer use built in; longest task horizon (14.5hrs)
Best for web automation	Gemini 3.1 Pro	Computer use tool native; deep Google ecosystem; auto browse in Chrome
Open-weight	(Limited options)	This capability is largely proprietary; GLM-4V and Qwen-VL have partial vision support

Key decision point: GPT-5.4 is currently the strongest for computer use in the API, particularly for professional document workflows (Excel, PowerPoint, browser). Claude Opus 4.6 is stronger for long-running autonomous tasks where the agent must stay on-task for hours.

🎓 Education & Tutoring Platforms

What you’re building: Personalized tutoring, homework helpers, language learning apps, coding bootcamp assistants, exam prep tools.

Requirements: Age-appropriate responses, Socratic dialogue capability, explanation of reasoning, multiple difficulty levels, safe content generation.

Tier	Model	Why
Best overall	GPT-5 or Claude Sonnet 4.6	Excellent at Socratic dialogue; strong at adjusting complexity
Best math/science	Gemini 3.1 Pro (Deep Think) or DeepSeek R1	Best STEM reasoning; can show step-by-step work
Best for young learners	Claude Haiku 4.5	Constitutional AI = safest content; fast; affordable for per-user billing
On-device (offline)	Phi-4 Mini	MIT license; strong reasoning for size; runs on tablets
Budget at scale	Gemini 3 Flash or DeepSeek V3.2	Sub-cent per interaction; viable for free-tier edtech products

Real-World: Khan Academy uses GPT-4/5 for Khanmigo, Duolingo Max uses GPT for conversation practice. Both demonstrate that GPT-family models set the standard for educational dialogue.

🏥 Healthcare & Clinical Applications

What you’re building: Clinical documentation assistants, diagnostic support tools, patient communication bots, medical record analysis, drug information systems.

Requirements: Accuracy on medical terminology, HIPAA compliance, conservative/safe outputs, ability to cite clinical sources, no hallucinated diagnoses.

Tier	Model	Why
Best overall	Google MedLM (Gemini-based)	Expert-level USMLE performance; HIPAA via Google Cloud BAA; deployed in production hospital systems
Best for research	BioMedLM (Stanford)	Trained on PubMed; open research weights; strong biomedical NLP
Best general model for medical RAG	Claude Sonnet 4.6	Lowest hallucination rate; citation support; can be deployed on AWS/GCP with HIPAA BAA
Structured EHR tasks	ClinicalBERT	ICD coding, NER, adverse event detection in structured clinical notes
On-premise (sensitive data)	Llama 4 or IBM Granite (self-hosted)	Data never leaves hospital infrastructure

Key decision point: For patient-facing applications, never use an unconstrained general model without medical-specific fine-tuning, RAG grounding on clinical guidelines, and a human-in-the-loop review step. Always pair with a HIPAA BAA from your cloud provider.

⚖️ Legal Tech Applications

What you’re building: Contract analysis tools, case law research assistants, due diligence automation, compliance monitoring, legal document drafting aids.

Requirements: Precision on legal terminology, citation of case law and statutes, low hallucination on facts and dates, confidentiality (data residency), audit trail.

Tier	Model	Why
Best turnkey	Harvey AI	Purpose-built for BigLaw; BigLaw Bench score 91% with GPT-5.4; Westlaw/LexisNexis integration
Best platform	CoCounsel (Thomson Reuters)	Native Westlaw; case law grounding; proven in AmLaw 200 firms
Best underlying model	GPT-5.4	91% BigLaw Bench; praised specifically for transactional contract analysis
Best for long contracts	Claude Opus 4.6 (1M context)	Entire contract portfolio in one session; strong instruction following
Open-weight	ChatLAW or Claude/Llama with legal RAG	Research-grade; requires your own legal corpus and citation pipeline

Key decision point: For large law firms, Harvey or CoCounsel wrap the hard integration work. For legal tech startups building custom products, use GPT-5.4 or Claude Sonnet 4.6 with a Westlaw/LexisNexis RAG pipeline and careful output validation.

💰 Financial Services Applications

What you’re building: Investment research tools, earnings analysis, portfolio risk screening, compliance monitoring, AML (anti-money laundering) systems, financial report generation.

Requirements: Accuracy on numbers, SEC/FINRA/GAAP terminology, no hallucinated financial data, audit trail, data residency compliance.

Tier	Model	Why
Best purpose-built	BloombergGPT	Trained on 363B Bloomberg tokens; 30%+ error reduction vs. general LLMs on financial tasks
Best general model	Claude Opus 4.6	#1 on Finance Agent benchmark; strong at financial report synthesis
Best for research synthesis	Perplexity Sonar Pro	Cited, real-time financial news synthesis
Best open-weight	FinGPT (AI4Finance)	Apache 2.0; fine-tuneable on proprietary financial data
For volume/screening	DeepSeek V3.2 or GPT-5 Mini	ESG screening, portfolio flagging at scale — Norway SWF uses Claude for this

Real-World: Norway’s $2.2T sovereign wealth fund uses Claude to screen its portfolio for ESG risks. JPMorgan COIN uses domain-trained LLMs for loan agreement review. 60%+ of major North American banks have LLM pilots or production deployments.

🔐 Cybersecurity Applications

What you’re building: Threat detection assistants, vulnerability scanning automation, security report generation, SIEM log analysis, penetration testing tools, phishing detection.

Requirements: Understanding of CVEs, MITRE ATT&CK, network protocols; structured output for SIEM integration; low false-positive rate; no generating exploit code.

Tier	Model	Why
Best platform	Microsoft Security Copilot	GPT-5.2 + Microsoft Sentinel; enterprise-grade; SIEM integration native
Best general model	GPT-5.4 or Claude Sonnet 4.6	Strong at log analysis, threat narrative generation, policy drafting
Best open-weight	Llama 4 or Mixtral (fine-tuned on security data)	Self-hosted; no sensitive log data leaving infrastructure
Code security specifically	GitHub Copilot (Enterprise) + Snyk AI	Security scanning built into IDE workflow; real-time vulnerability detection

Key decision point: For security-sensitive workloads, self-hosted open-weight models are often the only acceptable option — sending network logs or CVE data to a third-party API creates its own attack surface. GPT-5.4 noted its cyber safety systems carefully in its safety evaluation during the March 2026 launch.

🛒 E-Commerce & Personalization

What you’re building: Product description generation, personalized recommendation copy, review summarization, search ranking assistance, customer Q&A bots, visual product search.

Requirements: Fast, cheap per-item processing; multimodal (product images + text); SEO-aware output; brand voice consistency.

Tier	Model	Why
Best for volume	Gemini 3.1 Flash-Lite	Demonstrated UI generation; fast; $0.25/M; can generate product listings at scale
Best for quality	Claude Sonnet 4.6	Brand voice consistency; strong instruction following for style guides
Best multimodal	Gemini 3.1 Pro or GPT-5.4	Image + text product understanding; can analyze product photos
Best open-weight	Qwen3 or Llama 4 (fine-tuned)	Fine-tune on your product catalog and brand guidelines
Cheapest viable	DeepSeek V3.2	Excellent value for high-volume description generation

Key decision point: For bulk product description generation (thousands/day), DeepSeek V3.2 or Gemini Flash-Lite at sub-cent per item is the right answer. For homepage/hero copy requiring brand voice precision, invest in Sonnet 4.6.

📱 Mobile & On-Device AI Features

What you’re building: Offline AI assistants, on-device text prediction, local document summarization, privacy-first AI features that run without internet.

Requirements: Runs on device CPU or NPU, <4GB RAM footprint, sub-second inference, no network dependency, private by default.

Tier	Model	Why
Best iOS/macOS	Apple on-device models (FastVLM)	Apple silicon optimized; privacy-first; native OS integration
Best cross-platform (3.8B)	Phi-4 Mini	MIT license; 128K context; strong reasoning for size; runs on CPU
Best for Android/general	Gemma 3 4B	Google-quality; multimodal; runs efficiently on consumer hardware
Smallest viable	Gemma 3 1B or Llama 3.2 1B	Smartphone-class hardware; limited but functional
Best for coding features	Qwen3 4B	Strong code understanding for IDE plugins on local hardware

Key decision point: For Apple platforms, Apple’s own on-device models are best-in-class — but weights aren’t public. For cross-platform apps needing strong reasoning in a small package, Phi-4 Mini is the current leader.

🔬 Scientific Research Assistants

What you’re building: Literature review tools, hypothesis generation aids, experimental data analysis, protein structure annotation, genomics pipeline assistants, citation managers.

Requirements: Deep domain accuracy, citation grounding, ability to follow long complex instructions, math and statistics capability.

Tier	Model	Why
Best for biomedical	BioMedLM + Claude Sonnet 4.6	BioMedLM for biomedical NLP; Sonnet for synthesis and writing
Best for math/physics	DeepSeek R1 or Gemini 3.1 Pro (Deep Think)	Gold-level math competition performance; strong formal reasoning
Best for literature review	Perplexity Sonar Pro	Real-time citation-grounded research synthesis
Best for formal proofs	DeepSeek-Prover-V2	Only major open-source model for Lean 4 theorem proving
Best for SciGLM	SciGLM	Cross-domain (chemistry, biology, physics); Chinese academic institutions
Best general	Claude Opus 4.6 (1M context)	Read entire papers, datasets, and related work in one session

🏗️ DevOps, Infrastructure & Cloud Automation

What you’re building: IaC (Terraform, CDK) generation, CI/CD script automation, cloud cost optimization tools, runbook generation, incident response assistants.

Requirements: Understanding of cloud-specific APIs and services, structured output for YAML/JSON/HCL, low hallucination on resource names and API signatures.

Tier	Model	Why
AWS-native	Amazon Q Developer	Deep AWS service knowledge; understands Lambda, CloudFormation, CDK natively
Best general	GPT-5.4 or Claude Sonnet 4.6	Strong at generating accurate IaC; good at multi-file Terraform plans
Best open-weight	Llama 4 or Qwen3-Coder (fine-tuned on Terraform)	Self-hosted; fine-tuneable on your specific infra patterns
IDE integration	GitHub Copilot Enterprise	Native VS Code/JetBrains; understands repo context; multi-model

🎨 Creative Content Generation

What you’re building: Marketing copy, social media content, blog post drafts, email campaigns, product narratives, game dialogue, story generation.

Requirements: Creative flexibility, brand voice adherence, variety in output, low repetition, ability to match tone and style.

Tier	Model	Why
Best overall	GPT-5	OpenAI highlights GPT-5 as “best model yet for writing”; literary depth and rhythm; less sycophantic
Best for long-form	Claude Sonnet 4.6	200K context for maintaining narrative consistency; strong instruction following on style
Most “unfiltered”	Grok 4.20 (Spicy mode)	Less restricted creative outputs for mature content platforms (Premium+)
Budget at scale	DeepSeek V3.2 or Gemini 3 Flash	Marketing copy at pennies per piece; quality sufficient for most commercial uses
Open-weight	Mistral Large 3 or Llama 4	Fine-tuneable on your brand corpus; no API costs at volume

🌐 Translation & Localization Pipelines

What you’re building: Automated translation, multilingual content management, localization QA, subtitle generation, cross-language customer support.

Requirements: High translation quality across target languages, cultural adaptation (not just literal translation), fast throughput, cost efficiency for volume.

Tier	Model	Why
Best coverage	Qwen3-Next	119 languages; cultural nuance; strong on Asian languages
Best European	Mistral Large 3	Optimized for FR, DE, IT, ES, PT; strong European cultural context
Best for business	Cohere Command R+	10 major business languages; grounding in enterprise context
Fastest/cheapest	Gemini 3.1 Flash-Lite	Explicitly listed as a top use case by Google; 45% faster than 2.5 Flash; $0.25/M
Low-resource languages	BLOOM	46 languages including many underrepresented ones; open-source

🧩 Embeddings & Semantic Search

What you’re building: Vector database population, semantic search engines, recommendation systems, document similarity, duplicate detection, clustering pipelines.

Requirements: High-quality embeddings that capture semantic meaning, multilingual support, efficient inference, flexible output dimensions.

Tier	Model	Why
Best multimodal	Gemini Embedding 2 (March 10, 2026)	Text + image + video + audio + docs in one unified embedding space; SOTA benchmarks
Best text	OpenAI text-embedding-3-large	High quality; well-supported; widely adopted
Best open-weight	nomic-embed or BGE (from HuggingFace)	Strong text embeddings; self-hostable; Apache 2.0
Best for code	Voyage Code (via Anthropic)	Optimized for code semantic search; used by Claude Code internally

🤝 Multi-Agent Orchestration Frameworks

What you’re building: Pipelines where multiple AI agents collaborate — one researches, one writes, one reviews; or parallel agents tackling subtasks simultaneously.

Requirements: Reliable tool use, consistent output format across agents, long context for passing state, low cost for high call volume, predictable behavior.

Tier	Model	Why
Best overall orchestrator	Claude Sonnet 4.6	Best instruction following; most predictable output format; structured outputs GA
Best parallel reasoning	Grok 4.20	Native four-agent architecture; purpose-built for multi-agent workflows
Best open-weight	Qwen3 or Mistral Large 3	Function calling native; Apache 2.0; self-hostable multi-agent pipelines
Budget worker agents	DeepSeek V3.2 or Gemini Flash	Use a cheap, fast model for the “worker” agents; expensive model only for final synthesis
For computer-use agents	GPT-5.4 or Claude Opus 4.6	Native computer use; can operate real software as part of an agent pipeline

Key pattern: Use a flagship model (Claude Sonnet, GPT-5) as the orchestrator that plans, delegates, and synthesizes. Use cheaper models (Haiku, Gemini Flash, DeepSeek V3.2) as worker agents for individual subtasks. This architecture can reduce cost by 70–90% vs. using a frontier model for everything.

🧪 Model Evaluation & Red-Teaming Tools

What you’re building: LLM evaluation frameworks, automated test suites for AI outputs, safety testing tools, benchmark harnesses, hallucination detectors.

Requirements: Reliable judge behavior, ability to score outputs on rubrics, calibrated confidence, low meta-hallucination (the judge hallucinating about the student model’s output).

Tier	Model	Why
Best judge model	Claude Opus 4.6 or GPT-5.4	Highest reasoning reliability; least likely to give sycophantic evaluations
Specialized eval model	Atla Selene Mini (8B)	Purpose-built evaluation model; Apache 2.0; strong for automated scoring
For safety red-teaming	Claude Sonnet 4.6	Constitutional AI makes it well-calibrated for harm detection
For open eval pipelines	OLMo + OpenAI evals framework	Full transparency; reproducible; good for academic research
Cheapest at scale	GPT-5 Mini or Gemini 3 Flash	Run thousands of evals cheaply; use flagship model only for borderline cases

Summary Decision Table

Use Case	Primary Pick	Open-Weight	Budget
Customer support chatbot	Claude Sonnet 4.6	Llama 4 Maverick	Claude Haiku / Gemini Flash-Lite
Code completion (IDE)	GitHub Copilot	StarCoder2 / Qwen3-Coder	Codestral
Agentic coding	Claude Opus 4.6	Devstral 2	Devstral Small 2
Document analysis	Claude Sonnet 4.6	Llama 4 Scout	Gemini 2.5 Flash
RAG / knowledge base	Cohere Command R+	Mixtral 8x22B	DeepSeek V3.2
Complex reasoning	GPT-5.4 Thinking	DeepSeek R1	Qwen3-Next
Real-time web search	Perplexity Sonar Pro	—	Grok 4.1 Fast
Multilingual	Qwen3-Next	Qwen3 / BLOOM	Gemini Flash-Lite
Air-gapped / private	Llama 4 (self-hosted)	IBM Granite 4.0	Phi-4 Mini
Structured data extraction	Claude Sonnet 4.6	Qwen3 (fine-tuned)	DeepSeek V3.2
Computer use / GUI	GPT-5.4	— (limited)	—
Education / tutoring	GPT-5 / Claude Sonnet	Phi-4 Mini	Gemini 3 Flash
Healthcare	MedLM (Google Cloud)	Llama 4 (self-hosted)	BioMedLM
Legal	Harvey / CoCounsel	ChatLAW + RAG	Claude Sonnet 4.6
Finance	BloombergGPT / Claude Opus	FinGPT	DeepSeek V3.2
Cybersecurity	MS Security Copilot	Llama 4 (self-hosted)	Mixtral fine-tuned
Mobile / on-device	Apple on-device / Phi-4 Mini	Gemma 3 4B	Gemma 3 1B
Creative writing	GPT-5	Mistral Large 3	DeepSeek V3.2
Translation	Qwen3-Next	Mistral Large 3	Gemini Flash-Lite
Embeddings / search	Gemini Embedding 2	nomic-embed / BGE	text-embedding-3-small
Multi-agent orchestration	Claude Sonnet 4.6	Qwen3 / Mistral	DeepSeek V3.2 (worker)
Model evaluation	Claude Opus 4.6	Atla Selene Mini	GPT-5 Mini

The Developer’s Complete LLM Comparison Guide (April 11, 2026)

Every Major, Minor, Niche, Open-Source, and Specialized Language Model — Researched, Compared, and Rated

Table of Contents

What’s New This Week

1. What Is an LLM? A Developer’s Primer

2. How to Read This Guide

3. Tier 1 — Flagship Proprietary Models

🟢 GPT Series — OpenAI

Current GPT-5 Family (as of April 11, 2026)

GPT-5.4 (Current Flagship)

GPT-5.3 Codex (Coding Specialist)

GPT-OSS (Open-Weight Series)

🟣 Claude Series — Anthropic

Current Claude Family (as of April 11, 2026)

Claude Opus 4.6 (Current Power Flagship)

Claude Sonnet 4.6 (Current Balanced Flagship)

Claude Haiku 4.5

🔵 Gemini Series — Google DeepMind

Current Gemini Family (as of April 11, 2026)

Gemini 3.1 Pro (Current Flagship)

Gemini 3.1 Flash-Lite (Newest Budget Model)

Gemini 3 Flash (Default App Model)

Gemini Embedding 2

Gemma 3 (Open-Weight from Google)

⚡ Grok Series — xAI

Current Grok Family (as of April 11, 2026)

Grok 4.20 (Current Flagship)

4. Tier 2 — Strong Proprietary Challengers

🔍 Perplexity AI (Sonar Models)

🏢 Microsoft Copilot / Azure OpenAI

🟡 Cohere Command R+

🟠 Amazon Nova / Bedrock

5. Tier 3 — Open-Source Powerhouses

🦙 Meta Llama Series

Llama Versions Overview

🌪️ Mistral / Mixtral Series

Mistral Models Overview

🔴 DeepSeek Series

DeepSeek Models Overview

🐼 Qwen Series — Alibaba Cloud

Qwen Models Overview

🔷 IBM Granite (4.0 Family)

🦅 Falcon Series — TII (UAE)

🪟 Microsoft Phi Series

🌍 BLOOM — BigScience

🔬 OLMo — Allen Institute for AI

🟩 NVIDIA Nemotron 3 Super

6. Tier 4 — Chinese Frontier Models

🔴 Baidu ERNIE (文心 4.5)

🟤 Zhipu GLM-5 / ChatGLM

🌙 Moonshot Kimi

🔷 Baichuan / Yi / Hunyuan / InternLM

📦 ByteDance Seed

🔢 MiniMax M2.5

7. Tier 5 — Coding-Specialist Models

💻 GitHub Copilot

🤖 DeepSeek Coder V2 / Prover-V2

⭐ StarCoder2

🦙 CodeLlama

🌊 Codestral / Devstral 2 (Mistral)

🛒 Amazon Q Developer / Tabnine

8. Tier 6 — Domain-Specific Models

🏥 Healthcare LLMs

💰 Finance LLMs

⚖️ Legal LLMs

🔬 Science / Security

9. Tier 7 — Edge / On-Device / Small Models

10. Tier 8 — Research & Historical Models

Pricing Comparison Table (April 11, 2026)

Proprietary Models

Open-Weight Models (Self-Hosted = Free; Hosted Pricing Below)

Cost Optimization Strategies

Benchmark Comparison (April 11, 2026)

Key Benchmarks Explained

Benchmark Snapshot (April 11, 2026)

Benchmark Notes for This Week

13. Choosing the Right LLM: Decision Framework

Step 1: Define Your Primary Workload

Step 2: Assess Constraints

Step 3: Run Your Own Evaluation