Deep Dives
SeriesMay 2026
- Cost Optimization for LLM Applications
- API Rate Limits Compared: Every Major LLM Provider (May 2026)
API rate limits for every major LLM provider — May 2026. Side-by-side tables for OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Cerebras, SambaNova, and more.
- LLM Token Costs and Efficiency: A Practitioner's Guide (May 2026)
LLM token costs across 15+ providers: per-token pricing, caching mechanics, batch discounts, model routing, and cost optimization for May 2026.
- LLM Observability in Production
- CI/CD for AI Applications
- Structured Output from LLMs
April 2026
- Building Reliable RAG Systems
- API Rate Limits Compared: Every Major LLM Provider (April 2026)
Side-by-side rate limit comparison across 17 LLM API providers — OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Cerebras, SambaNova, Perplexity, Alibaba, Moonshot, and more — as of April 2026.
- LLM Token Costs and Efficiency: A Practitioner's Guide (April 2026)
Beyond the pricing page. How to actually think about LLM costs: per-token pricing across 15+ providers, hidden multipliers, caching mechanics, batch discounts, model routing architectures, and what 'cost per useful output' means in production.
- AI Agent Orchestration Patterns
- How Vector Databases Actually Work
- Create collection with scalar quantization
- Local LLM on a $550 AMD Mini PC: 28B Models at 20 tok/s
AMD 780M iGPU + 64GB DDR5 runs Gemma 4 28B at 19.5 tok/s. Setup guide, benchmarks, and cost breakdown vs. Mac Mini for local LLM inference under $600.
March 2026
- Embeddings in Practice: Every Major Model Compared
- omlx: Run Local LLMs on Apple Silicon with a RAG Customer Support App
omlx: macOS-native LLM server for Apple Silicon with SSD KV caching that cuts cold-start prefill from 90s to under 5s. Complete RAG customer support chatbot tutorial included.
- Prompt Injection Prevention in Production
Taxonomy of prompt injection attacks and the layered defenses — input validation, output filtering, guardrails — that actually work at scale.
- The Inference Stack Top to Bottom
What happens between your API call and a streamed token — routing, batching, KV cache, quantization, and speculative decoding explained.
- MCP, Tool Use, and Function Calling: How Agents Actually Work in 2026
A comprehensive rundown of function calling, Model Context Protocol, agent frameworks, and the patterns that actually work in production — across every major provider.
- API Rate Limits Compared: Every Major LLM Provider in One Place
A side-by-side comparison of rate limits across 15 LLM API providers — OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Perplexity, Alibaba, Moonshot, and more — as of March 2026.