← Home 19 editions

Deep Dives

Series

May 2026

  1. Cost Optimization for LLM Applications

  2. API Rate Limits Compared: Every Major LLM Provider (May 2026)

    API rate limits for every major LLM provider — May 2026. Side-by-side tables for OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Cerebras, SambaNova, and more.

  3. LLM Token Costs and Efficiency: A Practitioner's Guide (May 2026)

    LLM token costs across 15+ providers: per-token pricing, caching mechanics, batch discounts, model routing, and cost optimization for May 2026.

  4. LLM Observability in Production

  5. CI/CD for AI Applications

  6. Structured Output from LLMs

April 2026

  1. Building Reliable RAG Systems

  2. API Rate Limits Compared: Every Major LLM Provider (April 2026)

    Side-by-side rate limit comparison across 17 LLM API providers — OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Cerebras, SambaNova, Perplexity, Alibaba, Moonshot, and more — as of April 2026.

  3. LLM Token Costs and Efficiency: A Practitioner's Guide (April 2026)

    Beyond the pricing page. How to actually think about LLM costs: per-token pricing across 15+ providers, hidden multipliers, caching mechanics, batch discounts, model routing architectures, and what 'cost per useful output' means in production.

  4. AI Agent Orchestration Patterns

  5. How Vector Databases Actually Work

  6. Create collection with scalar quantization

  7. Local LLM on a $550 AMD Mini PC: 28B Models at 20 tok/s

    AMD 780M iGPU + 64GB DDR5 runs Gemma 4 28B at 19.5 tok/s. Setup guide, benchmarks, and cost breakdown vs. Mac Mini for local LLM inference under $600.

March 2026

  1. Embeddings in Practice: Every Major Model Compared

  2. omlx: Run Local LLMs on Apple Silicon with a RAG Customer Support App

    omlx: macOS-native LLM server for Apple Silicon with SSD KV caching that cuts cold-start prefill from 90s to under 5s. Complete RAG customer support chatbot tutorial included.

  3. Prompt Injection Prevention in Production

    Taxonomy of prompt injection attacks and the layered defenses — input validation, output filtering, guardrails — that actually work at scale.

  4. The Inference Stack Top to Bottom

    What happens between your API call and a streamed token — routing, batching, KV cache, quantization, and speculative decoding explained.

  5. MCP, Tool Use, and Function Calling: How Agents Actually Work in 2026

    A comprehensive rundown of function calling, Model Context Protocol, agent frameworks, and the patterns that actually work in production — across every major provider.

  6. API Rate Limits Compared: Every Major LLM Provider in One Place

    A side-by-side comparison of rate limits across 15 LLM API providers — OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Perplexity, Alibaba, Moonshot, and more — as of March 2026.