The Stack — Granola — Stochastic Sandbox

Granola is an AI-powered meeting notetaker that writes structured summaries from raw audio — not from a transcript you had to clean up first.

What It Is

Granola runs as a Mac desktop app that captures system audio from any meeting platform — Zoom, Google Meet, Teams, whatever — without requiring a bot to join the call. It produces structured notes in a user-editable template after the meeting ends. It’s used heavily by founders, product managers, and operators who are in back-to-back calls and can’t afford to manually clean up transcripts.

The Architecture

Granola’s most distinctive infrastructure decision is its local audio capture pipeline. Rather than routing audio through a server-side bot (the approach taken by Otter, Fireflies, and most competitors), Granola taps system audio directly on the user’s Mac using CoreAudio. This means the raw audio lives on the device until it’s ready to be processed. The audio is then sent post-meeting for transcription — Granola has publicly confirmed they use Whisper for transcription, though the specific deployment model (OpenAI’s hosted API vs. a self-hosted variant) has not been publicly disclosed.

For the summarization layer, Granola passes the raw transcript into a frontier model to generate structured notes. The specific model powering this is not publicly disclosed, but given their latency profile and cost constraints at the per-user level, it likely sits somewhere in the balanced-tier range of current offerings — something with strong instruction-following and long-context handling rather than the absolute frontier. The notes are generated post-meeting rather than streaming in real-time, which gives them architectural flexibility: they can batch the transcription and summarization steps without needing sub-second response times.

Granola’s infrastructure for note templates is worth examining. Users can create and modify templates (e.g., “action items + decisions + context”), and the model appears to apply those templates at inference time rather than fine-tuning separate models per template. This is almost certainly a prompting-layer concern — the template likely gets injected as a structured system prompt — which keeps the architecture simple and allows users to iterate on their own schemas without engineering intervention.

Caching strategy here is naturally limited: every meeting produces a unique transcript, so there’s minimal opportunity for semantic caching at the summarization layer. Where they likely do optimize is on the transcription side — Whisper processing can be parallelized across audio chunks before reassembly, reducing total turnaround time. Granola’s stated goal is notes ready within a couple of minutes of the meeting ending, which is achievable with this kind of chunked processing pipeline.

The Smart Decision

The decision to skip the bot entirely is the cleverest architectural call Granola made, and it’s doing more work than it first appears.

The bot-joining model (used by Otter, Fireflies, and many others) introduces a long chain of friction: the meeting organizer has to admit the bot, attendees can see it in the participant list, some enterprise IT policies block unknown attendees, and the whole thing requires a server-side stream to stay alive for the duration of the call. Granola sidesteps every single one of those failure modes by treating audio capture as a local concern. CoreAudio on macOS gives them access to all system audio — no permissions negotiation with the meeting platform, no bot admittance, no attendee visibility.

This decision has a meaningful second-order effect: it makes Granola platform-agnostic by default. They don’t need API integrations with Zoom, Google Meet, or Teams to capture audio from them. A new meeting platform launches tomorrow — Granola already works with it. Competitors built on bot infrastructure have to maintain per-platform integrations and renegotiate as those platforms change their bot policies. Granola’s capture layer is completely decoupled from the meeting platform layer.

The Tradeoff

The local audio capture model creates a real constraint: Granola only works on Mac. System audio capture at the OS level is handled very differently on Windows and Android, and CoreAudio is a macOS-specific API. Building equivalent functionality on other platforms isn’t a trivial port — it requires platform-specific audio subsystem work on each target OS.

This means Granola has effectively self-selected into the Mac-heavy demographic (startup operators, designers, PMs at tech companies) and is structurally blocked from enterprise deals where Windows fleets are the norm. It’s a meaningful ceiling. Competitors running server-side bots are platform-agnostic in the opposite direction — they work on any device because they’re not running locally at all. Granola traded cross-platform reach for a fundamentally better per-session experience, and that tradeoff is written into the architecture, not just the roadmap.

What You Can Steal

Capture locally, process remotely. If your product involves user-generated media (audio, screen recordings, documents), consider whether local capture can eliminate an entire class of integration dependencies before you commit to a server-side pipeline.
Post-hoc generation beats real-time streaming when latency budgets allow. Granola doesn’t stream notes during the meeting — it generates them after. This simplifies the architecture considerably and lets you use more capable models without fighting token-streaming UX constraints.
Templates as runtime prompts, not fine-tuned models. If your users need structured output customization, inject their schema into the system prompt at inference time. You get user personalization without model proliferation.
Pick your audience and let the architecture reflect it. Granola made a Mac-only call that also made them deeply good for a specific user. Trying to be platform-agnostic from day one often means being mediocre everywhere. Sometimes the right infra decision looks like a product limitation.
Whisper + frontier model summarization is a composable default stack for any audio-first use case. Transcription and summarization are separable concerns — keep them that way so you can swap either layer independently as models improve.