Office Hours — How do you know what data your AI coding agent is sending to the cloud, and what should you audit for?

How do you know what data your AI coding agent is sending to the cloud, and what should you audit for?

You probably don’t know. Most developers running Cursor Agent, GitHub Copilot with GPT-5.4, or Claude Code haven’t actually checked what leaves their machine. This matters because the data flowing to API servers includes your source code, file paths, git history, environment variables you didn’t realize were in scope, and sometimes secrets baked into configuration files. By default, you’re shipping your entire working context to frontier AI labs.

The Data Audit Problem

Start by understanding what each tool actually sends. GitHub Copilot, for instance, sends your current file, visible context, and recent edits to Microsoft’s infrastructure. Claude Code sends your entire session transcript plus file snapshots. Cursor Agent with GPT-5.4 sends similar patterns. None of these tools give you a packet-level view of what’s leaving. You can inspect HTTP traffic with mitmproxy or Charles Proxy if you’re comfortable intercepting HTTPS (requires adding a custom CA cert), but that’s friction most teams skip.

The real issue: you can’t audit what you can’t see. Most coding agents don’t expose a configuration flag that says “only send these file types” or “exclude directories matching this pattern.” They send what the LLM requests to work with. If the model asks for your git log to understand context, it goes. If it pulls in environment configs to understand your stack, those travel too.

What Actually Leaves Your Machine

When you use a coding agent on a private codebase, assume the agent sends:

Your entire active file and surrounding context (usually the whole file, sometimes adjacent files the model pulls in for context)
Directory structure and file paths (revealing project architecture and naming conventions)
Recent git commits and diffs (your development history and decision-making process)
Function signatures and imports from dependencies (information about your tech stack and internal APIs)
Error messages and stack traces from failed runs (potentially revealing infrastructure details, internal service names, or deployed configurations)
Test output and logs (which often contain business logic, API endpoints, or database structure hints)

What probably doesn’t leave: binaries, compiled artifacts, or truly large files that exceed context windows. But everything else is fair game.

Practical Audit Steps

Run your coding agent on a test repository first. Use a tool like Charles Proxy to intercept HTTPS traffic (if you control the CA cert on your machine), or use your IDE’s network logging if available. Cursor Agent and Copilot both have telemetry settings, but “telemetry” usually means diagnostic data, not code data—different buckets.

For GitHub Copilot specifically, check the Copilot settings in your editor: there’s a “Copilot: Use External Sources” toggle (defaults to enabled). Disable it if you’re worried about training data contribution, though that doesn’t change what’s sent in the current session. The privacy documentation says Copilot doesn’t train on private code by default if you’re on a paid account, but the data still flows through Microsoft’s API servers for processing.

Claude Code and Cursor Agent don’t have equivalent toggles. They send session data for inference, period. If you’re on Claude Code’s enterprise plan (though pricing and availability shift), you can request a Data Processing Addendum that restricts how Anthropic handles your code. Cursor Agent’s transparency here is weaker.

What You Should Audit For

Focus on these categories:

Secrets and credentials: Environment variable names, API keys, database connection strings, AWS account IDs embedded in error messages. Grep your code for SECRET, KEY, PASS, TOKEN patterns and understand what the agent might pull in.
Business logic patterns: Your domain-specific algorithms, pricing logic, authentication schemes. These have competitive value; leaking them to frontier labs means your approaches become part of their training distribution (eventually).
Infrastructure details: Service names, internal API contracts, deployment configurations. These are reconnaissance data for someone trying to understand your system.
Data schemas and structures: Columns in your database queries, field names in your data models. These reveal what data you’re tracking.

A concrete example: you’re debugging a payment processing service. The agent pulls in recent error logs to understand the issue. Those logs contain transaction amounts, customer IDs, payment processor responses, and your internal fee calculations. The agent sends all of that to the API server to reason about the problem. Now that transaction logic is visible to the frontier lab’s infrastructure.

Practical Safeguards

If you can’t avoid using cloud-based coding agents on sensitive code:

Run agents on sanitized versions of your codebase. Create a mirror repository with real structure but fake/anonymized business logic. Use it for agent-assisted refactoring and architectural work.
Set strict file exclusions in your agent’s configuration if the tool supports it (most don’t, but some do). Exclude .env files, config directories, and test fixtures that contain production data.
Use agents for structural or algorithmic work (refactoring, test writing, boilerplate), not for domain-specific logic (payment systems, authentication, data processing pipelines).
Run self-hosted or open-source agents on sensitive code. Devstral 2 (72.2% on SWE-bench) or Llama 4 Scout running locally gives you code generation without cloud transmission, though with less capability.
Treat agent sessions like pull requests: review what the agent generated before shipping it. This catches both security issues and unintended data exposure in the generated code itself.

Cost and Token Implications

This audit also surfaces a cost hidden in most budgets. Each coding agent query sends your entire working context, which balloons token usage. If you’re running an agent on a 10,000-line codebase and it makes 20 requests during a session, you’re paying for 200,000 tokens of context sent repeatedly. Some teams have hit six-figure bills this way. Knowing what data leaves helps you understand where those tokens are actually going and whether selective context could cut costs 40-60% without losing capability.

Bottom line: Assume everything your coding agent can see gets sent to cloud infrastructure. Audit what your active working context contains before running an agent on it, and either sanitize the codebase or use local agents for sensitive systems.

Question via Hacker News