How We Defend Against Prompt Injection at Brightwave
.png)
Our customers use Brightwave to tear through hundreds of documents during a deal, build models from company data, assemble reports and pitch decks. To do this well, our agents need real capabilities: shell access, internet connectivity, API credentials for external services like CRMs and data providers. They run in sandboxed Linux containers.
This architecture is powerful for the same reasons it is dangerous. An agent that can call APIs, run shell commands, and reach the internet can do a lot of damage if it starts following an attacker's instructions instead of yours. That technique, getting an agent to follow malicious instructions embedded in the content it reads, is called prompt injection, and no amount of system prompt hardening will reliably prevent it. Giskard researchers showed that a single malicious email could trick an OpenClaw agent into leaking credentials and internal files. Johann Rehberger demonstrated the same class of attack against Devin, GitHub Copilot, Google Jules, Claude Code, and others. This isn't specific to any one product. It's a structural problem with any agent that processes untrusted content and has real capabilities.
We approach this in two layers: limit the agent's exposure to malicious content, and bound the damage when it gets through.
Reducing attack surface
The biggest lever is controlling what the agent sees in the first place.
Our agents have an email inbox, but we whitelist which addresses can send to it. Only the user's own addresses get through. The platform supports composable skills, but they are all written within the platform. We don't have a third-party skill marketplace. There is no public endpoint where a stranger can talk to your agent.
These policies cut the attack surface dramatically. But they don't eliminate it. Agents still process content from web searches, documents the user uploads, and emails the user forwards. A poisoned document in a deal room, a compromised web page in a search result, a forwarded email with hidden instructions buried in white text. The untrusted content still gets in.
So we need infrastructure-level defenses that work even when the model itself is compromised.
Bounding the blast radius
The core idea is simple: keep real credentials out of the sandbox, and control what traffic leaves it.
The proxy
.png)
All of our sandboxed agents run behind a MITM (man-in-the-middle) proxy, a separate process running under its own user identity. All outbound traffic routes through it, which gives us two things: a centralized point to apply egress rules, and a place to keep real credentials out of the agent's hands.
Real API keys never enter the sandbox. When a customer connects an external service, the platform generates a dummy key for the integration and records which real credential it maps to. The agent uses the dummy key like a normal API key. The proxy intercepts the request, looks up the real credential, swaps it in, and forwards it. If an attacker exfiltrates the dummy key, they have nothing useful. And from the agent's perspective, it's just making normal API calls with normal keys. This keeps the model focused on the actual task rather than managing proxy mechanics.
Why cooperative proxying isn't enough
The natural way to route traffic through a proxy is the HTTPS_PROXY environment variable. Set it in the sandbox, and most HTTP clients will automatically send their connections through your proxy. Python's requests, curl, Node.js fetch, our own Rust HTTP clients: they all honor it.
This routes all outbound traffic through the proxy. But it's cooperative. An agent acting on a malicious prompt can bypass it by unsetting the environment variable, opening raw TCP sockets, or using libraries that ignore the proxy setting entirely.
We needed enforcement that doesn't depend on the agent's cooperation at all.
Network-level enforcement
The fix is to enforce it at the network layer, below the agent entirely. We use iptables rules applied by an init container before the agent starts. Only the proxy process, identified by its user ID, is allowed to make outbound connections on ports 443 and 80. Everything else is dropped at the network level. The agent can't modify these rules because it doesn't have the required kernel capability, and it can't bypass them because they operate below the application layer. The proxy is the only way out.
Egress control
Now that the proxy is the only way out of the sandbox, we can decide what gets through.
The first layer is restricting which domains the agent can reach. If a domain isn't in the list of services the customer has connected, the proxy blocks the request. An agent tricked into sending data to an attacker-controlled server gets nowhere.
But restricting domains alone isn't enough. Say the agent is working with sensitive documents from a deal room, and the user has connected their Gmail account with read-only access so the agent can triage their inbox. An attacker who compromises the agent can't send mail as the user — the token doesn't allow it. But they could try a different route: have the agent authenticate to gmail.com using the attacker's own token, and send the documents from the attacker's account. The request is going to gmail.com, a domain the agent is allowed to reach, so domain allowlisting lets it through. This is why the proxy also checks that every request carries the token it issued for this sandbox. A request to gmail.com authenticated as anyone other than the user gets dropped. The agent can only interact with gmail.com as the user, within the scopes the user granted.
Closing
The principle underwriting this architecture is that security properties should be enforced by infrastructure, not by the model's judgment. A compromised agent hits the same walls as a well-behaved one. It can only reach the domains the user has allowed, using credentials the user has scoped. That's what lets us give agents real capabilities without asking customers to trust the model to behave. The architecture makes the power safe to grant.