Your Agents Are Running on Bare Metal. That Should Terrify You.
I’ve spent months building layered enforcement architecture for AI agents — instructions, hooks, gates. Three layers of defense that make agents structurally incapable of shipping untested code. 247 commits, 100% test coverage, zero rollbacks.
But there’s a question I kept dodging: where are these agents actually running?
GitHub Agentic Workflows gives you a sandboxed runner — a disposable VM that spins up, does work, and disappears. It’s excellent. It’s also specific to GitHub. The moment your agent needs to hit your staging database, call an internal API, or access credentials to provision infrastructure, that sandbox boundary dissolves. Your agent is operating on real systems with real consequences.
Then NVIDIA dropped OpenShell at GTC 2026 — an open-source, policy-driven sandbox runtime for autonomous AI agents. And suddenly the conversation changed from “should we sandbox agents?” to “how fast can we get this deployed?”
That’s the gap this article addresses. We’ve been obsessing over what agents can do (hooks, gates, policies) without addressing where they do it. Sandboxes are the missing piece — Layer 0 of agentic DevOps.
Layer 0: The Enforcement Boundary
In my agent-proof architecture, I described three enforcement layers:
- Layer 1: Instructions — Tell the agent what you expect
- Layer 2: Hooks — Remind the agent at the moment of action
- Layer 3: Gates — Verify server-side before merge
These layers assume something critical: the agent is operating in an environment where enforcement can happen. But what if it isn’t?
An agent running on your local machine can spawn subprocesses that bypass hooks. It can write to disk outside your project directory. It can make network calls to services you didn’t authorize. Instructions tell it not to. Hooks try to catch it. But without an isolation boundary, these are speed bumps, not walls.
Sandboxes are Layer 0 — the execution environment that makes every other layer enforceable. They don’t replace hooks and gates. They make hooks and gates trustworthy.
Think of it this way:
- Hooks run inside the sandbox — they control what the agent does
- Gates validate from outside the sandbox — they verify what the agent produced
- Policies declare what the sandbox allows — they define the boundary itself
- The sandbox is the bridge between “tell the agent” and “enforce on the agent”
The Sandbox Landscape Exploded in 2025–2026
A year ago, “AI sandbox” meant E2B and maybe Docker. Today there are 30+ platforms competing across every dimension — isolation strength, cold start time, GPU access, persistence, and pricing.
The market segments by isolation technology:
| Isolation Tech | Strength | Trade-off | Key Platforms |
|---|---|---|---|
| Firecracker microVM | Strongest — dedicated kernel per workload | Slower cold starts, more resource overhead | E2B, Northflank, Vercel Sandbox, Blaxel, Fly.io Sprites |
| Kernel-level LSM | Strong — syscall-level enforcement | Requires Linux, complex policy authoring | NVIDIA OpenShell |
| gVisor | Good — userspace kernel interception | Some syscall compatibility gaps | Modal |
| Container | Moderate — shared kernel, namespace isolation | Escape vulnerabilities are well-documented | Daytona, Alibaba OpenSandbox |
| V8 Isolate / Wasm | Lightweight — process-level isolation | Limited to specific runtimes | Cloudflare Workers, Rivet Secure Exec |
The cold start race tells you where the market is heading: Blaxel claims 25ms resume from standby, Daytona hits sub-90ms, E2B does ~150ms with full microVM isolation. For agentic workloads where an agent might spin up dozens of sandboxes during a single task, milliseconds matter.
The Comparison That Matters
For agentic DevOps specifically, here’s what I’d look at:
| Platform | Cold Start | Open Source | GPU | Self-Hosted | Pricing |
|---|---|---|---|---|---|
| E2B | ~150ms | ✅ (core) | ❌ | Via Terraform | ~$0.08/hr |
| Daytona | Under 90ms | ✅ (AGPL) | ✅ | ❌ | ~$0.08/hr |
| Modal | Sub-second | ❌ | ✅ Best | ❌ | Pay-per-second |
| OpenShell | Seconds | ✅ Apache 2.0 | ✅ (DGX/RTX) | ✅ | Free |
| Northflank | Fast | ❌ | ❌ | ✅ BYOC | Per-second |
| Fly.io Sprites | 1-12s | ❌ | ❌ | ❌ | CPU+mem+storage |
| OpenSandbox | Variable | ✅ Apache 2.0 | ❌ | ✅ | Free |
| Microsandbox | Variable | ✅ Apache 2.0 | ❌ | ✅ Local-first | Free |
If you need ephemeral execution for agent backends, E2B is the proven choice with 200M+ sandboxes served. If you need persistent state with fast starts, Daytona (67K GitHub stars) or Fly.io Sprites are compelling. For GPU workloads, Modal is unmatched.
But for agentic DevOps — where policy-governed isolation is the whole point — one platform stands out.
NVIDIA OpenShell: Policy-Driven Agent Sandboxing
OpenShell, announced at GTC 2026, takes a fundamentally different approach. Instead of “here’s a sandbox, run your code,” it’s “here’s a policy engine, declare what the agent can do.”
OpenShell enforces four protection domains:
- Filesystem — Landlock LSM locks allowed paths at sandbox creation. Not a namespace trick. Kernel-enforced.
- Network — Deny-by-default. Every outbound connection goes through an HTTP CONNECT proxy evaluated by OPA/Rego policies in real-time.
- Process — Seccomp BPF filters block dangerous syscalls. No privilege escalation, no socket creation outside the proxy.
- Inference — A privacy router intercepts LLM API calls, strips caller credentials, and injects backend credentials. Your agent’s context never leaks to unauthorized model providers.
The killer feature is declarative YAML policies that hot-reload on running sandboxes:
# Allow the agent to reach GitHub API and npm registry — nothing else
network:
outbound:
- host: "api.github.com"
ports: [443]
methods: [GET, POST]
- host: "registry.npmjs.org"
ports: [443]
methods: [GET]
Change the policy file, and the running sandbox immediately enforces the new rules. No restart. No downtime. This is what makes it fit the agentic DevOps model — policies are code, code is versioned, versioned policies are auditable.
OpenShell is Apache 2.0, fully self-hosted, and runs as a lightweight K3s cluster inside a single Docker container. Two commands to get started:
openshell sandbox create -- claude
openshell policy set my-sandbox --policy network-policy.yaml
It’s alpha software — single-player mode, rough edges. But the architecture is right: sandboxes aren’t just isolation, they’re governance infrastructure.
Sandboxes Complete the Agentic DevOps Stack
Here’s how sandboxes connect to everything I’ve written about agentic DevOps:
With hookflows, you enforce rules at the moment of action. But hookflows run in the agent’s process — they trust the environment. A sandbox makes the environment itself trustworthy.
With agent hooks, you intercept tool calls and block dangerous operations. But hooks can be disabled by a sufficiently creative agent (or developer). A sandbox enforces at the kernel level — there’s no --skip-sandbox flag.
With gates in CI/CD, you verify everything server-side. But gates only catch problems after the agent has already made changes. A sandbox prevents the problems from happening during execution.
With GitHub Agentic Workflows, you get a purpose-built sandbox for GitHub’s ecosystem. General-purpose sandboxes extend that model to any infrastructure — your staging environments, your databases, your internal APIs.
The progression is clear:
| Layer | Mechanism | When | Strength | Weakness |
|---|---|---|---|---|
| Layer 0: Sandbox | Kernel/VM isolation | During execution | Can’t be bypassed | Requires infrastructure |
| Layer 1: Instructions | Context engineering | Before action | Easy to author | Easy to ignore |
| Layer 2: Hooks | Tool-call interception | At moment of action | Real-time enforcement | Can be disabled |
| Layer 3: Gates | CI/CD pipeline | After action | Server-side, tamper-proof | Catches problems late |
Each layer compensates for the weaknesses of the others. Sandboxes at Layer 0 mean that even if an agent bypasses hooks, it physically cannot access unauthorized filesystems, networks, or processes.
The Bottom Line
We’ve been building agentic DevOps from the top down — instructions, hooks, gates. All essential. All insufficient without the foundation.
Sandboxes are that foundation. They’re the difference between “we told the agent not to” and “the agent literally cannot.” Between policy-as-suggestion and policy-as-physics.
NVIDIA’s OpenShell is the most significant new entrant because it treats sandboxes as governance infrastructure, not just containers. Declarative YAML policies, hot-reloadable at runtime, with kernel-level enforcement that agents physically cannot circumvent. It’s Apache 2.0, it’s free, and it works with Claude Code, Codex, and Copilot out of the box.
The sandbox market is mature enough to use today. E2B for ephemeral execution, Daytona for fast iteration, Modal for GPU workloads, OpenShell for policy-governed isolation. The tooling exists. The question is whether your agentic DevOps stack includes it.
If you’re running agents without sandbox isolation, you’re running agents on trust. And trust doesn’t scale.