Building a Sandboxed Execution System for AI Agents and Untrusted Code

I didn’t really set out to build something like a “sandboxed execution system for AI agents.” It started from something much more ordinary, I just wanted a way to safely run code on a server that I didn’t fully trust.

Sometimes it was code generated by AI Agents. Sometimes it was automation scripts that looked correct at first glance but had no real guarantees behind them. The obvious answer is always the same:

Just run it in a container.

And for a while, that feels like the end of the story. But once you start using it for iterative, stateful workloads, it starts to break down.

So I started building something more opinionated: not just a way to run code, but a controlled execution environment that could manage state, sessions, and lifecycle around untrusted workloads.

That system became Bastion: a self-hosted sandboxed execution environment for running untrusted code from AI agents and automation tools, with persistent sessions instead of one-off containers.

Check out the project on Github!

The problem isn’t really “running code safely”

My first framing of the problem was pretty narrow: isolation. Docker gives you that. Spin up a container, execute code, tear it down. So that should be enough, right? But very quickly, I started running into friction that doesn’t show up in simple setups:

I don’t want a fresh environment every time
I need state to persist across executions (files, installs, intermediate outputs)
I want to run multiple commands in the same environment
I want interactive access (like a shell, not just a request/response model)
I want visibility into what’s happening while it’s running
I want strict resource limits, but not ones that make everything fragile or unpredictable

At that point, “just use Docker” starts to feel like it solves only one dimension of the problem. Because the real issue wasn’t execution itself. It was persistent stateful execution, where the environment isn’t disposable anymore, but something you manage and once you think in those terms, the problem stops being simple. You’re no longer just running code. You’re effectively managing a persistent, sandboxed execution environment per session.

The first approaches don’t really scale in your head (or in practice)

The naive version is straightforward: spin up a container per request, run a command (docker run), return stdout, destroy everything.

It works, until you try to do anything iterative. The moment you need state, everything breaks down: dependency installs repeat every time, filesystem state disappears after execution, workflows can’t continue across commands, debugging becomes almost impossible because context resets constantly

So the next step feels obvious:

keep a container alive per session

This immediately feels closer to what you actually want. Now each session is a long-lived environment, and commands execute inside it.

But this shift quietly introduces a new set of problems:

how do you run multiple commands concurrently in the same environment?
how do you track which process belongs to which execution?
how do you stream output in real time?
what happens when the container crashes or gets out of sync?
how do you reason about state consistency over time?

At this point, I stopped thinking in terms of “a runner”. What I was building started to look more like: a system that manages execution environments. That shift changes everything.

The system naturally started organizing itself around sessions

Once you stop thinking in terms of isolated commands and start thinking in terms of environments, a core abstraction emerges pretty quickly:

A session is not a request. It is a long-lived execution environment.

That idea became the center of everything. From there, the system naturally split into layers:

API layer (how clients interact with the system)
orchestrator (how execution is scheduled and controlled)
container runtime (where code actually runs)
filesystem layer (how state persists across time)
persistence layer (how we remember what happened)

The orchestrator ends up being where the “real system” lives

One thing I didn’t fully appreciate at the beginning is that Docker is not really an orchestration system in the way this problem needs.

It gives you primitives: start container, exec into container, attach streams, kill container

But everything above that: the logic that makes these meaningful is missing. So the orchestrator becomes the actual control center.

It’s responsible for things like:

session lifecycle management
enforcing concurrency limits per session
deciding whether an execution should run at all
reconciling runtime state with persisted state
tracking execution metadata and history

At some point, it started to feel less like “application logic” and more like a lightweight scheduler, something closer to an OS concept, but at the container level. That analogy actually helped reason about it more clearly.

Sessions became the central abstraction

A session represents a stable environment where multiple executions happen over time.

a persistent container
a filesystem mounted into it
environment configuration
execution history
runtime state

This changes how execution itself is modeled. It stops being: run → get output → discard and becomes: stateless operations over a persistent stateful environment

That tradeoff is powerful, but it introduces its own complexity: concurrency inside a shared environment, consistency of state over time, isolation between executions that still share a container

Why `docker exec` started to make more sense than spawning containers

At some point, there are two obvious directions:

spawn a new container per execution
reuse a running container and execute inside it

I tried both directions, and they lead to very different systems. Spawning containers gives clean isolation, but:

it’s too slow for interactive workflows
state has to be reconstructed externally every time
it doesn’t support “continuation” of work very naturally

Using docker exec inside persistent containers shifts the system in a different direction: state is naturally preserved, execution becomes fast and incremental, interactive workflows become possible

But the tradeoff is real: you now have to manage multiple processes inside one environment, resource isolation becomes harder, orphan processes and cleanup become real problems

Still, for this kind of system, the tradeoff is worth it because the core requirement is not “one-off execution”, it’s continuous interaction with an evolving environment.

Filesystem persistence turned out to be the simplest hard decision

Persistence sounds easy until you realize containers are inherently ephemeral. Bind mounts solve a lot immediately:

state survives container restarts
no external storage layer needed at the start
simple mental model

But they also shift responsibility upward: isolation is now something you must enforce carefully, path traversal becomes your problem, not Docker’s, boundaries are enforced at the API and orchestration layer

This is a recurring pattern in the system: simplicity in one layer usually moves complexity to another.

The terminal subsystem made the system feel “alive”

At some point, I wanted more than just command execution. I wanted an actual interactive environment, something that behaves like a shell session. So the system evolved into a WebSocket-based terminal layer:

Client ↔ WebSocket ↔ API ↔ docker exec (TTY mode)

This introduces a different class of problems:

partial stream handling
terminal resizing events
maintaining session continuity across disconnects
backpressure in streaming output
keeping interactive state stable under load

Once this works, the system stops feeling like a “command runner”. It starts feeling like a live environment you can interact with directly.

Concurrency is where the system stops being simple

Single execution per session is straightforward. Multiple concurrent executions inside the same environment is where everything becomes more delicate. Now you have to think about:

competing resource usage inside one container
output streams interleaving
lifecycle management per execution
cancellation and timeout semantics
tracking processes that aren’t tied to a single request

Even though everything still runs inside one container. This is the point where the system stops feeling like a wrapper around Docker and starts feeling like its own execution model.

Security is not one mechanism, it’s a stack of assumptions

It’s easy to assume containers give you “security”. They don’t, at least not by themselves. So the model becomes layered:

container isolation as the baseline
filesystem scoping via bind mounts
network control per session
CPU, memory, and process limits
execution timeouts enforced at runtime

But the important realization is: security here is not a single guarantee, it’s a composition of constraints

And if any layer is misconfigured, the guarantees degrade quickly. So instead of thinking in terms of “secure system”, it becomes: a system designed to contain failure within predictable boundaries

Persistence and reconciliation became unavoidable

One assumption I made early on was:

if a container exists, the system state is consistent

That turns out to be false fairly quickly. Containers crash. Machines restart. State drifts out of sync. So on startup, the system has to reconcile reality:

what exists in persistent storage (SQLite)
what is actually running in Docker
what is orphaned or inconsistent
what needs to be recovered or cleaned up

This becomes a simple but critical recovery loop:

load persisted session state
inspect runtime containers
match and reconcile
repair inconsistencies

It’s not exciting, but it’s the kind of thing that determines whether a system feels reliable or fragile.

Observability ends up mattering more than expected

At some point, you can’t reason about the system without visibility into it. So everything starts emitting structured logs:

execution lifecycle events
timing information
exit codes
resource usage
policy decisions

Not because it’s “good engineering practice”, but because without it debugging turns into guessing and for a system that executes arbitrary code, guessing is not acceptable.

What this ended up teaching me

A few things became clear only after the system started stabilizing:

stateful systems are fundamentally harder than stateless ones, even when they look simpler on the surface
orchestration is often the real complexity, not execution itself
abstraction boundaries matter more than implementation details
concurrency inside shared environments forces you to confront design decisions immediately
and most importantly, “just use Docker” stops being meaningful once you need lifecycle control

Where this naturally leads next

Once a single-machine system like this starts working, the questions change:

how do sessions move across machines?
can environments be snapshotted and restored?
can execution be replayed deterministically?
how do you coordinate multiple runtimes safely?
what does multi-tenant isolation look like at this level?

But that feels like a different class of system entirely.

Right now, the more interesting part was simply getting a single machine to behave like a stable, stateful execution environment that doesn’t fall apart under real usage and even that turned out to be more subtle than it initially looked.