I didn’t really set out to build something like a “sandboxed execution system for AI agents.” It started from something much more ordinary, I just wanted a way to safely run code on a server that I didn’t fully trust.
Sometimes it was code generated by AI Agents. Sometimes it was automation scripts that looked correct at first glance but had no real guarantees behind them. The obvious answer is always the same:
Just run it in a container.
And for a while, that feels like the end of the story. But once you start using it for iterative, stateful workloads, it starts to break down.
So I started building something more opinionated: not just a way to run code, but a controlled execution environment that could manage state, sessions, and lifecycle around untrusted workloads.
That system became Bastion: a self-hosted sandboxed execution environment for running untrusted code from AI agents and automation tools, with persistent sessions instead of one-off containers.
Check out the project on Github!
The problem isn’t really “running code safely”
My first framing of the problem was pretty narrow: isolation. Docker gives you that. Spin up a container, execute code, tear it down. So that should be enough, right? But very quickly, I started running into friction that doesn’t show up in simple setups:
- I don’t want a fresh environment every time
- I need state to persist across executions (files, installs, intermediate outputs)
- I want to run multiple commands in the same environment
- I want interactive access (like a shell, not just a request/response model)
- I want visibility into what’s happening while it’s running
- I want strict resource limits, but not ones that make everything fragile or unpredictable
At that point, “just use Docker” starts to feel like it solves only one dimension of the problem. Because the real issue wasn’t execution itself. It was persistent stateful execution, where the environment isn’t disposable anymore, but something you manage and once you think in those terms, the problem stops being simple. You’re no longer just running code. You’re effectively managing a persistent, sandboxed execution environment per session.
The first approaches don’t really scale in your head (or in practice)
The naive version is straightforward: spin up a container per request, run a command (docker run), return stdout, destroy everything.
It works, until you try to do anything iterative. The moment you need state, everything breaks down: dependency installs repeat every time, filesystem state disappears after execution, workflows can’t continue across commands, debugging becomes almost impossible because context resets constantly
So the next step feels obvious:
keep a container alive per session
This immediately feels closer to what you actually want. Now each session is a long-lived environment, and commands execute inside it.
But this shift quietly introduces a new set of problems:
- how do you run multiple commands concurrently in the same environment?
- how do you track which process belongs to which execution?
- how do you stream output in real time?
- what happens when the container crashes or gets out of sync?
- how do you reason about state consistency over time?
At this point, I stopped thinking in terms of “a runner”. What I was building started to look more like: a system that manages execution environments. That shift changes everything.
The system naturally started organizing itself around sessions
Once you stop thinking in terms of isolated commands and start thinking in terms of environments, a core abstraction emerges pretty quickly:
A session is not a request. It is a long-lived execution environment.
That idea became the center of everything. From there, the system naturally split into layers:
- API layer (how clients interact with the system)
- orchestrator (how execution is scheduled and controlled)
- container runtime (where code actually runs)
- filesystem layer (how state persists across time)
- persistence layer (how we remember what happened)
The orchestrator ends up being where the “real system” lives
One thing I didn’t fully appreciate at the beginning is that Docker is not really an orchestration system in the way this problem needs.
It gives you primitives: start container, exec into container, attach streams, kill container
But everything above that: the logic that makes these meaningful is missing. So the orchestrator becomes the actual control center.
It’s responsible for things like:
- session lifecycle management
- enforcing concurrency limits per session
- deciding whether an execution should run at all
- reconciling runtime state with persisted state
- tracking execution metadata and history
At some point, it started to feel less like “application logic” and more like a lightweight scheduler, something closer to an OS concept, but at the container level. That analogy actually helped reason about it more clearly.
Sessions became the central abstraction
A session represents a stable environment where multiple executions happen over time.
- a persistent container
- a filesystem mounted into it
- environment configuration
- execution history
- runtime state
This changes how execution itself is modeled. It stops being: run → get output → discard and becomes: stateless operations over a persistent stateful environment
That tradeoff is powerful, but it introduces its own complexity: concurrency inside a shared environment, consistency of state over time, isolation between executions that still share a container
Why docker exec started to make more sense than spawning containers
At some point, there are two obvious directions:
- spawn a new container per execution
- reuse a running container and execute inside it
I tried both directions, and they lead to very different systems. Spawning containers gives clean isolation, but:
- it’s too slow for interactive workflows
- state has to be reconstructed externally every time
- it doesn’t support “continuation” of work very naturally
Using docker exec inside persistent containers shifts the system in a different direction: state is naturally preserved, execution becomes fast and incremental, interactive workflows become possible
But the tradeoff is real: you now have to manage multiple processes inside one environment, resource isolation becomes harder, orphan processes and cleanup become real problems
Still, for this kind of system, the tradeoff is worth it because the core requirement is not “one-off execution”, it’s continuous interaction with an evolving environment.
Filesystem persistence turned out to be the simplest hard decision
Persistence sounds easy until you realize containers are inherently ephemeral. Bind mounts solve a lot immediately:
- state survives container restarts
- no external storage layer needed at the start
- simple mental model
But they also shift responsibility upward: isolation is now something you must enforce carefully, path traversal becomes your problem, not Docker’s, boundaries are enforced at the API and orchestration layer
This is a recurring pattern in the system: simplicity in one layer usually moves complexity to another.
The terminal subsystem made the system feel “alive”
At some point, I wanted more than just command execution. I wanted an actual interactive environment, something that behaves like a shell session. So the system evolved into a WebSocket-based terminal layer:
Client ↔ WebSocket ↔ API ↔ docker exec (TTY mode)
This introduces a different class of problems:
- partial stream handling
- terminal resizing events
- maintaining session continuity across disconnects
- backpressure in streaming output
- keeping interactive state stable under load
Once this works, the system stops feeling like a “command runner”. It starts feeling like a live environment you can interact with directly.
Concurrency is where the system stops being simple
Single execution per session is straightforward. Multiple concurrent executions inside the same environment is where everything becomes more delicate. Now you have to think about:
- competing resource usage inside one container
- output streams interleaving
- lifecycle management per execution
- cancellation and timeout semantics
- tracking processes that aren’t tied to a single request
Even though everything still runs inside one container. This is the point where the system stops feeling like a wrapper around Docker and starts feeling like its own execution model.
Security is not one mechanism, it’s a stack of assumptions
It’s easy to assume containers give you “security”. They don’t, at least not by themselves. So the model becomes layered:
- container isolation as the baseline
- filesystem scoping via bind mounts
- network control per session
- CPU, memory, and process limits
- execution timeouts enforced at runtime
But the important realization is: security here is not a single guarantee, it’s a composition of constraints
And if any layer is misconfigured, the guarantees degrade quickly. So instead of thinking in terms of “secure system”, it becomes: a system designed to contain failure within predictable boundaries
Persistence and reconciliation became unavoidable
One assumption I made early on was:
if a container exists, the system state is consistent
That turns out to be false fairly quickly. Containers crash. Machines restart. State drifts out of sync. So on startup, the system has to reconcile reality:
- what exists in persistent storage (SQLite)
- what is actually running in Docker
- what is orphaned or inconsistent
- what needs to be recovered or cleaned up
This becomes a simple but critical recovery loop:
- load persisted session state
- inspect runtime containers
- match and reconcile
- repair inconsistencies
It’s not exciting, but it’s the kind of thing that determines whether a system feels reliable or fragile.
Observability ends up mattering more than expected
At some point, you can’t reason about the system without visibility into it. So everything starts emitting structured logs:
- execution lifecycle events
- timing information
- exit codes
- resource usage
- policy decisions
Not because it’s “good engineering practice”, but because without it debugging turns into guessing and for a system that executes arbitrary code, guessing is not acceptable.
What this ended up teaching me
A few things became clear only after the system started stabilizing:
- stateful systems are fundamentally harder than stateless ones, even when they look simpler on the surface
- orchestration is often the real complexity, not execution itself
- abstraction boundaries matter more than implementation details
- concurrency inside shared environments forces you to confront design decisions immediately
- and most importantly, “just use Docker” stops being meaningful once you need lifecycle control
Where this naturally leads next
Once a single-machine system like this starts working, the questions change:
- how do sessions move across machines?
- can environments be snapshotted and restored?
- can execution be replayed deterministically?
- how do you coordinate multiple runtimes safely?
- what does multi-tenant isolation look like at this level?
But that feels like a different class of system entirely.
Right now, the more interesting part was simply getting a single machine to behave like a stable, stateful execution environment that doesn’t fall apart under real usage and even that turned out to be more subtle than it initially looked.