Learning Why AI Agents Need Sandboxes

I started thinking about this after reading Nanoclaw.

The idea that stuck with me was simple: if an agent can run shell commands, it should run them inside a container.

Nanoclaw uses that pattern to give agents real tools without giving them the whole host machine.

I wanted to understand that pattern better, so I built a much smaller experiment.

The old Ruby gem maid let you write rules for cleaning up files. I wanted to make a tiny LLM version of that.

So I built mildred, a command-line maid for your filesystem.

You describe file organization rules in English, and an LLM figures out the shell commands to make them happen.

That immediately creates a safety problem.

Imagine giving an LLM the ability to run shell commands on your actual machine.

The model decides what to run. Your code executes it.

That is a pretty strange thing to trust.

The Boundary

Modern LLMs can use tools. You define a function, and the model decides when to call it and what arguments to pass.

In Mildred, the important tool is RunCommand:

class RunCommand < RubyLLM::Tool
  description "Executes a shell command and returns its output"
  param :command, desc: "The shell command to execute"

  def execute(command:)
    stdout, stderr, status = Open3.capture3("bash", "-c", command)
    { stdout: stdout, stderr: stderr, exit_code: status.exitstatus }
  end
end

That execute method is the trust boundary. The model asks for a command. My code decides whether and where to run it.

At first, I added a noop mode. Instead of executing the command, Mildred can print what it would run.

That is useful, but it is not real isolation.

If I actually want the LLM to move files, I need a safer place for those commands to run.

The Sandbox

This is where the Nanoclaw idea clicked for me.

Run the command in a container, and only mount the directories the tool should touch.

def run_container
  name = "mildred-#{Process.pid}-#{SecureRandom.hex(4)}"
  args = ["container", "run", "-d", "--name", name]
  args.concat(mount_args)  # Only these directories are visible
  args.concat([IMAGE, "sleep", "infinity"])
  # ...
end

If the LLM generates a bad command, the damage is limited to the mounted workspace.

It cannot see my whole home directory. It cannot reach ~/.ssh. It cannot accidentally delete files I never mounted.

I also get a predictable environment. The model is always writing commands for the same small Linux image.

That was the main thing I learned.

The hard part was not asking an LLM to generate shell commands. That part is easy.

The hard part was deciding where trust ends.

For a small student project, this was a useful lesson: when software starts acting on the real world, even in tiny ways, the boundary matters more than the clever part.