Inside Offices: how we isolate AI agents with per-employee VMs

When you create an Office in HumanikOS, we provision a dedicated virtual machine. Not a container. Not a shared runtime. A real machine with its own filesystem, its own network boundary, its own resource allocation — owned entirely by that one agent.

The isolation is not incidental. It is the point. AI agents are not stateless API handlers. They work on projects over time. They write files, install packages, configure tools, accumulate context. If two agents share a runtime, they share failure modes. One agent's memory leak, runaway process, or filesystem corruption affects the other. Per-VM isolation means each agent's environment is completely independent. What happens in one Office stays in that Office.

State that survives everything

The hardest problem in running persistent AI agents is not getting them started — it is keeping their state intact across the full lifecycle of a machine. VMs restart. Deployments happen. Users go idle for days. The agent needs to come back exactly as it left.

We solved this with a snapshot-based persistence model. When an Office shuts down — whether from idle timeout, a deployment, or a manual stop — we capture the full workspace before the machine goes away. The filesystem, the session state, the in-progress context, the configuration. All of it written to cloud storage.

When the Office wakes again, the machine pulls that snapshot, reinstates the workspace, and applies any configuration changes that were made while it was offline. An agent that was halfway through a task when it went idle picks up where it left off. An agent whose instructions were updated while it slept wakes with those updates already in place. The machine may be brand new. The agent's world is continuous.

Zero-downtime deployments

Agent runtimes need to be updated. New capabilities, configuration changes, dependency upgrades. Doing this with downtime — stopping the agent, deploying, restarting — is not acceptable for a worker that users depend on.

Offices use a blue-green deployment model. Every Office runs two slots: an active slot serving requests and an inactive slot available for updates. When new code needs to deploy, it builds in the inactive slot while the active slot keeps running. If the build succeeds and passes health checks, we swap — the inactive slot becomes active instantly. If the build fails, nothing changes. The agent kept working throughout. Users never see a restart.

This also means agent code deployments are always reversible. If a new build introduces a problem, we can swap back to the previous slot without reprovisioning anything.

Self-healing by default

AI agents fail in ways web services do not. They modify their own filesystems. They spawn subprocesses. They write code and then run it. The failure surface is much larger than a traditional application, and restarts need to be graduated — not every failure warrants tearing down a machine.

When an Office encounters a failure, the recovery system works through escalating levels before taking drastic action. Minor failures get an in-process recovery — the agent reloads its context without the VM ever going down. If that does not resolve it, the agent process is cleanly restarted and re-initialized. Only if repeated restarts fail does the system escalate to a full machine replacement — reprovisioning a fresh VM from the last snapshot and reinstating the workspace from there.

In practice, the vast majority of failures resolve at the first level without any user-visible disruption. The escalation path exists for the rare cases where something is fundamentally broken — corrupted workspace, poisoned configuration, a bug that reproduces every restart. At that point, starting fresh from a known-good snapshot is the right answer, and the system does it automatically.

Scale-to-zero without losing anything

Dedicated VMs cost money whether they are doing work or not. At scale, leaving every Office running continuously is not economically viable. We built scale-to-zero into the foundation from the start.

When an Office has been idle — no active prompts, no scheduled work — it captures its workspace snapshot and shuts the machine down. Cost drops to zero. When a user sends a message or a scheduled task triggers, the machine wakes, restores state, and is ready to work. The user sees a brief startup period. The agent resumes from exactly where it left off.

The economics of per-VM isolation only make sense with scale-to-zero. Without it, the cost of dedicated compute would be prohibitive for all but the most active agents. With it, agents cost compute proportional to the work they actually do — not the time they exist.

State that survives everything

Zero-downtime deployments

Self-healing by default

Scale-to-zero without losing anything

What it actually takes to run AI agents in production

The Data Plane: databases, object storage, and semantic search — built in