What it actually takes to run AI agents in production

Spinning up one AI agent is a weekend project. Running a hundred of them — each with its own environment, its own secrets, its own skills, its own data access, sleeping when idle and waking on demand — is an infrastructure problem that takes months to get right.

This post is about the second thing. Not the model, not the prompting technique, not the framework. The infrastructure underneath. What it actually takes to run AI agents at scale in a way that is reliable, affordable, and usable by people who are not AI engineers.

Each agent needs a machine to live in

The first decision is the hardest: where does the agent actually run? Containers are tempting because they are cheap and fast to spin up. But containers are ephemeral by design. An AI agent is not a stateless request handler — it is a worker with a desk. It has a filesystem where it keeps its work, a running process it returns to, an environment it has configured for itself. Kill the container and you kill the desk.

We give each agent a dedicated virtual machine. Not shared compute. A real VM with its own CPU allocation, memory, network boundary, and persistent filesystem. When the agent works on a project, the files stay on that machine. When it installs a dependency, it stays installed. When it configures a tool, that configuration persists the next time the machine wakes up.

The tradeoff is cost. A VM costs real money per hour. At a hundred offices, that adds up fast. So we built scale-to-zero into the foundation. When a user goes idle, the machine captures a full snapshot of the workspace — the filesystem, session state, open context, in-progress work — and writes it to cloud storage before shutting down. When the user returns, the machine wakes, pulls the snapshot, applies any configuration changes made while it was offline, reinstates the agent's context, and resumes. From the agent's perspective, almost nothing happened. From an infrastructure perspective, the machine was completely gone. It costs nothing while sleeping.

We offer eight machine tiers — from a shared quarter-vCPU for lightweight agents to a dedicated 8-core, 32 GB machine for compute-intensive work. The right-sizing problem is real: an agent doing light research and writing has completely different requirements than one running builds, processing large files, or executing long-running code. Getting this wrong in either direction is expensive.

Waking a machine and giving it work

When a user sends a message to an office that is sleeping, something has to happen. The naive approach is to tell the user "your agent is starting, check back in 30 seconds." We did not think that was acceptable.

Our control plane — Nexus — runs an application load balancer that sits in front of every office. When a prompt arrives for a sleeping office, the load balancer detects no healthy instances, acquires a distributed lock to prevent duplicate spawns, and triggers machine creation. The user's request is held. When the machine comes online and passes its health check, the prompt is forwarded and execution begins. The user sees a brief loading state, not a hard error.

The distributed lock matters more than it sounds. Without it, two simultaneous messages to the same sleeping office would each trigger machine creation independently. You would end up with two machines, two sessions, split state. The lock ensures exactly one machine wakes, and all pending requests queue behind it.

Once the machine is awake, dispatching work to it is straightforward: the prompt is proxied through Nexus to the machine's internal address, execution begins, and results stream back in real time through our messaging layer. The API holds no persistent connections — the realtime layer handles delivery.

The environment is not just a container — it is a context

Before an agent can do useful work, it needs to understand its context. Who is it? What can it do? What data can it access? Who is it working for? What integrations are available? What are the rules?

We solve this with a set of context files materialized into the workspace filesystem on every wake. SOUL.md contains the agent's identity, personality, and behavioral instructions. AGENTS.md contains the full registry of agents in the workspace and how they relate to each other. Tool definitions tell the agent what it can actually do. Skill manifests describe the integrations and APIs it has access to.

These files are assembled on every wake from live configuration — not baked into the VM image. Change an agent's instructions in the dashboard at 2 PM, and the next time it wakes up its context reflects that change. Add a new integration before sending a prompt, and the agent knows about it when it starts working.

On top of the filesystem context, we inject environment variables at multiple layers: infrastructure IDs, server config, per-request context, and user-defined variables. The agent always knows what tenant it belongs to, which workspace and office it is running in, and which configuration it should be operating under. This is what makes isolation work correctly across a hundred concurrent offices — each one operating in its own context bubble.

Secrets are a first-class concern, not an afterthought

Every real agent needs secrets: API keys for the services it calls, credentials for the integrations it uses, tokens for the data sources it reads. Handling these badly is how you get key leakage, billing fraud, and security incidents.

We built a vault-backed secrets system with two major isolation properties. First: secrets are never stored in plaintext anywhere in the system. Every key is encrypted before it touches our database. What is stored is a vault reference — a pointer to the encrypted value, not the value itself. Second: secrets are injected at the network level, not the application level.

The injection pattern works like this. Each office runs a localhost proxy alongside the agent process. When the agent calls an external API — Anthropic, OpenAI, a third-party service — it targets a local port. The proxy intercepts the request, resolves the appropriate credential from the vault (checking office-level BYOK first, then workspace-level, then platform default), injects the Authorization header, and forwards the request to the real endpoint. The agent never sees the raw key. It never appears in logs. It cannot be extracted from application memory. This is defense-in-depth applied to the AI layer.

For service-to-service authentication, we use a dual-part key format: hsk_<accessKeyId>.hss_<secret>. The access key ID is stored in plaintext for O(1) lookup. The secret is stored as a SHA-256 hash and compared at auth time — never decrypted, never returned after creation. This is the same pattern used by payment APIs. It makes key compromise survivable: rotate the secret, the access key ID stays the same, existing integrations update without reconfiguration.

Skills are how agents know what to do with their environment

Giving an agent a machine, a context, and secrets is necessary but not sufficient. The agent also needs to know how to use the environment it has been given. That is what skills are for.

A skill is a structured capability package: a set of tools the agent can call, prose documentation describing what those tools do and when to use them, behavioral protocols that guide how the agent should approach a domain, and dynamic context compiled from live state at runtime. Skills are not prompts. They are reusable, composable capability modules that can be installed on any agent in any office.

We ship four core skill domains. The database skill gives agents the full data plane — query tables, run SQL, read and write object storage, perform semantic search — all scoped to the correct namespace. The office skill gives agents the ability to manage their own environment: provision secrets, register integrations, manage cron schedules, introspect their own configuration. The command skill operates at the workspace level: list all offices, inspect their state, dispatch work to specific agents, suggest routing based on capability. And agents can load skills dynamically mid-conversation via a meta-tool, acquiring new capabilities when the task requires them.

Integration skills are generated automatically. When you connect a third-party API — a CRM, a database, a webhook endpoint — the system reads its OpenAPI spec, generates a skill document that describes the available operations with auth instructions and example calls, and installs it into every office that should have access. The agent receives a skill that says, in plain language: here is what this API does, here is how to authenticate, here are the endpoints you can use. Change the integration credentials, and the skill updates on the next wake. Add a new integration to a workspace and it propagates to every office in that workspace automatically.

The reuse mechanism matters for scale. You do not want to configure skills per-office — that would make 100 offices 100x the configuration work. Skills are defined once at the workspace level and inherited by offices based on their configuration. An office can have the database skill, the GitHub integration skill, the Slack integration skill, and three custom skills specific to its role. Another office in the same workspace shares the same GitHub and Slack skills but has a different role and different custom context. One configuration layer, differentiated behavior per agent.

Human interfaces are not optional

The hardest part of building agent infrastructure for real organizations is not the technical complexity — it is making the system usable by people who did not build it. A platform where only AI engineers can configure agents is not a product. It is a proof of concept.

We built two layers for this. The first is Nova: an orchestration agent that operates at the workspace level and understands your entire setup. Nova knows which offices exist, what each agent is capable of, what data is available, and how work should flow. You describe what you want in plain language. Nova figures out which agent should handle it, dispatches the work, monitors progress, and reports back. You do not need to understand the underlying infrastructure to use it. You just need to explain what you want done.

The second is the Board: a visual interface that shows all your offices as nodes on a canvas, with real-time state overlaid — which ones are running, which are idle, which are currently processing a task. You can send a prompt directly to a specific agent from the Board, inspect its current state, review its recent work, adjust its configuration, or provision new secrets — all without touching any API or configuration file. This is the interface for the person managing the workforce, not the person who built it.

We spent considerable time on the experience for non-technical users specifically. When an agent needs credentials it does not have, it surfaces a structured card asking the user to provide them — not an error message, not a log line, but a form with context explaining what is needed and why. Destructive operations require explicit confirmation. Step-by-step planning is exposed as a readable objective list that updates as the agent works through it. The goal is that a non-engineer managing a team of AI agents should feel like they are managing a team of people, not operating a computer system.

The unsexy truth

None of this is glamorous infrastructure work. Machine lifecycle management, secrets rotation, skill propagation, multi-tenant isolation, human orchestration interfaces — these are not the parts of AI that get written up in research papers. But they are the difference between a demo and a product.

If you are serious about running AI agents at scale, in a real organization, for real users — you will build all of this eventually. Either you build it yourself, or you use a platform that has already built it. The question is how much of your time you want to spend on the plumbing versus the work the agents are actually doing.

That is the problem HumanikOS was built to solve.