Most teams are still treating agent behavior like handcrafted prompt art. That works right up until the agent gets real tool access, starts touching production systems, or needs to behave consistently across repos, environments, and sessions.
That’s where Harness as Code comes in.
The short version: Harness as Code applies the same ideas that made Infrastructure as Code practical and scalable to AI agents. Instead of hiding governance inside application code or hoping a giant system prompt keeps your agent safe, you define the harness itself as version-controlled, reviewable, testable artifacts.
From DevOps to Agent Governance: same principles, new domain
The Problem Prompt Engineering Can’t Solve
Anthropic has been explicit that harness design matters for long-running agents. In its engineering write-up on effective harnesses for long-running agents, the company describes the harness as the layer that helps agents keep making progress across multiple context windows. In its earlier post on building effective agents, Anthropic also argues that simple, composable patterns beat unnecessary framework complexity.
I think that’s the right direction, but the industry still underspecifies one crucial idea: the harness should be code, not folklore.
Once agents move beyond toy demos, every team hits the same questions:
- How do I control what tools an agent can call?
- How do I review behavior changes in pull requests?
- How do I reproduce the same governance in another repo or environment?
- How do I know what context the agent actually saw on turn 27?
- How do I test that my guardrails work before trusting the agent with real autonomy?
If the answer is “we have a really good prompt,” you don’t have governance. You have hope.
What Harness as Code Actually Means
HashiCorp defines Infrastructure as Code as a declarative, version-controlled way to define systems you can review, test, and automate. Harness as Code takes that same mental model and applies it to agent runtime behavior.
For me, a system only qualifies as Harness as Code if it gives you these properties:
- Declarative — behavior is defined in files, not buried in runtime branches
- Versioned — harness changes go through Git like any other engineering change
- Reviewable — permissions, hooks, and context rules show up in diffs
- Composable — you can layer capabilities without rewriting the core
- Observable — you can inspect what was active and why
- Testable — you can validate behavior in CI instead of relying on vibes
- Portable — the harness survives model churn and vendor churn
That is the big leap. The prompt stops being the product. The harness becomes the product.
Why This Matters
The DevOps parallel is real.
| DevOps gave us | Harness as Code gives agents |
|---|---|
| Infrastructure as Code | Agent governance as code |
| CI/CD gates | Approval and autonomy gates |
| RBAC / least privilege | Tool access boundaries |
| Build pipelines | Agent loops with retries and termination |
| Observability | Context provenance and event trails |
| Hooks and policy checks | Pre-tool and post-tool governance |
This matters because the failure mode for agents is rarely raw model quality. It is almost always control-plane quality.
An agent fails because it had the wrong tools, the wrong context, no retry policy, no safety hook, no way to explain its current state, or no clean boundary between static identity and dynamic runtime behavior. That’s a harness problem.
If you care about context engineering, this is the missing operational layer. Context engineering decides what the model should see. Harness as Code decides how that decision is defined, evaluated, audited, and evolved over time.
How It’s Different From Existing Approaches
A lot of current agent tooling is useful. I use and study these systems constantly. But they’re optimized around different centers of gravity.
- GitHub Copilot cloud agent is optimized around GitHub-native repo work in a GitHub-hosted environment.
- OpenAI’s Agents SDK is optimized around code-first orchestration, tools, guardrails, and state inside your application.
- OpenAI Sandbox Agents cleanly split harness and compute, which is an important architectural move.
- Pi is one of the strongest examples of a minimal terminal coding harness, extended through TypeScript extensions, skills, prompt templates, themes, packages, and multiple runtime surfaces.
Those are real strengths. But Harness as Code has a different bias: extensibility through portable governance artifacts.
That means the center of the system should stay tiny while the edges get powerful. It also means behavior should not require rewriting the runtime every time you want a new rule. Add an artifact. Add a hook. Add a condition. Review the diff. Re-run validation. Ship.
That’s a very different philosophy from both prompt-heavy setups and batteries-included mega-frameworks.
How AI Harness Implements Harness as Code
AI Harness is my reference implementation of this idea. The repo tagline says it plainly: declarative AI agent governance in Go.
Here’s how the product makes Harness as Code concrete.
1. Markdown-first control plane
The harness starts with harness.md and a .harness/ directory tree. Identity, tools, hooks, and sub-agents are defined as files you can diff, review, and move between projects.
That sounds simple, but it’s the whole point. The harness isn’t hidden behind a SaaS UI or locked into a provider-specific workflow. It lives in the repo, next to the code it governs.
2. Typed artifacts instead of loose files
AI Harness doesn’t treat all context as an undifferentiated blob. It introduces a typed artifact model with explicit precedence:
override (100) > harness (80) > builtin (60) > plugin (40) > model (20)
Typed Artifact Precedence: deterministic composition, not accidental overrides
That gives each capability a declared role, a priority, and composition semantics. Instead of asking, “Why did this rule win?” you can answer it deterministically.
This is one of the key differences between generic file-based customization and real Harness as Code. Composition is not accidental. It’s designed.
3. Per-turn evaluation, not startup-only config
AI Harness evaluates artifact conditions every turn. If an artifact says it should only activate in review mode, after multiple errors, or once the session reaches a certain phase, the runtime reevaluates that condition continuously.
The implementation uses Starlark for those conditional expressions, which keeps the language familiar and constrained while making the runtime dynamic.
That means governance can evolve with the session:
- an error-recovery artifact can activate after repeated failures
- a language-specific ruleset can appear only when relevant files are active
- a stricter override can kick in when risk increases
That’s the difference between static config and living governance. If you want the deeper implementation details, I wrote a separate breakdown on per-turn evaluation and dynamic governance.
Per-Turn Evaluation: static config is hope — per-turn evaluation is engineering
4. Context observability as a first-class feature
This is the feature I think most harnesses still underinvest in.
AI Harness ships harness context so you can inspect what the agent sees, where each section came from, which artifacts are active, which are inactive, and how much of your token budget is already gone.
That matters because agent behavior is downstream of context state. If you can’t inspect context composition, you’re debugging a black box.
5. A tiny core with powerful edges
The repo’s philosophy is simple: keep the core tiny and make the edges powerful.
AI Harness already ships commands like:
go install github.com/htekdev/ai-harness/cmd/harness@latest
harness init my-agent
harness validate
harness artifacts --verbose
harness context --verbose
That command set reflects the product thesis. Scaffold fast. Validate fast. Inspect the harness. Inspect the active context. Don’t bury governance inside a maze of framework internals.
Where I Think This Goes Next
I don’t think “harness engineering” is a side topic. I think it becomes its own discipline.
The same way teams eventually stopped debating whether infrastructure should be hand-managed, teams will stop debating whether agent behavior should live in undocumented prompt glue. They’ll expect agent governance to be:
- versioned
- inspectable
- testable
- composable
- vendor-portable
That’s why I keep framing Harness as Code as the DevOps of AI agents. It’s not about replacing good prompts. It’s about putting prompts in their proper place: as one input inside a larger, engineered runtime.
If you want the broader market landscape, read my live comparison of agent harnesses. If you want the product implementation, start with the AI Harness repository and treat it as a working reference, not just a pitch.
The Bottom Line
Harness as Code is the shift from “trust the model” to “trust the system around the model.”
That’s the move from prompts as governance to architecture as governance. And once agents start doing real work in real environments, I don’t think that move is optional.
Your model is not your control plane. Your harness is.