Everyone talks about the model. Almost nobody talks about the harness.
That is backwards. The model is the voice. The harness is the body—the part that reaches into the filesystem, runs a command, checks whether the command succeeded, remembers what you asked for three steps ago, and knows when to stop generating and start verifying. Without a harness, you have a brilliant monologue. With one, you have something closer to a worker.
I have been building harnesses longer than I have been impressed by any single model release. The releases matter. They change what the voice can say and how reliably it says it. But the leverage in day-to-day work has almost always come from the infrastructure wrapped around the model: what it can touch, what it is allowed to touch, how errors surface, and how a human stays in the loop without becoming the bottleneck.
A harness, in the plainest sense, is orchestration.
You give the model a task. The harness decides how to break that task into steps, which tools to call, what context to load, and when to hand control back to you. It is not magic. It is plumbing with opinions.
The pieces repeat across every serious setup I have seen:
Tools. Read a file. Edit a file. Run a test. Query an API. Search the codebase. The model proposes; the harness executes—or refuses, if the action is out of bounds.
Context. Not everything belongs in the prompt. A harness gathers the right slices: the file you have open, the failing test output, the schema for the collection you are editing. Good context management is half the game. Bad context is how you get confident answers about the wrong repository.
Memory and state. A single completion has no persistence. A harness carries forward what happened last turn, what branch you are on, what you already tried. Without that spine, every message starts from zero and the system keeps re-learning what you told it five minutes ago.
Verification. Generate, then check. Run the linter. Run the test. Diff the file. The harness treats “it sounded right” as insufficient evidence. This is where most demo-ware falls apart and most useful systems begin.
Human gates. Some steps should pause for approval: destructive commands, deploys, anything irreversible. The harness is also a safety rail—not because the model is evil, but because it is fast and ungrounded.
None of these ideas are new. We have been building variations of this since long before the current wave of LLMs. What changed is that the voice in the middle got good enough that the plumbing around it finally feels worth the effort.
People still argue about which model wins a benchmark. I am more interested in which harness lets a mediocre Tuesday afternoon turn into shipped work.
A stronger model with a weak harness gives you longer, more articulate mistakes. A decent model with a strong harness gives you a repeatable loop: inspect, act, verify, iterate. In production—by which I mean “my actual repo on a day I am tired”—the second arrangement wins more often than the leaderboard suggests.
That is why I keep returning to the spinal-cord metaphor. The interesting subsystems multiply: retrieval, embeddings, tool protocols, background agents, desktop integrations, browser automation, skills files, hooks that fire before a commit. The model is one organ. The harness is what keeps the organism from collapsing into a pile of clever parts.
You do not need a grand unified architecture on day one. You need a loop that fails loudly and recovers cheaply. Start with read access and a test command. Add write access when you trust the verification step. Add memory when you are tired of repeating yourself. Add tools when you notice the model guessing about things it could just look up.
The harness grows the way any useful system grows: by covering the gaps that hurt.
There is a temptation to treat the harness as temporary—that eventually the model will swallow tools, memory, and judgement whole. Maybe. I am skeptical in the short and medium term, for the same reason I am skeptical of any system that confuses fluency with contact with reality.
Reality is messy. Files exist or they do not. Tests pass or they fail. Production breaks at 2 a.m. in ways no training corpus quite captured. Something has to sit between the language and the world, applying permissions, catching errors, and preserving intent across steps. That something will keep earning its keep even as the models improve.
So when someone asks what I am excited about in AI right now, the honest answer is not a single checkpoint file. It is the emerging craft of building harnesses that are boring enough to trust and flexible enough to compound. The model is the spark. The harness is how you build a fire without burning the house down.
The work is unglamorous. It is also, increasingly, where the value lives.