A Paradigm Shift – Everyday Reflections on Harness Engineering in Vibe Coding

From Prompts to Standards

One thing has become increasingly clear to me while vibe coding lately: the entire paradigm is shifting from “writing better prompts” to “defining better standards.”

In the past, vibe coding was mostly about conversing with a model: explaining the requirements clearly, asking it to generate code, and then reviewing and adjusting the result ourselves. The key skill at that stage was the ability to express what we wanted through prompts.

As the toolchain matures, however—especially with the emergence of engineering-oriented capabilities such as Goals, Loops, Skills, and Harnesses—the focus of AI coding is beginning to change. It is no longer enough to tell a model what to do; we also need to define what doing it right looks like.

In the past: We relied on execution to meet standards defined by others.
Now: We rely on judgment to translate vague, experience-based ideas of “good” into checklists that AI can execute.

Once the standards are clear enough, a model can continuously explore, implement, inspect, and correct its work against them. The human role shifts from “prompting the model line by line” to “designing the task system and its acceptance criteria.”

What Makes a Good Standard?

This makes the quality of the standards themselves critical. To me, a good standard has four dimensions:

Executable
Verifiable
Clearly bounded
Aligned across layers

The process of establishing these standards is really a process of abstracting the business. In the past, a product manager’s primary output was a PRD. In the future, more of the work may revolve around aligning on standards—and that alignment is how we create a shared objective between the business and AI.

How to Build Good Standards

Step 1: Extract them
Make the “implicit acceptance checklist” in your head explicit.

Step 2: Structure them
Categorize, prioritize, and complete the extracted checklist items.

Step 3: Attach evidence
For every item, ask: “How can I prove that this has been achieved?”

Step 4: Iterate
Look for gaps after every use. Did a new issue surface during acceptance testing? Add another item. Does an item keep passing while the result still feels insufficient? Then the bar is too low—raise it. Iteration matters because good standards are often not written perfectly at the start; they are refined through repeated use.

At its core, building standards is a process of abstracting the business. A product manager’s primary output used to be a PRD. Going forward, more of the work may be about aligning on standards—and that work itself creates a shared objective between the business and AI.

Standardization and Automation: What Makes a Loop Possible

Once the standards are in place, the next question naturally follows: What happens next?

What can be standardized can also be automated.

This is the underlying logic that allows a Loop command to work. Once a standard has been broken down into verifiable checks, AI can operate in a cycle: execute → self-check → compare against the standard → correct → self-check again.

Without standards, a Loop wanders aimlessly.
With verifiable standards, it becomes a perpetual engine.

A Harness Is Not a Silver Bullet

I have also realized that a Harness is not a silver bullet. If the standards are not explicit enough, a Skill or task may spend a long time exploring without ultimately producing the expected result. The model may still lack a deep understanding of the industry context, the product goals, or the artistic criteria that you yourself may struggle to define. The final output may only have completed the task “on paper.”

My current practical takeaway is this:

A Harness, combined with iterative Skill prompt tuning and explicit acceptance criteria, still works better than simply letting a model explore freely.

In other words, we should not think of a Harness merely as a way to “let AI run autonomously.” More importantly, AI needs to know which direction is correct, what qualifies as an acceptable result, when it should stop, and when it should keep iterating.

Standardizability Defines the Boundary of AI

This leads to a broader question: where exactly is the boundary of AI capability? We often need to judge which tasks AI can handle and which it cannot.

My current view is that this boundary depends largely on whether the work can be standardized.

If a task has a clearly defined correct answer—or at least explicit acceptance criteria—it is well suited to AI. For example:

Whether a page runs correctly
Whether a feature meets its requirements
Whether the tests pass
Whether the code style is consistent
Whether the output conforms to a structured template

These tasks share a common trait: they can be defined, verified, and practiced repeatedly.

But if the work has no fixed standard, or if the judgment depends heavily on human experience, aesthetics, context, and trade-offs, then AI may not be able to complete it independently. It can assist, but it is unlikely to fully replace human judgment.

Further Reflections

OpenAI recently introduced Record & Replay. At its core, it captures a manual workflow and turns it into a reusable Skill, giving an Agent stronger generalization capabilities. This makes the entire RPA workflow much simpler: one recording can be enough to automate it.

This reinforces the same conclusion: capabilities that can be standardized will become less valuable, while the ability to define the standards will become more valuable.

These are some informal reflections. I would be glad to connect with others exploring Harnesses and exchange ideas.

Way to data scientist