Autonomous Coding Strategy - Codex

The goal feature in CodeX and Hermes agents enables AI to work autonomously on complex, long-running coding tasks by using LLM-based judgment to determine ta...

2026-05-10 By Sean Weldon

How AI Coding Agents Can Actually Finish What They Start: Inside the Goal Feature

TL;DR

The goal feature in CodeX and Hermes agents solves AI's tendency to quit coding tasks prematurely by replacing simple iteration loops with LLM-based judgment of task completion. Success requires explicit definition of done states, comprehensive prompt engineering that defines what to achieve and when to stop, and structured workflows providing continuous context about progress toward multi-hour autonomous coding objectives.

Key Takeaways

AI coding agents consistently declare victory too early—fixing tests for 10-15 minutes then claiming completion without exhaustive resolution, requiring repeated human prompting to continue well-scoped work.
The goal feature replaces programmatic for-loops with intelligent LLM calls that evaluate task satisfaction using continuous context about goal state, eliminating arbitrary iteration limits that couldn't distinguish genuine completion from stopping points.
Effective goal prompts must define four components: what to achieve, what not to change, how to validate progress, and quantifiable completion criteria—occupying middle ground between single prompts and open-ended backlogs.
Initial alignment conversations prove critical for success—discussing project context, constraints, and asking the model to ask questions before starting prevents garbage results from simply pasting prompts.
The mission extension enables week-long objectives by scheduling variable-interval runs (hours/days/weeks) with hypothesis formation and human-in-the-loop oversight, rather than continuous loops requiring immediate verifiable feedback.

What Problem Does the Goal Feature Solve?

AI coding agents suffer from a chronic problem: premature task completion. Models declare projects finished despite incomplete work, forcing developers to repeatedly prompt agents to continue execution on well-scoped tickets.

The failure pattern appears consistently across coding tasks. An agent fixes failing tests for 10-15 minutes, then claims the job is done without exhaustive resolution of underlying issues. Complex projects requiring sustained attention fragment into dozens of manual interventions, undermining the promise of autonomous AI assistance.

CodeX and Hermes agents introduced the goal feature to address this fundamental limitation. The feature enables AI to work autonomously on complex, long-running coding tasks by fundamentally changing how agents determine when to stop working.

How Did Goal Features Evolve from Earlier Approaches?

The predecessor system called rough loop ran coding agents in simple programmatic for-loops with maximum iteration limits. Agents executed the same prompt repeatedly until hitting an arbitrary stopping point—a blunt instrument that couldn't distinguish between genuine task completion and running out of iterations.

The goal feature replaces these dump programmatic loops with LLM-based judgment of task satisfaction. Instead of identical prompts each iteration, agents receive continuous context about goal state and progress toward completion.

An LLM call evaluates whether the goal is satisfied after each work cycle. When the goal remains incomplete, the agent receives a message containing the goal file and a prompt to continue working. This intelligent stopping mechanism eliminates the arbitrary nature of iteration-based limits.

How Does the Goal Feature Architecture Actually Work?

CodeX and Hermes implement goal evaluation differently but achieve the same outcome. CodeX requires agents to explicitly mark completion themselves within their work loop. Hermes uses a separate LLM judge call to evaluate results independently.

The continuous prompt includes critical instruction that prevents premature stopping: "Do not accept proxy signals as completion by themselves. Only mark goal achieved when audit shows objective actually achieved." This guidance directs agents to verify actual completion rather than accepting superficial indicators.

The goal context provided to agents includes three components:

The goal file containing the complete objective definition
Current state showing progress and completed work
Continuous prompt guidance: "Take next concrete steps. If goal completed, state so explicitly and stop"

The separate LLM call for goal evaluation uses a prompt defining the definition of done, required output format, the goal itself, and how to respond. This architecture creates a judgment layer independent of the agent's own assessment.

What Commands Control Goal Feature Execution?

Activation requires running codeex features enable goal after listing experimental features. The goal feature exists as an experimental capability requiring explicit enablement before use.

The core command set provides simple control over goal execution:

/go initiates goal execution when first used
/go (repeated) checks current status including runtime and token usage
/go pause or /go clear stops execution immediately
/side forks the conversation from the current point while the goal continues running independently

The /side command proves particularly valuable for long-running goals. Developers can branch conversations to ask questions or explore alternatives while the main goal execution continues uninterrupted in the background.

What Makes a Goal Prompt Effective?

Goal prompts must occupy specific scope boundaries: bigger than one prompt but smaller than an open-ended backlog. Prompts that are too narrow waste the goal feature's autonomous capabilities. Prompts that are too broad lack the specificity needed for the agent to judge completion.

Four essential components define effective goal prompts:

What to achieve: Specific, measurable objectives with clear deliverables that the agent can verify independently.

What not to change: Explicit constraints preventing scope creep or unintended modifications to working systems.

How to validate progress: Concrete verification methods the agent can execute without human intervention.

When to stop: Quantifiable completion criteria that eliminate ambiguity about the finished state.

The agent must know what "done" means before starting work. Fuzzy definitions like "keep going until everything is fixed" cause models to either quit prematurely or spiral into nonsense as they search for undefined completion signals.

How Do Goal Prompts Work for Different Task Types?

Migration tasks require explicit visual verification constraints. An effective prompt states: "Migrate project from legacy stack to new stack, ensure all screens stay visually identical, use Playwright interactive to verify." The visual identity constraint prevents the agent from introducing UI regressions while modernizing the underlying technology.

Prototype development benefits from structured planning documents. Goal prompts should point to plan.md or PRD files, create tests for each milestone, and verify UI against reference screens. This structure breaks large projects into verifiable checkpoints.

Optimization tasks demand quantifiable target metrics. Prompts must define target metrics, specify evaluation sets, and include explicit stop conditions: "Stop when target is met." Vincent, the OpenClaw maintainer, ran a goal for 3 days across 30 rounds using quantifiable conditions like "find 20 discrete new issues" to prevent premature stopping.

Quantifying requirements explicitly eliminates the ambiguity that causes agents to fail. Concrete numbers and verification methods give the LLM judge clear criteria for evaluating task completion.

What Lessons Emerge from Production Usage?

Initial alignment conversations with agents prove critical for success. Simply pasting a goal prompt and starting execution yields garbage results most of the time. Developers must invest in context-setting before autonomous work begins.

Effective alignment covers multiple dimensions:

Project context: What the system does and why it matters
Quality standards: What good looks like and what constitutes acceptable work
Historical context: What's been tried before and what bugs keep appearing
Constraints: What must remain unchanged and what can be modified

Asking the model to ask questions before starting work dramatically improves outcomes. The agent surfaces assumptions, clarifies ambiguous requirements, and identifies potential conflicts before investing hours in the wrong direction.

Vincent's 3-day goal execution demonstrates the power of quantifiable stop conditions. By defining discrete, countable objectives like "find 20 discrete new issues," he enabled the agent to work autonomously across 30 rounds without spiraling into nonsense or quitting prematurely.

How Does Go Body Streamline Goal Creation?

Go Body is an open-source tool that helps construct well-formed goal prompts through structured interviews. Running npx go body triggers a CodeX interview that systematically captures the components of effective goals.

The workflow generates two files that work together:

goal.md contains the well-written request with constraints, stop rules, and detailed loop instructions. This file provides the comprehensive context the agent needs to understand the objective.

state.yaml maintains a task list that the agent updates on each loop. The structured state tracking enables the agent to maintain progress awareness across multiple work cycles.

The agent references goal.md for context and updates state.yaml to track progress. A single prompt combined with the goal.md file enables agents to generate necessary assets and build fully functional projects without additional human intervention.

What Are the Limitations of the Goal Feature?

The goal feature works effectively for hours-long sessions but fails for week-long or month-long objectives without immediate verifiable feedback. Tasks requiring sustained effort over extended periods need different architectural approaches.

Goals requiring days or weeks between verification cycles exceed the goal feature's design parameters. SEO optimization, ad ROI improvement, and follower growth require time for metrics to materialize before the next iteration can begin.

The continuous loop model assumes the agent can verify progress immediately after taking action. When verification requires waiting hours, days, or weeks for results, the continuous execution model breaks down.

How Does the Mission Extension Address Long-Running Objectives?

The mission concept extends the goal feature for long-running objectives that span weeks or months. Missions capture metrics to optimize rather than immediate tasks to complete.

Mission.md files define the metrics the agent should optimize. The agent forms hypotheses about improvements and completes one step as artifacts. Instead of continuous loops, missions schedule the next run at variable intervals—hours, days, or weeks depending on feedback cycle requirements.

New sessions receive mission.md and a summary of previous work. The agent evaluates results from the last hypothesis, forms new hypotheses based on data, and executes the next step.

Human-in-the-loop capabilities enable agents to message humans when attempting dramatic changes or when goals become unclear or unverifiable. This safety mechanism prevents autonomous systems from making destructive decisions during long-running missions.

A Twitter growth mission demonstrates the mission workflow in practice. The agent iterated from baseline tweets to founder-voice strategy across multiple scheduled runs, doubling performance through hypothesis-driven experimentation over extended timeframes.

What the Experts Say

"The goal feature is probably the most consequential thing they have shipped in CodeX this year."

This assessment reflects the fundamental shift from manual iteration management to autonomous task completion. The goal feature transforms coding agents from tools requiring constant supervision into systems capable of sustained independent work.

"Most time if you just simply paste in a prompt and ask it to do it is likely give you garbage."

This observation captures the critical importance of alignment conversations. The technical capability of the goal feature means nothing without proper context-setting and requirement clarification before autonomous execution begins.

Frequently Asked Questions

Q: How long can a goal run before it needs human intervention?

Goals work effectively for hours-long sessions with immediate verifiable feedback. Tasks requiring days or weeks benefit from the mission extension, which schedules runs at variable intervals and includes human-in-the-loop checkpoints for dramatic changes or unclear objectives.

Q: What's the difference between CodeX and Hermes goal implementations?

CodeX requires agents to explicitly mark completion themselves within their work loop, while Hermes uses a separate LLM judge call to evaluate results independently. Both approaches achieve intelligent stopping but differ in whether the working agent or external judge determines completion.

Q: Can I check on a running goal without interrupting it?

Yes. Running /go again while a goal executes shows current status including runtime and token usage without interrupting execution. The /side command forks conversations to ask questions while the goal continues running independently in the background.

Q: What happens if I don't define clear stop conditions?

Fuzzy definitions like "keep going until everything is fixed" cause models to either quit prematurely or spiral into nonsense. Agents need quantifiable completion criteria—concrete numbers, specific verification methods, or discrete countable objectives—to judge when work is genuinely complete.

Q: How does Go Body improve goal prompt creation?

Go Body conducts a structured CodeX interview via npx go body that generates goal.md with well-written requests, constraints, and stop rules, plus state.yaml for task tracking. This systematic approach captures the four essential components—what to achieve, what not to change, validation methods, and stop conditions.

Q: Why is the initial alignment conversation so critical?

Simply pasting prompts yields garbage results most of the time. Effective alignment discusses project context, quality standards, historical attempts, and constraints. Asking the model to ask questions before starting surfaces assumptions and clarifies ambiguous requirements before hours of autonomous work.

Q: What's the difference between goals and missions?

Goals run continuously for hours with immediate feedback verification. Missions extend goals for week-long or month-long objectives by scheduling runs at variable intervals (hours/days/weeks), forming hypotheses between runs, and including human-in-the-loop oversight for dramatic changes or unclear metrics.

Q: Can goals handle migration projects effectively?

Yes, with explicit constraints. Effective migration prompts specify: "Migrate project from legacy stack to new stack, ensure all screens stay visually identical, use Playwright interactive to verify." The visual identity constraint and verification method prevent UI regressions during technology modernization.

The Bottom Line

The goal feature represents a fundamental shift from supervised AI coding assistance to genuinely autonomous task completion by replacing arbitrary iteration limits with intelligent LLM-based judgment of when work is actually done.

Success requires investment in prompt engineering and alignment conversations before execution begins. Developers must define what to achieve, what not to change, how to validate progress, and quantifiable stop conditions. The initial context-setting conversation—discussing project background, constraints, and asking the model to ask questions—determines whether autonomous execution produces production-quality results or garbage.

Start by enabling the goal feature with codeex features enable goal, then use Go Body (npx go body) to construct your first well-formed goal prompt. For tasks spanning hours, use goals with immediate verification. For week-long objectives requiring time between feedback cycles, explore the mission extension with scheduled runs and human-in-the-loop oversight.

Sources

rIs802-bXDY - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub