// runtime agent loops LIVE

Agent Loops

Plan, execute, review, correct: agents that run themselves

The pipeline that turns a task into a finished, reviewed result without a human in the loop. OpenRouter models call AgentOS MCP tools through a plan/execute/review pipeline with bounded self-correction. The model decides what to do; code decides what is allowed and performs every write. The first loop built on it keeps this website current from the source repo, and the pipeline itself became the node type the task-graph runtime fans out across the portfolio.

03 pipeline stages 04 model roles 02 live loops 0 writes by the model

introspective_engineering [01]

A System That Audits Its Own Story

AgentOS does not just build software. It runs a content-audit loop that automatically evaluates whether the showcase website communicates the project's achievements, and when it falls short, proposes concrete improvements grounded in git evidence. The loop produces a synthesis layer with fields like whatFeelsUnderstated, whatIsMostImpressiveNow, and recommendedHeroShift, each backed by citations to commits and changelog entries. The model proposes; the evidence decides.

SchemaVersion 4

Deterministic Decision Engine

Code, not the model, computes nextBestActionScore using explicit penalties: effort (low 0, medium 4, high 8) and risk (low 0, medium 3, high 6). A +3 grounding bonus rewards fully evidenced proposals. Impact is weighted by stated confidence so ungrounded hype sinks regardless of self-rating.

Grounding Split

Evidence vs Speculation

A structural synthesisGrounding coverage ratio reports how many factual claims cite evidence. Readers can tell demonstrated fact from positioning suggestion at a glance. New sections must pass a three-criteria justification gate (clearCapabilityGap, cannotBeMergedIntoExisting, addsDifferentiation) or be flagged as under-justified.

Loop 02

Content Audit

While the Website Update loop keeps the site current, the Content Audit loop asks a harder question: is the site as impressive as the work? It reads git history and CHANGELOG since the last sync, compares the evidence to the live site, and produces a three-layer report: synthesis (what changed, why it matters, how to reposition), findings (stale copy, missing concepts), and proposals (website-ready copy with placement anchors). Every proposal cites evidence; every write is safety-gated.

why_it_exists [02]

From demo to infrastructure

A single successful agent run proves nothing. It proves the task can work once, not that it works reliably. The gap between a demo that worked yesterday and infrastructure you trust to run unattended is the whole problem.

This runtime closes that gap. It separates the parts that must be reliable (connecting tools, validating calls, retrying transient failures, applying writes) from the part that is allowed to be creative (the model deciding what to do). The reliable parts are code. The creative part is sandboxed behind a gate it cannot reach around.

The result is a loop that can be measured, hardened, and reused. Build one loop properly and the next one is a prompt and a config entry, not a rewrite.

The model is the least trusted component in the system. Every tool call it makes is parsed and schema-checked before it reaches a tool. Every write it proposes is applied by code, not by the model.

What it is

01Plan / execute / review pipeline

02Bounded self-correction loop

03Model plans, code applies

04Reliability harness, not one green run

05Provider-agnostic by construction

Status

Live. Two loops in production (Website Update and Content Audit), on a shared pipeline, safety gate, and repeatability harness. The same pipeline is now the node type inside every graph workflow on the runtime.

model_routing [03]

Built for the Token Scarcity Era

AgentOS is built for a world where autonomous loops have to be sustainable. A single task can involve planning, execution, review, correction, audit, and summarisation, each step consuming tokens. Running every stage through a premium frontier model turns experimentation into an operating cost problem.

Planning + Review

DeepSeek

Used where reasoning quality matters most: planning the approach, reviewing outputs, and deciding whether a result should be accepted or corrected.

Synthesis + Audit

Kimi

Used where breadth matters: reading many files, synthesising evidence, and producing richer website audit proposals from long-context project history.

Execution + Tools

Qwen

Used for tool-heavy execution: inspecting files, calling MCP tools, and generating structured change-sets that deterministic code can validate.

OpenRouter sits behind this routing layer as the model gateway. The architecture is not “use the strongest model everywhere”; it is “use the cheapest capable model for each stage, keep repeatable work in code, and spend intelligence where it matters”.

model routing

├── plan (DeepSeek) ── reasoning-heavy planning, no tools executed

├── execute (Qwen) ── tool-heavy MCP execution and structured change-sets

├── audit (Kimi) ── long-context synthesis and narrative proposals

└── apply (code) ── deterministic validation, replacement, and structural inserts

the_engine [04]

One Pipeline, Every Loop

Three stages and an optional correction loop. One MCP connection is opened once and shared across every stage. The pipeline holds no transport or model detail itself, so every loop on the site runs on this same engine.

Stage 01

Plan

The planner model reads the tool catalogue and the task, then writes a short numbered plan. It executes nothing. Planning and doing are deliberately separate stages.

Stage 02

Execute

The executor runs the agent loop, calling MCP tools to carry out the plan and adapting when a step turns out wrong. Turn-budgeted, so it cannot run forever.

Stage 03

Review

The reviewer checks the result against the original request. It approves, or rejects with a structured reason, what is missing, and a suggested fix.

On rejection, the executor runs again with the reviewer's feedback, up to a set correction budget. The verdict is parsed into structured fields before it goes back, so the executor receives actionable feedback, not raw prose.

task

└── plan (DeepSeek V4 Flash) ── reads tool catalogue, writes a plan, runs nothing

└── execute (Qwen3 Coder) ── agent loop: model ⇄ MCP tools, turn-budgeted

└── review (DeepSeek V4 Flash) ── APPROVED / REJECTED + reason, missing, suggestion

└── correct ── on REJECTED, re-execute with feedback, up to N times

reliability [05]

Built to Be Reliable

The model is never trusted

Every tool call is parsed and checked against the tool's JSON schema before it reaches the MCP server. The reasoning blocks the open-weight models prepend are stripped first. An invalid call never touches a tool.

Two failure classes, handled separately

Transient provider failures (429, 5xx, network) are retried with exponential backoff, invisible to the model. Invalid tool calls are sent back to the model with the schema so it can self-correct, budgeted per tool per run.

A repeatability harness, not a single green run

The harness runs the same task many times and reports success rate, average attempts, turns, duration, and token cost. Reliability becomes a number you can watch move between changes, not a vibe. One green run proves capability; this measures whether it holds.

Provider-agnostic by construction

The loop holds no transport or model detail. Four role profiles map to OpenRouter models: planner on DeepSeek V4 Flash, tools on Qwen3 Coder, agentic on Kimi K2.5, cheap on the free Qwen tier. Swapping a provider or adding a second agent means touching the edges, not the loop.

Offline test coverage

A smoke-test block exercises the write gate, scope allowlist, coverage checks, change-set parsing, and the deterministic apply without hitting a model or the network. The safety-critical paths are tested in isolation.

the_loops [06]

What Runs on the Pipeline

Two loops run directly on the pipeline, and the graph workflows (code audit, website audit, and the deterministic detectors) wrap it as their node type. Each new loop is a prompt and a config entry on top of the same pipeline, gate, and harness. This list grows; the engine does not change.

Loop 01

Website Update

The pipeline's first real job is the page you are reading. It keeps the Showcase Website current from each project's source repo, and it is the template every future loop follows.

How it works

The model reads the project's CHANGELOG, README, and shipped version, then returns a validated JSON change-set: what is stale, what is current, and why. In apply mode, code performs each edit with an exact-snippet replace and re-reads the file to confirm it took. The model plans. Code applies. A clean run reaches a fixed point: apply, then re-run, and nothing changes.

The safety gate

Writes are allowed only inside the website directory, only through exact-snippet replace (never full-file rewrites), only on elements inside the chosen scope, and only up to a per-run cap. No-op edits are skipped. Git operations are blocked. Dry-run intercepts every write, so a preview can never touch a file. The gate is enforced in code, not by asking the model nicely.

run_it_from_the_editor [07]

Loops in Studio

The dashboard's Studio hub carries the loops: the Audits view runs single-target code audits and website content and facts audits and opens their reports, while the Website Update card gives project and scope selectors with Dry Run and Apply buttons. Apply confirms before writing. Everything shells out to the same agent-runner command you would run in a terminal, with per-stage invocation and token stats so you can see what each run actually cost.

agent-runner.js (OpenRouter ⇄ MCP tools)

├── --website AgentOS ──▶ the website loop

├── --scope metrics | descriptions | hero | full

└── --mode dry-run (preview) | apply (write)

Dashboard → Agents tab

├── loop pipelines with per-stage run + token stats

└── Website Update card: Dry Run / Apply, confirm on apply

what_happened_next [08]

The Hard Part Was Done Once

This page predicted a docs loop, a changelog loop, and a code-review loop. All three now exist, and they arrived as graph workflows rather than standalone loops: documentation-drift and changelog-coverage run as deterministic detectors, and the portfolio code audit runs the full pipeline per project with verification on top. The pipeline, the trust boundary, and the reliability measurement carried over unchanged. See the runtime →