The pipeline that turns a task into a finished, reviewed result without a human in the loop. OpenRouter models call AgentOS MCP tools through a plan/execute/review pipeline with bounded self-correction. The model decides what to do; code decides what is allowed and performs every write. The first loop built on it keeps this website current from the source repo, and the pipeline itself became the node type the task-graph runtime fans out across the portfolio.
AgentOS does not just build software. It runs a content-audit loop that automatically evaluates whether the showcase website communicates the project's achievements, and when it falls short, proposes concrete improvements grounded in git evidence. The loop produces a synthesis layer with fields like whatFeelsUnderstated, whatIsMostImpressiveNow, and recommendedHeroShift, each backed by citations to commits and changelog entries. The model proposes; the evidence decides.
Code, not the model, computes nextBestActionScore using explicit penalties: effort (low 0, medium 4, high 8) and risk (low 0, medium 3, high 6). A +3 grounding bonus rewards fully evidenced proposals. Impact is weighted by stated confidence so ungrounded hype sinks regardless of self-rating.
A structural synthesisGrounding coverage ratio reports how many factual claims cite evidence. Readers can tell demonstrated fact from positioning suggestion at a glance. New sections must pass a three-criteria justification gate (clearCapabilityGap, cannotBeMergedIntoExisting, addsDifferentiation) or be flagged as under-justified.
While the Website Update loop keeps the site current, the Content Audit loop asks a harder question: is the site as impressive as the work? It reads git history and CHANGELOG since the last sync, compares the evidence to the live site, and produces a three-layer report: synthesis (what changed, why it matters, how to reposition), findings (stale copy, missing concepts), and proposals (website-ready copy with placement anchors). Every proposal cites evidence; every write is safety-gated.
A single successful agent run proves nothing. It proves the task can work once, not that it works reliably. The gap between a demo that worked yesterday and infrastructure you trust to run unattended is the whole problem.
This runtime closes that gap. It separates the parts that must be reliable (connecting tools, validating calls, retrying transient failures, applying writes) from the part that is allowed to be creative (the model deciding what to do). The reliable parts are code. The creative part is sandboxed behind a gate it cannot reach around.
The result is a loop that can be measured, hardened, and reused. Build one loop properly and the next one is a prompt and a config entry, not a rewrite.
Live. Two loops in production (Website Update and Content Audit), on a shared pipeline, safety gate, and repeatability harness. The same pipeline is now the node type inside every graph workflow on the runtime.
AgentOS is built for a world where autonomous loops have to be sustainable. A single task can involve planning, execution, review, correction, audit, and summarisation, each step consuming tokens. Running every stage through a premium frontier model turns experimentation into an operating cost problem.
Used where reasoning quality matters most: planning the approach, reviewing outputs, and deciding whether a result should be accepted or corrected.
Used where breadth matters: reading many files, synthesising evidence, and producing richer website audit proposals from long-context project history.
Used for tool-heavy execution: inspecting files, calling MCP tools, and generating structured change-sets that deterministic code can validate.
OpenRouter sits behind this routing layer as the model gateway. The architecture is not “use the strongest model everywhere”; it is “use the cheapest capable model for each stage, keep repeatable work in code, and spend intelligence where it matters”.
Three stages and an optional correction loop. One MCP connection is opened once and shared across every stage. The pipeline holds no transport or model detail itself, so every loop on the site runs on this same engine.
The planner model reads the tool catalogue and the task, then writes a short numbered plan. It executes nothing. Planning and doing are deliberately separate stages.
The executor runs the agent loop, calling MCP tools to carry out the plan and adapting when a step turns out wrong. Turn-budgeted, so it cannot run forever.
The reviewer checks the result against the original request. It approves, or rejects with a structured reason, what is missing, and a suggested fix.
On rejection, the executor runs again with the reviewer's feedback, up to a set correction budget. The verdict is parsed into structured fields before it goes back, so the executor receives actionable feedback, not raw prose.
Every tool call is parsed and checked against the tool's JSON schema before it reaches the MCP server. The reasoning blocks the open-weight models prepend are stripped first. An invalid call never touches a tool.
Transient provider failures (429, 5xx, network) are retried with exponential backoff, invisible to the model. Invalid tool calls are sent back to the model with the schema so it can self-correct, budgeted per tool per run.
The harness runs the same task many times and reports success rate, average attempts, turns, duration, and token cost. Reliability becomes a number you can watch move between changes, not a vibe. One green run proves capability; this measures whether it holds.
The loop holds no transport or model detail. Four role profiles map to OpenRouter models: planner on DeepSeek V4 Flash, tools on Qwen3 Coder, agentic on Kimi K2.5, cheap on the free Qwen tier. Swapping a provider or adding a second agent means touching the edges, not the loop.
A smoke-test block exercises the write gate, scope allowlist, coverage checks, change-set parsing, and the deterministic apply without hitting a model or the network. The safety-critical paths are tested in isolation.
Two loops run directly on the pipeline, and the graph workflows (code audit, website audit, and the deterministic detectors) wrap it as their node type. Each new loop is a prompt and a config entry on top of the same pipeline, gate, and harness. This list grows; the engine does not change.
The pipeline's first real job is the page you are reading. It keeps the Showcase Website current from each project's source repo, and it is the template every future loop follows.
The model reads the project's CHANGELOG, README, and shipped version, then returns a validated JSON change-set: what is stale, what is current, and why. In apply mode, code performs each edit with an exact-snippet replace and re-reads the file to confirm it took. The model plans. Code applies. A clean run reaches a fixed point: apply, then re-run, and nothing changes.
Writes are allowed only inside the website directory, only through exact-snippet replace (never full-file rewrites), only on elements inside the chosen scope, and only up to a per-run cap. No-op edits are skipped. Git operations are blocked. Dry-run intercepts every write, so a preview can never touch a file. The gate is enforced in code, not by asking the model nicely.
The dashboard's Studio hub carries the loops: the Audits view runs single-target code audits and website content and facts audits and opens their reports, while the Website Update card gives project and scope selectors with Dry Run and Apply buttons. Apply confirms before writing. Everything shells out to the same agent-runner command you would run in a terminal, with per-stage invocation and token stats so you can see what each run actually cost.
This page predicted a docs loop, a changelog loop, and a code-review loop. All three now exist, and they arrived as graph workflows rather than standalone loops: documentation-drift and changelog-coverage run as deterministic detectors, and the portfolio code audit runs the full pipeline per project with verification on top. The pipeline, the trust boundary, and the reliability measurement carried over unchanged. See the runtime →