A chronological account of how AgentOS went from a folder of context templates to a personal agentic infrastructure, and then kept going: context portfolio, local MCP server, three AI clients, token tracking, a VSCode dashboard, and from late June a task-graph runtime with deterministic verification, gated applies, and the Studio operator hub. Built in sessions from 28 April 2026 onward.
Each phase produced something independently useful. The portfolio is useful without a server. The server is useful without a dashboard. But stacked, they compound into something that makes every AI session faster by default.
Worked through the AIDB AgentOS templates to fill out all ten portfolio files: identity, role, projects, team, tools, communication style, goals, preferences, domain knowledge, and decisions. Each file is a single page of dense, machine-readable markdown describing how I actually work. Merged Claude Code's session-context file with the richer content from the identity file, creating a single authoritative file that loads automatically every session. Cleaned up duplication across all ten files so each is self-contained.
Built a Node.js MCP server using the official Model Context Protocol SDK with StreamableHTTP (stateless) transport. Every portfolio file exposed as a typed portfolio:// resource over a loopback endpoint. Files read from disk on every request, so edits are live without a restart. Wrapped in pm2 with Task Scheduler auto-start on Windows login. The server now runs continuously in the background without any manual steps.
Built a VSCode extension with two surfaces: a sidebar TreeView (always visible) and a WebviewPanel (on demand). One polling loop drives both. Status endpoint added to the MCP server returning metadata, resource list, request count, and a rolling log of the last 100 requests. pm2 Start, Stop, and Restart buttons inside the editor. Click a portfolio file to open it. Once I could see the request log live, I started using the system differently: watching which resources Claude actually pulls is feedback I never had before.
Added Skills and Memory tabs to the dashboard. Skills tab lists all Claude Code commands from global and project directories with scope badges. Memory tab parses project memory files with YAML frontmatter. Custom slash commands created for the full post-build workflow: update changelog, update docs, sync showcase copy, commit, push, all chained via a single jn-website-commit orchestrator with a confirmation gate at each step.
Dashboard panel now fills the window and resizes with it. Active tab content scrolls; status card and tab nav stay fixed at the top. Fixed a critical bug: log-skill.js used CommonJS require in an ESM package, so every skill invocation silently threw a ReferenceError and nothing was posted to the server. Fixed with an ES module import. Restarted the pm2 process to pick up the fix.
Claude global config made portable: skills and memory versioned in the AgentOS repo and synced to ~/.claude/ via sync.ps1. The global CLAUDE.md now travels with the repo and restores on any machine with a single push. Portfolio MCP section added to the project CLAUDE.md, instructing Claude to use portfolio:// URIs when reading portfolio files rather than searching the filesystem.
Codex connected to AgentOS via an HTTP MCP bridge (no separate server). Portfolio-read events from Codex now flow back to the shared dashboard server and appear tagged in the request log. A global config/projects.md registry defined paths, changelogs, and web page targets for all four projects. All skills refactored to read from the registry rather than hardcoding paths. $env:AGENTOSROOT eliminates usernames and machine-specific paths from the repo.
End-to-end token tracking built into the server: a normalisation pipeline converts token usage across providers, per-skill and per-tool totalTokens accumulated alongside invocation counters, and cumulative all-time totals persisted in persist.json. Both log-skill.js and log-tool.js now extract token usage from Claude hook payloads and include it in their POST bodies. Session and all-time totals returned in /status.
All ten portfolio-read skills now spawn Haiku sub-agents for resource retrieval. Token overhead shifted from the main session to cheaper Haiku contexts. The same delegation pattern applied to orchestration: jn-website-commit delegates changelog, docs, git, and showcase steps to sub-agents. The --auto flag added to five skills, allowing the orchestrator to skip confirmation gates when chaining. Only the changelog draft step still pauses for review.
MCP server extended to aggregate tokens per model (Haiku, Sonnet, Opus). Dashboard status card redesigned as a two-row table with session and all-time columns. Run buttons added to each skill entry for direct invocation from the dashboard, no terminal required. log-skill.js and log-tool.js now extract the model from Claude hook payloads and include it in POST bodies. 60-second auto-save interval added to server.js for regular state persistence between requests.
Setup Wizard webview panel added: gather AGENTOSROOT, validate prerequisites, run setup steps with progress feedback, auto-close on success. SyncDiffPanel added: shows file-by-file differences between canonical repo and runtime ~/.claude/ state. Agents tab added alongside Skills and Memory. Server uptime and restart tracking made persistent across restarts. Graceful shutdown handlers flush state to disk before exit.
Skills, memories, and agents can now be created directly from the dashboard via wizard forms: fill in the fields, hit Create, the file lands in the correct directory with proper structure and YAML frontmatter. Remove buttons with confirmation dialogs enable deletion without touching the filesystem manually. Agent Run button wired to fire agents via the Claude Code extension directly. Search and filter controls added to all four registry tabs. New /agent-log server endpoint mirrors the skill and tool log pattern, tracking agent invocations with token usage and per-agent stats.
Cursor connected to the shared AgentOS MCP server via Streamable HTTP. Command logging via hooks.json posts Cursor slash command invocations to /command-log. Dashboard updated with client pill selectors in Setup and Analytics tabs. All skills renamed to the new Commands nomenclature and moved to clients/claude/commands/. bootstrap.ps1 added for automated new-machine setup. Codex README updated with gateway pattern documentation.
Global memory files and skills moved from per-client directories into a single clients/shared/ directory. All three sync scripts updated to source from shared. Commands tab deduplication implemented: shared skills appearing in multiple clients collapse to a single row with stacked tool badges. An "All" badge replaces individual badges when all three clients carry the command. Scope terminology standardised from "local" to "project" throughout. Client colour assignments corrected: Claude = blue, Cursor = orange (previously swapped). Codex rate limit adapter fixed for expired rate limit windows.
The showcase website AgentOS section fully rewritten from scratch: seven new pages (overview, architecture, Claude, Codex, Cursor, dashboard, build log) with a persistent two-level navigation so users can discover and move between sections without getting lost. Pages sourced directly from the repo: READMEs, CHANGELOG, and source architecture. Previous pages archived with cross-links preserved for reference.
Shared skills became client-aware: {{LIGHTWEIGHT_MODEL}} and {{CLIENT_NAME}} placeholders now expand per client at sync time, so one skill file resolves to Haiku on Claude and Cursor and GPT-4o-mini on Codex. Codex skills now self-report to /command-log via a logging block injected at sync, so Codex invocations show in the dashboard alongside Claude and Cursor. The pm2 controls were hardened against shell injection: child_process.exec through cmd.exe replaced with spawn (no shell) plus process-name validation. Stats-cache accuracy fixed: a full ISO lastComputedAt timestamp replaces the date-only cutoff so same-day sessions are re-counted, and the hour-of-day heatmap now bins by local time. jn-ext-deploy made self-contained so it runs without the MCP extension tools loaded.
Codex gained pay-per-token access to OpenRouter-hosted models with no subscriptions: one API key, four role profiles (planner on DeepSeek V4 Flash, tools on Qwen3 Coder, agentic on Kimi K2.5, cheap on the free Qwen tier), all deployed through the managed config block. The dashboard followed: a shared pricing module estimates cost per model across Anthropic, OpenAI, and OpenRouter, the Codex analytics adapter attributes tokens to real model IDs from rollout logs, and the model tables render columns dynamically from the data instead of a hardcoded Haiku/Sonnet/Opus trio. Two real bugs fell out of the work: Codex token totals were double-counting cached and reasoning tokens (subsets, not additions), and Codex could not reach the MCP server at all since the security hardening because nothing sent the Bearer token. The managed block now authenticates via AGENTOS_TOKEN, kept in sync by sync.ps1.
A runtime agent loop built on top of the multi-provider work: a plan/execute/review pipeline with bounded self-correction, provider-agnostic by construction so the loop holds no model or transport detail. A planner model writes a plan, an executor model runs the agent loop against the MCP tools, and a reviewer model checks the result and sends it back for correction when it falls short. Reliability is treated as first-class: transient provider errors retry with backoff invisibly to the model, invalid tool calls are schema-checked and handed back to the model to fix, and a repeatability harness runs a task many times to measure success rate rather than trusting one green run. The first loop built on it keeps this Showcase Website current from each project's CHANGELOG, README, and shipped version. The model decides what is stale and proposes a JSON change-set; code then applies every edit deterministically through an exact-snippet replace, behind a safety gate that confines writes to the website directory, enforces a per-scope selector allowlist and a per-run write cap, and blocks git. The model plans; code writes.
The linear pipeline became one node type in a task-graph engine: DAG validation before any model spend, bounded concurrency via a hand-rolled promise pool, a per-run token budget as a circuit breaker, and fan-out/gather node factories with deterministic gather order. The first portfolio workflow fanned the code audit out over every enrolled project concurrently. On top of it, a run-control API on the existing gateway: list workflows, start a run, fetch a snapshot, and stream live node lifecycle events over SSE, all behind bearer auth because these endpoints trigger model spend.
Free-tier rate limits blocked the first portfolio run, so single models became ordered routes: preferred first, failovers behind. A transient provider error restarts the stage on the next model in the route; a real error is rethrown immediately. Telemetry records the attempted-model chain. In the same push, focused audit mode fixed the worst quality bug: a deterministic selector caps each audit at the highest-signal surfaces and a hard finalisation instruction makes ending a run without findings JSON a failure, not a silent clean pass.
A verifier node re-checks every audit result deterministically across four locked failure domains: output, completion, schema, and anchor. Every finding's anchor text must resolve verbatim against the file on disk or the finding is rejected; a failed verdict routes to one bounded re-run at reduced scope. The write path landed alongside it: apply is a replay of a saved, verifier-passed report, dry-run by default, gated to a filesystem allow-list, a clean working tree, and a mandatory build gate, with CRLF-aware matching after a line-ending mismatch blocked the first real candidate. Two real fixes went through the full path and were committed.
An opt-in challenger stage runs downstream of the verifier and can raise advisory cautions on a passing result, but never changes a verdict, widens apply eligibility, or triggers a write; the policy is written down, not implied. Workflow profiles then generalised the layer: a registry with a hardened contract puts a stable id and honest metadata over each workflow, validated before any model call. Two no-model detector profiles (documentation drift and changelog coverage) proved the graph carries more than audits.
A SQLite run ledger and queue (better-sqlite3, WAL) dual-written alongside the canonical JSONL telemetry, behind an async interface portable to a hosted database later. Store reads are opt-in with strict JSONL fallback; orphaned rows are retained and reported, never fabricated into history. Queue payloads are secret-free with credentials injected at drain time, the gateway's non-secret environment is pinned by contract, and an approval spine (grants bound to an exact apply fingerprint, issued by a fail-closed CLI) exists for any future queued apply, which remains impossible by design.
A read-only, offline, no-model layer over the audit output: a workflow simulation report with structured evidence anchors, deterministic grouping, manual dispositions persisted in an append-only ledger, historical read-back, and a descriptive learning summary. Everything is a candidate, nothing auto-remediates. The website content audit also joined the graph as a first-class workflow: the proposal-first sibling of the code audit, fanning out over the enrolled projects with no apply path at all.
The dashboard grew native Workflows, Runs, and Graph tabs, then consolidated them into Studio: one hub organised around the run lifecycle, from the workflow catalogue through configuration, a live cost estimate priced from per-stage run history, launch behind an explicit spend confirmation, live observation with 4-second change-guarded refresh, and node drill-through to the findings report. Workflow and project catalogues moved into committed JSON registries so enrolment changes get git review. Next boundary on the roadmap: modularising the 12,000-line webview (Phase 5A).
Three components, one machine, one shared server. The portfolio is the source of truth. The MCP server is the access layer. The dashboard is the control plane.
Once you understand resources, URIs, and the transport split (stdio vs HTTP), the SDK does most of the work. StreamableHTTP stateless mode means no session management to worry about. The hardest part is the conceptual mapping, not the implementation.
The VSCode extension host does not inherit your shell's PATH, so npm globals are not directly callable. Wrapping child_process.exec with { shell: 'cmd.exe' } routes commands through cmd.exe, which resolves npm globals correctly. A TypeScript strict-overload mismatch on the typed exec signature required an as ExecOptions cast to make the compiler happy.
I considered remote hosting on day one. But every agent I actually use runs on this machine, and binding to loopback only sidesteps the entire auth, TLS, and hosting question. When I need remote access, I will add it. Until then it is friction I do not need.
Watching which portfolio resources Claude actually pulls changed how I refine the files. Token counts per model changed which model I use for which task. Seeing invocation logs from Cursor alongside Claude logs confirmed the multi-client wiring was working. Visibility is not optional infrastructure: it is what makes a system worth building.
The move from per-client memory and skills directories to a single clients/shared/ directory was the right call. Any update to a command or memory file now propagates to all three clients on the next sync, with no divergence between them. The canonical-vs-runtime model means the repo is always the source of truth and the runtime directories are always recreatable.