noticed's code factory

As I write this, eight Claude Code sessions are running in parallel on my machine. Each one is on its own git worktree, on its own branch, working a different PRD: one is rebuilding search, one is reworking the relationship record, one is fixing avatars on /relationships, one is wiring live sync into the browser extension. I'm not typing code into any of them. I'm reviewing, steering, and starting the next one.

what a code factory is

The phrase caught on over the past year, as people noticed the bottleneck had moved. When a coding agent can write a feature in twenty minutes, the slow part becomes knowing what to build. Dan Shapiro called the end state a "dark factory", a box that turns specs into software. Addy Osmani put it as "you are no longer just writing code; you are building the factory that builds your software."

A code factory is the harness around that shift: the worktrees, the slash commands, the PRDs, the eval labs, the overnight loops, the always-on box - everything that lets one person keep many agents productive at once without losing the thread.

At noticed, we argue the CTO's job nowadays (at least part of it) is to build this code factory.

I wrote before about the multi-tenant agent harness we ship as a product. This is the other harness, the one we use to build the product. Here's what two months on the floor looks like, measured from my own Claude Code transcripts:

13.7 billion tokens generated across 32 active days.
the longest single session stayed alive 14 days and 4 hours — 14,930 turns ran through it.
31,920 model turns - roughly one every three minutes, around the clock.
~427 million tokens a day on average; peak days cross 1.3 billion.
Opus 4.8 does 86% of it. The rest is a long tail - Fable 5 (read up on Portugal's Dom Sebastião), a bit of Sonnet and Haiku for cheap work, and some Cursor Composer 2.5 for quick edits.

the floor: worktrees in parallel

The unit of work is a git worktree. One feature, one worktree, one branch, one session. They don't share a working directory, so they don't step on each other, and I can have as many open as I have attention to review.

Today the counter hit eight concurrent sessions, twelve across the day. The busiest day this month ran 28 sessions and the busiest single day burned 1.3B tokens across 15 of them.

Eight is about the edge, and that's not a coincidence. Past that, the human reviewing the output becomes the bottleneck, not the agents. Every session starts just about the same way. The first message is almost always some variant of:

Work in a worktree from main. Open a PR, watch CI green, merge and watch deploy CI green.

…followed by a link to a Notion PRD, a pasted error, or a one-line task. That sentence is the whole contract: isolate the work, prove it green, ship it, confirm the deploy. The agent owns the loop from branch to production and I own deciding what to work on.

the VPS: a floor that never sleeps

The factory also lives on a VPS. We took a fork of VS Code's code-server and turned it into an opinionated, browser-based IDE for exactly this workflow - a far-left sidebar of worktree-sessions, each an in-window tab, so the parallel sessions that used to be scattered across terminal panes are first-class objects you can glance at and switch between. It runs in a browser, which means the factory is reachable from my phone.

More than just working from anywhere, this VPS gets some of the longest running tasks like data migrations, overnight loops, backfills that take hours and releases I want watched for six hours after merge. Work I want to kick off from my phone and check on later. The laptop is for the work I'm actively reviewing and the VPS is for the work that needs to keep going.

There's a nice loop here: the agents build the IDE that the agents run in. I'll ask a session to make a UI change to the IDE, then ask it to take a screenshot from the VPS and confirm the change is visible there.

the contract: PRDs on Notion

A worktree session is only as good as the brief it starts with. Ours live in Notion, in a database called noticed_ PRDs. The PRD is the single source of truth.

We have a /new-feature command for this. You hand it a raw idea, a bug report, or a voice-note transcript; it interviews you for the gaps, greps the actual codebase so the approach names real files and tables, and writes a one-screen PRD into Notion with a fixed shape: Problem, Goal / Non-goals, Approach, Locked decisions, Done when. If it doesn't fit on a single viewport, it probably means it needs to be broken into two features.

Two sections carry the weight:

Locked decisions - choices already settled that the build must not reopen ("WebSocket push, not polling"; "reuse the existing dialog, don't build a new modal").
Done when - an observable checklist, each item concrete enough that someone watching a screen recording could confirm it. Stuff like "A badge appears within 5s of a new comment," and not "the user is notified."

Then a build session reads the PRD and takes it to a green, reviewed PR. The Done when list is the acceptance test.

the night shift: /goal

Some work doesn't need me in the loop. It needs persistence.

Claude code's /goal is my overnight loop. My approach here is that it's good enough. I've never felt like coding up my own coding harness from scratch and I do appreciate just how far $200/month take me in Opus 4.8 credits. In fact, this is what drove me away from Cursor a few months ago where I was spending upwards of $600 a month.

This loop feeds the agent the same goal every iteration, lets it see its own previous work in the files and git history, and doesn't let it stop until the goal is unequivocally true. I've started it 47 times in two months, most of them between 22:00 and midnight. A few real ones, verbatim:

/goal run /simplify then /security-review then watch CI go green
and merge. Once deploy CI is green, run the backfill. Keep a watch
on this release over the next 6 hours.

/goal the noticed web app feels slow. Analyse the following pages
in prod (use the developer port open on my Helium browser to time
their load times), and use smart strategies to make them faster.

/goal drive all remaining jobs to completion fast.
Use /query-optimization-lab to debug and fix the slow queries.

The pattern is always the same shape: a goal, a tool to use, and a finish line the agent can verify for itself. I write it, go to sleep, and read the PRs in the morning.

the last 5%: /simao-here and the picker

Agents are great at the first 95% of a UI and bad at the last 5%. Everyone and their mother is trying to make LLMs get good at design (me included) but until we get there, I'm still the one driving it home. We've got hundreds (literally) of ESLint rules that enforce our brand guidelines (check them out here: https://www.noticed.so/brand/system/components) but stuff really doesn't look good when it comes out of the factory.

So I built it a pair of eyes. There's an element picker: I hover, I click an element, and it writes a markdown description and a cropped screenshot of that exact element to a .cp-context/ folder, then types the reference straight into the Claude Code session text input. The agent sees what I see.

/simao-here is the claim. I run it in my active session and that session becomes the target - every pick from then on flows into that prompt until another session claims it. So my morning often starts with /simao-here, then a string of picks:

1. @.cp-context/element-...md  @.cp-context/element-...png
   - move logs into the user menu
2. @.cp-context/element-...md  @.cp-context/element-...png
   - this card's padding is wrong, match /relationships

I'm not describing the bug in words, I'm pointing at it. The picker turns "the thing in the top right looks off" - which an agent can't act on - into a screenshot and a DOM node it can. It's the highest-leverage tool I've built for the parts of the product that have to be felt. 20 claims in two months, almost all of them mornings and evenings.

the labs: the agent grades its own homework

The danger with a fast factory is shipping fast garbage. The guardrail is measurement.

We run lab skills - skills that force a benchmark loop on the agent: measure, change one thing, re-measure, keep or discard. Three of them run constantly:

query-optimization-lab pulls a hot query from prod, reproduces it locally, and optimizes against four metrics - never touching prod to experiment. It's responsible for a 5.6x speedup on our combined workload, and it's the tool a /goal reaches for when it finds something OOMing.
agent-eval-lab improves the product agent's behavior against a judge that is immutable - you can't move the goalposts to make a score go up. If you try to lower a threshold to pass a failing scenario, the skill refuses.
identity-matching-lab guards the part of noticed that decides whether github:x and linkedin:y are the same human, scored against golden pairs so a prompt tweak that improves one case but regresses ten gets caught. If identity matching doesn't work, noticed doesn't work. So this one's really important.

the repair bench: Vercel, ClickHouse, and our own logs

Half the factory's work is fixing stuff. The debugging loop is its own ritual, and it almost always starts with production logs.

A typical bug session opens with a pasted stack trace and an instruction:

Work in a worktree from main. Debug this error in prod.
Find the error logs in Vercel. TDD: reproduce locally,
watch it pass, ship a fix.

The agent reads the actual Vercel runtime logs through an MCP server, correlates them with our pipeline errors table, then reproduces the failure as a failing test before it touches anything. The fix is the thing that turns the test green. For ClickHouse - where the failures are OOMs and 30-second timeouts, not stack traces - it pulls the offending query straight from system.query_log and hands it to query-optimization-lab.

And the product has its own /logs page, so a lot of triage is just reading what the system already records about itself.

a few words

A year ago I'd have read this and called the author a LARPer but now I'm the one writing. I have all the dreams of creating a nice UI for my code factory, creating a few defaults and not prompting it at all. But for now my setup is a VSCode window (because it's where I feel at home, surrounded by code, even if I don't read most of it anymore) and n^n Claude Code terminal tabs open, cycling through them every couple of minutes.

Makes me pretty happy to have the ability to focus on creating cool PRDs and trusting an agent to do a better and faster job than me at coding what I described. The last mile design process scratches my designer itch, I get to enjoy the weekends with my family and friends and have my VPS ping me every now and then with a quick question, and I'm shipping product faster than I ever had before.

If you want to see what noticed's code factory is producing, get early access at noticed.so :)