lab skills: how disciplined benchmarking made our agent 5.6x faster

A note on stack: ClickHouse is GREAT and we use it for everything. Their pricing is right and the free trial got us off the ground. noticed relies on it for analytics and our identity graphs with Postgres for transactional state. The query-optimization-lab below is ClickHouse-shaped — the settings, the system.query_log source, the spill behavior. The discipline ports to any database..

We open-sourced two Claude Code skills that force a measurement loop on AI coding agents. One optimizes database queries. The other improves agent eval scores. Both follow the same rule: the benchmark decides.

the problem with vibe-driven optimization

A query starts OOMing under load. You change five things, push and it works. You don't know which change fixed it.

An eval score drops. You tweak the system prompt, scores recover and you can't tell whether the prompt change helped or whether the LLM was just having a good day.

A week later, it regresses. There's no record of what you tried.

All three failures share the same shape: the change wasn't isolated, the metric wasn't pinned and the experiment wasn't logged.

what a lab skill is

A skill is a directory with a SKILL.md file that Claude Code reads into its system prompt. When something the user does matches the skill's description, Claude pulls in the full body and follows it.

A lab skill is a particular shape of skill that enforces a benchmark loop:

Measure
Hypothesize
Change one thing
Re-measure
Keep or discard

The skill body owns the discipline. The repo owns the benchmarks. The LLM owns the hypotheses and the rewrites.

We split each skill into four files:

skill-name/
├── SKILL.md         frontmatter + the core loop, under 500 lines
├── WORKFLOW.md      step-by-step with concrete files and commands
├── PATTERNS.md      reusable moves that have worked before
└── EXPERIMENTS.md   how to log experiments + variance discipline

The agent always sees SKILL.md. It pulls in the other three only when the loop sends it there. Anthropic notes that SKILL.md loses effectiveness past ~500 lines, so the rest belongs in linked files.

the two labs

At noticed we run a Postgres-backed pipeline, a 60M-row ClickHouse cluster and a multi-tenant agent that talks to thousands of users. Both labs came out of real production fires.

query-optimization-lab triggers when a query is slow, memory-heavy, timing out, or scanning too many rows. The loop:

Pull the hot query from prod system.query_log (read-only).
Reproduce it locally.
Benchmark the current shape on four metrics: read_rows, read_bytes, memory_usage, elapsed_ns.
Make one targeted rewrite.
Benchmark again on the same dataset.
Keep the change only if a bottleneck metric improves and correctness holds.

Production is read-only for diagnosis. All benchmarking runs locally.

The skill encodes ClickHouse settings that LLMs get wrong by default:

max_bytes_before_external_group_by only spills GROUP BY to disk. It does nothing for JOINs.
For JOIN hash-table OOMs you need join_algorithm='auto' plus max_bytes_in_join.
max_memory_usage is a hard kill, not a spill trigger.

agent-eval-lab triggers when eval scenarios fail, judge scores drop, or you're iterating on system prompts and persona overlays. It covers two eval systems: scripted onboarding evals (6 fixed dimensions, CSV reports) and agent live evals (user-simulator LLM, mission-driven success criteria, JSON reports).

The judge is immutable. You improve the agent's behavior, not the grading. The skill refuses if you try to lower a threshold to make a failing scenario pass.

LLM evals are non-deterministic, so the lab includes a flakiness protocol: 2-of-2 consecutive passes to ship, mixed results trigger a third run and you never ship a fix based on a single passing run after a failure.

why one-thing-at-a-time matters

If a pipeline is OOMing, the instinct is to change the query, bump the memory limit and reduce concurrency at the same time. If it works, you don't know which change mattered. If it fails, you don't know what to revert. The lab skill prevents this by construction: it won't propose a second change until the first has been benchmarked and decided.

Same with evals. A scenario fails across persona consistency and account awareness. The temptation is to rewrite the whole system prompt. The lab traces the failure to one source file, proposes one edit, re-runs that single scenario and compares before and after. Only once the scenario passes does it run the full suite to check for regressions.

results

On noticed's combined workload, query-optimization-lab produced a 5.6x end-to-end speedup. No single rewrite is responsible. The 5.6x is a sequence of one-variable experiments, each kept or discarded on the benchmark.

agent-eval-lab caught persona drift and account-awareness regressions that manual prompt iteration had been missing. The flakiness protocol is the reason. It filters out lucky runs.

what's still rough

Nothing here is finished. The places we're least comfortable:

Cross-skill conflicts. When both labs trigger on the same change (a query rewrite that affects an eval scenario), there's no shared state. The agent runs them sequentially and we coordinate by hand.
Variance budgets. The flakiness protocol is conservative. For some eval dimensions a 2-of-2 rule is overkill; for others, 2-of-2 isn't enough. We don't yet have per-dimension thresholds.
Discard logging. EXPERIMENTS.md works, but the agent doesn't always remember to write to it on a discard. Half the time we catch this in review and backfill.

how to use it

Install via Claude Code:

/plugin marketplace add noticedso/noticed-labs
/plugin install noticed-labs@noticed-labs

The skills ship with noticed-shaped defaults. You'll want to fork and adapt:

Copy the skill folder into your own plugin repo.
Rewrite the Repo Defaults section in each SKILL.md to point at your benchmark scripts, eval files and migration paths.
Keep the Core Loop, Keep/Discard Rules and Output Format sections. That discipline is portable.
Update the frontmatter description so your repo's queries and evals trigger the skill.

In order for the smill to become useful, you need to match it to the architecture of your repo.

The repo includes both skills, the four-file layout and the experiment log format we use internally. Take what's useful, ignore what's not and if you see something we got wrong, open an issue.

noticed-labs on GitHub.

If you want to see what this discipline powers in practice, join the waitlist at noticed.so.