Coherence Daddy · Autoresearch

Optimize Claude Code skills, automatically.

An LLM-as-judge optimizer that finds the changes that actually make your skill files work — verified with stability-aware statistics, not vibes.

11 slides · ~5 min read Claude Code · Anthropic API · Python macOS · Linux · Windows + WSL
Why this exists

Skill files degrade silently.

Skill files are markdown instructions written for Claude. No compiler, no test suite, no signal that the AI actually follows them — until something breaks.

The manual path

Read 100 outputs. Spot the pattern. Slow, biased, and never replicable. The fix that "felt right" is one you can't defend or re-run when the model version changes.

The autoresearch path

Write 4 binary evals once. Run autoresearch optimize. Get a stability-tested winner with a z-stat in 30 minutes. Reproducible, defensible, model-agnostic.

The mental model
This is A/B testing for prompts — but the variants are written by an Opus mutator and the grader is a Haiku judge.
Slide 03 / Three commands

Three commands. Nothing else.

The whole tool is three subcommands. Each does one thing and prints a number you can defend.

autoresearch validate

Re-runs the judge K times on the same outputs and asks you to hand-grade some. Reports Fleiss' kappa (test-retest) and Cohen's kappa (judge-vs-human). Both ≥ 0.7 means the judge is trustworthy.

autoresearch optimize

Greedy mutation loop. Up to 20 experiments. Keeps changes that improve the train score, discards the rest. Logs every experiment to JSONL.

autoresearch compare

Head-to-head scoring. --repeat N runs N rounds and reports mean ± stdev with a z-stat. Below z=2 the result is noise. Above z=2 it's real.

Slide 04 / Proof it works

cd:plan went from 0.388 to 1.000 on holdout.

We ran Autoresearch against the real cd:plan skill from cd-skills. Four binary evals, 2 train inputs, 4 holdout inputs Haiku had never seen.

EvalBaselineOptimized
has_checkbox_steps9/1212/12
every_step_has_verify0/1212/12
steps_reference_files12/1212/12
has_done_when0/1212/12
Lift

+61.2pp

Holdout lift

Significance

z = 19.56

z ≥ 2 means real signal

Stability

stdev 0.000

5 rounds, zero variance

Slide 05 / Install

Install as a plugin.

Inside Claude Code:

Claude Code
/plugin marketplace add Coherence-Daddy/autoresearch
/plugin install autoresearch@autoresearch
/reload-plugins

That registers the autoresearch:run skill so any agent can invoke it. The skill itself instructs the agent to clone the repo and install the Python CLI:

Terminal
git clone https://github.com/Coherence-Daddy/autoresearch ~/autoresearch
cd ~/autoresearch
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY=sk-ant-...

One-time setup, then every autoresearch call reuses the venv.

Slide 06 / Step 1 — Write a config

Tell Autoresearch what "good" means.

An eval config is a YAML file. It declares the skill to optimize, 2-6 test inputs, a holdout set, and 2-5 binary evals.

target_skill: my-skill.md
generator_model: claude-haiku-4-5-20251001
judge_model: claude-haiku-4-5-20251001
mutator_model: claude-opus-4-7
runs_per_input: 3

inputs:
  - name: case_a
    prompt: "The prompt you'd give the skill in real use."

holdout:
  - name: unseen_case
    prompt: "A prompt the optimizer never sees."

evals:
  - name: has_required_section
    question: "Does the output contain X?"
    pass_condition: "What a passing output looks like."
    fail_condition: "What a failing output looks like."
Eval design rule
At least one eval must be locally satisfiable — pass/fail determinable from a short fragment. Pure global properties starve the mutator.
Slide 07 / Step 2 — Validate

Don't trust the judge until you've measured it.

Terminal
autoresearch validate my-config.yaml

The validator runs the judge K times on the same outputs (test-retest), then prompts you to hand-grade them (judge-vs-human). It prints a kappa per eval.

GatePass thresholdWhat it means
Test-retest κ≥ 0.7Judge is consistent with itself
Judge-human κ≥ 0.7Judge agrees with you
Agreement≥ 0.8Same verdict 80%+ of the time
If the gate fails
The optimizer's scores are meaningless. Stop, fix the eval prompts, and re-validate before continuing.
Slide 08 / Step 3 — Optimize

Run the loop.

Terminal
autoresearch optimize my-config.yaml

For each experiment (max 20 by default):

  1. Mutate Opus reads the failing examples and proposes a single skill change.
  2. Score Haiku generates outputs against the train inputs; the judge scores them.
  3. Decide If the new score beats the best, keep it. Otherwise discard and try again.
  4. Holdout check Every 5 experiments, score against the held-out inputs to detect overfitting.

Output is saved to runs/<skill>-<timestamp>/: the baseline, the best, every snapshot, and a JSONL log of every experiment.

Slide 09 / Step 4 — Compare

Always run --repeat 5.

Terminal
autoresearch compare my-config.yaml \
  runs/my-skill-2026-04-27/SKILL.md.baseline \
  runs/my-skill-2026-04-27/SKILL.md.best \
  --holdout --repeat 5

Single-shot comparisons are dominated by generator-temperature noise — they will lie to you. --repeat 5 runs the comparison five times and reports mean ± stdev plus a z-stat.

z-statVerdict
< 2Noise. Don't ship this change.
2 – 5Real but modest lift.
≥ 5Strong signal. Ship it.
Slide 10 / What the optimizer finds

Two patterns recur across skills.

Pattern 01

Remove, don't add.

The first win on the plan skill came from deleting the verbose 9-item self-review section. Less context, fewer ways to derail. The model already knew the format — the extra prose was diluting it.

Pattern 02

Constraints beat process instructions.

Process changes ("write the file in reverse order") regressed. The win was a presence constraint with a self-check recovery action — a model-executable rescue, not workflow overhead.

## OUTPUT BUDGET (strict)

- Max 7 steps. Action ≤ 1 sentence. Verify ≤ 1 sentence.
- `## Done when` is mandatory and the LAST section.
- Self-check: scroll to the bottom. If the last heading
  is not `## Done when`, shorten earlier Actions until it fits.
Episode 01 · Done

Stop guessing. Start measuring.

Take any skill that "feels off" — write 4 binary evals, 6 inputs, run optimize. Worst case: 30 minutes and a clean null-result. Best case: a stability-tested fix you can ship.