Coherence Daddy · Autoresearch

Optimize Claude Code skills, automatically.

An LLM-as-judge optimizer that finds the changes that actually make your skill files work — verified with stability-aware statistics, not vibes.

11 slides · ~5 min read Claude Code · Anthropic API · Python macOS · Linux · Windows + WSL

Why this exists

Skill files degrade silently.

Skill files are markdown instructions written for Claude. No compiler, no test suite, no signal that the AI actually follows them — until something breaks.

The manual path

Read 100 outputs. Spot the pattern. Slow, biased, and never replicable. The fix that "felt right" is one you can't defend or re-run when the model version changes.

The autoresearch path

Write 4 binary evals once. Run autoresearch optimize. Get a stability-tested winner with a z-stat in 30 minutes. Reproducible, defensible, model-agnostic.

The mental model

This is A/B testing for prompts — but the variants are written by an Opus mutator and the grader is a Haiku judge.

Slide 03 / Three commands

Three commands. Nothing else.

The whole tool is three subcommands. Each does one thing and prints a number you can defend.

`autoresearch validate`

Re-runs the judge K times on the same outputs and asks you to hand-grade some. Reports Fleiss' kappa (test-retest) and Cohen's kappa (judge-vs-human). Both ≥ 0.7 means the judge is trustworthy.

`autoresearch optimize`

Greedy mutation loop. Up to 20 experiments. Keeps changes that improve the train score, discards the rest. Logs every experiment to JSONL.

`autoresearch compare`

Head-to-head scoring. --repeat N runs N rounds and reports mean ± stdev with a z-stat. Below z=2 the result is noise. Above z=2 it's real.

Slide 04 / Proof it works

cd:plan went from 0.388 to 1.000 on holdout.

We ran Autoresearch against the real cd:plan skill from cd-skills. Four binary evals, 2 train inputs, 4 holdout inputs Haiku had never seen.

Eval	Baseline	Optimized
`has_checkbox_steps`	9/12	12/12
`every_step_has_verify`	0/12	12/12
`steps_reference_files`	12/12	12/12
`has_done_when`	0/12	12/12

Lift

+61.2pp

Holdout lift

Significance

z = 19.56

z ≥ 2 means real signal

Stability

stdev 0.000

5 rounds, zero variance

Slide 05 / Install

Install as a plugin.

Inside Claude Code:

Claude Code

/plugin marketplace add Coherence-Daddy/autoresearch
/plugin install autoresearch@autoresearch
/reload-plugins

That registers the autoresearch:run skill so any agent can invoke it. The skill itself instructs the agent to clone the repo and install the Python CLI:

Terminal

git clone https://github.com/Coherence-Daddy/autoresearch ~/autoresearch
cd ~/autoresearch
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY=sk-ant-...

One-time setup, then every autoresearch call reuses the venv.

Slide 06 / Step 1 — Write a config

Tell Autoresearch what "good" means.

An eval config is a YAML file. It declares the skill to optimize, 2-6 test inputs, a holdout set, and 2-5 binary evals.

target_skill: my-skill.md
generator_model: claude-haiku-4-5-20251001
judge_model: claude-haiku-4-5-20251001
mutator_model: claude-opus-4-7
runs_per_input: 3

inputs:
  - name: case_a
    prompt: "The prompt you'd give the skill in real use."

holdout:
  - name: unseen_case
    prompt: "A prompt the optimizer never sees."

evals:
  - name: has_required_section
    question: "Does the output contain X?"
    pass_condition: "What a passing output looks like."
    fail_condition: "What a failing output looks like."

Eval design rule

At least one eval must be locally satisfiable — pass/fail determinable from a short fragment. Pure global properties starve the mutator.

Slide 07 / Step 2 — Validate

Don't trust the judge until you've measured it.

Terminal

autoresearch validate my-config.yaml

The validator runs the judge K times on the same outputs (test-retest), then prompts you to hand-grade them (judge-vs-human). It prints a kappa per eval.

Gate	Pass threshold	What it means
Test-retest κ	≥ 0.7	Judge is consistent with itself
Judge-human κ	≥ 0.7	Judge agrees with you
Agreement	≥ 0.8	Same verdict 80%+ of the time

If the gate fails

The optimizer's scores are meaningless. Stop, fix the eval prompts, and re-validate before continuing.

Slide 08 / Step 3 — Optimize

Run the loop.

Terminal

autoresearch optimize my-config.yaml

For each experiment (max 20 by default):

Mutate Opus reads the failing examples and proposes a single skill change.
Score Haiku generates outputs against the train inputs; the judge scores them.
Decide If the new score beats the best, keep it. Otherwise discard and try again.
Holdout check Every 5 experiments, score against the held-out inputs to detect overfitting.

Output is saved to runs/<skill>-<timestamp>/: the baseline, the best, every snapshot, and a JSONL log of every experiment.

Slide 09 / Step 4 — Compare

Always run `--repeat 5`.

Terminal

autoresearch compare my-config.yaml \
  runs/my-skill-2026-04-27/SKILL.md.baseline \
  runs/my-skill-2026-04-27/SKILL.md.best \
  --holdout --repeat 5

Single-shot comparisons are dominated by generator-temperature noise — they will lie to you. --repeat 5 runs the comparison five times and reports mean ± stdev plus a z-stat.

z-stat	Verdict
< 2	Noise. Don't ship this change.
2 – 5	Real but modest lift.
≥ 5	Strong signal. Ship it.

Slide 10 / What the optimizer finds

Two patterns recur across skills.

Pattern 01

Remove, don't add.

The first win on the plan skill came from deleting the verbose 9-item self-review section. Less context, fewer ways to derail. The model already knew the format — the extra prose was diluting it.

Pattern 02

Constraints beat process instructions.

Process changes ("write the file in reverse order") regressed. The win was a presence constraint with a self-check recovery action — a model-executable rescue, not workflow overhead.

## OUTPUT BUDGET (strict)

- Max 7 steps. Action ≤ 1 sentence. Verify ≤ 1 sentence.
- `## Done when` is mandatory and the LAST section.
- Self-check: scroll to the bottom. If the last heading
  is not `## Done when`, shorten earlier Actions until it fits.

Episode 01 · Done

Stop guessing. Start measuring.

Take any skill that "feels off" — write 4 binary evals, 6 inputs, run optimize. Worst case: 30 minutes and a clean null-result. Best case: a stability-tested fix you can ship.

Repo: github.com/Coherence-Daddy/autoresearch coherencedaddy.com

Optimize Claude Code skills, automatically.

Skill files degrade silently.

The manual path

The autoresearch path

Three commands. Nothing else.

autoresearch validate

autoresearch optimize

autoresearch compare

cd:plan went from 0.388 to 1.000 on holdout.

+61.2pp

z = 19.56

stdev 0.000

Install as a plugin.

Tell Autoresearch what "good" means.

Don't trust the judge until you've measured it.

Run the loop.

Always run --repeat 5.

Two patterns recur across skills.

Remove, don't add.

Constraints beat process instructions.

Stop guessing. Start measuring.

`autoresearch validate`

`autoresearch optimize`

`autoresearch compare`

Always run `--repeat 5`.