Optimize Claude Code skills, automatically.
An LLM-as-judge optimizer that finds the changes that actually make your skill files work — verified with stability-aware statistics, not vibes.
Skill files degrade silently.
Skill files are markdown instructions written for Claude. No compiler, no test suite, no signal that the AI actually follows them — until something breaks.
The manual path
Read 100 outputs. Spot the pattern. Slow, biased, and never replicable. The fix that "felt right" is one you can't defend or re-run when the model version changes.
The autoresearch path
Write 4 binary evals once. Run autoresearch optimize. Get a stability-tested winner with a z-stat in 30 minutes. Reproducible, defensible, model-agnostic.
Three commands. Nothing else.
The whole tool is three subcommands. Each does one thing and prints a number you can defend.
autoresearch validate
Re-runs the judge K times on the same outputs and asks you to hand-grade some. Reports Fleiss' kappa (test-retest) and Cohen's kappa (judge-vs-human). Both ≥ 0.7 means the judge is trustworthy.
autoresearch optimize
Greedy mutation loop. Up to 20 experiments. Keeps changes that improve the train score, discards the rest. Logs every experiment to JSONL.
autoresearch compare
Head-to-head scoring. --repeat N runs N rounds and reports mean ± stdev with a z-stat. Below z=2 the result is noise. Above z=2 it's real.
cd:plan went from 0.388 to 1.000 on holdout.
We ran Autoresearch against the real cd:plan skill from cd-skills. Four binary evals, 2 train inputs, 4 holdout inputs Haiku had never seen.
| Eval | Baseline | Optimized |
|---|---|---|
has_checkbox_steps | 9/12 | 12/12 |
every_step_has_verify | 0/12 | 12/12 |
steps_reference_files | 12/12 | 12/12 |
has_done_when | 0/12 | 12/12 |
+61.2pp
Holdout lift
z = 19.56
z ≥ 2 means real signal
stdev 0.000
5 rounds, zero variance
Install as a plugin.
Inside Claude Code:
/plugin marketplace add Coherence-Daddy/autoresearch /plugin install autoresearch@autoresearch /reload-plugins
That registers the autoresearch:run skill so any agent can invoke it. The skill itself instructs the agent to clone the repo and install the Python CLI:
git clone https://github.com/Coherence-Daddy/autoresearch ~/autoresearch cd ~/autoresearch python3 -m venv .venv && source .venv/bin/activate pip install -e . export ANTHROPIC_API_KEY=sk-ant-...
One-time setup, then every autoresearch call reuses the venv.
Tell Autoresearch what "good" means.
An eval config is a YAML file. It declares the skill to optimize, 2-6 test inputs, a holdout set, and 2-5 binary evals.
target_skill: my-skill.md
generator_model: claude-haiku-4-5-20251001
judge_model: claude-haiku-4-5-20251001
mutator_model: claude-opus-4-7
runs_per_input: 3
inputs:
- name: case_a
prompt: "The prompt you'd give the skill in real use."
holdout:
- name: unseen_case
prompt: "A prompt the optimizer never sees."
evals:
- name: has_required_section
question: "Does the output contain X?"
pass_condition: "What a passing output looks like."
fail_condition: "What a failing output looks like."
Don't trust the judge until you've measured it.
autoresearch validate my-config.yaml
The validator runs the judge K times on the same outputs (test-retest), then prompts you to hand-grade them (judge-vs-human). It prints a kappa per eval.
| Gate | Pass threshold | What it means |
|---|---|---|
| Test-retest κ | ≥ 0.7 | Judge is consistent with itself |
| Judge-human κ | ≥ 0.7 | Judge agrees with you |
| Agreement | ≥ 0.8 | Same verdict 80%+ of the time |
Run the loop.
autoresearch optimize my-config.yaml
For each experiment (max 20 by default):
- Mutate Opus reads the failing examples and proposes a single skill change.
- Score Haiku generates outputs against the train inputs; the judge scores them.
- Decide If the new score beats the best, keep it. Otherwise discard and try again.
- Holdout check Every 5 experiments, score against the held-out inputs to detect overfitting.
Output is saved to runs/<skill>-<timestamp>/: the baseline, the best, every snapshot, and a JSONL log of every experiment.
Always run --repeat 5.
autoresearch compare my-config.yaml \ runs/my-skill-2026-04-27/SKILL.md.baseline \ runs/my-skill-2026-04-27/SKILL.md.best \ --holdout --repeat 5
Single-shot comparisons are dominated by generator-temperature noise — they will lie to you. --repeat 5 runs the comparison five times and reports mean ± stdev plus a z-stat.
| z-stat | Verdict |
|---|---|
| < 2 | Noise. Don't ship this change. |
| 2 – 5 | Real but modest lift. |
| ≥ 5 | Strong signal. Ship it. |
Two patterns recur across skills.
Remove, don't add.
The first win on the plan skill came from deleting the verbose 9-item self-review section. Less context, fewer ways to derail. The model already knew the format — the extra prose was diluting it.
Constraints beat process instructions.
Process changes ("write the file in reverse order") regressed. The win was a presence constraint with a self-check recovery action — a model-executable rescue, not workflow overhead.
## OUTPUT BUDGET (strict) - Max 7 steps. Action ≤ 1 sentence. Verify ≤ 1 sentence. - `## Done when` is mandatory and the LAST section. - Self-check: scroll to the bottom. If the last heading is not `## Done when`, shorten earlier Actions until it fits.
Stop guessing. Start measuring.
Take any skill that "feels off" — write 4 binary evals, 6 inputs, run optimize. Worst case: 30 minutes and a clean null-result. Best case: a stability-tested fix you can ship.