FM Discovery — Autoresearch Progress

#	Label	Operator	Type	Set	Quality	Spec	W/T/L	Cost	Move

What do the columns & rubric mean?

Type: The lever group. Inference = inference-time only (prompt / decoding / RAG / topology, no weight change). SFT = runs on a LoRA adapter trained by imitation on Sonnet gold. ORPO = runs on an adapter trained on chosen/rejected preference pairs.
Quality: 0–1, higher is better — 0.5 × rubric + 0.5 × pairwise, averaged over cases.
Spec: % of cases passing the hard technical gate: exactly 7 atomic questions, valid title, no emojis. A failed case scores 0.
W/T/L: Head-to-head vs Sonnet's gold — Wins / Ties / Losses. A Win means the on-device questions were judged better than Sonnet's.
The two sets: Set A = Gold (Sonnet) vs Set B = On-device. The judge picks which is better and scores Set B 1–5 on five dimensions:
atomicity: asks exactly one thing (no “and/or”)
specificity: concrete to this task, not generic filler
coverage: the 7 cover the most decision-critical unknowns
naturalness: reads like a thoughtful human coach
nonRedundancy: no overlapping or repeated questions

Click any row to see what the experiment tried, the gold-vs-on-device questions side by side, and the judge's per-case notes.

Reference analyses

Controlled probes and diagnostics — not scored on the frozen ruler, so they never enter the run log above, but they are part of the experiment record. Each closes or reframes a question.

#	Analysis	Question	Result	Verdict