Alexis Rondeau · live research log

Autoresearch progress

Loading the experiment record…

Discarded Kept improvement Running best ×Invalid
#Label Operator Type SetQuality Spec W/T/L Cost Move
What do the columns & rubric mean?
Type
The lever group. Inference = inference-time only (prompt / decoding / RAG / topology, no weight change). SFT = runs on a LoRA adapter trained by imitation on Sonnet gold. ORPO = runs on an adapter trained on chosen/rejected preference pairs.
Quality
0–1, higher is better — 0.5 × rubric + 0.5 × pairwise, averaged over cases.
Spec
% of cases passing the hard technical gate: exactly 7 atomic questions, valid title, no emojis. A failed case scores 0.
W/T/L
Head-to-head vs Sonnet's gold — Wins / Ties / Losses. A Win means the on-device questions were judged better than Sonnet's.
The two sets
Set A = Gold (Sonnet)  vs  Set B = On-device. The judge picks which is better and scores Set B 1–5 on five dimensions:
atomicity
asks exactly one thing (no “and/or”)
specificity
concrete to this task, not generic filler
coverage
the 7 cover the most decision-critical unknowns
naturalness
reads like a thoughtful human coach
nonRedundancy
no overlapping or repeated questions
Click any row to see what the experiment tried, the gold-vs-on-device questions side by side, and the judge's per-case notes.