Alexis Rondeau · live research log
Autoresearch progress
Loading the experiment record…
| # | Label |
Operator |
Type |
Set | Quality |
Spec |
W/T/L |
Cost |
Move |
What do the columns & rubric mean?
- Type
- The lever group. Inference = inference-time only (prompt / decoding / RAG / topology, no weight change). SFT = runs on a LoRA adapter trained by imitation on Sonnet gold. ORPO = runs on an adapter trained on chosen/rejected preference pairs.
- Quality
- 0–1, higher is better —
0.5 × rubric + 0.5 × pairwise, averaged over cases.
- Spec
- % of cases passing the hard technical gate: exactly 7 atomic questions, valid title, no emojis. A failed case scores 0.
- W/T/L
- Head-to-head vs Sonnet's gold — Wins / Ties / Losses. A Win means the on-device questions were judged better than Sonnet's.
- The two sets
- Set A = Gold (Sonnet) vs Set B = On-device. The judge picks which is better and scores Set B 1–5 on five dimensions:
- atomicity
- asks exactly one thing (no “and/or”)
- specificity
- concrete to this task, not generic filler
- coverage
- the 7 cover the most decision-critical unknowns
- naturalness
- reads like a thoughtful human coach
- nonRedundancy
- no overlapping or repeated questions