Alexis Rondeau & Claude Code · final report

"Which city in Paris are you staying in?" - Can a phone-sized model learn to ask frontier-quality questions? Can they? Do they? Let's find out!

A post-training study on small-model specialization with an autoresearch loop

Across 96 experiments — every prompting, retrieval, decoding, pipeline, fine-tuning, preference-learning, reinforcement-learning and adversarial-distillation lever we could find — an on-device 3B model came within about one rubric point of a frontier cloud model on every quality dimension: close, but consistently a notch below, across the board.

1The question

Move one feature from the cloud onto the device, with no quality loss — if that's even possible.

I'm building a native macOS app with a discovery planner: type a one-line task — "plan a trip to Paris" — and it returns a restated title and exactly seven clarifying questions. Today it runs on Claude Sonnet in the cloud. I wanted it entirely local and native, on Apple's built-in ~3B Foundation Model, with no Anthropic access in the shipping product. The whole project is one question: can a ~20× smaller, on-device model match a frontier cloud model on this one narrow task?

User input
"Plan a trip to Paris"
Discovery Planner
  1. What city are you departing from?
  2. When are you planning to travel?
  3. How many people are going on this trip?
  4. How long do you plan to stay in Paris?
  5. What is your total budget for this trip?
  6. What type of accommodation do you prefer?
  7. What is the main purpose of your visit?

The planner turns a vague one-liner into seven specific clarifying questions. (Shown: the gold-standard answer from claude-sonnet-4-6 — the bar we're trying to match on-device.)

To answer it honestly I built an autoresearch loop (inspired by Andrej Karpathy's autoresearch): a frozen ruler — 30 held-out tasks, each with a Sonnet gold answer and a Sonnet judge that scores the on-device answer head-to-head — and a mutable agent that every experiment changes. The judge scores each question set 1–5 on five dimensions; quality (0–1) blends the rubric with pairwise win rate. The ruler never changes, so scores are comparable across months of work.

One real evaluationHere is a single case from our test set, scored exactly the way every experiment is — the user's one-liner, the cloud model's gold answer, the on-device answer, and the judge's verdict.

User input
what the user types

"Plan a trip to Paris"

Gold standard
claude-sonnet-4-6 · cloud
  1. What city are you departing from?
  2. When are you planning to travel?
  3. How many people are going on this trip?
  4. How long do you plan to stay in Paris?
  5. What is your total budget for this trip?
  6. What type of accommodation do you prefer?
  7. What is the main purpose of your visit?
On-device candidate
Apple FM ~3B · on device
  1. What is your travel date or timeframe?
  2. How many nights will you stay in Paris?
  3. What is your total budget for the trip?
  4. Which city in Paris are you staying in?
  5. Do you prefer a guided tour or self-guided exploration?
  6. Is this trip for a specific occasion or purpose?
  7. Do you have any travel restrictions or mobility needs?
Judgement
claude-sonnet-4-6

Gold better

Atomicity 4 Specificity 3 Coverage 3 Naturalness 3 Non-redundancy 4

each scored 1–5

The on-device set asks a nonsensical question ("Which city in Paris are you staying in?" — Paris is a city) and omits two critical unknowns — the departure city and the party size — making it less useful for planning than the gold set.

The on-device model phrases questions naturally and even adds a sensible one about mobility — but it misses the few decision-critical unknowns (where you're flying from, how many people). That gap, which we call coverage, is the central challenge of this project. On other tasks the on-device model ties or even beats the cloud model; this is one honest, representative case.

The loop itself is run by Claude Code, Anthropic's coding agent. Each iteration it reads the running log of what's already been tried, proposes the next experiment, writes and runs the code against the ruler, and records the result — one experiment at a time. I worked alongside it as supervisor: steering which levers to pursue next, giving continuous feedback on its hypotheses, and triggering the manual fine-tuning runs. So throughout this report, "we" means that pairing — an AI agent generating and running the experiments under human direction.

How one experiment runs. Claude Code drives the cycle autonomously: read the log, form one hypothesis, edit the agent, and score it against the frozen ruler — keeping the result only if it beats the running best. The human supervisor sets direction and runs the fine-tuning track the agent can't. Solid arrows are the automated loop; the dashed link is the scoring against the ruler; rust marks human input.

2The climb

Real progress — quality roughly doubled. Then it stalled.

Stock baseline
0.28
3B, single shot
In-context ceiling
0.32
best of 29 prompting / RAG tries
Fine-tuned adapter
0.41
LoRA on the model's weights
Best so far
0.44
adapter + over-generate & prune
0.50 — parity: a tie on every case 0.50 0.40 0.30 0.28 0.32 0.41 0.44 stalled — no further gains 96 experiments, in run order →
Running best quality across the 96 experiments. The 29 in-context tries inch from 0.28 to 0.32; the first fine-tuned adapter jumps to 0.41; over-generate-and-prune reaches the 0.44 best — then fifty-plus further experiments — judge-reward RL, adversarial distillation, and an exhaustive inference-topology×best-adapter cross among them — add nothing (one cross cell, exp078, flashed 0.47 but a confirmation re-judge of its identical output fell to 0.40: judge noise, not a gain). Roughly doubled, then flat.

A fine-tuned SFT adapter was the only lever that broke past the prompting ceiling; everything above ~0.32 runs on one. Four of the five rubric dimensions reached 4–5. One did not.

3Everything we tried

Seven lever classes, 96 experiments, 9 that set a new best. One dimension stuck throughout.

Lever classRunsBest qualityCoverageGate
Stock baseline10.283full-30
Inference  prompt · schema · decoding · RAG · multi-call pipelines · self-critique · over-generate-prune · solution-space info-gain · answer-simulation · decomposed verification290.323dev-10
SFT  LoRA adapters (imitation), incl. more data & coverage-forced data340.443full-30
ORPO  preference, style-vs-style pairs20.363full-30
ORPO  preference, on-policy pairs (gold vs the model's own drafts)20.403full-30
GRPO  judge-reward RL — group-relative advantages over the model's own judged drafts, incl. KL-anchored40.413full-30
GAD  adversarial distillation — reward is a discriminator trained to tell Sonnet gold sets from the model's drafts20.363full-30

Four findings hold across every row. Atomicity, specificity, naturalness and non-redundancy reached 4–5 — the on-device model phrases questions like a thoughtful human and rarely repeats itself. Coverage — whether the seven questions include the single most decision-critical unknown — sat at 3 in every one of the 96 runs (the only values ever observed were 2 and 3). It did not move for prompting, for retrieval, for elaborate multi-call reasoning, for fine-tuning, for more or reshaped training data, for preference learning, for reinforcement learning against the rubric itself, or for adversarial distillation from the teacher.

The cleanest negative. Preference learning (ORPO) first regressed (0.33) because the pairs contrasted two Sonnet styles, not good-vs-bad judgment — the model learned the wrong correlate. Rebuilding the pairs on-policy (chosen = gold, rejected = the model's own weaker draft for the same task) recovered the loss (0.40, back to the SFT baseline) — confirming the diagnosis exactly — but coverage still did not move off 3. Even a clean, correct preference signal could not teach which unknown matters.

Every experiment, at a glance. One tile per run, in the order each was run — its on-device pipeline stripped to bare structure, coded by the legend below. Read the shapes: thin single-shot baselines, ensembles fanning wide, the tall multi-call reasoning pipelines of the inference era, then the two-column adapter runs that fill the back half.

Inference call SFT adapter call ORPO adapter call Swift / user step Adapter training (dev-time)

You can view each experiment's details — its hypothesis, method, result and process diagram — on the live dashboard.

4The confirmation experiment

Hold everything constant. Change only the model. Watch coverage.

If coverage is locked because the harness or the prompt is wrong, then a more capable model would also be stuck at 3. If it's locked because the 3B model isn't capable enough, a stronger model — given the identical prompt, schema, gold, judge and rubric, and decoded exactly the way the gold itself was — should clear it. So we ran two cloud models as the candidate through the exact same harness. The only variable is the model.

Candidate modelWhereQualityCoveragevs gold (W/T/L)
Apple Foundation Model (~3B)on-device0.443best 2/5/22
Claude Haiku 4.5cloud0.5544/11/15
Claude Sonnet 4.6 (= the gold model)cloud0.6544/18/8

Coverage moved the moment the model changed: 3 → 4. It was never the harness — the same rubric that scored the 3B at 3 scores Haiku and Sonnet at 4. Two further details matter:

5Is it really coverage?

The obvious follow-up: if we stopped caring about coverage, would the model reach parity? No.

Here the story turns. Everything to this point — the climb, the 96 experiments, the cloud confirmation — frames coverage as the wall: the one dimension stuck at 3, the one the judge keeps naming. But "stuck at the lowest score" and "the reason it loses" are not the same claim, and this section is where we stopped assuming they were. Coverage was the lowest dimension and by far the most-cited in the judge's notes, so it was tempting to call it the gap. To test that, we re-judged the stored candidates with the exact official judge in two modes: the normal five dimensions, and a coverage-blind mode that drops coverage from the rubric and from the "which set is better" verdict. Same candidate outputs; only the judge's instructions change. The five-dimension control reproduces the logged quality, so the comparison is sound.

Candidatequality, 5-dimquality, coverage-blindΔloses to gold (of 30)
SFT baseline0.3980.399+0.0023 → 23
Best adapter (exp056)0.4150.412−0.0022 → 22
On-policy ORPO0.3690.370+0.0025 → 24

Removing coverage changes quality by essentially zero — and, the telling part, the head-to-head barely moves: told to ignore coverage entirely, the judge still prefers the gold set ~22–24 times out of 30. The losses were never mostly about coverage.

The reason is visible in the rubric the same judge assigns each model as a candidate:

Scored as candidateatomicityspecificitycoveragenaturalnessnon-redundancy
On-device best (exp056)44344
Sonnet (cloud anchor)54455

The on-device model matches the frontier on specificity alone and trails by about a point on everything else. Coverage is the joint-lowest, but it is not special — the gap is broad. That is why re-weighting it away does nothing: the model is a uniform notch below the frontier, and coverage is simply the symptom the judge can name in a sentence.

A correlation audit puts numbers on broad. Across all 1,552 judged question-sets the five rubric dimensions are not independent: a single latent quality factor explains 58% of their variance, and coverage correlates 0.69 with specificity and 0.61 with naturalness. So coverage was never a separable axis you could target in isolation — it co-moves with the rest, which is precisely why any coverage-specific fix was structurally unlikely to land.

6The verdict

The gap is a broad capability limit of the 3B base — not a coverage-specific wall, and not a method we failed to find.

Imitation, preference learning, reinforcement learning against the rubric reward, adversarial distillation from the teacher, and every inference-time technique we could find leave the on-device model a consistent notch below the frontier on every dimension. A more capable model clears that notch on the identical harness (coverage 3→4, and 4→5 elsewhere), and removing coverage from the judging recovers none of the gap. So the shortfall is bound to model capacity, broadly — no amount of further training or scaffolding on this base is likely to close it.

Four angles on coverage, one wall — and the mechanism for each. The final stretch attacked coverage directly from every side, and each route failed for a reason we could measure. Generation (fine-tuning) imitates the gold's phrasing but not the judgment that chose it. On-policy reward (GRPO) has nothing to optimise — the model's own drafts almost never vary on coverage, so the group-relative signal is flat; left unregularised it drifts off the natural-language manifold, and a KL anchor pulls it straight back to the 0.41 baseline. Adversarial distillation (GAD) grounds the reward in the teacher, but a probe shows a discriminator separates the model from Sonnet on style (AUC 0.76), not content (0.62) — so it re-learns the register, not the coverage. Discrimination — a dedicated critic that only has to recognise the better set — fails too: trained to predict the judge's coverage score it reaches 0.999 on seen tasks but generalises to nothing on new ones, because recognising the pivotal unknown takes the same value-of-information judgment as producing it. Every channel for importing that judgment from outside the 3B narrows the gap or recovers the baseline; none crosses it.

What this means for the product. The on-device model is uniformly close — roughly two-thirds of the frontier's practical quality, about a point below on nearly every dimension, and strong in absolute terms (4 of 5 on most). The shortfall is broad and capacity-bound, so there is no single dimension to fix. Three honest paths:

Ship & document

  • Ship the 0.44 adapter; document it honestly as uniformly near-parity — a notch below the frontier, not broken on any one axis.
  • Best fit if local + free + private outweighs a consistently-slightly-weaker question set.

Reframe the bar

  • Stop measuring against Sonnet-parity and measure absolute usefulness — does the set enable a good plan? The model may be plenty good for the product even a notch below the frontier.
  • Changes the question from "match the cloud" to "is this good enough," which the raw scores can't answer alone.

Wait for a bigger base

  • Re-run this exact harness when Apple ships a larger on-device model; it is now a turnkey regression test.
  • No work until then; outcome outside our control.

My recommendation: ship and document, and reframe the bar to absolute usefulness. The model is a uniform notch below the frontier, not broken anywhere; at ~two-thirds of the practical ceiling and strong in absolute terms it is plausibly good enough — and the broad, capacity-bound gap won't yield to more work on this base. Re-test when a larger on-device model ships.

7Reproducibility

Every experiment is logged with its hypothesis, method, result and a process diagram on the dashboard; the narrative is in the project overview. The frozen ruler (30 gold cases + Sonnet judge), all agent configs, the LoRA training scripts (SFT, ORPO, the KL-anchored GRPO policy gradient, and the GAD discriminator), the on-policy draft generator, the cloud ceiling-confirmation harness, the coverage-blind re-judge (§5), the rubric correlation audit, and the coverage-critic probe are in the repository. The autoresearch loop runs unattended and can request the manual training track when it judges the weights are the answer. The cloud anchors in §4 and the mechanistic analyses in §5–§6 are reference only, deliberately kept out of the on-device run log.

8References

Not built from scratch — the levers we tried, and the papers each one came from. For each: what the paper introduced, and how we used it here.

In-context phase — prompting, retrieval, decoding

  1. Language Models are Few-Shot Learners · Brown et al., 2020 · GPT-3 · NeurIPS 2020

    Showed a large language model can perform a new task from a few examples placed in the prompt, with no weight updates.

    UsedOur first lever: feeding the 3B a handful of gold question-sets as in-prompt demonstrations to steer its output before touching the weights.

    ExperimentsDirectly used by the early few-shot prompting runs — exp003 (2 nearest exemplars) and exp007 (semantic-embedding retrieval); influenced the whole in-context phase (exp003–exp027) that all built on in-prompt demonstrations.

  2. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks · Lewis et al., 2020 · RAG · NeurIPS 2020

    Pairs a generator with a retriever that pulls relevant passages from an external index at inference time.

    UsedRetrieving the most relevant gold examples per task to use as demonstrations, instead of a fixed prompt set.

    ExperimentsDirectly used in the retrieval runs — exp003/exp007 (nearest-exemplar retrieval), exp020 (corpus-grounded ranking over a 3-temp pool), and exp040 (leak-free 619-pair corpus RAG, the best honest in-context add-on at 0.405); also exp033 leave-one-out RAG on the adapter.

  3. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries · Carbonell & Goldstein, 1998 · SIGIR'98

    A selection criterion that trades off relevance against novelty to avoid picking redundant items.

    UsedChoosing demonstration sets that were relevant but diverse; in our runs MMR selection didn't beat a simpler top-k.

    ExperimentsDirectly used in exp043 (MMR-diverse corpus few-shot, k=3, λ=0.6), which scored 0.382 — below exp040's plain top-2-nearest (0.405); diversifying the demos did not diversify the output.

  4. The Curious Case of Neural Text Degeneration · Holtzman et al., 2019 · nucleus / top-p sampling · ICLR 2020

    Introduced top-p (nucleus) sampling, truncating the unreliable tail of the distribution to balance quality and diversity.

    UsedOur decoding controls — the top-p / temperature settings for single-shot output and for the over-generate-and-prune sampler (topP .95, t .9).

    ExperimentsDirectly used by the multi-temperature over-generate-and-prune runs — exp004 (best-of-4 temps), exp020 (21 candidates over temps 0.4/0.7/1.0), and the GRPO on-policy samplers (topP .95 / t .9 in exp057–058); the over-generate-and-prune champion is exp056 (0.44).

Fine-tuning

  1. Finetuned Language Models Are Zero-Shot Learners · Wei et al., 2021 · FLAN · ICLR 2022

    Fine-tuning on tasks phrased as natural-language instructions sharply improves zero-shot generalization.

    UsedThe template for our supervised fine-tuning — training the model on (task → seven questions) pairs in instruction form.

    ExperimentsThe template for every SFT adapter: adapter_e1 (first dev-10 spike), adapter_v2a_e1 (the 0.409 SFT champion), and the v2b coverage-forced variant — all trained on (task → seven questions) instruction pairs.

  2. LoRA: Low-Rank Adaptation of Large Language Models · Hu et al., 2021 · ICLR 2022

    Freezes the base weights and trains small low-rank adapter matrices, cutting trainable parameters by orders of magnitude.

    UsedOur main fine-tuning method — the LoRA adapter that lifted quality from the 0.32 prompting ceiling to 0.41.

    ExperimentsDirectly used for every adapter run: adapter_v2a_e1 was the first to clear the prompting ceiling (0.409), and it became the base for the entire 0.40–0.44 adapter family (exp028–exp056) plus all the ORPO/GRPO/GAD fine-tunes.

Preference learning & reinforcement learning

  1. Direct Preference Optimization: Your Language Model is Secretly a Reward Model · Rafailov et al., 2023 · DPO · NeurIPS 2023

    Aligns a model to preference pairs with a simple classification loss — no separate reward model or RL loop.

    UsedPreference-learning experiments: training on chosen/rejected question-set pairs to push toward gold-like output.

    ExperimentsInfluence/precedent only — no DPO run exists. It motivated the preference-learning direction; the variant we actually implemented was ORPO (adapter_orpo_e1/e2 and on-policy adapter_orpo_op_e1/e2).

  2. ORPO: Monolithic Preference Optimization without Reference Model · Hong et al., 2024

    Folds preference optimization into SFT in a single stage via an odds-ratio penalty, with no reference model.

    UsedOur preference runs combining the SFT and preference objectives in one pass, including the on-policy ORPO variant.

    ExperimentsDirectly used in adapter_orpo_e1/e2 (offline ORPO, chosen=gold vs rejected=model drafts) and adapter_orpo_op_e1/e2 (on-policy ORPO); both stayed at/below the SFT champion and left coverage pinned at 3.

  3. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models · Shao et al., 2024 · introduces GRPO

    Introduced GRPO, a PPO variant that drops the critic and estimates the baseline from group-relative scores of sampled outputs.

    UsedThe RL algorithm behind exp057–058: a judge-rubric reward with group-relative advantage over G=6 on-policy drafts per task.

    ExperimentsDirectly implemented in exp057 (judge-reward GRPO, no KL anchor, 0.328), exp058 / adapter_grpo_kl_e1 (KL-anchored, recovered to 0.407 ≈ base), and reused as the policy-gradient core of the GAD run (exp059).

  4. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning · DeepSeek-AI, 2025 · also Nature 645:633–638

    Showed strong reasoning can be incentivized by RL against a reward, without supervised reasoning traces.

    UsedThe motivation for trying rubric-grounded RL on our non-verifiable, judge-scored task — the lever sweep that pointed us at GRPO.

    ExperimentsFraming/motivation only — the literature sweep cited in exp057's log that pointed us at rubric-grounded RL; it maps to no run of its own, only to the GRPO attempts (exp057–058) it inspired.

  5. Training Language Models to Follow Instructions with Human Feedback · Ouyang et al., 2022 · InstructGPT / RLHF · NeurIPS 2022

    Defined the standard RLHF pipeline (SFT → reward model → RL), and found a 1.3B aligned model preferred over 175B GPT-3.

    UsedThe conceptual backbone for our reward-based fine-tuning — and direct evidence for the thesis that a small specialized model can beat a far larger general one.

    ExperimentsFraming for the whole study (the specialization thesis) plus the conceptual backbone for the reward-based runs — the ORPO (adapter_orpo_*) and GRPO (exp057–058) fine-tunes; not a single run.

  6. Constitutional AI: Harmlessness from AI Feedback · Bai et al., 2022 · RLAIF

    Replaces human preference labels with model-generated ones (RL from AI Feedback), supervised only by a set of principles.

    UsedThe precedent for using an AI (Sonnet) rather than humans as the preference and reward source throughout our loop.

    ExperimentsInfluenced the whole loop — the Sonnet AI-feedback reward in exp057–058 (GRPO judge-rubric reward) and the discriminator teacher-likeness reward in exp059 (GAD); also the source of the ORPO preference pairs.

Evaluation & the specialization thesis

  1. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · Zheng et al., 2023 · NeurIPS 2023 D&B

    Validated using a strong LLM as an automatic judge — including pairwise comparison — against human preference.

    UsedThe basis for our frozen Sonnet "ruler": rubric scoring plus pairwise win-rate as the metric held constant across every experiment.

    ExperimentsUsed by every experiment — the frozen Sonnet rubric + pairwise judge scored all 96 runs (baseline through exp079); it was also the literal reward signal in the GRPO runs (exp057–058).

  2. Multitask Prompted Training Enables Zero-Shot Task Generalization · Sanh et al., 2021 · T0 · ICLR 2022

    An 11B model trained on prompted multitask data often beats models up to 16× larger, zero-shot.

    UsedPrior evidence that specialization can close size gaps — the optimistic prior this whole study set out to test on a 3B.

    ExperimentsFraming/motivation for the whole study — the specialization prior the entire program tested; not tied to any single run.

Distillation lineage — the GAD experiment

  1. Distilling the Knowledge in a Neural Network · Hinton, Vinyals & Dean, 2015 · NIPS 2014 DL Workshop

    Trains a small "student" model to match a larger "teacher" model's outputs, transferring its knowledge.

    UsedThe conceptual basis for distilling Sonnet (teacher) into the 3B (student) — the framing behind our GAD experiment.

    ExperimentsThe distillation framing behind exp059 / adapter_gad_e1 (Sonnet teacher → 3B student); the SFT adapters (adapter_v2a, v2b) are also a form of teacher-text distillation from Sonnet gold.

  2. Generative Adversarial Networks · Goodfellow et al., 2014

    Trains a generator against a discriminator in a minimax game; the discriminator learns to tell real data from generated.

    UsedThe adversarial setup we adapted for GAD (exp059): a discriminator trained to separate the model's drafts from Sonnet's, its teacher-likeness score used as the RL reward. GAD itself is our own construction — there is no single canonical "generative adversarial distillation" paper.

    ExperimentsDirectly used in exp059 / adapter_gad_e1: a MiniLM-embedding discriminator (val AUC 0.629) trained to separate the model's drafts from Sonnet's, its P(teacher-like) score feeding the GRPO reward; regressed to 0.355 and left coverage at 3.

The autoresearch loop methodology is inspired by Andrej Karpathy's autoresearch (linked in §1). Only ORPO (2024), DeepSeekMath/GRPO (2024) and DeepSeek-R1 (2025) are recent; the rest are foundational methods we applied off the shelf.