Alexis Rondeau & Claude Code · final report

"Which city in Paris are you staying in?"

Can an LLM autoresearch loop teach a phone-sized small language model (SLM) to ask frontier-quality questions? Can it? Does it? Let's find out!

A post-training study on small-model specialization with an autoresearch loop

Across 96 experiments — every prompting, retrieval, decoding, pipeline, fine-tuning, preference-learning, reinforcement-learning and adversarial-distillation lever we could find — Apple's native 2025 3B Foundation Model came within about one rubric point of a frontier cloud model (Claude Sonnet 4.6) on every quality dimension: close, but consistently a notch below, across the board.

1The question

Move one feature from the cloud onto the device, with no quality loss — if that's even possible.

I'm building a native macOS app with a discovery planner: type a one-line task — "plan a trip to Paris" — and it returns a restated title and exactly seven clarifying questions. Today it runs on Claude Sonnet 4.6 in the cloud. I wanted it entirely local and native, on Apple's built-in ~3B Foundation Model, with no Anthropic access in the shipping product. The whole project is one question: can a ~20× smaller, on-device model match a frontier cloud model on this one narrow task?

User input
"Plan a trip to Paris"
Discovery Planner
  1. What city are you departing from?
  2. When are you planning to travel?
  3. How many people are going on this trip?
  4. How long do you plan to stay in Paris?
  5. What is your total budget for this trip?
  6. What type of accommodation do you prefer?
  7. What is the main purpose of your visit?

The planner turns a vague one-liner into seven specific clarifying questions. (Shown: the gold-standard answer from claude-sonnet-4-6 — the bar we're trying to match on-device.)

2The climb

Real progress — quality roughly doubled. Then it stalled.

Stock baseline
0.28
3B, single shot
In-context ceiling
0.32
best of 29 prompting / RAG tries
Fine-tuned adapter
0.41
LoRA on the model's weights
Best so far
0.44
adapter + over-generate & prune
0.50 — parity: a tie on every case 0.50 0.40 0.30 0.20 0.28 0.32 0.41 0.44 stalled — no further gains 96 experiments, in run order →
Running best quality across the 96 experiments. The 29 in-context tries inch from 0.28 to 0.32; the first fine-tuned adapter jumps to 0.41; over-generate-and-prune reaches the 0.44 best — then fifty-plus further experiments — judge-reward RL, adversarial distillation, and an exhaustive inference-topology×best-adapter cross among them — add nothing (one cross cell, exp078, flashed 0.47 but a confirmation re-judge of its identical output fell to 0.40: judge noise, not a gain). Roughly doubled, then flat.

A fine-tuned SFT adapter was the only lever that broke past the prompting ceiling; everything above ~0.32 runs on one. Four of the five rubric dimensions reached 4–5. One did not.

3Our approach

A fixed test set scored by a Sonnet judge — and the autoresearch loop that drives the experiments.

How we measure quality

The evaluation is fixed up front and never changes: 30 held-out one-liners, each with a gold answer from Claude Sonnet 4.6 — the cloud model we're matching. For any on-device candidate, a judge (also Sonnet) reads its seven questions beside the gold, picks the better set, and scores it 1–5 on five dimensions. One quality score (0–1) blends that rubric with the head-to-head win rate. A score from week one is comparable to one months later, because every experiment uses the same tasks, the same gold, and the same judge configuration (same model: Claude Sonnet 4.6, same prompt, same rubric).

The rubric's five scoring dimensions

Atomicityasks exactly one thing — no "and / or"
Specificityconcrete to this task, not generic filler
Coveragethe seven cover the most decision-critical unknowns
Naturalnessreads like a thoughtful human coach
Non-redundancyno overlapping or repeated questions

One real evaluationHere is a single case from our test set, scored exactly the way every experiment is — the user's one-liner, the cloud model's gold answer, the on-device answer, and the judge's verdict.

User input
what the user types

"Plan a trip to Paris"

Gold standard
claude-sonnet-4-6 · cloud
  1. What city are you departing from?
  2. When are you planning to travel?
  3. How many people are going on this trip?
  4. How long do you plan to stay in Paris?
  5. What is your total budget for this trip?
  6. What type of accommodation do you prefer?
  7. What is the main purpose of your visit?
On-device candidate + Magic
Apple FM ~3B · on device
  1. What is your travel date or timeframe?
  2. How many nights will you stay in Paris?
  3. What is your total budget for the trip?
  4. Which city in Paris are you staying in?
  5. Do you prefer a guided tour or self-guided exploration?
  6. Is this trip for a specific occasion or purpose?
  7. Do you have any travel restrictions or mobility needs?
Judgement
claude-sonnet-4-6

Gold better

Atomicity 4 Specificity 3 Coverage 3 Naturalness 3 Non-redundancy 4

each scored 1–5

The on-device set asks a nonsensical question ("Which city in Paris are you staying in?" — Paris is a city) and omits two critical unknowns — the departure city and the party size — making it less useful for planning than the gold set.

The on-device model phrases questions naturally and even adds a sensible one about mobility — but it misses the few decision-critical unknowns (where you're flying from, how many people). That gap, which we call coverage, is the central challenge of this project. On other tasks the on-device model ties or even beats the cloud model; this is one honest, representative case.

The autoresearch loop

To answer our research question, I built an autoresearch loop (inspired by Andrej Karpathy's autoresearch): a bash script that each iteration spawns a headless Claude Code to make one change, lets the Sonnet judge score it against the gold, then commits the result and loops. It runs unattended, and pauses to hand me the manual fine-tuning track when the agent asks for it.

How one experiment runs. Claude Code drives the cycle autonomously: read the log, form one hypothesis, edit the agent, and have the judge score it — keeping the result only if it beats the running best. The human supervisor sets direction and runs the fine-tuning track the agent can't. Solid arrows are the automated loop; the dashed link is the scoring by the judge; rust marks human input.

4Everything we tried

Seven lever classes, 96 experiments, 9 that set a new best. One dimension stuck throughout.

Every experiment, at a glance. One tile per run, in the order each was run — its on-device pipeline stripped to bare structure, coded by the legend below. Read the shapes: thin single-shot baselines, ensembles fanning wide, the tall multi-call reasoning pipelines of the inference era, then the two-column adapter runs that fill the back half.

Inference call SFT adapter call ORPO adapter call Swift / user step Adapter training (dev-time)

You can view each experiment's details — its hypothesis, method, result and process diagram — on the live dashboard.

Best result by lever class

Lever classRunsBest qualityAtomicitySpecificityNaturalnessNon-redundancyCoverageGate
Stock baseline10.2853333full-30
Inference  prompt · schema · decoding · RAG · multi-call pipelines · self-critique · over-generate-prune · solution-space info-gain · answer-simulation · decomposed verification290.3254443dev-10
SFT  LoRA adapters (imitation), incl. more data & coverage-forced data340.4454443full-30
ORPO  preference, style-vs-style pairs20.3644433full-30
ORPO  preference, on-policy pairs (gold vs the model's own drafts)20.4054443full-30
GRPO  judge-reward RL — group-relative advantages over the model's own judged drafts, incl. KL-anchored40.4154443full-30
GAD  adversarial distillation — reward is a discriminator trained to tell Sonnet gold sets from the model's drafts20.3653443full-30

One pattern holds across every row. Atomicity, specificity, naturalness and non-redundancy reached 4–5 — the on-device model phrases questions like a thoughtful human and rarely repeats itself.

But coverage — whether the seven include the single most decision-critical unknown — sat at 3 in every one of the 96 runs (only 2s and 3s ever appeared), unmoved by prompting, retrieval, multi-call reasoning, fine-tuning, reshaped training data, preference learning, RL against the rubric, or adversarial distillation.

The cleanest negative. Preference learning (ORPO) first regressed (0.33): the pairs contrasted two Sonnet styles, not good-vs-bad judgment, so the model learned the wrong correlate. Rebuilding them on-policy (chosen = gold, rejected = the model's own weaker draft) recovered the loss (0.40, back to the SFT baseline) — confirming the diagnosis — yet coverage still did not move off 3. Even a clean preference signal could not teach which unknown matters.

5"But wait! What if..." The confirmation experiment

Hold everything constant. Change only the model. Watch coverage.

If coverage is locked because the harness or the prompt is wrong, then a more capable model would also be stuck at 3. If it's locked because the 3B model isn't capable enough, a stronger model — given the identical prompt, schema, gold, judge and rubric, and decoded exactly the way the gold itself was — should clear it. So we ran two cloud models as the candidate through the exact same harness. The only variable is the model.

Candidate modelWhereQualityCoveragevs gold (W/T/L)
Apple Foundation Model (~3B)on-device0.443best 2/5/22
Claude Haiku 4.5cloud0.5544/11/15
Claude Sonnet 4.6 (= the gold model)cloud0.6544/18/8

Coverage moved the moment the model changed: 3 → 4. It was never the harness — the same rubric that scored the 3B at 3 scores Haiku and Sonnet at 4. Two further details matter:

6The verdict

The gap is a capability limit of the 3B base — measured against what's achievable, not a perfect score, and not a method we failed to find.

Every lever — imitation, preference learning, RL against the rubric reward, adversarial distillation, every inference-time trick — leaves the on-device model a notch below the frontier. The confirmation experiment (§5) shows why, and it's two gaps, not one:

Gap A — the 3B vs. the frontier. Swap in only a more capable model and the wall lifts: coverage 3→4, quality 0.44→0.65, head-to-head against gold from 2/5/22 to 4/18/8. This notch is the capability limit — it closes the moment capacity rises.

Gap B — the frontier vs. a perfect score. Even Sonnet, judged against its own gold, tops out at coverage 4 / quality 0.65 (ties itself 18, loses 8). That residual is the task's own answer-variance — two excellent question sets disagreeing on the single most critical unknown — which no model crosses. It's the task's ceiling, not the 3B's.

So the bar is 0.65 / coverage 4, not 1.0. The 3B's shortfall is its distance below that bar, and that distance is bound to model capacity — no further training or scaffolding on this base is likely to close it.

What this means for the product. The on-device model is uniformly close — roughly two-thirds of the frontier's practical quality (0.44 against a 0.65 practical ceiling), and strong in absolute terms (4 of 5 on most). The shortfall is capacity-bound, so more work on this base is unlikely to move it. Three honest paths:

Ship & document

  • Ship the 0.44 adapter; document it honestly as uniformly near-parity — a notch below the frontier, not broken on any one axis.
  • Best fit if local + free + private outweighs a consistently-slightly-weaker question set.

Reframe the bar

  • Stop measuring against Sonnet-parity and measure absolute usefulness — does the set enable a good plan? The model may be plenty good for the product even a notch below the frontier.
  • Changes the question from "match the cloud" to "is this good enough," which the raw scores can't answer alone.

Wait for a bigger base

  • Re-run this exact harness when Apple ships a larger on-device model; it is now a turnkey regression test.
  • No work until then; outcome outside our control.

My recommendation: ship and document, and reframe the bar to absolute usefulness. The model is a uniform notch below the frontier, not broken anywhere; at ~two-thirds of the practical ceiling and strong in absolute terms it is plausibly good enough — and the capacity-bound gap won't yield to more work on this base. Re-test when a larger on-device model ships.

7Reproducibility

Every experiment is logged with its hypothesis, method, result and a process diagram on the dashboard; the narrative is in the project overview. The fixed evaluation (30 gold cases + Sonnet judge), all agent configs, the LoRA training scripts (SFT, ORPO, the KL-anchored GRPO policy gradient, and the GAD discriminator), the on-policy draft generator, the cloud ceiling-confirmation harness, and the coverage-critic probe are in the repository. The autoresearch loop runs unattended and can request the manual training track when it judges the weights are the answer. The cloud anchors in §5 and the mechanistic analysis in §6 are reference only, deliberately kept out of the on-device run log.

8References

Not built from scratch — the levers we tried, and the papers each one came from. For each: what the paper introduced, and how we used it here.

In-context phase — prompting, retrieval, decoding

  1. Language Models are Few-Shot Learners · Brown et al., 2020 · GPT-3 · NeurIPS 2020

    Showed a large language model can perform a new task from a few examples placed in the prompt, with no weight updates.

    UsedOur first lever: feeding the 3B a handful of gold question-sets as in-prompt demonstrations to steer its output before touching the weights.

    ExperimentsDirectly used by the early few-shot prompting runs — exp003 (2 nearest exemplars) and exp007 (semantic-embedding retrieval); influenced the whole in-context phase (exp003–exp027) that all built on in-prompt demonstrations.

  2. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks · Lewis et al., 2020 · RAG · NeurIPS 2020

    Pairs a generator with a retriever that pulls relevant passages from an external index at inference time.

    UsedRetrieving the most relevant gold examples per task to use as demonstrations, instead of a fixed prompt set.

    ExperimentsDirectly used in the retrieval runs — exp003/exp007 (nearest-exemplar retrieval), exp020 (corpus-grounded ranking over a 3-temp pool), and exp040 (leak-free 619-pair corpus RAG, the best honest in-context add-on at 0.405); also exp033 leave-one-out RAG on the adapter.

  3. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries · Carbonell & Goldstein, 1998 · SIGIR'98

    A selection criterion that trades off relevance against novelty to avoid picking redundant items.

    UsedChoosing demonstration sets that were relevant but diverse; in our runs MMR selection didn't beat a simpler top-k.

    ExperimentsDirectly used in exp043 (MMR-diverse corpus few-shot, k=3, λ=0.6), which scored 0.382 — below exp040's plain top-2-nearest (0.405); diversifying the demos did not diversify the output.

  4. The Curious Case of Neural Text Degeneration · Holtzman et al., 2019 · nucleus / top-p sampling · ICLR 2020

    Introduced top-p (nucleus) sampling, truncating the unreliable tail of the distribution to balance quality and diversity.

    UsedOur decoding controls — the top-p / temperature settings for single-shot output and for the over-generate-and-prune sampler (topP .95, t .9).

    ExperimentsDirectly used by the multi-temperature over-generate-and-prune runs — exp004 (best-of-4 temps), exp020 (21 candidates over temps 0.4/0.7/1.0), and the GRPO on-policy samplers (topP .95 / t .9 in exp057–058); the over-generate-and-prune champion is exp056 (0.44).

Fine-tuning

  1. Finetuned Language Models Are Zero-Shot Learners · Wei et al., 2021 · FLAN · ICLR 2022

    Fine-tuning on tasks phrased as natural-language instructions sharply improves zero-shot generalization.

    UsedThe template for our supervised fine-tuning — training the model on (task → seven questions) pairs in instruction form.

    ExperimentsThe template for every SFT adapter: adapter_e1 (first dev-10 spike), adapter_v2a_e1 (the 0.409 SFT champion), and the v2b coverage-forced variant — all trained on (task → seven questions) instruction pairs.

  2. LoRA: Low-Rank Adaptation of Large Language Models · Hu et al., 2021 · ICLR 2022

    Freezes the base weights and trains small low-rank adapter matrices, cutting trainable parameters by orders of magnitude.

    UsedOur main fine-tuning method — the LoRA adapter that lifted quality from the 0.32 prompting ceiling to 0.41.

    ExperimentsDirectly used for every adapter run: adapter_v2a_e1 was the first to clear the prompting ceiling (0.409), and it became the base for the entire 0.40–0.44 adapter family (exp028–exp056) plus all the ORPO/GRPO/GAD fine-tunes.

Preference learning & reinforcement learning

  1. Direct Preference Optimization: Your Language Model is Secretly a Reward Model · Rafailov et al., 2023 · DPO · NeurIPS 2023

    Aligns a model to preference pairs with a simple classification loss — no separate reward model or RL loop.

    UsedPreference-learning experiments: training on chosen/rejected question-set pairs to push toward gold-like output.

    ExperimentsInfluence/precedent only — no DPO run exists. It motivated the preference-learning direction; the variant we actually implemented was ORPO (adapter_orpo_e1/e2 and on-policy adapter_orpo_op_e1/e2).

  2. ORPO: Monolithic Preference Optimization without Reference Model · Hong et al., 2024

    Folds preference optimization into SFT in a single stage via an odds-ratio penalty, with no reference model.

    UsedOur preference runs combining the SFT and preference objectives in one pass, including the on-policy ORPO variant.

    ExperimentsDirectly used in adapter_orpo_e1/e2 (offline ORPO, chosen=gold vs rejected=model drafts) and adapter_orpo_op_e1/e2 (on-policy ORPO); both stayed at/below the SFT champion and left coverage pinned at 3.

  3. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models · Shao et al., 2024 · introduces GRPO

    Introduced GRPO, a PPO variant that drops the critic and estimates the baseline from group-relative scores of sampled outputs.

    UsedThe RL algorithm behind exp057–058: a judge-rubric reward with group-relative advantage over G=6 on-policy drafts per task.

    ExperimentsDirectly implemented in exp057 (judge-reward GRPO, no KL anchor, 0.328), exp058 / adapter_grpo_kl_e1 (KL-anchored, recovered to 0.407 ≈ base), and reused as the policy-gradient core of the GAD run (exp059).

  4. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning · DeepSeek-AI, 2025 · also Nature 645:633–638

    Showed strong reasoning can be incentivized by RL against a reward, without supervised reasoning traces.

    UsedThe motivation for trying rubric-grounded RL on our non-verifiable, judge-scored task — the lever sweep that pointed us at GRPO.

    ExperimentsFraming/motivation only — the literature sweep cited in exp057's log that pointed us at rubric-grounded RL; it maps to no run of its own, only to the GRPO attempts (exp057–058) it inspired.

  5. Training Language Models to Follow Instructions with Human Feedback · Ouyang et al., 2022 · InstructGPT / RLHF · NeurIPS 2022

    Defined the standard RLHF pipeline (SFT → reward model → RL), and found a 1.3B aligned model preferred over 175B GPT-3.

    UsedThe conceptual backbone for our reward-based fine-tuning — and direct evidence for the thesis that a small specialized model can beat a far larger general one.

    ExperimentsFraming for the whole study (the specialization thesis) plus the conceptual backbone for the reward-based runs — the ORPO (adapter_orpo_*) and GRPO (exp057–058) fine-tunes; not a single run.

  6. Constitutional AI: Harmlessness from AI Feedback · Bai et al., 2022 · RLAIF

    Replaces human preference labels with model-generated ones (RL from AI Feedback), supervised only by a set of principles.

    UsedThe precedent for using an AI (Sonnet) rather than humans as the preference and reward source throughout our loop.

    ExperimentsInfluenced the whole loop — the Sonnet AI-feedback reward in exp057–058 (GRPO judge-rubric reward) and the discriminator teacher-likeness reward in exp059 (GAD); also the source of the ORPO preference pairs.

Evaluation & the specialization thesis

  1. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · Zheng et al., 2023 · NeurIPS 2023 D&B

    Validated using a strong LLM as an automatic judge — including pairwise comparison — against human preference.

    UsedThe basis for our fixed Sonnet judge: rubric scoring plus pairwise win-rate as the metric held constant across every experiment.

    ExperimentsUsed by every experiment — the fixed Sonnet rubric + pairwise judge scored all 96 runs (baseline through exp079); it was also the literal reward signal in the GRPO runs (exp057–058).

  2. Multitask Prompted Training Enables Zero-Shot Task Generalization · Sanh et al., 2021 · T0 · ICLR 2022

    An 11B model trained on prompted multitask data often beats models up to 16× larger, zero-shot.

    UsedPrior evidence that specialization can close size gaps — the optimistic prior this whole study set out to test on a 3B.

    ExperimentsFraming/motivation for the whole study — the specialization prior the entire program tested; not tied to any single run.

Distillation lineage — the GAD experiment

  1. Distilling the Knowledge in a Neural Network · Hinton, Vinyals & Dean, 2015 · NIPS 2014 DL Workshop

    Trains a small "student" model to match a larger "teacher" model's outputs, transferring its knowledge.

    UsedThe conceptual basis for distilling Sonnet (teacher) into the 3B (student) — the framing behind our GAD experiment.

    ExperimentsThe distillation framing behind exp059 / adapter_gad_e1 (Sonnet teacher → 3B student); the SFT adapters (adapter_v2a, v2b) are also a form of teacher-text distillation from Sonnet gold.

  2. Generative Adversarial Networks · Goodfellow et al., 2014

    Trains a generator against a discriminator in a minimax game; the discriminator learns to tell real data from generated.

    UsedThe adversarial setup we adapted for GAD (exp059): a discriminator trained to separate the model's drafts from Sonnet's, its teacher-likeness score used as the RL reward. GAD itself is our own construction — there is no single canonical "generative adversarial distillation" paper.

    ExperimentsDirectly used in exp059 / adapter_gad_e1: a MiniLM-embedding discriminator (val AUC 0.629) trained to separate the model's drafts from Sonnet's, its P(teacher-like) score feeding the GRPO reward; regressed to 0.355 and left coverage at 3.

The autoresearch loop methodology is inspired by Andrej Karpathy's autoresearch (linked in §3). Only ORPO (2024), DeepSeekMath/GRPO (2024) and DeepSeek-R1 (2025) are recent; the rest are foundational methods we applied off the shelf.