Alexis Rondeau & Claude Code · final report

The wall that wasn't coverage

Across 71 experiments — every prompting, retrieval, decoding, pipeline, fine-tuning and preference-learning lever we could find — an on-device 3B model came within about one rubric point of a frontier cloud model on every quality dimension: close, but consistently a notch below, across the board. This is the record of how far it got, and the controlled tests that show the gap is a broad capability limit — not a single missing skill.

1The question

Move one feature from the cloud onto the device, with no quality loss — if that's even possible.

I'm building a native macOS app with a discovery planner: type a one-line task — "plan a trip to Paris" — and it returns a restated title and exactly seven clarifying questions. Today it runs on Claude Sonnet in the cloud. I wanted it entirely local and native, on Apple's built-in ~3B Foundation Model, with no Anthropic access in the shipping product. The whole project is one question: can a ~20× smaller, on-device model match a frontier cloud model on this one narrow task?

To answer it honestly I built an autoresearch loop (inspired by Andrej Karpathy's autoresearch): a frozen ruler — 30 held-out tasks, each with a Sonnet gold answer and a Sonnet judge that scores the on-device answer head-to-head — and a mutable agent that every experiment changes. The judge scores each question set 1–5 on five dimensions; quality (0–1) blends the rubric with pairwise win rate. The ruler never changes, so scores are comparable across months of work.

The loop itself is run by Claude Code, Anthropic's coding agent. Each iteration it reads the running log of what's already been tried, proposes the next experiment, writes and runs the code against the ruler, and records the result — one experiment at a time. I worked alongside it as supervisor: steering which levers to pursue next, giving continuous feedback on its hypotheses, and triggering the manual fine-tuning runs. So throughout this report, "we" means that pairing — an AI agent generating and running the experiments under human direction.

How one experiment runs. Claude Code drives the cycle autonomously: read the log, form one hypothesis, edit the agent, and score it against the frozen ruler — keeping the result only if it beats the running best. The human supervisor sets direction and runs the fine-tuning track the agent can't. Solid arrows are the automated loop; the dashed link is the scoring against the ruler; rust marks human input.

2The climb

Real progress — quality roughly doubled. Then it stalled.

Stock baseline
0.28
3B, single shot
In-context ceiling
0.32
best of 29 prompting / RAG tries
Fine-tuned adapter
0.41
LoRA on the model's weights
Best so far
0.44
adapter + over-generate & prune
0.40 0.30 0.28 0.32 0.41 0.44 stalled — no further gains 71 experiments, in run order →
Running best quality across the 71 experiments. The 29 in-context tries inch from 0.28 to 0.32; the first fine-tuned adapter jumps to 0.41; over-generate-and-prune reaches the 0.44 best — then thirty-plus further experiments add nothing. Roughly doubled, then flat.

A fine-tuned SFT adapter was the only lever that broke past the prompting ceiling; everything above ~0.32 runs on one. Four of the five rubric dimensions reached 4–5. One did not.

3Everything we tried

Five lever classes, 71 experiments, 9 that set a new best. One dimension stuck throughout.

Lever classRunsBest qualityCoverageGate
Stock baseline10.283full-30
Inference  prompt · schema · decoding · RAG · multi-call pipelines · self-critique · over-generate-prune · solution-space info-gain · answer-simulation · decomposed verification290.323dev-10
SFT  LoRA adapters (imitation), incl. more data & coverage-forced data340.443full-30
ORPO  preference, style-vs-style pairs20.363full-30
ORPO  preference, on-policy pairs (gold vs the model's own drafts)20.403full-30

Four findings hold across every row. Atomicity, specificity, naturalness and non-redundancy reached 4–5 — the on-device model phrases questions like a thoughtful human and rarely repeats itself. Coverage — whether the seven questions include the single most decision-critical unknown — sat at 3 in every one of the 71 runs (the only values ever observed, on either gate, were 2 and 3). It did not move for prompting, for retrieval, for elaborate multi-call reasoning, for fine-tuning, for more or reshaped training data, or for preference learning.

The cleanest negative. Preference learning (ORPO) first regressed (0.33) because the pairs contrasted two Sonnet styles, not good-vs-bad judgment — the model learned the wrong correlate. Rebuilding the pairs on-policy (chosen = gold, rejected = the model's own weaker draft for the same task) recovered the loss (0.40, back to the SFT baseline) — confirming the diagnosis exactly — but coverage still did not move off 3. Even a clean, correct preference signal could not teach which unknown matters.

Every experiment, at a glance. One tile per run, in the order each was run — its on-device pipeline stripped to bare structure, coded by the legend below. Read the shapes: thin single-shot baselines, ensembles fanning wide, the tall multi-call reasoning pipelines of the inference era, then the two-column adapter runs that fill the back half.

Inference call SFT adapter call ORPO adapter call Swift / user step Adapter training (dev-time)

You can view each experiment's details — its hypothesis, method, result and process diagram — on the live dashboard.

4The confirmation experiment

Hold everything constant. Change only the model. Watch coverage.

If coverage is locked because the harness or the prompt is wrong, then a more capable model would also be stuck at 3. If it's locked because the 3B model isn't capable enough, a stronger model — given the identical prompt, schema, gold, judge and rubric, and decoded exactly the way the gold itself was — should clear it. So we ran two cloud models as the candidate through the exact same harness. The only variable is the model.

Candidate modelWhereQualityCoveragevs gold (W/T/L)
Apple Foundation Model (~3B)on-device0.443best 2/5/22
Claude Haiku 4.5cloud0.5544/11/15
Claude Sonnet 4.6 (= the gold model)cloud0.6544/18/8

Coverage moved the moment the model changed: 3 → 4. It was never the harness — the same rubric that scored the 3B at 3 scores Haiku and Sonnet at 4. Two further details matter:

5Is it really coverage?

The obvious follow-up: if we stopped caring about coverage, would the model reach parity? No.

Here the story turns. Everything to this point — the climb, the 71 experiments, the cloud confirmation — frames coverage as the wall: the one dimension stuck at 3, the one the judge keeps naming. But "stuck at the lowest score" and "the reason it loses" are not the same claim, and this section is where we stopped assuming they were. Coverage was the lowest dimension and by far the most-cited in the judge's notes, so it was tempting to call it the gap. To test that, we re-judged the stored candidates with the exact official judge in two modes: the normal five dimensions, and a coverage-blind mode that drops coverage from the rubric and from the "which set is better" verdict. Same candidate outputs; only the judge's instructions change. The five-dimension control reproduces the logged quality, so the comparison is sound.

Candidatequality, 5-dimquality, coverage-blindΔloses to gold (of 30)
SFT baseline0.3980.399+0.0023 → 23
Best adapter (exp056)0.4150.412−0.0022 → 22
On-policy ORPO0.3690.370+0.0025 → 24

Removing coverage changes quality by essentially zero — and, the telling part, the head-to-head barely moves: told to ignore coverage entirely, the judge still prefers the gold set ~22–24 times out of 30. The losses were never mostly about coverage.

The reason is visible in the rubric the same judge assigns each model as a candidate:

Scored as candidateatomicityspecificitycoveragenaturalnessnon-redundancy
On-device best (exp056)44344
Sonnet (cloud anchor)54455

The on-device model matches the frontier on specificity alone and trails by about a point on everything else. Coverage is the joint-lowest, but it is not special — the gap is broad. That is why re-weighting it away does nothing: the model is a uniform notch below the frontier, and coverage is simply the symptom the judge can name in a sentence.

6The verdict

The gap is a broad capability limit of the 3B base — not a coverage-specific wall, and not a method we failed to find.

Imitation, preference learning, and every inference-time technique in the 2024–2025 literature leave the on-device model a consistent notch below the frontier on every dimension. A more capable model clears that notch on the identical harness (coverage 3→4, and 4→5 elsewhere), and removing coverage from the judging recovers none of the gap. So the shortfall is bound to model capacity, broadly — no amount of further training or scaffolding on this base is likely to close it.

What this means for the product. The on-device model is uniformly close — roughly two-thirds of the frontier's practical quality, about a point below on nearly every dimension, and strong in absolute terms (4 of 5 on most). The shortfall is broad and capacity-bound, so there is no single dimension to fix. Three honest paths:

Ship & document

  • Ship the 0.44 adapter; document it honestly as uniformly near-parity — a notch below the frontier, not broken on any one axis.
  • Best fit if local + free + private outweighs a consistently-slightly-weaker question set.

Reframe the bar

  • Stop measuring against Sonnet-parity and measure absolute usefulness — does the set enable a good plan? The model may be plenty good for the product even a notch below the frontier.
  • Changes the question from "match the cloud" to "is this good enough," which the raw scores can't answer alone.

Wait for a bigger base

  • Re-run this exact harness when Apple ships a larger on-device model; it is now a turnkey regression test.
  • No work until then; outcome outside our control.

My recommendation: ship and document, and reframe the bar to absolute usefulness. The model is a uniform notch below the frontier, not broken anywhere; at ~two-thirds of the practical ceiling and strong in absolute terms it is plausibly good enough — and the broad, capacity-bound gap won't yield to more work on this base. Re-test when a larger on-device model ships.

7Reproducibility

Every experiment is logged with its hypothesis, method, result and a process diagram on the dashboard; the narrative is in the project overview. The frozen ruler (30 gold cases + Sonnet judge), all agent configs, the LoRA training and ORPO scripts, the on-policy draft generator, the cloud ceiling-confirmation harness, and the coverage-blind re-judge (§5) are in the repository. The autoresearch loop runs unattended and can request the manual training track when it judges the weights are the answer. The cloud anchors in §4 and the coverage-blind re-judge in §5 are reference analyses only, deliberately kept out of the on-device run log.