Across 71 experiments — every prompting, retrieval, decoding, pipeline, fine-tuning and preference-learning lever we could find — an on-device 3B model came within about one rubric point of a frontier cloud model on every quality dimension: close, but consistently a notch below, across the board. This is the record of how far it got, and the controlled tests that show the gap is a broad capability limit — not a single missing skill.
Move one feature from the cloud onto the device, with no quality loss — if that's even possible.
I'm building a native macOS app with a discovery planner: type a one-line task — "plan a trip to Paris" — and it returns a restated title and exactly seven clarifying questions. Today it runs on Claude Sonnet in the cloud. I wanted it entirely local and native, on Apple's built-in ~3B Foundation Model, with no Anthropic access in the shipping product. The whole project is one question: can a ~20× smaller, on-device model match a frontier cloud model on this one narrow task?
To answer it honestly I built an autoresearch loop (inspired by Andrej Karpathy's autoresearch): a frozen ruler — 30 held-out tasks, each with a Sonnet gold answer and a Sonnet judge that scores the on-device answer head-to-head — and a mutable agent that every experiment changes. The judge scores each question set 1–5 on five dimensions; quality (0–1) blends the rubric with pairwise win rate. The ruler never changes, so scores are comparable across months of work.
The loop itself is run by Claude Code, Anthropic's coding agent. Each iteration it reads the running log of what's already been tried, proposes the next experiment, writes and runs the code against the ruler, and records the result — one experiment at a time. I worked alongside it as supervisor: steering which levers to pursue next, giving continuous feedback on its hypotheses, and triggering the manual fine-tuning runs. So throughout this report, "we" means that pairing — an AI agent generating and running the experiments under human direction.
Real progress — quality roughly doubled. Then it stalled.
A fine-tuned SFT adapter was the only lever that broke past the prompting ceiling; everything above ~0.32 runs on one. Four of the five rubric dimensions reached 4–5. One did not.
Five lever classes, 71 experiments, 9 that set a new best. One dimension stuck throughout.
| Lever class | Runs | Best quality | Coverage | Gate |
|---|---|---|---|---|
| Stock baseline | 1 | 0.28 | 3 | full-30 |
| Inference prompt · schema · decoding · RAG · multi-call pipelines · self-critique · over-generate-prune · solution-space info-gain · answer-simulation · decomposed verification | 29 | 0.32 | 3 | dev-10 |
| SFT LoRA adapters (imitation), incl. more data & coverage-forced data | 34 | 0.44 | 3 | full-30 |
| ORPO preference, style-vs-style pairs | 2 | 0.36 | 3 | full-30 |
| ORPO preference, on-policy pairs (gold vs the model's own drafts) | 2 | 0.40 | 3 | full-30 |
Four findings hold across every row. Atomicity, specificity, naturalness and non-redundancy reached 4–5 — the on-device model phrases questions like a thoughtful human and rarely repeats itself. Coverage — whether the seven questions include the single most decision-critical unknown — sat at 3 in every one of the 71 runs (the only values ever observed, on either gate, were 2 and 3). It did not move for prompting, for retrieval, for elaborate multi-call reasoning, for fine-tuning, for more or reshaped training data, or for preference learning.
Every experiment, at a glance. One tile per run, in the order each was run — its on-device pipeline stripped to bare structure, coded by the legend below. Read the shapes: thin single-shot baselines, ensembles fanning wide, the tall multi-call reasoning pipelines of the inference era, then the two-column adapter runs that fill the back half.
Inference call SFT adapter call ORPO adapter call Swift / user step Adapter training (dev-time)
You can view each experiment's details — its hypothesis, method, result and process diagram — on the live dashboard.
Hold everything constant. Change only the model. Watch coverage.
If coverage is locked because the harness or the prompt is wrong, then a more capable model would also be stuck at 3. If it's locked because the 3B model isn't capable enough, a stronger model — given the identical prompt, schema, gold, judge and rubric, and decoded exactly the way the gold itself was — should clear it. So we ran two cloud models as the candidate through the exact same harness. The only variable is the model.
| Candidate model | Where | Quality | Coverage | vs gold (W/T/L) |
|---|---|---|---|---|
| Apple Foundation Model (~3B) | on-device | 0.44 | 3 | best 2/5/22 |
| Claude Haiku 4.5 | cloud | 0.55 | 4 | 4/11/15 |
| Claude Sonnet 4.6 (= the gold model) | cloud | 0.65 | 4 | 4/18/8 |
Coverage moved the moment the model changed: 3 → 4. It was never the harness — the same rubric that scored the 3B at 3 scores Haiku and Sonnet at 4. Two further details matter:
The obvious follow-up: if we stopped caring about coverage, would the model reach parity? No.
Here the story turns. Everything to this point — the climb, the 71 experiments, the cloud confirmation — frames coverage as the wall: the one dimension stuck at 3, the one the judge keeps naming. But "stuck at the lowest score" and "the reason it loses" are not the same claim, and this section is where we stopped assuming they were. Coverage was the lowest dimension and by far the most-cited in the judge's notes, so it was tempting to call it the gap. To test that, we re-judged the stored candidates with the exact official judge in two modes: the normal five dimensions, and a coverage-blind mode that drops coverage from the rubric and from the "which set is better" verdict. Same candidate outputs; only the judge's instructions change. The five-dimension control reproduces the logged quality, so the comparison is sound.
| Candidate | quality, 5-dim | quality, coverage-blind | Δ | loses to gold (of 30) |
|---|---|---|---|---|
| SFT baseline | 0.398 | 0.399 | +0.00 | 23 → 23 |
| Best adapter (exp056) | 0.415 | 0.412 | −0.00 | 22 → 22 |
| On-policy ORPO | 0.369 | 0.370 | +0.00 | 25 → 24 |
Removing coverage changes quality by essentially zero — and, the telling part, the head-to-head barely moves: told to ignore coverage entirely, the judge still prefers the gold set ~22–24 times out of 30. The losses were never mostly about coverage.
The reason is visible in the rubric the same judge assigns each model as a candidate:
| Scored as candidate | atomicity | specificity | coverage | naturalness | non-redundancy |
|---|---|---|---|---|---|
| On-device best (exp056) | 4 | 4 | 3 | 4 | 4 |
| Sonnet (cloud anchor) | 5 | 4 | 4 | 5 | 5 |
The on-device model matches the frontier on specificity alone and trails by about a point on everything else. Coverage is the joint-lowest, but it is not special — the gap is broad. That is why re-weighting it away does nothing: the model is a uniform notch below the frontier, and coverage is simply the symptom the judge can name in a sentence.
Imitation, preference learning, and every inference-time technique in the 2024–2025 literature leave the on-device model a consistent notch below the frontier on every dimension. A more capable model clears that notch on the identical harness (coverage 3→4, and 4→5 elsewhere), and removing coverage from the judging recovers none of the gap. So the shortfall is bound to model capacity, broadly — no amount of further training or scaffolding on this base is likely to close it.
What this means for the product. The on-device model is uniformly close — roughly two-thirds of the frontier's practical quality, about a point below on nearly every dimension, and strong in absolute terms (4 of 5 on most). The shortfall is broad and capacity-bound, so there is no single dimension to fix. Three honest paths:
My recommendation: ship and document, and reframe the bar to absolute usefulness. The model is a uniform notch below the frontier, not broken anywhere; at ~two-thirds of the practical ceiling and strong in absolute terms it is plausibly good enough — and the broad, capacity-bound gap won't yield to more work on this base. Re-test when a larger on-device model ships.
Every experiment is logged with its hypothesis, method, result and a process diagram on the dashboard; the narrative is in the project overview. The frozen ruler (30 gold cases + Sonnet judge), all agent configs, the LoRA training and ORPO scripts, the on-policy draft generator, the cloud ceiling-confirmation harness, and the coverage-blind re-judge (§5) are in the repository. The autoresearch loop runs unattended and can request the manual training track when it judges the weights are the answer. The cloud anchors in §4 and the coverage-blind re-judge in §5 are reference analyses only, deliberately kept out of the on-device run log.