I'm building a native macOS app, and I want its AI to run entirely local and native, using Apple's Foundation Models. The open question: can that small, built-in model match a frontier cloud model on one specific task? This page is our attempt to find out.
The aim: move one feature from the cloud onto the device — ideally with no quality loss. Whether that's achievable is exactly what we're testing.
My app has a discovery planner: when someone types a one-line task — "plan a trip to Paris", "write my resume" — it responds with a restated title and exactly seven clarifying questions that surface what it needs to know before planning. Today that runs on Anthropic's Claude Sonnet in the cloud.
The planner turns a vague one-liner into seven specific clarifying questions. (Shown: the gold-standard answer from claude-sonnet-4-6 — the bar we're trying to match on-device.)
I want it to run entirely local and Apple-native, on Apple's built-in Foundation Model — the roughly 3-billion-parameter model that ships with macOS. Going native through the OS model means privacy (nothing leaves the Mac), zero inference cost, and it keeps working offline.
The on-device model is ~20× smaller than the cloud model. The whole project is the question: can a small specialized model match a large general one on a narrow, well-defined task?
A frozen "ruler" to measure against, and a mutable "agent" we keep improving.
Rather than hand-tune the on-device model by trial and error, I built an autoresearch loop (inspired by Andrej Karpathy's autoresearch). It has two halves that must never blur together:
A fixed set of 30 held-out tasks, each with a gold answer written by Sonnet, plus a Sonnet judge that compares the on-device answer to the gold head-to-head and scores it on a five-point rubric. This never changes — it is the honest measuring stick. Making the ruler easier is forbidden.
The on-device question-generator itself — its prompts, its multi-step pipelines, and (the big lever) a fine-tuned adapter. This is what every experiment changes, trying to score higher against the unchanging ruler.
Each experiment produces one number — a quality score from 0 to 1 — plus the judge's written reasoning for every case. Because the ruler is fixed, scores are comparable across months of experiments. The live dashboard plots every experiment as a point and tracks the running best.
A second AI that writes the experiments — one per iteration, on a tight leash.
The loop is driven by Claude Code, Anthropic's coding agent, running headless and unattended. Each iteration it does the full scientific cycle by itself: read the experiment log and the judge's latest failure notes, form one hypothesis, write the code, build it, run the evaluation (which auto-scores via the Sonnet judge), and honestly log whether it helped — then stop.
I work alongside it as supervisor — steering which levers it pursues next, giving continuous feedback on its hypotheses, and running the manual fine-tuning track it can't do itself. So when this page says "we," it means that pairing: an AI agent generating and running the experiments, with a human directing the research.
Two kinds of operator appear on the dashboard: Interactive work done in a guided session (mostly the adapter training) and Autonomous experiments written by this agent.
The levers we can pull, the rubric the model is scored on, and the rule that keeps us honest.
The rubric. The judge scores each set of seven questions on five dimensions, 1–5:
One real evaluationHere is a single case from our test set, scored exactly the way every experiment is — the user's one-liner, the cloud model's gold answer, the on-device answer, and the judge's verdict.
"Plan a trip to Paris"
Gold better
each scored 1–5
The on-device set asks a nonsensical question ("Which city in Paris are you staying in?" — Paris is a city) and omits two critical unknowns — the departure city and the party size — making it less useful for planning than the gold set.
The on-device model phrases questions naturally and even adds a sensible one about mobility — but it misses the few decision-critical unknowns (where you're flying from, how many people). That gap, which we call coverage, is the central challenge of this project. On other tasks the on-device model ties or even beats the cloud model; this is one honest, representative case.
The levers. An experiment is some combination of the knobs below, and every experiment is tagged on the dashboard by the most powerful lever it pulls — its type. There are three: Inference changes how the model answers with no retraining (this is what the autonomous agent explores), while SFT and ORPO retrain the model's weights by hand. SFT and ORPO produce the same artifact — a trained adapter — and differ only in the training recipe.
Inference Inference-time knobs — change how the model answers, no retraining (what the autonomous agent explores)
How the model picks each next word — always the single most likely one (predictable) or sampling for variety. We found predictable ("greedy") to be both better and less noisy here.
The exact wording of the request and any guidance baked into it.
Forcing the shape of the answer — exactly seven questions, one-topic-per-slot, or deliberately eight-then-trim (the trick behind the current best).
Chaining calls — draft, then self-critique and revise; or generate several candidates and pick, vote, or run a tournament between them.
Pulling the most similar past tasks and showing their answers as inspiration before the model writes its own.
Deterministic clean-up after generation — removing near-duplicates, scoring coverage against known-important topics, the over-generate-and-prune step.
Letting the model call a helper function mid-answer (e.g. a coverage check). Available in the framework; not yet exploited.
The weights — retrain the model itself, by hand (the most powerful lever)
The artifact both recipes produce: a small trained add-on baked into the model's own weights. The only lever that broke past the prompting ceiling — every result above ~0.32, including the 0.44 best, runs on one. What differs between adapters is the training recipe:
Supervised fine-tuning on (task → one good answer) pairs — "make answers look like this." Every adapter so far. Nails phrasing and format, but adding or reshaping examples never moved coverage.
Train on pairs — a better answer beside a worse one — so the model learns to prefer high-coverage sets, not merely imitate them. The first lever aimed straight at the coverage gap. In training now.
The autonomous agent turns the Inference knobs on its own; it can't retrain the model, so when it judges the weights are the answer it flags a request for the manual track. The weights — SFT today, ORPO in training now — are trained by hand, the track aimed squarely at the coverage gap.
Around 60 experiments across two tracks. Here's the honest scorecard.
The climb so far (quality on the full 30-case gate):
For reference, a perfect tie with Sonnet on every case would be ~0.5; consistently beating it would be higher. We're not there yet — the current best still loses about 21 of 30 head-to-heads — but the trajectory is real and every step is reproducible.
loading the experiment record…
kept (set a new best) discarded × invalid (gold leak) — running best (per metric)
Static snapshot of the live data. The full interactive version — each experiment's hypothesis, method, result, and process diagram — is on the dashboard.