Project overview: a post-training study on small-model specialization with an autoresearch loop

Update · research concluded

This project has reached its conclusion. The full findings and the verdict are in the final report.

1The goal

The aim: move one feature from the cloud onto the device — ideally with no quality loss. Whether that's achievable is exactly what we're testing.

My app has a discovery planner: when someone types a one-line task — "plan a trip to Paris", "write my resume" — it responds with a restated title and exactly seven clarifying questions that surface what it needs to know before planning. Today that runs on Anthropic's Claude Sonnet in the cloud.

User input

"Plan a trip to Paris"

→

Discovery Planner

What city are you departing from?
When are you planning to travel?
How many people are going on this trip?
How long do you plan to stay in Paris?
What is your total budget for this trip?
What type of accommodation do you prefer?
What is the main purpose of your visit?

The planner turns a vague one-liner into seven specific clarifying questions. (Shown: the gold-standard answer from claude-sonnet-4-6 — the bar we're trying to match on-device.)

I want it to run entirely local and Apple-native, on Apple's built-in Foundation Model — the roughly 3-billion-parameter model that ships with macOS. Going native through the OS model means privacy (nothing leaves the Mac), zero inference cost, and it keeps working offline.

The hard constraint: the shipping product must contain no Anthropic access. I use Sonnet only at development time, in two roles — as the gold standard we're trying to match, and as the judge that scores how close we are. The on-device model has to stand on its own.

The on-device model is ~20× smaller than the cloud model. The whole project is the question: can a small specialized model match a large general one on a narrow, well-defined task?

2Autoresearch: a self-improving research loop

A fixed "judge" to measure against, and a mutable "agent" we keep improving.

Rather than hand-tune the on-device model by trial and error, I built an autoresearch loop (inspired by Andrej Karpathy's autoresearch). It has two halves that must never blur together:

Fixed JUDGE

A fixed set of 30 held-out tasks, each with a gold answer written by Sonnet, plus a Sonnet judge that compares the on-device answer to the gold head-to-head and scores it on a five-point rubric. This never changes — it is the honest measuring stick. Making the judge easier is forbidden.

vs.

Mutable AGENT

The on-device question-generator itself — its prompts, its multi-step pipelines, and (the big lever) a fine-tuned adapter. This is what every experiment changes, trying to score higher against the unchanging judge.

Each experiment produces one number — a quality score from 0 to 1 — plus the judge's written reasoning for every case. Because the judge is fixed, scores are comparable across months of experiments. The live dashboard plots every experiment as a point and tracks the running best.

How one experiment runs. Claude Code drives the cycle autonomously: read the log, form one hypothesis, edit the agent, and have the judge score it — keeping the result only if it beats the running best. The human supervisor sets direction and runs the fine-tuning track the agent can't. Solid arrows are the automated loop; the dashed link is the scoring by the judge; rust marks human input.

3The autonomous coding agent

A second AI that writes the experiments — one per iteration, on a tight leash.

The loop is driven by Claude Code, Anthropic's coding agent, running headless and unattended. Each iteration it does the full scientific cycle by itself: read the experiment log and the judge's latest failure notes, form one hypothesis, write the code, build it, run the evaluation (which auto-scores via the Sonnet judge), and honestly log whether it helped — then stop.

I work alongside it as supervisor — steering which levers it pursues next, giving continuous feedback on its hypotheses, and running the manual fine-tuning track it can't do itself. So when this page says "we," it means that pairing: an AI agent generating and running the experiments, with a human directing the research.

What keeps it safe and honest

It may only edit the agent's code — the judge (gold tasks, rubric, metric) is off-limits and any tampering is auto-reverted.
It has no web access and a scoped toolset; it cannot reach outside the project.
Every experiment must end in a working build and exactly one logged result. Each is committed automatically, so progress is fully traceable.
Strict anti-cheating rules: it can use example answers as inspiration but never copy them into its output (more on why below).

Two kinds of operator appear on the dashboard: Interactive work done in a guided session (mostly the adapter training) and Autonomous experiments written by this agent.

4How an experiment works

The levers we can pull, the rubric the model is scored on, and the rule that keeps us honest.

The rubric. The judge scores each set of seven questions on five dimensions, 1–5:

Atomicity — each question asks exactly one thing.
Specificity — concrete to this task, not generic filler.
Coverage — the seven cover the most decision-critical unknowns. (This has been the stubborn one.)
Naturalness — reads like a thoughtful human coach.
Non-redundancy — the questions don't overlap.

One real evaluationHere is a single case from our test set, scored exactly the way every experiment is — the user's one-liner, the cloud model's gold answer, the on-device answer, and the judge's verdict.

User input

what the user types

"Plan a trip to Paris"

Gold standard

claude-sonnet-4-6 · cloud

What city are you departing from?
When are you planning to travel?
How many people are going on this trip?
How long do you plan to stay in Paris?
What is your total budget for this trip?
What type of accommodation do you prefer?
What is the main purpose of your visit?

On-device candidate

Apple FM ~3B · on device

What is your travel date or timeframe?
How many nights will you stay in Paris?
What is your total budget for the trip?
Which city in Paris are you staying in?
Do you prefer a guided tour or self-guided exploration?
Is this trip for a specific occasion or purpose?
Do you have any travel restrictions or mobility needs?

Judgement

claude-sonnet-4-6

Gold better

Atomicity 4 Specificity 3 Coverage 3 Naturalness 3 Non-redundancy 4

each scored 1–5

The on-device set asks a nonsensical question ("Which city in Paris are you staying in?" — Paris is a city) and omits two critical unknowns — the departure city and the party size — making it less useful for planning than the gold set.

The on-device model phrases questions naturally and even adds a sensible one about mobility — but it misses the few decision-critical unknowns (where you're flying from, how many people). That gap, which we call coverage, is the central challenge of this project. On other tasks the on-device model ties or even beats the cloud model; this is one honest, representative case.

The levers. An experiment is some combination of the knobs below, and every experiment is tagged on the dashboard by the most powerful lever it pulls — its type. There are three: Inference changes how the model answers with no retraining (this is what the autonomous agent explores), while SFT and ORPO retrain the model's weights by hand. SFT and ORPO produce the same artifact — a trained adapter — and differ only in the training recipe.

Inference Inference-time knobs — change how the model answers, no retraining (what the autonomous agent explores)

Decoding

How the model picks each next word — always the single most likely one (predictable) or sampling for variety. We found predictable ("greedy") to be both better and less noisy here.

Prompt & instructions

The exact wording of the request and any guidance baked into it.

Output schema

Forcing the shape of the answer — exactly seven questions, one-topic-per-slot, or deliberately eight-then-trim (the trick behind the current best).

Multi-step pipelines

Chaining calls — draft, then self-critique and revise; or generate several candidates and pick, vote, or run a tournament between them.

Retrieval (few-shot examples)

Pulling the most similar past tasks and showing their answers as inspiration before the model writes its own.

Post-processing in code

Deterministic clean-up after generation — removing near-duplicates, scoring coverage against known-important topics, the over-generate-and-prune step.

Tool use

Letting the model call a helper function mid-answer (e.g. a coverage check). Available in the framework; not yet exploited.

The weights — retrain the model itself, by hand (the most powerful lever)

LoRA adapter

The artifact both recipes produce: a small trained add-on baked into the model's own weights. The only lever that broke past the prompting ceiling — every result above ~0.32, including the 0.44 best, runs on one. What differs between adapters is the training recipe:

SFT imitation

Supervised fine-tuning on (task → one good answer) pairs — "make answers look like this." Every adapter so far. Nails phrasing and format, but adding or reshaping examples never moved coverage.

ORPO preference

Train on pairs — a better answer beside a worse one — so the model learns to prefer high-coverage sets, not merely imitate them. The first lever aimed straight at the coverage gap. In training now.

The autonomous agent turns the Inference knobs on its own; it can't retrain the model, so when it judges the weights are the answer it flags a request for the manual track. The weights — SFT today, ORPO in training now — are trained by hand, the track aimed squarely at the coverage gap.

The measurement lesson, learned the hard way. We first screened ideas on a small 10-case proxy with random sampling. Twice, that produced a "win" that evaporated on the full set. The fix is now law: gate every decision on all 30 cases with deterministic (greedy) decoding. Noise had been masquerading as progress.

Why we never copy examples. The agent retrieves similar past examples to learn from. One experiment scored a suspicious 0.498 — far above everything else. It turned out the retrieval had handed several test cases their own gold answer as a "demonstration," which the model copied and was then graded against. The agent caught this itself, flagged the result invalid, and we hard-wired a "leave-one-out" guard so it can never happen again. A real safeguard against fooling ourselves.

5What we've learned

Around 95 experiments across two tracks. Here's the honest scorecard.

The climb so far (quality on the full 30-case gate):

Baseline

0.28

Stock model, single shot

→

In-context experiments

~0.32

Best of ~27 prompting / retrieval tries

→

Adapter experiment

0.41

Fine-tuned on the model's weights

→

Best experiment so far

0.44

Adapter + over-generate & prune

For reference, a perfect tie with Sonnet on every case would be ~0.5; consistently beating it would be higher. We're not there yet — the current best still loses about 21 of 30 head-to-heads — but the trajectory is real and every step is reproducible.

What didn't work

Throwing more model calls at it. Brainstorm-then-pick, best-of-N, self-critique, voting — every "make the small model check its own work" scheme failed. A weak judge of its own output just compounds the weakness.
Prompt and schema engineering for coverage. Forcing a fixed checklist of dimensions, chain-of-thought "name the unknowns first," redundancy repair — all plateaued at the same coverage score.
More training data alone. Doubling the examples helped a little; re-shaping them to emphasize coverage did not move coverage at all.
Borrowing phrasing from generic examples. Backfilling a weak slot from a foreign source always cost specificity.

What worked

Fine-tuning an adapter. Baking the skill into the weights was the only thing that broke past the prompting ceiling — a jump from ~0.32 to 0.41. Everything above that line, including the current 0.44 best, is built on top of this adapter — the later wins stack inference tricks onto it, they don't replace it.
Deterministic decoding. Removing sampling randomness both raised quality and made results trustworthy.
Over-generate in the model's own voice, then prune. Ask the adapter for 8 questions and drop the weakest — recovering a wasted slot without the specificity tax, because the spare comes from the model's own distribution.
Grounding the prune in known-important topics edged it a little higher still (current best, 0.44).

Biggest surprises

It's a capability gap, not a prompting gap. 27 clever in-context experiments couldn't beat what a single fine-tuned adapter did — strong evidence the limit is what the model knows, not how we ask.
Cheap beat expensive. A single well-trained call consistently outperformed elaborate multi-call pipelines.
"Coverage" is locked in the weights. Every inference-time trick improves phrasing and reduces overlap, but none teaches the model which unknown actually matters most. That judgment seems to live in the weights, not the prompt.
The system caught itself cheating (the 0.498 leak) — and we corrected it openly rather than banking the win.

Where this points next. Both tracks — the autonomous agent and the manual training work — keep arriving at the same conclusion from opposite directions: the remaining gap is coverage, and it is weight-locked. The next bet is a training method that distills not just Sonnet's answers but its judgment of which questions matter — teaching the model to prefer high-coverage question sets, not just imitate them.

6The record at a glance

loading the experiment record…

kept (set a new best) discarded × invalid (gold leak) — running best (per metric)

Static snapshot of the live data. The full interactive version — each experiment's hypothesis, method, result, and process diagram — is on the dashboard.