Four copies of a cheap model beat Fable at 1/7 the price
Update: the frontier-panel number here was synthesized with our earlier engine; the best synthesizer we have since found — a Kimi-k2.6 judge feeding a GLM-5.2 synthesizer — takes the same frontier panel to 73.4. See Synth is two jobs for the full judge × synthesizer grid.
Run MiniMax-M3 four times on a hard research question, synthesize the four reports with a fifth M3, and the answer scores 68.1 on DRACO deep research. Fable 5, a frontier closed model, scores 65.3 running once. Four M3 runs plus the synth cost about $37 to run the hundred-task benchmark; one Fable 5 run, at twice Opus's price, models to around $250. So four cheap copies clear the frontier model at roughly a seventh of the price. Run M3 ten times instead and you reach 69.4 — a little higher, still a fraction of the cost — but four is already past Fable.
I did not expect ten copies of one model to do this, because two copies do nothing. Synthesize two M3 runs and you score 66.2 — the same 66.2 a single run scores. Not a tenth of a point of lift. Then I ran the same trick on Opus 4.8: one run scores 60.7, two runs synthesized score 67.6. Seven points, same synthesizer, same tasks. Two M3 runs buy nothing and two Opus runs buy seven points. Why?
What decides it is whether the runs fail in the same place. Synth can only recover an answer when at least one run got the part the others missed. Opus's two runs miss different things, so the pair lands the parts a single run flubbed: on the tasks where synthesizing two of them helped most, one Opus run averaged 52 out of 100, its own worst work, and the two together recovered it. M3's two runs move together. Score them against each other and M3 swings by more than five points on 42 of the hundred tasks — up on 22, down on 20, a wash — because when M3 gets a task wrong both of its runs get it wrong the same way. Opus is the shaky one as a solo researcher, and that is what lets it synthesize: a model that misses unpredictably misses somewhere new on the second try. A steady model like M3 misses the same way twice.
More runs work for the same reason two don't. M3's mistakes mostly repeat, but not every time. Each extra independent run is another chance for one of them to dig up the primary source the others skipped, and the synthesizer keeps whatever survives the cross-check. So the score climbs with the number of runs, and you can watch how. Two runs do nothing. By four you have most of the gain and you have cleared Fable; around seven it tops out near 69; past that, more copies buy nothing. I checked the lazy explanation first — maybe you just need the runs to look different — and turned the sampling temperature up to force them apart. The score did not move. A high temperature changes the words and the search path, and leaves M3 blind on the same tasks. What works is more genuinely independent runs.
DRACO against the number of M3 runs synthesized, each point its own full hundred-task judge. This is one nested ordering of the runs, so the bump at seven and the dip at nine are run-to-run noise of about a point — the shape that matters is the climb out of the flat and the plateau near 69, short of the all-open panel and well short of the frontier.
Cost is why you do this with a cheap model. M3 costs $0.30 per million input tokens and $1.20 per million out. Fable 5 runs at twice Opus 4.8's price, about $9.90 in and $49.50 out, so per token it costs thirty-three to forty-one times what M3 does. Four M3 research runs come to about $30; add the M3 that fuses them and the gemini-3.1-pro grader that writes the consensus pass, and the measured cost is $37 over the hundred tasks. Ten runs, the full curve, costs $87. One Fable 5 run, priced out at the same token budget M3 uses, models to about $250 — its price is route-blocked and unpublished, so that figure is a model, not a bill. The model leans on one assumption, that Fable 5 burns tokens like M3 does, and the gap survives it: even at half the tokens Fable 5 would run $125. Four cheap runs beat the frontier model at about a seventh of its cost, ten at a third — either way, a fraction.
| Approach | Models | DRACO | Cost / 100 tasks |
|---|---|---|---|
| Frontier panel | 5 different (closed + open) | 73.4 | — |
| All-open panel | 5 different open | 69.2 | — |
| Ten M3 runs, synthesized | 1 open, run ten times | 69.4 | $87 measured |
| Four M3 runs, synthesized | 1 open, run four times | 68.1 | $37 measured |
| M3 solo | 1 open | 66.2 | — |
| Fable 5 solo | 1 closed frontier | 65.3 | ~$250 modeled |
Does stacking copies of one model beat a real panel of different ones? No, and the gap is the interesting part. A frontier-mixed panel of five models, GPT-5.5 and Opus among them, scores 73.4 with our best synthesizer, about four points up. A panel of five different open models scores 69.2, a hair above where the M3 copies plateau. Five different models clear any pile of the same one because different models go blind on different tasks, and copies of M3 share one set of blind spots. A real panel gets that spread from variety. You can get most of it from volume instead, and volume is cheap when the model is.
This is the same engine behind the panel results we have written about: synth pays out on diverse error, and the model in the synthesizer seat decides how much of it survives. You do not need a roster of models to get diverse error. You can manufacture most of it by running one cheap model a handful of times. The full run is public — every synthesized answer, the synth code, the grader — in TrustedRouter Synth Draco. Four tries at a cheap model beat one try at a frontier one.