Synth is two jobs, and no model wins both

2026-06-19 · TrustedRouter Synth Draco on GitHub

Synthesize a panel of models into one answer and you are running two separate jobs. A judge reads the panel and writes down where the models agree, where they contradict each other, and what they all missed. A synthesizer takes the panel and that analysis and writes the final answer. We put the three strongest open models — MiniMax-M3, GLM-5.2, and Kimi K2.6 — into both seats, every combination, nine synthesizers, all on the same frontier research panel and graded by the same gemini-3.1-pro. The best one is a pair: a Kimi-k2.6 judge and GLM writing, 73.4 on DRACO deep research — the highest of the nine, with an M3 judge a hair behind at 72.3. No single model doing both jobs comes close.

We argued before that MiniMax-M3 is the best synthesizer. That test held the judge fixed at gemini and swapped only the writer, so it answered half the question. The judge is the other half, and the best pair is the grid above.

judge ↓ \ synthesizer →	MiniMax-M3	GLM-5.2	Kimi K2.6
MiniMax-M3	67.1	72.3	64.7
GLM-5.2	68.0	70.3	66.9
Kimi K2.6	67.1	73.4	48.7

Read it down the columns. GLM writes the best synthesized answer no matter who judges — 72.3 under an M3 judge, 73.4 under a Kimi judge, 70.3 under itself. Move GLM out of the writer's seat and put M3 there and the score settles around 67 in every row; put Kimi there and it is 65 and falling. The synthesizer is the seat where the score is won, and GLM owns it.

The surprise is hiding in that "under itself" number. GLM is the best synthesizer and the worst judge of its own writing. A GLM judge grading a GLM-written answer scores 70.3; an M3 judge grading the exact same GLM writing scores 72.3, two points higher — a gap that clears the error bars, paired at about two standard errors. A model grading its own work brings its own blind spots to the grading and waves through the gaps it was always going to leave. A second model sees them.

Kimi writes a respectable answer when M3 or GLM judges it, around 65. Grading its own Kimi-written answer, it scores 48.7 — the worst cell in the grid by a wide margin. The whole diagonal, where one model fills both seats, drops, and the bottom of it is a model judging itself.

The best open synthesizer is two different models in the two seats: a Kimi-k2.6 judge and GLM writing, 73.4 on DRACO, with M3 a close second in the judge seat. That clears GLM doing the whole job alone (70.3), and it clears our earlier best of 71.6 — a gemini judge over the same panel with M3 writing. Two things changed to get there: the judge went open, and the synthesizer prompt now matches the gateway's exactly, so the gain comes from both. Judge and writer are open weights now. The panel here is still the frontier mix; making the panel open too is its own result. GLM runs through Tinfoil to dodge its host's censorship, but the weights are the same everywhere.

The full 9-cell run is public — every synthesized answer, both prompts copied verbatim from the gateway, the grader — in TrustedRouter Synth Draco. Don't let the model that writes the answer grade it. The best synthesizer and the best judge are different models, and the pair beats either one synthesizing alone.

Every model here is a generalist pressed into a seat it was never trained for. The seat is trainable: reward a model for naming what a panel missed and it learns to judge; reward it for merging without dropping the one run that landed the hard part and it learns to synthesize. A small model tuned for one of these jobs can beat a frontier generalist at it for a fraction of the cost and latency, and a specialized judge feeding a specialized synthesizer could synthesize better than any stack built from off-the-shelf models while running faster than all of them. This is the research we do at TrustedRouter. If you have a PhD and want to work on it, apply.