We're hiring We're looking for PhD researchers to join the team and work on exciting frontier problems. Get in touch →
← TrustedRouter blog

The best synthesizer of a research panel is an open-weights model

2026-06-17 · TrustedRouter-Fusion-Draco on GitHub

Fusing a panel of research reports into one answer is a skill of its own, and the best model at it is open-weights. We ran the test directly. On DRACO, our suite of 100 agentic deep-research tasks judged by gemini-3.1-pro, we held one thing fixed and varied one thing. The fixed part is a five-model panel — gpt-5.5, opus-4.8, gemini-3-flash, kimi-k2.6, and deepseek-v4-pro — plus a single fixed judge analysis. Every panelist sees the same task and writes the same report each time. The one thing we swap is the final model that reads those five reports and writes the answer, the slot we call the fuser. Whatever moves in the score is the fuser and nothing else.

The winner is MiniMax-M3 at 71.6, with GLM-5.2 right behind at 71.1. Both are open-weights, and both finish ahead of Claude Opus 4.8 at 70.6. That ordering should give you pause. The frontier closed model loses the synthesis slot to two models you can download, on a task where the panel feeding all three is identical.

Fuser (synthesizer)DRACO score (full 100)
minimax-m371.6
glm-5.271.1
opus-4.870.6
kimi-k2.667.0
deepseek-v4-pro65.7
gpt-5.562.2
gemma-4-31b54.0

The sharpest result is GPT-5.5. Run on its own as a researcher, it is the strongest single model on this benchmark, scoring 63.0 solo. Hand it five reports to reconcile and it drops to 62.2, the bottom of the capable fusers, below DeepSeek V4 Pro at 65.7 and Kimi K2.6 at 67.0. The model that is best at doing the research alone lands among the worst at reconciling the research of others. Solving a task and fusing five reports are two different abilities, and being excellent at the first tells you almost nothing about the second. A great soloist defaults to its own view. A great fuser weighs five views it didn't write and resolves where they disagree.

Size matters here in a way it does not for a panelist. Gemma-4-31b collapses to 54.0, nearly eighteen points under the leaders. A 31-billion-parameter model holds its own as one voice on the panel, then runs out of room when asked to hold five frontier reports in context and reconcile them at once. The fuser has to keep all the evidence live, track which source said what, and adjudicate conflicts, and a small model lacks the room to do it. Panelists can stay small because each owns one slice. The synthesizer owns the whole thing, so it has to be big.

The obvious objection: if GLM-5.2 ties for the top, why not just use it? Because it goes blank on Taiwan and Hong Kong. As we documented in the best fuser goes blank on Taiwan, GLM-5.2 refuses politically sensitive China content, and a synthesizer that drops whole topics is unsafe as a default no matter how the average score reads. MiniMax-M3 tops the table with no such hole, which makes it the model we'd actually put in the fuser slot.

This sits on top of the result in our fusion evals post, where assembling a panel and fusing it beats any single frontier model, and it extends the finding from the best open models aren't on your leaderboard: solo-model rankings fail to predict who fuses well. The fuser is a capability you have to measure on its own, because the usual proxies of raw smarts, parameter count, and solo benchmark rank all mislead. You can see every panelist and judge model on our models page, and the harness that produced these numbers is open at TrustedRouter-Fusion-Draco.

If synthesis is a separate skill, it is a separate training target. A model could be built specifically to read N reports and produce one reconciled answer, optimized for that job alone, free of any requirement that it also be a pleasant chatbot or a strong solo researcher. The fuser slot is the most consequential position in an agentic research stack, since it decides what the user actually reads, and right now we fill it with general-purpose models that happen to be decent at it. The numbers say a purpose-built synthesizer would beat all of them.


Sign in

Choose a sign in method.