New SOTA: TrustedRouter Synth beats Fable and Frontier

2026-06-17 · TrustedRouter Synth Draco on GitHub

Synth is TrustedRouter's multi-model fusion — a panel of models, a judge, and a synthesizer behind one API. This is the research behind it. Try Synth →

Research is only worth as much as someone else's ability to run it again. Too much of AI has drifted the other way: the strongest results arrive as a single number in a post, produced by a model you cannot open, on a harness no one else can see, graded by a rubric that ships to nobody. You are asked to take it on faith. TrustedRouter is verifiable open source software. That is how a benchmark number earns trust: verifiability, not hype.

So we held ourselves to it. We set out to test Synth directly: a panel of models, each writing its own answer with a final model synthesizing them, beats any single model on a hard research benchmark — and then to push past it. On DRACO, a hundred deep-research tasks graded against roughly forty weighted criteria each by gemini-3.1-pro, a diverse panel synthesized by Claude Opus 4.8 scores 70.6. Swap the synthesizer engine — a Kimi-k2.6 judge feeding a GLM-5.2 synthesizer in place of Opus — and the same panel reaches 73.4, the new state of the art. The top configurations sit within about a standard error of each other (the whiskers above), but the whole TrustedRouter band clears the closed baselines. Every prompt, every tool call, and every graded answer behind the number is published.

The result comes from the panel, and the panel is itself an argument for open weights. The strongest older closed baselines paired two closed frontier models. Ours adds frontier open-weights models — DeepSeek V4 Pro and Kimi K2.6 — alongside GPT-5.5, Opus, and Gemini 3 Flash. Synth works on disagreement: models that fail in different places, reconciled by a strong synthesizer. Open-weights models are trained on different data and disagree in different ways than a closed pair does, and the wider panel is what reaches the top.

The synthesizer carries most of that result. Hold the five-model panel fixed and change only the model that writes the final answer: Opus 4.8 scores 70.6, GPT-5.5 scores 62.2. Same reports, same judge analysis, same hundred tasks, eight points of swing from one decision. A larger panel behind a weaker synthesizer buys nothing, and which model fills that slot is its own ranking.

No single model comes near that on its own. Run each one through the same agentic loop with the same live tools, and the strongest of them lands seven points below the panel.

Solo model	TrustedRouter	Published baseline
GPT-5.5	63.0	60.0
Claude Opus 4.8	60.7	58.8
DeepSeek V4 Pro	59.9	60.3
Kimi K2.6	50.1	53.7
Gemini 3.1 Pro	47.4	45.4
Gemini 3 Flash	41.1	43.1

The strongest solo reaches 63; the best panel reaches 73.4. Assembling a frontier answer out of models that are each behind the frontier is the entire point.

DRACO is an agentic benchmark. The answers are not in any model's weights, so each model in the panel has to search the web, read the sources, and run the numbers itself; we give every one of them live tools and let it drive its own research. Those runs issued thousands of searches and fetches, and all of them sit in the published replays — none touching the benchmark's own hosts, so nothing was looked up that was meant to be worked out. The leakage guard lives in the open-source harness, and the audit is yours to re-run.

We ran all of it on TrustedRouter for the same reason we published the code. A benchmark sends your prompts and the documents you fetch through someone else's servers, and with most gateways you take their privacy on faith. TrustedRouter runs inside a Trusted Execution Environment (TEE), end-to-end encrypted: a sealed enclave the operator cannot read into, handling every request as an attested workload whose exact code is measured and published. You can pull the image digest, match it against the open source, and confirm the binary that saw your prompt is the one in the repository, with nowhere inside it to record anything. You check the privacy the way you check the score — by hand, against a hash.

We do not want you to trust our 73.4. Clone the repository — the harness, the tasks, the judge, the panel, and the raw run traces are all in it — point it at TrustedRouter, and produce the number yourself. Open code, open results, a score you can reproduce and a privacy guarantee you can verify. That is what an AI lab doing open science looks like, and it is the only kind of result worth believing.