Fusion eval results
2026-06-14Source context: OpenRouter Fusion announcement.
Reproducing Fusion in the open. TrustedRouter is running the same class of routing experiment with public code, explicit model lists, and measurable cost/quality tradeoffs instead of a hidden benchmark harness.
Comparable full-run results are not published yet. The prior holistic-judge run is excluded from this post because it does not match OpenRouter's DRACO scoring method.
Reference Results
| Run | OpenRouter score | TrustedRouter score | Status |
|---|---|---|---|
| Solo Gemini 3 Flash | 43.1 | 29.35 on 10-task smoke | Investigating |
| Solo Kimi K2.6 | 53.7 | Not enough completed rows | Investigating |
| Solo DeepSeek V4 Pro | 60.3 | Not run with exact scorer yet | Pending |
| Fusion budget panel | 64.7 | Not run with exact scorer yet | Pending |
Replication Rules
- Mode: micro-hybrid runs the small public smoke before any expensive full pass.
- Judge model: google/gemini-3.1-pro-preview.
- Scoring: DRACO criterion-level grading, three independent passes, normalized 0-100.
- Search: Exa with DRACO/rubric hostnames excluded and result leakage checks enabled.
- Publication rule: raw solo baselines must be close before any Fusion headline is published.
The exact scorer and leakage guard are implemented in the open-source harness. Full comparable results will replace this table when the raw baselines replicate.