Chasing Mythos-level Fusion in the open

2026-06-14

Source context: Open Fusion methodology.

We tried to push TrustedRouter Fusion toward Mythos and Fable-class DRACO performance. The target panel was GPT-5.5, Claude Opus 4.8, Kimi K2.7 Code, GLM 5.2, MiniMax M3, Gemini 3 Flash, and Gemini 3.1 Pro, with Opus 4.8 synthesizing the final answer and Gemini 3.1 Pro judging against DRACO criteria.

That exact run is not publishable yet. Two blockers showed up immediately: GPT-5.5 needs special long-reasoning handling on DRACO prompts, and our Z.AI account is not entitled for GLM 5.2 yet. Z.AI returns a permission error for glm-5.2, so substituting it silently would be dishonest.

What actually ran

Run	Task slice	Result	Status
Exact 7-model target	Non-financial DRACO pilot	No score	Blocked by GPT-5.5 gateway handling and GLM 5.2 entitlement
Available 6-model fallback	First completed non-financial DRACO task	19.85	Completed, far below target

The fallback panel used Opus 4.8, Kimi K2.7 Code, GLM 5.1, MiniMax M3, Gemini 3 Flash, and Gemini 3.1 Pro. It completed one task before the pilot was stopped for speed and reliability. A score of 19.85 is not close to the target, and we are not presenting it as a win.

What changed in the harness

GPT-5.5 eval calls now omit temperature and use max_completion_tokens.
Panel and final synthesis calls stream so long answers do not wait for full completion before parsing.
Analysis and judge calls stay non-streaming because they require structured JSON reliability.
The live runner now has explicit six-model and seven-model frontier Fusion configs behind a hard budget.
The recommended DRACO slice for this experiment is --task-filter non-financial.

Next gates

The next clean run needs three fixes before any headline claim: enable GLM 5.2 on the Z.AI account, make GPT-5.5 long-reasoning responses produce useful content through the attested gateway, and finish a 10-task non-financial DRACO pilot without task-level hangs.

This is the point of doing the work in the open. If TrustedRouter clears a Mythos/Fable-class target, the result should be reproducible from code, model ids, task filters, budget limits, and artifacts. Until then, the honest result is: not there yet.

Evals guide Models Providers GitHub