The best fuser we tested goes blank on Taiwan
Last week we published a result we were proud of: on a hundred deep-research tasks, a panel of models with GLM-5.2 writing the final synthesis scored 71.1 — the best fuser we tested, ahead of Claude Opus and GPT-5.5, and an open-weights model at that. It came with one asterisk we couldn't explain at first. On exactly one of the hundred tasks, GLM-5.2 returned nothing. No error, no refusal message, just a single token and then silence. In the scores it showed up as a zero, and a zero with no reason attached is the kind of thing that eats a tenth of a point and makes you wonder what else is wrong.
So we dug in, because a model that blanks once in a hundred is a model you can't fully trust the other ninety-nine times. It wasn't length: the input was under 20,000 tokens, nothing for a model that holds far more. It wasn't our code trimming the output: the raw response was empty, one token long. And it wasn't the question; hand GLM-5.2 that same task on its own and it writes a thorough report. The blank appeared only when we gave it the panel's evidence to synthesize.
So we bisected the evidence. Drop the five panel reports one at a time, and removing a single one — GPT-5.5's — brought the answer back. Bisect that report, and the trigger sat in one passage: a Greater China equity fund describing its holdings across the People's Republic of China, Hong Kong, and Taiwan. That was the whole of it. Replace "Taiwan" and "Hong Kong" with neutral placeholders and GLM-5.2 fuses the task perfectly — a clean 7,000-character report. Put the two words back and it goes silent, every single time.
GLM-5.2 is built by Zhipu, a Chinese lab, and like other Chinese open-weight models it carries content rules from its training. Show it text that frames Taiwan and Hong Kong as distinct from mainland China, a routine line in a fund factsheet, and it stops cold. No argument, no refusal message, no banner that reads "I can't help with that." It emits one end-of-turn token, and where the report should be there is nothing.
That absence is the part worth sitting with. We caught it only because the whole benchmark is open and we audit every task — the lone zero stood out and we chased it down. In an ordinary pipeline it would have been invisible: a slightly lower score, a dropped section in one report, a user in Taipei getting an empty reply and no reason why. A model's politics don't appear in a quality benchmark. They appear as holes in your output, on whichever topics the lab that trained it decided you shouldn't have. It is the same blank FreedomBench catches across the questions Beijing censors, here hiding inside a fusion pipeline. Open weights mean you can run the model anywhere; they don't mean the model left its politics at home.
It also sharpened what we want from a fuser. A synthesizer's entire job is to carry whatever the panel found, on any subject, faithfully. A model that drops a topic without a word is broken at exactly that job, however well it scores everywhere else. Neutrality and reliability belong in how fusers are measured, alongside answer quality; we ranked every model in that seat on exactly those terms. For now the benchmark scores the blank as the zero it is, and the full run is published: every prompt, the empty response included, reproducible end to end. We would rather show you the hole than paper over it.