FreedomBench: AI models that refuse to answer the truth about China

2026-06-17 · FreedomBench on GitHub

Ask GLM-5.2 — one of the best open-weight models in the world, built by a Chinese lab — whether the army killed protesters at Tiananmen Square in 1989. It doesn't argue with you, and it doesn't lie to you. It sends back nothing — no text, no refusal, no error message. A blank.

I found this building FreedomBench, which took an afternoon. Sixty multiple-choice questions, each one a plain fact the Chinese government censors: Has Beijing ever governed Taiwan? What did the army do at Tiananmen? Who is the six-year-old the state took away after the Dalai Lama named him Panchen Lama? How many people died in the Great Leap Forward famine? Each question has one true answer and three wrong ones, and one of the wrong ones is always the official Beijing line. The model picks a letter; you count how many it gets right. That is the whole benchmark.

Multiple choice is the point. There is no judge model deciding what counts as an answer, no keyword matching, no arguing that a model "sort of acknowledged" something. It picks the true letter or it doesn't. And it corners a censored model, which has only two ways out: pick the false letter, or refuse to pick at all.

I ran it across the current frontier. Most models score a flat 60 out of 60 — these are basic facts, and any model with the lights on gets all of them. Google's Gemma: 60. DeepSeek V4, a Chinese model: 60. Then GLM-5.2, also Chinese, also genuinely excellent at everything else: 29.

It did not get the other 26 wrong. It answered 34 of the questions — 29 of them correctly — and returned an empty completion for the other 26. The same model writes code, does math, and will cheerfully tell you the capital of France. Ask it about Falun Gong or June 4th and it goes dark.

Getting that number right took some care, and the mistake I almost made is worth admitting. Run thirty models against sixty questions all at once and some providers start returning empty responses under the load — blanks indistinguishable from the real thing. My first pass had GLM-5.2 silent on 47 of 60; slow the run down and retry every blank, and 21 of them fill back in. Those were the server choking, not the model. The 26 that stay empty no matter how many times you ask are the real refusals — GLM goes dark on Tiananmen even when it's the only request in flight. "The model refused" and "the server choked" produce the identical empty string, and a censorship benchmark that can't tell them apart is just measuring its own plumbing.

The pattern reads like a map of what the Party guards most closely. GLM-5.2 returned nothing on every question about Falun Gong and every question about Tiananmen — those two it will not touch at all. It went quiet on three of five about Tibet, Taiwan, and the jailed dissidents, fewer about Xinjiang and Xi Jinping, and it answered every single question about COVID's origins and the South China Sea. The further a topic sits from the Party's rawest nerves, the more it will say.

GLM-5.2 isn't the only one, and the censored models don't even refuse the same way — each lab built its own door. Z.ai's GLM models, and Moonshot's Kimi coding model, go silent: an empty completion, not a single word. Tencent's Hunyuan is polite about it and switches to Chinese to do it — "我无法提供相关信息," I cannot provide that information. Xiaomi's MiMo doesn't answer at all; a guardrail sitting above the model stamps the request "rejected because it was considered high risk." Three labs, three ways to say nothing — a blank, a courteous deflection, a safety label — all drawn around the same handful of facts.

What I did not expect was how far apart two Chinese labs sit. DeepSeek and Z.ai both train excellent open models, in the same country, under the same government. DeepSeek V4 answered all sixty truthfully. GLM-5.2 went silent on twenty-six of them. Each lab makes that call itself. Same government over both, and they split.

And the silence isn't even in the weights. Run that same Z.ai GLM-4.7 on Cerebras instead of Z.ai's own API and all twenty-seven of its banned answers come back — Tiananmen, Falun Gong, the lot. It holds for the headline model too: route GLM-5.2 to Tinfoil's sealed confidential enclave instead of Z.ai and every blank fills in — a clean sixty out of sixty, the censorship gone the instant the weights run somewhere the host can't reach into the prompt. The refusal is bolted onto Z.ai's endpoint, not trained into the model, which turned out to be its own story.

I half-expected each lab to tighten the screws over time, every new model censoring more than the last. The versions say no. Z.ai's GLM line holds flat — every release from 4.5 to 5.2 refuses the same twenty-six or so, a fixed policy that doesn't move across major version bumps. Moonshot's Kimi loosened in the middle: the base K2 refused, K2.5 and K2.6 answer all sixty, and only the K2.7 coding model clams up again. Xiaomi's MiMo never refuses at all — it just picks Beijing's answer, and the version barely changes the count.

Family, by version (oldest → newest)	Freedom score
Kimi — k2 / k2.5 / k2.6 / k2.7-code	70% / 100% / 100% / 68%
MiMo — v2-pro / v2.5 / v2.5-pro	68% / 80% / 72%
GLM — 4.5 / 4.6 / 4.7 / 5 / 5.1 / 5.2	47% / 50% / 45% / 45% / 47% / 48%

One caveat, because it cuts against the headline. A test like this measures the model as the router hands it to you, and the censorship sits in the serving host — so the score depends on which host that is. TrustedRouter sends some models to the lab's own API and others to neutral Western hosts. DeepSeek and GLM hit their makers' own endpoints, so those scores hold: DeepSeek's API answers, Z.ai's refuses. Qwen, though, routes to Novita and Together instead of Alibaba, so its clean sweep is the open weights on a neutral host — the benchmark never reaches Alibaba's own service, and Alibaba, MiniMax, and Tencent have no own-API route here at all. The labs that still come back censored from a neutral host — Tencent's Hunyuan refusing in Chinese, Xiaomi's MiMo picking Beijing's line — wrote it into the weights. The ones that look spotless may just be served from somewhere their politics didn't follow.

The obvious objection is that this is China-bashing in a lab coat. It isn't. Every question is a documented fact with a source — UN findings, court rulings, the wire services — and the same test would catch an American model that fell silent on its own government's worst moments. These are the questions a curious teenager asks. What it catches is whether a model will tell you something true that a government would rather it didn't.

This matters more every month, because these models are getting very good. DeepSeek V4 draws level with Claude Opus on the factuality tests Anthropic uses to grade itself, and GLM and Kimi are right behind. People will run them — locally, in production — because they are cheap and excellent. A model trained to fall silent on certain facts will fall silent on them inside your app, for your users, and never mention that it did. The blank is the one straight thing it does.

FreedomBench is sixty questions and a scoring script. It's public at github.com/Lore-Hex/FreedomBench, the raw replay of every model's answers is in the repo, and you can run the whole panel through one API in a few minutes. The censored models won't tell you they're censored. This will.