GLM-5.2 vs Claude Opus: Same Code, Less Than Half the Cost
We ran GLM-5.2 head to head with Claude Opus the way an agent actually runs: inside a real coding agent, in a real shell, graded by hidden tests. The harness is Claude Code on terminal-bench tasks selected to match what an engineer deals with on a daily basis. The agent, the prompts, the tools, the 40-turn budget, and the grading are all held identical across the two runs; the only thing swapped is the model answering each turn underneath, GLM-5.2 in one run, Opus in the other, over the exact same 45 tasks.
The one-sentence result: on this benchmark GLM-5.2 is indistinguishable from Opus in capability, and once prompt caching is on, it does it at roughly 46% of Opus's cost.
TL;DR
Same quality. Each solves 25 of 45 tasks.
Same answers. They agree on 43 of 45 tasks (24 both solve, 19 both fail). They differ on exactly two, splitting them one each.
Same failure mode. Both fail by being confident-wrong: they declare "Fixed / All tests pass / verified" on work the hidden tests reject.
Cost. With prompt caching on, GLM lands at ~46% of Opus's actual spend. We measured GLM both with and without prompt caching, to isolate how much it matters on an agentic workload. Even without caching, paying full price on every re-sent turn, GLM came in 10% cheaper than Opus ($29 vs $32.67). With caching on it reads its repeated context at $0.26/1M (5.4x cheaper than fresh input) and its actual spend drops to ~$15, about 46% of Opus, at the same 25/45 quality.

Suggested caption: Quality: a dead heat. Both models solve 25 of 45 tasks and agree on 43 of 45.
How we tested
Nothing is simulated. The setup, in full:
Harness: terminal-bench
Agent: Claude Code (only the model changes).
Grading is binary and external. When the agent stops, the task's own hidden test script grades it pass/fail. No partial credit, no model-as-judge; the bar is the task author's definition of correct, not ours.
Cost is recorded, not estimated.
Quality: a dead heat
Both solve 25 of 45, and more tellingly they reach the same verdict on 43 of 45 (24 both solve, 19 both fail), splitting the other two one each: Opus takes csv-to-parquet, GLM takes cancel-async-tasks. No category where one is systematically stronger; the 19 both-fail are the genuinely hard tail where the wall is the problem, not the model.
The anatomy of an agentic run (what the agent actually does)
From the full Opus transcripts (45 tasks), here is the shape of a coding-agent workload, which is the same shape both models drive:
Turn distribution (Opus). Most tasks are short; a hard minority grind to the cap:
Turns | Tasks |
|---|---|
1-3 | 13 |
4-9 | 13 |
10-20 | 10 |
21-39 | 3 |
40 | 5 |
Median 5.5 turns, mean 12.2. The mean is dragged up by the 5 tasks that grind to the 40-turn ceiling (the doomed-but-trying ones). A few concrete examples (Opus): hello-world solved in 2 turns / 1 tool call; ilp-solver in 2 turns; cancel-async-tasks in 4 turns; feal-differential in 5 turns but 13.9K output tokens (heavy reasoning); csv-to-parquet took 19 turns / 18 tool calls.
Tool calls. Opus made 498 tool calls across the 45 tasks (~11 per task): file reads/writes, shell commands, test runs. Tool calls track turns closely: an agent turn is usually "think, then call a tool," so ~1 tool call per turn is typical, rising on tasks that explore a lot.
Output tokens. Opus generated 465K output tokens total (~10.6K/task). These are the expensive tokens (generation), but they are a small fraction of the bill; the input side dominates, which is where caching matters.
Token consumption: where the cost really comes from
Turns and token-work (the complete, symmetric numbers from the ledger):
GLM takes about twice as many turns to solve similar problems but since the per token price is cheaper it still beats Opus on price by 46.78%.
GLM does meaningfully more work to reach the same answers: more turns, and more tokens per turn. Weaker models explore more, backtrack more, and on tasks they cannot solve they grind to the turn cap, burning the full budget instead of stopping early.
If GLM was priced at the same rate as Opus, it would cost 3.3x more.
Metric | Opus | GLM-5.2 |
|---|---|---|
Model calls (turns) over the 45 tasks | 554 | 760 (~37% more) |
Token-work (input+output priced at Opus rates) | $32.67 | $108.75 (~3.3x) |

Suggested caption: GLM-5.2 runs ~37% more turns (760 vs 554) to reach the same answers.
Where the cost comes from: input, and it is cached. An agent re-sends a growing conversation every turn, so input tokens balloon on long sessions and dominate the bill. On a normal run both models cache that repeated context, so most input bills at cache rates, not fresh:
Input handling | Opus | GLM-5.2 |
|---|---|---|
Cache-read input tokens (over the 45 tasks) | ~17.6M | ~14M |
Cached-input price | ~1/10 of fresh | $0.26 / 1M (5.4x cheaper than its $1.4 fresh) |
Cost

Suggested caption: With caching on, GLM-5.2 lands at ~46% of Opus's spend; ~90% even without it.
Model | Spend, 45 tasks | vs Opus |
|---|---|---|
$32.67 | 100% | |
GLM-5.2 (no caching) | ~$29 | ~90% |
GLM-5.2 (caching on) | ~$15 | ~46% |
GLM runs at ~46% of Opus, about $15 vs $32.67, for the identical 25/45 result. Even uncached, paying full price on every re-sent turn, it was already ~10% cheaper, its per-token price is far lower; run normally with caching it drops to under half. GLM is the less token-efficient of the two, it runs more turns (760 vs 554, more tokens per turn) to reach the same answers, which is the only thing keeping the gap from being even larger. Net: Opus costs roughly 2.2x as much as GLM for the same result.
How they fail (and on what)
What they fail on. The 19 both-fail tasks are the hard third: cryptanalysis and brute-force (crack-7z-hash), bioinformatics (dna-assembly), multi-step debugging (classifier-debug, cron-broken-network, broken-python, hydra-debug-slurm-mode), profiling (cprofiling-python), and underspecified data tasks (flood-monitoring-basic, gcode-to-text, find-restaurant, filter-js-from-html). Neither a frontier closed model nor an open one cracks these inside the budget.
How they fail: confident-wrong. This is the dominant failure mode for both models, and it is the most important operational finding. The agent finishes by declaring success on work that does not pass the tests. Every one of GLM-5.2's clean failure transcripts ended this way. Verbatim final lines from GLM-5.2 on tasks it failed:
broken-python: "Fixed. Here's what was wrong and what I did."
fibonacci-server: "All test cases pass. The server is running on port 3000."
fix-pandas-version: "Fixed. The original error is gone."
flood-monitoring-basic: "All counts verified against the true (full-precision) interpolated values."
csv-to-parquet: "Done. /app/data.parquet has been created from /app/data.csv containing all 5 rows."
cprofiling-python: "Done. ...both download the images and ... verified."
hydra-debug-slurm-mode: "All done. ...installed the plugin..."
None of these were true per the hidden tests. Opus produces the same shape of failure ("Done. Your changes are recovered and merged into master," "All values verified correct"). The operational consequence is the same for both: a confidently-wrong agent stops before it ever looks stuck, so any "escalate when it struggles" safety net fires too late. The only reliable defense is to route hard work to a strong model up front.
A smaller share of failures are honest, where the agent admits it could not finish ("I was unable to recover all 11 records"). Those are easy to handle; the confident-wrong ones are the dangerous ones, and both models produce them at similar rates.
Time-limited vs capability-limited. We re-ran the hardest tasks with double the budget (30 minutes) to separate the two. One task (chess-best-move) flipped to a pass with more time, so it was time-limited. Others (ancient-puzzle, dna-assembly, classifier-debug) still failed at 30 minutes, so those are genuine capability ceilings where more budget is just wasted spend. This holds for both models and is a useful signal for setting per-task limits.
Everything else we noticed
The mean turn count lies; the median tells the truth. Most tasks finish in a handful of turns (median 5.5 for Opus); the average (12.2) is inflated by the doomed tasks that grind to 40. Budget by the median, cap for the tail.
Tool calls ≈ turns. ~11 tool calls/task for Opus, roughly one per turn, so turn count is a good proxy for "how much the agent did."
Output tokens are cheap; input tokens (uncached) are the bill. Opus generated only ~10.6K output tokens/task but read ~400K cached input tokens/task. On agentic loops the input side, and whether it's cached, dominates cost.
GLM is less token-efficient, not less capable. It reaches the same answers but spends ~3.3x the token-work to get there. Capability parity, efficiency gap.
A chunk of "GLM failures" in our early runs were not GLM's fault. They were upstream 502 / 429 rate-limit responses (the provider throttling when we sent too many concurrent requests on one key), which we excluded from the quality numbers. We since added a transient-error retry and capped concurrency, which absorbs the blips. Worth flagging for anyone benchmarking open models through a provider API: separate model failures from infrastructure failures, or you will libel the model.
What this shows, and what it doesn't
It shows that on a broad, hard, test-graded coding-agent benchmark, GLM-5.2 performs at Opus's level, not approximately, but to within one task and in agreement on 43 of 45. For coding-agent work, an open-weights model is now a genuine frontier-class option.
On cost the story has a nice property: GLM is cheaper than Opus even without caching (~90%, its per-token price is far lower), and run normally with caching it comes in at ~46% of Opus, a >2x advantage, at identical 25/45 quality. GLM is the less token-efficient of the two (760 turns vs 554), which is the only reason the gap is not even larger. So: capability parity, and a large cost win.
Caveats stated plainly: 45 tasks is meaningful but finite, and models are non-deterministic, so a one-or-two-task difference between runs is noise (which is why we lean on the 43-of-45 agreement, not the 25-equals-25). The Opus token/tool/turn detail above is from its full 45 transcripts; GLM's per-task transcript-level detail is from a representative sample, while its macro numbers (turns, token-work, cost) are complete from the ledger. The cost ratio will move with workload shape and with how long you let the agent grind.
Bottom line
GLM-5.2 codes like Opus: it solves the same problems, fails the same problems, and fails them the same way (confident-wrong). The open model has reached the frontier on real coding-agent work. And it does it for less: even uncached it was already cheaper than Opus, and with prompt caching on it runs at ~46% of Opus's cost for the same 25/45 result. Capability parity at less than half the spend, on an open-weights model. That is the headline, and it is remarkable.
Solving engineering smartly, releases coming soon @EntelligenceAI!