@cursor_ai
We use a combination of offline benchmarks and online evals to measure model quality. This makes results more useful, especially as public benchmarks are increasingly saturated.
New method for scoring models on agentic coding tasks — compare intelligence and efficiency in Cursor. Audience reaction: 49.69% supportive, 15.95% confronting.
We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency: https://t.co/VItnifMh55
Real-time analysis of public opinion and engagement
What the community is saying — both sides
alternative to saturated leaderboards.
The dominant theme is the primacy of token efficiency — many argue a slightly-less-smart but much cheaper/faster model beats a top-scoring but costly one for day-to-day coding.
GPT-5.4 (high) is repeatedly singled out for hitting a strong Pareto balance of intelligence and token efficiency, with several users reporting it outperforms older variants.
, Composer, and GLM are praised for punching above their weight on the efficiency frontier.
, multi-step/task complexity, and especially long-horizon context retention across multi-hour coding sessions are frequent requests.
many ask to include more models (Grok, Qwen, Kimi, Sonnet variants, GPT-5.4 modes) and to publish methodology or open-source the benchmark.
or task-price alongside token counts so teams can estimate real production spend.
Several replies note real-world signals (live usage, build/regression behavior, hallucinated imports) matter more than static scores — this framing makes the leaderboard feel more trustworthy.
users are excited to see data that helps route tasks to the right model rather than chase prestige metrics.
and cherry‑picked, with users pointing to suspicious rankings (Sonnet, Opus, Composer) and complaining about sloppy chart design.
(one user cited $26 for a prompt) and see efficiency scores as a literal price tag for layoffs.
Conversation shifts from “which model writes best” to “which model uses the fewest resources,” with users stressing that efficiency metrics will drive hiring and tooling decisions.
instead of isolated bench scores.
.
Plenty of replies predict the chart will be turned into marketing collateral and call for clearer naming, methodology transparency, and better communication.
mocking, profanity, and blunt disbelief pepper the thread, underscoring frustration with the presentation.
Concrete asks include fixing factual errors, improving rate limits, and publishing clearer methodology or updated Composer results to restore credibility.
Most popular replies, ranked by engagement
We use a combination of offline benchmarks and online evals to measure model quality. This makes results more useful, especially as public benchmarks are increasingly saturated.
Learn more: https://t.co/AYcO0IQ33e
Here's the graph with the same data, but plotted against the actual output cost for each (Composer 1.5 output from Cursor docs is $17.5). Although this doesn't account for >200K Opus 4.6/>272K GPT 5.4/Gemini 3.1 >200K.
damn i did not expect sonnet 4.5 to be so behind
This graph loses credibility when codex and got5.4 are above Opus
Wait codex is better than Claude?
Found something wrong with this article? Let us know and we'll look into it.