@cursor_ai
We use a combination of offline benchmarks and online evals to measure model quality. This makes results more useful, especially as public benchmarks are increasingly saturated.
New method for scoring models on agentic coding tasks — compare intelligence and efficiency in Cursor. Audience reaction: 49.69% supportive, 15.95% confronting.
Real-time analysis of public opinion and engagement
What the community is saying — both sides
Community applauds the blend of offline benchmarks and online evals as a more practical lens for real developer workflows, calling CursorBench a useful, action-oriented alternative to saturated leaderboards.
The dominant theme is the primacy of token efficiency — many argue a slightly-less-smart but much cheaper/faster model beats a top-scoring but costly one for day-to-day coding.
4 (high) is repeatedly singled out for hitting a strong Pareto balance of intelligence and token efficiency, with several users reporting it outperforms older variants.
6, Composer, and GLM are praised for punching above their weight on the efficiency frontier.
speed, multi-step/task complexity, and especially long-horizon context retention across multi-hour coding sessions are frequent requests.
many ask to include more models (Grok, Qwen, Kimi, Sonnet variants, GPT-5.4 modes) and to publish methodology or open-source the benchmark.
requests to show cost per solved or task-price alongside token counts so teams can estimate real production spend.
Several replies note real-world signals (live usage, build/regression behavior, hallucinated imports) matter more than static scores — this framing makes the leaderboard feel more trustworthy.
users are excited to see data that helps route tasks to the right model rather than chase prestige metrics.
The benchmark is widely called unreliable and cherry‑picked, with users pointing to suspicious rankings (Sonnet, Opus, Composer) and complaining about sloppy chart design.
several replies warn GPT‑5.4’s 1M context mode is prohibitively expensive (one user cited $26 for a prompt) and see efficiency scores as a literal price tag for layoffs.
Conversation shifts from “which model writes best” to “which model uses the fewest resources,” with users stressing that efficiency metrics will drive hiring and tooling decisions.
Many argue benchmarks don’t reflect production reality and ask for retention and real‑world metrics instead of isolated bench scores.
Practical users note that spending time clarifying prompts often beats swapping models — prompting quality trumps model swapping.
Plenty of replies predict the chart will be turned into marketing collateral and call for clearer naming, methodology transparency, and better communication.
mocking, profanity, and blunt disbelief pepper the thread, underscoring frustration with the presentation.
Concrete asks include fixing factual errors, improving rate limits, and publishing clearer methodology or updated Composer results to restore credibility.
Most popular replies, ranked by engagement
We use a combination of offline benchmarks and online evals to measure model quality. This makes results more useful, especially as public benchmarks are increasingly saturated.
Learn more: https://t.co/AYcO0IQ33e
Here's the graph with the same data, but plotted against the actual output cost for each (Composer 1.5 output from Cursor docs is $17.5). Although this doesn't account for >200K Opus 4.6/>272K GPT 5.4/Gemini 3.1 >200K.
damn i did not expect sonnet 4.5 to be so behind
This graph loses credibility when codex and got5.4 are above Opus
Wait codex is better than Claude?