AI
AI Analysis
Live Data

Scoring Models on Agentic Coding: Cursor Benchmark Results

New method for scoring models on agentic coding tasks — compare intelligence and efficiency in Cursor. Audience reaction: 49.69% supportive, 15.95% confronting.

Community Sentiment Analysis

Real-time analysis of public opinion and engagement

Sentiment Distribution

66% Engaged
50% Positive
16% Negative
Positive
50%
Negative
16%
Neutral
34%

Key Takeaways

What the community is saying — both sides

Supporting

1

Community applauds the blend of offline benchmarks and online evals as a more practical lens for real developer workflows, calling CursorBench a useful, action-oriented alternative to saturated leaderboards

Community applauds the blend of offline benchmarks and online evals as a more practical lens for real developer workflows, calling CursorBench a useful, action-oriented alternative to saturated leaderboards.

2

The dominant theme is the primacy of token efficiency — many argue a slightly-less-smart but much cheaper/faster model beats a top-scoring but costly one for day-to-day coding

The dominant theme is the primacy of token efficiency — many argue a slightly-less-smart but much cheaper/faster model beats a top-scoring but costly one for day-to-day coding.

3

GPT-5

4 (high) is repeatedly singled out for hitting a strong Pareto balance of intelligence and token efficiency, with several users reporting it outperforms older variants.

4

Open-weight and efficiency standouts like Opus 4

6, Composer, and GLM are praised for punching above their weight on the efficiency frontier.

5

People want more axes

speed, multi-step/task complexity, and especially long-horizon context retention across multi-hour coding sessions are frequent requests.

6

Calls for broader coverage

many ask to include more models (Grok, Qwen, Kimi, Sonnet variants, GPT-5.4 modes) and to publish methodology or open-source the benchmark.

7

Practical metrics demanded

requests to show cost per solved or task-price alongside token counts so teams can estimate real production spend.

8

Several replies note real-world signals (live usage, build/regression behavior, hallucinated imports) matter more than static scores — this framing makes the leaderboard feel more trustworthy

Several replies note real-world signals (live usage, build/regression behavior, hallucinated imports) matter more than static scores — this framing makes the leaderboard feel more trustworthy.

9

The tone is largely enthusiastic and pragmatic

users are excited to see data that helps route tasks to the right model rather than chase prestige metrics.

Opposing

1

The benchmark is widely called unreliable and cherry‑picked, with users pointing to suspicious rankings (Sonnet, Opus, Composer) and complaining about sloppy chart design

The benchmark is widely called unreliable and cherry‑picked, with users pointing to suspicious rankings (Sonnet, Opus, Composer) and complaining about sloppy chart design.

2

Cost and compute fears dominate

several replies warn GPT‑5.4’s 1M context mode is prohibitively expensive (one user cited $26 for a prompt) and see efficiency scores as a literal price tag for layoffs.

3

Conversation shifts from “which model writes best” to “which model uses the fewest resources,” with users stressing that efficiency metrics will drive hiring and tooling decisions

Conversation shifts from “which model writes best” to “which model uses the fewest resources,” with users stressing that efficiency metrics will drive hiring and tooling decisions.

4

Many argue benchmarks don’t reflect production reality and ask for retention and real‑world metrics instead of isolated bench scores

Many argue benchmarks don’t reflect production reality and ask for retention and real‑world metrics instead of isolated bench scores.

5

Practical users note that spending time clarifying prompts often beats swapping models — prompting quality trumps model swapping

Practical users note that spending time clarifying prompts often beats swapping models — prompting quality trumps model swapping.

6

Plenty of replies predict the chart will be turned into marketing collateral and call for clearer naming, methodology transparency, and better communication

Plenty of replies predict the chart will be turned into marketing collateral and call for clearer naming, methodology transparency, and better communication.

7

Tone ranges from skeptical to hostile

mocking, profanity, and blunt disbelief pepper the thread, underscoring frustration with the presentation.

8

Concrete asks include fixing factual errors, improving rate limits, and publishing clearer methodology or updated Composer results to restore credibility

Concrete asks include fixing factual errors, improving rate limits, and publishing clearer methodology or updated Composer results to restore credibility.

Top Reactions

Most popular replies, ranked by engagement

C

@cursor_ai

Supporting

We use a combination of offline benchmarks and online evals to measure model quality. This makes results more useful, especially as public benchmarks are increasingly saturated.

253
5
45.5K
C

@cursor_ai

Supporting

Learn more: https://t.co/AYcO0IQ33e

125
2
26.6K
L

@labomen001

Supporting

Here's the graph with the same data, but plotted against the actual output cost for each (Composer 1.5 output from Cursor docs is $17.5). Although this doesn't account for >200K Opus 4.6/>272K GPT 5.4/Gemini 3.1 >200K.

94
4
23.7K
C

@carboxydev

Opposing

damn i did not expect sonnet 4.5 to be so behind

4
4
5.0K
D

@dimitrioskonst

Opposing

This graph loses credibility when codex and got5.4 are above Opus

3
3
1.3K
B

@browntechdude

Opposing

Wait codex is better than Claude?

3
0
916