Scoring Models on Agentic Coding: Cursor Benchmark Results

@cursor_aiposted on X

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency: https://t.co/VItnifMh55

View original tweet on X →

Community Sentiment Analysis

Real-time analysis of public opinion and engagement

Sentiment Distribution

66% Engaged

50% Positive

16% Negative

Positive

50%

Negative

16%

Neutral

34%

Key Takeaways

What the community is saying — both sides

Supporting

useful, action-oriented

alternative to saturated leaderboards.

The dominant theme is the primacy of token efficiency — many argue a slightly-le...

The dominant theme is the primacy of token efficiency — many argue a slightly-less-smart but much cheaper/faster model beats a top-scoring but costly one for day-to-day coding.

GPT-5.4 (high) is repeatedly singled out for hitting a strong Pareto balance of ...

GPT-5.4 (high) is repeatedly singled out for hitting a strong Pareto balance of intelligence and token efficiency, with several users reporting it outperforms older variants.

Opus 4.6

, Composer, and GLM are praised for punching above their weight on the efficiency frontier.

speed

, multi-step/task complexity, and especially long-horizon context retention across multi-hour coding sessions are frequent requests.

Calls for broader coverage

many ask to include more models (Grok, Qwen, Kimi, Sonnet variants, GPT-5.4 modes) and to publish methodology or open-source the benchmark.

cost per solved

or task-price alongside token counts so teams can estimate real production spend.

Several replies note real-world signals (live usage, build/regression behavior, ...

Several replies note real-world signals (live usage, build/regression behavior, hallucinated imports) matter more than static scores — this framing makes the leaderboard feel more trustworthy.

The tone is largely enthusiastic and pragmatic

users are excited to see data that helps route tasks to the right model rather than chase prestige metrics.

Opposing

unreliable

and cherry‑picked, with users pointing to suspicious rankings (Sonnet, Opus, Composer) and complaining about sloppy chart design.

prohibitively expensive

(one user cited $26 for a prompt) and see efficiency scores as a literal price tag for layoffs.

Conversation shifts from “which model writes best” to “which model uses the fewe...

Conversation shifts from “which model writes best” to “which model uses the fewest resources,” with users stressing that efficiency metrics will drive hiring and tooling decisions.

retention and real‑world metrics

instead of isolated bench scores.

prompting quality trumps model swapping

Plenty of replies predict the chart will be turned into marketing collateral and...

Plenty of replies predict the chart will be turned into marketing collateral and call for clearer naming, methodology transparency, and better communication.

Tone ranges from skeptical to hostile

mocking, profanity, and blunt disbelief pepper the thread, underscoring frustration with the presentation.

Concrete asks include fixing factual errors, improving rate limits, and publishi...

Concrete asks include fixing factual errors, improving rate limits, and publishing clearer methodology or updated Composer results to restore credibility.

Top Reactions

Most popular replies, ranked by engagement

@cursor_ai

Mar 12

Supporting

We use a combination of offline benchmarks and online evals to measure model quality. This makes results more useful, especially as public benchmarks are increasingly saturated.

253

45.5K

@cursor_ai

Mar 12

Supporting

Learn more: https://t.co/AYcO0IQ33e

125

26.6K

@labomen001

Mar 12

Supporting

Here's the graph with the same data, but plotted against the actual output cost for each (Composer 1.5 output from Cursor docs is $17.5). Although this doesn't account for >200K Opus 4.6/>272K GPT 5.4/Gemini 3.1 >200K.

23.7K

@carboxydev

Mar 12

Opposing

damn i did not expect sonnet 4.5 to be so behind

5.0K

@dimitrioskonst

Mar 13

Opposing

This graph loses credibility when codex and got5.4 are above Opus

1.3K

@browntechdude

Mar 12

Opposing

Wait codex is better than Claude?

916

This article was AI-generated from real-time signals discovered by PureFeed.

PureFeed scans X/Twitter 24/7 and turns the noise into actionable intelligence. Create your own signals and get a personalized feed of what actually matters.

Report an Issue

Found something wrong with this article? Let us know and we'll look into it.