LamBench Benchmark: GPT 5.x Performance & Sentiment

@VictorTaelinposted on X

Introducing LamBench . . . You asked me to make a benchmark, so I made it. It is a simple, old style Q&A consisting of 120 fresh λ-calculus programming questions. Some are easy, like "implement add for λ-encoded nats". Some are harder, like "derive a generic fold for arbitrary λ-encodings". It measures: - intelligence (% tasks completed) - elegance (BLC-length of solutions) - speed (completion time) Basically what I care about, other than long context. I made it today because I was excited about GPT 5.5. It didn't do too well ): (My first-day impression is that I can't tell the difference between GPT 5.5 and GPT 5.4. I would be lying if I said otherwise. I'd not be able to distinguish in a blind test. I need more time. It is much faster though.) This is a new, simple bench, so expect be bugs. Specially on OpenRouter models. I'll retest soon. Also, it was born saturated. V2 will be harder... ↓ Link and more charts below ↓

View original tweet on X →

Community Sentiment Analysis

Real-time analysis of public opinion and engagement

Sentiment Distribution

57% Engaged

53% Positive

Positive

53%

Negative

Neutral

43%

Key Takeaways

What the community is saying — both sides

Supporting

Enthusiastic endorsement:

Many replies celebrate the benchmark — “finally, a proper λ-calculus benchmark,” “this is so sick,” and multiple “cool work” reactions.

Expand and schedule it:

Requests to broaden scope — run the top 20 leader LLMs, test weekly, and publish longitudinal results including older models to track evolution.

Methodology and metrics need clarity:

Questions about single vs. multiple tries, whether models use max reasoning effort, how “intelligence” is defined, and suggestions to add price per token for comparability.

Seen as a trustworthy signal:

Several responders praise the use of lambda calculus as “brutally honest” and commend transparent admissions (e.g., not being able to tell 5.5 from 5.4).

Regression detection and version concerns:

Observers point out surprising drops (5.5 vs 5.4, GLM, K

and criticize flagship upgrades that appear worse than previous releases.

Quality over speed preference:

Some argue they’d prefer slower, more accurate reasoning rather than optimizing for raw speed metrics.

Community reproduction and open-model testing:

People offer to run the repo on open models, suggest trying specific models (opus 4.5, grok, benchmaxx), and encourage broader community benchmarking.

Opposing

training on a benchmark creates a feedback loop that effectively turns it into an RL environment

models and teams optimize to exploit the benchmark, producing distributional drift, reward‑hacking and overfitting. Recommended fixes: fresh blind test sets, periodic rotation of tasks, and evaluation by independent holders.

benchmarks remain static evaluation tools

, and calling them RL conflates supervised training with interactive decision processes. The distinction matters for tooling and expectations; solutions focus on clearer terminology and stronger generalization tests (out‑of‑distribution and real‑world benchmarks) rather than renaming the problem.

sequence length materially changing evaluation and optimization

longer examples skew metrics, affect throughput, and create incentives to game length rather than quality. Common responses: use length‑normalized metrics, stratify results by input length, and include length-balanced splits so models aren’t rewarded for trivial length manipulations.

Top Reactions

Most popular replies, ranked by engagement

@VictorTaelin

Apr 24

Supporting

Speed:

3.5K

@VictorTaelin

Apr 24

Supporting

Elegance:

3.3K

@VictorTaelin

Apr 24

Supporting

Link: https://t.co/Z6jEbRZkFC This is a silly tiny bench, don't expect much from it Just automating part of what I test on each new model

3.0K

@NicholasBardy

Apr 24

Opposing

If you train on a benchmark it becomes a RL environment

686

@engMecComp

Apr 24

Opposing

i think that in this bench context length is a turnover

245

This article was AI-generated from real-time signals discovered by PureFeed.

PureFeed scans X/Twitter 24/7 and turns the noise into actionable intelligence. Create your own signals and get a personalized feed of what actually matters.

Report an Issue

Found something wrong with this article? Let us know and we'll look into it.