NEW_AI
AI Analysis
Live Data

LamBench Benchmark: GPT 5.x Performance & Sentiment

LamBench: 120 λ-calculus questions benchmarking GPT models on intelligence, elegance, and speed. Early results: 53% support, 4% confront; includes charts & link.

@VictorTaelinposted on X

Introducing LamBench . . . You asked me to make a benchmark, so I made it. It is a simple, old style Q&A consisting of 120 fresh λ-calculus programming questions. Some are easy, like "implement add for λ-encoded nats". Some are harder, like "derive a generic fold for arbitrary λ-encodings". It measures: - intelligence (% tasks completed) - elegance (BLC-length of solutions) - speed (completion time) Basically what I care about, other than long context. I made it today because I was excited about GPT 5.5. It didn't do too well ): (My first-day impression is that I can't tell the difference between GPT 5.5 and GPT 5.4. I would be lying if I said otherwise. I'd not be able to distinguish in a blind test. I need more time. It is much faster though.) This is a new, simple bench, so expect be bugs. Specially on OpenRouter models. I'll retest soon. Also, it was born saturated. V2 will be harder... ↓ Link and more charts below ↓

View original tweet on X →

Community Sentiment Analysis

Real-time analysis of public opinion and engagement

Sentiment Distribution

57% Engaged
53% Positive
Positive
53%
Negative
4%
Neutral
43%

Key Takeaways

What the community is saying — both sides

Supporting

1

Enthusiastic endorsement:

Many replies celebrate the benchmark — “finally, a proper λ-calculus benchmark,” “this is so sick,” and multiple “cool work” reactions.

2

Expand and schedule it:

Requests to broaden scope — run the top 20 leader LLMs, test weekly, and publish longitudinal results including older models to track evolution.

3

Methodology and metrics need clarity:

Questions about single vs. multiple tries, whether models use max reasoning effort, how “intelligence” is defined, and suggestions to add price per token for comparability.

4

Seen as a trustworthy signal:

Several responders praise the use of lambda calculus as “brutally honest” and commend transparent admissions (e.g., not being able to tell 5.5 from 5.4).

5

Regression detection and version concerns:

Observers point out surprising drops (5.5 vs 5.4, GLM, K

6

and criticize flagship upgrades that appear worse than previous releases.

and criticize flagship upgrades that appear worse than previous releases.

7

Quality over speed preference:

Some argue they’d prefer slower, more accurate reasoning rather than optimizing for raw speed metrics.

8

Community reproduction and open-model testing:

People offer to run the repo on open models, suggest trying specific models (opus 4.5, grok, benchmaxx), and encourage broader community benchmarking.

Opposing

1

training on a benchmark creates a feedback loop that effectively turns it into an RL environment

models and teams optimize to exploit the benchmark, producing distributional drift, reward‑hacking and overfitting. Recommended fixes: fresh blind test sets, periodic rotation of tasks, and evaluation by independent holders.

2

benchmarks remain static evaluation tools

, and calling them RL conflates supervised training with interactive decision processes. The distinction matters for tooling and expectations; solutions focus on clearer terminology and stronger generalization tests (out‑of‑distribution and real‑world benchmarks) rather than renaming the problem.

3

sequence length materially changing evaluation and optimization

longer examples skew metrics, affect throughput, and create incentives to game length rather than quality. Common responses: use length‑normalized metrics, stratify results by input length, and include length-balanced splits so models aren’t rewarded for trivial length manipulations.

Top Reactions

Most popular replies, ranked by engagement

V

@VictorTaelin

Supporting

Speed:

64
1
3.5K
V

@VictorTaelin

Supporting

Elegance:

61
0
3.3K
V

@VictorTaelin

Supporting

Link: https://t.co/Z6jEbRZkFC This is a silly tiny bench, don't expect much from it Just automating part of what I test on each new model

42
1
3.0K
N

@NicholasBardy

Opposing

If you train on a benchmark it becomes a RL environment

2
0
686
E

@engMecComp

Opposing

i think that in this bench context length is a turnover

1
1
245

This article was AI-generated from real-time signals discovered by PureFeed.

PureFeed scans X/Twitter 24/7 and turns the noise into actionable intelligence. Create your own signals and get a personalized feed of what actually matters.

Report an Issue

Found something wrong with this article? Let us know and we'll look into it.