@VictorTaelin
Speed:
LamBench: 120 λ-calculus questions benchmarking GPT models on intelligence, elegance, and speed. Early results: 53% support, 4% confront; includes charts & link.
Introducing LamBench . . . You asked me to make a benchmark, so I made it. It is a simple, old style Q&A consisting of 120 fresh λ-calculus programming questions. Some are easy, like "implement add for λ-encoded nats". Some are harder, like "derive a generic fold for arbitrary λ-encodings". It measures: - intelligence (% tasks completed) - elegance (BLC-length of solutions) - speed (completion time) Basically what I care about, other than long context. I made it today because I was excited about GPT 5.5. It didn't do too well ): (My first-day impression is that I can't tell the difference between GPT 5.5 and GPT 5.4. I would be lying if I said otherwise. I'd not be able to distinguish in a blind test. I need more time. It is much faster though.) This is a new, simple bench, so expect be bugs. Specially on OpenRouter models. I'll retest soon. Also, it was born saturated. V2 will be harder... ↓ Link and more charts below ↓
Real-time analysis of public opinion and engagement
What the community is saying — both sides
Many replies celebrate the benchmark — “finally, a proper λ-calculus benchmark,” “this is so sick,” and multiple “cool work” reactions.
Requests to broaden scope — run the top 20 leader LLMs, test weekly, and publish longitudinal results including older models to track evolution.
Questions about single vs. multiple tries, whether models use max reasoning effort, how “intelligence” is defined, and suggestions to add price per token for comparability.
Several responders praise the use of lambda calculus as “brutally honest” and commend transparent admissions (e.g., not being able to tell 5.5 from 5.4).
Observers point out surprising drops (5.5 vs 5.4, GLM, K
and criticize flagship upgrades that appear worse than previous releases.
Some argue they’d prefer slower, more accurate reasoning rather than optimizing for raw speed metrics.
People offer to run the repo on open models, suggest trying specific models (opus 4.5, grok, benchmaxx), and encourage broader community benchmarking.
models and teams optimize to exploit the benchmark, producing distributional drift, reward‑hacking and overfitting. Recommended fixes: fresh blind test sets, periodic rotation of tasks, and evaluation by independent holders.
, and calling them RL conflates supervised training with interactive decision processes. The distinction matters for tooling and expectations; solutions focus on clearer terminology and stronger generalization tests (out‑of‑distribution and real‑world benchmarks) rather than renaming the problem.
longer examples skew metrics, affect throughput, and create incentives to game length rather than quality. Common responses: use length‑normalized metrics, stratify results by input length, and include length-balanced splits so models aren’t rewarded for trivial length manipulations.
Most popular replies, ranked by engagement
Speed:
Elegance:
Link: https://t.co/Z6jEbRZkFC This is a silly tiny bench, don't expect much from it Just automating part of what I test on each new model
If you train on a benchmark it becomes a RL environment
i think that in this bench context length is a turnover
Found something wrong with this article? Let us know and we'll look into it.