Understanding the AI Capability Gap: Data Insights

@karpathyposted on X

Judging by my tl there is a growing gap in understanding of AI capability. The first issue I think is around recency and tier of use. I think a lot of people tried the free tier of ChatGPT somewhere last year and allowed it to inform their views on AI a little too much. This is a group of reactions laughing at various quirks of the models, hallucinations, etc. Yes I also saw the viral videos of OpenAI's Advanced Voice mode fumbling simple queries like "should I drive or walk to the carwash". The thing is that these free and old/deprecated models don't reflect the capability in the latest round of state of the art agentic models of this year, especially OpenAI Codex and Claude Code. But that brings me to the second issue. Even if people paid $200/month to use the state of the art models, a lot of the capabilities are relatively "peaky" in highly technical areas. Typical queries around search, writing, advice, etc. are *not* the domain that has made the most noticeable and dramatic strides in capability. Partly, this is due to the technical details of reinforcement learning and its use of verifiable rewards. But partly, it's also because these use cases are not sufficiently prioritized by the companies in their hillclimbing because they don't lead to as much $$$ value. The goldmines are elsewhere, and the focus comes along. So that brings me to the second group of people, who *both* 1) pay for and use the state of the art frontier agentic models (OpenAI Codex / Claude Code) and 2) do so professionally in technical domains like programming, math and research. This group of people is subject to the highest amount of "AI Psychosis" because the recent improvements in these domains as of this year have been nothing short of staggering. When you hand a computer terminal to one of these models, you can now watch them melt programming problems that you'd normally expect to take days/weeks of work. It's this second group of people that assigns a much greater gravity to the capabilities, their slope, and various cyber-related repercussions. TLDR the people in these two groups are speaking past each other. It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems. This part really works and has made dramatic strides because 2 properties: 1) these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also 2) they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them. So here we are.

View original tweet on X →

Figure “Select AI Index technical performance benchmarks vs. human performance” plots model performance (scaled to a human baseline) across multiple benchmarks from 2012–2024, highlighting very steep recent gains on coding benchmarks (HumanEval / SWE-bench) relative to many other tasks — directly illustrating the rapid, domain‑specific improvements in coding/technical capabilities you described. ([hai.stanford.edu](https://hai.stanford.edu/assets/files/hai_ai-index-report-2025_chapter2_final.pdf))
Source: Stanford HAI — AI Index Report 2025

Research Brief

What our analysis found

A viral tweet from AI researcher Andrej Karpathy has crystallized a growing debate in the tech community: the gap between what casual users experience with free AI chatbots and what paying professionals witness with frontier coding agents has become a chasm. The argument centers on models like OpenAI's GPT-5.3-Codex, announced on February 5, 2026, which posted dramatic benchmark gains — scoring 77.3% on Terminal-Bench 2.0 compared to 64.0% for its predecessor, and jumping from roughly 38% to 64.7% on OSWorld-Verified, an agentic evaluation suite. On cybersecurity capture-the-flag challenges, the model achieved 77.6% versus 67.4% for the prior version, and OpenAI classified it as the first model to receive a "High capability" rating for cybersecurity under its internal Preparedness Framework.

The practical implications are already measurable. OpenAI's Codex Security research preview, reported by Axios in March 2026, identified approximately 800 critical findings and more than 10,500 high-severity issues during testing — numbers that lend credibility to claims that these models can surface real vulnerabilities at scale. Meanwhile, OpenAI has been shipping specialized variants like GPT-5.3-Codex-Spark, running on Cerebras wafers and claiming over 1,000 tokens per second, alongside desktop and IDE integrations designed to embed agents directly into professional coding workflows.

Yet the experience for the average user remains starkly different. OpenAI's own March 2026 release notes confirm that free-tier users are routed to smaller "Instant" or "mini" models, while frontier Codex models are reserved for paid, Pro, and enterprise plans. This tiered architecture means millions of casual users are forming opinions based on substantially less capable systems — a structural information gap that fuels the very disconnect Karpathy describes. The technical explanation is rooted in Reinforcement Learning from Verifiable Rewards (RLVR), a training method that leverages deterministic signals like unit test pass/fail outcomes, which is far more tractable in code and math than in subjective domains like creative writing or general advice.

Fact Check

Evidence from both sides

Supporting Evidence

Frontier coding benchmarks show dramatic, quantifiable gains

OpenAI's own GPT-5.3-Codex announcement reports score jumps on agentic and coding evaluations: Terminal-Bench 2.0 rose from 64.0% to 77.3%, OSWorld-Verified from approximately 38% to 64.7%, and cybersecurity CTF performance from 67.4% to 77.6%. These numbers directly substantiate the claim that recent improvements in technical domains have been "staggering" (source: openai.com, Feb 5, 2026).

Real-world cybersecurity findings validate vulnerability-detection claims

Axios reported in March 2026 that OpenAI's Codex Security research preview identified roughly 800 critical findings and more than 10,500 high-severity issues during testing, supporting the tweet's assertion that these models can find and surface exploitable vulnerabilities in computer systems (source: axios.com).

Reinforcement learning with verifiable rewards explains the "peaky" improvement pattern

The training technique known as RLVR uses deterministic verifiers such as unit tests and automated checkers to provide clear reward signals. Published research from 2024–2026 documents that this approach yields stronger performance gains in code and math domains than in subjective tasks like writing, which aligns precisely with the tweet's explanation of why coding capabilities have outpaced general-purpose improvements (source: paperlens.io).

Free-tier vs. paid-tier model routing confirms the experience gap

OpenAI's March 2026 release notes and help documentation confirm that free users are routed to smaller "Instant" or "mini" models by default, while frontier Codex models are only available through paid, Pro, or enterprise plans. This directly supports the claim that casual users testing free ChatGPT are interacting with fundamentally different — and less capable — systems (source: help.openai.com).

Practitioner reports and production tooling corroborate compressed work cycles

Multiple industry writeups and engineering reports describe teams using Codex agents to compress debugging and refactoring cycles from days to minutes. OpenAI's rollout of GPT-5.3-Codex-Spark on Cerebras hardware, claiming over 1,000 tokens per second, and desktop IDE integrations further indicate that these tools are being engineered for professional technical workflows, not casual consumer use (sources: tomshardware.com, intuitionlabs.ai).

Contradicting Evidence

OpenAI itself says it has "no definitive evidence" of end-to-end autonomous cyberattack capability

While the tweet implies these models can "find and exploit vulnerabilities in computer systems," OpenAI's own February 2026 announcement explicitly states it has found no definitive evidence that GPT-5.3-Codex can automate cyberattacks from start to finish. The company also describes safeguards including routing high-risk queries to older models and gating access through a Trusted Access program. This is an important nuance to claims about offensive security capability (source: openai.com).

Access to the most capable cyber models is heavily restricted, limiting broad verification

OpenAI's Codex Security is a research preview, and the Trusted Access for Cyber pilot is an invite-only program with credits for vetted organizations. This means the most impressive cybersecurity capabilities are not broadly available for independent evaluation, and the strongest performance claims rest largely on the developer's own benchmarks and controlled testing environments rather than widespread, independent replication (sources: openai.com, axios.com).

Benchmark scores are self-reported and may not generalize to real-world complexity

The dramatic gains cited — such as Terminal-Bench 2.0 and OSWorld-Verified — come from OpenAI's own appendix tables. While directionally meaningful, self-reported benchmarks from the model developer warrant scrutiny, as performance on curated evaluation suites does not always translate to equivalent gains on messy, real-world engineering problems with ambiguous requirements and shifting specifications.

The "two groups" framing oversimplifies a spectrum of user experiences

The tweet presents a binary between casual free-tier users and professional frontier-model users, but OpenAI's tiered system includes multiple paid plans (Plus, Pro, Enterprise, Team) with varying model access and rate limits. Many paying users who are not software engineers may still find improvements incremental in their domains — such as writing, research synthesis, or business strategy — meaning the capability gap is not purely a function of willingness to pay but also of domain suitability for current training methods.

This article was AI-generated from real-time signals discovered by PureFeed.

PureFeed scans X/Twitter 24/7 and turns the noise into actionable intelligence. Create your own signals and get a personalized feed of what actually matters.

Report an Issue

Found something wrong with this article? Let us know and we'll look into it.