How CoT Prompting Boosts LLM Math Performance — Data

@akshay_pachaarposted on X

You're in a Research Scientist interview at Google. Interviewer: We have a base LLM that's terrible at maths. How would you turn it into a maths & reasoning powerhouse? You: I'll get some problems labeled and fine-tune the model. Interview over. Here's what you missed:

View original tweet on X →

Process infographic from DeepMind’s July 25, 2024 blog showing AlphaProof’s training loop: informal problems are formalized into a large synthetic corpus, a solver network searches for formal proofs, and successful proofs are used to reinforce the model (AlphaZero-style). It directly supports the point that turning a weak LLM into a math/reasoning powerhouse requires formalization, synthetic data generation, search/verification and reinforcement learning — not just standard supervised fine-tuning on labeled problems.
Source: DeepMind (Google DeepMind blog)

Research Brief

What our analysis found

The viral tweet highlights a critical gap in how many practitioners approach improving LLMs at math: defaulting to fine-tuning while overlooking prompting strategies like Chain-of-Thought (CoT). Introduced by Google researchers in 2022, CoT prompting guides models to generate intermediate reasoning steps before arriving at a final answer. On the GSM8K benchmark of math word problems, a PaLM 540B model using CoT prompting achieved 58% accuracy, surpassing the previous state-of-the-art of 55% set by a fine-tuned GPT-3 175B model with a verifier — without modifying a single model weight.

Perhaps most striking is the power of zero-shot CoT: simply appending "Let's think step by step" to a query improved accuracy on arithmetic reasoning tasks from 10.4% to 40.7% in early research. Advanced descendants of CoT have since emerged, including Self-Consistency (which boosted GSM8K accuracy by +17.9 percentage points through majority voting across multiple reasoning paths), Tree-of-Thought prompting from Princeton and Google DeepMind enabling parallel exploration with backtracking, and Microsoft Research's Chain-of-Reasoning framework introduced in June 2025.

However, the picture is far more nuanced than the tweet suggests. Fine-tuning remains a powerful tool: NVIDIA demonstrated state-of-the-art math performance using supervised fine-tuning on millions of problems, Stanford researchers showed 10-15% accuracy improvements on the MATH dataset using sequential fine-tuning with Mistral-7B, and 2025 research found that fine-tuning on as few as 1,000 examples can produce reasoning performance comparable to top models. The real expert answer likely involves combining both approaches — not dismissing either one.

Fact Check

Evidence from both sides

Supporting Evidence

CoT outperformed fine-tuned models on key benchmarks

On the GSM8K math word problem dataset, PaLM 540B with CoT prompting achieved 58% accuracy, beating the prior state-of-the-art of 55% from a fine-tuned GPT-3 175B model with a verifier, according to the original 2022 Google research paper.

CoT requires no model weight modification

The original CoT paper explicitly noted that while generating reasoning steps was "previously accomplished via fine-tuning," CoT achieves this through prompting alone, requiring neither large training datasets nor changes to the model's parameters — making it far more resource-efficient.

Zero-shot CoT delivers dramatic gains with minimal effort

Simply adding "Let's think step by step" to a prompt improved arithmetic reasoning accuracy from 10.4% to 40.7%, demonstrating that substantial reasoning improvements are possible without any fine-tuning or few-shot examples.

CoT has become an essential production technique

Industry practitioners now describe CoT as an "essential practice" for production LLM applications involving sequential decision-making, elevating it from an experimental curiosity to an indispensable tool in the AI engineer's toolkit.

Advanced CoT variants continue to push boundaries

Self-Consistency decoding, introduced by Google Research, achieved +17.9 percentage points in absolute accuracy gains on GSM8K by sampling multiple reasoning paths and using majority voting, further widening the gap over basic fine-tuning approaches.

Contradicting Evidence

Fine-tuning remains highly effective for math domains

NVIDIA research demonstrated that Supervised Fine-Tuning using a dataset of millions of math problems with reasoning traces achieved state-of-the-art performance on math benchmarks with the open-source Qwen2.5-Math-1.5B model, showing fine-tuning is far from obsolete.

Small-scale fine-tuning can match top reasoning models

Research published in May 2025 showed that supervised fine-tuning on as few as 1,000 examples can enable a pre-trained LLM to reason effectively, with performance comparable to leading reasoning models — challenging the notion that fine-tuning is an inadequate approach.

CoT fails with smaller models

The benefits of CoT prompting are an emergent property of models with roughly 100 billion parameters or more. In smaller models, CoT can produce illogical reasoning chains and actually deliver worse accuracy than standard prompting, limiting its practical applicability.

CoT can underperform direct answering

A study titled "The Curse of CoT" found that CoT and its variants sometimes underperform direct answering on pattern-based in-context learning tasks, suggesting that explicit step-by-step reasoning can disrupt a model's implicit reasoning mechanisms.

Fine-tuning delivers consistent, measurable gains

Stanford researchers using Multi-Task Sequential Fine-Tuning and Logic-Enhanced Sequential Fine-Tuning with Mistral-7B demonstrated approximately 10-15% accuracy improvements on the MATH dataset compared to a non-fine-tuned baseline, proving fine-tuning's reliability as a strategy.

Fine-tuning and CoT can conflict

Research has shown that fine-tuning can sometimes decrease CoT accuracy, especially in smaller LLMs and on math-reasoning datasets like GSM8K, suggesting the two approaches do not always complement each other cleanly.

This article was AI-generated from real-time signals discovered by PureFeed.

PureFeed scans X/Twitter 24/7 and turns the noise into actionable intelligence. Create your own signals and get a personalized feed of what actually matters.

Report an Issue

Found something wrong with this article? Let us know and we'll look into it.