Google Research on AI’s Reasoning Skills: How Chain-of-Thought Prompting Can Double Math Benchmark Performance

Discover how Google Research's innovative new prompting method can significantly boost your AI's problem-solving accuracy, transforming the future of complex reasoning tasks.

Written by Ethan Carlson

Published on June 26th, 2024 10:35 PM

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Publication: Google Brain Team
Publication Date: January 10, 2023
Organizations mentioned: Google Research, Google Brain Team
Publication Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou
Technical background required: Medium
Estimated read time (original text): 15 minutes
Sentiment score: 75%, positive (100% being most positive)

TLDR

In this report, researchers at Google’s one and only Brain Team explore chain-of-thought prompting with large language models and demonstrate its effectiveness in improving complex reasoning tasks. This method of prompting language models works by using a few exemplars that include a series of intermediate reasoning steps leading to the final answer, which improves the model’s ability to perform complex reasoning tasks. The team shows that models like PaLM can achieve state-of-the-art performance on benchmarks like the GSM8K math word problems by providing intermediate reasoning steps.

Methodology:

The methodology involves few-shot prompting with “chain-of-thought” exemplars, which include the problem (input), intermediate reasoning steps, and the final answer (output).
Experiments were conducted on three LLMs: GPT-3 with 175 billion parameters (statistical weights about how text tends to flow), LaMDA with up to 137B, and PaLM with up to 540B, using greedy decoding for sampling model responses.
The performance of chain-of-thought prompting was evaluated against standard prompting and various ablations on multiple reasoning benchmarks.

Key findings:

Chain-of-thought prompting significantly outperforms standard prompting, especially in sufficiently large models (around 100B parameters or larger).
The method yielded striking improvements on complex tasks, such as doubling performance on the GSM8K math word problems benchmark for the largest models tested.
The approach facilitated length generalization in symbolic reasoning tasks, enabling models to handle longer sequences than those seen in few-shot exemplars.
Performance gains from chain-of-thought prompting were robust across annotators, exemplars, and language models.
While chain-of-thought reasoning emerged due to model scale, it remained challenging to induce reasoning in smaller models.

Recommendations:

Chain-of-thought prompting can be applied to a wide range of reasoning tasks, but it is most effective when tasks are complex, models are large, and standard prompting shows flat scaling curves.
The method can be further improved by addressing calculator and symbol mapping errors, common in incorrect chains of thought.
Future research should focus on improving the factuality of language model generations and exploring how to induce reasoning in smaller models.
Given the variance in performance due to prompt engineering, developing robust methods for generating chain-of-thought annotations could enhance the applicability of this approach.
Additional studies are needed to understand the emergent properties of chain-of-thought reasoning as the model scale increases.

Thinking Critically

Implications:

Adopting chain-of-thought prompting across organizations could lead to more efficient problem-solving processes in various fields. For example, this approach could improve the debugging and testing phases in software development by providing a step-by-step breakdown of logic errors. In education, it could aid in teaching complex concepts by offering clear reasoning paths for students to follow.
Organizations that don’t implement such prompting might miss out on the potential to enhance the performance of their existing LLMs. This could result in slower progress in fields that require complex reasoning, such as research and development.
The broader adoption of chain-of-thought prompting could shift LLM development towards creating models that are not only large but also inherently better at reasoning, thus changing the landscape of AI research and its applications in industry.

Alternative perspectives:

While the report suggests significant improvements through chain-of-thought prompting, the method’s success may be context-dependent. In scenarios where reasoning tasks are less structured or where incorrect intermediate steps could lead to right answers by chance (e.g., multiple-choice questions), the reliability of this method may be questioned.
The report emphasizes the emergent nature of reasoning abilities at large scales, but this might overshadow the potential of smaller models that are specialized or fine-tuned for specific tasks. Alternative approaches could enable smaller models to perform complex reasoning without chain-of-thought prompting.
The findings are based on experiments with only a few LLMs and may not generalize across all models or tasks. It’s conceivable that other models or prompting techniques could yield similar or better results, and further research is needed to explore the full spectrum of possibilities.

AI predictions:

As AI research advances, we expect to see the development of models that incorporate chain-of-thought reasoning as a fundamental feature rather than an add-on prompted by exemplars.
There will likely be an increase in the creation of datasets that include not just answers but also detailed reasoning paths, catering to the training and evaluation of AI models with chain-of-thought capabilities.
The success of chain-of-thought prompting may inspire new interdisciplinary collaborations, where experts from various fields such as cognitive science and linguistics work alongside AI researchers to better understand and implement human-like reasoning in language models.

Glossary

GSM8K: A benchmark dataset consisting of 8,000 grade-school-level math word problems used to evaluate the reasoning abilities of language models.
CSQA: A commonsense reasoning dataset that requires models to answer questions about the world involving complex semantics that often require prior knowledge.
Emergent ability: A phenomenon where certain abilities, such as chain-of-thought reasoning, only arise at a certain scale of model parameters and are not evident in smaller models.
Symbol mapping error: An error type where the chain of thought is correct except for incorrect usage of numerical symbols, which could be corrected by modifying only the equations without changing the words.
One step missing error: An error type where the chain of thought is almost correct but is missing a single reasoning step.
Semantic understanding error: An error type where the chain of thought has major errors in understanding the semantics of the problem.
Incoherent chain of thought error: An error type where the chain of thought contains statements that do not logically follow from previous ones or violate basic world knowledge.

Recommended Research Reports

The 2017 Breakthrough That Changed AI Forever: A Simple Guide to Transformers

Google Research on AI’s Reasoning Skills: How Chain-of-Thought Prompting Can Double Math Benchmark Performance

TLDR

Thinking Critically

Glossary

Join the revolution

Recommended Research Reports

The 2017 Breakthrough That Changed AI Forever: A Simple Guide to Transformers

Breaking a 2,000-Word Barrier: New Study Unlocks 10,000+ Word Generation in AI Models

The Roadmap to Getting 26,500 LA City Workers to Get Al Training

Join the revolution