ReAct: When Language Models Learn to Think and Do
The Gist
ReAct (Reasoning + Acting) was the breakthrough paper that showed language models don’t have to choose between two worlds: pure reasoning or pure action.
Before ReAct, you had a fork in the road:
- Chain-of-Thought (CoT): Models think step-by-step but stay in their own heads. Result: eloquent hallucinations.
- Action-only agents: Models just act—call APIs, search, click—without explicit reasoning. Result: rigid, hard to debug.
ReAct merged these paths into a Thought → Action → Observation loop. The model reasons about what to do, takes an action (like searching Wikipedia), observes the result, and uses that real observation to refine its next thought. It’s the difference between a student writing an essay in a vacuum versus one that can look things up as they write.
The payoff is massive:
- HotpotQA (multi-hop question answering): CoT’s hallucination rate: 56%. ReAct’s: ~0%.
- ALFWorld (interactive household simulation): Action-only baseline: 45% success. ReAct: 71%.
- Fever (fact verification): ReAct competitive with CoT but far more faithful to retrieved facts.
This wasn’t just incremental—it reframed what agents could be.
Why It Matters Now
If you’ve used Claude’s tool calling, GPT-4’s function calling, or any LangChain agent, you’ve been using ReAct’s DNA.
The Thought-Action-Observation loop is now the canonical design pattern for agentic AI. When you see:
Thought: I need to find the population of Nairobi
Action: Search["Nairobi population"]
Observation: [Wikipedia result: ~4.3 million]
Thought: Now I have the data, let me verify this...
…you’re looking at ReAct’s legacy, even if the paper isn’t cited.
The reason ReAct became foundational is subtle: it proved that grounding reasoning in reality works. Not just theoretically—empirically, on real benchmarks. That insight reshaped how ML engineers think about building systems that need to be both intelligent and trustworthy.
The Hallucination Problem & How ReAct Fixed It
Why CoT Hallucinates
Chain-of-Thought is powerful but fragile. When a model has to reason entirely in its own latent space, without access to external facts, it tends to confabulate. On HotpotQA—a benchmark that requires retrieving multiple Wikipedia passages to answer—CoT achieved a 56% hallucination rate. That’s not a typo. More than half the time, the model confidently stated false facts.
The reason is simple: the model was trained to predict plausible continuations of text. Without real information to anchor its reasoning, it predicts plausible-sounding answers. A question like “Which basketball player co-wrote a book with a chess grandmaster?” requires two separate facts. CoT might invent them both.
ReAct’s Grounding Strategy
ReAct’s solution was to interrupt the reasoning loop with observations from external reality. Instead of:
Q: Which basketball player co-wrote a book with a chess grandmaster?
Reasoning: Michael Jordan wrote books... Magnus Carlsen wrote books...
Answer: Michael Jordan and Magnus Carlsen
ReAct does:
Thought: I need to find basketball players who co-wrote books.
Action: Search["basketball player co-wrote book"]
Observation: [Result 1: Josh McRoberts... Result 2: ...]
Thought: Let me check if any of these co-wrote with a chess grandmaster.
Action: Search["chess grandmaster autobiography co-author"]
Observation: [Result: Hans Niemann... Hans Ree... ]
Thought: No match yet. Let me search more specifically.
Action: Search["[player name] chess grandmaster book"]
...
The observation forces the model to work with real data. If a fact isn’t in the search results, the model can’t invent it credibly. The hallucination rate plummeted to near-zero.
The Numbers
| Task | CoT | ReAct |
|---|---|---|
| HotpotQA | 56% hallucination | ~0% hallucination |
| ALFWorld | N/A (action-only) | 71% success |
| Fever | Competitive | Competitive (more faithful) |
The mechanism is straightforward: ground your reasoning in real observations, and you eliminate hallucination.
The Lineage: From CoT to Modern Agents
ReAct built directly on Wei et al.’s Chain-of-Thought (2022), which showed that prompting models to reason step-by-step improved accuracy. But CoT was reasoning-only. The parallel thread was tool-use in language models—Toolformer, and later, function calling APIs.
ReAct’s genius was combining them: let models reason and call tools in a tight feedback loop.
The lineage:
- Chain-of-Thought (Wei et al., 2022): Prove that reasoning helps.
- ReAct (Yao et al., 2023): Prove that grounded reasoning + action eliminates hallucination.
- Toolformer (Schick et al., 2023): Fine-tune models to call tools mid-generation.
- Function Calling APIs (OpenAI, Anthropic, 2023): Productize the ReAct loop.
- Modern Agents (LangChain, AutoGPT, Claude SDK): Industrialize the loop.
Each step built on the insight: the model’s internal reasoning is most powerful when anchored to external reality.
Rubber-Ducking the Jargon
- ReAct: Reasoning + Acting. The core contribution.
- Reasoning trace: The intermediate thoughts the model generates. “I need to find X.”
- Action space: The set of available actions. In the paper: web search, Wikipedia lookup, database queries, code execution.
- Observation: The result of an action. The actual data returned.
- HotpotQA: Benchmark requiring multi-hop reasoning over Wikipedia. “Which film was released first, A or B?” → search for A’s release date, search for B’s, compare.
- ALFWorld: Simulated household environment. “Put the clean mug in the cupboard.” The model must take actions in a text-based sim.
- Grounding: Anchoring reasoning in real external data (the opposite of hallucination).
- Hallucination rate: Percentage of generated answers that contain false facts.
What to Watch Out For
ReAct is beautiful, but it has limits worth knowing:
Model-specific: Tested on PaLM-540B and Davinci-003. Smaller models might struggle with the multi-step reasoning format. (Though subsequent work shows the pattern scales.)
Action space is constrained: The paper tests web search, Wikipedia lookup, calculator, and code execution. Not every task has a clean action set. What if you need to reason about something truly novel?
ALFWorld is simulated: It’s a text-based household sim, not the real world. The 71% success is impressive but doesn’t guarantee real-world performance.
Prompting is brittle: The format of the Thought-Action-Observation prompt matters. Small changes in phrasing can shift success rates. The paper is sensitive to how you ask the model to format its reasoning.
Doesn’t solve the “when” problem: ReAct doesn’t tell you when to reason vs. when to act. Should you search first or think first? The paper doesn’t provide a principled answer. Modern agents often solve this with learned routing policies.
Limited to language models: Only tested with autoregressive LLMs at inference time. What about reasoning-optimized models that think before responding?
So What?
If you’re building an agent—a chatbot that answers questions, a code assistant, an automation system—ReAct’s template is your starting point:
- Generate a thought (what do I need to do?).
- Propose an action (how do I get the information?).
- Observe the result (what did I learn?).
- Repeat until you reach an answer or conclusion.
And the key lesson: ground your reasoning in real data. Don’t let the model think in a vacuum. Every step should be anchored to something external—a search result, a code execution output, a database query. That’s how you get from “confidently wrong” to “actually correct.”
Reproduction & Implementation
Setting Up
To reproduce ReAct experiments locally, you’ll need:
- A language model API or local model (the paper uses PaLM-540B and text-davinci-003).
- Environment access: Wikipedia (via API), a web search tool, a calculator, optionally a code interpreter.
- Benchmark datasets: HotpotQA, ALFWorld, Fever.
Example ReAct Prompt Format
Here’s how you’d prompt a model to use ReAct (simplified):
You will answer questions by interleaving Thought, Action, and Observation steps.
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Search[query], Lookup[topic], Calculator[expression], Finish[answer]]
Observation: the result of the action
... (Thought/Action/Observation can repeat N times)
Thought: I now know the final answer
Finish[answer]
Begin!
Question: What is the elevation of the second-highest mountain in North America?
Thought: I need to find the second-highest mountain in North America and its elevation.
Action: Search[second highest mountain North America]
Observation: Mount Logan is the highest at 20,310 ft. The second highest is Mount Saint Elias...
...
Pseudo-Code for the Loop
def react_agent(question, model, tools):
"""
Run the ReAct loop until the agent outputs Finish[answer].
"""
context = f"Question: {question}\n"
max_steps = 10
for step in range(max_steps):
# Generate thought + action
response = model.generate(
prompt=context,
stop_at=["Observation:"]
)
context += response
# Parse action
action = parse_action(response)
if action.type == "Finish":
return action.answer
# Execute action and observe
observation = tools[action.type](action.arg)
context += f"\nObservation: {observation}\n"
return "Max steps reached"Key Resources
- Original Paper: ReAct: Synergizing Reasoning and Acting in Language Models | arXiv:2210.03629
- GitHub: ysymyth/ReAct – Official implementation, datasets, and prompts.
- HotpotQA Benchmark: hotpotqa.github.io
- ALFWorld Environment: alfworld.github.io
- Further Reading: Follow-up work on multi-agent ReAct (Sap et al., 2023) and structured reasoning (Yang et al., 2023).
Takeaway: ReAct showed that the best way to make language models reliable is to let them think out loud while grounding that thinking in real external data. It’s a simple idea—almost obvious in hindsight—but it fundamentally changed how we build AI agents.