58 Ways to Talk to an LLM — A Field Guide to Prompt Engineering
The Hook
Prompt engineering has a reputation problem. Half the internet treats it like a mystical art — “just add ‘think step by step’ and watch the magic happen.” The other half dismisses it as a temporary hack that better models will render obsolete. Both camps are wrong, and a massive survey paper finally brings the receipts.
“The Prompt Report: A Systematic Survey of Prompting Techniques” by Sander Schulhoff and 28 co-authors is the most ambitious attempt yet to catalog, categorize, and evaluate the full landscape of prompt engineering. We’re talking 58 text-based prompting techniques, 40 techniques for other modalities (image, audio, video, 3D, etc.), and a vocabulary of 33 standardized terms — assembled from over 1,500 papers across the field.
The core contribution? A taxonomy. Not a tutorial, not a “top 10 prompts” listicle — a systematic framework for understanding why different prompting strategies work, when they apply, and how they relate to each other. Think of it as the periodic table for talking to AI.
The Argument
The authors start from a simple observation: despite being one of the most widely practiced skills in modern AI, prompt engineering has no shared language. The same technique gets called different names across papers. Terms like “prompt,” “context,” and “exemplar” are used inconsistently. Researchers in one subfield don’t cite researchers in another. The result is a fragmented landscape where practitioners reinvent the wheel constantly.
The paper’s thesis is that this fragmentation isn’t just annoying — it actively holds the field back. Without a shared taxonomy, you can’t systematically compare techniques. Without systematic comparison, you can’t build a real science of prompting. You just have a collection of tricks.
So they built the taxonomy from the ground up.
The vocabulary alone is valuable. The paper defines a prompt as consisting of a directive (the core instruction), optional exemplars (examples), optional output formatting instructions, style instructions, role assignment, and additional information (context documents, etc.). This decomposition sounds obvious, but having it codified means we can finally talk precisely about which component of a prompt is doing the heavy lifting in any given technique.
The technique taxonomy organizes 58 text-based methods into clean categories. Here’s the landscape at a high level:
Zero-Shot techniques — prompting strategies that don’t require examples. This includes the now-famous Chain-of-Thought (CoT) prompting (“let’s think step by step”), but also less well-known methods like Emotion Prompting (appending motivational statements actually improves performance — LLMs respond to “this is very important to my career”), SimToM (perspective-taking for theory-of-mind tasks), and Rephrase and Respond (having the model rephrase the question before answering).
Few-Shot techniques — providing examples to guide the model. The paper maps out the surprisingly complex design space here: how to select examples (similarity-based, diversity-based, reinforcement-learning-based), how to order them (recency bias is real — models weight later examples more heavily), and even how to format them.
Thought Generation — the Chain-of-Thought family. The paper traces the full lineage from the original CoT paper through Tree of Thoughts (branching reasoning paths), Graph of Thoughts (allowing cycles and merging), and Algorithm of Thoughts (using algorithmic exploration patterns like DFS). Each adds structure to the reasoning process, trading simplicity for performance on complex tasks.
Decomposition — breaking complex problems into subproblems. Least-to-Most prompting, Decomposed Prompting, Plan-and-Solve, and DECOMP each take different approaches to the same insight: LLMs are better at simple steps than giant leaps.
Ensembling — running multiple prompts and aggregating. DENSE, DiVeRSe, Max Mutual Information, Universal Self-Consistency, and Meta-Prompting all find different ways to extract signal from multiple model outputs. The key insight: model responses are stochastic, so sampling multiple times and aggregating is almost always better than a single shot.
Self-Criticism — having the model check its own work. Self-Refine, Self-Verification, Chain-of-Verification, Cumulative Reasoning. These techniques exploit the asymmetry that verifying an answer is often easier than generating it, even for LLMs.
The Lineage
This paper sits at a very specific moment in the field’s history. Prompt engineering emerged from two converging threads:
The scaling laws revolution (2020-2022) — GPT-3 demonstrated that sufficiently large language models could perform tasks they weren’t explicitly trained on, purely through in-context learning. This created the entire field of prompt engineering overnight. Suddenly, how you talked to the model mattered as much as how you trained it.
The Chain-of-Thought breakthrough (Wei et al., 2022) — the discovery that adding “let’s think step by step” could dramatically improve reasoning performance was a paradigm shift. It proved that prompts weren’t just input formatting — they could fundamentally alter the model’s reasoning process. This spawned an entire subfield of “thought generation” techniques.
The Prompt Report synthesizes what happened after these two moments, as researchers went deep on extending, combining, and stress-testing prompting strategies across domains. The paper’s related work section reveals a key tension: the automation vs. manual design debate. One camp (DSPy, OPRO, APE) argues we should let models optimize their own prompts. The other camp argues that human-designed prompts remain more interpretable and controllable. The Prompt Report doesn’t pick a side — it documents both, which is arguably more useful.
The authors also connect prompting to the agent paradigm — dedicating a full section to how prompting techniques compose into agentic systems with tool use, planning, and memory. This bridges the gap between “prompt engineering” as a solo skill and “prompt engineering” as a component of larger AI architectures.
The Deep Dive
The Taxonomy That Matters Most
The paper’s Figure 1 is worth the read alone — a sprawling tree diagram organizing all 58 techniques by category. But the real value is in the relationships it reveals.
Take Chain-of-Thought as the trunk. From it branch:
- Tree of Thoughts — instead of a single reasoning chain, explore multiple branches and backtrack when paths fail. Best for problems with clear success/failure states (like puzzles or planning).
- Graph of Thoughts — extends trees by allowing reasoning paths to merge. If two branches reach the same intermediate conclusion, combine them. Better for problems where partial solutions can be synthesized.
- Algorithm of Thoughts — constrains exploration to follow algorithmic patterns (BFS, DFS). Trades flexibility for efficiency by pruning the search space.
Each is strictly more powerful than the last, but also strictly more complex to implement and more expensive in token usage. The paper makes this tradeoff explicit, which is rare and helpful.
The Benchmark Reality Check
The authors didn’t just catalog — they tested. They ran a benchmark study across multiple prompting techniques on the MMLU dataset with GPT-4, and the results are sobering:
Zero-shot Chain-of-Thought (the “think step by step” trick everyone uses) showed mixed results. On some categories it helped. On others it actually hurt performance compared to plain zero-shot prompting. The effect was inconsistent across domains — strong gains on logical reasoning, but sometimes negative on knowledge-heavy questions where overthinking led the model astray.
Few-shot prompting generally helped, but the gains were smaller than many practitioners assume — and highly sensitive to example selection and ordering.
Emotion Prompting (adding “this is important to my career” to the prompt) showed surprisingly consistent small gains across categories. The mechanism isn’t well understood, but the empirical result is robust.
The takeaway: there is no universal “best prompt.” The optimal technique depends on the task, the model, and often the specific inputs. The paper gives you the map — you still have to navigate.
The Security Angle
The paper dedicates serious space to prompt injection and adversarial attacks — something most prompting tutorials skip entirely. The taxonomy of attacks is organized into two categories:
Operator-level threats — where the developer building the system is the target. This includes training data poisoning, backdoor attacks in fine-tuned models, and prompt leaking (extracting the system prompt).
User-level threats — where end users try to break the system. Direct injection (“ignore previous instructions”), indirect injection (hiding instructions in retrieved documents), and jailbreaking (using creative framing to bypass safety filters).
The paper catalogs defensive techniques too: XML tagging to separate instructions from data, sandwich defense (repeating the instruction after user input), instruction hierarchy (prioritizing system prompts over user prompts), and more. None are bulletproof, but having the full attack surface mapped is valuable for anyone building production systems.
So What?
The Prompt Report matters for three reasons:
First, it’s a reference. Next time you’re stuck on a prompting problem, you don’t need to Google and hope — you can look up the taxonomy, find techniques designed for your problem class, and try them systematically.
Second, it reveals the field’s maturity. With 58 cataloged techniques and counting, prompt engineering is no longer a set of ad hoc tricks. It’s an emerging engineering discipline with its own design patterns, tradeoffs, and failure modes.
Third, it’s honest about limitations. The benchmark results show that many popular techniques don’t generalize as well as their original papers suggest. That’s not a failure — that’s science. And it’s information most “prompt engineering guides” won’t give you.
If you work with LLMs in any capacity — building products, doing research, or just trying to get better outputs — this is the paper to bookmark. It won’t tell you the one magic prompt. It’ll do something better: give you a framework for finding the right one.
Paper: “The Prompt Report: A Systematic Survey of Prompting Techniques” by Sander Schulhoff et al., arXiv:2406.06608, 2024.
Reproduction & Implementation
Environment Setup
# Core dependencies for reproducing the prompting techniques
pip install openai>=1.0.0 # GPT-4 API (used in benchmarks)
pip install anthropic>=0.30.0 # Claude API
pip install langchain>=0.2.0 # Prompt chaining / agent framework
pip install dspy-ai>=2.0.0 # Programmatic prompt optimization
pip install pandas>=2.0.0
pip install numpy>=1.24.0
pip install matplotlib>=3.7.0 # Visualization
pip install datasets>=2.14.0 # HuggingFace datasets (MMLU)Pseudo-Code: Key Prompting Techniques
# ---- CHAIN-OF-THOUGHT (Zero-Shot) ----
def zero_shot_cot(question: str, llm) -> str:
"""The simplest and most famous technique: append 'think step by step'."""
prompt = f"{question}\n\nLet's think step by step."
return llm.generate(prompt)
# ---- TREE OF THOUGHTS ----
def tree_of_thoughts(problem: str, llm, n_branches=3, max_depth=3):
"""
Explore multiple reasoning paths, evaluate, prune.
More powerful than CoT but more expensive.
"""
def evaluate_thought(thought):
eval_prompt = f"Rate this reasoning (1-10):\n{thought}"
return float(llm.generate(eval_prompt))
frontier = [("", problem)] # (reasoning_so_far, remaining_problem)
for depth in range(max_depth):
candidates = []
for reasoning, remaining in frontier:
# Branch: generate n different next steps
for _ in range(n_branches):
next_step = llm.generate(
f"{reasoning}\nNext step for: {remaining}",
temperature=0.7
)
candidates.append((reasoning + "\n" + next_step, remaining))
# Evaluate and prune to top-k
scored = [(c, evaluate_thought(c[0])) for c in candidates]
scored.sort(key=lambda x: x[1], reverse=True)
frontier = [c for c, s in scored[:n_branches]]
return frontier[0][0] # Best reasoning path
# ---- SELF-REFINE ----
def self_refine(question: str, llm, max_rounds=3):
"""Generate, critique, refine. Exploits verification asymmetry."""
answer = llm.generate(question)
for _ in range(max_rounds):
critique = llm.generate(
f"Question: {question}\nAnswer: {answer}\n\n"
f"Critique this answer. What's wrong or could be improved?"
)
if "looks correct" in critique.lower():
break
answer = llm.generate(
f"Question: {question}\nOriginal answer: {answer}\n"
f"Critique: {critique}\n\nProvide an improved answer."
)
return answer
# ---- EMOTION PROMPTING ----
def emotion_prompt(question: str, llm) -> str:
"""Surprisingly effective: add motivational stakes."""
prompt = (
f"{question}\n\n"
f"This is very important to my career. "
f"Please provide the most accurate and thorough answer possible."
)
return llm.generate(prompt)
# ---- ENSEMBLE: UNIVERSAL SELF-CONSISTENCY ----
def universal_self_consistency(question: str, llm, n_samples=5):
"""Sample multiple answers, pick the most consistent one."""
answers = [
llm.generate(question, temperature=0.7)
for _ in range(n_samples)
]
# Ask the LLM to pick the most consistent answer
selection_prompt = (
f"Question: {question}\n\n"
f"Here are {n_samples} candidate answers:\n"
+ "\n".join(f"{i+1}. {a}" for i, a in enumerate(answers))
+ "\n\nWhich answer is most consistent with the majority? "
f"Return only the number."
)
choice = int(llm.generate(selection_prompt).strip()) - 1
return answers[choice]Resource Links
Official Repository
- The Prompt Report resources: trigaten.github.io/Prompt_Survey_Site
Community Implementations
- DSPy (programmatic prompt optimization): github.com/stanfordnlp/dspy
- LangChain Prompts (prompt templates): github.com/langchain-ai/langchain
- Tree of Thoughts (official implementation): github.com/princeton-nlp/tree-of-thought-llm
- Graph of Thoughts: github.com/spcl/graph-of-thoughts
- Self-Refine: github.com/madaan/self-refine
Further Reading
- Chain-of-Thought Prompting — Wei et al., the original CoT paper
- Tree of Thoughts — Yao et al., deliberate reasoning with LLMs
- Large Language Models Are Human-Level Prompt Engineers — Zhou et al., automatic prompt generation (APE)
- DSPy: Compiling Declarative Language Model Calls — Khattab et al., programmatic prompting