Awesome Deliberative Prompting
How to ask Large Language Models (LLMs) to produce reliable reasoning and make reason-responsive decisions.
deliberation, n.
The action of thinking carefully about something, esp. in order to reach a decision; careful consideration; an act or instance of this. (OED)
Contents
- Success Stories
- Prompting Patterns and Strategies
- Beyond "Let's think step by step"
- Multi-Agent Deliberation
- Reflection and Meta-Cognition
- Text Generation Techniques
- Self-Correction
- Reasoning Analytics
- Limitations, Failures, Puzzles
- Datasets
- Tools and Frameworks
- Other Resources
Success Stories
Striking evidence for effectiveness of deliberative prompting.
- ? The original "chain of though" (CoT) paper, first to give clear evidence that deliberative prompting works. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." 2022-01-28. [>paper]
- ? Deliberative prompting improves ability of Google's LLMs to solve unseen difficult problems, and instruction-finetuned (Flan-) models are much better at it.
- "Scaling Instruction-Finetuned Language Models." 2022-12-06. [>paper]
- "PaLM 2 Technical Report." 2023-05-17. [>paper]
- ? Deliberative prompting is highly effective for OpenAI's models (Text-Davinci-003, ChatGPT, GPT-4), increasing accuracy in many (yet not all) reasoning tasks in the EvalAGI benchmark. "AGIEval: A Human-Centric Benchmark for
Evaluating Foundation Models." 2023-04-13. [>paper]
- ? Deliberative prompting unlocks latent cognitive skills and is more effective for bigger models. "Challenging BIG-Bench tasks and whether chain-of-thought can solve them." 2022-10-17. [>paper]
- ? Experimentally introducing errors in CoT reasoning traces decreases decision accuracy, which provides indirect evidence for reason-responsiveness of LLMs. "Stress Testing Chain-of-Thought Prompting for Large Language Models." 2023-09-28. [>paper]
- ? Reasoning (about retrieval candidates) improves RAG. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." 2023-10-17. [>paper]
- ? Deliberative reading notes improve RAG. "Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models." 2023-11-15. [>paper]
- ? Good reasoning (CoT) causes good answers (i.e., LLMs are reason-responsive). "Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems." 2023-12-07. [>paper]
- ? Logical interpretation of internal layer-wise processing of reasoning tasks yields further evidence for reason-responsiveness. "Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Model." 2023-12-07. [>paper]
- ? Reasoning about alternative drafts improves text generation. "Self-Evaluation Improves Selective Generation in Large Language Models." 2023-12-14. [>paper]
- ? CoT with carefully retrieved, diverse reasoning demonstrations boosts multi-modal LLMs. "Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models." 2023-12-04. [>paper]
- ? Effective multi-hop CoT for visual question answering. "II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering." 2024-02-16. [>paper]
- ? ? DPO on synthetic CoT traces increases reason-responsiveness of small LLMs. "Making Reasoning Matter:
Measuring and Improving Faithfulness of Chain-of-Thought Reasoning" 2024-02-23. [>paper] [>code]
Prompting Patterns and Strategies
Prompting strategies and patterns to make LLMs deliberate.
Beyond "Let's think step by step"
Instructing LLMs to reason (in a specific way).
- ? Asking GPT-4 to provide a correct and a wrong answers boosts accuracy. "Large Language Models are Contrastive Reasoners." 2024-03-13. [>paper]
- ? Guided dynamic prompting increases GPT-4 CoT performance by up to 30 percentage points. "Structure Guided Prompt: Instructing Large Language Model in Multi-Step Reasoning by Exploring Graph Structure of the Text" 2024-02-20. [>paper]
- ? Letting LLMs choose and combine reasoning strategies is cost-efficient and improves performance. "SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures." 2024-02-06. [>paper]
- ? CoA: Produce an abstract reasoning trace first, and fill in the details (using tools) later. "Efficient Tool Use with Chain-of-Abstraction Reasoning." 2024-01-30. [>paper]
- ? Reason over and over again until verification test is passed. "Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts." 2023-10-23. [>paper]
- ? Generate multiple diverse deliberations, then synthesize those in a single reasoning path. "Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios." 2023-11-14. [>paper]
- ? Survey of CoT regarding task types, prompt designs, and reasoning quality metrics. "Towards Better Chain-of-Thought Prompting Strategies: A Survey." 2023-10-08. [>paper]
- ? Asking a LLM about a problem's broader context leads to better answers. "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." 2023-10-09. [>paper]
- Weighing Pros and Cons: This universal deliberation paradigm can be implemented with LLMs.
- ? A {{guidance}} program that does: 1. Identify Options → 2. Generate Pros and Cons → 3. Weigh Reasons → 4. Decide. [>code]
- ? ? Plan-and-Solve Prompting. "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought
Reasoning by Large Language Models." 2023-05-06. [>paper] [>code]
- ? Note-Taking. "Learning to Reason and Memorize with Self-Notes." 2023-05-01. [>paper]
- ? Deliberate-then-Generate improves text quality. "Deliberate then Generate: Enhanced Prompting Framework for Text Generation." 2023-05-31. [>paper]
- ? Make LLM spontaneously interleave reasoning and Q/A. "ReAct: Synergizing Reasoning and Acting in Language Models." 2022-10-06. [>paper]
- ? 'Divide-and-Conquer' instructions substantially outperform standard CoT. "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models" 2022-05-21. [>paper]
Multi-Agent Deliberation
Let one (or many) LLMs simulate a free controversy.
- ? ? Carefully selected open LLMs that iteratively review and improve their answers outperform GPT4-o. "Mixture-of-Agents Enhances Large Language Model Capabilities." 2024-06-10. [>paper] [>code]
- ? More elaborate and costly multi-agent-system designs are typically more effective, according to this review: "Are we going MAD? Benchmarking Multi-Agent Debate between Language Models for Medical Q&A." 2023-11-19. [>paper]
- ? Systematic peer review is even better than multi-agent debate. "Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration." 2023-11-14. [>paper]
- ? Collective critique and reflection reduce factual hallucinations and toxicity. "N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics." 2023-10-28. [>paper]
- ? ? Delphi-process with diverse LLMs is veristically more valuable than simple debating. "ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs." 2023-09-22. [>paper] [>code]
- ? Multi-agent debate increases cognitive diversity increases performance. "Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate." 2023-05-30. [>paper]
- ? Leverage wisdom of the crowd effects through debate simulation. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." 2023-05-23. [>paper]
- ? ? Emulate Socratic dialogue to collaboratively solve problems with multiple AI agents. "The Socratic Method for Self-Discovery in Large Language Models." 2023-05-05. [>blog] [>code]
Reflection and Meta-Cognition
Higher-order reasoning strategies that may improve first-order deliberation.
- ? ? Keeping track of general insights gained from CoT problem solving improves future accuracy and efficiency. "Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models." 2024-06-06. [>paper] [>code]
- ? ? Processing task in function of self-assessed difficulty boosts CoT effectiveness. "Divide and Conquer for Large Language Models Reasoning." 2024-01-10. [>paper] [>code]
- ? ? Reflecting on task allows LLM to autogenerate more effective instructions, demonstration, and reasoning traces. "Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models." 2023-10-11. [>paper] [>code]
- ? ? LLM-based AI Instructor devises effective first-order CoT-instructions (open source models improve by up to 20%). "Agent Instructs Large Language Models to be General Zero-Shot Reasoners." 2023-10-05. [>paper] [>code]
- ? ? Clarify→Judge→Evaluate→Confirm→Qualify Paradigm. "Metacognitive Prompting Improves Understanding in Large Language Models." 2023-08-10. [>paper] [>code]
- ? ? Find-then-simulate-an-expert-for-this-problem Strategy. "Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm." 2021-02-15. [>paper] [>lmql]
Text Generation Techniques
Text generation techniques, which can be combined with prompting patterns and strategies.
- ? Iterative revision of reasoning in light of previous CoT traces improves accuracy by 10-20%. "RAT: Retrieval Augmented Thoughts Elicit
Context-Aware Reasoning in Long-Horizon Generation". 2024-03-08. [>paper]
- ? Pipeline for self-generating & choosing effective CoT few-shot demonstrations. "Universal Self-adaptive Prompting". 2023-05-24. [>paper]
- ? More reasoning (= longer reasoning traces) is better. "The Impact of Reasoning Step Length on Large Language Models". 2024-01-10. [>paper]
- ? Having (accordingly labeled) correct and erroneous (few-shot) reasoning demonstrations improves CoT. "Contrastive Chain-of-Thought Prompting." 2023-11-17. [>paper]
- ? Better problem-solving and deliberation through few-shot trial-and-error (in-context RL). "Reflexion: Language Agents with Verbal Reinforcement Learning." 2023-03-20. [>paper]
- ? External guides that constrain generation of reasoning improve accuracy by up to 35% on selected tasks. "Certified Reasoning with Language Models." 2023-06-06. [>paper]
- ? ? Highly effective beam search for generating complex, multi-step reasoning episodes. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." 2023-05-17. [>paper] [>code]
- ? A minimalistic implementation of Tree-of-Thoughts as plain prompt. [>code]
- ? An experimental LMQL implementation of Tree-of-Thoughts. [>code]
- ? ? LLM auto-generates diverse reasoning demonstration to-be-used in deliberative prompting. "Automatic Chain of Thought Prompting in Large Language Models." 2022-10-07. [>paper] [>code]
Self-Correction
Let LLMs self-correct their deliberation.
- ? Consistency between multiple CoT-traces is an indicator of reasoning reliability, which can be exploited for self-check / aggregation. "Can We Verify Step by Step for Incorrect Answer Detection?" 2024-02-16. [>paper]
- ? Turn LLMs into intrinsic self-checkers by appending self-correction steps to standard CoT traces for finetuning. "Small Language Model Can Self-correct." 2024-01-14. [>paper]
- ? Reinforced Self-Training improves retrieval-augmented multi-hop Q/A. "ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent." 2023-12-15. [>paper]
- ? Conditional self-correction depending on whether critical questions have been addressed in reasoning trace. "The ART of LLM Refinement: Ask, Refine, and Trust." 2023-11-14. [>paper]
- ? Iteratively refining reasoning given diverse feedback increases accuaracy by up tp 10% (ChatGPT). "MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models." 2023-10-19. [>paper]
- ? Instructing a model just to "review" its answer and "find problems" doesn't lead to effective self-correction. "Large Language Models Cannot Self-Correct Reasoning Yet." 2023-09-25. [>paper]
- ? LLMs can come up with, and address critical questions to improve their drafts. "Chain-of-Verification Reduces Hallucination in Large Language Models." 2023-09-25. [>paper]
- ? LogiCoT: Self-check and revision after each CoT step improves performance (for selected tasks and models). "Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic." 2023-09-23. [>paper]
- ? Excellent review about self-correcting LLMs, with application to unfaithful reasoning. "Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies." 2023-08-06. [>paper]
Reasoning Analytics
Methods for analysing LLM deliberation and assessing reasoning quality.
- ?? Comprehensive LLM-based reasoning analytics that breaks texts down into individual reasons. "DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models." 2024-01-04. [>paper] [>code]
- ?? Highly performant, open LLM (T5-based) for inference verification. "Minds versus Machines: Rethinking Entailment Verification with Language Models." 2024-02-06. [>paper] [>model]
- ?? Test dataset for CoT evaluators. "A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains." 2023-11-23. [>paper] [>dataset]
- ?? Framework for evaluating reasoning chains by viewing them as informal proofs that derive the final answer. "ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness." 2023-11-23. [>paper] [>code]
- ? GPT-4 is 5x better at predicting whether math reasoning is correct than GPT-3.5. "Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs." 2023-12-28. [>paper]
- ? Minimalistic GPT-4 prompts for assessing reasoning quality. "SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation." 2023-09-29. [>paper] [>code]
- ?? Automatic, semantic-similarity based metrics for assessing CoT traces (redundancy, faithfulness, consistency, etc.). "ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning." 2023-09-12. [>paper]
Limitations, Failures, Puzzles
Things that don't work, or are poorly understood.
- ? Structured generation risks to degrade reasoning quality and CoT effectiveness. "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models." 2024-08-05. [>paper]
- ? Filler tokens can be as effective as sound reasoning traces for eliciting correct answers. "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models." 2024-04-24. [>paper]
- ? Causal analysis shows that LLMs sometimes ignore CoT traces, but reason responsiveness increases with model size, and is shaped by fine-tuning. "LLMs with Chain-of-Thought Are Non-Causal Reasoners" 2024-02-25. [>paper]
- ? Bad reasoning may lead to correct conclusions, hence better methods for CoT evaluation are needed. "SCORE: A framework for Self-Contradictory Reasoning Evaluation." 2023-11-16. [>paper]
- ? LLMs may produce "encoded reasoning" that's unintelligable to humans, which may nullify any XAI gains from deliberative prompting. "Preventing Language Models From Hiding Their Reasoning." 2023-10-27. [>paper]
- ? LLMs judge and decide in function of available arguments (reason-responsiveness), but are more strongly influenced by fallacious and deceptive reasons as compared to sound ones. "How susceptible are LLMs to Logical Fallacies?" 2023-08-18. [>paper]
- ? Incorrect reasoning improves answer accuracy (nearly) as much as correct one. "Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting." 2023-07-20. [>paper]
- ? Zeroshot CoT reasoning in sensitive domains increases a LLM's likelihood to produce harmful or undesirable output. "On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning." 2023-06-23. [>paper]
- ? LLMs may systematically fabricate erroneous CoT rationales for wrong answers, NYU/Anthropic team finds. "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting." 2023-05-07. [>paper]
- ? LLMs' practical deliberation is not robust, but easily let astray by re-wording scenarios. "Despite 'super-human' performance, current LLMs are unsuited for decisions about ethics and safety" 2022-12-13. [>paper]
Datasets
Datasets containing examples of deliberative prompting, potentially useful for training models / assessing their deliberation skills.
- Instruction-following dataset augmented with "reasoning traces" generated by LLMs.
- ? ORCA - Microsoft's original paper. "Orca: Progressive Learning from Complex Explanation Traces of GPT-4." 2023-06-05. [>paper]
- ? OpenOrca - Open source replication of ORCA datasets. [>dataset]
- ? Dolphin - Open source replication of ORCA datasets. [>dataset]
- ? ORCA 2 - Improved Orca by Microsoft, e.g. with meta reasoning. "Orca 2: Teaching Small Language Models How to Reason." 2023-11-18. [>paper]
- ?? CoT Collection - 1.84 million reasoning traces for 1,060 tasks. "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning." [>paper] [>code]
- ? OASST1 - contains more than 200 instructions to generate pros and cons (acc. to nomic.ai's map). [>dataset]
- ? LegalBench - a benchmark for legal reasoning in LLMs [>paper]
- ?? ThoughtSource - an open resource for data and tools related to chain-of-thought reasoning in large language models. [>paper] [>code]
- ?? Review with lots of hints to CoT relevant datasets. "Datasets for Large Language Models: A Comprehensive Survey" [>paper] [>code]
- ? Maxime Labonne's LLM datasets list [github]
Tools and Frameworks
Tools and Frameworks to implement deliberative prompting.
- ? LMQL - a programming language for language model interaction. [>site]
- ? Interactive LMQL Playground [>site]
- ? "Prompting Is Programming: A Query Language for Large Language Models." 2022-12-12. [>paper]
- ? {{guidance}} - a language for controlling large language models. [>code]
- ? outlines ~ - a language for guided text generation. [>code]
- ? DSPy - a programmatic interface to LLMs. [>code]
- ? llm-reasoners – A library for advanced large language model reasoning. [>code]
- ? ThinkGPT - framework and building blocks for chain-of-thought workflows. [>code]
- ? LangChain - a python library for building LLM chains and agents. [>code]
- ? PromptBench -a unified library for evaluating LLMS, inter alia effectiveness of CoT prompts. [>code]
- ? SymbolicAI - a library for compositional differentiable programming with LLMs. [>code]
Other Resources
More awesome and useful material.
- Survey of Autonomous LLM Agents (continuously updated). [>site]
- ? LLM Dashboard - explore task-specific reasoning performance of open LLMs [>app]
- Prompt Engineering Guide set up by DAIR. [>site]
- ATLAS - principles and benchmark for systematic prompting [>code]
- Deliberative Prompting Guide set up by Logikon. [>site]
- Arguing with Arguments – recent and wonderful piece by H. Siegel discussing what it actually means to evaluate an argument. [>paper]