Reading_groups Download - Reading_groups Source code download

Reading_groups

Other source code

1.0.0

Download

Resources for hot topics related to large-scale pre-trained language models

The power of computing : A lot of evidence shows that advances in machine learning are largely driven by computing, not research, please refer to "The Bitter Lesson", and there are often Emergence and Homogenization phenomena. Studies have shown that the use of artificial intelligence computing doubles about every 3.4 months, while the efficiency improvement only doubles every 16 months. Among them, the amount of calculation is mainly driven by computing power, while the efficiency is driven by research. This means that computing growth has historically dominated advances in machine learning and its subfields. This is further proven by the emergence of GPT-4. Despite this, we still need to pay attention to whether there will be a more subverted architecture in the future, such as S4. Most of the current NLP research hotspots are based on more advanced LLM (~100B, $10^{23}$ FLOPs). In particular, ChatGPT uses fewer than pre-training calculations (4.9+60 petaflops/s-days vs 3640 petaflops/s-days) and human feedback ($500k, 20k hours, 13+33+31k data, Compared with the GPT-3's $12,000k, it released its GPT big model dialogue capabilities and became popular. Therefore, this library tracks and classifies articles related to large-scale pre-trained language model LLM, which allows us to grasp the frontier and see the direction clearly. Of course, in addition to [Big computing power technology foundation], there are other aspects: [Breakthrough in big model technology], [Enhanced in big data quality], [Open innovation ecological environment], [Close team collaboration], [Strong engineering capabilities] etc.

For more LLM topics papers please refer to here and here.

Papers ( rough category )

Model training, testing and optimization
Applications and LLM+
Principle analysis
Technology improvements
Survey and datasets

resource

LLM Courses
Important pictures
LLM Demo
Important blogs and self-selected articles
Training, reasoning, application tools (not compiled)

Large model training and optimization

【Testing on GPT-4, limitation】Sparks of Artificial General Intelligence: Early experiments with GPT-4

Model Card
Video

【InstructGPT papers, including sft, ppo, etc., one of the most important articles】 Training language models to follow instructions with human feedback

【scalable oversight: How can humans continue to improve their models after their models exceed their own tasks? 】Measuring Progress on Scalable Oversight for Large Language Models

Self-critiquing models for assisting human evaluators
Definition: The ability to provide reliable supervision to the model in the form of labels, reward signals, or criticism that will remain effective after the model begins to achieve a wide range of human-level performance.
Scalable oversight technology can improve the capacity and alignment of models (i.e., apply and achieve goals in the way humans expect).
If we can find a supervised learning paradigm based on the existing model (level above non-experts, under experts) that can improve the correctness of the model's answers, then we can get a better understanding of the model by no means relying on experts. Expert system.
Another perspective idea is to prompt the model by using multiple hints and strategies and to accept only the answers given by the model on a consistent and reasonable basis of evidence. But the technology from this angle may not be sufficiently scalable. Of course, any technology that can solve such challenges with high reliability may represent important advances in scalable supervision.
Existing solutions: Let existing models assist humans in obtaining knowledge to enable humans to produce high-quality supervision. Given the success of AlaphZero, self-game is also promising.

【Definition of Alignment, produced by deepmind】Alignment of Language Agents

A General Language Assistant as a Laboratory for Alignment

[RETRO paper, model searched using CCA+] Improving language models by retrieving from trillions of tokens

Fine-Tuning Language Models from Human Preferences

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

【Big model in Chinese and English, exceeding GPT-3】GLM-130B: An Open Bilingual Pre-trained Model

【Pre-training target optimization】UL2: Unifying Language Learning Paradigms

【Alignment’s new benchmarks, model libraries and new methods】Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

【MLM without the [MASK] tags through technology】Representation Deficiency in Masked Language Modeling

【Text to image training relieves Vocabulary's needs and resists certain attacks】 Language Modelling with Pixels

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

InCoder: A Generative Model for Code Infilling and Synthesis

[Search Text-related images for language model pre-training] Visually-Augmented Language Modeling

A Non-monotonic Self-terminating Language Model

【Comparison and fine-tuning of negative feedback through propt design】 Chain of Hindsight Aligns Language Models with Feedback

Related articles: The Wisdom of Hindsight Makes Language Models Better Instruction Followers

【Sparrow Model】Improving alignment of dialog agents via targeted human judgements

[Use small model parameters to accelerate the training process of large model (not starting from scratch)] Learning to Grow Pretrained Models for Efficient Transformer Training

[MoE semi-parametric knowledge fusion model for multiple knowledge sources] Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models

[Merge method for merging multiple trained models on different datasets] Dataless Knowledge Fusion by Merging Weights of Language Models

[It is very inspiring that the search mechanism replaces the general architecture of FFN in Transformer (×2.54 time) to decouple knowledge stored in model parameters] Language model with Plug-in Knowldge Memory

【Automatically generate Instruction tuning data for GPT-3 training】 Self-Instruction: Aligning Language Model with Self Generated Instructions

【Data similar to yizhong wang that automatically generates Instruction, aimed at T0】Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
Language model acceptance judgments are not always robust to context
SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks
(FLAN-T5-CoT) 【CoT Fine Tuning】Scaling Instruction-Finetuned Language Models

-

Towards Conditionally Dependent Masked Language Models

【Iteratively calibrate imperfectly generated independent correctors, Sean Welleck's follow-up article】Generating Sequences by Learning to Self-Correct

Prediction: AI feedback will soon replace human user feedback for model updates
Towards Boosting the Open-Domain Chatbot with Human Feedback
Similar ideas 1. Constitutional AI: Harmlessness from AI Feedback
Similar ideas 2. Discovering Language Model Behaviors with Model-Written Evaluations
Application: [OpenAI] Recursively Summarizing Books with Human Feedback

[Continuous Learning: Add a propt for the new task, and the previous task's propt and the big model remain unchanged] Progressive Prompts: Continual Learning for Language Models without Forgetting

[EMNLP 2022, continuous update of the model] MemPrompt: Memory-assisted Prompt Editing with User Feedback

【New Neural Architecture (FOLNet), which contains first-order logical induction bias】 Learning Language Representations with Logical Inductive Bias

GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator

【Pretraining language model based on state-space models, exceeding BERT】Pretraining Without Attention

[Consider human feedback during pre-training] Pretraining Language Models with Human Preferences

[Meta's open source LLaMA model, 7B-65B, trains more labeled small models than usual, achieving optimal performance under various inference budgets] LLaMA: Open and Efficient Foundation Language Models

[Teaching large language models to self-debug and explain the generated code through a small number of examples, but they have been used like this now] Teaching Large Language Models to Self-Debug

A series of papers and tools published on the self-correction ability of large language models, babyAGI, Auto-GPT
Similar ideas: 0. [The model records and reflects on the mistakes you have made] Reflexion: an autonomous agent with dynamic memory and self-reflection
Similar ideas: 1. [Models iterate through communication and iterative correction of each other's output] DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents

How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

LIMA: Less Is More for Alignment

【Tree-of-thought, more and more like alphago】Deliberate Problem Solving with Large Language Models

Applications and LLM+

【Multi-step reasoning method for applying ICL is very inspiring】ReAct: Synergizing Reasoning and Acting in Language Models

【Using LLM alone is not enough to create a truly powerful app, and the real power will appear when LLM is combined with other sources of computing or knowledge]
【Tools】 langchain - Building applications with LLMs through composability
【survey】 Augmented Language Models: a Survey
Toolformer
Similar ideas 0. TALM: Tool Augmented Language Models
Similar ideas 1. DEMONSTRATE–SEARCH–PREDICT:Composing retrieval and language models for knowledge-intensive NLP
Similar Thoughts 2. LAMBADA: Backward Chaining for Automated Reasoning in Natural Language
Similar ideas 3. [Select and reasoning] Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning
Similar ideas 4. Language Models as Agent Models
Similar Thoughts 5. Prompting Is Programming: A Query Language For Large Language Models
Similar ideas 6.【Neurips 22'】Language Model Cascades
Similar ideas 7. ART: Automatic multi-step reasoning and tool-use for large language models
Generative Agents: Interactive Simulacra of Human Behavior

【CoT directly generates program code, and then lets python interpreter execute】 Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Related articles: [EMNLP 22'] Language Models of Code are Few-Shot Commonsense Learners
【Heng Ji Group】Code4Struct: Code Generation for Few-Shot Structured Prediction from Natural Language PAL: Program-aided Language Models
【Qing Lyu, Chris Callison-Burch Group】Faithful Chain-of-Thought Reasoning

[Big model directly generates evidence context] Generate rather than Retrieve: Large Language Models are Strong Context Generators

【Writing model with 4 specific operations】PEER: A Collaborative Language Model

【Combining Python, SQL executors and big models】Binding Language Models in Symbolic Languages

[Retrieve the document generation code] DocPrompting: Generating Code by Retrieving the Docs

[There will be many articles in Grounding+LLM in the next series] LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
https://say-can.github.io/

【Self-Iterative Generation (Verified using Python) Training Data】 Language Models Can Teach Themselves to Program Better

Related articles: Specializing Smaller Language Models towards Multi-Step Reasoning
STaR: Bootstrapping Reasoning With Reasoning, from Neurips 22 (generate CoT data for model fine-tuning), causing a series of CoT articles teaching small models.
Similar ideas [Knowledge Distillation] Teaching Small Language Models to Reason and Learning by Distilling Context
Similar ideas KAIST and Xiang Ren groups ([CoT's rationale fine-tuning (professor)] PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales, etc.) and Large Language Models Are Reasoning Teachers
ETH's [CoT data trains problem decomposition and problem-solving models separately] Distilling Multi-Step Reasoning Capabilities of Large Language Models into Smaller Models via Semantic Decompositions

【Let small models learn CoT abilities】In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models

【Big Model Teach Small Model CoT】 Large Language Models Are Reasoning Teachers

[Big model generates evidence (recitation) and then performs small sample closed-book question and answer] Recitation-Augmented Language Models

[Natural Language Methods of Inductive Reasoners] Language Models as Inductive Reasoners

[GPT-3 is used for data annotation (such as emotional classification)] Is GPT-3 a Good Data Annotator?

【Models for Data Augmentation based on multitasking training for less sample data augmentation】KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in Low-Resource NLP

【Procedural planning work, not interested in the time being】Neuro-Symbolic Procedural Planning with Commonsense Prompting

[Objective: Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

【Combining the results of the external physics simulator in the context】Mind's Eye: Grounded Language Model Reasoning through Simulation

[Retrieve the task of enhancing CoT to do knowledge Intensive] Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

【Contrast the potential (binary) knowledge in unsupervised recognition language model】Discovering Latent Knowledge in Language Models Without Supervision

[Percy Liang group, trusted search engine, only 51.5% of generated sentences are fully supported by citations] Evaluating Verifiability in Generative Search Engines

Progressive-Hint Prompting Improves Reasoning in Large Language Models

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Principle analysis

[In my opinion, it is one of the most important articles. The law of proportion of language models under cross entropy loss, the loss is a power-law relationship with the model size, the dataset size, the amount of the computation used for training, and the width and depth of the architecture details such as the width and depth. Scaling Laws for Neural Language Models

[One of the other most important articles, Chinchilla, under limited computing, the optimal model is not the largest model, but a smaller model trained with more data (60-70B)] Training Compute-Optimal Large Language Models

[Which architecture and optimization goals help zero-sample generalization] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

【Grokking “Epiphany” learning process Memorization->Circuit formation->Cleanup】Progress measures for grokking via mechanistic interpretation

[Investigate the characteristics of the search-based model and find that both are of limited reasoning] Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model

The idea of search + LLM is the next direction, but it is not the only answer.
[Analysis and research on when to use external knowledge, that is, the switching between external knowledge and parameter knowledge] Large Language Models with Controllable Working Memory
Rethink Search: Making Domain Experts out of Dilettants
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

[Human-AI Language Interaction Evaluation Framework] Evaluating Human-Language Model Interaction

Similar articles Measuring the Human Utility of Free-Text Rationales in Human-AI Collaboration

What learning algorithm is in-context learning? Investigations with linear models

[Use ICL to learn action prediction after reinforcement learning, really clever] In-context reformer learning with algorithm distillation

【Model editing, this is Hot topic】Mass-Editing Memory in a Transformer

[The model's sensitivity to irrelevant context, adding irrelevant information to the examples in the prompt and adding instructions that ignore irrelevant context partially resolve] Large Language Models Can Be Easily Distracted by Irrelevant Context

【zero-shot CoT will show bias and toxicity under sensitive issues】 On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

【CoT of the big model has cross-language capabilities】 Language models are multilingual chain-of-thought reasons

[The lower the confusion of different Prompt sequences, the better the performance] Demystifying Prompts in Language Models via Perplexity Estimation

[Binary implicity resolution task of large models, this suggestion is difficult and there is no scaling phenomenon] Large language models are not zero-shot communicators (https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/ Implicity)

【Complexity-Based Prompting for Multi-step Reasoning

Objective: To improve the utility of CoT itself is closely related to the analysis of CoT utility
[Select a single sample after generation and then select a combination] Explanation Selection Using Unlabeled Data for In-Context Learning
【Automatic Chain of Thought Prompting in Large Language Models
[Make a secondary adjustment to the explanation of CoT generation, use a refiner module with parameters + information entropy optimization] Explanation Regeneration via Information Bottleneck

What Matters In The Structured Pruning of Generative Language Models?

[AmbiBench dataset, Task Ambiguity: The scaling RLHF model performs best in disambiguating tasks. Fine-tuning is more helpful than few-shot prompting】Task Ambiguity in Humans and Language Models

【GPT-3 test, including memory, calibration, bias, etc.】Prompting GPT-3 To Be Reliable

[OSU study which part of CoT is effective for performance] Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Similar Thoughts1 Complementary Explanations for Effective In-Context Learning (UT Austin, Xi Ye, Greg Durrett)
Similar Thought2 Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango

[Research on cross-language model of discrete prompts] Can discrete information extraction prompts generalize across language models?

【The memory rate is logarithmic linear relationship with the model size, prefix length and repetition rate in training】 Quantifying Memorization Across Neural Language Models

【It's very inspiring, decompose the problem into sub-questions through GPT iteration and answer it】Measuring and Narrowing the Compositionality Gap in Language Models

[Whether or when will the research be effective for reading in step-by-step answers, zero samples and low resources are effective] When Do Decompositions Help for Machine Reading?
Similar ideas Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Similar ideas Successive Prompting for Decomposing Complex Questions

[Analogous test of GPT-3 similar to civil servants’ intelligence questions] Emergent Analogical Reasoning in Large Language Models

【Short text training, long text testing, evaluation of model variable length adaptability】A Length-Extrapolatable Transformer

[When not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories

【ICL is another form of gradient update】Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers

Related articles: Transformers learn in-context by gradient descent

Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective

[Research on the process of training the OPT model in different sizes, and found that confusion is an indicator of ICL] Training Trajectories of Language Models Across Scales

[EMNLP 2022, Pre-trained pure English corpus contains other languages, and the model's cross-language capabilities may come from data leakage] Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models

[Overriding semantic priors and using information in propt is a surge capability] Larger language models do in-context learning differently

【EMNLP 2022 findings】What Language Model to Train if You Have One Million GPU Hours?

Technology improvements (such as generation technology, Prompt engineering, indicators, credibility, etc.)

[Introducing CFG technology during reasoning greatly improves the instruction compliance ability of small models] Stay on topic with Classifier-Free Guidance

【Train your own LLaMA model with openai’s GPT-4, and I can only say I admire you】Instruction Tuning with GPT-4

Reflexion: an autonomous agent with dynamic memory and self-reflection

【Personalized style prompt learning, OPT】Extensible Prompts for Language Models

[Accelerating large model decoding, using the direct consensus between small models and large models to be used multiple times at a time, after all, the input will be very slow if it is long] Accelerating Large Language Model Decoding with Specialized Sampling

[Use soft prompt to reduce the decline in ICL capability caused by fine tuning, fine-tuning the first stage, fine-tuning the second stage] Preserving In-Context Learning ability in Large Language Model Fine-tuning

【Semantic parsing tasks, sample selection methods of ICL, CODEX and T5-large】Diverse Demonstrations Improve In-context Compositional Generalization

【A new optimization method for text generation】Tailoring Language Generation Models under Total Variation Distance

[Uncertainty estimation of conditional generation, using semantic clustering combined with multiple sampling outputs to estimate the entropy of clusters] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Related articles: 1. Language Models (Mostly) Know What They Know
Related articles: 2. Teaching Models to Express Their Uncertainty in Words
Related articles: 3. [How does language expression affect calibration and accuracy, and which expression method is the best? 】Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models
Related articles: 4. Reducing Conversational Agents' Overconfidence Through Linguistic Calibration
Calibration meta-analysis: Will the calibration of a large model change due to the size of the model, the architecture of the model, the different Instructions, the different Contexts, and the task domain?
What is the optimal calibration method for open domain dialogue generation? How to improve the calibration performance of the model, fine-tuning, RLHF, Instruction tuning?
Are the big models really calibrated for understanding the problem rather than getting a good credibility assessment through statistical bias? Is it like humans that there are deceptions, knowing that you don’t understand, but pretending that you know? How to evaluate this?
If the big model has good calibration, what can we do next, how can we apply it to applications such as dialogue generation?

Go-tuning: Improving Zero-shot Learning Abilitys of Smaller Language Models

[Very inspiring, text generation method under free text constraints] Controllable Text Generation with Language Constraints

[When generating predictions, use similarity to select phrase instead of softmax token] Nonparametric Masked Language Modeling

[ICL method for long text] Parallel Context Windows Improve In-Context Learning of Large Language Models

【Sample of InstructGPT model generating ICL by itself】 Self-Prompting Large Language Models for Open-Domain QA

【Transfer and attention mechanisms enable ICL to enter more annotation samples】Structured Prompting: Scaling In-Context Learning to 1,000 Examples

Momentum Calibration for Text Generation

【Two ICL sample selection methods, experiments based on OPT and GPTJ】Careful Data Curation Stabilizes In-context Learning

【Analysis of the evaluation indicators of Mauve (pillutla et al.)】On the Usefulness of Embeddings, Clusters and Strings for Text Generation Evaluation

Promptagator: Few-shot Dense Retrieval From 8 Examples

[Three cobblers, Zhuge Liang] Self-Consistency Improves Chain of Thought Reasoning in Language Models

【Use knowledge as a reference for cobblers】Rethinking with Retrieval: Faithful Large Language Model Inference

[Invert, input and label generate instructions for conditions] Guess the Instruction! Making Language Models Stronger Zero-Shot Learners

【LLM’s reverse derivation self-verification】 Large Language Models are reasons with Self-Verification

【Methods for searching - Safety scenarios under the process of generating evidence】Foveate, Attribute, and Rationalize: Towards Safe and Trustworthy AI

[Confidence estimation of fragments extracted by text-generated information based on beam search] How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling?

SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning

【A Discussion on Extracted Summary Gold Label】Text Summarization with Oracle Expectation

【OOD detection method based on Martian distance】Out-of-Distribution Detection and Selective Generation for Conditional Language Models

[Attention module integrates Prompt to predict sample level] Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning

【Prompt for multiple tasks by decomposition and distillation into one Prompt】Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning

[The evaluation indicators of step-by-step reasoning generated text can be used as the topic to share next time] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

[Calibrating Sequence likelihood Improves Conditional Language Generation]

【Text attack method based on gradient optimization】TextGrad: Advancing Robustness Evaluation in NLP by Gradient-Driven Optimization

[GMM modeling ICL decision classification boundaries to calibrate] Prototypical Calibration for Few-shot Learning of Language Models

【Rewriting problem, and graph-based ICL aggregation method】Ask Me Anything: A simple strategy for prompting language models

[Database for selecting good candidates as ICLs from unannotated example pools] Selective Annotation Makes Language Models Better Few-Shot Learners

PromptBoosting: Black-Box Text Classification with Ten Forward Passes

Attention-Guided Backdoor Attacks against Transformers

【Prompt Mask position automatic label selection】Pre-trained Language Models can be Fully Zero-Shot Learners

[Compress the length of the FiD input vector and reorder it when outputting to output document ranking] FiD-Light: Efficient and Effective Retrieval-Augmented Text Generation

【Explanation on the generation of large models】PINTO: Faithful Language Reasoning Using Prompted-Generated Rationales

【Find a subset of pre-training impacts】ORCA: Interpreting Prompted Language Models via Location Supporting Evidence in the Ocean of Pretraining Data

[Prompt project, aimed at Instruction, generates first stage and two stage sorting filtering] Large Language Models are Human-Level Prompt Engineers

Knowledge Unlearning for Mitigating Privacy Risks in Language Models

Editing models with task arithmetic

[Don't enter instructions and samples every time, convert them into parameter-efficient modules,] HINT: Hypernetwork Instruction Tuning for Efficient Zero-Shot Generalisation

[ICL display generation method without manual sample selection] Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations

[Task Instruction and text generate Embedding together]One Embedder, Any Task: Instruction-Finetuned Text Embeddings

【Big model teaching small model CoT】KNIFE: Knowledge Distillation with Free-Text Rationales

[Problem of inconsistency between source and target word segmentation of information extraction generation model] Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Parsel: A Unified Natural Language Framework for Algorithmic Reasoning

[ICL sample selection, first phase selection and second phase sorting] Self-adaptive In-context Learning

[Intensive reading, readable prompt unsupervised selection method, GPT-2] Toward Human Readable Prompt Tuning: Kubrick's The Shining is a good movie, and a good prompt too

Survey and datasets

【PRONTOQA dataset tests CoT inference ability and finds that Planning ability is still limited】 Language Models Can (kind of) Reason: A Systematic Formal Analysis of Chain-of-Thought

【reasoning dataset】WikiWhy: Answering and Explaining Cause-and-Effect Questions

【reasoning dataset】STREET: A MULTI-TASK STRUCTURED REASONING AND EXPLANATION BENCHMARK

【Reasoning dataset, comparing OPT pre-training and fine-tuning, including CoT fine-tuning models】 ALERT: Adapting Language Models to Reasoning Tasks

[Summary of the recent reasoning by Zhang Ningyu team of Zhejiang University] Reasoning with Language Model Prompting: A Survey

[Summary of text generation technology and direction by Xiao Yanghua's team in Fudan] Harnessing Knowledge and Reasoning for Human-Like Natural Language Generation: A Brief Review

[Summary of recent reasoning articles, Jie Huang from UIUC] Towards Reasoning in Large Language Models: A Survey

【Review of tasks, datasets and methods of mathematical reasoning and DL】A Survey of Deep Learning for Mathematical Reasoning

A Survey on Natural Language Processing for Programming

Reward modeling dataset:

This dataset is provided by Stiennon et al. (2020) and contains manual feedback on the model-generated summary. This dataset has two parts: comparison and axis. In the comparison section, the manual annotator was asked to select the best from the two summary. In the axis section, the manual annotator scores the summary quality based on Likert scale. The comparison part only has training and verification splits, while the axis part only has testing and verification splits. The abstract used to train reward models in the paper comes from the TL;DR dataset. Other validation and test data are from TL;DR datasets, CNN articles, and Daily Mail articles. https://huggingface.co/datasets/openai/summarize_from_feedback
This dataset comes from Ganguli et al. (2022); Bai et al. (2022a) and contains conversations about human evaluation. 3 One example includes a pair of conversations between humans and chatbots. Humans prefer one of these two conversations. https://huggingface.co/datasets/Anthropic/hh-rlhf
This dataset is from Nakano et al. (2021). There are 19,578 comparisons in total. Each example in the dataset contains model answers to a pair of questions, as well as related metadata. Each answer has a preference score from humans that can be used to determine which of the two answers is better. https://huggingface.co/datasets/openai/webgpt_comparisons
SHP is a dataset composed of 385K collective human preferences for responses to problems/indications in 18 different subject areas, from cooking to legal consultation. These preferences are intended to reflect how helpful one answer is to another and are intended to be used to train RLHF reward models and NLG evaluation models (such as SteamSHP). https://huggingface.co/datasets/stanfordnlp/SHP

Red-teaming datasets, harmless vs. helpful, RLHF + scale is harder to attack (another effective technique is CoT fine-tuning):

The overall consensus among humans is low on what a successful attack is.
Meta's Bot Adversarial Dialog dataset https://github.com/facebookresearch/ParlAI/tree/main/parlai/tasks/bot_adversarial_dialogue
Anthropic's red-teaming attempts https://huggingface.co/datasets/Anthropic/hh-rlhf/tree/main/red-team-attempts
AI2's RealToxicityPrompts https://huggingface.co/datasets/allenai/real-toxicity-prompts