This paper list focuses on the theoretical and empirical analysis of language models, especially large language models (LLMs). The papers in this list investigate the learning behavior, generalization ability, and other properties of language models through theoretical analysis, empirical analysis, or a combination of both.
Scope of this list:
Limitations of this list:
Statistics of This paper list:
If you have any suggestions or want to contribute, please feel free to open an issue or a pull request.
For details on how to contribute, please refer to the contribution guidelines.
You can also share your thoughts and discuss with others in the Discussions.
Note
For uncategorized version, please refer to here.
^ back to top ^
Categories focusing on different phenomena, properties, and behaviors observed in large language models (LLMs) and transformer-based models.
^ back to top ^
Papers focusing on the theoretical and empirical analysis of in-context learning in large language models.
Provable In-Context Learning with Transformers: A Case Study on Linear Regression [paper link] 2024-11-04
Dake Bu; Wei Huang; Andi Han; Atsushi Nitanda; Taiji Suzuki; Qingfu Zhang; Hau-San Wong
Pretrained transformer efficiently learns low-dimensional target functions in-context [paper link] 2024-11-04
Kazusato Oko; Yujin Song; Taiji Suzuki; Denny Wu
Toward Understanding In-context vs. In-weight Learning [paper link] 2024-10-30
Bryan Chan; Xinyi Chen; András György; Dale Schuurmans
On the Role of Depth and Looping for In-Context Learning with Task Diversity [paper link] 2024-10-29
Khashayar Gatmiry; Nikunj Saunshi; Sashank J. Reddi; Stefanie Jegelka; Sanjiv Kumar
Mechanisms of Symbol Processing for In-Context Learning in Transformer Networks [paper link] 2024-10-23
Paul Smolensky; Roland Fernandez; Zhenghao Herbert Zhou; Mattia Opper; Jianfeng Gao
Can Transformers In-Context Learn Behavior of a Linear Dynamical System? [paper link] 2024-10-21
Usman Akram; Haris Vikalo
Bayesian scaling laws for in-context learning [paper link] 2024-10-21
Aryaman Arora; Dan Jurafsky; Christopher Potts; Noah D. Goodman
Provable In-context Learning for Mixture of Linear Regressions using Transformers [paper link] 2024-10-18
Yanhao Jin; Krishnakumar Balasubramanian; Lifeng Lai
In-context learning and Occam's razor [paper link] 2024-10-17
Eric Elmoznino; Tom Marty; Tejas Kasetty; Leo Gagnon; Sarthak Mittal; Mahan Fathi; Dhanya Sridhar; Guillaume Lajoie
Context-Scaling versus Task-Scaling in In-Context Learning [paper link] 2024-10-16
Amirhesam Abedsoltan; Adityanarayanan Radhakrishnan; Jingfeng Wu; Mikhail Belkin
Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent [paper link] 2024-10-15
Bo Chen; Xiaoyu Li; Yingyu Liang; Zhenmei Shi; Zhao Song
How Transformers Implement Induction Heads: Approximation and Optimization Analysis [paper link] 2024-10-15
Mingze Wang; Ruoxi Yu; Weinan E; Lei Wu
On the Training Convergence of Transformers for In-Context Classification [paper link] 2024-10-15
Wei Shen; Ruida Zhou; Jing Yang; Cong Shen
Transformers learn variable-order Markov chains in-context [paper link] 2024-10-07
Ruida Zhou; Chao Tian; Suhas Diggavi
Revisiting In-context Learning Inference Circuit in Large Language Models [paper link] 2024-10-06
Hakaze Cho; Mariko Kato; Yoshihiro Sakai; Naoya Inoue
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context [paper link] 2024-10-02
Spencer Frei; Gal Vardi
Transformers Handle Endogeneity in In-Context Linear Regression [paper link] 2024-10-02
Haodong Liang; Krishnakumar Balasubramanian; Lifeng Lai
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [paper link] 2024-09-10
Siyu Chen; Heejune Sheen; Tianhao Wang; Zhuoran Yang
Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs [paper link] 2024-09-06
Aliakbar Nafar; Kristen Brent Venable; Parisa Kordjamshidi
Transformers are Minimax Optimal Nonparametric In-Context Learners [paper link] 2024-08-22
Juno Kim; Tai Nakamaki; Taiji Suzuki
Memorisation In In-Context Learning [paper link] 2024-08-21
Shahriar Golchin; Mihai Surdeanu; Steven Bethard; Eduardo Blanco; Ellen Riloff
In-Context Learning with Representations: Contextual Generalization of Trained Transformers [paper link] 2024-08-19
Tong Yang; Yu Huang; Yingbin Liang; Yuejie Chi
Fast Training Dataset Attribution via In-Context Learning [paper link] 2024-08-14
Milad Fotouhi; Mohammad Taha Bahadori; Oluwaseyi Feyisetan; Payman Arabshahi; David Heckerman
How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression [paper link] 2024-08-08
Xingwu Chen; Lei Zhao; Difan Zou
Transformers are Universal In-context Learners [paper link] 2024-08-02
Takashi Furuya; Maarten V. de Hoop; Gabriel Peyré
Polynomial Regression as a Task for Understanding In-context Learning Through Finetuning and Alignment [paper link] 2024-07-27
Max Wilcoxson; Morten Svendgård; Ria Doshi; Dylan Davis; Reya Vir; Anant Sahai
Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism [paper link] 2024-07-24
Anhao Zhao; Fanghua Ye; Jinlan Fu; Xiaoyu Shen
One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [paper link] 2024-07-24
Zihao Li; Yuan Cao; Cheng Gao; Yihan He; Han Liu; Jason M. Klusowski; Jianqing Fan; Mengdi Wang
When can transformers compositionally generalize in-context? [paper link] 2024-07-17
Seijin Kobayashi; Simon Schug; Yassir Akram; Florian Redhardt; Johannes von Oswald; Razvan Pascanu; Guillaume Lajoie; João Sacramento
In-Context In-Context Learning with Transformer Neural Processes [paper link] 2024-06-19
Matthew Ashman; Cristiana Diaconu; Adrian Weller; Richard E. Turner
Probing the Decision Boundaries of In-context Learning in Large Language Models [paper link] 2024-06-17
Siyan Zhao; Tung Nguyen; Aditya Grover
State Soup: In-Context Skill Learning, Retrieval and Mixing [paper link] 2024-06-12
Maciej Pióro; Maciej Wołczyk; Razvan Pascanu; Johannes von Oswald; João Sacramento
Estimating the Hallucination Rate of Generative AI [paper link] 2024-06-11
Andrew Jesson; Nicolas Beltran-Velez; Quentin Chu; Sweta Karlekar; Jannik Kossen; Yarin Gal; John P. Cunningham; David Blei
BERTs are Generative In-Context Learners [paper link] 2024-06-07
David Samuel
Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective [paper link] 2024-06-06
Xinhao Yao; Xiaolin Hu; Shenzhi Yang; Yong Liu
What Do Language Models Learn in Context? The Structured Task Hypothesis [paper link] 2024-06-06
Jiaoda Li; Yifan Hou; Mrinmaya Sachan; Ryan Cotterell
Exact Conversion of In-Context Learning to Model Weights in Linearized-Attention Transformers [paper link] 2024-06-05
Brian K Chen; Tianyang Hu; Hui Jin; Hwee Kuan Lee; Kenji Kawaguchi
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks [paper link] 2024-06-04
Tianyu He; Darshil Doshi; Aritra Das; Andrey Gromov
Why Larger Language Models Do In-context Learning Differently? [paper link] 2024-05-30
Zhenmei Shi; Junyi Wei; Zhuoyan Xu; Yingyu Liang
Is In-Context Learning Sufficient for Instruction Following in LLMs? [paper link] 2024-05-30
Hao Zhao; Maksym Andriushchenko; Francesco Croce; Nicolas Flammarion
Does learning the right latent variables necessarily improve in-context learning? [paper link] 2024-05-29
Sarthak Mittal; Eric Elmoznino; Leo Gagnon; Sangnie Bhardwaj; Dhanya Sridhar; Guillaume Lajoie
A Theory of In-Context Learning in Transformers [paper link] 2024-05-29
Yifei Wang; Yuyang Wu; Zeming Wei; Stefanie Jegelka; Yisen Wang
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [paper link] 2024-05-27
Chenyu Zheng; Wei Huang; Rongzhen Wang; Guoqiang Wu; Jun Zhu; Chongxuan Li
Transformer In-Context Learning for Categorical Data [paper link] 2024-05-27
Aaron T. Wang; Ricardo Henao; Lawrence Carin
Automatic Domain Adaptation by Transformers in In-Context Learning [paper link] 2024-05-27
Ryuichiro Hataya; Kota Matsui; Masaaki Imaizumi
Unifying Demonstration Selection and Compression for In-Context Learning [paper link] 2024-05-27
Jun Gao
On the Noise Robustness of In-Context Learning for Text Generation [paper link] 2024-05-27
Hongfu Gao; Feipeng Zhang; Wenyu Jiang; Jun Shu; Feng Zheng; Hongxin Wei
MLPs Learn In-Context [paper link] 2024-05-24
William L. Tong; Cengiz Pehlevan
Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification [paper link] 2024-05-24
Shang Liu; Zhongze Cai; Guanting Chen; Xiaocheng Li
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [paper link] 2024-05-02
Khashayar Gatmiry; Nikunj Saunshi; Sashank J. Reddi; Stefanie Jegelka; Sanjiv Kumar
In-context Learning on Function Classes Unveiled for Transformers [paper link] 2024-05-02
Zhijie Wang; Bo Jiang; Shuai Li
In-Context Learning with Long-Context Models: An In-Depth Exploration [paper link] 2024-04-30
Amanda Bertsch; Maor Ivgi; Uri Alon; Jonathan Berant; Matthew R. Gormley; Graham Neubig
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation [paper link] 2024-04-10
Aaditya K. Singh; Ted Moskovitz; Felix Hill; Stephanie C. Y. Chan; Andrew M. Saxe
Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability [paper link] 2024-04-01
Ivan Lee; Nan Jiang; Taylor Berg-Kirkpatrick
Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [paper link] 2024-02-29
Siyu Chen; Heejune Sheen; Tianhao Wang; Zhuoran Yang
How Transformers Learn Causal Structure with Gradient Descent [paper link] 2024-02-22
Eshaan Nichani; Alex Damian; Jason D. Lee
In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization [paper link] 2024-02-22
Ruiqi Zhang; Jingfeng Wu; Peter L. Bartlett
Identifying Semantic Induction Heads to Understand In-Context Learning [paper link] 2024-02-20
Jie Ren; Qipeng Guo; Hang Yan; Dongrui Liu; Xipeng Qiu; Dahua Lin
How do Transformers perform In-Context Autoregressive Learning? [paper link] 2024-02-08
Michael E. Sander; Raja Giryes; Taiji Suzuki; Mathieu Blondel; Gabriel Peyré
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks [paper link] 2024-02-06
Jongho Park; Jaeseung Park; Zheyang Xiong; Nayoung Lee; Jaewoong Cho; Samet Oymak; Kangwook Lee; Dimitris Papailiopoulos
An Information-Theoretic Analysis of In-Context Learning [paper link] 2024-01-28
Hong Jun Jeon; Jason D. Lee; Qi Lei; Benjamin Van Roy
The Transient Nature of Emergent In-Context Learning in Transformers [paper link] 2023-12-11
Aaditya K. Singh; Stephanie C. Y. Chan; Ted Moskovitz; Erin Grant; Andrew M. Saxe; Felix Hill
In-Context Learning Functions with Varying Number of Minima [paper link] 2023-11-21
David Oniani; Yanshan Wang
Exploring the Relationship between In-Context Learning and Instruction Tuning [paper link] 2023-11-17
Hanyu Duan; Yixuan Tang; Yi Yang; Ahmed Abbasi; Kar Yan Tam
When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks [paper link] 2023-11-15
Hao Peng; Xiaozhi Wang; Jianhui Chen; Weikai Li; Yunjia Qi; Zimu Wang; Zhili Wu; Kaisheng Zeng; Bin Xu; Lei Hou; Juanzi Li
In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax [paper link] 2023-11-13
Aaron Mueller; Albert Webson; Jackson Petty; Tal Linzen
Transformers learn to implement preconditioned gradient descent for in-context learning [paper link] 2023-11-09
Kwangjun Ahn; Xiang Cheng; Hadi Daneshmand; Suvrit Sra
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models [paper link] 2023-10-26
Deqing Fu; Tian-Qi Chen; Robin Jia; Vatsal Sharan
In-Context Learning Creates Task Vectors [paper link] 2023-10-24
Roee Hendel; Mor Geva; Amir Globerson
Function Vectors in Large Language Models [paper link] 2023-10-23
Eric Todd; Millicent L. Li; Arnab Sen Sharma; Aaron Mueller; Byron C. Wallace; David Bau
In-context Learning with Transformer Is Really Equivalent to a Contrastive Learning Pattern [paper link] 2023-10-19
Ruifeng Ren; Yong Liu
Trained Transformers Learn Linear Models In-Context [paper link] 2023-10-19
Ruiqi Zhang; Spencer Frei; Peter L. Bartlett
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [paper link] 2023-10-16
Tianyu Guo; Wei Hu; Song Mei; Huan Wang; Caiming Xiong; Silvio Savarese; Yu Bai
Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions [paper link] 2023-10-13
Satwik Bhattamishra; Arkil Patel; Phil Blunsom; Varun Kanade
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [paper link] 2023-10-13
Jingfeng Wu; Difan Zou; Zixiang Chen; Vladimir Braverman; Quanquan Gu; Peter Bartlett
In-Context Learning Learns Label Relationships but Is Not Conventional Learning [paper link] 2023-10-13
Jannik Kossen; Yarin Gal; Tom Rainforth
In-context Convergence of Transformers [paper link] 2023-10-13
Yu Huang; Yuan Cheng; Yingbin Liang
In-Context Learning through the Bayesian Prism [paper link] 2023-10-13
Madhur Panwar; Kabir Ahuja; Navin Goyal
Do pretrained Transformers Really Learn In-context by Gradient Descent? [paper link] 2023-10-12
Lingfeng Shen; Aayush Mishra; Daniel Khashabi
What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization [paper link] 2023-10-10
Yufeng Zhang; Fengzhuo Zhang; Zhuoran Yang; Zhaoran Wang
Explaining Emergent In-Context Learning as Kernel Regression [paper link] 2023-10-05
Chi Han; Ziqi Wang; Han Zhao; Heng Ji
CausalLM is not optimal for in-context learning [paper link] 2023-09-02
Nan Ding; Tomer Levinboim; Jialin Wu; Sebastian Goodman; Radu Soricut
One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention [paper link] 2023-07-07
Arvind Mahankali; Tatsunori B. Hashimoto; Tengyu Ma
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [paper link] 2023-07-06
Yu Bai; Fan Chen; Huan Wang; Caiming Xiong; Song Mei
Transformers Learn In-Context by Gradient Descent [paper link] 2023-06-15
Johannes Von Oswald; Eyvind Niklasson; Ettore Randazzo; Joao Sacramento; Alexander Mordvintsev; Andrey Zhmoginov; Max Vladymyrov
The Closeness of In-Context Learning and Weight Shifting for Softmax Regression [paper link] 2023-04-26
Shuai Li; Zhao Song; Yu Xia; Tong Yu; Tianyi Zhou
A Theory of Emergent In-Context Learning as Implicit Structure Induction [paper link] 2023-03-14
Michael Hahn; Navin Goyal
The Learnability of In-Context Learning [paper link] 2023-03-14
Noam Wies; Yoav Levine; Amnon Shashua
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes [paper link] 2023-01-14
Shivam Garg; Dimitris Tsipras; Percy Liang; Gregory Valiant
Transformers generalize differently from information stored in context vs in weights [paper link] 2022-10-13
Stephanie C. Y. Chan; Ishita Dasgupta; Junkyung Kim; Dharshan Kumaran; Andrew K. Lampinen; Felix Hill
In-Context Learning and Induction Heads [paper link] 2022-09-24
Catherine Olsson; Nelson Elhage; Neel Nanda; Nicholas Joseph; Nova DasSarma; Tom Henighan; Ben Mann; Amanda Askell; Yuntao Bai; Anna Chen; Tom Conerly; Dawn Drain; Deep Ganguli; Zac Hatfield-Dodds; Danny Hernandez; Scott Johnston; Andy Jones; Jackson Kernion; Liane Lovitt; Kamal Ndousse; Dario Amodei; Tom Brown; Jack Clark; Jared Kaplan; Sam McCandlish; Chris Olah
^ back to top ^
Papers analyzing the chain-of-thought phenomenon in large language models, exploring theoretical and empirical perspectives.
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [paper link] 2024-10-31
Ming Li; Yanhong Li; Tianyi Zhou
A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration [paper link] 2024-10-21
Yingqian Cui; Pengfei He; Xianfeng Tang; Qi He; Chen Luo; Jiliang Tang; Yue Xing
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency [paper link] 2024-10-07
Kaiyue Wen; Huaqing Zhang; Hongzhou Lin; Jingzhao Zhang
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [paper link] 2024-10-03
Hongkang Li; Meng Wang; Songtao Lu; Xiaodong Cui; Pin-Yu Chen
Autoregressive + Chain of Thought (CoT) ≃ Recurrent: Recurrence's Role in Language Models and a Revist of Recurrent Transformer [paper link] 2024-09-14
Xiang Zhang; Muhammad Abdul-Mageed; Laks V.S. Lakshmanan
Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods [paper link] 2024-08-25
Xinyang Hu; Fengzhuo Zhang; Siyu Chen; Zhuoran Yang
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning [paper link] 2024-07-01
Akshara Prabhakar; Thomas L. Griffiths; R. Thomas McCoy
On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning [paper link] 2024-06-20
Franz Nowak; Anej Svete; Alexandra Butoi; Ryan Cotterell
Iteration Head: A Mechanistic Study of Chain-of-Thought [paper link] 2024-06-04
Vivien Cabannes; Charles Arnal; Wassim Bouaziz; Alice Yang; Francois Charton; Julia Kempe
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [paper link] 2024-04-24
Jacob Pfau; William Merrill; Samuel R. Bowman
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [paper link] 2024-02-20
Zhiyuan Li; Hong Liu; Denny Zhou; Tengyu Ma
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective [paper link] 2023-12-22
Guhao Feng; Bohang Zhang; Yuntian Gu; Haotian Ye; Di He; Liwei Wang
Why Can Large Language Models Generate Correct Chain-of-Thoughts? [paper link] 2023-10-20
Rasul Tutunov; Antoine Grosnit; Juliusz Ziomek; Jun Wang; Haitham Bou-Ammar
How Large Language Models Implement Chain-of-Thought? [paper link] 2023-10-13
Yiqun Wang; Sile Hu; Yonggang Zhang; Xiang Tian; Xuesong Liu; Yaowu Chen; Xu Shen; Jieping Ye
The Expressive Power of Transformers with Chain of Thought [paper link] 2023-10-13
William Merrill; Ashish Sabharwal
^ back to top ^
Papers examining the hallucination phenomenon in language models, including both theoretical and empirical analysis.
No Free Lunch: Fundamental Limits of Learning Non-Hallucinating Generative Models [paper link] 2024-10-24
Changlong Wu; Ananth Grama; Wojciech Szpankowski
Shared Imagination: LLMs Hallucinate Alike [paper link] 2024-07-23
Yilun Zhou; Caiming Xiong; Silvio Savarese; Chien-Sheng Wu
Estimating the Hallucination Rate of Generative AI [paper link] 2024-06-11
Andrew Jesson; Nicolas Beltran-Velez; Quentin Chu; Sweta Karlekar; Jannik Kossen; Yarin Gal; John P. Cunningham; David Blei
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? [paper link] 2024-05-09
Zorik Gekhman; Gal Yona; Roee Aharoni; Matan Eyal; Amir Feder; Roi Reichart; Jonathan Herzig
Mechanisms of non-factual hallucinations in language models [paper link] 2024-03-26
Lei Yu; Meng Cao; Jackie Chi Kit Cheung; Yue Dong
Unfamiliar Finetuning Examples Control How Language Models Hallucinate [paper link] 2024-03-08
Katie Kang; Eric Wallace; Claire Tomlin; Aviral Kumar; Sergey Levine
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation [paper link] 2024-03-05
Shiqi Chen; Miao Xiong; Junteng Liu; Zhengxuan Wu; Teng Xiao; Siyang Gao; Junxian He
Calibrated Language Models Must Hallucinate [paper link] 2023-11-24
Adam Tauman Kalai; Santosh S. Vempala
The Curious Case of Hallucinatory Unanswerablity: Finding Truths in the Hidden States of Over-Confident Large Language Models [paper link] 2023-10-18
Aviv Slobodkin; Omer Goldman; Avi Caciularu; Ido Dagan; Shauli Ravfogel
^ back to top ^
Papers that analyze the reversal curse phenomenon in large language models.
Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics [paper link] 2024-05-07
Hanlin Zhu; Baihe Huang; Shaolun Zhang; Michael Jordan; Jiantao Jiao; Yuandong Tian; Stuart Russell
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" [paper link] 2024-04-04
Lukas Berglund; Meg Tong; Max Kaufmann; Mikita Balesni; Asa Cooper Stickland; Tomasz Korbak; Owain Evans
An Investigation of LLMs' Inefficacy in Understanding Converse Relations [paper link] 2023-12-01
Chengwen Qi; Bowen Li; Binyuan Hui; Bailin Wang; Jinyang Li; Jinwang Wu; Yuanjun Laili
Physics of Language Models: Part 3.2, Knowledge Manipulation [paper link] 2023-09-25
Zeyuan Allen-Zhu; Yuanzhi Li
The Reversal Curse: Which Tokens You Predict Underlie the Factorization Curse and More [paper link] 2023-06-07
Ouail Kitouni; Niklas Nolte; Diane Bouchacourt; Adina Williams; Mike Rabbat; Mark Ibrahim
^ back to top ^
Papers exploring how model performance scales with model size, data size, or computational resources, and the emergence of unexpected abilities.
Unlocking the Theory Behind Scaling 1-Bit Neural Networks [paper link] 2024-11-03
Majid Daliri; Zhao Song; Chiwun Yang
How Does Critical Batch Size Scale in Pre-training? [paper link] 2024-10-29
Hanlin Zhang; Depen Morwani; Nikhil Vyas; Jingfeng Wu; Difan Zou; Udaya Ghai; Dean Foster; Sham Kakade
An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models [paper link] 2024-10-15
Anuj K. Nayak; Lav R. Varshney
A Hitchhiker's Guide to Scaling Law Estimation [paper link] 2024-10-15
Leshem Choshen; Yang Zhang; Jacob Andreas
Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models [paper link] 2024-10-08
Siqi Wang; Zhengyu Chen; Bei Li; Keqing He; Min Zhang; Jingang Wang
Grokking at the Edge of Linear Separability [paper link] 2024-10-06
Alon Beck; Noam Levi; Yohai Bar-Sinai
An Empirical Study of Scaling Laws for Transfer [paper link] 2024-08-30
Matthew Barnett
A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language [paper link] 2024-08-22
Ekdeep Singh Lubana; Kyogo Kawaguchi; Robert P. Dick; Hidenori Tanaka
Scaling Law with Learning Rate Annealing [paper link] 2024-08-20
Howe Tissue; Venus Wang; Lu Wang
Performance Law of Large Language Models [paper link] 2024-08-19
Chuhan Wu; Ruiming Tang
Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition [paper link] 2024-08-16
Kenzo Clauw; Sebastiano Stramaglia; Daniele Marinazzo
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [paper link] 2024-07-31
Bradley Brown; Jordan Juravsky; Ryan Ehrlich; Ronald Clark; Quoc V. Le; Christopher Ré; Azalia Mirhoseini
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product [paper link] 2024-07-29
Neil Mallinar; Daniel Beaglehole; Libin Zhu; Adityanarayanan Radhakrishnan; Parthe Pandit; Mikhail Belkin
Exploring Scaling Trends in LLM Robustness [paper link] 2024-07-25
Nikolaus Howe; Michał Zajac; Ian McKenzie; Oskar Hollinsworth; Tom Tseng; Pierre-Luc Bacon; Adam Gleave
Understanding the Interplay of Scale, Data, and Bias in Language Models: A Case Study with BERT [paper link] 2024-07-25
Muhammad Ali; Swetasudha Panda; Qinlan Shen; Michael Wick; Ari Kobren
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies [paper link] 2024-07-18
Chaofan Tao; Qian Liu; Longxu Dou; Niklas Muennighoff; Zhongwei Wan; Ping Luo; Min Lin; Ngai Wong
Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition [paper link] 2024-07-17
Mohamad Amin Mohamadi; Zhiyuan Li; Lei Wu; Danica J. Sutherland
Predicting Emergent Capabilities by Finetuning [paper link] 2024-07-10
Charlie Victor Snell; Eric Wallace; Dan Klein; Sergey Levine
Resolving Discrepancies in Compute-Optimal Scaling of Language Models [paper link] 2024-06-25
Tomer Porian; Mitchell Wortsman; Jenia Jitsev; Ludwig Schmidt; Yair Carmon
Scaling Laws for Linear Complexity Language Models [paper link] 2024-06-24
Xuyang Shen; Dong Li; Ruitao Leng; Zhen Qin; Weigao Sun; Yiran Zhong
Scaling Laws for Fact Memorization of Large Language Models [paper link] 2024-06-22
Xingyu Lu; Xiaonan Li; Qinyuan Cheng; Kai Ding; Xuanjing Huang; Xipeng Qiu
Reconciling Kaplan and Chinchilla Scaling Laws [paper link] 2024-06-12
Tim Pearce; Jinyeop Song
Deep Grokking: Would Deep Neural Networks Generalize Better? [paper link] 2024-05-29
Simin Fan; Razvan Pascanu; Martin Jaggi
Linguistic Collapse: Neural Collapse in (Large) Language Models [paper link] 2024-05-28
Robert Wu; Vardan Papyan
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [paper link] 2024-05-28
Alexander Hägele; Elie Bakouch; Atli Kosson; Loubna Ben Allal; Leandro Von Werra; Martin Jaggi
gzip Predicts Data-dependent Scaling Laws [paper link] 2024-05-26
Rohan Pandey
Emergence of a High-Dimensional Abstraction Phase in Language Transformers [paper link] 2024-05-24
Emily Cheng; Diego Doimo; Corentin Kervadec; Iuri Macocco; Jade Yu; Alessandro Laio; Marco Baroni
A rationale from frequency perspective for grokking in training neural network [paper link] 2024-05-24
Zhangchen Zhou; Yaoyu Zhang; Zhi-Qin John Xu
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [paper link] 2024-05-23
Boshi Wang; Xiang Yue; Yu Su; Huan Sun
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining [paper link] 2024-05-23
Ce Ge; Zhijian Ma; Daoyuan Chen; Yaliang Li; Bolin Ding
4+3 Phases of Compute-Optimal Neural Scaling Laws [paper link] 2024-05-23
Elliot Paquette; Courtney Paquette; Lechao Xiao; Jeffrey Pennington
Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models [paper link] 2024-05-22
Raghu Mudumbai; Tyler Bell
Quantifying Emergence in Large Language Models [paper link] 2024-05-21
Hang Chen; Xinyu Yang; Jiaying Zhu; Wenya Wang
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [paper link] 2024-05-14
Xueyan Niu; Bo Bai; Lei Deng; Wei Han
More Compute Is What You Need [paper link] 2024-04-30
Zhen Guo
An exactly solvable model for emergence and scaling laws [paper link] 2024-04-26
Yoonsoo Nam; Nayara Fonseca; Seok Hyeong Lee; Ard Louis
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck [paper link] 2024-04-11
Nathan Godey; Éric de la Clergerie; Benoît Sagot
A Large-Scale Exploration of
Lucas Lingle
Emergent Abilities in Reduced-Scale Generative Language Models [paper link] 2024-04-02
Sherin Muckatira; Vijeta Deshpande; Vladislav Lialin; Anna Rumshisky
Understanding Emergent Abilities of Language Models from the Loss Perspective [paper link] 2024-03-23
Zhengxiao Du; Aohan Zeng; Yuxiao Dong; Jie Tang
Unraveling the Mystery of Scaling Laws: Part I [paper link] 2024-03-21
Hui Su; Zhi Tian; Xiaoyu Shen; Xunliang Cai
Language models scale reliably with over-training and on downstream tasks [paper link] 2024-03-13
Samir Yitzhak Gadre; Georgios Smyrnis; Vaishaal Shankar; Suchin Gururangan; Mitchell Wortsman; Rulin Shao; Jean Mercat; Alex Fang; Jeffrey Li; Sedrick Keh; Rui Xin; Marianna Nezhurina; Igor Vasiljevic; Jenia Jitsev; Alexandros G. Dimakis; Gabriel Ilharco; Shuran Song; Thomas Kollar; Yair Carmon; Achal Dave; Reinhard Heckel; Niklas Muennighoff; Ludwig Schmidt
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method [paper link] 2024-02-26
Biao Zhang; Zhongtao Liu; Colin Cherry; Orhan Firat
Interpreting Grokked Transformers in Complex Modular Arithmetic [paper link] 2024-02-26
Hiroki Furuta; Gouki Minegishi; Yusuke Iwasawa; Yutaka Matsuo
A Tale of Tails: Model Collapse as a Change of Scaling Laws [paper link] 2024-02-10
Elvis Dohmatob; Yunzhen Feng; Pu Yang; Francois Charton; Julia Kempe
Scaling Data-Constrained Language Models [paper link] 2023-10-25
Niklas Muennighoff; Alexander M. Rush; Boaz Barak; Teven Le Scao; Aleksandra Piktus; Nouamane Tazi; Sampo Pyysalo; Thomas Wolf; Colin Raffel
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning [paper link] 2023-10-06
Tian Jin; Nolan Clement; Xin Dong; Vaishnavh Nagarajan; Michael Carbin; Jonathan Ragan-Kelley; Gintare Karolina Dziugaite
Are Emergent Abilities of Large Language Models a Mirage? [paper link] 2023-04-28
Rylan Schaeffer; Brando Miranda; Sanmi Koyejo
Training Compute-Optimal Large Language Models [paper link] 2022-03-29
Jordan Hoffmann; Sebastian Borgeaud; Arthur Mensch; Elena Buchatskaya; Trevor Cai; Eliza Rutherford; Diego de Las Casas; Lisa Anne Hendricks; Johannes Welbl; Aidan Clark; Tom Hennigan; Eric Noland; Katie Millican; George van den Driessche; Bogdan Damoc; Aurelia Guy; Simon Osindero; Karen Simonyan; Erich Elsen; Jack W. Rae; Oriol Vinyals; Laurent Sifre
Scaling Laws for Neural Language Models [paper link] 2020-01-22
Jared Kaplan; Sam McCandlish; Tom Henighan; Tom B. Brown; Benjamin Chess; Rewon Child; Scott Gray; Alec Radford; Jeffrey Wu; Dario Amodei
^ back to top ^
Papers focusing on how large language models store, retrieve, and utilize knowledge, analyzing the memory mechanisms involved.
A Geometric Framework for Understanding Memorization in Generative Models [paper link] 2024-10-31
Brendan Leigh Ross; Hamidreza Kamkari; Tongzi Wu; Rasa Hosseinzadeh; Zhaoyan Liu; George Stein; Jesse C. Cresswell; Gabriel Loaiza-Ganem
Optimal Memorization Capacity of Transformers [paper link] 2024-09-26
Tokio Kajitsuka; Issei Sato
Schrodingers Memory: Large Language Models [paper link] 2024-09-16
Wei Wang; Qing Li
Self-Attention Limits Working Memory Capacity of Transformer-Based Models [paper link] 2024-09-16
Dongyu Gong; Hantao Zhang
Great Memory, Shallow Reasoning: Limits of kNN-LMs [paper link] 2024-08-21
Shangyi Geng; Wenting Zhao; Alexander M Rush
Memorisation In In-Context Learning [paper link] 2024-08-21
Shahriar Golchin; Mihai Surdeanu; Steven Bethard; Eduardo Blanco; Ellen Riloff
Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks [paper link] 2024-08-09
Verna Dankers; Ivan Titov
Understanding Memorisation in LLMs: Dynamics, Influencing Factors, and Implications [paper link] 2024-07-27
Till Speicher; Mohammad Aflah Khan; Qinyuan Wu; Vedant Nanda; Soumi Das; Bishwamittra Ghosh; Krishna P. Gummadi; Evimaria Terzi
Demystifying Verbatim Memorization in Large Language Models [paper link] 2024-07-25
Jing Huang; Diyi Yang; Christopher Potts
From Internal Conflict to Contextual Adaptation of Language Models [paper link] 2024-07-24
Sara Vera Marjanović; Haeun Yu; Pepa Atanasova; Maria Maistro; Christina Lioma; Isabelle Augenstein
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [paper link] 2024-07-20
Antonis Antoniades; Xinyi Wang; Yanai Elazar; Alfonso Amayuelas; Alon Albalak; Kexun Zhang; William Yang Wang
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction [paper link] 2024-07-16
Zeyuan Allen-Zhu; Yuanzhi Li
Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning [paper link] 2024-07-09
J. Crosbie; E. Shutova
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers [paper link] 2024-06-26
Yibo Jiang; Goutham Rajendran; Pradeep Ravikumar; Bryon Aragam
Scaling Laws for Fact Memorization of Large Language Models [paper link] 2024-06-22
Xingyu Lu; Xiaonan Li; Qinyuan Cheng; Kai Ding; Xuanjing Huang; Xipeng Qiu
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data [paper link] 2024-06-20
Johannes Treutlein; Dami Choi; Jan Betley; Cem Anil; Samuel Marks; Roger Baker Grosse; Owain Evans
Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models [paper link] 2024-06-20
Sunny Duan; Mikail Khona; Abhiram Iyer; Rylan Schaeffer; Ila R Fiete
Understanding Finetuning for Factual Knowledge Extraction [paper link] 2024-06-20
Gaurav Ghosal; Tatsunori Hashimoto; Aditi Raghunathan
Estimating Knowledge in Large Language Models Without Generating a Single Token [paper link] 2024-06-18
Daniela Gottesman; Mor Geva
How Do Large Language Models Acquire Factual Knowledge During Pretraining? [paper link] 2024-06-17
Hoyeon Chang; Jinho Park; Seonghyeon Ye; Sohee Yang; Youngkyung Seo; Du-Seong Chang; Minjoon Seo
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs [paper link] 2024-06-14
Abhimanyu Hans; Yuxin Wen; Neel Jain; John Kirchenbauer; Hamid Kazemi; Prajwal Singhania; Siddharth Singh; Gowthami Somepalli; Jonas Geiping; Abhinav Bhatele; Tom Goldstein
Knowledge Circuits in Pretrained Transformers [paper link] 2024-05-28
Yunzhi Yao; Ningyu Zhang; Zekun Xi; Mengru Wang; Ziwen Xu; Shumin Deng; Huajun Chen
Upper and lower memory capacity bounds of transformers for next-token prediction [paper link] 2024-05-22
Liam Madden; Curtis Fox; Christos Thrampoulidis
A Multi-Perspective Analysis of Memorization in Large Language Models [paper link] 2024-05-19
Bowen Chen; Namgi Han; Yusuke Miyao
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws [paper link] 2024-04-08
Zeyuan Allen-Zhu; Yuanzhi Li
Memorization Capacity of Multi-Head Attention in Transformers [paper link] 2024-03-02
Sadegh Mahdavi; Renjie Liao; Christos Thrampoulidis
Birth of a Transformer: A Memory Viewpoint [paper link] 2023-11-06
Alberto Bietti; Vivien Cabannes; Diane Bouchacourt; Herve Jegou; Leon Bottou
Physics of Language Models: Part 3.2, Knowledge Manipulation [paper link] 2023-09-25
Zeyuan Allen-Zhu; Yuanzhi Li
Can Neural Network Memorization Be Localized? [paper link] 2023-07-18
Pratyush Maini; Michael C. Mozer; Hanie Sedghi; Zachary C. Lipton; J. Zico Kolter; Chiyuan Zhang
Quantifying Memorization Across Neural Language Models [paper link] 2022-02-15
Nicholas Carlini; Daphne Ippolito; Matthew Jagielski; Katherine Lee; Florian Tramer; Chiyuan Zhang
^ back to top ^
Papers discussing various aspects of the training process, including optimization, fine-tuning, and the training landscape of large language models.
Global Convergence in Training Large-Scale Transformers [paper link] 2024-10-31
Cheng Gao; Yuan Cao; Zihao Li; Yihan He; Mengdi Wang; Han Liu; Jason Matthew Klusowski; Jianqing Fan
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [paper link] 2024-10-31
Ming Li; Yanhong Li; Tianyi Zhou
Learning and Transferring Sparse Contextual Bigrams with Linear Transformers [paper link] 2024-10-30
Yunwei Ren; Zixuan Wang; Jason D. Lee
Abrupt Learning in Transformers: A Case Study on Matrix Completion [paper link] 2024-10-29
Pulkit Gopalani; Ekdeep Singh Lubana; Wei Hu
LoRA vs Full Fine-tuning: An Illusion of Equivalence [paper link] 2024-10-28
Reece Shuttleworth; Jacob Andreas; Antonio Torralba; Pratyusha Sharma
A distributional simplicity bias in the learning dynamics of transformers [paper link] 2024-10-25
Riccardo Rende; Federica Gerace; Alessandro Laio; Sebastian Goldt
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [paper link] 2024-10-17
Tianyu Guo; Druv Pai; Yu Bai; Jiantao Jiao; Michael I. Jordan; Song Mei
How Transformers Implement Induction Heads: Approximation and Optimization Analysis [paper link] 2024-10-15
Mingze Wang; Ruoxi Yu; Weinan E; Lei Wu
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [paper link] 2024-10-14
Weronika Ormaniec; Felix Dangel; Sidak Pal Singh
Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? [paper link] 2024-10-08
Fırat Öncel; Matthias Bethge; Beyza Ermis; Mirco Ravanelli; Cem Subakan; Çağatay Yıldız
On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent [paper link] 2024-10-07
Bingrui Li; Wei Huang; Andi Han; Zhanpeng Zhou; Taiji Suzuki; Jun Zhu; Jianfei Chen
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective [paper link] 2024-10-07
Kaiyue Wen; Zhiyuan Li; Jason Wang; David Hall; Percy Liang; Tengyu Ma
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [paper link] 2024-10-03
Hongkang Li; Meng Wang; Songtao Lu; Xiaodong Cui; Pin-Yu Chen
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization [paper link] 2024-10-03
Xinhao Yao; Hongjin Qian; Xiaolin Hu; Gengze Xu; Yong Liu
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context [paper link] 2024-10-02
Spencer Frei; Gal Vardi
Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective [paper link] 2024-10-02
Zeyu Gan; Yong Liu
Investigating the Impact of Model Complexity in Large Language Models [paper link] 2024-10-01
Jing Luo; Huiyuan Wang; Weiran Huang
Benigh or Not-Benign Overfitting in Token Selection of Attention Mechanism [paper link] 2024-09-26
Keitaro Sakamoto; Issei Sato
Non-asymptotic Convergence of Training Transformers for Next-token Prediction [paper link] 2024-09-25
Ruiquan Huang; Yingbin Liang; Jing Yang
Optimization Hyper-parameter Laws for Large Language Models [paper link] 2024-09-07
Xingyu Xie; Kuangyu Ding; Shuicheng Yan; Kim-Chuan Toh; Tianwen Wei
The AdEMAMix Optimizer: Better, Faster, Older [paper link] 2024-09-05
Matteo Pagliardini; Pierre Ablin; David Grangier
Clustering and Alignment: Understanding the Training Dynamics in Modular Addition [paper link] 2024-08-18
Tiberiu Musat
Global Convergence in Training Large-Scale Transformers [paper link] 2024-08
Cheng Gao; Yuan Cao; Zihao Li; Yihan He; Mengdi Wang; Han Liu; Jason M. Klusowski; Jianqing Fan
On the Convergence of Encoder-only Shallow Transformers [paper link] 2024-08
Yongtao Wu; Fanghui Liu; Grigorios G Chrysos; Volkan Cevher
Parameter-Efficient Fine-Tuning for Continual Learning: A Neural Tangent Kernel Perspective [paper link] 2024-07-24
Jingren Liu; Zhong Ji; YunLong Yu; Jiale Cao; Yanwei Pang; Jungong Han; Xuelong Li
Learning Dynamics of LLM Finetuning [paper link] 2024-07-15
Yi Ren; Danica J. Sutherland
Deconstructing What Makes a Good Optimizer for Language Models [paper link] 2024-07-10
Rosie Zhao; Depen Morwani; David Brandfonbrener; Nikhil Vyas; Sham Kakade
Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity [paper link] 2024-06-17
Bingxiang He; Ning Ding; Cheng Qian; Jia Deng; Ganqu Cui; Lifan Yuan; Huan-ang Gao; Huimin Chen; Zhiyuan Liu; Maosong Sun
Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective [paper link] 2024-05-27
Akiyoshi Tomihari; Issei Sato
Infinite Limits of Multi-head Transformer Dynamics [paper link] 2024-05-24
Blake Bordelon; Hamza Tahir Chaudhry; Cengiz Pehlevan
Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics [paper link] 2024-05-07
Hanlin Zhu; Baihe Huang; Shaolun Zhang; Michael Jordan; Jiantao Jiao; Yuandong Tian; Stuart Russell
Control Theoretic Approach to Fine-Tuning and Transfer Learning [paper link] 2024-04-16
Erkan Bayram; Shenyu Liu; Mohamed-Ali Belabbas; Tamer Başar
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think [paper link] 2024-04-12
Xinpeng Wang; Chengzhi Hu; Bolei Ma; Paul Röttger; Barbara Plank
On Training Data Influence of GPT Models [paper link] 2024-04-11
Qingyi Liu; Yekun Chai; Shuohuan Wang; Yu Sun; Keze Wang; Hua Wu
Best Practices and Lessons Learned on Synthetic Data for Language Models [paper link] 2024-04-11
Ruibo Liu; Jerry Wei; Fangyu Liu; Chenglei Si; Yanzhe Zhang; Jinmeng Rao; Steven Zheng; Daiyi Peng; Diyi Yang; Denny Zhou; Andrew M. Dai
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse [paper link] 2024-04-07
Mohamed El Amine Seddik; Suei-Wen Chen; Soufiane Hayou; Pierre Youssef; Merouane Debbah
Unveiling the Generalization Power of Fine-Tuned Large Language Models [paper link] 2024-03-14
Haoran Yang; Yumeng Zhang; Jiaqi Xu; Hongyuan Lu; Pheng Ann Heng; Wai Lam
Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [paper link] 2024-03-14
Akhil Kedia; Mohd Abbas Zaidi; Sushil Khyalia; Jungho Jung; Harshith Goka; Haejun Lee
Linear Attention is (Maybe) All You Need (to Understand Transformer Optimization) [paper link] 2024-03-13
Kwangjun Ahn; Xiang Cheng; Minhak Song; Chulhee Yun; Ali Jadbabaie; Suvrit Sra
Hallmarks of Optimization Trajectories in Neural Networks and LLMs: The Lengths, Bends, and Dead Ends [paper link] 2024-03-12
Sidak Pal Singh; Bobby He; Thomas Hofmann; Bernhard Schölkopf
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models [paper link] 2024-03-06
Adithya Bhaskar; Dan Friedman; Danqi Chen
Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [paper link] 2024-02-29
Siyu Chen; Heejune Sheen; Tianhao Wang; Zhuoran Yang
How Transformers Learn Causal Structure with Gradient Descent [paper link] 2024-02-22
Eshaan Nichani; Alex Damian; Jason D. Lee
LoRA Training in the NTK Regime has No Spurious Local Minima [paper link] 2024-02-19
Uijeong Jang; Jason D. Lee; Ernest K. Ryu
On the Emergence of Cross-Task Linearity in the Pretraining-Finetuning Paradigm [paper link] 2024-02-06
Zhanpeng Zhou; Zijun Chen; Yilan Chen; Bo Zhang; Junchi Yan
Transformers learn through gradual rank increase [paper link] 2023-12-10
Enric Boix-Adsera; Etai Littwin; Emmanuel Abbe; Samy Bengio; Joshua Susskind
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks [paper link] 2023-11-21
Samyak Jain; Robert Kirk; Ekdeep Singh Lubana; Robert P. Dick; Hidenori Tanaka; Edward Grefenstette; Tim Rocktäschel; David Scott Krueger
Connecting Pre-trained Language Model and Downstream Task via Properties of Representation [paper link] 2023-11-02
Chenwei Wu; Holden Lee; Rong Ge
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer [paper link] 2023-07-02
Yuandong Tian; Yiping Wang; Beidi Chen; Simon Du
A Kernel-Based View of Language Model Fine-Tuning [paper link] 2023-06-15
Sadhika Malladi; Alexander Wettig; Dingli Yu; Danqi Chen; Sanjeev Arora
A Stability Analysis of Fine-Tuning a Pre-Trained Model [paper link] 2023-01-24
Zihao Fu; Anthony Man-Cho So; Nigel Collier
^ back to top ^
Papers analyzing the learning capabilities and generalization performance of language models, from weak to strong generalization.
Generalization and Risk Bounds for Recurrent Neural Networks [paper link] 2024-11-05
Xuewei Cheng; Ke Huang; Shujie Ma
Provable Length Generalization in Sequence Prediction via Spectral Filtering [paper link] 2024-11-01
Annie Marsden; Evan Dogariu; Naman Agarwal; Xinyi Chen; Daniel Suo; Elad Hazan
RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner [paper link] 2024-10-31
Fu-Chieh Chang; Yu-Ting Lee; Hui-Ying Shih; Pei-Yuan Wu
Mixture of Parrots: Experts improve memorization more than reasoning [paper link] 2024-10-24
Samy Jelassi; Clara Mohri; David Brandfonbrener; Alex Gu; Nikhil Vyas; Nikhil Anand; David Alvarez-Melis; Yuanzhi Li; Sham M. Kakade; Eran Malach
How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs [paper link] 2024-10-17
Guhao Feng; Kai Yang; Yuntian Gu; Xinyue Ai; Shengjie Luo; Jiacheng Sun; Di He; Zhenguo Li; Liwei Wang
On Rank-Dependent Generalisation Error Bounds for Transformers [paper link] 2024-10-15
Lan V. Truong
Benign Overfitting in Single-Head Attention [paper link] 2024-10-10
Roey Magen; Shuning Shang; Zhiwei Xu; Spencer Frei; Wei Hu; Gal Vardi
Dynamics of Concept Learning and Compositional Generalization [paper link] 2024-10-10
Yongyi Yang; Core Francisco Park; Ekdeep Singh Lubana; Maya Okawa; Wei Hu; Hidenori Tanaka
Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [paper link] 2024-10-08
Junhyung Park; Patrick Bloebaum; Shiva Prasad Kasiviswanathan
Provable Weak-to-Strong Generalization via Benign Overfitting [paper link] 2024-10-06
David X. Wu; Anant Sahai
A Formal Framework for Understanding Length Generalization in Transformers [paper link] 2024-10-03
Xinting Huang; Andy Yang; Satwik Bhattamishra; Yash Sarrof; Andreas Krebs; Hattie Zhou; Preetum Nakkiran; Michael Hahn
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context [paper link] 2024-10-02
Spencer Frei; Gal Vardi
Lines of Thought in Large Language Models [paper link] 2024-10-02
Raphaël Sarfati; Toni J. B. Liu; Nicolas Boullé; Christopher J. Earls
Investigating the Impact of Model Complexity in Large Language Models [paper link] 2024-10-01
Jing Luo; Huiyuan Wang; Weiran Huang
Benign or Not-Benign Overfitting in Token Selection of Attention Mechanism [paper link] 2024-09-26
Keitaro Sakamoto; Issei Sato
Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics [paper link] 2024-09-15
Yi Ren; Danica J. Sutherland
Unforgettable Generalization in Language Models [paper link] 2024-09-03
Eric Zhang; Leshem Chosen; Jacob Andreas
The Many Faces of Optimal Weak-to-Strong Learning [paper link] 2024-08-30
Mikael Møller Høgsgaard; Kasper Green Larsen; Markus Engelund Mathiasen
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [paper link] 2024-08-29
Tian Ye; Zicheng Xu; Yuanzhi Li; Zeyuan Allen-Zhu
Out-of-distribution generalization via composition: a lens through induction heads in Transformers [paper link] 2024-08-18
Jiajun Song; Zhuoyan Xu; Yiqiao Zhong
On the Generalization of Preference Learning with DPO [paper link] 2024-08-06
Shawn Im; Yixuan Li
Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs [paper link] 2024-07-31
Kewei Cheng; Jingfeng Yang; Haoming Jiang; Zhengyang Wang; Binxuan Huang; Ruirui Li; Shiyang Li; Zheng Li; Yifan Gao; Xian Li; Bing Yin; Yizhou Sun
Expand