Awesome Resource-Efficient LLM Papers
A curated list of high-quality papers on resource-efficient LLMs.
This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.
Table of Contents
- Awesome Resource-Efficient LLM Papers
- Table of Contents
- LLM Architecture Design
- Efficient Transformer Architecture
- Non-transformer Architecture
- LLM Pre-Training
- Memory Efficiency
- Distributed Training
- Mixed precision training
- Data Efficiency
- Importance Sampling
- Data Augmentation
- Training Objective
- LLM Fine-Tuning
- Parameter-Efficient Fine-Tuning
- Full-Parameter Fine-Tuning
- LLM Inference
- Model Compression
- Dynamic Acceleration
- System Design
- Deployment optimization
- Support Infrastructure
- Other Systems
- Resource-Efficiency Evaluation Metrics & Benchmarks
- ? Computation Metrics
- ? Memory Metrics
- ⚡️ Energy Metrics
- ? Financial Cost Metric
- ? Network Communication Metric
- Other Metrics
- Benchmarks
- Reference
LLM Architecture Design
Efficient Transformer Architecture
Date |
Keywords |
Paper |
Venue |
2024 |
Approximate attention |
Simple linear attention language models balance the recall-throughput tradeoff |
ArXiv |
2024 |
Hardware attention |
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases |
ArXiv |
2024 |
Approximate attention |
LoMA: Lossless Compressed Memory Attention |
ArXiv |
2024 |
Approximate attention |
Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation |
ICML |
2024 |
Hardware optimization |
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning |
ICLR |
2023 |
Hardware optimization |
Flashattention: Fast and memory-efficient exact attention with io-awareness |
NeurIPS |
2023 |
Approximate attention |
KDEformer: Accelerating Transformers via Kernel Density Estimation |
ICML |
2023 |
Approximate attention |
Mega: Moving Average Equipped Gated Attention |
ICLR |
2022 |
Hardware optimization |
xFormers - Toolbox to Accelerate Research on Transformers |
GitHub |
2021 |
Approximate attention |
Efficient attention: Attention with linear complexities |
WACV |
2021 |
Approximate attention |
An Attention Free Transformer |
ArXiv |
2021 |
Approximate attention |
Self-attention Does Not Need O(n^2) Memory |
ArXiv |
2021 |
Hardware optimization |
LightSeq: A High Performance Inference Library for Transformers |
NAACL |
2021 |
Hardware optimization |
FasterTransformer: A Faster Transformer Framework |
GitHub |
2020 |
Approximate attention |
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention |
ICML |
2019 |
Approximate attention |
Reformer: The efficient transformer |
ICLR |
Non-transformer Architecture
Date |
Keywords |
Paper |
Venue |
2024 |
Decoder |
You Only Cache Once: Decoder-Decoder Architectures for Language Models |
ArXiv |
2024 |
BitLinear layer |
Scalable MatMul-free Language Modeling |
ArXiv |
2023 |
RNN LM |
RWKV: Reinventing RNNs for the Transformer Era |
EMNLP-Findings |
2023 |
MLP |
Auto-Regressive Next-Token Predictors are Universal Learners |
ArXiv |
2023 |
Convolutional LM |
Hyena Hierarchy: Towards Larger Convolutional Language models |
ICML |
2023 |
Sub-quadratic Matrices based |
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture |
NeurIPS |
2023 |
Selective State Space Model |
Mamba: Linear-Time Sequence Modeling with Selective State Spaces |
ArXiv |
2022 |
Mixture of Experts |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
JMLR |
2022 |
Mixture of Experts |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts |
ICML |
2022 |
Mixture of Experts |
Mixture-of-Experts with Expert Choice Routing |
NeurIPS |
2022 |
Mixture of Experts |
Efficient Large Scale Language Modeling with Mixtures of Experts |
EMNLP |
2017 |
Mixture of Experts |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer |
ICLR |
LLM Pre-Training
Memory Efficiency
Distributed Training
Date |
Keywords |
Paper |
Venue |
2024 |
Model Parallelism |
ProTrain: Efficient LLM Training via Adaptive Memory Management |
Arxiv |
2024 |
Model Parallelism |
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs |
Arxiv |
2023 |
Data Parallelism |
Palm: Scaling language modeling with pathways |
Github |
2023 |
Model Parallelism |
Bpipe: memory-balanced pipeline parallelism for training large language models |
JMLR |
2022 |
Model Parallelism |
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning |
OSDI |
2021 |
Data Parallelism |
FairScale: A general purpose modular PyTorch library for high performance and large scale training |
JMLR |
2020 |
Data Parallelism |
Zero: Memory optimizations toward training trillion parameter models |
IEEE SC20 |
2019 |
Model Parallelism |
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism |
NeurIPS |
2019 |
Model Parallelism |
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism |
Arxiv |
2019 |
Model Parallelism |
PipeDream: generalized pipeline parallelism for DNN training |
SOSP |
2018 |
Model Parallelism |
Mesh-tensorflow: Deep learning for supercomputers |
NeurIPS |
Mixed precision training
Date |
Keywords |
Paper |
Venue |
2022 |
Mixed Precision Training |
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
Arxiv |
2018 |
Mixed Precision Training |
Bert: Pre-training of deep bidirectional transformers for language understanding |
ACL |
2017 |
Mixed Precision Training |
Mixed Precision Training |
ICLR |
Data Efficiency
Importance Sampling
Date |
Keywords |
Paper |
Venue |
2024 |
Importance sampling |
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning |
Arxiv |
2023 |
Survey on importance sampling |
A Survey on Efficient Training of Transformers |
IJCAI |
2023 |
Importance sampling |
Data-Juicer: A One-Stop Data Processing System for Large Language Models |
Arxiv |
2023 |
Importance sampling |
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models |
EMNLP |
2023 |
Importance sampling |
Machine Learning Force Fields with Data Cost Aware Training |
ICML |
2022 |
Importance sampling |
Beyond neural scaling laws: beating power law scaling via data pruning |
NeurIPS |
2021 |
Importance sampling |
Deep Learning on a Data Diet: Finding Important Examples Early in Training |
NeurIPS |
2018 |
Importance sampling |
Training Deep Models Faster with Robust, Approximate Importance Sampling |
NeurIPS |
2018 |
Importance sampling |
Not All Samples Are Created Equal: Deep Learning with Importance Sampling |
ICML |
Data Augmentation
Date |
Keywords |
Paper |
Venue |
2024 |
Data Augmentation |
LLMRec: Large Language Models with Graph Augmentation for Recommendation |
WSDM |
2024 |
Data augmentation |
LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition |
Arxiv |
2023 |
Data augmentation |
MixGen: A New Multi-Modal Data Augmentation |
WACV |
2023 |
Data augmentation |
Augmentation-Aware Self-Supervision for Data-Efficient GAN Training |
NeurIPS |
2023 |
Data augmentation |
Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis |
EMNLP |
2023 |
Data augmentation |
FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization |
EMNLP |
Training Objective
Date |
Keywords |
Paper |
Venue |
2023 |
Training objective |
Challenges and Applications of Large Language Models |
Arxiv |
2023 |
Training objective |
Efficient Data Learning for Open Information Extraction with Pre-trained Language Models |
EMNLP |
2023 |
Masked language-image modeling |
Scaling Language-Image Pre-training via Masking |
CVPR |
2022 |
Masked image modeling |
Masked Autoencoders Are Scalable Vision Learners |
CVPR |
2019 |
Masked language modeling |
MASS: Masked Sequence to Sequence Pre-training for Language Generation |
ICML |
LLM Fine-Tuning
Parameter-Efficient Fine-Tuning
Date |
Keywords |
Paper |
Venue |
2024 |
LoRA-based fine-tuning |
Dlora: Distributed parameter-efficient fine-tuning solution for large language model |
Arxiv |
2024 |
LoRA-based fine-tuning |
SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models |
Arxiv |
2024 |
LoRA-based fine-tuning |
Data-efficient Fine-tuning for LLM-based Recommendation |
SIGIR |
2024 |
LoRA-based fine-tuning |
MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter |
ACL |
2023 |
LoRA-based fine-tuning |
DyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank Adaptation |
EACL |
2022 |
Masking-based fine-tuning |
Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively |
NeurIPS |
2021 |
Masking-based fine-tuning |
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models |
ACL |
2021 |
Masking-based fine-tuning |
Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning |
EMNLP |
2021 |
Masking-based fine-tuning |
Unlearning Bias in Language Models by Partitioning Gradients |
ACL |
2019 |
Masking-based fine-tuning |
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization |
ACL |
Full-Parameter Fine-Tuning
Date |
Keywords |
Paper |
Venue |
2024 |
Full-parameter fine-tuning |
Hift: A hierarchical full parameter fine-tuning strategy |
Arxiv |
2024 |
Study of full-parameter fine-tuning optimizations |
A Study of Optimizations for Fine-tuning Large Language Models |
Arxiv |
2023 |
Comparative study betweeen full-parameter and LoRA-base fine-tuning |
A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model |
Arxiv |
2023 |
Comparative study betweeen full-parameter and parameter-efficient fine-tuning |
Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification |
Arxiv |
2023 |
Full-parameter fine-tuning with limited resources |
Full Parameter Fine-tuning for Large Language Models with Limited Resources |
Arxiv |
2023 |
Memory-efficient fine-tuning |
Fine-Tuning Language Models with Just Forward Passes |
NeurIPS |
2023 |
Full-parameter fine-tuning for medicine applications |
PMC-LLaMA: Towards Building Open-source Language Models for Medicine |
Arxiv |
2022 |
Drawback of full-parameter fine-tuning |
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution |
ICLR |
LLM Inference
Model Compression
Pruning
Date |
Keywords |
Paper |
Venue |
2024 |
Unstructured Pruning |
SparseLLM: Towards Global Pruning for Pre-trained Language Models |
NeurIPS |
2024 |
Structured Pruning |
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models |
Arxiv |
2024 |
Structured Pruning |
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation |
Arxiv |
2024 |
Structured Pruning |
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect |
Arxiv |
2024 |
Structured Pruning |
NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models |
Arxiv |
2024 |
Structured Pruning |
SliceGPT: Compress Large Language Models by Deleting Rows and Columns |
ICLR |
2024 |
Unstructured Pruning |
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs |
ICLR |
2024 |
Structured Pruning |
Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models |
ICLR |
2023 |
Unstructured Pruning |
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models |
Arxiv |
2023 |
Unstructured Pruning |
SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot |
ICML |
2023 |
Unstructured Pruning |
A Simple and Effective Pruning Approach for Large Language Models |
ICLR |
2023 |
Unstructured Pruning |
AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers |
TCAD |
2023 |
Structured Pruning |
LLM-Pruner: On the Structural Pruning of Large Language Models |
NeurIPS |
2023 |
Structured Pruning |
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation |
ICML |
2023 |
Structured Pruning |
Structured Pruning for Efficient Generative Pre-trained Language Models |
ACL |
2023 |
Structured Pruning |
ZipLM: Inference-Aware Structured Pruning of Language Models |
NeurIPS |
2023 |
Contextual Pruning |
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time |
ICML |
Quantization
Date |
Keywords |
Paper |
Venue |
2024 |
Weight Quantization |
Evaluating Quantized Large Language Models |
Arxiv |
2024 |
Weight Quantization |
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models |
Arxiv |
2024 |
Weight Quantization |
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models |
Arxiv |
2024 |
Weight-Activation Co-Quantization |
Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs |
NeurIPS |
2024 |
Weight Quantization |
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models |
ICLR |
2023 |
Weight Quantization |
Flexround: Learnable rounding based on element-wise division for post-training quantization |
ICML |
2023 |
Weight Quantization |
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling |
EMNLP |
2023 |
Weight Quantization |
OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models |
AAAI |
2023 |
Weight Quantization |
Gptq: Accurate posttraining quantization for generative pre-trained transformers |
ICLR |
2023 |
Weight Quantization |
Dynamic Stashing Quantization for Efficient Transformer Training |
EMNLP |
2023 |
Weight Quantization |
Quantization-aware and tensor-compressed training of transformers for natural language understanding |
Interspeech |
2023 |
Weight Quantization |
QLoRA: Efficient Finetuning of Quantized LLMs |
NeurIPS |
2023 |
Weight Quantization |
Stable and low-precision training for large-scale vision-language models |
NeurIPS |
2023 |
Weight Quantization |
Prequant: A task-agnostic quantization approach for pre-trained language models |
ACL |
2023 |
Weight Quantization |
Olive: Accelerating large language models via hardware-friendly outliervictim pair quantization |
ISCA |
2023 |
Weight Quantization |
Awq: Activationaware weight quantization for llm compression and acceleration |
arXiv |
2023 |
Weight Quantization |
Spqr: A sparsequantized representation for near-lossless llm weight compression |
arXiv |
2023 |
Weight Quantization |
SqueezeLLM: Dense-and-Sparse Quantization |
arXiv |
2023 |
Weight Quantization |
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models |
arXiv |
2022 |
Activation Quantization |
Gact: Activation compressed training for generic network architectures |
ICML |
2022 |
Fixed-point Quantization |
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization |
ACL |
2021 |
Activation Quantization |
Ac-gc: Lossy activation compression with guaranteed convergence |
NeurIPS |
Dynamic Acceleration
Input Pruning
Date |
Keywords |
Paper |
Venue |
2024 |
Score-based Token Removal |
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation |
COLM |
2024 |
Score-based Token Removal |
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference |
Arxiv |
2024 |
Learning-based Token Removal |
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression |
ACL |
2024 |
Learning-based Token Removal |
Compressed Context Memory For Online Language Model Interaction |
ICLR |
2023 |
Score-based Token Removal |
Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference |
KDD |
2023 |
Learning-based Token Removal |
PuMer: Pruning and Merging Tokens for Efficient Vision Language Models |
ACL |
2023 |
Learning-based Token Removal |
Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model |
arXiv |
2023 |
Learning-based Token Removal |
SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models |
arXiv |
2022 |
Learning-based Token Removal |
Transkimmer: Transformer Learns to Layer-wise Skim |
ACL |
2022 |
Score-based Token Removal |
Learned Token Pruning for Transformers |
KDD |
2021 |
Learning-based Token Removal |
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference |
NAACL |
2021 |
Score-based Token Removal |
Efficient sparse attention architecture with cascade token and head pruning |
HPCA |
System Design
Deployment optimization
Date |
Keywords |
Paper |
Venue |
2024 |
Hardware Optimization |
LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration |
Arxiv |
2023 |
Hardware offloading |
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU |
PMLR |
2023 |
Hardware offloading |
Fast distributed inference serving for large language models |
arXiv |
2022 |
Collaborative inference |
Petals: Collaborative Inference and Fine-tuning of Large Models |
arXiv |
2022 |
Hardware offloading |
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale |
IEEE SC22 |
Support Infrastructure
Date |
Keywords |
Paper |
Venue |
2024 |
Edge devices |
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases |
ICML |
2024 |
Edge devices |
EdgeShard: Efficient LLM Inference via Collaborative Edge Computing |
Arxiv |
2024 |
Edge devices |
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs |
ICML |
2024 |
Edge devices |
The breakthrough memory solutions for improved performance on llm inference |
IEEE Micro |
2024 |
Edge devices |
MELTing point: Mobile Evaluation of Language Transformers |
MobiCom |
2024 |
Edge devices |
LLM as a System Service on Mobile Devices |
Arxiv |
2024 |
Edge devices |
LocMoE: A Low-overhead MoE for Large Language Model Training |
Arxiv |
2024 |
Edge devices |
Jetmoe: Reaching llama2 performance with 0.1 m dollars |
Arxiv |
2023 |
Edge devices |
Training Large-Vocabulary Neural Language Models by Private Federated Learning for Resource-Constrained Devices |
ICASSP |
2023 |
Edge devices |
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly |
arXiv |
2023 |
Libraries |
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training |
ICPP |
2023 |
Libraries |
GPT-NeoX-20B: An Open-Source Autoregressive Language Model |
ACL |
2023 |
Edge devices |
Large Language Models Empowered Autonomous Edge AI for Connected Intelligence |
arXiv |
2022 |
Libraries |
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale |
IEEE SC22 |
2022 |
Libraries |
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning |
OSDI |
2022 |
Edge devices |
EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation |
arXiv |
2022 |
Edge devices |
ProFormer: Towards On-Device LSH Projection-Based Transformers |
ACL |
2021 |
Edge devices |
Generate More Features with Cheap Operations for BERT |
ACL |
2021 |
Edge devices |
SqueezeBERT: What can computer vision teach NLP about efficient neural networks? |
SustaiNLP |
2020 |
Edge devices |
Lite Transformer with Long-Short Range Attention |
arXiv |
2019 |
Libraries |
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism |
IEEE SC22 |
2018 |
Libraries |
Mesh-TensorFlow: Deep Learning for Supercomputers |
NeurIPS |
Other Systems
Date |
Keywords |
Paper |
Venue |
2023 |
Other Systems |
Tabi: An Efficient Multi-Level Inference System for Large Language Models |
EuroSys |
2023 |
Other Systems |
Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation |
PACMMOD |
Resource-Efficiency Evaluation Metrics & Benchmarks
? Computation Metrics
Metric |
Description |
Example Usage |
FLOPs (Floating-point operations) |
the number of arithmetic operations on floating-point numbers |
[FLOPs] |
Training Time |
the total duration required for training, typically measured in wall-clock minutes, hours, or days |
[minutes, days] [hours] |
Inference Time/Latency |
the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds |
[end-to-end latency in seconds] [next token generation latency in milliseconds] |
Throughput |
the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS) |
[tokens/s] [queries/s] |
Speed-Up Ratio |
the improvement in inference speed compared to a baseline model |
[inference time speed-up] [throughput speed-up] |
? Memory Metrics
Metric |
Description |
Example Usage |
Number of Parameters |
the number of adjustable variables in the LLM’s neural network |
[number of parameters] |
Model Size |
the storage space required for storing the entire model |
[peak memory usage in GB] |
⚡️ Energy Metrics
Metric |
Description |
Example Usage |
Energy Consumption |
the electrical power used during the LLM’s lifecycle |
[kWh] |
Carbon Emission |
the greenhouse gas emissions associated with the model’s energy usage |
[kgCO2eq] |
The following are available software packages designed for real-time tracking of energy consumption and carbon emission.
- CodeCarbon
- Carbontracker
- experiment-impact-tracker
You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or
? Financial Cost Metric
Metric |
Description |
Example Usage |
Dollars per parameter |
the total cost of training (or running) the LLM by the number of parameters |
|
? Network Communication Metric
Metric |
Description |
Example Usage |
Communication Volume |
the total amount of data transmitted across the network during a specific LLM execution or training run |
[communication volume in TB] |
Other Metrics
Metric |
Description |
Example Usage |
Compression Ratio |
the reduction in size of the compressed model compared to the original model |
[compress rate] [percentage of weights remaining] |
Loyalty/Fidelity |
the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment |
[loyalty] [fidelity] |
Robustness |
the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output |
[after-attack accuracy, query number] |
Pareto Optimality |
the optimal trade-offs between various competing factors |
[Pareto frontier (cost and accuracy)] [Pareto frontier (performance and FLOPs)] |
Benchmarks
Benchmark |
Description |
Paper |
General NLP Benchmarks |
an extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc. |
A Comprehensive Overview of Large Language Models |
Dynaboard |
an open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable Dynascore |
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking |
EfficientQA |
an open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systems |
NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned |
SustaiNLP 2020 Shared Task |
a challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference |
Overview of the SustaiNLP 2020 Shared Task |
ELUE (Efficient Language Understanding Evaluation) |
a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submission |
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline |
VLUE (Vision-Language Understanding Evaluation) |
a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparison |
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models |
Long Range Arena (LAG) |
a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiency |
Long Range Arena: A Benchmark for Efficient Transformers |
Efficiency-aware MS MARCO |
an enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems |
Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking |
Reference
If you find this paper list useful in your research, please consider citing:
@article{bai2024beyond,
title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
journal={arXiv preprint arXiv:2401.00625},
year={2024}
}