speculative decoding
0.2.0
對推測解碼的一些最新技術的探索
還有一些我自己的想法,如果可行的話,我將嘗試在這個儲存庫中分享。最初的目標是使用它來加速 Spear-TTS 中的文字到語義解碼器
在早期退出方案中,在規範解碼期間快取隱藏層,因為小型和大型模型共享相同的前幾層
為了提前退出,允許額外的變壓器塊頭(與主變壓器桿分開)
找出批量規格解碼 - 不同的行可能以不同的速率前進
進一步優化批量規範解碼,因為所有索引都會損失一些性能 - 似乎需要一些工作才能使該技術真正可用
使批次規範解碼與提前退出策略一起工作
使用預言變壓器的想法完成推測性採樣 - 似乎效果很好! ?
取得一些 wandb 圖表,看看先知與早期退出策略的比較如何,在儲存庫上分享
也可以進行實驗,看看預言變壓器是否會為主模型損失帶來任何好處。最初的預言家論文只做了簡單的線性投影
對於早期退出策略,嘗試將最後快取的嵌入隨機求和回同一模型(a la alphafold2 回收),沿著序列長度隨機裁剪,並以這種方式訓練早期退出損失。看看是否可以透過這種方式提高伽瑪值
花一個上午的時間進行微觀優化
@inproceedings { Leviathan2022FastIF ,
title = { Fast Inference from Transformers via Speculative Decoding } ,
author = { Yaniv Leviathan and Matan Kalman and Y. Matias } ,
booktitle = { International Conference on Machine Learning } ,
year = { 2022 } ,
url = { https://api.semanticscholar.org/CorpusID:254096365 }
}
@inproceedings { sun2023spectr ,
title = { SpecTr: Fast Speculative Decoding via Optimal Transport } ,
author = { Ziteng Sun and Ananda Theertha Suresh and Jae Hun Ro and Ahmad Beirami and Himanshu Jain and Felix Yu and Michael Riley and Sanjiv Kumar } ,
booktitle = { Workshop on Efficient Systems for Foundation Models @ ICML2023 } ,
year = { 2023 } ,
url = { https://openreview.net/forum?id=d0mGsaheuT }
}
@article { Chen2023AcceleratingLL ,
title = { Accelerating Large Language Model Decoding with Speculative Sampling } ,
author = { Charlie Chen and Sebastian Borgeaud and Geoffrey Irving and Jean-Baptiste Lespiau and L. Sifre and John M. Jumper } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2302.01318 } ,
url = { https://api.semanticscholar.org/CorpusID:256503945 }
}
@article { Yan2020ProphetNetPF ,
title = { ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training } ,
author = { Yu Yan and Weizhen Qi and Yeyun Gong and Dayiheng Liu and Nan Duan and Jiusheng Chen and Ruofei Zhang and Ming Zhou } ,
journal = { ArXiv } ,
year = { 2020 } ,
volume = { abs/2001.04063 } ,
url = { https://api.semanticscholar.org/CorpusID:210164665 }
}
@article { Zhang2023DraftV ,
title = { Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding } ,
author = { Jinchao Zhang and Jue Wang and Huan Li and Lidan Shou and Ke Chen and Gang Chen and Sharad Mehrotra } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2309.08168 } ,
url = { https://api.semanticscholar.org/CorpusID:262013673 }
}
@misc { medusa ,
author = { Tianle Cai and Yuhong Li and Zhengyang Geng and Hongwu Peng and Tri Dao } ,
title = { Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads } ,
year = { 2023 } ,
publisher = { GitHub } ,
journal = { GitHub repository } ,
howpublished = { url{https://github.com/FasterDecoding/Medusa} } ,
}