airllm下载 - airllm源码下载

airllm

其他源码

1.0.0

下载

航空标志

快速入门|配置| macOS |示例笔记本|常问问题

AirLLM优化了推理内存使用，允许 70B 大型语言模型在单个 4GB GPU 卡上运行推理，无需量化、蒸馏和剪枝。现在您可以在8GB vram上运行405B Llama3.1 。

更新

[2024/08/20] v2.11.0: 支持Qwen2.5

[2024/08/18] v2.10.1 支持CPU推理。支持非分片模型。感谢@NavodPeiris 的出色工作！

[2024/07/30] 支持Llama3.1 405B （示例笔记本）。支持8bit/4bit量化。

[2024/04/20] AirLLM 已经原生支持 Llama3。在 4GB 单 GPU 上运行 Llama3 70B。

[2023/12/25] v2.8.2：支持MacOS运行70B大语言模型。

[2023/12/20] v2.7：支持AirLLMMixtral。

[2023/12/20] v2.6：新增AutoModel，自动检测模型类型，无需提供模型类来初始化模型。

[2023/12/18] v2.5：添加预取以重叠模型加载和计算。速度提高 10%。

[2023/12/03] 增加了对ChatGLM 、 QWen 、 Baichuan 、 Mistral 、 InternLM的支持！

[2023/12/02]增加了对安全张量的支持。现在支持开放 llm 排行榜中的所有前 10 名模型。

[2023/12/01]airllm 2.0。支持压缩：运行时间加快 3 倍！

[2023/11/20]airllm 初始版本！

明星历史

快速入门

1.安装包

首先，安装airllm pip 包。

pip install airllm

2. 推论

然后，初始化AirLLMLlama2，传入正在使用的模型的huggingface repo ID，或者本地路径，就可以像常规的transformer模型一样进行推理。

(也可以在初始化AirLLMLlama2时通过layer_shards_ saving_path指定分割后的分层模型的保存路径。

 from airllm import AutoModel

MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel . from_pretrained ( "garage-bAInd/Platypus2-70B-instruct" )

# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
        'What is the capital of United States?' ,
        #'I like',
    ]

input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH , 
    padding = False )
           
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 20 ,
    use_cache = True ,
    return_dict_in_generate = True )

output = model . tokenizer . decode ( generation_output . sequences [ 0 ])

print ( output )

注意：在推理过程中，会先对原始模型进行分解并逐层保存。请确保huggingface缓存目录中有足够的磁盘空间。

模型压缩 - 推理速度提高 3 倍！

我们刚刚添加了基于逐块量化的模型压缩的模型压缩。这可以进一步将推理速度加快3 倍，而精度损失几乎可以忽略不计！ （查看更多性能评估以及为什么我们在本文中使用逐块量化）

速度改进

如何启用模型压缩速度：

步骤 1. 确保已通过pip install -U bitsandbytes Bitsandbytes 安装了 BitsandBytes
步骤2.确保airllm版本高于2.0.0： pip install -U airllm
步骤3.初始化模型时，传递参数压缩（'4bit'或'8bit'）：

 model = AutoModel . from_pretrained ( "garage-bAInd/Platypus2-70B-instruct" ,
                     compression = '4bit' # specify '8bit' for 8-bit block-wise quantization 
                    )

模型压缩和量化有什么区别？

量化通常需要量化权重和激活，才能真正加快速度。这使得保持准确性和避免各种输入中异常值的影响变得更加困难。

虽然在我们的例子中，瓶颈主要在于磁盘加载，但我们只需要减小模型加载大小即可。所以，我们只对权重部分进行量化，这样更容易保证准确性。

配置

初始化模型时，我们支持以下配置：

压缩：支持的选项：4bit、8bit（用于 4 位或 8 位块量化），或者默认为 None（无压缩）
profiling_mode ：支持的选项：True 输出时间消耗或默认 False
layer_shards_ saving_path ：可选的另一个路径来保存分割模型
hf_token ：如果下载门控模型，可以在此处提供 Huggingface 令牌，例如： meta-llama/Llama-2-7b-hf
预取：预取以重叠模型加载和计算。默认情况下，打开。目前，只有 AirLLMLlama2 支持此功能。
delete_original ：如果你没有太多的磁盘空间，可以将delete_original设置为true，删除原来下载的拥抱脸模型，只保留转换后的模型，以节省一半的磁盘空间。

苹果系统

只需安装airllm并像在linux上一样运行代码即可。请参阅快速入门了解更多信息。

确保您安装了 mlx 和 torch
您可能需要安装 python 本机，请在此处查看更多信息
仅支持 Apple 芯片

示例 [python 笔记本] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb)

Python 笔记本示例

合作实验室示例如下：

其他模型的示例（ChatGLM、QWen、Baichuan、Mistral 等）：

聊天GLM：

 from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel . from_pretrained ( "THUDM/chatglm3-6b-base" )
input_text = [ 'What is the capital of China?' ,]
input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH , 
    padding = True )
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 5 ,
    use_cache = True ,
    return_dict_in_generate = True )
model . tokenizer . decode ( generation_output . sequences [ 0 ])

Q文：

 from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel . from_pretrained ( "Qwen/Qwen-7B" )
input_text = [ 'What is the capital of China?' ,]
input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH )
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 5 ,
    use_cache = True ,
    return_dict_in_generate = True )
model . tokenizer . decode ( generation_output . sequences [ 0 ])

百川、InternLM、米斯特拉尔等：

 from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel . from_pretrained ( "baichuan-inc/Baichuan2-7B-Base" )
#model = AutoModel.from_pretrained("internlm/internlm-20b")
#model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
input_text = [ 'What is the capital of China?' ,]
input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH )
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 5 ,
    use_cache = True ,
    return_dict_in_generate = True )
model . tokenizer . decode ( generation_output . sequences [ 0 ])

请求其他型号支持：此处

致谢

很多代码都基于 SimJeg 在 Kaggle 考试竞赛中的出色工作。向 SimJeg 致敬：

GitHub账号@SimJeg，Kaggle上的代码，相关讨论。

常问问题

1. 元数据不完整缓冲区

safetensors_rust.SafetensorError：反序列化标头时出错：MetadataIncompleteBuffer

如果遇到此错误，最可能的原因是磁盘空间不足。分割模型的过程非常消耗磁盘。看到这个。您可能需要扩展磁盘空间、清除 Huggingface .cache 并重新运行。

2. ValueError: max() arg 是一个空序列

您很可能正在使用 Llama2 类加载 QWen 或 ChatGLM 模型。请尝试以下操作：

对于 QWen 模型：

 from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel . from_pretrained (...)

对于 ChatGLM 模型：

 from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel . from_pretrained (...)

3. 401 客户端错误....Repo 模型...已关闭。

有些模型是门控模型，需要 Huggingface api 令牌。您可以提供 hf_token：

 model = AutoModel . from_pretrained ( "meta-llama/Llama-2-7b-hf" , #hf_token='HF_API_TOKEN')

4. ValueError：要求填充，但标记器没有填充标记。

某些模型的标记生成器没有填充标记，因此您可以设置填充标记或简单地关闭填充配置：

 input_tokens = model . tokenizer ( input_text ,
   return_tensors = "pt" , 
   return_attention_mask = False , 
   truncation = True , 
   max_length = MAX_LENGTH , 
   padding = False  #<-----------   turn off padding 
)

引用航空法学硕士

如果您发现 AirLLM 对您的研究有用并希望引用它，请使用以下 BibTex 条目：

 @software{airllm2023,
  author = {Gavin Li},
  title = {AirLLM: scaling large language models on low-end commodity computers},
  url = {https://github.com/lyogavin/airllm/},
  version = {0.0},
  year = {2023},
}

贡献

欢迎贡献、想法和讨论！

如果你觉得有用，请或者请我喝杯咖啡！

展开

附加信息

版本 1.0.0
类型其他源码
更新时间 2024-12-05
大小 1.94MB
来自于 Github

airllm

更新

明星历史

目录

快速入门

1.安装包

2. 推论

模型压缩 - 推理速度提高 3 倍！

如何启用模型压缩速度：

模型压缩和量化有什么区别？

配置

苹果系统

Python 笔记本示例

其他模型的示例（ChatGLM、QWen、Baichuan、Mistral 等）：

请求其他型号支持：此处

致谢

常问问题

1. 元数据不完整缓冲区

2. ValueError: max() arg 是一个空序列

3. 401 客户端错误....Repo 模型...已关闭。

4. ValueError：要求填充，但标记器没有填充标记。

引用航空法学硕士

贡献

waymo open dataset

SmartTube

Sunamu

MySchedule.py

viptools for eslam

VITAident

chat.petals.dev

GPT Prompt Templates

GPTyped

waymo open dataset

SmartTube

Sunamu

waymo open dataset

wp functions

termwind