rtp llm下載 - rtp llm原始碼下載

rtp llm

其他源碼

v0.2.0

下載

英文中文

訊息

[2024 / 06] 我們發布了全新版本的 rtp-llm，它具有用 C++ 重構的調度和批次框架、完整的 GPU 記憶體管理和分配追蹤以及新的設備後端。檢查發布資訊以了解更多詳細資訊！
[2024 / 06] 我們目前正在與硬體製造商廣泛合作，致力於支援多個硬體後端。 AMD ROCm、Intel CPU 和 ARM CPU 支援即將推出，敬請關注即將發布的版本！

關於

rtp-llm是阿里巴巴基礎模型推理團隊開發的大型語言模型（LLM）推理加速引擎。在阿里巴巴集團內部廣泛應用，支援淘寶、天貓、閒魚、菜鳥、高德地圖、餓了麼、AE、Lazada等多個業務部門的LLM服務。
rtp-llm專案是havenask的子項目

特徵

經過生產驗證

應用於眾多LLM場景，例如：

淘寶文文
阿里巴巴國際化人工智慧平台Aidge
OpenSearch LLM 智慧問答版
基於大語言模型的淘寶搜尋長尾查詢重寫

高效能

採用高效能CUDA內核，包括PagedAttention、FlashAttention、FlashDecoding等。
實現 WeightOnly INT8 量化，並在加載時自動量化；支持使用 GPTQ 和 AWQ 的 WeightOnly INT4 量化
自適應 KVCache 量化
框架層面動態批次開銷的詳細最佳化
特別針對 V100 GPU 進行最佳化

靈活性和易用性

與 HuggingFace 模型無縫集成，支援 SafeTensors、Pytorch 和 Megatron 等多種權重格式
使用單一模型實例部署多個 LoRA 服務
處理多模式輸入（組合圖像和文字）
實現多機/多GPU張量並行
支援P-tuning模型

先進的加速技術

載入修剪後的不規則模型
用於多輪對話的上下文前綴緩存
系統提示快取
推測性解碼
Medusa 用於高階平行化策略

如何使用

要求

作業系統：Linux
蟒蛇：3.10
NVIDIA GPU：運算能力7.0或更高（例如RTX20xx、RTX30xx、RTX40xx、V100、T4、A10/A30/A100、L4、H100等）

啟動範例

泊塢窗

 cd rtp-llm/docker
# IMAGE_NAME =
# if cuda11: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda11
# if cuda12: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda12
sh ./create_container.sh < CONTAINER_NAME > < IMAGE_NAME >
sh CONTAINER_NAME/sshme.sh

cd ../
# start http service
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '

WHL

 # Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service

cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '

Docker 發行說明

Docker 發行說明

常問問題

libcufft.so
錯誤日誌： OSError: libcufft.so.11: cannot open shared object file: No such file or directory
解決方法：請檢查cuda和rtp-llm版本是否匹配
libth_transformer.so
錯誤日誌： OSError: /rtp-llm/maga_transformer/libs/libth_transformer.so: cannot open shared object file: No such file or directory
解決方案：如果透過whl或docker安裝（這意味著不是bazel建置），請檢查您目前的目錄不是rtp-llm，否則python將使用相對路徑套件而不是安裝的whl
Bazel 建構超時
錯誤日誌： ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)
解決：
1. 更改 open_source/deps/pip.bzl 中的 pip 鏡像儲存庫，新增 extra_pip_args=["--index_url=xxx"]
2. 手動 pip 安裝要求，特別是對於 pytorch，因為 bazel 建置預設有 600 秒逾時，這可能不足以下載 pytorch
Curl 錯誤錯誤日誌： thread '<unnamed>' panicked at 'index out of bounds: the len is 1 but the index is 1', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.8.1/src/dfa.rs:1415:45
解決方案：將tiktoken升級至0.7.0

文件

在部署 Docker 中測試
服務範例
RWKV-Runner 範例
Python 庫範例
在阿里雲ECS中使用RTP-LLm
配置參數
原始碼建構
請求格式
多GPU推理
洛拉
PT調整
系統提示符
重用KV緩存
多式聯運
嵌入/重排模型部署
結構化修剪
量化
推測性抽樣
路線圖
貢獻
基準與性能

致謝

我們的專案主要基於FasterTransformer，在此基礎上，我們整合了來自TensorRT-LLM的一些核心實作。 FasterTransformer和TensorRT-LLM為我們提供了可靠的效能保證。 Flash-Attention2和cutlass也在我們持續的效能優化過程中提供了許多幫助。我們的連續批次和增量解碼借鑒了vllm的實現；採樣利用了 Transformer，推測性採樣整合了 Medusa 的實現，多模態部分整合了 llava 和 qwen-vl 的實現。我們感謝這些項目的啟發和幫助。

外部應用程式場景（持續更新）

淘寶文達
阿里巴巴國際AI平台Aidge
OpenSearch LLM 智慧問答版
基於大語言模型的淘寶搜尋長尾查詢重寫

支援型號列表

法學碩士

Aquila 和 Aquila2（BAAI/AquilaChat2-7B、BAAI/AquilaChat2-34B、BAAI/Aquila-7B、BAAI/AquilaChat-7B 等）
百川和百川2 (baichuan-inc/Baichuan2-13B-Chat、baichuan-inc/Baichuan-7B)
Bloom（bigscience/bloom、bigscience/bloomz）
ChatGlm（THUDM/chatglm2-6b、THUDM/chatglm3-6b、GLM4 等）
獵鷹（tiiuae/falcon-7b、tiiuae/falcon-40b、tiiuae/falcon-rw-7b 等）
GptNeox (EleutherAI/gpt-neox-20b)
GPT BigCode（bigcode/starcoder、bigcode/starcoder2）
LLaMA 與 LLaMA-2（meta-llama/Llama-2-7b、meta-llama/Llama-2-13b-hf、meta-llama/Llama-2-70b-hf、lmsys/vicuna-33b-v1.3、 01-ai/Yi-34B、xverse/XVERSE-13B等）
MPT（mosaicml/mpt-30b-chat等）
Phi（微軟/phi-1_5等）
Qwen（Qwen、Qwen1.5、Qwen2 等）
InternLM（internlm/internlm-7b、internlm/internlm-chat-7b 等）
Gemma（Google/gemma-it 等）
Mixtral（mistralai/Mixtral-8x7B-v0.1等）