rtp llm下载 - rtp llm源码下载

rtp llm

其他源码

v0.2.0

下载

英语中文

消息

[2024 / 06] 我们发布了全新版本的 rtp-llm，它具有用 C++ 重构的调度和批处理框架、完整的 GPU 内存管理和分配跟踪以及新的设备后端。检查发布信息以了解更多详细信息！
[2024 / 06] 我们目前正在与硬件制造商广泛合作，致力于支持多个硬件后端。 AMD ROCm、Intel CPU 和 ARM CPU 支持即将推出，敬请关注即将发布的版本！

关于

rtp-llm是阿里巴巴基础模型推理团队开发的大型语言模型（LLM）推理加速引擎。在阿里巴巴集团内部得到广泛应用，支持淘宝、天猫、闲鱼、菜鸟、高德地图、饿了么、AE、Lazada等多个业务部门的LLM服务。
rtp-llm项目是havenask的子项目

特征

经过生产验证

应用于众多LLM场景，例如：

淘宝文文
阿里巴巴国际化人工智能平台Aidge
OpenSearch LLM 智能问答版
基于大语言模型的淘宝搜索长尾查询重写

高性能

采用高性能CUDA内核，包括PagedAttention、FlashAttention、FlashDecoding等。
实现 WeightOnly INT8 量化，并在加载时自动量化；支持使用 GPTQ 和 AWQ 的 WeightOnly INT4 量化
自适应 KVCache 量化
框架层面动态批处理开销的详细优化
特别针对 V100 GPU 进行优化

灵活性和易用性

与 HuggingFace 模型无缝集成，支持 SafeTensors、Pytorch 和 Megatron 等多种权重格式
使用单个模型实例部署多个 LoRA 服务
处理多模式输入（组合图像和文本）
实现多机/多GPU张量并行
支持P-tuning模型

先进的加速技术

加载修剪后的不规则模型
用于多轮对话的上下文前缀缓存
系统提示缓存
推测性解码
Medusa 用于高级并行化策略

如何使用

要求

操作系统：Linux
蟒蛇：3.10
NVIDIA GPU：计算能力7.0或更高（例如RTX20xx、RTX30xx、RTX40xx、V100、T4、A10/A30/A100、L4、H100等）

启动示例

泊坞窗

 cd rtp-llm/docker
# IMAGE_NAME =
# if cuda11: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda11
# if cuda12: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda12
sh ./create_container.sh < CONTAINER_NAME > < IMAGE_NAME >
sh CONTAINER_NAME/sshme.sh

cd ../
# start http service
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '

WHL

 # Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service

cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '

Docker 发行说明

Docker 发行说明

常问问题

libcufft.so
错误日志： OSError: libcufft.so.11: cannot open shared object file: No such file or directory
解决方法：请检查cuda和rtp-llm版本是否匹配
libth_transformer.so
错误日志： OSError: /rtp-llm/maga_transformer/libs/libth_transformer.so: cannot open shared object file: No such file or directory
解决方案：如果通过whl或docker安装（这意味着不是bazel构建），请检查您当前的目录不是rtp-llm，否则python将使用相对路径包而不是安装的whl
Bazel 构建超时
错误日志： ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)
解决：
1. 更改 open_source/deps/pip.bzl 中的 pip 镜像存储库，添加 extra_pip_args=["--index_url=xxx"]
2. 手动 pip 安装要求，特别是对于 pytorch，因为 bazel 构建默认有 600 秒超时，这可能不足以下载 pytorch
Curl 错误错误日志： thread '<unnamed>' panicked at 'index out of bounds: the len is 1 but the index is 1', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.8.1/src/dfa.rs:1415:45
解决方案：将tiktoken升级至0.7.0

文档

在部署 Docker 中测试
服务示例
RWKV-Runner 示例
Python 库示例
在阿里云ECS中使用RTP-LLm
配置参数
源代码构建
请求格式
多GPU推理
洛拉
PT调整
系统提示符
重用KV缓存
多式联运
嵌入/重排模型部署
结构化修剪
量化
推测性抽样
路线图
贡献
基准与性能

致谢

我们的项目主要基于FasterTransformer，在此基础上，我们集成了来自TensorRT-LLM的一些内核实现。 FasterTransformer和TensorRT-LLM为我们提供了可靠的性能保证。 Flash-Attention2和cutlass也在我们持续的性能优化过程中提供了很多帮助。我们的连续批处理和增量解码借鉴了vllm的实现；采样利用了 Transformer，推测性采样集成了 Medusa 的实现，多模态部分集成了 llava 和 qwen-vl 的实现。我们感谢这些项目的启发和帮助。

外部应用场景（持续更新）

淘宝文达
阿里巴巴国际AI平台Aidge
OpenSearch LLM 智能问答版
基于大语言模型的淘宝搜索长尾查询重写

支持型号列表

法学硕士

Aquila 和 Aquila2（BAAI/AquilaChat2-7B、BAAI/AquilaChat2-34B、BAAI/Aquila-7B、BAAI/AquilaChat-7B 等）
百川和百川2 (baichuan-inc/Baichuan2-13B-Chat、baichuan-inc/Baichuan-7B)
Bloom（bigscience/bloom、bigscience/bloomz）
ChatGlm（THUDM/chatglm2-6b、THUDM/chatglm3-6b、GLM4 等）
猎鹰（tiiuae/falcon-7b、tiiuae/falcon-40b、tiiuae/falcon-rw-7b 等）
GptNeox (EleutherAI/gpt-neox-20b)
GPT BigCode（bigcode/starcoder、bigcode/starcoder2）
LLaMA 和 LLaMA-2（meta-llama/Llama-2-7b、meta-llama/Llama-2-13b-hf、meta-llama/Llama-2-70b-hf、lmsys/vicuna-33b-v1.3、 01-ai/Yi-34B、xverse/XVERSE-13B等）
MPT（mosaicml/mpt-30b-chat等）
Phi（微软/phi-1_5等）
Qwen（Qwen、Qwen1.5、Qwen2 等）
InternLM（internlm/internlm-7b、internlm/internlm-chat-7b 等）
Gemma（谷歌/gemma-it 等）
Mixtral（mistralai/Mixtral-8x7B-v0.1等）