英语 中文
应用于众多LLM场景,例如:
cd rtp-llm/docker
# IMAGE_NAME =
# if cuda11: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda11
# if cuda12: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda12
sh ./create_container.sh < CONTAINER_NAME > < IMAGE_NAME >
sh CONTAINER_NAME/sshme.sh
cd ../
# start http service
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '
# Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service
cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '
libcufft.so
错误日志: OSError: libcufft.so.11: cannot open shared object file: No such file or directory
解决方法:请检查cuda和rtp-llm版本是否匹配
libth_transformer.so
错误日志: OSError: /rtp-llm/maga_transformer/libs/libth_transformer.so: cannot open shared object file: No such file or directory
解决方案:如果通过whl或docker安装(这意味着不是bazel构建),请检查您当前的目录不是rtp-llm,否则python将使用相对路径包而不是安装的whl
Bazel 构建超时
错误日志: ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)
解决:
Curl 错误错误日志: thread '<unnamed>' panicked at 'index out of bounds: the len is 1 but the index is 1', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.8.1/src/dfa.rs:1415:45
解决方案:将tiktoken升级至0.7.0
我们的项目主要基于FasterTransformer,在此基础上,我们集成了来自TensorRT-LLM的一些内核实现。 FasterTransformer和TensorRT-LLM为我们提供了可靠的性能保证。 Flash-Attention2和cutlass也在我们持续的性能优化过程中提供了很多帮助。我们的连续批处理和增量解码借鉴了vllm的实现;采样利用了 Transformer,推测性采样集成了 Medusa 的实现,多模态部分集成了 llava 和 qwen-vl 的实现。我们感谢这些项目的启发和帮助。