英文 中文
應用於眾多LLM場景,例如:
cd rtp-llm/docker
# IMAGE_NAME =
# if cuda11: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda11
# if cuda12: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:deploy_image_cuda12
sh ./create_container.sh < CONTAINER_NAME > < IMAGE_NAME >
sh CONTAINER_NAME/sshme.sh
cd ../
# start http service
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '
# Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service
cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d ' {"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}} '
libcufft.so
錯誤日誌: OSError: libcufft.so.11: cannot open shared object file: No such file or directory
解決方法:請檢查cuda和rtp-llm版本是否匹配
libth_transformer.so
錯誤日誌: OSError: /rtp-llm/maga_transformer/libs/libth_transformer.so: cannot open shared object file: No such file or directory
解決方案:如果透過whl或docker安裝(這意味著不是bazel建置),請檢查您目前的目錄不是rtp-llm,否則python將使用相對路徑套件而不是安裝的whl
Bazel 建構超時
錯誤日誌: ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)
解決:
Curl 錯誤錯誤日誌: thread '<unnamed>' panicked at 'index out of bounds: the len is 1 but the index is 1', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.8.1/src/dfa.rs:1415:45
解決方案:將tiktoken升級至0.7.0
我們的專案主要基於FasterTransformer,在此基礎上,我們整合了來自TensorRT-LLM的一些核心實作。 FasterTransformer和TensorRT-LLM為我們提供了可靠的效能保證。 Flash-Attention2和cutlass也在我們持續的效能優化過程中提供了許多幫助。我們的連續批次和增量解碼借鑒了vllm的實現;採樣利用了 Transformer,推測性採樣整合了 Medusa 的實現,多模態部分整合了 llava 和 qwen-vl 的實現。我們感謝這些項目的啟發和幫助。