lightllm下載 - lightllm源碼下載

LightLLM 是一個基於 Python 的 LLM（大型語言模型）推理和服務框架，以其輕量級設計、易於擴展和高速效能而聞名。 LightLLM 利用了許多備受推崇的開源實作的優勢，包括但不限於 FasterTransformer、TGI、vLLM 和 FlashAttention。

英文文檔 | 中文文檔

特徵

三進程非同步協作：分詞、模型推理、去分詞非同步進行，GPU利用率大幅提升。
Nopad（Unpad）：支援跨多個模型的nopad注意力操作，以有效處理長度差異較大的請求。
動態批次：啟用請求的動態批次調度
FlashAttention：結合 FlashAttention 來提高推理過程中的速度並減少 GPU 記憶體佔用。
張量並行：利用多個 GPU 上的張量並行來實現更快的推理。
Token Attention：實作token-wise的KV快取記憶體管理機制，實現推理時記憶體零浪費。
高效能Router：與Token Attention配合，精心管理每個Token的GPU內存，從而優化系統吞吐量。
Int8KV 快取：此功能將令牌的容量增加到幾乎兩倍。只有駱駝支持。

支援型號列表

盛開
駱駝
駱駝V2
星碼器
Qwen-7b
聊天GLM2-6b
實習生LM-7b
實習生VL-Chat
Qwen-VL
Qwen-VL-聊天
Qwen2-VL
拉瓦-7b
拉瓦-13b
混合
穩定
最小每千次曝光費用
Φ3
人工智慧協同
DeepSeek-V2-Lite
DeepSeek-V2

啟動Qwen-7b時，需要設定參數「--eos_id 151643 --trust_remote_code」。

ChatGLM2需要設定參數'--trust_remote_code'。

InternLM需要設定參數'--trust_remote_code'。

InternVL-Chat(Phi3)需要設定參數'--eos_id 32007 --trust_remote_code'。

InternVL-Chat(InternLM2)需要設定參數'--eos_id 92542 --trust_remote_code'。

Qwen2-VL-7b需要設定參數'--eos_id 151645 --trust_remote_code'，並使用'pip install git+https://github.com/huggingface/transformers'升級到最新版本。

Stablelm 需要設定參數「--trust_remote_code」。

Phi-3 只支援 Mini 和 Small。

DeepSeek-V2-Lite和DeepSeek-V2需要設定參數'--data_type bfloat16'

開始使用

要求

該程式碼已使用 Pytorch>=1.3、CUDA 11.8 和 Python 3.9 進行了測試。若要安裝必要的依賴項，請參閱提供的requirements.txt並按照以下說明進行操作

 # for cuda 11.8
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118
# this version nccl can support torch cuda graph 
pip install nvidia-nccl-cu12==2.20.5

容器

您可以使用官方的Docker容器更輕鬆地運行模型。為此，請按照下列步驟操作：

從 GitHub 容器註冊表中拉取容器：
```
docker pull ghcr.io/modeltc/lightllm:main
```

運行具有 GPU 支援和連接埠映射的容器：

docker run -it --gpus all -p 8080:8080                  
        --shm-size 1g -v your_local_path:/data/         
        ghcr.io/modeltc/lightllm:main /bin/bash

或者，您可以自己建立容器：

docker build -t < image_name > .
docker run -it --gpus all -p 8080:8080                  
        --shm-size 1g -v your_local_path:/data/         
        < image_name > /bin/bash

您也可以使用幫助程式腳本來啟動容器和伺服器：
```
python tools/quick_launch_docker.py --help
```
注意：如果您使用多個 GPU，則可能需要透過在docker run命令中新增--shm-size來增加共享記憶體大小。

安裝

從原始碼安裝

python setup.py install

安裝 Triton 包

該程式碼已在一系列 GPU 上進行了測試，包括 V100、A100、A800、4090 和 H800。如果您在 A100、A800 等上執行程式碼，我們建議使用 triton==3.0.0。

pip install triton==3.0.0 --no-deps

如果您在 H800 或 V100 上執行程式碼，您可以嘗試 triton-nightly 以獲得更好的效能。

pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps

奔跑的駱駝

借助高效的路由器和 TokenAttention，LightLLM 可以部署為服務並實現最先進的吞吐量效能。

啟動伺服器：

python -m lightllm.server.api_server --model_dir /path/llama-7B     
                                     --host 0.0.0.0                 
                                     --port 8080                    
                                     --tp 1                         
                                     --max_total_token_num 120000

參數max_total_token_num受部署環境的 GPU 記憶體影響。您也可以指定 --mem_faction 以自動計算。

python -m lightllm.server.api_server --model_dir /path/llama-7B     
                                     --host 0.0.0.0                 
                                     --port 8080                    
                                     --tp 1                         
                                     --mem_faction 0.9

要在 shell 中發起查詢：

curl http://127.0.0.1:8080/generate     
    -X POST                             
    -d ' {"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}} ' 
    -H ' Content-Type: application/json '

從 Python 查詢：

 import time
import requests
import json

url = 'http://localhost:8080/generate'
headers = { 'Content-Type' : 'application/json' }
data = {
    'inputs' : 'What is AI?' ,
    "parameters" : {
        'do_sample' : False ,
        'ignore_eos' : False ,
        'max_new_tokens' : 1024 ,
    }
}
response = requests . post ( url , headers = headers , data = json . dumps ( data ))
if response . status_code == 200 :
    print ( response . json ())
else :
    print ( 'Error:' , response . status_code , response . text )

運行多模式模型

運行QWen-VL

python -m lightllm.server.api_server 
    --host 0.0.0.0                 
    --port 8080                    
    --tp 1                         
    --max_total_token_num 12000    
    --trust_remote_code            
    --enable_multimodal            
    --cache_capacity 1000          
    --model_dir /path/of/Qwen-VL or /path/of/Qwen-VL-Chat

奔跑拉瓦

python -m lightllm.server.api_server 
    --host 0.0.0.0                 
    --port 8080                    
    --tp 1                         
    --max_total_token_num 12000    
    --trust_remote_code            
    --enable_multimodal            
    --cache_capacity 1000          
    --model_dir /path/of/llava-v1.5-7b or /path/of/llava-v1.5-13b

取自 QWen-VL 的查詢

 import time
import requests
import json
import base64

url = 'http://localhost:8080/generate'
headers = { 'Content-Type' : 'application/json' }

uri = "/local/path/of/image" # or "/http/path/of/image"
if uri . startswith ( "http" ):
    images = [{ "type" : "url" , "data" : uri }]
else :
    with open ( uri , 'rb' ) as fin :
        b64 = base64 . b64encode ( fin . read ()). decode ( "utf-8" )
    images = [{ 'type' : "base64" , "data" : b64 }]

data = {
    "inputs" : "<img></img>Generate the caption in English with grounding:" ,
    "parameters" : {
        "max_new_tokens" : 200 ,
        # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id
        "stop_sequences" : [ " <|endoftext|>" ],
    },
    "multimodal_params" : {
        "images" : images ,
    }
}

response = requests . post ( url , headers = headers , data = json . dumps ( data ))
if response . status_code == 200 :
    print ( response . json ())
else :
    print ( 'Error:' , response . status_code , response . text )

來自 QWen-VL-Chat 的查詢

 import json
import requests
import base64

def run_once ( query , uris ):
    images = []
    for uri in uris :
        if uri . startswith ( "http" ):
            images . append ({ "type" : "url" , "data" : uri })
        else :
            with open ( uri , 'rb' ) as fin :
                b64 = base64 . b64encode ( fin . read ()). decode ( "utf-8" )
            images . append ({ 'type' : "base64" , "data" : b64 })

    data = {
        "inputs" : query ,
        "parameters" : {
            "max_new_tokens" : 200 ,
            # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id
            "stop_sequences" : [ " <|endoftext|>" , " <|im_start|>" , " <|im_end|>" ],
        },
        "multimodal_params" : {
            "images" : images ,
        }
    }

    # url = "http://127.0.0.1:8080/generate_stream"
    url = "http://127.0.0.1:8080/generate"
    headers = { 'Content-Type' : 'application/json' }
    response = requests . post ( url , headers = headers , data = json . dumps ( data ))
    if response . status_code == 200 :
        print ( " + result: ({})" . format ( response . json ()))
    else :
        print ( ' + error: {}, {}' . format ( response . status_code , response . text ))

"""
multi-img, multi-round:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<img></img>
<img></img>
上面两张图片分别是哪两个城市？请对它们进行对比。<|im_end|>
<|im_start|>assistant
根据提供的信息，两张图片分别是重庆和北京。<|im_end|>
<|im_start|>user
这两座城市分别在什么地方？<|im_end|>
<|im_start|>assistant
"""
run_once (
    uris = [
        "assets/mm_tutorial/Chongqing.jpeg" ,
        "assets/mm_tutorial/Beijing.jpeg" ,
    ],
    query = "<|im_start|>system n You are a helpful assistant.<|im_end|> n <|im_start|>user n <img></img> n <img></img> n上面两张图片分别是哪两个城市？请对它们进行对比。<|im_end|> n <|im_start|>assistant n根据提供的信息，两张图片分别是重庆和北京。<|im_end|> n <|im_start|>user n这两座城市分别在什么地方？<|im_end|> n <|im_start|>assistant n "
)

來自 Llava 的查詢

 import time
import requests
import json
import base64

url = 'http://localhost:8080/generate'
headers = { 'Content-Type' : 'application/json' }

uri = "/local/path/of/image" # or "/http/path/of/image"
if uri . startswith ( "http" ):
    images = [{ "type" : "url" , "data" : uri }]
else :
    with open ( uri , 'rb' ) as fin :
        b64 = base64 . b64encode ( fin . read ()). decode ( "utf-8" )
    images = [{ 'type' : "base64" , "data" : b64 }]

data = {
    "inputs" : "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> n Please explain the picture. ASSISTANT:" ,
    "parameters" : {
        "max_new_tokens" : 200 ,
    },
    "multimodal_params" : {
        "images" : images ,
    }
}

response = requests . post ( url , headers = headers , data = json . dumps ( data ))
if response . status_code == 200 :
    print ( response . json ())
else :
    print ( 'Error:' , response . status_code , response . text )

附加 lanuch 參數： --enable_multimodal 、 --cache_capacity 、較大的--cache_capacity需要更大的shm-size

支援--tp > 1 ，當tp > 1時，視覺模型運行在 gpu 0 上

Qwen-VL 的特殊圖像標籤是<img></img> （ <image> for Llava）， data["multimodal_params"]["images"]的長度應與標籤的數量相同，數量可以是0、1、 2、...

輸入影像格式：字典列表，如{'type': 'url'/'base64', 'data': xxx}

表現

服務表現

我們使用具有 80G GPU 記憶體的 A800 在 LLaMA-7B 上比較了 LightLLM 和 vLLM==0.1.2 的服務效能。

首先，準備資料如下：

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

啟動服務：

python -m lightllm.server.api_server --model_dir /path/llama-7b --tp 1 --max_total_token_num 121060 --tokenizer_mode auto

評估：

 cd test
python benchmark_serving.py --tokenizer /path/llama-7b --dataset /path/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --request-rate 200

性能比較結果如下：

法學碩士	輕型法學碩士
總時間：361.79秒吞吐量：5.53 個請求/秒	總時間：188.85秒吞吐量：10.59 個請求/秒

靜態推理效能

為了調試，我們為各種模型提供靜態效能測試腳本。例如，您可以透過以下方式評估 LLaMA 模型的推理效能

 cd test/model
python test_llama.py

常問問題

LLaMA 標記產生器無法載入。
- 考慮透過執行命令pip install protobuf==3.20.0來解決此問題。
error : PTX .version 7.4 does not support .target sm_89
- 使用bash tools/resolve_ptx_version python -m lightllm.server.api_server ...啟動

使用 lightllm 的項目

如果您有需要合併的項目，請透過電子郵件聯絡或建立拉取請求。

LazyLLM ：建立多代理 LLM 應用程式的最簡單、最懶惰的方法。

一旦你安裝了lightllm和lazyllm ，你就可以使用下面的程式碼來建立你自己的聊天機器人：

 from lazyllm import TrainableModule , deploy , WebModule
# Model will be download automatically if you have an internet connection
m = TrainableModule ( 'internlm2-chat-7b' ). deploy_method ( deploy . lightllm )
WebModule ( m ). start (). wait ()