lightllm 다운로드 - lightllm 소스 코드 다운로드

LightLLM은 Python 기반 LLM(Large Language Model) 추론 및 제공 프레임워크로, 경량 설계, 손쉬운 확장성 및 고속 성능으로 유명합니다. LightLLM은 FasterTransformer, TGI, vLLM 및 FlashAttention을 포함하되 이에 국한되지 않고 널리 알려진 수많은 오픈 소스 구현의 장점을 활용합니다.

영어 문서 | 중국어(중국어)

특징

3중 프로세스 비동기식 협업: 토큰화, 모델 추론, 토큰 해제가 비동기식으로 수행되어 GPU 활용도가 크게 향상됩니다.
Nopad(Unpad): 길이 차이가 큰 요청을 효율적으로 처리하기 위해 여러 모델에 걸쳐 nopad 어텐션 작업을 지원합니다.
동적 배치: 요청의 동적 배치 예약을 활성화합니다.
FlashAttention: FlashAttention을 통합하여 추론 중에 속도를 향상하고 GPU 메모리 공간을 줄입니다.
텐서 병렬성: 더 빠른 추론을 위해 여러 GPU에서 텐서 병렬성을 활용합니다.
Token Attention: 토큰 방식의 KV 캐시 메모리 관리 메커니즘을 구현하여 추론 중에 메모리 낭비를 방지합니다.
고성능 라우터: Token Attention과 협력하여 각 토큰의 GPU 메모리를 꼼꼼하게 관리하여 시스템 처리량을 최적화합니다.
Int8KV 캐시: 이 기능은 토큰 용량을 거의 두 배로 늘립니다. 라마만 지원합니다.

지원 모델 목록

꽃
야마
라마 V2
스타코더
Qwen-7b
채팅GLM2-6b
인턴LM-7b
InternVL-채팅
Qwen-VL
Qwen-VL-채팅
Qwen2-VL
라바-7b
라바-13b
믹스트랄
안정
미니CPM
파이-3
CohereForAI
DeepSeek-V2-Lite
DeepSeek-V2

Qwen-7b를 시작할 때 '--eos_id 151643 --trust_remote_code' 매개변수를 설정해야 합니다.

ChatGLM2에서는 '--trust_remote_code' 매개변수를 설정해야 합니다.

InternLM은 '--trust_remote_code' 매개변수를 설정해야 합니다.

InternVL-Chat(Phi3)은 '--eos_id 32007 --trust_remote_code' 매개변수를 설정해야 합니다.

InternVL-Chat(InternLM2)은 '--eos_id 92542 --trust_remote_code' 매개변수를 설정해야 합니다.

Qwen2-VL-7b는 '--eos_id 151645 --trust_remote_code' 매개변수를 설정하고 'pip install git+https://github.com/huggingface/transformers'를 사용하여 최신 버전으로 업그레이드해야 합니다.

Stablelm에서는 '--trust_remote_code' 매개변수를 설정해야 합니다.

Phi-3는 Mini와 Small만 지원합니다.

DeepSeek-V2-Lite 및 DeepSeek-V2는 '--data_type bfloat16' 매개변수를 설정해야 합니다.

시작하기

요구사항

코드는 Pytorch>=1.3, CUDA 11.8 및 Python 3.9에서 테스트되었습니다. 필요한 종속성을 설치하려면 제공된 요구 사항.txt 를 참조하고 다음 지침을 따르십시오.

 # for cuda 11.8
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118
# this version nccl can support torch cuda graph 
pip install nvidia-nccl-cu12==2.20.5

컨테이너

공식 Docker 컨테이너를 사용하면 모델을 더 쉽게 실행할 수 있습니다. 이렇게 하려면 다음 단계를 따르세요.

GitHub Container Registry에서 컨테이너를 가져옵니다.
```
docker pull ghcr.io/modeltc/lightllm:main
```

GPU 지원 및 포트 매핑을 사용하여 컨테이너를 실행합니다.

docker run -it --gpus all -p 8080:8080                  
        --shm-size 1g -v your_local_path:/data/         
        ghcr.io/modeltc/lightllm:main /bin/bash

또는 컨테이너를 직접 빌드할 수도 있습니다.

docker build -t < image_name > .
docker run -it --gpus all -p 8080:8080                  
        --shm-size 1g -v your_local_path:/data/         
        < image_name > /bin/bash

도우미 스크립트를 사용하여 컨테이너와 서버를 모두 시작할 수도 있습니다.
```
python tools/quick_launch_docker.py --help
```
참고: 여러 GPU를 사용하는 경우 docker run 명령에 --shm-size 추가하여 공유 메모리 크기를 늘려야 할 수도 있습니다.

설치

다음을 통해 소스 코드에서 설치

python setup.py install

트리톤 패키지 설치

이 코드는 V100, A100, A800, 4090 및 H800을 포함한 다양한 GPU에서 테스트되었습니다. A100, A800 등에서 코드를 실행하는 경우 triton==3.0.0을 사용하는 것이 좋습니다.

pip install triton==3.0.0 --no-deps

H800 또는 V100에서 코드를 실행하는 경우 triton-nightly를 시도하여 더 나은 성능을 얻을 수 있습니다.

pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps

라마를 달려라

효율적인 라우터 및 TokenAttention을 통해 LightLLM을 서비스로 배포하고 최첨단 처리량 성능을 달성할 수 있습니다.

서버를 시작합니다:

python -m lightllm.server.api_server --model_dir /path/llama-7B     
                                     --host 0.0.0.0                 
                                     --port 8080                    
                                     --tp 1                         
                                     --max_total_token_num 120000

max_total_token_num 매개변수는 배포 환경의 GPU 메모리에 영향을 받습니다. --mem_faction을 지정하여 자동으로 계산되도록 할 수도 있습니다.

python -m lightllm.server.api_server --model_dir /path/llama-7B     
                                     --host 0.0.0.0                 
                                     --port 8080                    
                                     --tp 1                         
                                     --mem_faction 0.9

셸에서 쿼리를 시작하려면 다음 안내를 따르세요.

curl http://127.0.0.1:8080/generate     
    -X POST                             
    -d ' {"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}} ' 
    -H ' Content-Type: application/json '

Python에서 쿼리하려면 다음 안내를 따르세요.

 import time
import requests
import json

url = 'http://localhost:8080/generate'
headers = { 'Content-Type' : 'application/json' }
data = {
    'inputs' : 'What is AI?' ,
    "parameters" : {
        'do_sample' : False ,
        'ignore_eos' : False ,
        'max_new_tokens' : 1024 ,
    }
}
response = requests . post ( url , headers = headers , data = json . dumps ( data ))
if response . status_code == 200 :
    print ( response . json ())
else :
    print ( 'Error:' , response . status_code , response . text )

다중 모드 모델 실행

QWen-VL 실행

python -m lightllm.server.api_server 
    --host 0.0.0.0                 
    --port 8080                    
    --tp 1                         
    --max_total_token_num 12000    
    --trust_remote_code            
    --enable_multimodal            
    --cache_capacity 1000          
    --model_dir /path/of/Qwen-VL or /path/of/Qwen-VL-Chat

라바를 실행하세요

python -m lightllm.server.api_server 
    --host 0.0.0.0                 
    --port 8080                    
    --tp 1                         
    --max_total_token_num 12000    
    --trust_remote_code            
    --enable_multimodal            
    --cache_capacity 1000          
    --model_dir /path/of/llava-v1.5-7b or /path/of/llava-v1.5-13b

QWen-VL에서 쿼리

 import time
import requests
import json
import base64

url = 'http://localhost:8080/generate'
headers = { 'Content-Type' : 'application/json' }

uri = "/local/path/of/image" # or "/http/path/of/image"
if uri . startswith ( "http" ):
    images = [{ "type" : "url" , "data" : uri }]
else :
    with open ( uri , 'rb' ) as fin :
        b64 = base64 . b64encode ( fin . read ()). decode ( "utf-8" )
    images = [{ 'type' : "base64" , "data" : b64 }]

data = {
    "inputs" : "<img></img>Generate the caption in English with grounding:" ,
    "parameters" : {
        "max_new_tokens" : 200 ,
        # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id
        "stop_sequences" : [ " <|endoftext|>" ],
    },
    "multimodal_params" : {
        "images" : images ,
    }
}

response = requests . post ( url , headers = headers , data = json . dumps ( data ))
if response . status_code == 200 :
    print ( response . json ())
else :
    print ( 'Error:' , response . status_code , response . text )

QWen-VL-Chat에서 쿼리

 import json
import requests
import base64

def run_once ( query , uris ):
    images = []
    for uri in uris :
        if uri . startswith ( "http" ):
            images . append ({ "type" : "url" , "data" : uri })
        else :
            with open ( uri , 'rb' ) as fin :
                b64 = base64 . b64encode ( fin . read ()). decode ( "utf-8" )
            images . append ({ 'type' : "base64" , "data" : b64 })

    data = {
        "inputs" : query ,
        "parameters" : {
            "max_new_tokens" : 200 ,
            # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id
            "stop_sequences" : [ " <|endoftext|>" , " <|im_start|>" , " <|im_end|>" ],
        },
        "multimodal_params" : {
            "images" : images ,
        }
    }

    # url = "http://127.0.0.1:8080/generate_stream"
    url = "http://127.0.0.1:8080/generate"
    headers = { 'Content-Type' : 'application/json' }
    response = requests . post ( url , headers = headers , data = json . dumps ( data ))
    if response . status_code == 200 :
        print ( " + result: ({})" . format ( response . json ()))
    else :
        print ( ' + error: {}, {}' . format ( response . status_code , response . text ))

"""
multi-img, multi-round:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<img></img>
<img></img>
上面两张图片分别是哪两个城市？请对它们进行对比。<|im_end|>
<|im_start|>assistant
根据提供的信息，两张图片分别是重庆和北京。<|im_end|>
<|im_start|>user
这两座城市分别在什么地方？<|im_end|>
<|im_start|>assistant
"""
run_once (
    uris = [
        "assets/mm_tutorial/Chongqing.jpeg" ,
        "assets/mm_tutorial/Beijing.jpeg" ,
    ],
    query = "<|im_start|>system n You are a helpful assistant.<|im_end|> n <|im_start|>user n <img></img> n <img></img> n上面两张图片分别是哪两个城市？请对它们进行对比。<|im_end|> n <|im_start|>assistant n根据提供的信息，两张图片分别是重庆和北京。<|im_end|> n <|im_start|>user n这两座城市分别在什么地方？<|im_end|> n <|im_start|>assistant n "
)

Llava에서 쿼리

 import time
import requests
import json
import base64

url = 'http://localhost:8080/generate'
headers = { 'Content-Type' : 'application/json' }

uri = "/local/path/of/image" # or "/http/path/of/image"
if uri . startswith ( "http" ):
    images = [{ "type" : "url" , "data" : uri }]
else :
    with open ( uri , 'rb' ) as fin :
        b64 = base64 . b64encode ( fin . read ()). decode ( "utf-8" )
    images = [{ 'type' : "base64" , "data" : b64 }]

data = {
    "inputs" : "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> n Please explain the picture. ASSISTANT:" ,
    "parameters" : {
        "max_new_tokens" : 200 ,
    },
    "multimodal_params" : {
        "images" : images ,
    }
}

response = requests . post ( url , headers = headers , data = json . dumps ( data ))
if response . status_code == 200 :
    print ( response . json ())
else :
    print ( 'Error:' , response . status_code , response . text )

추가 LANuch 매개변수: --enable_multimodal , --cache_capacity , 더 큰 --cache_capacity 더 큰 shm-size 필요합니다.

--tp > 1 지원, tp > 1 일 때 시각적 모델은 GPU 0에서 실행됩니다.

Qwen-VL의 특수 이미지 태그는 <img></img> (Llava의 경우 <image> )이며, data["multimodal_params"]["images"] 는 태그 수와 동일해야 합니다. 0, 1, 2, ...이 될 수 있습니다.

입력 이미지 형식: {'type': 'url'/'base64', 'data': xxx} 와 같은 dict 목록

성능

서비스 성과

80G GPU 메모리를 탑재한 A800을 사용하여 LLaMA-7B에서 LightLLM과 vLLM==0.1.2의 서비스 성능을 비교했습니다.

시작하려면 다음과 같이 데이터를 준비하십시오.

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

서비스를 시작합니다:

python -m lightllm.server.api_server --model_dir /path/llama-7b --tp 1 --max_total_token_num 121060 --tokenizer_mode auto

평가:

 cd test
python benchmark_serving.py --tokenizer /path/llama-7b --dataset /path/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --request-rate 200

성능 비교 결과는 다음과 같습니다.

vLLM	라이트LLM
총 시간: 361.79초 처리량: 5.53 요청/초	총 시간: 188.85초 처리량: 10.59 요청/초

정적 추론 성능

디버깅을 위해 다양한 모델에 대한 정적 성능 테스트 스크립트를 제공합니다. 예를 들어 LLaMA 모델의 추론 성능을 다음과 같이 평가할 수 있습니다.

 cd test/model
python test_llama.py

FAQ

LLaMA 토크나이저를 로드하지 못했습니다.
- pip install protobuf==3.20.0 명령을 실행하여 이 문제를 해결하는 것이 좋습니다.
error : PTX .version 7.4 does not support .target sm_89
- bash tools/resolve_ptx_version python -m lightllm.server.api_server ... 로 시작하세요.

lightllm을 사용하는 프로젝트

통합해야 할 프로젝트가 있는 경우 이메일로 문의하거나 풀 요청을 작성하세요.

LazyLLM : 다중 에이전트 LLM 애플리케이션을 구축하는 가장 쉽고 느린 방법입니다.

lightllm 및 lazyllm 설치한 후 다음 코드를 사용하여 자신만의 챗봇을 구축할 수 있습니다.

 from lazyllm import TrainableModule , deploy , WebModule
# Model will be download automatically if you have an internet connection
m = TrainableModule ( 'internlm2-chat-7b' ). deploy_method ( deploy . lightllm )
WebModule ( m ). start (). wait ()