airllm 다운로드 - airllm 소스 코드 다운로드

airllm

기타 소스코드

1.0.0

다운로드

airllm_로고

빠른 시작 | 구성 | 맥OS | 노트북 예시 | FAQ

AirLLM은 추론 메모리 사용을 최적화하여 700억 개의 대규모 언어 모델이 양자화, 증류 및 정리 없이 단일 4GB GPU 카드에서 추론을 실행할 수 있도록 합니다. 이제 8GB vram 에서 405B Llama3.1을 실행할 수 있습니다.

업데이트

[2024/08/20] v2.11.0: Qwen2.5 지원

[2024/08/18] v2.10.1 CPU 추론을 지원합니다. 비샤딩 모델을 지원합니다. 훌륭한 작업을 해주신 @NavodPeiris에게 감사드립니다!

[2024/07/30] Llama3.1 405B (예제 노트북)를 지원합니다. 8bit/4bit 양자화를 지원합니다.

[2024/04/20] AirLLM은 이미 기본적으로 Llama3를 지원합니다. 4GB 단일 GPU에서 Llama3 70B를 실행하세요.

[2023/12/25] v2.8.2: 70B 대규모 언어 모델을 실행하는 MacOS를 지원합니다.

[2023/12/20] v2.7: AirLLMMixtral을 지원합니다.

[2023/12/20] v2.6: AutoModel이 추가되었습니다. 모델 유형을 자동으로 감지하며, 모델을 초기화하기 위해 모델 클래스를 제공할 필요가 없습니다.

[2023/12/18] v2.5: 모델 로드 및 계산을 겹치기 위해 프리페치를 추가했습니다. 10% 속도 향상.

[2023/12/03] ChatGLM , QWen , Baichuan , Mistral , InternLM 지원이 추가되었습니다!

[2023/12/02] safetensor에 대한 지원이 추가되었습니다. 이제 공개 LLM 리더보드에서 상위 10개 모델을 모두 지원합니다.

[2023/12/01] airllm 2.0. 압축 지원: 실행 시간이 3배 빨라졌습니다!

[2023/11/20] airllm 초기버전!

스타의 역사

빠른 시작

1. 패키지 설치

먼저 airllm pip 패키지를 설치합니다.

pip install airllm

2. 추론

그런 다음 AirLLMLlama2를 초기화하고 사용 중인 모델의 허깅페이스 저장소 ID 또는 로컬 경로를 전달하면 일반 변환기 모델과 유사하게 추론을 수행할 수 있습니다.

( AirLLMLlama2를 초기화할 때 layer_shards_saving_path를 통해 분할된 계층 모델을 저장할 경로를 지정할 수도 있습니다.

 from airllm import AutoModel

MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel . from_pretrained ( "garage-bAInd/Platypus2-70B-instruct" )

# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
        'What is the capital of United States?' ,
        #'I like',
    ]

input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH , 
    padding = False )
           
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 20 ,
    use_cache = True ,
    return_dict_in_generate = True )

output = model . tokenizer . decode ( generation_output . sequences [ 0 ])

print ( output )

참고: 추론 중에는 원본 모델이 먼저 분해되어 레이어별로 저장됩니다. Huggingface 캐시 디렉터리에 충분한 디스크 공간이 있는지 확인하세요.

모델 압축 - 추론 속도가 3배 향상되었습니다!

방금 블록별 양자화 기반 모델 압축을 기반으로 모델 압축을 추가했습니다. 거의 무시할 수 있는 정확도 손실 로 추론 속도를 최대 3배 까지 높일 수 있습니다! (더 많은 성능 평가와 이 문서에서 블록 단위 양자화를 사용하는 이유를 확인하세요)

속도 향상

모델 압축 속도 향상을 활성화하는 방법:

1단계 pip install -U bitsandbytes 로 비트앤바이트가 설치되어 있는지 확인하세요.
2단계. airllm 버전이 2.0.0 이상인지 확인하십시오. pip install -U airllm
3단계. 모델을 초기화할 때 인수 압축('4비트' 또는 '8비트')을 전달합니다.

 model = AutoModel . from_pretrained ( "garage-bAInd/Platypus2-70B-instruct" ,
                     compression = '4bit' # specify '8bit' for 8-bit block-wise quantization 
                    )

모델 압축과 양자화의 차이점은 무엇입니까?

양자화는 일반적으로 작업 속도를 높이기 위해 가중치와 활성화를 모두 양자화해야 합니다. 이로 인해 정확성을 유지하고 모든 종류의 입력에서 이상값의 영향을 피하기가 더 어려워졌습니다.

우리의 경우 병목 현상은 주로 디스크 로딩에 있지만 모델 로딩 크기만 더 작게 만들면 됩니다. 그래서 우리는 가중치 부분만 양자화하게 되므로 정확도를 더 쉽게 보장할 수 있습니다.

구성

모델을 초기화할 때 다음 구성을 지원합니다.

압축 : 지원되는 옵션: 4비트 또는 8비트 블록 단위 양자화의 경우 4비트, 8비트, 압축하지 않는 경우 기본적으로 없음
profiling_mode : 지원되는 옵션: 시간 소모를 출력하려면 True 또는 기본적으로 False
layer_shards_saving_path : 선택적으로 분할된 모델을 저장하기 위한 또 다른 경로
hf_token : Meta-llama/Llama-2-7b-hf 와 같은 게이트 모델을 다운로드하는 경우 여기에 Huggingface 토큰을 제공할 수 있습니다.
prefetching : 모델 로딩과 계산을 겹치기 위한 프리페칭. 기본적으로 켜져 있습니다. 현재로서는 AirLLMLlama2만이 이를 지원합니다.
delete_original : 디스크 공간이 너무 많지 않은 경우 delete_original을 true로 설정하여 다운로드한 원본 포옹 얼굴 모델을 삭제하고 변환된 모델만 유지하여 디스크 공간을 절반으로 절약할 수 있습니다.

맥OS

airllm을 설치하고 Linux에서와 동일한 코드를 실행하면 됩니다. 빠른 시작에서 자세한 내용을 확인하세요.

mlx와 토치를 설치했는지 확인하세요
아마도 Python 네이티브를 설치해야 할 것입니다. 여기에서 자세한 내용을 확인하세요.
Apple 실리콘만 지원됩니다.

예 [파이썬 노트북] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb)

Python 노트북 예시

Colab 예시:

다른 모델의 예(ChatGLM, QWen, Baichuan, Mistral 등):

채팅GLM:

 from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel . from_pretrained ( "THUDM/chatglm3-6b-base" )
input_text = [ 'What is the capital of China?' ,]
input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH , 
    padding = True )
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 5 ,
    use_cache = True ,
    return_dict_in_generate = True )
model . tokenizer . decode ( generation_output . sequences [ 0 ])

Q웬:

 from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel . from_pretrained ( "Qwen/Qwen-7B" )
input_text = [ 'What is the capital of China?' ,]
input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH )
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 5 ,
    use_cache = True ,
    return_dict_in_generate = True )
model . tokenizer . decode ( generation_output . sequences [ 0 ])

Baichuan, InternLM, Mistral 등:

 from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel . from_pretrained ( "baichuan-inc/Baichuan2-7B-Base" )
#model = AutoModel.from_pretrained("internlm/internlm-20b")
#model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
input_text = [ 'What is the capital of China?' ,]
input_tokens = model . tokenizer ( input_text ,
    return_tensors = "pt" , 
    return_attention_mask = False , 
    truncation = True , 
    max_length = MAX_LENGTH )
generation_output = model . generate (
    input_tokens [ 'input_ids' ]. cuda (), 
    max_new_tokens = 5 ,
    use_cache = True ,
    return_dict_in_generate = True )
model . tokenizer . decode ( generation_output . sequences [ 0 ])

다른 모델 지원을 요청하려면: 여기

승인

많은 코드는 Kaggle 시험 대회에서 SimJeg의 훌륭한 작업을 기반으로 합니다. SimJeg에게 큰 감사를 드립니다:

GitHub 계정 @SimJeg, Kaggle의 코드, 관련 토론.

FAQ

1. 메타데이터불완전버퍼

safetensors_rust.SafetensorError: 헤더를 역직렬화하는 중 오류 발생: MetadataIncompleteBuffer

이 오류가 발생하는 경우 가장 가능한 원인은 디스크 공간이 부족하기 때문입니다. 모델 분할 프로세스는 디스크를 많이 소모합니다. 이것을 보세요. 디스크 공간을 확장하고, Huggingface .cache를 지우고 다시 실행해야 할 수도 있습니다.

2. ValueError: max() arg는 빈 시퀀스입니다.

아마도 Llama2 클래스를 사용하여 QWen 또는 ChatGLM 모델을 로드하고 있을 것입니다. 다음을 시도해 보세요:

QWen 모델의 경우:

 from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel . from_pretrained (...)

ChatGLM 모델의 경우:

 from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel . from_pretrained (...)

3. 401 클라이언트 오류....Repo 모델...이 게이트되었습니다.

일부 모델은 게이트 모델이므로 Huggingface API 토큰이 필요합니다. hf_token을 제공할 수 있습니다.

 model = AutoModel . from_pretrained ( "meta-llama/Llama-2-7b-hf" , #hf_token='HF_API_TOKEN')

4. ValueError: 패딩을 요청했지만 토크나이저에 패딩 토큰이 없습니다.

일부 모델의 토크나이저에는 패딩 토큰이 없으므로 패딩 토큰을 설정하거나 간단히 패딩 구성을 끌 수 있습니다.

 input_tokens = model . tokenizer ( input_text ,
   return_tensors = "pt" , 
   return_attention_mask = False , 
   truncation = True , 
   max_length = MAX_LENGTH , 
   padding = False  #<-----------   turn off padding 
)

AirLLM 인용

AirLLM이 귀하의 연구에 유용하다고 생각하고 이를 인용하고 싶다면 다음 BibTex 항목을 사용하십시오.

 @software{airllm2023,
  author = {Gavin Li},
  title = {AirLLM: scaling large language models on low-end commodity computers},
  url = {https://github.com/lyogavin/airllm/},
  version = {0.0},
  year = {2023},
}

기부금

기여, 아이디어 및 토론을 환영합니다!

도움이 되셨다면, 커피 한 잔 사주세요!

확장하다

추가 정보

버전 1.0.0
유형 기타 소스코드
업데이트 시간 2024-12-05
크기 1.94MB
출처 Github

airllm

업데이트

스타의 역사

목차

빠른 시작

1. 패키지 설치

2. 추론

모델 압축 - 추론 속도가 3배 향상되었습니다!

모델 압축 속도 향상을 활성화하는 방법:

모델 압축과 양자화의 차이점은 무엇입니까?

구성

맥OS

Python 노트북 예시

다른 모델의 예(ChatGLM, QWen, Baichuan, Mistral 등):

다른 모델 지원을 요청하려면: 여기

승인

FAQ

1. 메타데이터불완전버퍼

2. ValueError: max() arg는 빈 시퀀스입니다.

3. 401 클라이언트 오류....Repo 모델...이 게이트되었습니다.

4. ValueError: 패딩을 요청했지만 토크나이저에 패딩 토큰이 없습니다.

AirLLM 인용

기부금

waymo open dataset

SmartTube

Sunamu

MySchedule.py

viptools for eslam

VITAident

chat.petals.dev

GPT Prompt Templates

GPTyped

waymo open dataset

SmartTube

Sunamu

waymo open dataset

wp functions

termwind