llama cpp python 다운로드 - llama cpp python 소스 코드 다운로드

`llama.cpp` 용 Python 바인딩

테스트

@ggerganov의 llama.cpp 라이브러리에 대한 간단한 Python 바인딩입니다. 이 패키지는 다음을 제공합니다:

ctypes 인터페이스를 통해 C API에 대한 저수준 액세스.
텍스트 완성을 위한 고급 Python API
- OpenAI와 유사한 API
- LangChain 호환성
- LlamaIndex 호환성
OpenAI 호환 웹 서버
- 로컬 부조종사 교체
- 함수 호출 지원
- 비전 API 지원
- 여러 모델

설명서는 https://llama-cpp-python.readthedocs.io/en/latest에서 확인할 수 있습니다.

설치

요구사항:

파이썬 3.8+
C 컴파일러
- 리눅스: gcc 또는 clang
- 윈도우: 비주얼 스튜디오 또는 MinGW
- 맥OS: Xcode

패키지를 설치하려면 다음을 실행하세요.

pip install llama-cpp-python

그러면 소스에서 llama.cpp 빌드되어 이 Python 패키지와 함께 설치됩니다.

이것이 실패하면 pip install 에 --verbose 추가하여 전체 cmake 빌드 로그를 확인하세요.

사전 제작된 휠(신규)

기본 CPU 지원으로 사전 제작된 휠을 설치하는 것도 가능합니다.

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

설치 구성

llama.cpp 추론 속도를 높이기 위한 다양한 하드웨어 가속 백엔드와 백엔드 특정 옵션을 지원합니다. 전체 목록은 llama.cpp README를 참조하세요.

모든 llama.cpp cmake 빌드 옵션은 CMAKE_ARGS 환경 변수 또는 설치 중 --config-settings / -C cli 플래그를 통해 설정할 수 있습니다.

환경 변수

 # Linux and Mac
CMAKE_ARGS= " -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS " 
  pip install llama-cpp-python

 # Windows
$ env: CMAKE_ARGS = " -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS "
pip install llama - cpp - python

CLI / 요구사항.txt

pip install -C / --config-settings 명령을 통해 설정하고 requirements.txt 파일에 저장할 수도 있습니다.

pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python 
  -C cmake.args= " -DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS "

 # requirements.txt

llama-cpp-python -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"

지원되는 백엔드

다음은 몇 가지 일반적인 백엔드, 해당 빌드 명령 및 필요한 추가 환경 변수입니다.

오픈BLAS(CPU)

OpenBLAS와 함께 설치하려면 설치하기 전에 GGML_BLAS 및 GGML_BLAS_VENDOR 환경 변수를 설정하십시오.

CMAKE_ARGS= " -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS " pip install llama-cpp-python

쿠다

CUDA 지원과 함께 설치하려면 설치하기 전에 GGML_CUDA=on 환경 변수를 설정하십시오.

CMAKE_ARGS= " -DGGML_CUDA=on " pip install llama-cpp-python

사전 제작된 휠(신규)

CUDA를 지원하는 사전 제작된 휠을 설치하는 것도 가능합니다. 시스템이 일부 요구 사항을 충족하는 한:

CUDA 버전은 12.1, 12.2, 12.3, 12.4 또는 12.5입니다.
Python 버전은 3.10, 3.11 또는 3.12입니다.

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/ < cuda-version >

여기서 <cuda-version> 은 다음 중 하나입니다.

cu121 : 쿠다 12.1
cu122 : 쿠다 12.2
cu123 : 쿠다 12.3
cu124 : 쿠다 12.4
cu125 : 쿠다 12.5

예를 들어 CUDA 12.1 휠을 설치하려면:

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

금속

Metal(MPS)과 함께 설치하려면 설치하기 전에 GGML_METAL=on 환경 변수를 설정하세요.

CMAKE_ARGS= " -DGGML_METAL=on " pip install llama-cpp-python

사전 제작된 휠(신규)

금속 지지대가 있는 사전 제작된 휠을 설치하는 것도 가능합니다. 시스템이 일부 요구 사항을 충족하는 한:

MacOS 버전은 11.0 이상입니다.
Python 버전은 3.10, 3.11 또는 3.12입니다.

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal

hipBLAS(ROCm)

AMD 카드에 대한 hipBLAS / ROCm 지원을 설치하려면 설치하기 전에 GGML_HIPBLAS=on 환경 변수를 설정하십시오.

CMAKE_ARGS= " -DGGML_HIPBLAS=on " pip install llama-cpp-python

불칸

Vulkan 지원과 함께 설치하려면 설치하기 전에 GGML_VULKAN=on 환경 변수를 설정하세요.

CMAKE_ARGS= " -DGGML_VULKAN=on " pip install llama-cpp-python

SYCL

SYCL 지원과 함께 설치하려면 설치하기 전에 GGML_SYCL=on 환경 변수를 설정하십시오.

 source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS= " -DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx " pip install llama-cpp-python

RPC

RPC 지원과 함께 설치하려면 설치하기 전에 GGML_RPC=on 환경 변수를 설정하십시오.

 source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS= " -DGGML_RPC=on " pip install llama-cpp-python

윈도우 노트

오류: 'nmake' 또는 'CMAKE_C_COMPILER'를 찾을 수 없습니다.

불평하는 문제가 발생하면 'nmake' '?' 를 찾을 수 없습니다. 또는 CMAKE_C_COMPILER를 사용하는 경우 llama.cpp 저장소에 언급된 대로 w64devkit을 추출하고 pip install을 실행하기 전에 이를 CMAKE_ARGS에 수동으로 추가할 수 있습니다.

 $env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C: /w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C: /w64devkit/bin/g++.exe"

위 지침을 참조하고 CMAKE_ARGS 사용하려는 BLAS 백엔드로 설정하세요.

MacOS 노트

자세한 MacOS Metal GPU 설치 문서는 docs/install/macos.md에서 확인할 수 있습니다.

M1 Mac 성능 문제

참고: Apple Silicon(M1) Mac을 사용하는 경우 arm64 아키텍처를 지원하는 Python 버전을 설치했는지 확인하세요. 예를 들어:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

그렇지 않으면 설치하는 동안 Apple Silicon(M1) Mac에서 10배 느린 llama.cpp x86 버전이 빌드됩니다.

M 시리즈 Mac 오류: `(mach-o 파일이지만 호환되지 않는 아키텍처입니다('x86_64'가 있고 'arm64'가 필요함))`

다음으로 설치해 보세요

CMAKE_ARGS= " -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on " pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python

업그레이드 및 재설치

llama-cpp-python 업그레이드하고 다시 빌드하려면 pip install 명령에 --upgrade --force-reinstall --no-cache-dir 플래그를 추가하여 패키지가 소스에서 다시 빌드되도록 하세요.

고급 API

API 참조

고급 API는 Llama 클래스를 통해 간단한 관리 인터페이스를 제공합니다.

다음은 기본 텍스트 완성을 위해 고급 API를 사용하는 방법을 보여주는 간단한 예입니다.

 from llama_cpp import Llama

llm = Llama (
      model_path = "./models/7B/llama-model.gguf" ,
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm (
      "Q: Name the planets in the solar system? A: " , # Prompt
      max_tokens = 32 , # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop = [ "Q:" , " n " ], # Stop generating just before the model would generate a new question
      echo = True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print ( output )

기본적으로 llama-cpp-python OpenAI 호환 형식으로 완성을 생성합니다.

{
  "id" : "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" ,
  "object" : "text_completion" ,
  "created" : 1679561337 ,
  "model" : "./models/7B/llama-model.gguf" ,
  "choices" : [
    {
      "text" : "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto." ,
      "index" : 0 ,
      "logprobs" : None ,
      "finish_reason" : "stop"
    }
  ],
  "usage" : {
    "prompt_tokens" : 14 ,
    "completion_tokens" : 28 ,
    "total_tokens" : 42
  }
}

Llama 클래스의 __call__ 및 create_completion 메서드를 통해 텍스트 완성 기능을 사용할 수 있습니다.

Hugging Face Hub에서 모델 가져오기

from_pretrained 메서드를 사용하여 Hugging Face에서 직접 gguf 형식의 Llama 모델을 다운로드할 수 있습니다. 이 기능을 사용하려면 huggingface-hub 패키지를 설치해야 합니다( pip install huggingface-hub ).

 llm = Llama . from_pretrained (
    repo_id = "Qwen/Qwen2-0.5B-Instruct-GGUF" ,
    filename = "*q8_0.gguf" ,
    verbose = False
)

기본적으로 from_pretrained 모델을 Huggingface 캐시 디렉터리에 다운로드합니다. 그러면 huggingface-cli 도구를 사용하여 설치된 모델 파일을 관리할 수 있습니다.

채팅 완료

고급 API는 채팅 완료를 위한 간단한 인터페이스도 제공합니다.

채팅을 완료하려면 모델이 메시지를 단일 프롬프트로 형식화하는 방법을 알아야 합니다. Llama 클래스는 미리 등록된 채팅 형식(예: chatml , llama-2 , gemma 등)을 사용하거나 사용자 정의 채팅 핸들러 객체를 제공하여 이를 수행합니다.

모델은 다음 우선 순위를 사용하여 메시지를 단일 프롬프트로 형식화합니다.

제공된 경우 chat_handler 사용하십시오.
제공된 경우 chat_format 사용하세요.
gguf 모델의 메타데이터에서 tokenizer.chat_template 사용합니다(대부분의 새 모델에서 작동하지만 이전 모델에는 이 기능이 없을 수 있음).
그렇지 않으면 llama-2 채팅 형식으로 대체됩니다.

선택한 채팅 형식을 보려면 verbose=True 로 설정하세요.

 from llama_cpp import Llama
llm = Llama (
      model_path = "path/to/llama-2/llama-model.gguf" ,
      chat_format = "llama-2"
)
llm . create_chat_completion (
      messages = [
          { "role" : "system" , "content" : "You are an assistant who perfectly describes images." },
          {
              "role" : "user" ,
              "content" : "Describe this image in detail please."
          }
      ]
)

Llama 클래스의 create_chat_completion 메소드를 통해 채팅 완료가 가능합니다.

OpenAI API v1 호환성을 위해 dict 대신 pydantic 모델을 반환하는 create_chat_completion_openai_v1 메서드를 사용합니다.

JSON 및 JSON 스키마 모드

유효한 JSON 또는 특정 JSON 스키마로만 채팅 응답을 제한하려면 create_chat_completion 의 response_format 인수를 사용하세요.

JSON 모드

다음 예에서는 유효한 JSON 문자열로만 응답을 제한합니다.

 from llama_cpp import Llama
llm = Llama ( model_path = "path/to/model.gguf" , chat_format = "chatml" )
llm . create_chat_completion (
    messages = [
        {
            "role" : "system" ,
            "content" : "You are a helpful assistant that outputs in JSON." ,
        },
        { "role" : "user" , "content" : "Who won the world series in 2020" },
    ],
    response_format = {
        "type" : "json_object" ,
    },
    temperature = 0.7 ,
)

JSON 스키마 모드

응답을 특정 JSON 스키마로 추가로 제한하려면 response_format 인수의 schema 속성에 스키마를 추가하세요.

 from llama_cpp import Llama
llm = Llama ( model_path = "path/to/model.gguf" , chat_format = "chatml" )
llm . create_chat_completion (
    messages = [
        {
            "role" : "system" ,
            "content" : "You are a helpful assistant that outputs in JSON." ,
        },
        { "role" : "user" , "content" : "Who won the world series in 2020" },
    ],
    response_format = {
        "type" : "json_object" ,
        "schema" : {
            "type" : "object" ,
            "properties" : { "team_name" : { "type" : "string" }},
            "required" : [ "team_name" ],
        },
    },
    temperature = 0.7 ,
)

함수 호출

고급 API는 OpenAI 호환 기능 및 도구 호출을 지원합니다. 이는 functionary 사전 훈련된 모델 채팅 형식이나 일반 chatml-function-calling 채팅 형식을 통해 가능합니다.

 from llama_cpp import Llama
llm = Llama ( model_path = "path/to/chatml/llama-model.gguf" , chat_format = "chatml-function-calling" )
llm . create_chat_completion (
      messages = [
        {
          "role" : "system" ,
          "content" : "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"

        },
        {
          "role" : "user" ,
          "content" : "Extract Jason is 25 years old"
        }
      ],
      tools = [{
        "type" : "function" ,
        "function" : {
          "name" : "UserDetail" ,
          "parameters" : {
            "type" : "object" ,
            "title" : "UserDetail" ,
            "properties" : {
              "name" : {
                "title" : "Name" ,
                "type" : "string"
              },
              "age" : {
                "title" : "Age" ,
                "type" : "integer"
              }
            },
            "required" : [ "name" , "age" ]
          }
        }
      }],
      tool_choice = {
        "type" : "function" ,
        "function" : {
          "name" : "UserDetail"
        }
      }
)

기능적 v2

이 모델 세트에 대한 다양한 gguf 변환 파일은 여기에서 찾을 수 있습니다. Functionary는 지능적으로 함수를 호출하고 제공된 함수 출력을 분석하여 일관된 응답을 생성할 수 있습니다. v2의 모든 function 모델은 병렬 함수 호출을 지원합니다. Llama 클래스를 초기화할 때 chat_format 에 functionary-v1 또는 functionary-v2 제공할 수 있습니다.

llama.cpp와 HuggingFace의 토크나이저 간의 불일치로 인해 임원용 HF 토크나이저를 제공해야 합니다. LlamaHFTokenizer 클래스를 초기화하여 Llama 클래스에 전달할 수 있습니다. 이는 Llama 클래스에 사용되는 기본 llama.cpp 토크나이저를 재정의합니다. 토크나이저 파일은 gguf 파일을 호스팅하는 각 HF 저장소에 이미 포함되어 있습니다.

 from llama_cpp import Llama
from llama_cpp . llama_tokenizer import LlamaHFTokenizer
llm = Llama . from_pretrained (
  repo_id = "meetkai/functionary-small-v2.2-GGUF" ,
  filename = "functionary-small-v2.2.q4_0.gguf" ,
  chat_format = "functionary-v2" ,
  tokenizer = LlamaHFTokenizer . from_pretrained ( "meetkai/functionary-small-v2.2-GGUF" )
)

참고 : Functionary 채팅 핸들러에 자동으로 추가되므로 Functionary에서 사용되는 기본 시스템 메시지를 제공할 필요가 없습니다. 따라서 메시지에는 모델에 대한 추가 컨텍스트(예: 날짜/시간 등)를 제공하는 채팅 메시지 및/또는 시스템 메시지만 포함되어야 합니다.

다중 모드 모델

llama-cpp-python 언어 모델이 텍스트와 이미지 모두에서 정보를 읽을 수 있도록 하는 llava1.5와 같은 지원을 제공합니다.

다음은 지원되는 다중 모달 모델과 해당 채팅 핸들러(Python API) 및 채팅 형식(Server API)입니다.

모델	`LlamaChatHandler`	`chat_format`
llava-v1.5-7b	`Llava15ChatHandler`	`llava-1-5`
llava-v1.5-13b	`Llava15ChatHandler`	`llava-1-5`
llava-v1.6-34b	`Llava16ChatHandler`	`llava-1-6`
문드림2	`MoondreamChatHandler`	`moondream2`
나노라바	`NanollavaChatHandler`	`nanollava`
라마-3-비전-알파	`Llama3VisionAlphaChatHandler`	`llama-3-vision-alpha`
minicpm-v-2.6	`MiniCPMv26ChatHandler`	`minicpm-v-2.6`

그런 다음 사용자 정의 채팅 핸들러를 사용하여 클립 모델을 로드하고 채팅 메시지와 이미지를 처리해야 합니다.

 from llama_cpp import Llama
from llama_cpp . llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler ( clip_model_path = "path/to/llava/mmproj.bin" )
llm = Llama (
  model_path = "./path/to/llava/llama-model.gguf" ,
  chat_handler = chat_handler ,
  n_ctx = 2048 , # n_ctx should be increased to accommodate the image embedding
)
llm . create_chat_completion (
    messages = [
        { "role" : "system" , "content" : "You are an assistant who perfectly describes images." },
        {
            "role" : "user" ,
            "content" : [
                { "type" : "text" , "text" : "What's in this image?" },
                { "type" : "image_url" , "image_url" : { "url" : "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
            ]
        }
    ]
)

from_pretrained 메서드를 사용하여 Hugging Face Hub에서 모델을 가져올 수도 있습니다.

 from llama_cpp import Llama
from llama_cpp . llama_chat_format import MoondreamChatHandler

chat_handler = MoondreamChatHandler . from_pretrained (
  repo_id = "vikhyatk/moondream2" ,
  filename = "*mmproj*" ,
)

llm = Llama . from_pretrained (
  repo_id = "vikhyatk/moondream2" ,
  filename = "*text-model*" ,
  chat_handler = chat_handler ,
  n_ctx = 2048 , # n_ctx should be increased to accommodate the image embedding
)

response = llm . create_chat_completion (
    messages = [
        {
            "role" : "user" ,
            "content" : [
                { "type" : "text" , "text" : "What's in this image?" },
                { "type" : "image_url" , "image_url" : { "url" : "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)
print ( response [ "choices" ][ 0 ][ "text" ])

참고 : 다중 모드 모델은 도구 호출 및 JSON 모드도 지원합니다.

로컬 이미지 로드

이미지는 base64로 인코딩된 데이터 URI로 전달될 수 있습니다. 다음 예제에서는 이를 수행하는 방법을 보여줍니다.

 import base64

def image_to_base64_data_uri ( file_path ):
    with open ( file_path , "rb" ) as img_file :
        base64_data = base64 . b64encode ( img_file . read ()). decode ( 'utf-8' )
        return f"data:image/png;base64, { base64_data } "

# Replace 'file_path.png' with the actual path to your PNG file
file_path = 'file_path.png'
data_uri = image_to_base64_data_uri ( file_path )

messages = [
    { "role" : "system" , "content" : "You are an assistant who perfectly describes images." },
    {
        "role" : "user" ,
        "content" : [
            { "type" : "image_url" , "image_url" : { "url" : data_uri }},
            { "type" : "text" , "text" : "Describe this image in detail please." }
        ]
    }
]

추측적 디코딩

llama-cpp-python 모델이 초안 모델을 기반으로 완성을 생성할 수 있도록 하는 추측적 디코딩을 지원합니다.

추측 디코딩을 사용하는 가장 빠른 방법은 LlamaPromptLookupDecoding 클래스를 사용하는 것입니다.

초기화 중에 Llama 클래스에 초안 모델로 전달하면 됩니다.

 from llama_cpp import Llama
from llama_cpp . llama_speculative import LlamaPromptLookupDecoding

llama = Llama (
    model_path = "path/to/model.gguf" ,
    draft_model = LlamaPromptLookupDecoding ( num_pred_tokens = 10 ) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

임베딩

텍스트 임베딩을 생성하려면 create_embedding 또는 embed 사용하세요. 이것이 제대로 작동하려면 모델 생성 시 생성자에 embedding=True 전달해야 합니다.

 import llama_cpp

llm = llama_cpp . Llama ( model_path = "path/to/model.gguf" , embedding = True )

embeddings = llm . create_embedding ( "Hello, world!" )

# or create multiple embeddings at once

embeddings = llm . create_embedding ([ "Hello, world!" , "Goodbye, world!" ])

Transformer 스타일 모델에는 토큰 수준 과 시퀀스 수준 이라는 두 가지 기본 임베딩 개념이 있습니다. 시퀀스 수준 임베딩은 일반적으로 평균을 구하거나 첫 번째 토큰을 사용하여 토큰 수준 임베딩을 함께 "풀링"하여 생성됩니다.

임베딩에 명시적으로 맞춰진 모델은 일반적으로 기본적으로 각 입력 문자열에 하나씩 시퀀스 수준 임베딩을 반환합니다. 텍스트 생성을 위해 설계된 것과 같은 비임베딩 모델은 일반적으로 각 시퀀스의 각 토큰에 대해 하나씩 토큰 수준 임베딩만 반환합니다. 따라서 토큰 수준 임베딩의 경우 반환 유형의 차원이 하나 더 높아집니다.

경우에 따라 모델 생성 시 pooling_type 플래그를 사용하여 풀링 동작을 제어할 수 있습니다. LLAMA_POOLING_TYPE_NONE 사용하여 모든 모델에서 토큰 수준 임베딩을 보장할 수 있습니다. 반대로, 생성 지향 모델을 사용하여 시퀀스 수준 임베딩을 생성하는 것은 현재 불가능하지만 언제든지 수동으로 풀링을 수행할 수 있습니다.

컨텍스트 창 조정

Llama 모델의 컨텍스트 창은 한 번에 처리할 수 있는 최대 토큰 수를 결정합니다. 기본적으로 이는 512개의 토큰으로 설정되지만 요구 사항에 따라 조정될 수 있습니다.

예를 들어 더 큰 컨텍스트로 작업하려는 경우 Llama 개체를 초기화할 때 n_ctx 매개변수를 설정하여 컨텍스트 창을 확장할 수 있습니다.

 llm = Llama ( model_path = "./models/7B/llama-model.gguf" , n_ctx = 2048 )

OpenAI 호환 웹 서버

llama-cpp-python OpenAI API를 즉시 대체하는 것을 목표로 하는 웹 서버를 제공합니다. 이를 통해 OpenAI 호환 클라이언트(언어 라이브러리, 서비스 등)에서 llama.cpp 호환 모델을 사용할 수 있습니다.

서버 패키지를 설치하고 시작하려면:

pip install ' llama-cpp-python[server] '
python3 -m llama_cpp.server --model models/7B/llama-model.gguf

위의 하드웨어 가속 섹션과 유사하게 다음과 같이 GPU(cuBLAS) 지원을 설치할 수도 있습니다.

CMAKE_ARGS= " -DGGML_CUDA=on " FORCE_CMAKE=1 pip install ' llama-cpp-python[server] '
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35

OpenAPI 설명서를 보려면 http://localhost:8000/docs로 이동하세요.

0.0.0.0 에 바인딩하여 원격 연결을 활성화하려면 python3 -m llama_cpp.server --host 0.0.0.0 사용하세요. 마찬가지로 포트(기본값은 8000)를 변경하려면 --port 사용하세요.

프롬프트 형식을 설정하고 싶을 수도 있습니다. chatml의 경우 다음을 사용하세요.

python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml

그러면 모델이 예상하는 방식에 따라 프롬프트 형식이 지정됩니다. 모델 카드에서 프롬프트 형식을 찾을 수 있습니다. 가능한 옵션을 보려면 llama_cpp/llama_chat_format.py를 참조하고 "@register_chat_format"으로 시작하는 줄을 찾으세요.

huggingface-hub 설치되어 있는 경우 --hf_model_repo_id 플래그를 사용하여 Hugging Face Hub에서 모델을 로드할 수도 있습니다.

python3 -m llama_cpp.server --hf_model_repo_id Qwen/Qwen2-0.5B-Instruct-GGUF --model ' *q8_0.gguf '

웹 서버 기능

로컬 부조종사 교체
함수 호출 지원
비전 API 지원
다중 모델

도커 이미지

Docker 이미지는 GHCR에서 사용할 수 있습니다. 서버를 실행하려면:

docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest

termux의 Docker(루트 필요)는 현재 휴대폰에서 이를 실행하는 유일한 알려진 방법입니다. termux 지원 문제를 참조하세요.

저수준 API

API 참조

저수준 API는 llama.cpp 에서 제공하는 C API에 대한 직접 ctypes 바인딩입니다. 전체 하위 수준 API는 llama_cpp/llama_cpp.py에서 찾을 수 있으며 llama.h의 C API를 직접 미러링합니다.

다음은 하위 수준 API를 사용하여 프롬프트를 토큰화하는 방법을 보여주는 간단한 예입니다.

 import llama_cpp
import ctypes
llama_cpp . llama_backend_init ( False ) # Must be called once at the start of each program
params = llama_cpp . llama_context_default_params ()
# use bytes for char * params
model = llama_cpp . llama_load_model_from_file ( b"./models/7b/llama-model.gguf" , params )
ctx = llama_cpp . llama_new_context_with_model ( model , params )
max_tokens = params . n_ctx
# use ctypes arrays for array params
tokens = ( llama_cpp . llama_token * int ( max_tokens ))()
n_tokens = llama_cpp . llama_tokenize ( ctx , b"Q: Name the planets in the solar system? A: " , tokens , max_tokens , llama_cpp . c_bool ( True ))
llama_cpp . llama_free ( ctx )

하위 수준 API 사용에 대한 더 많은 예제는 예제 폴더를 확인하세요.

선적 서류 비치

문서는 https://llama-cpp-python.readthedocs.io/를 통해 제공됩니다. 문서에서 문제를 발견하면 문제를 공개하거나 PR을 제출하세요.

개발

이 패키지는 활발히 개발 중이며 어떤 기여라도 환영합니다.

시작하려면 저장소를 복제하고 편집 가능/개발 모드에서 패키지를 설치하십시오.

git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python

# Upgrade pip (required for editable mode)
pip install --upgrade pip

# Install with pip
pip install -e .

# if you want to use the fastapi / openapi server
pip install -e .[server]

# to install all optional dependencies
pip install -e .[all]

# to clear the local build cache
make clean

vendor/llama.cpp 하위 모듈에서 원하는 커밋을 확인한 다음 make clean 및 pip install -e 를 실행하여 llama.cpp 의 특정 커밋을 테스트할 수도 있습니다 pip install -e . 다시. llama.h API를 변경하려면 새 API와 일치하도록 llama_cpp/llama_cpp.py 파일을 변경해야 합니다(다른 곳에서는 추가 변경이 필요할 수 있음).