llama cpp python ดาวน์โหลด - llama cpp python ดาวน์โหลดซอร์สโค้ด

Python Bindings สำหรับ `llama.cpp`

การทดสอบ

การผูก Python อย่างง่ายสำหรับไลบรารี llama.cpp ของ @ggerganov แพ็คเกจนี้ประกอบด้วย:

การเข้าถึง C API ระดับต่ำผ่านอินเทอร์เฟซ ctypes
Python API ระดับสูงสำหรับการเติมข้อความให้สมบูรณ์
- API ที่เหมือน OpenAI
- ความเข้ากันได้ของ LangChain
- ความเข้ากันได้ของ LlamaIndex
เว็บเซิร์ฟเวอร์ที่รองรับ OpenAI
- การเปลี่ยน Copilot ในพื้นที่
- รองรับการเรียกฟังก์ชัน
- การสนับสนุนวิสัยทัศน์ API
- หลายรุ่น

มีเอกสารประกอบอยู่ที่https://llama-cpp-python.readthedocs.io/en/latest

การติดตั้ง

ความต้องการ:

ไพธอน 3.8+
ซีคอมไพเลอร์
- Linux: gcc หรือเสียงดังกราว
- Windows: Visual Studio หรือ MinGW
- MacOS: Xcode

หากต้องการติดตั้งแพ็คเกจ ให้รัน:

pip install llama-cpp-python

สิ่งนี้จะสร้าง llama.cpp จากแหล่งที่มาและติดตั้งควบคู่ไปกับแพ็คเกจ Python นี้

หากล้มเหลว ให้เพิ่ม --verbose ไปที่ pip install ดูบันทึกการสร้าง cmake แบบเต็ม

ล้อสำเร็จรูป (ใหม่)

นอกจากนี้ยังสามารถติดตั้งล้อที่สร้างไว้ล่วงหน้าพร้อมการรองรับ CPU พื้นฐานได้

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

การกำหนดค่าการติดตั้ง

llama.cpp รองรับแบ็กเอนด์การเร่งความเร็วด้วยฮาร์ดแวร์จำนวนหนึ่งเพื่อเพิ่มความเร็วในการอนุมาน รวมถึงตัวเลือกเฉพาะของแบ็กเอนด์ ดู llama.cpp README สำหรับรายการทั้งหมด

ตัวเลือกการสร้าง llama.cpp cmake ทั้งหมดสามารถตั้งค่าผ่านตัวแปรสภาพแวดล้อม CMAKE_ARGS หรือผ่านแฟล็ก --config-settings / -C cli ระหว่างการติดตั้ง

ตัวแปรสภาพแวดล้อม

 # Linux and Mac
CMAKE_ARGS= " -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS " 
  pip install llama-cpp-python

 # Windows
$ env: CMAKE_ARGS = " -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS "
pip install llama - cpp - python

CLI/requirements.txt

นอกจากนี้ยังสามารถตั้งค่าผ่านคำสั่ง pip install -C / --config-settings และบันทึกลงในไฟล์ requirements.txt :

pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python 
  -C cmake.args= " -DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS "

 # requirements.txt

llama-cpp-python -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"

แบ็กเอนด์ที่รองรับ

ด้านล่างนี้คือแบ็กเอนด์ทั่วไป คำสั่ง build และตัวแปรสภาพแวดล้อมเพิ่มเติมที่จำเป็น

OpenBLAS (ซีพียู)

หากต้องการติดตั้งด้วย OpenBLAS ให้ตั้งค่าตัวแปรสภาพแวดล้อม GGML_BLAS และ GGML_BLAS_VENDOR ก่อนทำการติดตั้ง:

CMAKE_ARGS= " -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS " pip install llama-cpp-python

CUDA

หากต้องการติดตั้งด้วยการสนับสนุน CUDA ให้ตั้งค่า GGML_CUDA=on ตัวแปรสภาพแวดล้อมก่อนทำการติดตั้ง:

CMAKE_ARGS= " -DGGML_CUDA=on " pip install llama-cpp-python

ล้อสำเร็จรูป (ใหม่)

นอกจากนี้ยังสามารถติดตั้งล้อที่สร้างไว้ล่วงหน้าด้วยการรองรับ CUDA ได้ด้วย ตราบใดที่ระบบของคุณตรงตามข้อกำหนดบางประการ:

CUDA เวอร์ชันคือ 12.1, 12.2, 12.3, 12.4 หรือ 12.5
Python เวอร์ชันคือ 3.10, 3.11 หรือ 3.12

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/ < cuda-version >

โดยที่ <cuda-version> เป็นหนึ่งในรายการต่อไปนี้:

cu121 : CUDA 12.1
cu122 : CUDA 12.2
cu123 : CUDA 12.3
cu124 : CUDA 12.4
cu125 : CUDA 12.5

ตัวอย่างเช่น หากต้องการติดตั้งล้อ CUDA 12.1:

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

โลหะ

หากต้องการติดตั้งด้วย Metal (MPS) ให้ตั้งค่า GGML_METAL=on ตัวแปรสภาพแวดล้อมก่อนการติดตั้ง:

CMAKE_ARGS= " -DGGML_METAL=on " pip install llama-cpp-python

ล้อสำเร็จรูป (ใหม่)

นอกจากนี้ยังสามารถติดตั้งล้อที่สร้างไว้ล่วงหน้าพร้อมส่วนรองรับโลหะได้ ตราบใดที่ระบบของคุณตรงตามข้อกำหนดบางประการ:

MacOS เวอร์ชันคือ 11.0 หรือใหม่กว่า
Python เวอร์ชันคือ 3.10, 3.11 หรือ 3.12

pip install llama-cpp-python 
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal

ฮิปบลาส (ROCm)

หากต้องการติดตั้งด้วยการสนับสนุน hipBLAS / ROCm สำหรับการ์ด AMD ให้ตั้งค่า GGML_HIPBLAS=on ตัวแปรสภาพแวดล้อมก่อนทำการติดตั้ง:

CMAKE_ARGS= " -DGGML_HIPBLAS=on " pip install llama-cpp-python

วัลแคน

หากต้องการติดตั้งด้วยการสนับสนุน Vulkan ให้ตั้งค่า GGML_VULKAN=on ตัวแปรสภาพแวดล้อมก่อนการติดตั้ง:

CMAKE_ARGS= " -DGGML_VULKAN=on " pip install llama-cpp-python

SYCL

หากต้องการติดตั้งด้วยการสนับสนุน SYCL ให้ตั้งค่า GGML_SYCL=on ตัวแปรสภาพแวดล้อมก่อนการติดตั้ง:

 source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS= " -DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx " pip install llama-cpp-python

อาร์พีซี

หากต้องการติดตั้งด้วยการสนับสนุน RPC ให้ตั้งค่า GGML_RPC=on ตัวแปรสภาพแวดล้อมก่อนการติดตั้ง:

 source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS= " -DGGML_RPC=on " pip install llama-cpp-python

หมายเหตุของ Windows

ข้อผิดพลาด: ไม่พบ 'nmake' หรือ 'CMAKE_C_COMPILER'

หากคุณพบปัญหาที่บ่นว่าไม่พบ 'nmake' '?' หรือ CMAKE_C_COMPILER คุณสามารถแยก w64devkit ตามที่กล่าวไว้ใน llama.cpp repo และเพิ่มสิ่งเหล่านั้นด้วยตนเองใน CMAKE_ARGS ก่อนที่จะรัน pip install:

 $env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C: /w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C: /w64devkit/bin/g++.exe"

ดูคำแนะนำด้านบนและตั้งค่า CMAKE_ARGS เป็นแบ็กเอนด์ BLAS ที่คุณต้องการใช้

หมายเหตุ MacOS

เอกสารการติดตั้ง MacOS Metal GPU โดยละเอียดมีอยู่ที่ docs/install/macos.md

ปัญหาประสิทธิภาพของ M1 Mac

หมายเหตุ: หากคุณใช้ Apple Silicon (M1) Mac ตรวจสอบให้แน่ใจว่าคุณได้ติดตั้งเวอร์ชันของ Python ที่รองรับสถาปัตยกรรม arm64 ตัวอย่างเช่น:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

มิฉะนั้น ขณะติดตั้งจะสร้างเวอร์ชัน llama.cpp x86 ซึ่งจะช้ากว่า 10 เท่าบน Apple Silicon (M1) Mac

ข้อผิดพลาด Mac Series M: `(ไฟล์ mach-o แต่เป็นสถาปัตยกรรมที่เข้ากันไม่ได้ (มี 'x86_64' ต้องใช้ 'arm64'))`

ลองติดตั้งด้วย

CMAKE_ARGS= " -DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on " pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python

การอัพเกรดและติดตั้งใหม่

หากต้องการอัปเกรดและสร้าง llama-cpp-python ใหม่ ให้เพิ่ม --upgrade --force-reinstall --no-cache-dir flag ลงในคำสั่ง pip install เพื่อให้แน่ใจว่าแพ็กเกจจะถูกสร้างขึ้นใหม่จากแหล่งที่มา

API ระดับสูง

การอ้างอิง API

API ระดับสูงมีอินเทอร์เฟซที่ได้รับการจัดการที่เรียบง่ายผ่านคลาส Llama

ด้านล่างนี้เป็นตัวอย่างสั้นๆ ที่สาธิตวิธีการใช้ API ระดับสูงในการเติมข้อความพื้นฐาน:

 from llama_cpp import Llama

llm = Llama (
      model_path = "./models/7B/llama-model.gguf" ,
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm (
      "Q: Name the planets in the solar system? A: " , # Prompt
      max_tokens = 32 , # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop = [ "Q:" , " n " ], # Stop generating just before the model would generate a new question
      echo = True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print ( output )

ตามค่าเริ่มต้น llama-cpp-python จะสร้างความสำเร็จในรูปแบบที่เข้ากันได้กับ OpenAI:

{
  "id" : "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" ,
  "object" : "text_completion" ,
  "created" : 1679561337 ,
  "model" : "./models/7B/llama-model.gguf" ,
  "choices" : [
    {
      "text" : "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto." ,
      "index" : 0 ,
      "logprobs" : None ,
      "finish_reason" : "stop"
    }
  ],
  "usage" : {
    "prompt_tokens" : 14 ,
    "completion_tokens" : 28 ,
    "total_tokens" : 42
  }
}

การเติมข้อความให้สมบูรณ์สามารถทำได้ผ่านเมธอด __call__ และ create_completion ของคลาส Llama

การดึงโมเดลจาก Hugging Face Hub

คุณสามารถดาวน์โหลดโมเดล Llama ในรูป gguf ได้โดยตรงจาก Hugging Face โดยใช้วิธี from_pretrained คุณจะต้องติดตั้งแพ็คเกจ huggingface-hub เพื่อใช้คุณสมบัตินี้ ( pip install huggingface-hub )

 llm = Llama . from_pretrained (
    repo_id = "Qwen/Qwen2-0.5B-Instruct-GGUF" ,
    filename = "*q8_0.gguf" ,
    verbose = False
)

ตามค่าเริ่มต้น from_pretrained จะดาวน์โหลดโมเดลไปยังไดเร็กทอรีแคชของ Huggingface จากนั้นคุณจึงสามารถจัดการไฟล์โมเดลที่ติดตั้งด้วยเครื่องมือ huggingface-cli

เสร็จสิ้นการแชท

API ระดับสูงยังมีอินเทอร์เฟซที่เรียบง่ายสำหรับการแชทให้เสร็จสิ้น

การแชทเสร็จสิ้นต้องการให้โมเดลรู้วิธีจัดรูปแบบข้อความให้เป็นข้อความแจ้งเดียว คลาส Llama ทำสิ่งนี้โดยใช้รูปแบบแชทที่ลงทะเบียนไว้ล่วงหน้า (เช่น chatml , llama-2 , gemma ฯลฯ) หรือโดยการจัดหาออบเจ็กต์ตัวจัดการแชทแบบกำหนดเอง

โมเดลจะจัดรูปแบบข้อความให้เป็นพรอมต์เดียวโดยใช้ลำดับความสำคัญต่อไปนี้:

ใช้ chat_handler หากมีให้
ใช้ chat_format หากมีให้
ใช้ tokenizer.chat_template จากข้อมูลเมตาของโมเดล gguf (ควรใช้กับรุ่นใหม่ส่วนใหญ่ รุ่นเก่าอาจไม่มีสิ่งนี้)
มิฉะนั้น ให้ใช้รูปแบบแชท llama-2

ตั้งค่า verbose=True เพื่อดูรูปแบบแชทที่เลือก

 from llama_cpp import Llama
llm = Llama (
      model_path = "path/to/llama-2/llama-model.gguf" ,
      chat_format = "llama-2"
)
llm . create_chat_completion (
      messages = [
          { "role" : "system" , "content" : "You are an assistant who perfectly describes images." },
          {
              "role" : "user" ,
              "content" : "Describe this image in detail please."
          }
      ]
)

การจบการแชทสามารถทำได้ผ่านวิธี create_chat_completion ของคลาส Llama

สำหรับความเข้ากันได้ของ OpenAI API v1 คุณใช้เมธอด create_chat_completion_openai_v1 ซึ่งจะส่งคืนโมเดล pydantic แทน dicts

โหมดสคีมา JSON และ JSON

หากต้องการจำกัดการตอบกลับแชทเฉพาะ JSON ที่ถูกต้องหรือ JSON Schema ที่เฉพาะเจาะจง ให้ใช้อาร์กิวเมนต์ response_format ใน create_chat_completion

โหมดเจสัน

ตัวอย่างต่อไปนี้จะจำกัดการตอบสนองต่อสตริง JSON ที่ถูกต้องเท่านั้น

 from llama_cpp import Llama
llm = Llama ( model_path = "path/to/model.gguf" , chat_format = "chatml" )
llm . create_chat_completion (
    messages = [
        {
            "role" : "system" ,
            "content" : "You are a helpful assistant that outputs in JSON." ,
        },
        { "role" : "user" , "content" : "Who won the world series in 2020" },
    ],
    response_format = {
        "type" : "json_object" ,
    },
    temperature = 0.7 ,
)

โหมดสคีมา JSON

หากต้องการจำกัดการตอบสนองเพิ่มเติมต่อ JSON Schema เฉพาะ ให้เพิ่มสคีมาในคุณสมบัติ schema ของอาร์กิวเมนต์ response_format

 from llama_cpp import Llama
llm = Llama ( model_path = "path/to/model.gguf" , chat_format = "chatml" )
llm . create_chat_completion (
    messages = [
        {
            "role" : "system" ,
            "content" : "You are a helpful assistant that outputs in JSON." ,
        },
        { "role" : "user" , "content" : "Who won the world series in 2020" },
    ],
    response_format = {
        "type" : "json_object" ,
        "schema" : {
            "type" : "object" ,
            "properties" : { "team_name" : { "type" : "string" }},
            "required" : [ "team_name" ],
        },
    },
    temperature = 0.7 ,
)

การเรียกใช้ฟังก์ชัน

API ระดับสูงรองรับฟังก์ชันที่เข้ากันได้กับ OpenAI และการเรียกเครื่องมือ สิ่งนี้เป็นไปได้ผ่านรูปแบบแชทของโมเดลที่ได้รับการฝึกอบรมล่วงหน้า functionary หรือผ่านรูปแบบแชท chatml-function-calling ทั่วไป

 from llama_cpp import Llama
llm = Llama ( model_path = "path/to/chatml/llama-model.gguf" , chat_format = "chatml-function-calling" )
llm . create_chat_completion (
      messages = [
        {
          "role" : "system" ,
          "content" : "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"

        },
        {
          "role" : "user" ,
          "content" : "Extract Jason is 25 years old"
        }
      ],
      tools = [{
        "type" : "function" ,
        "function" : {
          "name" : "UserDetail" ,
          "parameters" : {
            "type" : "object" ,
            "title" : "UserDetail" ,
            "properties" : {
              "name" : {
                "title" : "Name" ,
                "type" : "string"
              },
              "age" : {
                "title" : "Age" ,
                "type" : "integer"
              }
            },
            "required" : [ "name" , "age" ]
          }
        }
      }],
      tool_choice = {
        "type" : "function" ,
        "function" : {
          "name" : "UserDetail"
        }
      }
)

ฟังก์ชั่น v2

สามารถดูไฟล์ที่แปลงแล้ว gguf ต่างๆ สำหรับโมเดลชุดนี้ได้ที่นี่ Functionary สามารถเรียกใช้ฟังก์ชันต่างๆ ได้อย่างชาญฉลาด และยังวิเคราะห์เอาต์พุตของฟังก์ชันที่ให้มาเพื่อสร้างการตอบสนองที่สอดคล้องกัน ฟังก์ชั่น v2 ทุกรุ่นรองรับ การเรียกฟังก์ชั่นแบบขนาน คุณสามารถระบุ functionary-v1 หรือ functionary-v2 สำหรับ chat_format เมื่อเริ่มต้นคลาส Llama

เนื่องจากความแตกต่างระหว่างโทเค็นของ llama.cpp และ HuggingFace จึงจำเป็นต้องจัดเตรียม HF Tokenizer สำหรับการใช้งาน คลาส LlamaHFTokenizer สามารถเริ่มต้นและส่งผ่านไปยังคลาส Llama ได้ สิ่งนี้จะแทนที่โทเค็นไนเซอร์ llama.cpp เริ่มต้นที่ใช้ในคลาส Llama ไฟล์โทเค็นไนเซอร์รวมอยู่ในที่เก็บ HF ที่เกี่ยวข้องซึ่งโฮสต์ไฟล์ gguf แล้ว

 from llama_cpp import Llama
from llama_cpp . llama_tokenizer import LlamaHFTokenizer
llm = Llama . from_pretrained (
  repo_id = "meetkai/functionary-small-v2.2-GGUF" ,
  filename = "functionary-small-v2.2.q4_0.gguf" ,
  chat_format = "functionary-v2" ,
  tokenizer = LlamaHFTokenizer . from_pretrained ( "meetkai/functionary-small-v2.2-GGUF" )
)

หมายเหตุ : ไม่จำเป็นต้องระบุข้อความระบบเริ่มต้นที่ใช้ใน Functionary เนื่องจากข้อความเหล่านั้นจะถูกเพิ่มโดยอัตโนมัติในตัวจัดการแชทของ Functionary ดังนั้น ข้อความควรมีเพียงข้อความแชทและ/หรือข้อความระบบที่ให้บริบทเพิ่มเติมสำหรับโมเดล (เช่น วันที่เวลา ฯลฯ)

โมเดลหลายรูปแบบ

llama-cpp-python รองรับเช่น llava1.5 ซึ่งอนุญาตให้โมเดลภาษาอ่านข้อมูลจากทั้งข้อความและรูปภาพ

ด้านล่างนี้คือโมเดลหลายรูปแบบที่รองรับและตัวจัดการแชทที่เกี่ยวข้อง (Python API) และรูปแบบการแชท (Server API)

แบบอย่าง	`LlamaChatHandler`	`chat_format`
ลาวา-v1.5-7b	`Llava15ChatHandler`	`llava-1-5`
ลาวา-v1.5-13b	`Llava15ChatHandler`	`llava-1-5`
ลาวา-v1.6-34b	`Llava16ChatHandler`	`llava-1-6`
มูนดรีม2	`MoondreamChatHandler`	`moondream2`
นาโนลลาวา	`NanollavaChatHandler`	`nanollava`
llama-3-วิสัยทัศน์-อัลฟ่า	`Llama3VisionAlphaChatHandler`	`llama-3-vision-alpha`
minicpm-v-2.6	`MiniCPMv26ChatHandler`	`minicpm-v-2.6`

จากนั้น คุณจะต้องใช้ตัวจัดการแชทแบบกำหนดเองเพื่อโหลดโมเดลคลิปและประมวลผลข้อความแชทและรูปภาพ

 from llama_cpp import Llama
from llama_cpp . llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler ( clip_model_path = "path/to/llava/mmproj.bin" )
llm = Llama (
  model_path = "./path/to/llava/llama-model.gguf" ,
  chat_handler = chat_handler ,
  n_ctx = 2048 , # n_ctx should be increased to accommodate the image embedding
)
llm . create_chat_completion (
    messages = [
        { "role" : "system" , "content" : "You are an assistant who perfectly describes images." },
        {
            "role" : "user" ,
            "content" : [
                { "type" : "text" , "text" : "What's in this image?" },
                { "type" : "image_url" , "image_url" : { "url" : "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
            ]
        }
    ]
)

คุณยังสามารถดึงโมเดลจาก Hugging Face Hub โดยใช้วิธี from_pretrained

 from llama_cpp import Llama
from llama_cpp . llama_chat_format import MoondreamChatHandler

chat_handler = MoondreamChatHandler . from_pretrained (
  repo_id = "vikhyatk/moondream2" ,
  filename = "*mmproj*" ,
)

llm = Llama . from_pretrained (
  repo_id = "vikhyatk/moondream2" ,
  filename = "*text-model*" ,
  chat_handler = chat_handler ,
  n_ctx = 2048 , # n_ctx should be increased to accommodate the image embedding
)

response = llm . create_chat_completion (
    messages = [
        {
            "role" : "user" ,
            "content" : [
                { "type" : "text" , "text" : "What's in this image?" },
                { "type" : "image_url" , "image_url" : { "url" : "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)
print ( response [ "choices" ][ 0 ][ "text" ])

หมายเหตุ : โมเดลหลายรูปแบบยังรองรับการเรียกเครื่องมือและโหมด JSON อีกด้วย

กำลังโหลดรูปภาพในเครื่อง

รูปภาพสามารถส่งผ่านเป็น URI ข้อมูลที่เข้ารหัส base64 ได้ ตัวอย่างต่อไปนี้สาธิตวิธีการทำเช่นนี้

 import base64

def image_to_base64_data_uri ( file_path ):
    with open ( file_path , "rb" ) as img_file :
        base64_data = base64 . b64encode ( img_file . read ()). decode ( 'utf-8' )
        return f"data:image/png;base64, { base64_data } "

# Replace 'file_path.png' with the actual path to your PNG file
file_path = 'file_path.png'
data_uri = image_to_base64_data_uri ( file_path )

messages = [
    { "role" : "system" , "content" : "You are an assistant who perfectly describes images." },
    {
        "role" : "user" ,
        "content" : [
            { "type" : "image_url" , "image_url" : { "url" : data_uri }},
            { "type" : "text" , "text" : "Describe this image in detail please." }
        ]
    }
]

การถอดรหัสเก็งกำไร

llama-cpp-python รองรับการถอดรหัสแบบเก็งกำไรซึ่งช่วยให้โมเดลสามารถสร้างความสำเร็จตามโมเดลแบบร่างได้

วิธีที่เร็วที่สุดในการใช้การถอดรหัสแบบคาดเดาคือผ่านคลาส LlamaPromptLookupDecoding

เพียงส่งสิ่งนี้เป็นแบบจำลองแบบร่างไปยังคลาส Llama ในระหว่างการเริ่มต้น

 from llama_cpp import Llama
from llama_cpp . llama_speculative import LlamaPromptLookupDecoding

llama = Llama (
    model_path = "path/to/model.gguf" ,
    draft_model = LlamaPromptLookupDecoding ( num_pred_tokens = 10 ) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

การฝัง

หากต้องการสร้างการฝังข้อความ ให้ใช้ create_embedding หรือ embed โปรดทราบว่าคุณต้องส่ง embedding=True ไปยัง Constructor เมื่อสร้างโมเดลเพื่อให้สิ่งเหล่านี้ทำงานได้อย่างถูกต้อง

 import llama_cpp

llm = llama_cpp . Llama ( model_path = "path/to/model.gguf" , embedding = True )

embeddings = llm . create_embedding ( "Hello, world!" )

# or create multiple embeddings at once

embeddings = llm . create_embedding ([ "Hello, world!" , "Goodbye, world!" ])

มีแนวคิดหลักสองประการเกี่ยวกับการฝังในโมเดลสไตล์ Transformer: ระดับโทเค็น และ ระดับลำดับ การฝังระดับลำดับเกิดขึ้นจากการฝังระดับโทเค็นแบบ "รวม" ไว้ด้วยกัน โดยปกติแล้วจะทำการหาค่าเฉลี่ยหรือใช้โทเค็นแรก

โมเดลที่มุ่งไปที่การฝังอย่างชัดเจนมักจะส่งคืนการฝังระดับลำดับตามค่าเริ่มต้น หนึ่งรายการสำหรับแต่ละสตริงอินพุต โมเดลที่ไม่มีการฝัง เช่น โมเดลที่ออกแบบมาสำหรับการสร้างข้อความ โดยทั่วไปจะส่งคืนการฝังระดับโทเค็นเท่านั้น โดยหนึ่งรายการสำหรับแต่ละโทเค็นในแต่ละลำดับ ดังนั้นมิติของประเภทการส่งคืนจะสูงกว่าหนึ่งระดับสำหรับการฝังระดับโทเค็น

เป็นไปได้ที่จะควบคุมพฤติกรรมการรวมกลุ่มในบางกรณีโดยใช้แฟล็ก pooling_type ในการสร้างแบบจำลอง คุณสามารถมั่นใจได้ว่ามีการฝังระดับโทเค็นจากโมเดลใดๆ โดยใช้ LLAMA_POOLING_TYPE_NONE ในทางกลับกัน การรับโมเดลที่มุ่งเน้นการสร้างเพื่อให้ได้ผลการฝังระดับลำดับนั้นไม่สามารถทำได้ในขณะนี้ แต่คุณสามารถทำการรวมกลุ่มด้วยตนเองได้ตลอดเวลา

การปรับหน้าต่างบริบท

หน้าต่างบริบทของโมเดล Llama จะกำหนดจำนวนโทเค็นสูงสุดที่สามารถประมวลผลได้ในคราวเดียว ตามค่าเริ่มต้น ค่านี้จะตั้งไว้ที่ 512 โทเค็น แต่สามารถปรับเปลี่ยนได้ตามความต้องการของคุณ

ตัวอย่างเช่น หากคุณต้องการทำงานกับบริบทที่ใหญ่ขึ้น คุณสามารถขยายหน้าต่างบริบทได้โดยการตั้งค่าพารามิเตอร์ n_ctx เมื่อเริ่มต้นวัตถุ Llama:

 llm = Llama ( model_path = "./models/7B/llama-model.gguf" , n_ctx = 2048 )

เว็บเซิร์ฟเวอร์ที่รองรับ OpenAI

llama-cpp-python เสนอเว็บเซิร์ฟเวอร์ซึ่งมีจุดมุ่งหมายเพื่อทำหน้าที่แทนที่ OpenAI API แบบดรอปอิน ซึ่งจะทำให้คุณสามารถใช้โมเดลที่เข้ากันได้กับ llama.cpp กับไคลเอนต์ที่รองรับ OpenAI (ไลบรารีภาษา บริการ ฯลฯ)

หากต้องการติดตั้งแพ็คเกจเซิร์ฟเวอร์และเริ่มต้น:

pip install ' llama-cpp-python[server] '
python3 -m llama_cpp.server --model models/7B/llama-model.gguf

เช่นเดียวกับส่วนการเร่งด้วยฮาร์ดแวร์ด้านบน คุณสามารถติดตั้งด้วยการรองรับ GPU (cuBLAS) ได้ดังนี้:

CMAKE_ARGS= " -DGGML_CUDA=on " FORCE_CMAKE=1 pip install ' llama-cpp-python[server] '
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35

ไปที่ http://localhost:8000/docs เพื่อดูเอกสาร OpenAPI

หากต้องการผูกกับ 0.0.0.0 เพื่อเปิดใช้งานการเชื่อมต่อระยะไกล ให้ใช้ python3 -m llama_cpp.server --host 0.0.0.0 ในทำนองเดียวกัน หากต้องการเปลี่ยนพอร์ต (ค่าเริ่มต้นคือ 8000) ให้ใช้ --port

คุณอาจต้องการตั้งค่ารูปแบบพรอมต์ด้วย สำหรับ chatml ให้ใช้

python3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml

ซึ่งจะจัดรูปแบบพรอมต์ตามวิธีที่โมเดลคาดหวัง คุณสามารถค้นหารูปแบบพร้อมต์ได้ในการ์ดโมเดล สำหรับตัวเลือกที่เป็นไปได้ โปรดดู llama_cpp/llama_chat_format.py และมองหาบรรทัดที่ขึ้นต้นด้วย "@register_chat_format"

หากคุณได้ติดตั้ง huggingface-hub ไว้ คุณยังสามารถใช้แฟล็ก --hf_model_repo_id เพื่อโหลดโมเดลจาก Hugging Face Hub ได้

python3 -m llama_cpp.server --hf_model_repo_id Qwen/Qwen2-0.5B-Instruct-GGUF --model ' *q8_0.gguf '

คุณสมบัติเว็บเซิร์ฟเวอร์

การเปลี่ยน Copilot ในพื้นที่
รองรับการเรียกฟังก์ชัน
การสนับสนุนวิสัยทัศน์ API
หลายรุ่น

ภาพนักเทียบท่า

อิมเมจ Docker มีอยู่ใน GHCR หากต้องการรันเซิร์ฟเวอร์:

docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest

ปัจจุบันนักเทียบท่าบน termux (ต้องใช้รูท) เป็นวิธีเดียวที่ทราบในการเรียกใช้สิ่งนี้บนโทรศัพท์ ดูปัญหาการสนับสนุน termux

API ระดับต่ำ

การอ้างอิง API

API ระดับต่ำคือ ctypes โดยตรงที่เชื่อมโยงกับ C API ที่ได้รับจาก llama.cpp API ระดับต่ำทั้งหมดสามารถพบได้ใน llama_cpp/llama_cpp.py และจำลอง C API โดยตรงใน llama.h

ด้านล่างนี้คือตัวอย่างสั้นๆ ที่สาธิตวิธีใช้ API ระดับต่ำเพื่อสร้างโทเค็นพร้อมท์:

 import llama_cpp
import ctypes
llama_cpp . llama_backend_init ( False ) # Must be called once at the start of each program
params = llama_cpp . llama_context_default_params ()
# use bytes for char * params
model = llama_cpp . llama_load_model_from_file ( b"./models/7b/llama-model.gguf" , params )
ctx = llama_cpp . llama_new_context_with_model ( model , params )
max_tokens = params . n_ctx
# use ctypes arrays for array params
tokens = ( llama_cpp . llama_token * int ( max_tokens ))()
n_tokens = llama_cpp . llama_tokenize ( ctx , b"Q: Name the planets in the solar system? A: " , tokens , max_tokens , llama_cpp . c_bool ( True ))
llama_cpp . llama_free ( ctx )

ตรวจสอบโฟลเดอร์ตัวอย่างเพื่อดูตัวอย่างเพิ่มเติมของการใช้ API ระดับต่ำ

เอกสารประกอบ

มีเอกสารประกอบอยู่ที่https://llama-cpp-python.readthedocs.io/ หากคุณพบปัญหาใดๆ เกี่ยวกับเอกสาร โปรดเปิดปัญหาหรือส่ง PR

การพัฒนา

แพ็คเกจนี้อยู่ระหว่างการพัฒนา และฉันยินดีรับการมีส่วนร่วม

ในการเริ่มต้น ให้โคลนพื้นที่เก็บข้อมูลและติดตั้งแพ็คเกจในโหมดแก้ไข / พัฒนา:

git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python

# Upgrade pip (required for editable mode)
pip install --upgrade pip

# Install with pip
pip install -e .

# if you want to use the fastapi / openapi server
pip install -e .[server]

# to install all optional dependencies
pip install -e .[all]

# to clear the local build cache
make clean

คุณยังสามารถทดสอบการคอมมิตเฉพาะของ llama.cpp ได้ด้วยการตรวจสอบการคอมมิตที่ต้องการในโมดูลย่อย vendor/llama.cpp จากนั้นรัน make clean และ pip install -e . อีกครั้ง. การเปลี่ยนแปลงใดๆ ใน llama.h API จะต้องมีการเปลี่ยนแปลงในไฟล์ llama_cpp/llama_cpp.py เพื่อให้ตรงกับ API ใหม่ (อาจจำเป็นต้องทำการเปลี่ยนแปลงเพิ่มเติมในส่วนอื่น)