llama.go下载 - llama.go源码下载

llama.go

其他源码

v1.4: Server Mode

下载

最后 - 好消息！

我已经开始在这里重新实现该库： FastTensors

如果您想在纯 Go 中看到 GGML 兼容的实现，请给它加注星标。

正在寻找使用 Golang 进行 LLM 调试和推理？

请查看我的相关项目Booster

动机

我们梦想着这样一个世界：ML 黑客们可以在他们的家庭实验室中摸索非常大的 GPT模型，而无需 GPU 集群消耗大量的资金。

该项目的代码基于 Georgi Gerganov 的传奇ggml.cpp框架，用 C++ 编写，同样注重性能和优雅。

我们希望使用 Golang 而不是功能强大但级别太低的语言将允许更多的采用。

V0 路线图

纯 Golang 中的张量数学
实现 LLaMA 神经网络架构和模型加载
使用较小的 LLaMA-7B 模型进行测试
确保 Go 推理的工作方式与 C++ 完全相同
让Go发光吧！启用多线程和消息传递以提高性能

V1 路线图 - Spring'23

与 Mac、Linux 和 Windows 的跨平台兼容性
为 ML 黑客发布第一个稳定版本 - v1.0
启用更大的 LLaMA 模型：13B、30B、65B - v1.1
Apple Silicon（现代 Mac）和 ARM 服务器上的 ARM NEON 支持 - v1.2
通过支持 Intel 和 AMD 的 x64 AVX2 提升性能 - v1.2
更好的内存使用和 GC 优化 - v1.3
引入服务器模式（嵌入式 REST API）以在实际项目中使用 - v1.4
发布转换后的模型以通过 Internet 免费访问 - v1.4

V2 路线图 - Winter'23

V3 路线图 - Spring'23

允许复杂项目使用插件和外部 API
允许模型训练和微调
加速 GPU 卡和集群上的执行速度
FP16 和 BF16 数学（如果有硬件支持）
INT4 和 GPTQ 量化
AMD Radeon GPU 支持 OpenCL

如何跑步？

首先，自行获取并转换原始 LLaMA 模型，或者直接下载现成的模型：

LLaMA-7B： llama-7b-fp32.bin

LLaMA-13B： llama-13b-fp32.bin

两种型号都存储 FP32 权重，因此 LLaMA-7B 至少需要 32Gb RAM（不是 VRAM 或 GPU RAM）。 LLaMA-13B 双倍至 64Gb。

接下来，从源代码构建应用程序二进制文件（请参阅下面的说明），或者只下载已经构建的二进制文件：

Windows： llama-go-v1.4.0.exe

MacOS： llama-go-v1.4.0-macos

Linux: llama-go-v1.4.0-linux

现在您已经拥有了可执行文件和模型，请亲自尝试一下：

llama-go-v1.4.0-macos 
    --model ~ /models/llama-7b-fp32.bin 
    --prompt " Why Golang is so popular? "

有用的命令行标志：

--prompt   Text prompt from user to feed the model input
--model    Path and file name of converted .bin LLaMA model [ llama-7b-fp32.bin, etc ]
--server   Start in Server Mode acting as REST API endpoint
--host     Host to allow requests from in Server Mode [ localhost by default ]
--port     Port listen to in Server Mode [ 8080 by default ]
--pods     Maximum pods or units of parallel execution allowed in Server Mode [ 1 by default ]
--threads  Adjust to the number of CPU cores you want to use [ all cores by default ]
--context  Context size in tokens [ 1024 by default ]
--predict  Number of tokens to predict [ 512 by default ]
--temp     Model temperature hyper parameter [ 0.5 by default ]
--silent   Hide welcome logo and other output [ shown by default ]
--chat     Chat with user in interactive mode instead of compute over static prompt
--profile  Profe CPU performance while running and store results to cpu.pprof file
--avx      Enable x64 AVX2 optimizations for Intel and AMD machines
--neon     Enable ARM NEON optimizations for Apple Macs and ARM server

投入生产

LLaMA.go 嵌入了公开 REST API 的独立 HTTP 服务器。要启用它，请使用特殊标志运行应用程序：

llama-go-v1.4.0-macos 
    --model ~ /models/llama-7b-fp32.bin 
    --server 
    --host 127.0.0.1 
    --port 8080 
    --pods 4 
    --threads 4

根据模型大小、可用的 CPU 核心数量、要并行处理的请求数量、获得答案的速度，明智地选择Pod和线程参数。

Pod是许多可能并行运行的推理实例。

Threads参数设置 pod 内将使用多少个核心来进行张量数学运算。

例如，如果您的计算机具有 16 个硬件核心，能够并行运行 32 个超线程，那么您最终可能会得到类似的结果：

--server --pods 4 --threads 8

当没有空闲的 pod 来处理到达的请求时，它将被放入等待队列中，并在某个 pod 完成作业时启动。

REST API 示例

安排新工作

使用包含唯一 UUID v4 的 JSON 和提示将 POST 请求（使用 Postman）发送到您的服务器地址：

{
    "id" : " 5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3 " ,
    "prompt" : " Why Golang is so popular? "
}

检查工作状态

将 GET 请求（使用 Postman 或浏览器）发送到 URL，例如 http://host:port/jobs/status/:id

GET http://localhost:8080/jobs/status/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fcb

获取结果

将 GET 请求（使用 Postman 或浏览器）发送到 URL，例如 http://host:port/jobs/:id

GET http://localhost:8080/jobs/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fcb

如何建造

首先，安装Golang和git （如果是 Windows，则需要下载安装程序）。

brew install git
brew install golang

然后克隆存储库并进入项目文件夹：

 git clone https://github.com/gotzmann/llama.go.git
cd llama.go

安装外部依赖项的一些 Go 魔法：

 go mod tidy
go mod vendor

现在我们准备从源代码构建二进制文件：

go build -o llama-go-v1.exe -ldflags " -s -w " main.go

常问问题

1) 从哪里可以获得原始的 LLaMA 模型？

直接联系 Meta 或只是寻找一些 torrent 替代方案。

2) 如何将原始LLaMA文件转换为支持的格式？

将原始 PyTorch FP16 文件放入models目录中，然后使用命令进行转换：

python3 ./scripts/convert.py ~ /models/LLaMA/7B/ 0

展开

附加信息

版本 v1.4: Server Mode
类型其他源码
更新时间 2024-11-30
大小 10.3MB
来自于 Github

llama.go

最后 - 好消息！

正在寻找使用 Golang 进行 LLM 调试和推理？

动机

V0 路线图

V1 路线图 - Spring'23

V2 路线图 - Winter'23

V3 路线图 - Spring'23

如何跑步？

有用的命令行标志：

投入生产

REST API 示例

安排新工作

检查工作状态

获取结果

如何建造

常问问题

llama models

go

LLaMA Factory

GO GO 磁力

代码骆驼

骆驼2

chat.petals.dev

GPT Prompt Templates

GPTyped

waymo open dataset

SmartTube

Sunamu

waymo open dataset

wp functions

termwind