gpt burn
1.0.0
该项目旨在干净、简洁地重新实现 GPT-2。模型实现包含在src/model.rs
中,代码不足 300 行。虽然这是一个有趣的练习,主要用于(我自己的)教育目的,但它展示了 Rust 和 Burn 在机器学习领域的实用性:整个项目编译成单个二进制文件,使部署相对简单。
目前,仅支持字符级分词器,因此尚无法使用需要 BPE 分词器的官方权重。不过,为了好玩,你可以尝试一下我训练的小玩具模型(参见推论)。
该项目还包括一个用于训练和推理的简单 CLI。
Usage: gpt-burn <COMMAND>
Commands:
run Generate text using a pre-trained model
train Train a new model
您可以使用 Nix 安装gpt-burn
:
nix run github:felix-andreas/gpt-burn
或者,使用cargo
安装:
cargo install --git https://github.com/felix-andreas/gpt-burn
或者,克隆存储库并从源代码构建:
nix develop # optional
cargo run --release
如果您不使用 Nix 并且使用基于 Ubuntu 的发行版,则需要安装这些附加依赖项:
apt install pkg-config libssl-dev libvulkan1 mesa-vulkan-drivers vulkan-tools
我在德语维基百科语料库上使用字符级分词器训练了一个玩具模型,训练了 20,000 个批次(批次大小为 128),参数如下:
范围 | 价值 |
---|---|
参数 | 83M |
上下文长度 | 128 |
n_layers | 12 |
n_heads | 12 |
d_model | 第768章 |
您可以在这里下载并随后解压。或者,在一个命令中执行这两项操作:
curl -s ' https://drive.usercontent.google.com/download?id=1GGLaPnmPQ8Z2B9vJQoI6-K128X9LJKG0&export=download&confirm=t ' | tar xzf -
然后,运行模型:
gpt-burn run ./model_83M
您应该看到以下内容:
So wurden bis 1977 679 nachhaltige Wörgler Torbauten vorgeworfen, die Einwohnerzahl Sirkes bestand 2015 bis 1998.
Sie war trotz weniger als 10.000 ausgedehnter Größen wahrscheinlich auf folgende Breitenauflagen mit 932 km.
2016 wurden rund 145 Händen nach Deutschland geladen.
其他命令行选项有:
Usage: gpt-burn run [OPTIONS] <MODEL_PATH>
Arguments:
<MODEL_PATH>
Options:
-p, --prompt <PROMPT>
-n, --n-new-tokens <N_NEW_TOKENS> [default: 1000]
-s, --seed <SEED> [default: 0]
要训练您自己的模型,请运行:
gpt-burn train --context-length 128 --n-layers 12 --n-heads 12 --d-model 768 --batch-size 128 --learning-rate 0.0003 --seed 0 --text-corpus ./corpus.txt
重要的
确保corpus.txt
是utf-8编码的文本文件!
您可以将大多数超参数作为命令行选项传递:
Usage: gpt-burn train [OPTIONS]
Options:
-o, --output-path <PATH>
-c, --context-length <CONTEXT_LENGTH> [default: 64]
-d, --d-model <D_MODEL> [default: 64]
-l, --n-layers <N_LAYERS> [default: 2]
-h, --n-heads <N_HEADS> [default: 2]
-n, --n-steps <N_STEPS> [default: 50]
-b, --batch-size <BATCH_SIZE> [default: 32]
-r, --learning-rate <LEARNING_RATE> [default: 0.003]
-s, --seed <SEED> [default: 0]
-t, --text-corpus <TEXT_CORPUS> [default: .data/corpus.txt]
-m, --n-mega-bytes <N_MEGA_BYTES> Only use first <n> megabytes of dataset for training
-x, --no-save Don't save trained model (useful for debugging)
该模型可以通过Tokenizer
特征与不同的分词器一起使用。下面你看看下面的句子是如何
Albert Einstein war ein schweizerisch-US-amerikanischer theoretischer Physiker deutscher Herkunft.
由不同的分词器编码。
CharTokenizer
将每个字符分割成一个单独的标记:
Tokens: ["A", "l", "b", "e", "r", "t", " ", "E", "i", "n", "s", "t", "e", "i", "n", " ", "w", "a", "r", " ", "e", "i", "n", " ", "s", "c", "h", "w", "e", "i", "z", "e", "r", "i", "s", "c", "h", "-", "U", "S", "-", "a", "m", "e", "r", "i", "k", "a", "n", "i", "s", "c", "h", "e", "r", " ", "t", "h", "e", "o", "r", "e", "t", "i", "s", "c", "h", "e", "r", " ", "P", "h", "y", "s", "i", "k", "e", "r", " ", "d", "e", "u", "t", "s", "c", "h", "e", "r", " ", "H", "e", "r", "k", "u", "n", "f", "t", "."]
Values: [28, 13, 3, 6, 19, 21, 1, 32, 10, 15, 20, 21, 6, 10, 15, 1, 24, 2, 19, 1, 6, 10, 15, 1, 20, 4, 9, 24, 6, 10, 27, 6, 19, 10, 20, 4, 9, 66, 48, 46, 66, 2, 14, 6, 19, 10, 12, 2, 15, 10, 20, 4, 9, 6, 19, 1, 21, 9, 6, 16, 19, 6, 21, 10, 20, 4, 9, 6, 19, 1, 43, 9, 26, 20, 10, 12, 6, 19, 1, 5, 6, 22, 21, 20, 4, 9, 6, 19, 1, 35, 6, 19, 12, 22, 15, 7, 21, 67]
如果块长度超过三个字符, SimpleVowelTokenizer
会在下一个元音之前分割单词,创建类似于音节的结果:
Tokens: ["Albert", " ", "Einst", "ein", " ", "war", " ", "ein", " ", "schw", "eizer", "isch", "-", "US", "-", "amer", "ikan", "isch", "er", " ", "theor", "etisch", "er", " ", "Phys", "iker", " ", "deutsch", "er", " ", "Herk", "unft"]
Values: [2, 0, 3, 9, 0, 19, 0, 9, 0, 16, 10, 15, 1, 6, 1, 7, 13, 15, 11, 0, 17, 12, 11, 0, 5, 14, 0, 8, 11, 0, 4, 18]