gpt burn
1.0.0
該專案旨在乾淨、簡潔地重新實現 GPT-2。模型實作包含在src/model.rs
中,程式碼不足 300 行。雖然這是一個有趣的練習,主要用於(我自己的)教育目的,但它展示了 Rust 和 Burn 在機器學習領域的實用性:整個專案編譯成單一二進位文件,使部署相對簡單。
目前,僅支援字元級分詞器,因此尚無法使用需要 BPE 分詞器的官方權重。不過,為了好玩,你可以試試我訓練的小玩具模型(參見推論)。
該項目還包括一個用於訓練和推理的簡單 CLI。
Usage: gpt-burn <COMMAND>
Commands:
run Generate text using a pre-trained model
train Train a new model
您可以使用 Nix 安裝gpt-burn
:
nix run github:felix-andreas/gpt-burn
或者,使用cargo
安裝:
cargo install --git https://github.com/felix-andreas/gpt-burn
或者,克隆存儲庫並從原始程式碼建置:
nix develop # optional
cargo run --release
如果您不使用 Nix 並且使用基於 Ubuntu 的發行版,則需要安裝這些附加相依性:
apt install pkg-config libssl-dev libvulkan1 mesa-vulkan-drivers vulkan-tools
我在德語維基百科語料庫上使用字元級分詞器訓練了一個玩具模型,訓練了 20,000 個批次(批次大小為 128),參數如下:
範圍 | 價值 |
---|---|
參數 | 83M |
上下文長度 | 128 |
n_layers | 12 |
n_heads | 12 |
d_model | 第768章 |
您可以在這裡下載並隨後解壓縮。或者,在一個命令中執行這兩項操作:
curl -s ' https://drive.usercontent.google.com/download?id=1GGLaPnmPQ8Z2B9vJQoI6-K128X9LJKG0&export=download&confirm=t ' | tar xzf -
然後,運行模型:
gpt-burn run ./model_83M
您應該看到以下內容:
So wurden bis 1977 679 nachhaltige Wörgler Torbauten vorgeworfen, die Einwohnerzahl Sirkes bestand 2015 bis 1998.
Sie war trotz weniger als 10.000 ausgedehnter Größen wahrscheinlich auf folgende Breitenauflagen mit 932 km.
2016 wurden rund 145 Händen nach Deutschland geladen.
其他命令列選項有:
Usage: gpt-burn run [OPTIONS] <MODEL_PATH>
Arguments:
<MODEL_PATH>
Options:
-p, --prompt <PROMPT>
-n, --n-new-tokens <N_NEW_TOKENS> [default: 1000]
-s, --seed <SEED> [default: 0]
要訓練您自己的模型,請運行:
gpt-burn train --context-length 128 --n-layers 12 --n-heads 12 --d-model 768 --batch-size 128 --learning-rate 0.0003 --seed 0 --text-corpus ./corpus.txt
重要的
確保corpus.txt
是utf-8編碼的文字檔!
您可以將大多數超參數作為命令列選項傳遞:
Usage: gpt-burn train [OPTIONS]
Options:
-o, --output-path <PATH>
-c, --context-length <CONTEXT_LENGTH> [default: 64]
-d, --d-model <D_MODEL> [default: 64]
-l, --n-layers <N_LAYERS> [default: 2]
-h, --n-heads <N_HEADS> [default: 2]
-n, --n-steps <N_STEPS> [default: 50]
-b, --batch-size <BATCH_SIZE> [default: 32]
-r, --learning-rate <LEARNING_RATE> [default: 0.003]
-s, --seed <SEED> [default: 0]
-t, --text-corpus <TEXT_CORPUS> [default: .data/corpus.txt]
-m, --n-mega-bytes <N_MEGA_BYTES> Only use first <n> megabytes of dataset for training
-x, --no-save Don't save trained model (useful for debugging)
此模型可以透過Tokenizer
特徵與不同的分詞器一起使用。下面你來看看下面的句子是如何
Albert Einstein war ein schweizerisch-US-amerikanischer theoretischer Physiker deutscher Herkunft.
由不同的分詞器編碼。
CharTokenizer
將每個字元分割成一個單獨的標記:
Tokens: ["A", "l", "b", "e", "r", "t", " ", "E", "i", "n", "s", "t", "e", "i", "n", " ", "w", "a", "r", " ", "e", "i", "n", " ", "s", "c", "h", "w", "e", "i", "z", "e", "r", "i", "s", "c", "h", "-", "U", "S", "-", "a", "m", "e", "r", "i", "k", "a", "n", "i", "s", "c", "h", "e", "r", " ", "t", "h", "e", "o", "r", "e", "t", "i", "s", "c", "h", "e", "r", " ", "P", "h", "y", "s", "i", "k", "e", "r", " ", "d", "e", "u", "t", "s", "c", "h", "e", "r", " ", "H", "e", "r", "k", "u", "n", "f", "t", "."]
Values: [28, 13, 3, 6, 19, 21, 1, 32, 10, 15, 20, 21, 6, 10, 15, 1, 24, 2, 19, 1, 6, 10, 15, 1, 20, 4, 9, 24, 6, 10, 27, 6, 19, 10, 20, 4, 9, 66, 48, 46, 66, 2, 14, 6, 19, 10, 12, 2, 15, 10, 20, 4, 9, 6, 19, 1, 21, 9, 6, 16, 19, 6, 21, 10, 20, 4, 9, 6, 19, 1, 43, 9, 26, 20, 10, 12, 6, 19, 1, 5, 6, 22, 21, 20, 4, 9, 6, 19, 1, 35, 6, 19, 12, 22, 15, 7, 21, 67]
如果區塊長度超過三個字符, SimpleVowelTokenizer
會在下一個元音之前分割單詞,創建類似音節的結果:
Tokens: ["Albert", " ", "Einst", "ein", " ", "war", " ", "ein", " ", "schw", "eizer", "isch", "-", "US", "-", "amer", "ikan", "isch", "er", " ", "theor", "etisch", "er", " ", "Phys", "iker", " ", "deutsch", "er", " ", "Herk", "unft"]
Values: [2, 0, 3, 9, 0, 19, 0, 9, 0, 16, 10, 15, 1, 6, 1, 7, 13, 15, 11, 0, 17, 12, 11, 0, 5, 14, 0, 8, 11, 0, 4, 18]