https://github.com/abhaskumarsinha/Corpus2GPT
MinimalGPT 標誌" width="20%" style="max-width: 100%;">
[ GPT-1 Paper
] [ 1002 short stories from project guttenberg
] [ logo.com
] [ Transformer - Paper
] [ Huggingface Transformers
] [ TensorFlow
] [ BPE Tokenizer: subword-nmt
]
MinimalGPT是一個簡潔、適應性強且精簡的程式碼框架,包含建立、訓練、推理和微調 GPT 模型所需的基本元件。該框架專門使用 Keras 和 TensorFlow 實現,確保了更廣泛的深度學習生態系統中的相容性和一致性。
新功能:CPU/GPU/TPU 支援並支援載入大檔案資料集!
在儲存庫中,我們引入了兩個組成我們提出的框架的完整文件。第一個文件GPT.py用作基本框架,包含區塊和層等關鍵組件。這些組件包括多頭注意力、前饋機制、縮放點積注意力、位置編碼、softmaxed 輸出和用於模型預測的推理函數。第二個檔案MinimalGPT .py透過提供簡潔的命令列介面簡化了我們框架的使用。此介面使用戶能夠輕鬆執行基本操作,包括模型創建、訓練、保存、載入、微調和推理,所有操作都濃縮為單一命令列執行。此外,這些檔案可以方便地匯入Python程式碼中,讓使用者可以透過簡單的函數呼叫將它們無縫地合併到他們的專案中。
pip install -r requirements.txt
模型架構由幾個關鍵參數控制,包括GPT_INPUT、D_MODEL、MULTI_HEAD和DECODER_STACKS 。必須確保這些參數的一致性,以防止與載入模型以進行後續重新訓練或推理過程相關的問題。在出現不確定性的情況下,參考上次運行期間產生的設定檔可以提供有價值的見解。此外, VOCABULARY_START和VOCABULARY_END參數在定義語料庫的視窗標記方面發揮著至關重要的作用。這些標記有助於產生向量化器層,該層在指定的 START 和 END 標記計數內從語料庫中提取詞彙。需要注意的是,語料庫中的標記是用空格分隔的,並且當未明確指定標記文件時,包含VOCABULARY_START和VOCABULARY_END變得尤其相關。
另請注意,標記生成器檔案以及模型的權重都是同時保存/載入的。目前程式碼不支援單獨儲存/載入這兩個檔案。
推理模式 (-i) 不僅需要模型參數和保存的分詞器和權重檔案來產生推理資料。它應該與 (-ol) 開關一起使用。
usage: MinimalGPT .py [-h] [-d DATA_PATH] [-l LEARNING_RATE]
[-ol OUTPUT_LENGTH] [-e EPOCHS] [-b BATCH_SIZE]
[-s GPT_INPUT] [-dm D_MODEL] [-p MULTI_HEAD]
[-ds DECODER_STACKS] [-ts TOKEN_START] [-te TOKEN_END]
[-vs VOCABULARY_START] [-ve VOCABULARY_END] [-sd]
[-lt LOAD_TOKENIZER] [-lw LOAD_WEIGHTS]
[-st SAVE_TOKENIZER] [-sw SAVE_WEIGHTS] [-ot OPTIMIZER]
[-i] [-mv] [-mvo]
optional arguments:
-h, --help show this help message and exit
-d DATA_PATH, --data-path DATA_PATH
File: Corresponding to corpus or training text
[String]
-l LEARNING_RATE, --learning-rate LEARNING_RATE
Float: Learning Rate. The model will train ONLY IF the
rate is > 0, skip otherwise [Float]
-ol OUTPUT_LENGTH, --output-length OUTPUT_LENGTH
Length of the output sequence to be generated
-e EPOCHS, --epochs EPOCHS
Number of training Epochs [Int]
-b BATCH_SIZE, --batch-size BATCH_SIZE
Size of each batch [Int]
-s GPT_INPUT, --gpt-input GPT_INPUT
Number of Tokens of text the model inputs at a time
[Int]
-dm D_MODEL, --d-model D_MODEL
Embedding layer output dimensions [Int]
-p MULTI_HEAD, --multi-head MULTI_HEAD
Number of Multi-head Attention layer in parallel [Int]
-ds DECODER_STACKS, --decoder-stacks DECODER_STACKS
Number of stacked Decoder layer [Int]
-ts TOKEN_START, --token-start TOKEN_START
The token number in the corpus to mark it as the
starting point of the training [Int]
-te TOKEN_END, --token-end TOKEN_END
The token number in the corpus to mark it as the end
point of the training [Int]
-vs VOCABULARY_START, --vocabulary-start VOCABULARY_START
Token number from the corpus to mark the starting
point of vocabulary data [Int]
-ve VOCABULARY_END, --vocabulary-end VOCABULARY_END
Token number from the corpus to mark the end point of
vocabulary data [Int]
-sd, --save Save the Model and Vectorizer data to disk
[True/False]
-lt LOAD_TOKENIZER, --load-tokenizer LOAD_TOKENIZER
File: Vectorization layer [File]
-lw LOAD_WEIGHTS, --load-weights LOAD_WEIGHTS
File: Model Weights [File]
-st SAVE_TOKENIZER, --save-tokenizer SAVE_TOKENIZER
File: Saving Vectorizer File [File]
-sw SAVE_WEIGHTS, --save-weights SAVE_WEIGHTS
File: Saving Model Weights[File]
-ot OPTIMIZER, --optimizer OPTIMIZER
Optimizer consistent to TensorFlow optimizer class
[tf.keras.optimizers]
-i, --inference-only Only Print the output of the model in Inference Mode
[True/False]
-mv, --model-vectorizer
Return Model, Vectorizer Tuple [True/False]
-mvo, --model-vectorizer-output
Return Model, Vectorizer, Output Tuple [True/False]
假設所需的模型規格需要 GPT_INPUT = 10、D_MODEL = 128、MULTI_HEAD = 8 和 DECODER_STACKS = 1,訓練的語料庫標記範圍從 TOKEN_START = 0 到 TOKEN_END = 40000,並從語料庫範圍產生向量化= 200000,執行以下指令來啟動模型訓練過程。產生的權重和分詞器資料保存在指定的資料夾中。隨後的輸出說明了該命令執行的結果。
PS C:gpt> python MinimalGPT .py -d './dataset/output_dataset.txt' -l 0.001 -ol 200 -e 4 -b 512 -s 10 -dm 128 -p 8 -ds 1 -ts 0 -te 40000 -vs 0 -ve 200000 -sd -st './models/tokenizer.mgt' -sw './models/weights.mgw'
Total tokens: 40000
100%|██████████████████████████████████████████████████████████████████████████████| 200000/200000 [02:02<00:00, 1636.38it/s]
New Vectorizer created successfully...
Vocabulary Size: 14270
100%|██████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 302926.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 1289942.19it/s]
(None, 10, 128)
Epoch 1/4
79/79 [==============================] - 88s 1s/step - loss: 7.8692
Epoch 2/4
79/79 [==============================] - 92s 1s/step - loss: 3.8066
Epoch 3/4
79/79 [==============================] - 93s 1s/step - loss: 1.1487
Epoch 4/4
79/79 [==============================] - 92s 1s/step - loss: 0.2900
100%|██████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:05<00:00, 34.70it/s]
Vocabulary size saved: 14270
and her eyes in the library. She was the rather large woman, although not fat, and when she wore high heels--which sh
e was not prone to do, because although Cutter would not have cared, she kept trying to project into other people's minds and
trying, as she said, "Not to do anything to them, that I wouldn't want them to do you me."--she rose a good inch above Cutter.
She was pleasant humored, and cooperative, and the one great irritant about her that annoyed Cutter, was the fact that she wa
s not capable of meeting life wholeheartedly and with strength. She steadily worried about other people's feelings and thought
s, so that Cutter wondered if she were capable of the slightest personal conviction. Yet that weakness was an advantage at the
same time, to him, because she worked constantly toward making him happy. The house was run to his minutest liking, and the s
ervants liked her, so that while she did not use a strong enough
假設我們想要微調上述模型(或重新訓練它),那麼重新載入分詞器和權重並在語料庫指定視窗範圍的新文字上重新訓練它的命令如下:
PS C:gpt> python MinimalGPT .py -d './dataset/output_dataset.txt' -l 0.00005 -ol 200 -e 1 -b 512 -s 10 -dm 128 -p 8 -ds 1 -ts 80000 -te 120000 -sd -st './models/tokenizer2.mgt' -sw './models/weights2.mgw' -lt './models/tokenizer.mgt' -lw './models/weights.mgw'
Total tokens: 40000
100%|██████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 302923.51it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 1428099.68it/s]
(None, 10, 128)
79/79 [==============================] - 81s 993ms/step - loss: 7.9725
100%|██████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:06<00:00, 30.29it/s]
Vocabulary size saved: 14270
of her own the black of my own and my wife had could seen the house at the same moment her mind caught the first sugg
estion of the folded paper. “But he must have a name! Where is the paper?” She moved to the desk, and began to turn over the s
cattered documents that littered it. The first that caught her eye was an unfinished letter in her husband’s hand, with his pe
n lying across it, as though dropped there at a sudden summons. “My dear Parvis,”--who was Parvis?--“I have just received your
letter announcing Elwell’s death, and while I suppose there is now no farther risk of trouble, it might be safer--” That was
all. The “risk of trouble” was easily explained by the newspaper clipping which had apprised Mary of the suit brought against
her husband by one of his associates in the Blue Star enterprise. The only new information conveyed in the letter was the fact
of its showing Boyne,
推理模式涉及預訓練權重和向量化器的載入。然後利用這些元件來執行模型,產生指定長度的輸出。
PS C:gpt> python MinimalGPT .py -i -ol 500 -e 6 -b 512 -s 10 -dm 128 -p 8 -ds 1 -lt './models/tokenizer2.mgt' -lw './models/weights2.mgw'
(None, 10, 128)
100%|██████████████████████████████████████████████████████████████████████████████████████| 490/490 [00:13<00:00, 35.93it/s]
of her own “on the other from the inel’--a little sensational, of course. But I guess you’d better look it over.” He
held out a newspaper to Mary, who unfolded it slowly, remembering, as she did so, the evening when, in that same room, the per
usal of a clipping from the “Sentinel” had first shaken the depths of her security. As she opened the paper, her eyes, shrinki
ng from the glaring head-lines, “Widow of Boyne’s Victim Forced to Appeal for Aid,” ran down the column of text to two portrai
ts inserted in it. The first was her husband’s, taken from a photograph made the year they had come to England. It was the pic
ture of him that she liked best, the one that stood on the writing-table up-stairs in her bedroom. As the eyes in the photogra
ph met hers, she felt it would be impossible to read what was said of him, and closed her lids with the sharpness of the pain.
“I thought if you felt disposed to put your name down--” she heard Parvis continue. She opened her eyes with an effort, and t
hey fell on the other portrait. It was that of a youngish man, slightly built, in rough clothes, with features somewhat blurre
d by the shadow of a projecting hat-brim. Where had she seen that outline before? She stared at it confusedly, her heart hamme
ring in her throat and ears. Then she gave a cry. “This is the man--the man who came for my husband!” She heard Parvis start t
o his feet, and was dimly aware that she had slipped backward into the corner of the sofa, and that he was bending above her i
n alarm. With an intense effort she straightened herself, and reached out for the paper, which she had dropped. “It’s the man!
I should know him anywhere!” she cried in a voice that sounded in her own ears like a scream. Parvis’s voice seemed to come t
o her from far off, down endless, fog-muffled windings. “Mrs. Boyne, you’re not very well. Shall I call somebody? Shall I get
a glass of water?” “No, no, no!” She threw herself toward him, her hand frantically clenching the newspaper. “I tell you, it’s
the man! I KNOW him! He spoke to me in the garden!” Parvis took the journal from her, directing his glasses to the portrait.
“It can’t be, Mrs. Boyne. It’s Robert Elwell.” “Robert Elwell?” Her white
將透過使用MinimalGPT .py 產生的經過訓練的模型合併到您的專案中是一個簡單的過程,透過匯入MinimalGPT函數並根據所需的規範進行配置即可實現。這可以透過在 inference_only = True(推理模式)框架內設定參數 return_model_and_vectorizer = True 或 return_model_and_vectorizer_and_output = True 來實現。此外,模型的訓練、創建和匯出可以使用類似的方法來完成,與命令列模式並行。為了全面說明這些流程,隨附的 Jupyter Notebook 提供了範例示範。
from MinimalGPT import MinimalGPT model = MinimalGPT (output_length = 200, gpt_input = 10, d_model = 128, h = 8, decoder_stacks = 1, load_tokenizer = './models/tokenizer3.mgt', load_weights = './models/weights3.mgw', inference_only = True, return_model_and_vectorizer_and_output = True) model[0].summary()
Model: "model"
Layer (type) Output Shape Param
================================================================= input_1 (InputLayer) [(None, 10)] 0
embedding (Embedding) (None, 10, 128) 1826816
positional_embedding (Posit (None, 10, 128) 0
ionalEmbedding)
decoder (Decoder) (None, 10, 128) 37160
flatten (Flatten) (None, 1280) 0
dense (Dense) (None, 14273) 18283713
tf.nn.softmax (TFOpLambda) (None, 14273) 0
================================================================= Total params: 20,147,689 Trainable params: 20,147,689 Non-trainable params: 0
與原始論文實作相比,此處實現的模型略有不同。將縮放後的點積輸出的頭連接起來後形成的矩陣乘以大小鍵維度 x d_model 的矩陣參數。出於實際目的,由於可訓練參數優化,這種減少參數數量的小調整將導致性能略有提高。