https://github.com/abhaskumarsinha/Corpus2GPT
MinimalGPT 徽标" width="20%" style="max-width: 100%;">
[ GPT-1 Paper
] [ 1002 short stories from project guttenberg
] [ logo.com
] [ Transformer - Paper
] [ Huggingface Transformers
] [ TensorFlow
] [ BPE Tokenizer: subword-nmt
]
MinimalGPT是一个简洁、适应性强且精简的代码框架,包含构建、训练、推理和微调 GPT 模型所需的基本组件。该框架仅使用 Keras 和 TensorFlow 实现,确保更广泛的深度学习生态系统内的兼容性和连贯性。
新功能:CPU/GPU/TPU 支持并支持加载大文件数据集!
在存储库中,我们引入了两个组成我们提出的框架的完整文件。第一个文件GPT.py用作基本框架,包含块和层等关键组件。这些组件包括多头注意力、前馈机制、缩放点积注意力、位置编码、softmaxed 输出和用于模型预测的推理函数。第二个文件MinimalGPT .py通过提供简洁的命令行界面简化了我们框架的使用。该界面使用户能够轻松执行基本操作,包括模型创建、训练、保存、加载、微调和推理,所有操作都浓缩为单个命令行执行。此外,这些文件可以方便地导入到Python代码中,允许用户通过简单的函数调用将它们无缝地合并到他们的项目中。
pip install -r requirements.txt
模型架构由几个关键参数控制,包括GPT_INPUT、D_MODEL、MULTI_HEAD和DECODER_STACKS 。必须确保这些参数的一致性,以防止与加载模型以进行后续重新训练或推理过程相关的问题。在出现不确定性的情况下,参考上次运行期间生成的配置文件可以提供有价值的见解。此外, VOCABULARY_START和VOCABULARY_END参数在定义语料库的窗口标记方面发挥着至关重要的作用。这些标记有助于生成矢量化器层,该层在指定的 START 和 END 标记计数内从语料库中提取词汇。需要注意的是,语料库中的标记是用空格分隔的,并且当未明确指定标记文件时,包含VOCABULARY_START和VOCABULARY_END变得尤其相关。
另请注意,标记生成器文件以及模型的权重都是同时保存/加载的。目前该代码不支持单独保存/加载这两个文件。
推理模式 (-i) 不仅需要模型参数和保存的分词器和权重文件来生成推理数据。它应该与 (-ol) 开关一起使用。
usage: MinimalGPT .py [-h] [-d DATA_PATH] [-l LEARNING_RATE]
[-ol OUTPUT_LENGTH] [-e EPOCHS] [-b BATCH_SIZE]
[-s GPT_INPUT] [-dm D_MODEL] [-p MULTI_HEAD]
[-ds DECODER_STACKS] [-ts TOKEN_START] [-te TOKEN_END]
[-vs VOCABULARY_START] [-ve VOCABULARY_END] [-sd]
[-lt LOAD_TOKENIZER] [-lw LOAD_WEIGHTS]
[-st SAVE_TOKENIZER] [-sw SAVE_WEIGHTS] [-ot OPTIMIZER]
[-i] [-mv] [-mvo]
optional arguments:
-h, --help show this help message and exit
-d DATA_PATH, --data-path DATA_PATH
File: Corresponding to corpus or training text
[String]
-l LEARNING_RATE, --learning-rate LEARNING_RATE
Float: Learning Rate. The model will train ONLY IF the
rate is > 0, skip otherwise [Float]
-ol OUTPUT_LENGTH, --output-length OUTPUT_LENGTH
Length of the output sequence to be generated
-e EPOCHS, --epochs EPOCHS
Number of training Epochs [Int]
-b BATCH_SIZE, --batch-size BATCH_SIZE
Size of each batch [Int]
-s GPT_INPUT, --gpt-input GPT_INPUT
Number of Tokens of text the model inputs at a time
[Int]
-dm D_MODEL, --d-model D_MODEL
Embedding layer output dimensions [Int]
-p MULTI_HEAD, --multi-head MULTI_HEAD
Number of Multi-head Attention layer in parallel [Int]
-ds DECODER_STACKS, --decoder-stacks DECODER_STACKS
Number of stacked Decoder layer [Int]
-ts TOKEN_START, --token-start TOKEN_START
The token number in the corpus to mark it as the
starting point of the training [Int]
-te TOKEN_END, --token-end TOKEN_END
The token number in the corpus to mark it as the end
point of the training [Int]
-vs VOCABULARY_START, --vocabulary-start VOCABULARY_START
Token number from the corpus to mark the starting
point of vocabulary data [Int]
-ve VOCABULARY_END, --vocabulary-end VOCABULARY_END
Token number from the corpus to mark the end point of
vocabulary data [Int]
-sd, --save Save the Model and Vectorizer data to disk
[True/False]
-lt LOAD_TOKENIZER, --load-tokenizer LOAD_TOKENIZER
File: Vectorization layer [File]
-lw LOAD_WEIGHTS, --load-weights LOAD_WEIGHTS
File: Model Weights [File]
-st SAVE_TOKENIZER, --save-tokenizer SAVE_TOKENIZER
File: Saving Vectorizer File [File]
-sw SAVE_WEIGHTS, --save-weights SAVE_WEIGHTS
File: Saving Model Weights[File]
-ot OPTIMIZER, --optimizer OPTIMIZER
Optimizer consistent to TensorFlow optimizer class
[tf.keras.optimizers]
-i, --inference-only Only Print the output of the model in Inference Mode
[True/False]
-mv, --model-vectorizer
Return Model, Vectorizer Tuple [True/False]
-mvo, --model-vectorizer-output
Return Model, Vectorizer, Output Tuple [True/False]
假设所需的模型规范需要 GPT_INPUT = 10、D_MODEL = 128、MULTI_HEAD = 8 和 DECODER_STACKS = 1,训练的语料库标记范围从 TOKEN_START = 0 到 TOKEN_END = 40000,并从语料库范围生成向量化层VOCABULARY_START = 0 至 VOCABULARY_END = 200000,执行以下命令来启动模型训练过程。生成的权重和分词器数据保存在指定的文件夹中。随后的输出说明了该命令执行的结果。
PS C:gpt> python MinimalGPT .py -d './dataset/output_dataset.txt' -l 0.001 -ol 200 -e 4 -b 512 -s 10 -dm 128 -p 8 -ds 1 -ts 0 -te 40000 -vs 0 -ve 200000 -sd -st './models/tokenizer.mgt' -sw './models/weights.mgw'
Total tokens: 40000
100%|██████████████████████████████████████████████████████████████████████████████| 200000/200000 [02:02<00:00, 1636.38it/s]
New Vectorizer created successfully...
Vocabulary Size: 14270
100%|██████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 302926.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 1289942.19it/s]
(None, 10, 128)
Epoch 1/4
79/79 [==============================] - 88s 1s/step - loss: 7.8692
Epoch 2/4
79/79 [==============================] - 92s 1s/step - loss: 3.8066
Epoch 3/4
79/79 [==============================] - 93s 1s/step - loss: 1.1487
Epoch 4/4
79/79 [==============================] - 92s 1s/step - loss: 0.2900
100%|██████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:05<00:00, 34.70it/s]
Vocabulary size saved: 14270
and her eyes in the library. She was the rather large woman, although not fat, and when she wore high heels--which sh
e was not prone to do, because although Cutter would not have cared, she kept trying to project into other people's minds and
trying, as she said, "Not to do anything to them, that I wouldn't want them to do you me."--she rose a good inch above Cutter.
She was pleasant humored, and cooperative, and the one great irritant about her that annoyed Cutter, was the fact that she wa
s not capable of meeting life wholeheartedly and with strength. She steadily worried about other people's feelings and thought
s, so that Cutter wondered if she were capable of the slightest personal conviction. Yet that weakness was an advantage at the
same time, to him, because she worked constantly toward making him happy. The house was run to his minutest liking, and the s
ervants liked her, so that while she did not use a strong enough
假设我们想要微调上述模型(或重新训练它),那么重新加载分词器和权重并在语料库指定窗口范围的新文本上重新训练它的命令如下:
PS C:gpt> python MinimalGPT .py -d './dataset/output_dataset.txt' -l 0.00005 -ol 200 -e 1 -b 512 -s 10 -dm 128 -p 8 -ds 1 -ts 80000 -te 120000 -sd -st './models/tokenizer2.mgt' -sw './models/weights2.mgw' -lt './models/tokenizer.mgt' -lw './models/weights.mgw'
Total tokens: 40000
100%|██████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 302923.51it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 1428099.68it/s]
(None, 10, 128)
79/79 [==============================] - 81s 993ms/step - loss: 7.9725
100%|██████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:06<00:00, 30.29it/s]
Vocabulary size saved: 14270
of her own the black of my own and my wife had could seen the house at the same moment her mind caught the first sugg
estion of the folded paper. “But he must have a name! Where is the paper?” She moved to the desk, and began to turn over the s
cattered documents that littered it. The first that caught her eye was an unfinished letter in her husband’s hand, with his pe
n lying across it, as though dropped there at a sudden summons. “My dear Parvis,”--who was Parvis?--“I have just received your
letter announcing Elwell’s death, and while I suppose there is now no farther risk of trouble, it might be safer--” That was
all. The “risk of trouble” was easily explained by the newspaper clipping which had apprised Mary of the suit brought against
her husband by one of his associates in the Blue Star enterprise. The only new information conveyed in the letter was the fact
of its showing Boyne,
推理模式涉及预训练权重和向量化器的加载。然后利用这些组件来执行模型,生成指定长度的输出。
PS C:gpt> python MinimalGPT .py -i -ol 500 -e 6 -b 512 -s 10 -dm 128 -p 8 -ds 1 -lt './models/tokenizer2.mgt' -lw './models/weights2.mgw'
(None, 10, 128)
100%|██████████████████████████████████████████████████████████████████████████████████████| 490/490 [00:13<00:00, 35.93it/s]
of her own “on the other from the inel’--a little sensational, of course. But I guess you’d better look it over.” He
held out a newspaper to Mary, who unfolded it slowly, remembering, as she did so, the evening when, in that same room, the per
usal of a clipping from the “Sentinel” had first shaken the depths of her security. As she opened the paper, her eyes, shrinki
ng from the glaring head-lines, “Widow of Boyne’s Victim Forced to Appeal for Aid,” ran down the column of text to two portrai
ts inserted in it. The first was her husband’s, taken from a photograph made the year they had come to England. It was the pic
ture of him that she liked best, the one that stood on the writing-table up-stairs in her bedroom. As the eyes in the photogra
ph met hers, she felt it would be impossible to read what was said of him, and closed her lids with the sharpness of the pain.
“I thought if you felt disposed to put your name down--” she heard Parvis continue. She opened her eyes with an effort, and t
hey fell on the other portrait. It was that of a youngish man, slightly built, in rough clothes, with features somewhat blurre
d by the shadow of a projecting hat-brim. Where had she seen that outline before? She stared at it confusedly, her heart hamme
ring in her throat and ears. Then she gave a cry. “This is the man--the man who came for my husband!” She heard Parvis start t
o his feet, and was dimly aware that she had slipped backward into the corner of the sofa, and that he was bending above her i
n alarm. With an intense effort she straightened herself, and reached out for the paper, which she had dropped. “It’s the man!
I should know him anywhere!” she cried in a voice that sounded in her own ears like a scream. Parvis’s voice seemed to come t
o her from far off, down endless, fog-muffled windings. “Mrs. Boyne, you’re not very well. Shall I call somebody? Shall I get
a glass of water?” “No, no, no!” She threw herself toward him, her hand frantically clenching the newspaper. “I tell you, it’s
the man! I KNOW him! He spoke to me in the garden!” Parvis took the journal from her, directing his glasses to the portrait.
“It can’t be, Mrs. Boyne. It’s Robert Elwell.” “Robert Elwell?” Her white
将通过使用MinimalGPT .py 生成的经过训练的模型合并到您的项目中是一个简单的过程,通过导入MinimalGPT函数并根据所需的规范进行配置即可实现。这可以通过在 inference_only = True(推理模式)框架内设置参数 return_model_and_vectorizer = True 或 return_model_and_vectorizer_and_output = True 来实现。此外,模型的训练、创建和导出可以使用类似的方法来完成,与命令行模式并行。为了全面说明这些过程,随附的 Jupyter Notebook 提供了示例演示。
from MinimalGPT import MinimalGPT model = MinimalGPT (output_length = 200, gpt_input = 10, d_model = 128, h = 8, decoder_stacks = 1, load_tokenizer = './models/tokenizer3.mgt', load_weights = './models/weights3.mgw', inference_only = True, return_model_and_vectorizer_and_output = True) model[0].summary()
Model: "model"
Layer (type) Output Shape Param
================================================================= input_1 (InputLayer) [(None, 10)] 0
embedding (Embedding) (None, 10, 128) 1826816
positional_embedding (Posit (None, 10, 128) 0
ionalEmbedding)
decoder (Decoder) (None, 10, 128) 37160
flatten (Flatten) (None, 1280) 0
dense (Dense) (None, 14273) 18283713
tf.nn.softmax (TFOpLambda) (None, 14273) 0
================================================================= Total params: 20,147,689 Trainable params: 20,147,689 Non-trainable params: 0
与原始论文实现相比,这里实现的模型略有不同。将缩放后的点积输出的头连接后形成的矩阵乘以大小键维度 x d_model 的矩阵参数。出于实际目的,由于可训练参数优化,这种减少参数数量的小调整将导致性能略有提高。