https://github.com/abhaskumarsinha/Corpus2GPT
Logo MinimalGPT" width="20%" style="max-width: 100%;">
[ GPT-1 Paper
] [ 1002 short stories from project guttenberg
] [ logo.com
] [ Transformer - Paper
] [ Huggingface Transformers
] [ TensorFlow
] [ BPE Tokenizer: subword-nmt
]
MinimalGPT adalah kerangka kode yang ringkas, mudah beradaptasi, dan disederhanakan yang mencakup komponen penting yang diperlukan untuk konstruksi, pelatihan, inferensi, dan penyempurnaan model GPT. Framework ini diimplementasikan secara eksklusif menggunakan Keras dan TensorFlow, memastikan kompatibilitas dan koherensi dalam ekosistem deep learning yang lebih luas.
BARU: Dukungan CPU/GPU/TPU dan dukungan untuk memuat kumpulan data file besar!
Di repositori, kami memperkenalkan dua file integral yang terdiri dari kerangka kerja yang kami usulkan. File pertama, GPT.py , berfungsi sebagai kerangka dasar dan mencakup komponen penting seperti blok dan lapisan. Komponen-komponen ini mencakup perhatian multi-head, mekanisme feedforward, perhatian produk titik berskala, pengkodean posisi, keluaran softmaxed, dan fungsi inferensi untuk prediksi model. File kedua, MinimalGPT .py , menyederhanakan penggunaan kerangka kerja kami dengan menawarkan antarmuka baris perintah yang ringkas. Antarmuka ini memungkinkan pengguna melakukan operasi penting dengan mudah, termasuk pembuatan model, pelatihan, penyimpanan, pemuatan, penyesuaian, dan inferensi, semuanya diringkas menjadi satu eksekusi baris perintah. Selain itu, file-file tersebut dapat dengan mudah diimpor ke dalam kode Python, memungkinkan pengguna untuk memasukkannya ke dalam proyek mereka dengan mudah melalui pemanggilan fungsi sederhana.
pip install -r requirements.txt
Arsitektur model diatur oleh beberapa parameter penting, termasuk GPT_INPUT, D_MODEL, MULTI_HEAD , dan DECODER_STACKS . Sangat penting untuk memastikan konsistensi dalam parameter ini untuk mencegah masalah terkait pemuatan model untuk pelatihan ulang atau proses inferensi berikutnya. Dalam situasi di mana ketidakpastian muncul, mengacu pada file konfigurasi yang dihasilkan selama proses sebelumnya dapat memberikan wawasan yang berharga. Selain itu, parameter VOCABULARY_START dan VOCABULARY_END memainkan peran penting dalam menentukan penanda jendela untuk korpus. Penanda ini membantu menghasilkan lapisan Vectorizer, yang mengekstrak kosakata dari korpus dalam jumlah token START dan END yang ditentukan. Penting untuk dicatat bahwa token dalam korpus dipisahkan oleh spasi, dan penyertaan VOCABULARY_START dan VOCABULARY_END menjadi sangat relevan ketika file token tidak ditentukan secara eksplisit.
Perhatikan juga bahwa KEDUA - file tokenizer serta bobot model disimpan/dimuat sekaligus. Saat ini kode tersebut tidak mendukung penyimpanan/pemuatan kedua file ini secara terpisah.
Mode inferensi (-i) tidak hanya memerlukan parameter model dan tokenizer yang disimpan serta file bobot untuk menghasilkan data inferensi. Ini harus digunakan dengan saklar (-ol).
usage: MinimalGPT .py [-h] [-d DATA_PATH] [-l LEARNING_RATE]
[-ol OUTPUT_LENGTH] [-e EPOCHS] [-b BATCH_SIZE]
[-s GPT_INPUT] [-dm D_MODEL] [-p MULTI_HEAD]
[-ds DECODER_STACKS] [-ts TOKEN_START] [-te TOKEN_END]
[-vs VOCABULARY_START] [-ve VOCABULARY_END] [-sd]
[-lt LOAD_TOKENIZER] [-lw LOAD_WEIGHTS]
[-st SAVE_TOKENIZER] [-sw SAVE_WEIGHTS] [-ot OPTIMIZER]
[-i] [-mv] [-mvo]
optional arguments:
-h, --help show this help message and exit
-d DATA_PATH, --data-path DATA_PATH
File: Corresponding to corpus or training text
[String]
-l LEARNING_RATE, --learning-rate LEARNING_RATE
Float: Learning Rate. The model will train ONLY IF the
rate is > 0, skip otherwise [Float]
-ol OUTPUT_LENGTH, --output-length OUTPUT_LENGTH
Length of the output sequence to be generated
-e EPOCHS, --epochs EPOCHS
Number of training Epochs [Int]
-b BATCH_SIZE, --batch-size BATCH_SIZE
Size of each batch [Int]
-s GPT_INPUT, --gpt-input GPT_INPUT
Number of Tokens of text the model inputs at a time
[Int]
-dm D_MODEL, --d-model D_MODEL
Embedding layer output dimensions [Int]
-p MULTI_HEAD, --multi-head MULTI_HEAD
Number of Multi-head Attention layer in parallel [Int]
-ds DECODER_STACKS, --decoder-stacks DECODER_STACKS
Number of stacked Decoder layer [Int]
-ts TOKEN_START, --token-start TOKEN_START
The token number in the corpus to mark it as the
starting point of the training [Int]
-te TOKEN_END, --token-end TOKEN_END
The token number in the corpus to mark it as the end
point of the training [Int]
-vs VOCABULARY_START, --vocabulary-start VOCABULARY_START
Token number from the corpus to mark the starting
point of vocabulary data [Int]
-ve VOCABULARY_END, --vocabulary-end VOCABULARY_END
Token number from the corpus to mark the end point of
vocabulary data [Int]
-sd, --save Save the Model and Vectorizer data to disk
[True/False]
-lt LOAD_TOKENIZER, --load-tokenizer LOAD_TOKENIZER
File: Vectorization layer [File]
-lw LOAD_WEIGHTS, --load-weights LOAD_WEIGHTS
File: Model Weights [File]
-st SAVE_TOKENIZER, --save-tokenizer SAVE_TOKENIZER
File: Saving Vectorizer File [File]
-sw SAVE_WEIGHTS, --save-weights SAVE_WEIGHTS
File: Saving Model Weights[File]
-ot OPTIMIZER, --optimizer OPTIMIZER
Optimizer consistent to TensorFlow optimizer class
[tf.keras.optimizers]
-i, --inference-only Only Print the output of the model in Inference Mode
[True/False]
-mv, --model-vectorizer
Return Model, Vectorizer Tuple [True/False]
-mvo, --model-vectorizer-output
Return Model, Vectorizer, Output Tuple [True/False]
Dengan asumsi spesifikasi model yang diinginkan memerlukan GPT_INPUT = 10, D_MODEL = 128, MULTI_HEAD = 8, dan DECODER_STACKS = 1, dan rentang token korpus untuk rentang pelatihan dari TOKEN_START = 0 hingga TOKEN_END = 40000, dan menghasilkan lapisan vektorizer dari rentang korpus dari VOCABULARY_START = 0 sampai VOCABULARY_END = 200000, perintah berikut dijalankan untuk memulai proses pelatihan model. Data bobot dan tokenizer yang dihasilkan disimpan di folder yang ditentukan. Keluaran selanjutnya menggambarkan hasil eksekusi perintah ini.
PS C:gpt> python MinimalGPT .py -d './dataset/output_dataset.txt' -l 0.001 -ol 200 -e 4 -b 512 -s 10 -dm 128 -p 8 -ds 1 -ts 0 -te 40000 -vs 0 -ve 200000 -sd -st './models/tokenizer.mgt' -sw './models/weights.mgw'
Total tokens: 40000
100%|██████████████████████████████████████████████████████████████████████████████| 200000/200000 [02:02<00:00, 1636.38it/s]
New Vectorizer created successfully...
Vocabulary Size: 14270
100%|██████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 302926.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 1289942.19it/s]
(None, 10, 128)
Epoch 1/4
79/79 [==============================] - 88s 1s/step - loss: 7.8692
Epoch 2/4
79/79 [==============================] - 92s 1s/step - loss: 3.8066
Epoch 3/4
79/79 [==============================] - 93s 1s/step - loss: 1.1487
Epoch 4/4
79/79 [==============================] - 92s 1s/step - loss: 0.2900
100%|██████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:05<00:00, 34.70it/s]
Vocabulary size saved: 14270
and her eyes in the library. She was the rather large woman, although not fat, and when she wore high heels--which sh
e was not prone to do, because although Cutter would not have cared, she kept trying to project into other people's minds and
trying, as she said, "Not to do anything to them, that I wouldn't want them to do you me."--she rose a good inch above Cutter.
She was pleasant humored, and cooperative, and the one great irritant about her that annoyed Cutter, was the fact that she wa
s not capable of meeting life wholeheartedly and with strength. She steadily worried about other people's feelings and thought
s, so that Cutter wondered if she were capable of the slightest personal conviction. Yet that weakness was an advantage at the
same time, to him, because she worked constantly toward making him happy. The house was run to his minutest liking, and the s
ervants liked her, so that while she did not use a strong enough
Misalkan kita ingin menyempurnakan model di atas (atau melatihnya kembali), maka perintah untuk memuat ulang tokenizer dan bobot serta melatihnya kembali pada teks baru dari rentang jendela korpus tertentu diberikan di bawah ini:
PS C:gpt> python MinimalGPT .py -d './dataset/output_dataset.txt' -l 0.00005 -ol 200 -e 1 -b 512 -s 10 -dm 128 -p 8 -ds 1 -ts 80000 -te 120000 -sd -st './models/tokenizer2.mgt' -sw './models/weights2.mgw' -lt './models/tokenizer.mgt' -lw './models/weights.mgw'
Total tokens: 40000
100%|██████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 302923.51it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 39989/39989 [00:00<00:00, 1428099.68it/s]
(None, 10, 128)
79/79 [==============================] - 81s 993ms/step - loss: 7.9725
100%|██████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:06<00:00, 30.29it/s]
Vocabulary size saved: 14270
of her own the black of my own and my wife had could seen the house at the same moment her mind caught the first sugg
estion of the folded paper. “But he must have a name! Where is the paper?” She moved to the desk, and began to turn over the s
cattered documents that littered it. The first that caught her eye was an unfinished letter in her husband’s hand, with his pe
n lying across it, as though dropped there at a sudden summons. “My dear Parvis,”--who was Parvis?--“I have just received your
letter announcing Elwell’s death, and while I suppose there is now no farther risk of trouble, it might be safer--” That was
all. The “risk of trouble” was easily explained by the newspaper clipping which had apprised Mary of the suit brought against
her husband by one of his associates in the Blue Star enterprise. The only new information conveyed in the letter was the fact
of its showing Boyne,
Mode inferensi melibatkan pemuatan bobot dan vektorizer yang telah dilatih sebelumnya. Komponen-komponen ini kemudian digunakan untuk mengeksekusi model, menghasilkan keluaran dengan panjang tertentu sesuai yang ditentukan.
PS C:gpt> python MinimalGPT .py -i -ol 500 -e 6 -b 512 -s 10 -dm 128 -p 8 -ds 1 -lt './models/tokenizer2.mgt' -lw './models/weights2.mgw'
(None, 10, 128)
100%|██████████████████████████████████████████████████████████████████████████████████████| 490/490 [00:13<00:00, 35.93it/s]
of her own “on the other from the inel’--a little sensational, of course. But I guess you’d better look it over.” He
held out a newspaper to Mary, who unfolded it slowly, remembering, as she did so, the evening when, in that same room, the per
usal of a clipping from the “Sentinel” had first shaken the depths of her security. As she opened the paper, her eyes, shrinki
ng from the glaring head-lines, “Widow of Boyne’s Victim Forced to Appeal for Aid,” ran down the column of text to two portrai
ts inserted in it. The first was her husband’s, taken from a photograph made the year they had come to England. It was the pic
ture of him that she liked best, the one that stood on the writing-table up-stairs in her bedroom. As the eyes in the photogra
ph met hers, she felt it would be impossible to read what was said of him, and closed her lids with the sharpness of the pain.
“I thought if you felt disposed to put your name down--” she heard Parvis continue. She opened her eyes with an effort, and t
hey fell on the other portrait. It was that of a youngish man, slightly built, in rough clothes, with features somewhat blurre
d by the shadow of a projecting hat-brim. Where had she seen that outline before? She stared at it confusedly, her heart hamme
ring in her throat and ears. Then she gave a cry. “This is the man--the man who came for my husband!” She heard Parvis start t
o his feet, and was dimly aware that she had slipped backward into the corner of the sofa, and that he was bending above her i
n alarm. With an intense effort she straightened herself, and reached out for the paper, which she had dropped. “It’s the man!
I should know him anywhere!” she cried in a voice that sounded in her own ears like a scream. Parvis’s voice seemed to come t
o her from far off, down endless, fog-muffled windings. “Mrs. Boyne, you’re not very well. Shall I call somebody? Shall I get
a glass of water?” “No, no, no!” She threw herself toward him, her hand frantically clenching the newspaper. “I tell you, it’s
the man! I KNOW him! He spoke to me in the garden!” Parvis took the journal from her, directing his glasses to the portrait.
“It can’t be, Mrs. Boyne. It’s Robert Elwell.” “Robert Elwell?” Her white
Menggabungkan model terlatih yang dihasilkan melalui pemanfaatan MinimalGPT .py ke dalam proyek Anda adalah proses mudah yang difasilitasi dengan mengimpor fungsi MinimalGPT dan mengonfigurasinya sesuai dengan spesifikasi yang diinginkan. Hal ini dapat dicapai dengan mengatur parameter return_model_and_vectorizer = True atau return_model_and_vectorizer_and_output = True dalam kerangka inference_only = True (Mode Inferensi). Selain itu, pelatihan, pembuatan, dan ekspor model dapat dilakukan menggunakan pendekatan serupa, yang paralel dengan mode baris perintah. Untuk ilustrasi komprehensif tentang prosedur ini, Jupyter Notebook yang menyertainya memberikan contoh demonstrasi.
from MinimalGPT import MinimalGPT model = MinimalGPT (output_length = 200, gpt_input = 10, d_model = 128, h = 8, decoder_stacks = 1, load_tokenizer = './models/tokenizer3.mgt', load_weights = './models/weights3.mgw', inference_only = True, return_model_and_vectorizer_and_output = True) model[0].summary()
Model: "model"
Layer (type) Output Shape Param
================================================================= input_1 (InputLayer) [(None, 10)] 0
embedding (Embedding) (None, 10, 128) 1826816
positional_embedding (Posit (None, 10, 128) 0
ionalEmbedding)
decoder (Decoder) (None, 10, 128) 37160
flatten (Flatten) (None, 1280) 0
dense (Dense) (None, 14273) 18283713
tf.nn.softmax (TFOpLambda) (None, 14273) 0
================================================================= Total params: 20,147,689 Trainable params: 20,147,689 Non-trainable params: 0
Model yang diterapkan di sini sedikit berbeda dibandingkan dengan implementasi kertas aslinya. Matriks yang terbentuk setelah menggabungkan kepala keluaran perkalian titik berskala dikalikan dengan parameter matriks dimensi kunci ukuran x d_model. Untuk tujuan praktis, perubahan kecil untuk mengurangi jumlah parameter ini akan menghasilkan sedikit peningkatan kinerja karena pengoptimalan parameter yang dapat dilatih.