Baby GPT 是一個探索性項目,旨在逐步建立類似 GPT 的語言模型。這個專案從簡單的 Biggram 模型開始,逐漸融入 Transformer 模型架構中的先進概念。
超參數 | CPU型號 | GPU模型 |
device | '中央處理器' | 'cuda'(如果可用),否則為 'cpu' |
batch_size | 16 | 64 |
block_size | 8 | 256 |
num_iter | 10000 | 10000 |
eval_interval | 500 | 500 |
eval_iters | 100 | 200 |
d_model | 16 | 512 |
d_k | 4 | 16 |
Nx | 2 | 6 |
dropout_rate | 0.2 | 0.2 |
lr_rate | 0.005(5e-3) | 0.001 (1e-3) |
h | 2 | 6 |
# Function to create mini-batches for training or validation.
def get_batch ( split ):
# Select data based on training or validation split.
data = train_data if split == "train" else valid_data
# Generate random start indices for data blocks, ensuring space for 'block_size' elements.
ix = torch . randint ( len ( data ) - block_size , ( batch_size ,))
# Create input (x) and target (y) sequences from data blocks.
x = torch . stack ([ data [ i : i + block_size ] for i in ix ])
y = torch . stack ([ data [ i + 1 : i + block_size + 1 ] for i in ix ])
# Move data to GPU if available for faster processing.
x , y = x . to ( device ), y . to ( device )
return x , y
因素 | 小批量 | 大批量 |
梯度噪音 | 更高(更新差異更大) | 更低(更一致的更新) |
收斂 | 傾向於探索更多解決方案,包括更平坦的最小值 | 通常會收斂到更尖銳的最小值 |
概括 | 可能更好(由於最小值更平坦) | 可能更糟(由於最小值更尖銳) |
偏見 | 較低(不太可能過度擬合訓練資料模式) | 更高(可能與訓練資料模式過度擬合) |
變異數 | 更高(由於對解決方案空間的更多探索) | 較低(由於解決方案空間探索較少) |
計算成本 | 每個時期更高(更多更新) | 每個紀元較低(更新較少) |
記憶體使用情況 | 降低 | 更高 |
函數計算模型在指定迭代次數 (eval_iters) 上的平均損失。它用於評估模型的性能而不影響其參數。此模型設定為評估模式以停用某些層(例如 dropout)以實現一致的損失計算。計算訓練資料和驗證資料的平均損失後,模型將恢復到訓練模式。此功能對於監控培訓過程並在必要時進行調整至關重要。
@ torch . no_grad () # Disables gradient calculation to save memory and computations
def estimate_loss ():
result = {} # Dictionary to store the results
model . eval () # Puts the model in evaluation mode
# Iterates over the data splits (training and validation)
for split in [ 'train' , 'valid_date' ]:
# Initializes a tensor to store the losses for each iteration
losses = torch . zeros ( eval_iters )
# Loops over the number of iterations to calculate the average loss
for e in range ( eval_iters ):
X , Y = get_batch ( split ) # Fetches a batch of data
logits , loss = model ( X , Y ) # Gets the model outputs and computes the loss
losses [ e ] = loss . item () # Records the loss for this iteration
# Stores the mean loss for the current split in the result dictionary
result [ split ] = losses . mean ()
model . train () # Sets the model back to training mode
return result # Returns the dictionary with the computed losses
在這裡,我們設定並使用 AdamW 優化器在 PyTorch 中訓練神經網路模型。 Adam 優化器在許多深度學習場景中受到青睞,因為它結合了隨機梯度下降的其他兩個擴展的優點:AdaGrad 和 RMSProp。 Adam 計算每個參數的自適應學習率。除了像 RMSProp 一樣儲存過去梯度平方的指數衰減平均值之外,Adam 還保留過去梯度的指數衰減平均值,類似於動量。這使得優化器能夠調整神經網路每個權重的學習率,從而可以對複雜的資料集和架構進行更有效的訓練。
修改了權重衰減納入最佳化過程的方式,解決了原始 Adam 優化器的問題,即權重衰減與梯度更新沒有很好地分離,導致正則化的應用不理想。使用 AdamW 有時可以帶來更好的訓練表現和對未見資料的泛化能力。我們選擇 AdamW 是因為它能夠比標準 Adam 優化器更有效地處理權重衰減,從而有可能改善模型訓練和泛化。
optimizer = torch . optim . AdamW ( model . parameters (), lr = lr_rate )
for iter in range ( num_iter ):
# estimating the loss for per X interval
if iter % eval_interval == 0 :
losses = estimate_loss ()
print ( f"step { iter } : train loss is { losses [ 'train' ]:.5f } and validation loss is { losses [ 'valid_date' ]:.5f } " )
# sampling a mini batch of data
xb , yb = get_batch ( "train" )
# Forward Pass
logits , loss = model ( xb , yb )
# Zeroing Gradients: Before computing the gradients, existing gradients are reset to zero. This is necessary because gradients accumulate by default in PyTorch.
optimizer . zero_grad ( set_to_none = True )
# Backward Pass or Backpropogation: Computing Gradients
loss . backward ()
# Updating the Model Parameters
optimizer . step ()
自註意力是一種機制,允許模型以不同的方式權衡輸入資料不同部分的重要性。它是 Transformer 架構的關鍵元件,使模型能夠專注於輸入序列的相關部分以進行預測。
OneHeadSelfAttention :單頭自註意力機制的實現,讓模型專注於輸入序列的不同位置。 SelfAttention
Baby GPT 專案中的每個對應模型都逐步建立在前一個模型的基礎上,從自我注意力機制背後的直覺開始,然後是點積和縮放點積注意力的實際實現,最終整合一個-頭部自註意力模組。
class SelfAttention ( nn . Module ):
"""Self Attention (One Head)"""
""" d_k = C """
def __init__ ( self , d_k ):
super (). __init__ () #superclass initialization for proper torch functionality
# keys
self . keys = nn . Linear ( d_model , d_k , bias = False )
# queries
self . queries = nn . Linear ( d_model , d_k , bias = False )
# values
self . values = nn . Linear ( d_model , d_k , bias = False )
# buffer for the model
self . register_buffer ( 'tril' , torch . tril ( torch . ones ( block_size , block_size )))
def forward ( self , X ):
"""Computing Attention Matrix"""
B , T , C = X . shape
# Keys matrix K
K = self . keys ( X ) # (B, T, C)
# Query matrix Q
Q = self . queries ( X ) # (B, T, C)
# Scaled Dot Product
scaled_dot_product = Q @ K . transpose ( - 2 , - 1 ) * 1 / math . sqrt ( C ) # (B, T, T)
# Masking upper triangle
scaled_dot_product_masked = scaled_dot_product . masked_fill ( self . tril [: T , : T ] == 0 , float ( '-inf' ))
# SoftMax transformation
attention_matrix = F . softmax ( scaled_dot_product_masked , dim = - 1 ) # (B, T, T)
# Weighted Aggregation
V = self . values ( X ) # (B, T, C)
output = attention_matrix @ V # (B, T, C)
類別代表 Transformer 模型的基本構建塊,以單一頭封裝自註意力機制。以下是對其組件和流程的深入了解:
初始化:建構函數__init__(self, d_k)
Buffers : self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
Forward Pass : forward(self, X)
方法定義了每次呼叫 self-attention 模組時執行的計算
MultiHeadAttention :組合MultiHeadAttention
頭的輸出。 MultiHeadAttention 類別是上一步中的一個頭的自註意力機制的擴展實現,但現在多個注意力頭並行操作,每個注意力頭關注輸入的不同部分。
class MultiHeadAttention ( nn . Module ):
"""Multi Head Self Attention"""
"""h: #heads"""
def __init__ ( self , h , d_k ):
super (). __init__ ()
# initializing the heads, we want h times attention heads wit size d_k
self . heads = nn . ModuleList ([ SelfAttention ( d_k ) for _ in range ( h )])
# adding linear layer to project the concatenated heads to the original dimension
self . projections = nn . Linear ( h * d_k , d_model )
# adding dropout layer
self . droupout = nn . Dropout ( dropout_rate )
def forward ( self , X ):
# running multiple self attention heads in parallel and concatinate them at channel dimension
combined_attentions = torch . cat ([ h ( X ) for h in self . heads ], dim = - 1 )
# projecting the concatenated heads to the original dimension
combined_attentions = self . projections ( combined_attentions )
# applying dropout
combined_attentions = self . droupout ( combined_attentions )
return combined_attentions
FeedForward :在FeedForward
類別中使用 ReLU 活化實作前饋神經網路。像原始 Transformer 模型一樣,將這個完全連接的前饋添加到我們的模型中。
class FeedForward ( nn . Module ):
"""FeedForward Layer with ReLU activation function"""
def __init__ ( self , d_model ):
super (). __init__ ()
self . net = nn . Sequential (
# 2 linear layers with ReLU activation function
nn . Linear ( d_model , 4 * d_model ),
nn . ReLU (),
nn . Linear ( 4 * d_model , d_model ),
nn . Dropout ( dropout_rate )
def forward ( self , X ):
# applying the feedforward layer
return self . net ( X )
TransformerBlocks :使用Block
類堆疊變壓器區塊以創建更深的網路架構。每個附加層(或區塊,對於 Transformer 而言)都允許網路捕獲輸入資料的更複雜和抽象的特徵。
順序處理:每個 Transformer 區塊都會處理其前一個區塊的輸出,逐漸建立對輸入的更複雜的理解。這種順序處理允許網路開發資料的深層、分層表示。變壓器組的組件
# ---------------------------------- Blocks ----------------------------------#
class Block ( nn . Module ):
"""Multiple Blocks of Transformer"""
def __init__ ( self , d_model , h ):
super (). __init__ ()
d_k = d_model // h
# Layer 4: Adding Attention layer
self . attention_head = MultiHeadAttention ( h , d_k ) # h heads of d_k dimensional self-attention
# Layer 5: Feed Forward layer
self . feedforward = FeedForward ( d_model )
# Layer Normalization 1
self . ln1 = nn . LayerNorm ( d_model )
# Layer Normalization 2
self . ln2 = nn . LayerNorm ( d_model )
# Adding additional X for Residual Connections
def forward ( self , X ):
X = X + self . attention_head ( self . ln1 ( X ))
X = X + self . feedforward ( self . ln2 ( X ))
return X
ResidualConnections :增強Block
類別以包含剩餘連接,提高學習效率。殘差連接,也稱為跳躍連接,是深度神經網路設計中的關鍵創新,特別是在 Transformer 模型中。他們解決了訓練深度網路的主要挑戰之一:梯度消失問題。
# Adding additional X for Residual Connections
def forward ( self , X ):
X = X + self . attention_head ( self . ln1 ( X ))
X = X + self . feedforward ( self . ln2 ( X ))
return X
LayerNorm :使用Block
將層歸一化加入到 Transformer.Normalizing 層輸出。
class LayerNorm :
def __init__ ( self , dim , eps = 1e-5 ):
self . eps = eps
self . gamma = torch . ones ( dim )
self . beta = torch . zeros ( dim )
def __call__ ( self , x ):
# orward pass calculaton
xmean = x . mean ( 1 , keepdim = True ) # layer mean
xvar = x . var ( 1 , keepdim = True ) # layer variance
xhat = ( x - xmean ) / torch . sqrt ( xvar + self . eps ) # normalize to unit variance
self . out = self . gamma * xhat + self . beta
return self . out
def parameters ( self ):
return [ self . gamma , self . beta ]
Dropout :以正規化方法新增至SelfAttention
層中,以防止過度擬合。我們將 drop-out 加入:
ScaleUp :透過擴充batch_size
、 block_size
、 d_model
、 d_k
來增加模型的複雜度。您將需要 CUDA 工具包以及配備 NVIDIA GPU 的機器來訓練和測試這個更大的模型。
如果您想嘗試使用 CUDA 進行 GPU 加速,請確保您安裝了支援 CUDA 的適當版本的 PyTorch。
import torch
torch . cuda . is_available ()
您可以透過在 PyTorch 安裝命令中指定 CUDA 版本來完成此操作,例如在命令列中:
pip install torch torchvision torchaudio --extra-index-url