Building llama3 from scratch下載 - Building llama3 from scratch原始碼下載

使用 Python 從頭開始建置 LLaMA 3 LLM

LLaMA 3 是繼 Mistral 之後最有前途的開源模型之一，可以解決廣泛的任務。我之前在 Medium 上寫過一篇關於使用 LLaMA 架構從頭開始創建具有超過 230 萬個參數的 LLM 的部落格。現在 LLaMA-3 已發布，我們將以更簡單的方式重新建立它。

我們不會在本部落格中使用 GPU，但您至少需要 17 GB 的 RAM，因為我們將載入一些大小超過 15 GB 的檔案。如果這對您來說是個問題，您可以使用 Kaggle 作為解決方案。由於我們不需要 GPU，Kaggle 提供 30 GB RAM，同時僅使用 CPU 核心作為加速器。

以下是博客鏈接，指導您如何從頭開始創建 2.3+ 百萬參數 LLM：2.3+ 百萬參數 LLM From Scratch

先決條件
LLaMA 2 和 LLaMA 3 之間的區別
了解 LLaMA 3 的 Transformer 架構
- 使用 RMSNorm 進行預歸一化
- SwiGLU 激活函數
- 旋轉嵌入 (RoPE)
- 位元組對編碼 (BPE) 演算法
搭建舞台
了解文件結構
對我們的輸入資料進行標記
為每個令牌建立嵌入
使用 RMSNorm 進行標準化
注意頭（查詢、鍵、值）
實施 RoPE
實施自註意力
實施多頭注意力
實現 SwiGLU 激活函數
合併一切
產生輸出

先決條件

好的部分是我們不會使用物件導向程式設計（OOP）編碼，而只是簡單的 Python 程式設計。但是，您應該對神經網路和 Transformer 架構有基本的了解。這是部落格所需遵循的唯一兩個先決條件。

話題	關聯
變壓器理論	影片連結
神經網路理論	影片連結
Python 基礎知識	影片連結

LLaMA 2 和 LLaMA 3 之間的區別

在研究技術細節之前，您首先必須了解的是，LLaMA 3 的整個架構與 LLaMA 2 相同。博客的問題。即使您不了解 LLaMA 2 架構，也不必擔心，我們也會查看其技術細節的高級概述。無論哪種方式，這個部落格都是為您設計的。

以下是關於 LLaMA 2 和 LLaMA 3 的一些要點。

特徵	駱駝3	駱駝2
分詞器	Tiktoken（由 OpenAI 開發）	句子片段
參數數量	8B、70B	70B、13B、7B
訓練資料	15T代幣	2.2T代幣
上下文長度	8192 個代幣	4096 個代幣
注意力機制	分組查詢注意力	分組查詢注意力
微調模型	是的	是的
表現	在所有基準測試中均優於 Llama 2	在大多數基準測試中都優於 Llama 1
計算要求	非常高（70B 型號）	非常高（70B 型號）
可用性	開源	開源
從人類回饋中強化學習	是的	是的
支援的語言數量	30種語言	20種語言
適合於	最適合要求更高的任務，例如推理、編碼和能力測試	適合要求更高的任務，例如推理、編碼和能力測試

了解 LLaMA 3 的 Transformer 架構

在深入編碼之前，了解 LLaMA 3 的架構非常重要。為了更好地直觀理解，這裡有原版 Transformer、LLaMA 2/3 和 Mistral 之間的比較圖。

讓我們更詳細地了解 LLaMA 3 最重要的組件：

1. 使用 RMSNorm 進行預歸一化：

在與 LLaMA 2 相同的 LLaMA 3 方法中，使用稱為 RMSNorm 的技術來標準化每個變壓器子層的輸入。

想像一下，您正在準備一場大型考試，並且您有一本充滿章節的龐大教科書。每章代表一個不同的主題，但有些章節對於理解主題比其他章節更重要。現在，在深入研究整本教科書之前，您決定評估每一章的重要性。您不想在每一章上花費相同的時間；你想更多地關注關鍵問題。這就是使用 RMSNorm 的預先規範化在 ChatGPT 等大型語言模型 (LLM) 中發揮作用的地方。這就像根據每一章的重要性為其分配權重。對主題至關重要的章節的權重較高，而較不重要的章節的權重較低。

所以，在深入學習之前，你要根據每一章的加權重要性來調整你的學習計畫。您可以將更多的時間和精力分配給權重較高的章節，以確保您徹底掌握核心概念。

同樣，使用 RMSNorm 進行預先規範化可以幫助法學碩士優先考慮文本的哪些部分對於理解上下文和含義更為重要。它為基本元素分配較高的權重，為較不重要的元素分配較低的權重，確保模型將注意力集中在最需要準確理解的地方。有興趣的讀者可以在這裡探索RMSNorm的詳細實作。

2.SwiGLU激活函數：

LLaMA 從 PaLM 中汲取靈感，引入了 SwiGLU 活化函數。

想像一下，您是一名老師，試圖向學生解釋一個複雜的主題。你有一塊大白板，你可以在上面寫下要點並繪製圖表以使事情變得更清晰。但有時，你的字跡可能不是很工整，或是你的圖表可能畫得不完美。這可能會讓你的學生更難理解這些材料。

現在，想像一下，如果您有一支魔筆，可以根據每個點的重要性自動調整筆蹟的大小和風格。如果某件事確實很重要，那麼鋼筆就會把它寫得更大、更清晰，使其脫穎而出。如果不太重要，筆會寫得更小，但仍清晰可辨。 SwiGLU 就像 ChatGPT 等大型語言模型 (LLM) 的魔筆。在生成文字之前，SwiGLU 會根據每個單字或短語與上下文的相關性來調整其重要性。就像魔筆可以調整書寫的大小和風格一樣，SwiGLU 可以調整每個單字或短語的重點。

因此，當法學碩士產生文本時，它可以更加突出重要部分，使它們更加引人注目，並確保它們對文本的整體理解做出更多貢獻。透過這種方式，SwiGLU 可以幫助法學碩士產生更清晰、更容易理解的文本，就像魔筆如何幫助您在白板上為學生創建更清晰的解釋一樣。有關 SwiGLU 的更多詳細資訊可以在相關論文中找到。

3. 旋轉嵌入 (RoPE)：

旋轉嵌入（RoPE）是 LLaMA 3 中使用的一種位置嵌入。

想像一下，您在教室裡，想要為學生分配座位進行小組討論。通常，您可以將座位排列成行和列，每個學生都有一個固定的位置。然而，在某些情況下，您希望創建更動態的座位安排，讓學生可以更自由地走動和互動。

ROPE 就像一種特殊的座位安排，允許學生旋轉和改變位置，同時仍然保持彼此的相對位置。學生現在可以繞圈移動，而不是固定在一個地方，從而實現更流暢的互動。

在這種情況下，每個學生代表文本序列中的一個單字或標記，並且他們的位置對應於他們在序列中的位置。就像 ROPE 允許學生旋轉和改變位置一樣，ROPE 允許文字序列中單字的位置嵌入根據它們之間的相對位置動態變化。因此，在處理文字時，ROPE 引入了旋轉方面，而不是將位置嵌入視為固定和靜態，從而允許更靈活的表示來捕獲序列中單字之間的動態關係。這種靈活性有助於 ChatGPT 等模型更好地理解和產生自然流動並保持連貫性的文本，類似於動態座位安排如何在課堂上促進更多互動討論。對數學細節有興趣的人可以參考 RoPE 論文。

4. 位元組對編碼（BPE）演算法

LLaMA 3 使用 OpenAI 引入的 tiktoken 庫中的位元組對編碼 (BPE)，而 LLaMA 2 分詞器 BPE 是基於句子庫。它們之間有細微的差別，但是

首先，讓我們來了解BPE到底是什麼。

讓我們從一個簡單的例子開始。假設我們有一個文字語料庫，其中包含以下單字：「ab」、「bc」、「bcd」和「cde」。我們首先使用文本語料庫中的所有單字來初始化詞彙表，因此我們的初始詞彙表是{“a”，“b”，“c”，“d”，“e”}。

接下來，我們計算文字語料庫中每個字元的頻率。對於我們的範例，頻率為：{“a”：1、“b”：3、“c”：3、“d”：2、“e”：1}。

現在，我們開始合併過程。我們重複以下步驟，直到我們的詞彙量達到所需的大小：

首先，我們找到最常見的連續字元對。在本例中，最頻繁的對是“bc”，頻率為 2。合併後，我們更新頻率計數以反映新的子詞單元。更新頻率為{"a": 1, "b": 2, "c": 2, "d": 2, "e": 1, "bc": 2}。我們將新的子詞單元“bc”添加到我們的詞彙表中，它現在變成了{“a”，“b”，“c”，“d”，“e”，“bc”}。
我們重複這個過程。下一個最常見的對是「cd」。我們合併“cd”以形成新的子詞單元“cd”並更新頻率計數。更新頻率為{“a”：1，“b”：2，“c”：1，“d”：1，“e”：1，“bc”：2，“cd”：2}。我們將“cd”加到詞彙表中，得到{“a”，“b”，“c”，“d”，“e”，“bc”，“cd”}。
繼續這個過程，下一個頻繁對是“de”。我們合併“de”以形成子詞單元“de”並將頻率計數更新為{“a”：1，“b”：2，“c”：1，“d”：1，“e”：0，「bc」：2，「cd」：1，「de」：1}。我們將“de”添加到詞彙表中，使其成為{“a”，“b”，“c”，“d”，“e”，“bc”，“cd”，“de”}。
接下來，我們發現「ab」是最常見的對。我們合併“ab”以形成子詞單元“ab”並將頻率計數更新為{“a”：0，“b”：1，“c”：1，“d”：1，“e”：0，「bc」：2，「cd」：1，「de」：1，「ab」：1}。
我們將“ab”添加到詞彙表中，它變成{“a”，“b”，“c”，“d”，“e”，“bc”，“cd”，“de”，“ab” }。
那麼，下一個頻繁對是「bcd」。我們合併“bcd”以形成子詞單元“bcd”，並將頻率計數更新為{“a”：0，“b”：0，“c”：0，“d”：0，“e”：0 ， “bc”：1，“cd”：0，“de”：1，“ab”：1，“bcd”：1}。我們將“bcd”添加到詞彙表中，結果是{“a”，“b”，“c”，“d”，“e”，“bc”，“cd”，“de”，“ab”， “bcd” “}。
最後，最常見的對是「cde」。我們合併“cde”以形成子詞單元“cde”並將頻率計數更新為{“a”：0，“b”：0，“c”：0，“d”：0，“e”：0，「bc」：1，「cd」：0，「de」：0，「ab」：1，「bcd」：1，「cde」：1}。我們將“cde”添加到詞彙表中，使其成為{“a”，“b”，“c”，“d”，“e”，“bc”，“cd”，“de”，“ab” ，“bcd” ”，“cde”}。

該技術可以提高法學碩士的表現並處理罕見的和詞彙外的單字。 TikToken BPE 和句子 BPE 之間的最大區別在於，如果整個單字已知，TikToken BPE 並不總是將單字分割成更小的部分。例如，如果詞彙表中有“huging”，它會保留為一個標記，而不是拆分為 [“hug”,“ging”]。

搭建舞台

我們將使用一小部分 Python 庫，但最好安裝它們以避免遇到「未找到模組」錯誤。

!p ip install sentencepiece tiktoken torch blobfile matplotlib huggingface_hub

 Requirement already satisfied: sentencepiece in /opt/conda/lib/python3.10/site-packages (0.2.0)
Requirement already satisfied: tiktoken in /opt/conda/lib/python3.10/site-packages (0.7.0)
Requirement already satisfied: torch in /opt/conda/lib/python3.10/site-packages (2.1.2+cpu)
Requirement already satisfied: blobfile in /opt/conda/lib/python3.10/site-packages (2.1.1)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (3.7.5)
Requirement already satisfied: huggingface_hub in /opt/conda/lib/python3.10/site-packages (0.22.2)
Requirement already satisfied: regex>=2022.1.18 in /opt/conda/lib/python3.10/site-packages (from tiktoken) (2023.12.25)
Requirement already satisfied: requests>=2.26.0 in /opt/conda/lib/python3.10/site-packages (from tiktoken) (2.31.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch) (3.13.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch) (4.9.0)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch) (1.12)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch) (3.2.1)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch) (3.1.2)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch) (2024.2.0)
Requirement already satisfied: pycryptodomex~=3.8 in /opt/conda/lib/python3.10/site-packages (from blobfile) (3.20.0)
Requirement already satisfied: urllib3<3,>=1.25.3 in /opt/conda/lib/python3.10/site-packages (from blobfile) (1.26.18)
Requirement already satisfied: lxml~=4.9 in /opt/conda/lib/python3.10/site-packages (from blobfile) (4.9.4)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (4.47.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy<2,>=1.20 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (21.3)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (9.5.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from huggingface_hub) (6.0.1)
Requirement already satisfied: tqdm>=4.42.1 in /opt/conda/lib/python3.10/site-packages (from huggingface_hub) (4.66.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (3.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2024.2.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch) (1.3.0)

安裝所需的庫後，我們需要下載一些檔案。由於我們要複製 llama-3–8B 的架構，因此您必須在 HuggingFace 上擁有一個帳戶。此外，由於 llama-3 是一個門控模型，您必須接受其條款和條件才能存取模型內部。

步驟如下：

透過此連結建立 HuggingFace 帳戶
從此連結接受 llama-3–8B 的條款和條件

完成這兩個步驟後，現在我們必須下載一些檔案。有兩種選擇可以做到這一點：

（選項 1：手動）從此連結轉到 llama-3–8B HF 目錄，然後手動下載這三個檔案。

（選項 2：編碼）我們可以使用先前安裝的 Hugging_face 庫來下載所有這些檔案。然而，首先，我們需要使用 HF 令牌在工作筆記本中登入 HuggingFace Hub。您可以建立新令牌或從此連結存取它。

 # Import the `notebook_login` function from the `huggingface_hub` module.
from huggingface_hub import notebook_login

# Execute the `notebook_login` function to log in to the Hugging Face Hub.
notebook_login ()

 VBox(children=(HTML(value='<center> <imgnsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

運行此單元後，它會要求您輸入令牌。如果登入期間出現錯誤，請重試，但請確保取消選取新增令牌作為 git 憑證。之後，我們只需要運行一個簡單的 Python 程式碼即可下載作為 llama-3-8B 架構骨幹的三個檔案。

 # Import the necessary function from the huggingface_hub library
from huggingface_hub import hf_hub_download

# Define the repository information
repo_id = "meta-llama/Meta-Llama-3-8B"
subfolder = "original"  # Specify the subfolder within the repository

# List of filenames to download
filenames = [ "params.json" , "tokenizer.model" , "consolidated.00.pth" ] 

# Specify the directory where you want to save the downloaded files
save_directory = "llama-3-8B/"  # Replace with your desired path

# Download each file
for filename in filenames :
    hf_hub_download (
        repo_id = repo_id ,       # Repository ID
        filename = filename ,     # Name of the file to download
        subfolder = subfolder ,   # Subfolder within the repository
        local_dir = save_directory  # Directory to save the downloaded file
    )

 original/params.json:   0%|          | 0.00/211 [00:00<?, ?B/s]



original/tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]



original/consolidated.00.pth:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

下載所有文件後，我們需要匯入將在本部落格中使用的庫。

 # File system paths
from pathlib import Path

# Tokenization library
import tiktoken

# BPE loading function
from tiktoken . load import load_tiktoken_bpe

# PyTorch library
import torch

# JSON handling
import json

# Plotting library
import matplotlib . pyplot as plt

接下來，我們需要了解每個文件的用途。

了解文件結構

由於我們的目標是精確複製 llama-3，這意味著我們的輸入文字必須產生有意義的輸出。例如，如果我們的輸入是“太陽的顏色是？”，則輸出必須是“白色”。實現這一目標需要在大型資料集上訓練我們的法學碩士，這需要很高的運算能力，這對我們來說是不可行的。

然而，Meta 已經公開發布了他們的 llama-3 架構文件，或者更複雜地說，他們的預訓練權重，以便使用。我們剛剛下載了這些文件，使我們能夠複製它們的架構，而無需訓練或大型資料集。一切都已經準備好了，我們只需在正確的地方使用正確的組件。

查看每個文件及其重要性：

tokenizer.model - 正如我們之前討論的，LLaMA-3 使用tiktoken 的字節對編碼(BPE) 標記器，在包含15 萬億個標記的數據集上進行訓練- 比LLaMA-2 使用的數據集大7倍。讓我們加載這個文件並看看它包含什麼。

 # Loading the tokenizer from llama-3-8B
tokenizer_model = load_tiktoken_bpe ( "/kaggle/working/llama-3-8B/original/tokenizer.model" )

# Get the length of the tokenizer model 
len ( tokenizer_model )
# OUTPUT: 128000

# Get the type of the `tokenizer_model` object.
type ( tokenizer_model )
# OUTPUT: dictionary

 dict

長度屬性顯示總詞彙量大小，即訓練資料中唯一的字元數。 tokenizer_model 的型別是字典。

 # Printing the first 10 items of tokenizer model
dict ( list ( tokenizer_model . items ())[ 5600 : 5610 ])

 {b'mitted': 5600,
 b" $('#": 5601,
 b' saw': 5602,
 b' approach': 5603,
 b'ICE': 5604,
 b' saying': 5605,
 b' anyone': 5606,
 b'meta': 5607,
 b'SD': 5608,
 b' song': 5609}

當我們從中列印 10 個隨機項目時，您將看到使用 BPE 演算法形成的字串，類似於我們之前討論的範例。鍵表示來自 BPE 訓練的位元組序列，而值表示基於頻率的合併排名。

solidated.00.pth - 包含 Llama-3–8B 的學習參數（權重）。這些參數包括有關模型如何理解和處理語言的信息，例如如何表示標記、計算注意力、執行前饋變換以及標準化其輸出。

 # Loading a PyTorch model of LLaMA-3-8B
model = torch . load ( "/kaggle/working/llama-3-8B/original/consolidated.00.pth" )

# printing first 11 layers of the architecture
list ( model . keys ())[: 11 ]

 ['tok_embeddings.weight',
 'layers.0.attention.wq.weight',
 'layers.0.attention.wk.weight',
 'layers.0.attention.wv.weight',
 'layers.0.attention.wo.weight',
 'layers.0.feed_forward.w1.weight',
 'layers.0.feed_forward.w3.weight',
 'layers.0.feed_forward.w2.weight',
 'layers.0.attention_norm.weight',
 'layers.0.ffn_norm.weight',
 'layers.1.attention.wq.weight']

如果您熟悉轉換器架構，您就會了解查詢、鍵矩陣等。稍後，我們將使用這些層/權重在 Llama-3 的架構中建立此類矩陣。

params.json-包含各種參數值，例如：

 # Opening the parameters JSON file
with open ( "/kaggle/working/llama-3-8B/original/params.json" , "r" ) as f :
    config = json . load ( f )

# Printing the content
print ( config )

 {'dim': 4096, 'n_layers': 32, 'n_heads': 32, 'n_kv_heads': 8, 'vocab_size': 128256, 'multiple_of': 1024, 'ffn_dim_multiplier': 1.3, 'norm_eps': 1e-05, 'rope_theta': 500000.0}

這些值將幫助我們透過指定頭部數量、嵌入向量的維度等細節來複製 Llama-3 架構。

讓我們儲存這些值，以便稍後使用它們。

 # Dimension
dim = config [ "dim" ]

# Layers
n_layers = config [ "n_layers" ]

# Heads
n_heads = config [ "n_heads" ]

# KV_heads
n_kv_heads = config [ "n_kv_heads" ]

# Vocabulary
vocab_size = config [ "vocab_size" ]

# Multiple
multiple_of = config [ "multiple_of" ]

# Multiplier
ffn_dim_multiplier = config [ "ffn_dim_multiplier" ]

# Epsilon
norm_eps = config [ "norm_eps" ]

# RoPE
rope_theta = torch . tensor ( config [ "rope_theta" ])

現在我們有了分詞器模型、包含權重的架構模型和配置參數，讓我們開始從頭開始編寫我們自己的 Llama-3。

對我們的輸入資料進行標記

我們需要執行的第一件事是將輸入文字轉換為標記，為了實現這一點，我們首先必須創建一些特殊的標記，這些標記是在標記化文字中提供結構化標記所必需的，使標記生成器能夠識別和處理特定條件或說明。

 special_tokens = [
    "<|begin_of_text|>" ,  # Marks the beginning of a text sequence.
    "<|end_of_text|>" ,  # Marks the end of a text sequence.
    "<|reserved_special_token_0|>" ,  # Reserved for future use.
    "<|reserved_special_token_1|>" ,  # Reserved for future use.
    "<|reserved_special_token_2|>" ,  # Reserved for future use.
    "<|reserved_special_token_3|>" ,  # Reserved for future use.
    "<|start_header_id|>" ,  # Indicates the start of a header ID.
    "<|end_header_id|>" ,  # Indicates the end of a header ID.
    "<|reserved_special_token_4|>" ,  # Reserved for future use.
    "<|eot_id|>" ,  # Marks the end of a turn (in a conversational context).
] + [ f"<|reserved_special_token_ { i } |>" for i in range ( 5 , 256 - 5 )]  # A large set of tokens reserved for future use.

接下來，我們透過指定不同的模式來匹配輸入文字中的各種類型的子字串，從而定義將文字拆分為標記的規則。我們可以這樣做。

 # patterns based on which text will be break into tokens
tokenize_breaker = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+"

它可以從輸入文字中提取單字、縮寫、數字（最多三位）和非空白字元序列，您可以根據您的要求進行自訂。我們需要使用 TikToken BPE 來編寫一個簡單的分詞器函數，該函數需要三個輸入：tokenizer_model、tokenize_breaker 和special_tokens。該函數將相應地對我們的輸入文字進行編碼/解碼。

 # Initialize tokenizer with specified parameters
tokenizer = tiktoken . Encoding (

    # make sure to set path to tokenizer.model file
    name = "/kaggle/working/llama-3-8B/original/tokenizer.model" ,

    # Define tokenization pattern string
    pat_str = tokenize_breaker ,

    # Assign BPE mergeable ranks from tokenizer_model of LLaMA-3
    mergeable_ranks = tokenizer_model ,

    # Set special tokens with indices
    special_tokens = { token : len ( tokenizer_model ) + i for i , token in enumerate ( special_tokens )},
)

# Encode "hello world!" and decode tokens to string
tokenizer . decode ( tokenizer . encode ( "hello world!" ))

 'hello world!'

為了驗證我們的編碼器函數方法是否正常運作，我們將「Hello World」傳遞給它。首先，它對文字進行編碼，將其轉換為數值。然後，它將其解碼回文本，結果是“hello world！”。這證實了該功能正常運作。讓我們對輸入進行標記。

 # input prompt
prompt = "the answer to the ultimate question of life, the universe, and everything is "

# Encode the prompt using the tokenizer and prepend a special token (128000)
tokens = [ 128000 ] + tokenizer . encode ( prompt )

print ( tokens )  # Print the encoded tokens

# Convert the list of tokens into a PyTorch tensor
tokens = torch . tensor ( tokens )

# Decode each token back into its corresponding string
prompt_split_as_tokens = [ tokenizer . decode ([ token . item ()]) for token in tokens ]

print ( prompt_split_as_tokens )  # Print the decoded tokens

 [128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]
['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']

我們以一個特殊的標記開始對輸入文字「生命、宇宙和一切的終極問題的答案是」進行編碼。

為每個令牌建立嵌入

如果我們檢查輸入向量的長度，它將是：

 # checking dimension of input vector and embedding vector from llama-3 architecture
print ( dim , len ( tokens ))

 4096 17

我們的輸入向量目前的尺寸為 (17x1)，需要將其轉換為每個標記化單字的嵌入。這意味著我們的 (17x1) 令牌將變為 (17x4096)，其中每個令牌都有長度為 4096 的相應嵌入。

 # Define embedding layer with vocab size and embedding dimension
embedding_layer = torch . nn . Embedding ( vocab_size , dim )

# Copy pre-trained token embeddings to the embedding layer
embedding_layer . weight . data . copy_ ( model [ "tok_embeddings.weight" ])

# Get token embeddings for given tokens, converting to torch.bfloat16 format
token_embeddings_unnormalized = embedding_layer ( tokens ). to ( torch . bfloat16 )

# Print shape of resulting token embeddings
token_embeddings_unnormalized . shape

 torch.Size([17, 4096])

這些嵌入沒有標準化，如果我們不標準化它們將會產生嚴重的影響。在下一節中，我們將對輸入向量進行歸一化。

使用 RMSNorm 進行標準化

我們將使用我們先前看到的 RMSNorm 相同公式對輸入向量進行歸一化，以確保我們的輸入得到歸一化。

 # Calculating RMSNorm
def rms_norm ( tensor , norm_weights ):

    # Calculate the mean of the square of tensor values along the last dimension
    squared_mean = tensor . pow ( 2 ). mean ( - 1 , keepdim = True )
    
    # Add a small value to avoid division by zero
    normalized = torch . rsqrt ( squared_mean + norm_eps )
    
    # Multiply normalized tensor by the provided normalization weights
    return ( tensor * normalized ) * norm_weights

我們將使用 Layers_0 的注意力權重來標準化我們的非標準化嵌入。使用 Layer_0 的原因是我們現在正在創建 LLaMA-3 變壓器架構的第一層。

 # using RMS normalization and provided normalization weights
token_embeddings = rms_norm ( token_embeddings_unnormalized , 
                            model [ "layers.0.attention_norm.weight" ])

# Print the shape of the resulting token embeddings
token_embeddings . shape

 torch.Size([17, 4096])

您可能已經知道維度不會改變，因為我們只是標準化向量而沒有其他。

注意頭（查詢、鍵、值）

首先，我們從模型中載入查詢、鍵、值和輸出向量。

 # Print the shapes of different weights
print (
    # Query weight shape
    model [ "layers.0.attention.wq.weight" ]. shape ,
    
    # Key weight shape
    model [ "layers.0.attention.wk.weight" ]. shape ,
    
    # Value weight shape
    model [ "layers.0.attention.wv.weight" ]. shape ,
    
    # Output weight shape
    model [ "layers.0.attention.wo.weight" ]. shape
)

 torch.Size([4096, 4096]) torch.Size([1024, 4096]) torch.Size([1024, 4096]) torch.Size([4096, 4096])

這些維度表明，由於實施並行方法/訓練，我們下載的模型權重不是針對每個頭單獨的，而是針對多個注意力頭的。但是，我們可以解開這些矩陣，使其僅可用於單一頭。

 # Retrieve query weight for the first layer of attention
q_layer0 = model [ "layers.0.attention.wq.weight" ]

# Calculate dimension per head
head_dim = q_layer0 . shape [ 0 ] // n_heads

# Reshape query weight to separate heads
q_layer0 = q_layer0 . view ( n_heads , head_dim , dim )

# Print the shape of the reshaped query weight tensor
q_layer0 . shape

 torch.Size([32, 128, 4096])

這裡，32是Llama-3中註意力頭的數量，128是查詢向量的大小，4096是令牌嵌入的大小。我們可以使用以下方法存取第一層第一個頭的查詢權重矩陣：

 # Extract the query weight for the first head of the first layer of attention
q_layer0_head0 = q_layer0 [ 0 ]

# Print the shape of the extracted query weight tensor for the first head
q_layer0_head0 . shape

 torch.Size([128, 4096])

為了找到每個標記的查詢向量，我們將查詢權重與標記嵌入相乘。

 # Matrix multiplication: token embeddings with transpose of query weight for first head
q_per_token = torch . matmul ( token_embeddings , q_layer0_head0 . T )

# Shape of resulting tensor: queries per token
q_per_token . shape