Building llama3 from scratch 다운로드 - Building llama3 from scratch 소스 코드 다운로드

Python을 사용하여 처음부터 LLaMA 3 LLM 구축

LLaMA 3는 광범위한 작업을 해결하는 Mistral 이후 가장 유망한 오픈 소스 모델 중 하나입니다. 저는 이전에 LLaMA 아키텍처를 사용하여 처음부터 230만 개가 넘는 매개변수가 포함된 LLM을 만드는 방법에 대해 Medium에 블로그를 썼습니다. 이제 LLaMA-3가 출시되었으니 더 간단한 방법으로 다시 만들어 보겠습니다.

이 블로그에서는 GPU를 사용하지 않지만 크기가 15GB가 넘는 일부 파일을 로드할 예정이므로 최소 17GB RAM이 필요합니다. 이것이 문제인 경우 Kaggle을 솔루션으로 사용할 수 있습니다. GPU가 필요하지 않기 때문에 Kaggle은 CPU 코어만 가속기로 사용하면서 30GB RAM을 제공합니다.

다음은 230만 개 이상의 매개변수 LLM을 처음부터 생성하는 방법을 안내하는 블로그 링크입니다. 230만 개 이상의 매개변수 LLM 처음부터

전제 조건
LLaMA 2와 LLaMA 3의 차이점
LLaMA 3의 변환기 아키텍처 이해
- RMSNorm을 사용한 사전 정규화
- SwiGLU 활성화 기능
- RoPE(로터리 임베딩)
- BPE(바이트 쌍 인코딩) 알고리즘
무대 설정
파일 구조 이해
입력 데이터 토큰화
각 토큰에 대한 임베딩 생성
RMSNorm을 사용한 정규화
주의 헤드(쿼리, 키, 값)
RoPE 구현
셀프 어텐션 구현
다중 헤드 주의 구현
SwiGLU 활성화 기능 구현
모든 것을 병합
출력 생성

전제 조건

좋은 점은 객체 지향 프로그래밍(OOP) 코딩을 사용하지 않고 일반 Python 프로그래밍만 사용한다는 것입니다. 그러나 신경망과 Transformer 아키텍처에 대한 기본적인 이해가 있어야 합니다. 이는 블로그를 따라가는 데 필요한 유일한 두 가지 전제 조건입니다.

주제	링크
변압기 이론	비디오 링크
신경망 이론	비디오 링크
파이썬 기초	비디오 링크

LLaMA 2와 LLaMA 3의 차이점

기술적인 세부 사항을 살펴보기 전에 먼저 알아야 할 것은 LLaMA 3의 전체 아키텍처가 LLaMA 2와 동일하다는 것입니다. 따라서 아직 LLaMA 3의 기술적인 세부 사항을 살펴보지 않았다면 LLaMA 3는 그렇지 않을 것입니다. 이 블로그를 팔로우하는 데 문제가 있습니다. LLaMA 2 아키텍처에 대한 이해가 없더라도 걱정하지 마세요. 기술적인 세부 사항에 대한 높은 수준의 개요도 살펴볼 것입니다. 이 블로그는 어느 쪽이든 당신을 위해 설계되었습니다.

다음은 LLaMA 2 및 LLaMA 3에 대한 몇 가지 핵심 사항입니다. 해당 아키텍처에 이미 익숙하다면 다음을 수행하세요.

특징	라마 3	라마 2
토크나이저	Tiktoken (OpenAI에서 개발)	문장조각
매개변수 수	8B, 70B	70B, 13B, 7B
훈련 데이터	15T 토큰	2.2T 토큰
컨텍스트 길이	토큰 8192개	토큰 4096개
주의 메커니즘	그룹화된 쿼리 관심	그룹화된 쿼리 관심
미세 조정된 모델	예	예
성능	모든 벤치마크에서 Llama 2보다 우수함	대부분의 벤치마크에서 Llama 1보다 우수함
계산 요구 사항	매우 높음(70B 모델)	매우 높음(70B 모델)
유효성	오픈 소스	오픈 소스
인간 피드백을 통한 강화 학습	예	예
지원되는 언어 수	30개 언어	20개 언어
적합	추론, 코딩, 숙련도 테스트 등 더욱 까다로운 작업에 가장 적합	추론, 코딩, 숙련도 테스트 등 보다 까다로운 작업에 적합

LLaMA 3의 변환기 아키텍처 이해

코딩을 시작하기 전에 LLaMA 3의 아키텍처를 이해하는 것이 중요합니다. 더 나은 시각적 이해를 위해 바닐라 Transformer, LLaMA 2/3 및 Mistral 간의 비교 다이어그램이 있습니다.

LLaMA 3의 가장 중요한 구성 요소를 좀 더 자세히 살펴보겠습니다.

1. RMSNorm을 사용한 사전 정규화:

LLaMA 2와 동일한 LLaMA 3 접근 방식에서는 RMSNorm이라는 기술이 각 변환기 하위 계층의 입력을 정규화하는 데 사용됩니다.

당신이 큰 시험을 위해 공부하고 있고 장으로 가득 찬 방대한 교과서를 가지고 있다고 상상해보십시오. 각 장은 서로 다른 주제를 나타내지만 일부 장은 다른 장보다 주제를 이해하는 데 더 중요합니다. 이제 전체 교과서를 살펴보기 전에 각 장의 중요성을 평가하기로 결정했습니다. 모든 장에서 동일한 시간을 소비하고 싶지는 않습니다. 중요한 것에 더 집중하고 싶습니다. RMSNorm을 사용한 사전 정규화가 ChatGPT와 같은 LLM(대규모 언어 모델)에 사용되는 곳입니다. 이는 각 장의 중요성에 따라 가중치를 부여하는 것과 같습니다. 주제에 근본적인 장은 더 높은 가중치를 받고, 덜 중요한 장은 더 낮은 가중치를 갖습니다.

따라서 본격적으로 공부하기 전에 각 장의 중요도에 따라 학습 계획을 조정하세요. 가중치가 높은 장에 더 많은 시간과 노력을 할당하여 핵심 개념을 철저하게 파악합니다.

마찬가지로 RMSNorm을 사용한 사전 정규화는 LLM이 텍스트의 어느 부분이 맥락과 의미를 이해하는 데 더 중요한지 우선순위를 정하는 데 도움이 됩니다. 필수 요소에 더 높은 가중치를 할당하고 덜 중요한 요소에 더 낮은 가중치를 할당하여 모델이 정확한 이해를 위해 가장 필요한 곳에 주의를 집중하도록 합니다. 관심 있는 독자는 여기에서 RMSNorm의 자세한 구현을 살펴볼 수 있습니다.

2. SwiGLU 활성화 기능:

LLaMA는 PaLM에서 영감을 받아 SwiGLU 활성화 기능을 도입했습니다.

당신이 학생들에게 복잡한 주제를 설명하려는 교사라고 상상해 보십시오. 요점을 적고 다이어그램을 그려 상황을 더 명확하게 하는 큰 화이트보드가 있습니다. 하지만 때로는 손글씨가 그다지 깔끔하지 않거나 다이어그램이 완벽하게 그려지지 않을 수도 있습니다. 이로 인해 학생들이 자료를 이해하기가 더 어려워질 수 있습니다.

이제 각 지점의 중요성에 따라 필기의 크기와 스타일을 자동으로 조정하는 마술 펜이 있다고 상상해 보세요. 정말로 중요한 것이 있으면 펜은 그것을 더 크고 명확하게 써서 눈에 띄게 만듭니다. 덜 중요한 경우 펜은 더 작게 쓰지만 여전히 읽을 수 있습니다. SwiGLU는 ChatGPT와 같은 대규모 언어 모델(LLM)을 위한 마법의 펜과 같습니다. 텍스트를 생성하기 전에 SwiGLU는 문맥과의 관련성에 따라 각 단어나 문구의 중요성을 조정합니다. 마술 펜이 글쓰기의 크기와 스타일을 조정하는 것처럼 SwiGLU는 각 단어나 문구의 강조를 조정합니다.

따라서 LLM이 텍스트를 생성할 때 중요한 부분을 더 눈에 띄게 하여 해당 부분을 더 눈에 띄게 만들고 텍스트의 전반적인 이해에 더 많은 기여를 할 수 있습니다. 이러한 방식으로 SwiGLU는 마술 펜을 사용하여 화이트보드에서 학생들에게 더 명확한 설명을 작성하는 데 도움이 되는 것과 마찬가지로 LLM이 더 명확하고 이해하기 쉬운 텍스트를 생성하도록 돕습니다. SwiGLU에 대한 자세한 내용은 관련 논문에서 확인할 수 있습니다.

3. 로터리 임베딩(RoPE):

로터리 임베딩(RoPE)은 LLaMA 3에서 사용되는 위치 임베딩 유형입니다.

당신이 교실에 있고 그룹 토론을 위해 학생들에게 좌석을 할당하고 싶다고 상상해보십시오. 일반적으로 각 학생이 고정된 위치를 갖도록 좌석을 행과 열로 배열할 수 있습니다. 그러나 어떤 경우에는 학생들이 더 자유롭게 이동하고 상호 작용할 수 있는 보다 역동적인 좌석 배치를 만들고 싶을 수도 있습니다.

ROPE는 학생들이 서로의 상대적인 위치를 유지하면서 회전하고 위치를 변경할 수 있는 특별한 좌석 배치와 같습니다. 이제 학생들은 한 곳에 고정되어 있는 대신 원을 그리며 움직일 수 있어 더욱 유동적인 상호 작용이 가능해졌습니다.

이 시나리오에서 각 학생은 텍스트 시퀀스의 단어나 토큰을 나타내며, 학생의 위치는 시퀀스의 위치에 해당합니다. ROPE를 사용하여 학생들이 회전하고 위치를 변경할 수 있는 것과 마찬가지로 ROPE를 사용하면 텍스트 시퀀스에 있는 단어의 위치 임베딩이 서로의 상대적 위치에 따라 동적으로 변경될 수 있습니다. 따라서 텍스트를 처리할 때 위치 임베딩을 고정적이고 정적으로 처리하는 대신 ROPE는 회전 측면을 도입하여 시퀀스의 단어 간의 동적 관계를 캡처하는 보다 유연한 표현을 허용합니다. 이러한 유연성은 ChatGPT와 같은 모델이 자연스럽게 흐르고 일관성을 유지하는 텍스트를 더 잘 이해하고 생성하는 데 도움이 됩니다. 이는 동적 좌석 배치가 교실에서 더욱 대화형 토론을 촉진하는 방식과 유사합니다. 수학적 세부 사항에 관심이 있는 사람들은 RoPE 논문을 참조할 수 있습니다.

4. 바이트 쌍 인코딩(BPE) 알고리즘

LLaMA 3은 OpenAI에서 도입한 tiktoken 라이브러리의 BPE(바이트 쌍 인코딩)를 사용하는 반면, LLaMA 2 토크나이저 BPE는 문장 조각 라이브러리를 기반으로 합니다. 그들 사이에는 약간의 차이가 있지만,

먼저 BPE가 실제로 무엇인지 알아 보겠습니다.

간단한 예부터 시작해 보겠습니다. "ab", "bc", "bcd" 및 "cde"라는 단어가 포함된 텍스트 코퍼스가 있다고 가정합니다. 텍스트 코퍼스의 모든 개별 문자로 어휘를 초기화하는 것부터 시작하므로 초기 어휘는 {"a", "b", "c", "d", "e"}입니다.

다음으로, 텍스트 코퍼스의 각 문자의 빈도를 계산합니다. 예를 들어 빈도는 {"a": 1, "b": 3, "c": 3, "d": 2, "e": 1}입니다.

이제 병합 프로세스를 시작합니다. 어휘가 원하는 크기에 도달할 때까지 다음 단계를 반복합니다.

먼저, 가장 빈번한 연속 문자 쌍을 찾습니다. 이 경우 가장 빈번한 쌍은 빈도가 2인 "bc"입니다. 그런 다음 이 쌍을 병합하여 새로운 하위 단어 단위 "bc"를 만듭니다. 병합 후에는 새 하위 단어 단위를 반영하도록 빈도 수를 업데이트합니다. 업데이트된 빈도는 {"a": 1, "b": 2, "c": 2, "d": 2, "e": 1, "bc": 2}입니다. 새로운 하위 단어 단위 "bc"를 어휘에 추가합니다. 이는 이제 {"a", "b", "c", "d", "e", "bc"}가 됩니다.
우리는 과정을 반복합니다. 다음으로 가장 자주 사용되는 쌍은 "cd"입니다. "cd"를 병합하여 새로운 하위 단어 단위 "cd"를 형성하고 빈도 수를 업데이트합니다. 업데이트된 빈도는 {"a": 1, "b": 2, "c": 1, "d": 1, "e": 1, "bc": 2, "cd": 2}입니다. 어휘에 "cd"를 추가하면 {"a", "b", "c", "d", "e", "bc", "cd"}가 됩니다.
프로세스를 계속하면 다음으로 자주 사용되는 쌍은 "de"입니다. "de"를 병합하여 하위 단어 단위 "de"를 형성하고 빈도 수를 {"a": 1, "b": 2, "c": 1, "d": 1, "e": 0, "bc": 2, "cd": 1, "de": 1}. 어휘에 "de"를 추가하여 {"a", "b", "c", "d", "e", "bc", "cd", "de"}로 만듭니다.
다음으로 가장 빈번한 쌍으로 "ab"를 찾습니다. "ab"를 병합하여 하위 단어 단위 "ab"를 형성하고 빈도 수를 {"a": 0, "b": 1, "c": 1, "d": 1, "e": 0, "bc": 2, "cd": 1, "de": 1, "ab": 1}.
어휘에 "ab"를 추가하면 {"a", "b", "c", "d", "e", "bc", "cd", "de", "ab"}가 됩니다.
그런 다음 다음으로 자주 사용되는 쌍은 "bcd"입니다. "bcd"를 병합하여 하위 단어 단위 "bcd"를 형성하고 빈도 수를 {"a": 0, "b": 0, "c": 0, "d": 0, "e": 0, "bc": 1, "cd": 0, "de": 1, "ab": 1, "bcd": 1}. 어휘에 "bcd"를 추가하면 {"a", "b", "c", "d", "e", "bc", "cd", "de", "ab", "bcd"가 됩니다. "}.
마지막으로 가장 자주 사용되는 쌍은 "cde"입니다. "cde"를 병합하여 하위 단어 단위 "cde"를 형성하고 빈도 수를 {"a": 0, "b": 0, "c": 0, "d": 0, "e": 0, "bc": 1, "cd": 0, "de": 0, "ab": 1, "bcd": 1, "cde": 1}. 어휘에 "cde"를 추가하여 {"a", "b", "c", "d", "e", "bc", "cd", "de", "ab", "bcd"로 만듭니다. ", "cde"}.

이 기술은 LLM의 성능을 향상시키고 희귀하고 어휘력이 부족한 단어를 처리할 수 있습니다. TikToken BPE와 문장형 BPE의 가장 큰 차이점은 TikToken BPE가 전체 단어가 이미 알려진 경우 항상 단어를 더 작은 부분으로 분할하지 않는다는 것입니다. 예를 들어, "hugging"이 어휘에 있으면 ["hug","ging"]으로 분할되지 않고 하나의 토큰으로 유지됩니다.

무대 설정

우리는 소수의 Python 라이브러리를 사용하여 작업할 것이지만 "모듈을 찾을 수 없음" 오류가 발생하지 않도록 설치하는 것이 더 좋습니다.

!p ip install sentencepiece tiktoken torch blobfile matplotlib huggingface_hub

 Requirement already satisfied: sentencepiece in /opt/conda/lib/python3.10/site-packages (0.2.0)
Requirement already satisfied: tiktoken in /opt/conda/lib/python3.10/site-packages (0.7.0)
Requirement already satisfied: torch in /opt/conda/lib/python3.10/site-packages (2.1.2+cpu)
Requirement already satisfied: blobfile in /opt/conda/lib/python3.10/site-packages (2.1.1)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (3.7.5)
Requirement already satisfied: huggingface_hub in /opt/conda/lib/python3.10/site-packages (0.22.2)
Requirement already satisfied: regex>=2022.1.18 in /opt/conda/lib/python3.10/site-packages (from tiktoken) (2023.12.25)
Requirement already satisfied: requests>=2.26.0 in /opt/conda/lib/python3.10/site-packages (from tiktoken) (2.31.0)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch) (3.13.1)
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.10/site-packages (from torch) (4.9.0)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch) (1.12)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch) (3.2.1)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch) (3.1.2)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch) (2024.2.0)
Requirement already satisfied: pycryptodomex~=3.8 in /opt/conda/lib/python3.10/site-packages (from blobfile) (3.20.0)
Requirement already satisfied: urllib3<3,>=1.25.3 in /opt/conda/lib/python3.10/site-packages (from blobfile) (1.26.18)
Requirement already satisfied: lxml~=4.9 in /opt/conda/lib/python3.10/site-packages (from blobfile) (4.9.4)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (4.47.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy<2,>=1.20 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (21.3)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (9.5.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from huggingface_hub) (6.0.1)
Requirement already satisfied: tqdm>=4.42.1 in /opt/conda/lib/python3.10/site-packages (from huggingface_hub) (4.66.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (3.6)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2024.2.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch) (1.3.0)

필요한 라이브러리를 설치한 후 일부 파일을 다운로드해야 합니다. llama-3–8B의 아키텍처를 복제할 예정이므로 HuggingFace에 계정이 있어야 합니다. 또한 Lama-3은 제한 모델이므로 모델 콘텐츠에 액세스하려면 이용 약관에 동의해야 합니다.

단계는 다음과 같습니다.

이 링크에서 HuggingFace 계정을 만드세요.
이 링크에서 llama-3–8B의 이용 약관에 동의하세요.

이 두 단계를 모두 완료했으면 이제 일부 파일을 다운로드해야 합니다. 이를 수행하는 데는 두 가지 옵션이 있습니다.

(옵션 1: 수동) 이 링크에서 llama-3–8B HF 디렉터리로 이동하여 세 파일을 각각 수동으로 다운로드합니다.

(옵션 2: 코딩) 앞서 설치한 Hugging_face 라이브러리를 사용하여 이러한 파일을 모두 다운로드할 수 있습니다. 하지만 먼저 HF 토큰을 사용하여 작업 노트북 내 HuggingFace Hub에 로그인해야 합니다. 새 토큰을 생성하거나 이 링크에서 액세스할 수 있습니다.

 # Import the `notebook_login` function from the `huggingface_hub` module.
from huggingface_hub import notebook_login

# Execute the `notebook_login` function to log in to the Hugging Face Hub.
notebook_login ()

 VBox(children=(HTML(value='<center> <imgnsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

이 셀을 실행하면 토큰을 입력하라는 메시지가 표시됩니다. 로그인 중에 오류가 발생하면 다시 시도하되 git 자격 증명으로 토큰 추가를 선택 취소하세요. 그런 다음 간단한 Python 코드를 실행하여 llama-3–8B 아키텍처의 백본인 세 개의 파일을 다운로드하면 됩니다.

 # Import the necessary function from the huggingface_hub library
from huggingface_hub import hf_hub_download

# Define the repository information
repo_id = "meta-llama/Meta-Llama-3-8B"
subfolder = "original"  # Specify the subfolder within the repository

# List of filenames to download
filenames = [ "params.json" , "tokenizer.model" , "consolidated.00.pth" ] 

# Specify the directory where you want to save the downloaded files
save_directory = "llama-3-8B/"  # Replace with your desired path

# Download each file
for filename in filenames :
    hf_hub_download (
        repo_id = repo_id ,       # Repository ID
        filename = filename ,     # Name of the file to download
        subfolder = subfolder ,   # Subfolder within the repository
        local_dir = save_directory  # Directory to save the downloaded file
    )

 original/params.json:   0%|          | 0.00/211 [00:00<?, ?B/s]



original/tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]



original/consolidated.00.pth:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

모든 파일을 다운로드한 후에는 이 블로그 전체에서 사용할 라이브러리를 가져와야 합니다.

 # File system paths
from pathlib import Path

# Tokenization library
import tiktoken

# BPE loading function
from tiktoken . load import load_tiktoken_bpe

# PyTorch library
import torch

# JSON handling
import json

# Plotting library
import matplotlib . pyplot as plt

다음으로, 각 파일이 어떤 용도로 사용되는지 이해해야 합니다.

파일 구조 이해

우리는 llama-3의 정확한 복제를 목표로 하고 있으므로 입력 텍스트가 의미 있는 출력을 생성해야 함을 의미합니다. 예를 들어, 입력이 "태양의 색은 무엇입니까?"라면 출력은 "흰색"이어야 합니다. 이를 달성하려면 대규모 데이터 세트에 대한 LLM 교육이 필요하며, 이는 높은 계산 능력을 요구하므로 실행 불가능합니다.

그러나 Meta는 llama-3 아키텍처 파일, 더 복잡한 용어로 사전 훈련된 가중치를 공개적으로 공개했습니다. 방금 이러한 파일을 다운로드했기 때문에 교육이나 대규모 데이터 세트 없이도 아키텍처를 복제할 수 있습니다. 모든 것이 이미 준비되어 있으므로 올바른 구성 요소를 올바른 위치에 사용하기만 하면 됩니다.

각 파일과 그 중요성을 살펴보세요.

tokenizer.model - 앞서 논의한 것처럼 LLaMA-3은 tiktoken의 BPE(바이트 쌍 인코딩) 토크나이저를 사용합니다. 이 토크나이저는 LLaMA-2에 사용된 데이터 세트보다 7배 더 큰 15조 개의 토큰이 있는 데이터 세트에서 훈련되었습니다. 이 파일을 로드하고 무엇이 들어 있는지 살펴보겠습니다.

 # Loading the tokenizer from llama-3-8B
tokenizer_model = load_tiktoken_bpe ( "/kaggle/working/llama-3-8B/original/tokenizer.model" )

# Get the length of the tokenizer model 
len ( tokenizer_model )
# OUTPUT: 128000

# Get the type of the `tokenizer_model` object.
type ( tokenizer_model )
# OUTPUT: dictionary

 dict

길이 속성은 훈련 데이터의 고유한 문자 수인 총 어휘 크기를 표시합니다. tokenizer_model의 유형은 사전입니다.

 # Printing the first 10 items of tokenizer model
dict ( list ( tokenizer_model . items ())[ 5600 : 5610 ])

 {b'mitted': 5600,
 b" $('#": 5601,
 b' saw': 5602,
 b' approach': 5603,
 b'ICE': 5604,
 b' saying': 5605,
 b' anyone': 5606,
 b'meta': 5607,
 b'SD': 5608,
 b' song': 5609}

여기서 무작위로 10개의 항목을 인쇄하면 앞서 설명한 예와 유사하게 BPE 알고리즘을 사용하여 형성된 문자열을 볼 수 있습니다. 키는 BPE 훈련의 바이트 시퀀스를 나타내고 값은 빈도에 따른 병합 순위를 나타냅니다.

통합.00.pth - Llama-3–8B의 학습된 매개변수(가중치)를 포함합니다. 이러한 매개변수에는 모델이 토큰을 표현하고, 주의를 계산하고, 피드포워드 변환을 수행하고, 출력을 정규화하는 방법과 같이 모델이 언어를 이해하고 처리하는 방법에 대한 정보가 포함됩니다.

 # Loading a PyTorch model of LLaMA-3-8B
model = torch . load ( "/kaggle/working/llama-3-8B/original/consolidated.00.pth" )

# printing first 11 layers of the architecture
list ( model . keys ())[: 11 ]

 ['tok_embeddings.weight',
 'layers.0.attention.wq.weight',
 'layers.0.attention.wk.weight',
 'layers.0.attention.wv.weight',
 'layers.0.attention.wo.weight',
 'layers.0.feed_forward.w1.weight',
 'layers.0.feed_forward.w3.weight',
 'layers.0.feed_forward.w2.weight',
 'layers.0.attention_norm.weight',
 'layers.0.ffn_norm.weight',
 'layers.1.attention.wq.weight']

변환기 아키텍처에 익숙하다면 쿼리, 키 행렬 등에 대해 알고 있었을 것입니다. 나중에 이러한 레이어/가중치를 사용하여 Llama-3 아키텍처 내에서 이러한 행렬을 생성할 것입니다.

params.json - 다음과 같은 다양한 매개변수 값을 포함합니다.

 # Opening the parameters JSON file
with open ( "/kaggle/working/llama-3-8B/original/params.json" , "r" ) as f :
    config = json . load ( f )

# Printing the content
print ( config )

 {'dim': 4096, 'n_layers': 32, 'n_heads': 32, 'n_kv_heads': 8, 'vocab_size': 128256, 'multiple_of': 1024, 'ffn_dim_multiplier': 1.3, 'norm_eps': 1e-05, 'rope_theta': 500000.0}

이러한 값은 헤드 수, 임베딩 벡터의 차원 등과 같은 세부 정보를 지정하여 Llama-3 아키텍처를 복제하는 데 도움이 됩니다.

나중에 사용할 수 있도록 이 값을 저장해 보겠습니다.

 # Dimension
dim = config [ "dim" ]

# Layers
n_layers = config [ "n_layers" ]

# Heads
n_heads = config [ "n_heads" ]

# KV_heads
n_kv_heads = config [ "n_kv_heads" ]

# Vocabulary
vocab_size = config [ "vocab_size" ]

# Multiple
multiple_of = config [ "multiple_of" ]

# Multiplier
ffn_dim_multiplier = config [ "ffn_dim_multiplier" ]

# Epsilon
norm_eps = config [ "norm_eps" ]

# RoPE
rope_theta = torch . tensor ( config [ "rope_theta" ])

이제 토크나이저 모델, 가중치가 포함된 아키텍처 모델, 구성 매개변수가 있으므로 자체 Llama-3 코딩을 처음부터 시작해 보겠습니다.

입력 데이터 토큰화

가장 먼저 수행해야 할 일은 입력 텍스트를 토큰으로 변환하는 것입니다. 이를 달성하려면 먼저 토큰화된 텍스트 내에 구조화된 마커를 제공하는 데 필요한 특수 토큰을 생성하여 토크나이저가 특정 조건을 인식하고 처리할 수 있도록 해야 합니다. 또는 지침.

 special_tokens = [
    "<|begin_of_text|>" ,  # Marks the beginning of a text sequence.
    "<|end_of_text|>" ,  # Marks the end of a text sequence.
    "<|reserved_special_token_0|>" ,  # Reserved for future use.
    "<|reserved_special_token_1|>" ,  # Reserved for future use.
    "<|reserved_special_token_2|>" ,  # Reserved for future use.
    "<|reserved_special_token_3|>" ,  # Reserved for future use.
    "<|start_header_id|>" ,  # Indicates the start of a header ID.
    "<|end_header_id|>" ,  # Indicates the end of a header ID.
    "<|reserved_special_token_4|>" ,  # Reserved for future use.
    "<|eot_id|>" ,  # Marks the end of a turn (in a conversational context).
] + [ f"<|reserved_special_token_ { i } |>" for i in range ( 5 , 256 - 5 )]  # A large set of tokens reserved for future use.

다음으로 입력 텍스트의 다양한 유형의 하위 문자열과 일치하는 다양한 패턴을 지정하여 텍스트를 토큰으로 분할하는 규칙을 정의합니다. 그렇게 하는 방법은 다음과 같습니다.

 # patterns based on which text will be break into tokens
tokenize_breaker = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+"

입력 텍스트에서 단어, 약어, 숫자(최대 3자리) 및 공백이 아닌 문자 시퀀스를 추출할 수 있으며 요구 사항에 따라 사용자 정의할 수 있습니다. 우리는 tokenizer_model, tokenize_breaker 및 Special_tokens의 세 가지 입력을 사용하는 TikToken BPE를 사용하여 간단한 토크나이저 함수를 코딩해야 합니다. 이 함수는 그에 따라 입력 텍스트를 인코딩/디코딩합니다.

 # Initialize tokenizer with specified parameters
tokenizer = tiktoken . Encoding (

    # make sure to set path to tokenizer.model file
    name = "/kaggle/working/llama-3-8B/original/tokenizer.model" ,

    # Define tokenization pattern string
    pat_str = tokenize_breaker ,

    # Assign BPE mergeable ranks from tokenizer_model of LLaMA-3
    mergeable_ranks = tokenizer_model ,

    # Set special tokens with indices
    special_tokens = { token : len ( tokenizer_model ) + i for i , token in enumerate ( special_tokens )},
)

# Encode "hello world!" and decode tokens to string
tokenizer . decode ( tokenizer . encode ( "hello world!" ))

 'hello world!'

인코더 함수 메서드가 올바르게 작동하는지 확인하기 위해 "Hello World"를 전달합니다. 먼저 텍스트를 인코딩하여 숫자 값으로 변환합니다. 그런 다음 다시 텍스트로 디코딩하여 "hello world!"를 생성합니다. 이는 기능이 올바르게 작동하고 있음을 확인합니다. 입력 내용을 토큰화해 보겠습니다.

 # input prompt
prompt = "the answer to the ultimate question of life, the universe, and everything is "

# Encode the prompt using the tokenizer and prepend a special token (128000)
tokens = [ 128000 ] + tokenizer . encode ( prompt )

print ( tokens )  # Print the encoded tokens

# Convert the list of tokens into a PyTorch tensor
tokens = torch . tensor ( tokens )

# Decode each token back into its corresponding string
prompt_split_as_tokens = [ tokenizer . decode ([ token . item ()]) for token in tokens ]

print ( prompt_split_as_tokens )  # Print the decoded tokens

 [128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]
['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']

우리는 특별한 토큰으로 시작하는 "생명, 우주 및 모든 것에 대한 궁극적인 질문에 대한 답변"이라는 입력 텍스트를 인코딩했습니다.

각 토큰에 대한 임베딩 생성

입력 벡터의 길이를 확인하면 다음과 같습니다.

 # checking dimension of input vector and embedding vector from llama-3 architecture
print ( dim , len ( tokens ))

 4096 17

현재 차원(17x1)인 입력 벡터는 토큰화된 각 단어에 대한 임베딩으로 변환되어야 합니다. 즉, (17x1) 토큰은 (17x4096)이 되며 각 토큰의 길이는 4096입니다.

 # Define embedding layer with vocab size and embedding dimension
embedding_layer = torch . nn . Embedding ( vocab_size , dim )

# Copy pre-trained token embeddings to the embedding layer
embedding_layer . weight . data . copy_ ( model [ "tok_embeddings.weight" ])

# Get token embeddings for given tokens, converting to torch.bfloat16 format
token_embeddings_unnormalized = embedding_layer ( tokens ). to ( torch . bfloat16 )

# Print shape of resulting token embeddings
token_embeddings_unnormalized . shape

 torch.Size([17, 4096])

이러한 임베딩은 정규화되지 않으며 정규화하지 않으면 심각한 영향을 미칩니다. 다음 섹션에서는 입력 벡터에 대해 정규화를 수행합니다.

RMSNorm을 사용한 정규화

입력이 정규화되었는지 확인하기 위해 앞서 RMSNorm에 대해 본 것과 동일한 공식을 사용하여 입력 벡터를 정규화하겠습니다.

 # Calculating RMSNorm
def rms_norm ( tensor , norm_weights ):

    # Calculate the mean of the square of tensor values along the last dimension
    squared_mean = tensor . pow ( 2 ). mean ( - 1 , keepdim = True )
    
    # Add a small value to avoid division by zero
    normalized = torch . rsqrt ( squared_mean + norm_eps )
    
    # Multiply normalized tensor by the provided normalization weights
    return ( tensor * normalized ) * norm_weights

우리는 정규화되지 않은 임베딩을 정규화하기 위해layer_0의 Attention 가중치를 사용할 것입니다. layer_0을 사용하는 이유는 이제 LLaMA-3 변환기 아키텍처의 첫 번째 레이어를 생성하고 있기 때문입니다.

 # using RMS normalization and provided normalization weights
token_embeddings = rms_norm ( token_embeddings_unnormalized , 
                            model [ "layers.0.attention_norm.weight" ])

# Print the shape of the resulting token embeddings
token_embeddings . shape

 torch.Size([17, 4096])

우리는 벡터만 정규화하고 다른 것은 아무것도 하지 않기 때문에 차원이 변경되지 않는다는 것을 이미 알고 있을 것입니다.

주의 헤드(쿼리, 키, 값)

먼저 모델에서 쿼리, 키, 값 및 출력 벡터를 로드해 보겠습니다.

 # Print the shapes of different weights
print (
    # Query weight shape
    model [ "layers.0.attention.wq.weight" ]. shape ,
    
    # Key weight shape
    model [ "layers.0.attention.wk.weight" ]. shape ,
    
    # Value weight shape
    model [ "layers.0.attention.wv.weight" ]. shape ,
    
    # Output weight shape
    model [ "layers.0.attention.wo.weight" ]. shape
)

 torch.Size([4096, 4096]) torch.Size([1024, 4096]) torch.Size([1024, 4096]) torch.Size([4096, 4096])

차원은 우리가 다운로드한 모델 가중치가 병렬 접근 방식/훈련 구현으로 인해 각 머리에 대한 것이 아니라 여러 주의 머리에 대한 것임을 나타냅니다. 그러나 이러한 행렬을 풀어서 단일 헤드에만 사용할 수 있도록 할 수 있습니다.

 # Retrieve query weight for the first layer of attention
q_layer0 = model [ "layers.0.attention.wq.weight" ]

# Calculate dimension per head
head_dim = q_layer0 . shape [ 0 ] // n_heads

# Reshape query weight to separate heads
q_layer0 = q_layer0 . view ( n_heads , head_dim , dim )

# Print the shape of the reshaped query weight tensor
q_layer0 . shape

 torch.Size([32, 128, 4096])

여기서 32는 Llama-3의 Attention Head 개수, 128은 쿼리 벡터의 크기, 4096은 토큰 임베딩의 크기입니다. 다음을 사용하여 첫 번째 레이어의 첫 번째 헤드에 대한 쿼리 가중치 행렬에 액세스할 수 있습니다.

 # Extract the query weight for the first head of the first layer of attention
q_layer0_head0 = q_layer0 [ 0 ]

# Print the shape of the extracted query weight tensor for the first head
q_layer0_head0 . shape

 torch.Size([128, 4096])

각 토큰에 대한 쿼리 벡터를 찾으려면 쿼리 가중치에 토큰 임베딩을 곱합니다.

 # Matrix multiplication: token embeddings with transpose of query weight for first head
q_per_token = torch . matmul ( token_embeddings , q_layer0_head0 . T )

# Shape of resulting tensor: queries per token
q_per_token . shape