這是一個端到端項目,包括資料攝取、指令/答案對的創建、微調和結果評估。
首先安裝相依性:
pip install -r requirements.txt
為了找到微調數據,我們從 Arxiv 中抓取了 Llama 3 發布日期之後發表的法學碩士論文。
Selenium 抓取程式碼可以在llama3_8b_finetuning/arxiv_scraping/Arxiv_pdfs_download.py
中找到(必須在執行此腳本之前下載 Webdriver)。
抓取程式碼以取得 Arxiv 第一頁的論文並下載到llama3_8b_finetuning/data/pdfs
資料夾中。
此步驟的程式碼可以在 /llama3_8b_finetuning/creating_instruction_dataset.py 中找到。
使用 Langchain 的 PyPDFLoader 解析下載論文的文字內容。然後,文字透過 Grok 發送到 Llama 3 70B 模型。選擇 Grok 是因為其速度快且成本低。必須注意的是,Llama 3 使用者許可證僅允許將其用於訓練/微調 Llama LLM。因此,我們無法使用 Llama 3 為其他模型(甚至開源模型)或非商業用途建立指令/答案對。
建立對的提示位於 utils 檔案中,也可以在下面看到:
'''
You are a highly intelligent and knowledgeable assistant tasked with generating triples of instruction, input, and output from academic papers related to Large Language Models (LLMs). Each triple should consist of:
Instruction: A clear and concise task description that can be performed by an LLM.
Input: A sample input that corresponds to the instruction.
Output: The expected result or answer when the LLM processes the input according to the instruction.
Below are some example triples:
Example 1:
Instruction: Summarize the following abstract.
Input: "In this paper, we present a new approach to training large language models by incorporating a multi-task learning framework. Our method improves the performance on a variety of downstream tasks."
Output: "A new multi-task learning framework improves the performance of large language models on various tasks."
Example 2:
Instruction: Provide a brief explanation of the benefits of using multi-task learning for large language models.
Input: "Multi-task learning allows a model to learn from multiple related tasks simultaneously, which can lead to better generalization and performance improvements across all tasks. This approach leverages shared representations and can reduce overfitting."
Output: "Multi-task learning helps large language models generalize better and improve performance by learning from multiple related tasks simultaneously."
Now, generate similar triples based on the provided text from academic papers related to LLMs:
Source Text
(Provide the text from the academic papers here)
Generated Triples
Triple 1:
Instruction:
Input:
Output:
Triple 2:
Instruction:
Input:
Output:
Triple 3:
Instruction:
Input:
Output:
'''
最後,指令保存在llama3_8b_finetuning/data/arxiv_instruction_dataset.json
中。
此步驟的程式碼可以在/llama3_8b_finetuning/model_trainer.py
中找到
首先,我們載入指令/答案對,將它們分成測試和訓練資料集,然後
將它們格式化為正確的結構。
class DatasetHandler :
def __init__ ( self , data_path ):
self . data_path = data_path
def load_and_split_dataset ( self ):
dataset = load_dataset ( "json" , data_files = self . data_path )
train_test_split = dataset [ 'train' ]. train_test_split ( test_size = 0.2 )
dataset_dict = DatasetDict ({
'train' : train_test_split [ 'train' ],
'test' : train_test_split [ 'test' ]
})
return dataset_dict [ 'train' ], dataset_dict [ 'test' ]
@ staticmethod
def format_instruction ( sample ):
return f"""
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
### Instruction:
{ sample [ 'Instruction' ] }
### Input:
{ sample [ 'Input' ] }
### Response:
{ sample [ 'Output' ] }
"""
然後,我們定義從 Hugging Face 載入模型和標記器的類別。
class ModelManager :
def __init__ ( self , model_id , use_flash_attention2 , hf_token ):
self . model_id = model_id
self . use_flash_attention2 = use_flash_attention2
self . hf_token = hf_token
self . bnb_config = BitsAndBytesConfig (
load_in_4bit = True ,
bnb_4bit_use_double_quant = True ,
bnb_4bit_quant_type = "nf4" ,
bnb_4bit_compute_dtype = torch . bfloat16 if use_flash_attention2 else torch . float16
)
def load_model_and_tokenizer ( self ):
model = AutoModelForCausalLM . from_pretrained (
self . model_id ,
quantization_config = self . bnb_config ,
use_cache = False ,
device_map = "auto" ,
token = self . hf_token ,
attn_implementation = "flash_attention_2" if self . use_flash_attention2 else "sdpa"
)
model . config . pretraining_tp = 1
tokenizer = AutoTokenizer . from_pretrained (
self . model_id ,
token = self . hf_token
)
tokenizer . pad_token = tokenizer . eos_token
tokenizer . padding_side = "right"
return model , tokenizer
我們定義Trainer
類別和訓練配置:
class Trainer :
def __init__ ( self , model , tokenizer , train_dataset , peft_config , use_flash_attention2 , output_dir ):
self . model = model
self . tokenizer = tokenizer
self . train_dataset = train_dataset
self . peft_config = peft_config
self . args = TrainingArguments (
output_dir = output_dir ,
num_train_epochs = 3 ,
per_device_train_batch_size = 4 ,
gradient_accumulation_steps = 4 ,
gradient_checkpointing = True ,
optim = "paged_adamw_8bit" ,
logging_steps = 10 ,
save_strategy = "epoch" ,
learning_rate = 2e-4 ,
bf16 = use_flash_attention2 ,
fp16 = not use_flash_attention2 ,
tf32 = use_flash_attention2 ,
max_grad_norm = 0.3 ,
warmup_steps = 5 ,
lr_scheduler_type = "linear" ,
disable_tqdm = False ,
report_to = "none"
)
self . model = get_peft_model ( self . model , self . peft_config )
def train_model ( self , format_instruction_func ):
trainer = SFTTrainer (
model = self . model ,
train_dataset = self . train_dataset ,
peft_config = self . peft_config ,
max_seq_length = 2048 ,
tokenizer = self . tokenizer ,
packing = True ,
formatting_func = format_instruction_func ,
args = self . args ,
)
trainer . train ()
return trainer
最後,實例化類別並開始訓練。
請注意,Llama 模型是有門控的,這意味著 Hugging Face 需要在接受使用條款並且 Meta 批准訪問後提供令牌(幾乎是即時的)。
dataset_handler = DatasetHandler ( data_path = utils . Variables . INSTRUCTION_DATASET_JSON_PATH )
train_dataset , test_dataset = dataset_handler . load_and_split_dataset ()
new_test_dataset = []
for dict_ in test_dataset :
dict_ [ 'Output' ] = ''
new_test_dataset . append ( dict_ )
model_manager = ModelManager (
model_id = "meta-llama/Meta-Llama-3-8B" ,
use_flash_attention2 = True ,
hf_token = os . environ [ "HF_TOKEN" ]
)
model , tokenizer = model_manager . load_model_and_tokenizer ()
model_manager . save_model_and_tokenizer ( model , tokenizer , save_directory = utils . Variables . BASE_MODEL_PATH )
model = model_manager . prepare_for_training ( model )
peft_config = LoraConfig (
lora_alpha = 16 ,
lora_dropout = 0.1 ,
r = 64 ,
bias = "none" ,
task_type = "CAUSAL_LM" ,
target_modules = [
"q_proj" , "k_proj" , "v_proj" , "o_proj" , "gate_proj" , "up_proj" , "down_proj" ,
]
)
trainer = Trainer (
model = model ,
tokenizer = tokenizer ,
train_dataset = train_dataset ,
peft_config = peft_config ,
use_flash_attention2 = True ,
output_dir = utils . Variables . FINE_TUNED_MODEL_PATH
)
trained_model = trainer . train_model ( format_instruction_func = dataset_handler . format_instruction )
trained_model . save_model ()
為了評估微調結果,我們採用了召回導向的要點評估(ROUGE)分數,它比較兩組文本之間的重疊以衡量它們之間的相似性。
具體來說,我們使用 rouge_scorer 函式庫來計算 ROUGE-1 和 ROUGE-2,它們測量文本之間的 1-gram 和 2-gram 重疊。
import pandas as pd
from rouge_score import rouge_scorer
def calculate_rouge_scores ( generated_answers , ground_truth ):
scorer = rouge_scorer . RougeScorer ([ 'rouge1' , 'rouge2' , 'rougeL' ], use_stemmer = True )
total_rouge1 , total_rouge2 , total_rougeL = 0 , 0 , 0
for gen , ref in zip ( generated_answers , ground_truth ):
scores = scorer . score ( gen , ref )
total_rouge1 += scores [ 'rouge1' ]. fmeasure
total_rouge2 += scores [ 'rouge2' ]. fmeasure
total_rougeL += scores [ 'rougeL' ]. fmeasure
average_rouge1 = total_rouge1 / len ( generated_answers )
average_rouge2 = total_rouge2 / len ( generated_answers )
average_rougeL = total_rougeL / len ( generated_answers )
return { 'average_rouge1' : average_rouge1 ,
'average_rouge2' : average_rouge2 ,
'average_rougeL' : average_rougeL }
為了執行此計算,我們從測試資料集中獲取指令,將它們傳遞到基礎模型和微調模型中,並將輸出與指令/答案資料集的預期結果進行比較。
評估代碼可以在 /llama3_8b_finetuning/model_evaluation.py 中找到。
class ModelHandler :
def __init__ ( self ):
pass
def loading_model ( self , model_chosen = 'fine_tuned_model' ):
if model_chosen == 'fine_tuned_model' :
model_dir = utils . Variables . FINE_TUNED_MODEL_PATH
self . model = AutoPeftModelForCausalLM . from_pretrained (
model_dir ,
low_cpu_mem_usage = True ,
torch_dtype = torch . float16 ,
load_in_4bit = True ,
)
elif model_chosen == 'base_model' :
model_dir = utils . Variables . BASE_MODEL_PATH
self . model = AutoModelForCausalLM . from_pretrained (
model_dir ,
low_cpu_mem_usage = True ,
torch_dtype = torch . float16 ,
load_in_4bit = True ,
)
self . tokenizer = AutoTokenizer . from_pretrained ( model_dir )
def ask_question ( self , instruction , temperature = 0.5 , max_new_tokens = 1000 ):
prompt = format_instruction ( instruction )
input_ids = self . tokenizer ( prompt , return_tensors = "pt" , truncation = True ). input_ids . cuda ()
start_time = time . time ()
with torch . inference_mode ():
outputs = self . model . generate ( input_ids = input_ids , pad_token_id = self . tokenizer . eos_token_id , max_new_tokens = max_new_tokens , do_sample = True , top_p = 0.5 , temperature = temperature )
end_time = time . time ()
total_time = end_time - start_time
output_length = len ( outputs [ 0 ]) - len ( input_ids [ 0 ])
self . output = self . tokenizer . batch_decode ( outputs . detach (). cpu (). numpy (), skip_special_tokens = True )[ 0 ]
return self . output
ROUGE分數如下:
微調模型:
{'average_rouge1': 0.39997816307812206, 'average_rouge2': 0.2213826792342886, 'average_rougeL': 0.33508922374837047}
基本型號:
{'average_rouge1': 0.2524191394349585, 'average_rouge2': 0.13402054342344535, 'average_rougeL': 0.2115590931984475}
因此,可以看出,微調模型在測試資料集上的表現明顯優於基礎模型。
編寫這段程式碼並使其運行花了相當長的時間。這是很好的做法,但對於日常微調相關的工作,只需使用本地託管的 Hugging Face AutoTrain (https://github.com/huggingface/autotrain-advanced)。