llama3_8b_finetuning ดาวน์โหลด - llama3_8b_finetuning ดาวน์โหลดซอร์สโค้ด

llama3_8b_finetuning

โค้ดแหล่งที่มา AI

1.0.0

ดาวน์โหลด

อ่านฉัน

นี่เป็นโครงการแบบ end-to-end ซึ่งรวมถึงการนำเข้าข้อมูล การสร้างคู่คำสั่ง/คำตอบ การปรับแต่ง และการประเมินผลลัพธ์

ขั้นตอนแรก - การขูดข้อมูล

เริ่มต้นด้วยการติดตั้งการพึ่งพาด้วย:

pip install -r requirements.txt

เพื่อหาข้อมูลสำหรับการปรับแต่งอย่างละเอียด Arxiv จึงถูกคัดลอกมาจากเอกสาร LLM ที่เผยแพร่หลังจากวันวางจำหน่าย Llama 3

โค้ดขูดซีลีเนียมสามารถพบได้ใน llama3_8b_finetuning/arxiv_scraping/Arxiv_pdfs_download.py (ต้องดาวน์โหลด webdriver ก่อนที่จะเรียกใช้สคริปต์นี้)

รหัสขูดจะนำเอกสารในหน้าแรกของ Arxiv และดาวน์โหลดลงในโฟลเดอร์ llama3_8b_finetuning/data/pdfs

ขั้นตอนที่สอง - การสร้างคู่คำแนะนำ/คำตอบ

โค้ดสำหรับขั้นตอนนี้อยู่ที่ /llama3_8b_finetuning/creating_instruction_dataset.py

เนื้อหาข้อความจากเอกสารที่ดาวน์โหลดถูกแยกวิเคราะห์โดยใช้ PyPDFLoader ของ Langchain จากนั้นข้อความจะถูกส่งไปยังรุ่น Llama 3 70B ผ่านทาง Grok Grok ถูกเลือกเนื่องจากความเร็วและต้นทุนต่ำ จะต้องสังเกตว่าใบอนุญาตผู้ใช้ Llama 3 อนุญาตให้ใช้สำหรับการฝึกอบรม/การปรับแต่ง Llama LLM เท่านั้น ดังนั้นเราจึงไม่สามารถใช้ Llama 3 เพื่อสร้างคู่คำสั่ง/คำตอบสำหรับรุ่นอื่นๆ แม้แต่รุ่นโอเพ่นซอร์ส หรือสำหรับการใช้งานที่ไม่ใช่เชิงพาณิชย์

ข้อความแจ้งสำหรับการสร้างคู่อยู่ในไฟล์ utils และสามารถดูได้ที่ด้านล่าง:

-

 You are a highly intelligent and knowledgeable assistant tasked with generating triples of instruction, input, and output from academic papers related to Large Language Models (LLMs). Each triple should consist of:

Instruction: A clear and concise task description that can be performed by an LLM.
Input: A sample input that corresponds to the instruction.
Output: The expected result or answer when the LLM processes the input according to the instruction.
Below are some example triples:

Example 1:

Instruction: Summarize the following abstract.
Input: "In this paper, we present a new approach to training large language models by incorporating a multi-task learning framework. Our method improves the performance on a variety of downstream tasks."
Output: "A new multi-task learning framework improves the performance of large language models on various tasks."
Example 2:

Instruction: Provide a brief explanation of the benefits of using multi-task learning for large language models.
Input: "Multi-task learning allows a model to learn from multiple related tasks simultaneously, which can lead to better generalization and performance improvements across all tasks. This approach leverages shared representations and can reduce overfitting."
Output: "Multi-task learning helps large language models generalize better and improve performance by learning from multiple related tasks simultaneously."
Now, generate similar triples based on the provided text from academic papers related to LLMs:

Source Text
(Provide the text from the academic papers here)

Generated Triples
Triple 1:

Instruction:
Input:
Output:
Triple 2:

Instruction:
Input:
Output:
Triple 3:

Instruction:
Input:
Output:

-

สุดท้าย คำแนะนำจะถูกบันทึกไว้ใน llama3_8b_finetuning/data/arxiv_instruction_dataset.json

ขั้นตอนที่สาม - การปรับแต่งอย่างละเอียด

โค้ดสำหรับขั้นตอนนี้มีอยู่ใน /llama3_8b_finetuning/model_trainer.py

ขั้นแรก เราจะโหลดคู่คำสั่ง/คำตอบ แยกออกเป็นชุดข้อมูลทดสอบและฝึก และ
จัดรูปแบบให้เป็นโครงสร้างที่ถูกต้อง

 class DatasetHandler :
    def __init__ ( self , data_path ):
        self . data_path = data_path

    def load_and_split_dataset ( self ):
        dataset = load_dataset ( "json" , data_files = self . data_path )
        train_test_split = dataset [ 'train' ]. train_test_split ( test_size = 0.2 )
        dataset_dict = DatasetDict ({
            'train' : train_test_split [ 'train' ],
            'test' : train_test_split [ 'test' ]
        })
        return dataset_dict [ 'train' ], dataset_dict [ 'test' ]

    @ staticmethod
    def format_instruction ( sample ):
        return f"""
        Below is an instruction that describes a task, paired with an input that provides further context. 
        Write a response that appropriately completes the request.

        ### Instruction:
        { sample [ 'Instruction' ] }

        ### Input:
        { sample [ 'Input' ] }

        ### Response:
        { sample [ 'Output' ] }
        """

จากนั้น เรากำหนดคลาสที่โหลดโมเดลและโทเค็นไนเซอร์จาก Hugging Face

 class ModelManager :
    def __init__ ( self , model_id , use_flash_attention2 , hf_token ):
        self . model_id = model_id
        self . use_flash_attention2 = use_flash_attention2
        self . hf_token = hf_token
        self . bnb_config = BitsAndBytesConfig (
            load_in_4bit = True ,
            bnb_4bit_use_double_quant = True ,
            bnb_4bit_quant_type = "nf4" ,
            bnb_4bit_compute_dtype = torch . bfloat16 if use_flash_attention2 else torch . float16
        )
    
    def load_model_and_tokenizer ( self ):
        model = AutoModelForCausalLM . from_pretrained (
            self . model_id , 
            quantization_config = self . bnb_config , 
            use_cache = False , 
            device_map = "auto" ,
            token = self . hf_token ,  
            attn_implementation = "flash_attention_2" if self . use_flash_attention2 else "sdpa"
        )
        model . config . pretraining_tp = 1

        tokenizer = AutoTokenizer . from_pretrained (
            self . model_id ,
            token = self . hf_token
        )
        tokenizer . pad_token = tokenizer . eos_token
        tokenizer . padding_side = "right"
        
        return model , tokenizer

เรากำหนดคลาส Trainer และการกำหนดค่าการฝึก:

 class Trainer :
    def __init__ ( self , model , tokenizer , train_dataset , peft_config , use_flash_attention2 , output_dir ):
        self . model = model
        self . tokenizer = tokenizer
        self . train_dataset = train_dataset
        self . peft_config = peft_config
        self . args = TrainingArguments (
            output_dir = output_dir ,
            num_train_epochs = 3 ,
            per_device_train_batch_size = 4 ,
            gradient_accumulation_steps = 4 ,
            gradient_checkpointing = True ,
            optim = "paged_adamw_8bit" ,
            logging_steps = 10 ,
            save_strategy = "epoch" ,
            learning_rate = 2e-4 ,
            bf16 = use_flash_attention2 ,
            fp16 = not use_flash_attention2 ,
            tf32 = use_flash_attention2 ,
            max_grad_norm = 0.3 ,
            warmup_steps = 5 ,
            lr_scheduler_type = "linear" ,
            disable_tqdm = False ,
            report_to = "none"
        )
        self . model = get_peft_model ( self . model , self . peft_config )

    def train_model ( self , format_instruction_func ):
        trainer = SFTTrainer (
            model = self . model ,
            train_dataset = self . train_dataset ,
            peft_config = self . peft_config ,
            max_seq_length = 2048 ,
            tokenizer = self . tokenizer ,
            packing = True ,
            formatting_func = format_instruction_func , 
            args = self . args ,
        )
        trainer . train ()
        return trainer

ในที่สุด ชั้นเรียนก็เริ่มต้นขึ้นและเริ่มการฝึกอบรม

โปรดทราบว่าโมเดล Llama มีรั้วรอบขอบชิด ซึ่งหมายความว่า Hugging Face ต้องใช้โทเค็นที่ให้ไว้หลังจากยอมรับข้อกำหนดการใช้งานและ Meta อนุมัติการเข้าถึง (ซึ่งเกือบจะทันที)

 dataset_handler = DatasetHandler ( data_path = utils . Variables . INSTRUCTION_DATASET_JSON_PATH )
train_dataset , test_dataset = dataset_handler . load_and_split_dataset ()

new_test_dataset = []
for dict_ in test_dataset :
    dict_ [ 'Output' ] = ''
    new_test_dataset . append ( dict_ )

model_manager = ModelManager (
    model_id = "meta-llama/Meta-Llama-3-8B" ,
    use_flash_attention2 = True ,
    hf_token = os . environ [ "HF_TOKEN" ]
)
model , tokenizer = model_manager . load_model_and_tokenizer ()
model_manager . save_model_and_tokenizer ( model , tokenizer , save_directory = utils . Variables . BASE_MODEL_PATH )
model = model_manager . prepare_for_training ( model )

peft_config = LoraConfig (
    lora_alpha = 16 ,
    lora_dropout = 0.1 ,
    r = 64 ,
    bias = "none" ,
    task_type = "CAUSAL_LM" ,
    target_modules = [
        "q_proj" , "k_proj" , "v_proj" , "o_proj" , "gate_proj" , "up_proj" , "down_proj" ,
    ]
)

trainer = Trainer (
    model = model ,
    tokenizer = tokenizer ,
    train_dataset = train_dataset ,
    peft_config = peft_config ,
    use_flash_attention2 = True ,
    output_dir = utils . Variables . FINE_TUNED_MODEL_PATH
)
trained_model = trainer . train_model ( format_instruction_func = dataset_handler . format_instruction )
trained_model . save_model ()

ขั้นตอนที่ 4 - การประเมินผล

ในการประเมินผลลัพธ์การปรับแต่งอย่างละเอียด เราใช้คะแนน Recall-Oriented Understudy for Gisting Evaling (ROUGE) ซึ่งเปรียบเทียบการทับซ้อนกันระหว่างข้อความสองชุดเพื่อวัดความคล้ายคลึงกันระหว่างข้อความทั้งสองชุด

โดยเฉพาะอย่างยิ่ง เราใช้ไลบรารี rouge_scorer เพื่อคำนวณ ROUGE-1 และ ROUGE-2 ซึ่งวัดการทับซ้อนกันของ 1 กรัมและ 2 กรัมระหว่างข้อความ

 import pandas as pd
from rouge_score import rouge_scorer

def calculate_rouge_scores ( generated_answers , ground_truth ):
    scorer = rouge_scorer . RougeScorer ([ 'rouge1' , 'rouge2' , 'rougeL' ], use_stemmer = True )
    total_rouge1 , total_rouge2 , total_rougeL = 0 , 0 , 0
    for gen , ref in zip ( generated_answers , ground_truth ):
        scores = scorer . score ( gen , ref )
        total_rouge1 += scores [ 'rouge1' ]. fmeasure
        total_rouge2 += scores [ 'rouge2' ]. fmeasure
        total_rougeL += scores [ 'rougeL' ]. fmeasure
    average_rouge1 = total_rouge1 / len ( generated_answers )
    average_rouge2 = total_rouge2 / len ( generated_answers )
    average_rougeL = total_rougeL / len ( generated_answers )
    return { 'average_rouge1' : average_rouge1 ,
            'average_rouge2' : average_rouge2 ,
            'average_rougeL' : average_rougeL }

ในการคำนวณนี้ เราใช้คำสั่งจากชุดข้อมูลทดสอบ ส่งผ่านไปยังทั้งโมเดลพื้นฐานและโมเดลที่ได้รับการปรับแต่ง และเปรียบเทียบผลลัพธ์กับผลลัพธ์ที่คาดหวังจากชุดข้อมูลคำสั่ง/คำตอบ

รหัสสำหรับการประเมินสามารถพบได้ใน /llama3_8b_finetuning/model_evalue.py

 class ModelHandler :

    def __init__ ( self ):
        pass

    def loading_model ( self , model_chosen = 'fine_tuned_model' ):

        if model_chosen == 'fine_tuned_model' :
            model_dir = utils . Variables . FINE_TUNED_MODEL_PATH
            self . model = AutoPeftModelForCausalLM . from_pretrained (
                model_dir ,
                low_cpu_mem_usage = True ,
                torch_dtype = torch . float16 ,
                load_in_4bit = True ,
                )

        elif model_chosen == 'base_model' :
            model_dir = utils . Variables . BASE_MODEL_PATH
            self . model = AutoModelForCausalLM . from_pretrained (
                model_dir ,
                low_cpu_mem_usage = True ,
                torch_dtype = torch . float16 ,
                load_in_4bit = True ,
                )

        self . tokenizer = AutoTokenizer . from_pretrained ( model_dir )

    def ask_question ( self , instruction , temperature = 0.5 , max_new_tokens = 1000 ):

        prompt = format_instruction ( instruction )

        input_ids = self . tokenizer ( prompt , return_tensors = "pt" , truncation = True ). input_ids . cuda ()

        start_time = time . time ()
        with torch . inference_mode ():
            outputs = self . model . generate ( input_ids = input_ids , pad_token_id = self . tokenizer . eos_token_id , max_new_tokens = max_new_tokens , do_sample = True , top_p = 0.5 , temperature = temperature )
        end_time = time . time ()

        total_time = end_time - start_time
        output_length = len ( outputs [ 0 ]) - len ( input_ids [ 0 ])

        self . output = self . tokenizer . batch_decode ( outputs . detach (). cpu (). numpy (), skip_special_tokens = True )[ 0 ]

        return self . output

ผลการประเมิน

คะแนน ROUGE มีดังนี้:

รุ่นที่ปรับแต่งอย่างละเอียด:

{'average_rouge1': 0.39997816307812206, 'average_rouge2': 0.2213826792342886, 'average_rougeL': 0.33508922374837047}

โมเดลพื้นฐาน:

{'average_rouge1': 0.2524191394349585, 'average_rouge2': 0.13402054342344535, 'average_rougeL': 0.2115590931984475}

ดังนั้นจะเห็นได้ว่าประสิทธิภาพของโมเดลที่ได้รับการปรับแต่งอย่างละเอียดบนชุดข้อมูลทดสอบนั้นเหนือกว่าโมเดลพื้นฐานอย่างมาก

ทางเลือก

ใช้เวลาค่อนข้างนานในการเขียนโค้ดนี้และทำให้มันใช้งานได้ เป็นแนวทางปฏิบัติที่ดี แต่สำหรับงานที่เกี่ยวข้องกับการปรับแต่งทุกวัน เพียงใช้ Hugging Face AutoTrain ที่โฮสต์ในเครื่อง (https://github.com/huggingface/autotrain-advanced)

รูปภาพ.png