llama3_8b_finetuning تنزيل - llama3_8b_finetuning تنزيل كود المصدر

llama3_8b_finetuning

كود الذكاء الاصطناعي

1.0.0

تنزيل

التمهيدي

هذا مشروع شامل يتضمن استيعاب البيانات، وإنشاء أزواج التعليمات/الإجابات، والضبط الدقيق، وتقييم النتائج.

الخطوة الأولى - تجريف البيانات

ابدأ بتثبيت التبعيات باستخدام:

pip install -r requirements.txt

للعثور على بيانات الضبط الدقيق، تم استخراج Arxiv من أجل أوراق LLM المنشورة بعد تاريخ إصدار Llama 3.

يمكن العثور على كود استخراج السيلينيوم في llama3_8b_finetuning/arxiv_scraping/Arxiv_pdfs_download.py (يجب تنزيل برنامج تشغيل الويب قبل تنفيذ هذا البرنامج النصي).

يأخذ رمز الكشط الأوراق الموجودة في صفحة Arxiv الأولى ويقوم بتنزيلها في المجلد llama3_8b_finetuning/data/pdfs .

الخطوة الثانية - التعليمات/إنشاء أزواج الإجابات

يمكن العثور على الكود الخاص بهذه الخطوة على /llama3_8b_finetuning/creating_instruction_dataset.py.

تم تحليل محتوى النص من الأوراق التي تم تنزيلها باستخدام PyPDFLoader من Langchain. وبعد ذلك، تم إرسال النص إلى نموذج Llama 3 70B عبر Grok. تم اختيار Grok بسبب سرعته وتكلفته المنخفضة. تجدر الإشارة إلى أن ترخيص مستخدم Llama 3 يسمح فقط باستخدامه لتدريب/ضبط Llama LLMs. ولذلك، لن نكون قادرين على استخدام Llama 3 لإنشاء أزواج تعليمات/إجابات لنماذج أخرى، حتى تلك مفتوحة المصدر، أو للاستخدام غير التجاري.

توجد المطالبة بإنشاء الأزواج في ملف utils ويمكن رؤيتها أيضًا أدناه:

'''

 You are a highly intelligent and knowledgeable assistant tasked with generating triples of instruction, input, and output from academic papers related to Large Language Models (LLMs). Each triple should consist of:

Instruction: A clear and concise task description that can be performed by an LLM.
Input: A sample input that corresponds to the instruction.
Output: The expected result or answer when the LLM processes the input according to the instruction.
Below are some example triples:

Example 1:

Instruction: Summarize the following abstract.
Input: "In this paper, we present a new approach to training large language models by incorporating a multi-task learning framework. Our method improves the performance on a variety of downstream tasks."
Output: "A new multi-task learning framework improves the performance of large language models on various tasks."
Example 2:

Instruction: Provide a brief explanation of the benefits of using multi-task learning for large language models.
Input: "Multi-task learning allows a model to learn from multiple related tasks simultaneously, which can lead to better generalization and performance improvements across all tasks. This approach leverages shared representations and can reduce overfitting."
Output: "Multi-task learning helps large language models generalize better and improve performance by learning from multiple related tasks simultaneously."
Now, generate similar triples based on the provided text from academic papers related to LLMs:

Source Text
(Provide the text from the academic papers here)

Generated Triples
Triple 1:

Instruction:
Input:
Output:
Triple 2:

Instruction:
Input:
Output:
Triple 3:

Instruction:
Input:
Output:

'''

وأخيرًا، يتم حفظ التعليمات في llama3_8b_finetuning/data/arxiv_instruction_dataset.json .

الخطوة الثالثة - الضبط الدقيق

يمكن العثور على رمز هذه الخطوة في /llama3_8b_finetuning/model_trainer.py

أولاً، نقوم بتحميل أزواج التعليمات/الإجابات، وتقسيمها إلى مجموعات بيانات اختبار وتدريب، و
تنسيقها في الهيكل الصحيح.

 class DatasetHandler :
    def __init__ ( self , data_path ):
        self . data_path = data_path

    def load_and_split_dataset ( self ):
        dataset = load_dataset ( "json" , data_files = self . data_path )
        train_test_split = dataset [ 'train' ]. train_test_split ( test_size = 0.2 )
        dataset_dict = DatasetDict ({
            'train' : train_test_split [ 'train' ],
            'test' : train_test_split [ 'test' ]
        })
        return dataset_dict [ 'train' ], dataset_dict [ 'test' ]

    @ staticmethod
    def format_instruction ( sample ):
        return f"""
        Below is an instruction that describes a task, paired with an input that provides further context. 
        Write a response that appropriately completes the request.

        ### Instruction:
        { sample [ 'Instruction' ] }

        ### Input:
        { sample [ 'Input' ] }

        ### Response:
        { sample [ 'Output' ] }
        """

بعد ذلك، نحدد الفئة التي تقوم بتحميل النموذج والرمز المميز من Hugging Face.

 class ModelManager :
    def __init__ ( self , model_id , use_flash_attention2 , hf_token ):
        self . model_id = model_id
        self . use_flash_attention2 = use_flash_attention2
        self . hf_token = hf_token
        self . bnb_config = BitsAndBytesConfig (
            load_in_4bit = True ,
            bnb_4bit_use_double_quant = True ,
            bnb_4bit_quant_type = "nf4" ,
            bnb_4bit_compute_dtype = torch . bfloat16 if use_flash_attention2 else torch . float16
        )
    
    def load_model_and_tokenizer ( self ):
        model = AutoModelForCausalLM . from_pretrained (
            self . model_id , 
            quantization_config = self . bnb_config , 
            use_cache = False , 
            device_map = "auto" ,
            token = self . hf_token ,  
            attn_implementation = "flash_attention_2" if self . use_flash_attention2 else "sdpa"
        )
        model . config . pretraining_tp = 1

        tokenizer = AutoTokenizer . from_pretrained (
            self . model_id ,
            token = self . hf_token
        )
        tokenizer . pad_token = tokenizer . eos_token
        tokenizer . padding_side = "right"
        
        return model , tokenizer

نحدد فئة Trainer وتكوين التدريب:

 class Trainer :
    def __init__ ( self , model , tokenizer , train_dataset , peft_config , use_flash_attention2 , output_dir ):
        self . model = model
        self . tokenizer = tokenizer
        self . train_dataset = train_dataset
        self . peft_config = peft_config
        self . args = TrainingArguments (
            output_dir = output_dir ,
            num_train_epochs = 3 ,
            per_device_train_batch_size = 4 ,
            gradient_accumulation_steps = 4 ,
            gradient_checkpointing = True ,
            optim = "paged_adamw_8bit" ,
            logging_steps = 10 ,
            save_strategy = "epoch" ,
            learning_rate = 2e-4 ,
            bf16 = use_flash_attention2 ,
            fp16 = not use_flash_attention2 ,
            tf32 = use_flash_attention2 ,
            max_grad_norm = 0.3 ,
            warmup_steps = 5 ,
            lr_scheduler_type = "linear" ,
            disable_tqdm = False ,
            report_to = "none"
        )
        self . model = get_peft_model ( self . model , self . peft_config )

    def train_model ( self , format_instruction_func ):
        trainer = SFTTrainer (
            model = self . model ,
            train_dataset = self . train_dataset ,
            peft_config = self . peft_config ,
            max_seq_length = 2048 ,
            tokenizer = self . tokenizer ,
            packing = True ,
            formatting_func = format_instruction_func , 
            args = self . args ,
        )
        trainer . train ()
        return trainer

وأخيرًا، يتم إنشاء مثيل للفصول الدراسية ويبدأ التدريب.

لاحظ أن نماذج Llama مسورة، مما يعني أن Hugging Face يتطلب رمزًا مميزًا يتم تقديمه بعد قبول شروط الاستخدام وموافقة Meta على الوصول (وهو أمر فوري تقريبًا).

 dataset_handler = DatasetHandler ( data_path = utils . Variables . INSTRUCTION_DATASET_JSON_PATH )
train_dataset , test_dataset = dataset_handler . load_and_split_dataset ()

new_test_dataset = []
for dict_ in test_dataset :
    dict_ [ 'Output' ] = ''
    new_test_dataset . append ( dict_ )

model_manager = ModelManager (
    model_id = "meta-llama/Meta-Llama-3-8B" ,
    use_flash_attention2 = True ,
    hf_token = os . environ [ "HF_TOKEN" ]
)
model , tokenizer = model_manager . load_model_and_tokenizer ()
model_manager . save_model_and_tokenizer ( model , tokenizer , save_directory = utils . Variables . BASE_MODEL_PATH )
model = model_manager . prepare_for_training ( model )

peft_config = LoraConfig (
    lora_alpha = 16 ,
    lora_dropout = 0.1 ,
    r = 64 ,
    bias = "none" ,
    task_type = "CAUSAL_LM" ,
    target_modules = [
        "q_proj" , "k_proj" , "v_proj" , "o_proj" , "gate_proj" , "up_proj" , "down_proj" ,
    ]
)

trainer = Trainer (
    model = model ,
    tokenizer = tokenizer ,
    train_dataset = train_dataset ,
    peft_config = peft_config ,
    use_flash_attention2 = True ,
    output_dir = utils . Variables . FINE_TUNED_MODEL_PATH
)
trained_model = trainer . train_model ( format_instruction_func = dataset_handler . format_instruction )
trained_model . save_model ()

الخطوة 4 - التقييم

لتقييم نتائج الضبط الدقيق، استخدمنا درجة التقييم الموجه نحو الاستدعاء (ROUGE)، والتي تقارن التداخل بين مجموعتين من النص لقياس التشابه بينهما.

على وجه التحديد، استخدمنا مكتبة rouge_scorer لحساب ROUGE-1 وROUGE-2، اللذين يقيسان تداخل 1 جرام و2 جرام بين النصوص.

 import pandas as pd
from rouge_score import rouge_scorer

def calculate_rouge_scores ( generated_answers , ground_truth ):
    scorer = rouge_scorer . RougeScorer ([ 'rouge1' , 'rouge2' , 'rougeL' ], use_stemmer = True )
    total_rouge1 , total_rouge2 , total_rougeL = 0 , 0 , 0
    for gen , ref in zip ( generated_answers , ground_truth ):
        scores = scorer . score ( gen , ref )
        total_rouge1 += scores [ 'rouge1' ]. fmeasure
        total_rouge2 += scores [ 'rouge2' ]. fmeasure
        total_rougeL += scores [ 'rougeL' ]. fmeasure
    average_rouge1 = total_rouge1 / len ( generated_answers )
    average_rouge2 = total_rouge2 / len ( generated_answers )
    average_rougeL = total_rougeL / len ( generated_answers )
    return { 'average_rouge1' : average_rouge1 ,
            'average_rouge2' : average_rouge2 ,
            'average_rougeL' : average_rougeL }

لإجراء هذا الحساب، نأخذ التعليمات من مجموعة بيانات الاختبار، ونمررها إلى كل من النموذج الأساسي والنموذج المضبوط بدقة، ونقارن المخرجات بالنتائج المتوقعة من مجموعة بيانات التعليمات/الإجابة.

يمكن العثور على رمز التقييم في /llama3_8b_finetuning/model_evaluation.py.

 class ModelHandler :

    def __init__ ( self ):
        pass

    def loading_model ( self , model_chosen = 'fine_tuned_model' ):

        if model_chosen == 'fine_tuned_model' :
            model_dir = utils . Variables . FINE_TUNED_MODEL_PATH
            self . model = AutoPeftModelForCausalLM . from_pretrained (
                model_dir ,
                low_cpu_mem_usage = True ,
                torch_dtype = torch . float16 ,
                load_in_4bit = True ,
                )

        elif model_chosen == 'base_model' :
            model_dir = utils . Variables . BASE_MODEL_PATH
            self . model = AutoModelForCausalLM . from_pretrained (
                model_dir ,
                low_cpu_mem_usage = True ,
                torch_dtype = torch . float16 ,
                load_in_4bit = True ,
                )

        self . tokenizer = AutoTokenizer . from_pretrained ( model_dir )

    def ask_question ( self , instruction , temperature = 0.5 , max_new_tokens = 1000 ):

        prompt = format_instruction ( instruction )

        input_ids = self . tokenizer ( prompt , return_tensors = "pt" , truncation = True ). input_ids . cuda ()

        start_time = time . time ()
        with torch . inference_mode ():
            outputs = self . model . generate ( input_ids = input_ids , pad_token_id = self . tokenizer . eos_token_id , max_new_tokens = max_new_tokens , do_sample = True , top_p = 0.5 , temperature = temperature )
        end_time = time . time ()

        total_time = end_time - start_time
        output_length = len ( outputs [ 0 ]) - len ( input_ids [ 0 ])

        self . output = self . tokenizer . batch_decode ( outputs . detach (). cpu (). numpy (), skip_special_tokens = True )[ 0 ]

        return self . output

نتائج التقييم

نتائج ROUGE هي كما يلي:

نموذج مضبوط بدقة:

{'average_rouge1': 0.39997816307812206, 'average_rouge2': 0.2213826792342886, 'average_rougeL': 0.33508922374837047}

النموذج الأساسي:

{'average_rouge1': 0.2524191394349585, 'average_rouge2': 0.13402054342344535, 'average_rougeL': 0.2115590931984475}

ولذلك، يمكن ملاحظة أن أداء النموذج المضبوط في مجموعة بيانات الاختبار يتفوق بشكل كبير على أداء النموذج الأساسي.

البدائل

لقد استغرق الأمر بعض الوقت لكتابة هذا الرمز وتشغيله. لقد كانت ممارسة جيدة، ولكن بالنسبة للمهام اليومية المتعلقة بالضبط الدقيق، ما عليك سوى استخدام Hugging Face AutoTrain المستضاف محليًا (https://github.com/huggingface/autotrain-advanced).