Qwen2 VL Finetune下載 - Qwen2 VL Finetune原始碼下載

Qwen2 VL Finetune

Ai源碼

1.0.0

下載

微調 Qwen2-VL

此儲存庫包含僅使用 HuggingFace 和 Liger-Kernel 訓練 Qwen2-VL 的腳本。

其他項目

【Phi3-視覺微調】
【Llama3.2-視覺微調】
[莫爾莫微調]

更新

[2024/11/05] 增加記憶體高效的 8 位元訓練。
[2024/09/12] 現在模型已經使用 Liger-Kernel 進行訓練。
[2024/09/11] 支援為投影機和視覺模型設定不同的學習率。
[2024/09/11] 支援多影像和視訊訓練。

微調 Qwen2-VL
- 其他項目
- 更新
- 目錄
- 支援的功能
- 安裝
  - 使用environment.yaml
- 資料集準備
- 訓練
  - 全面微調
  - 使用 LoRA 進行微調
  - 使用視訊資料集進行訓練
    - 合併 LoRA 權重
    - 影像解析度可提升效能
    - libcudnn 錯誤問題
- 待辦事項
- 已知問題
- 執照
- 引文
- 致謝

支援的功能

深速
LoRA/QLoRA
全面微調
使用 LoRA 時啟用對vision_model進行微調。
停用/啟用 Flash 注意 2
多圖像和視訊訓練
使用 liger kernel 優化訓練

安裝

使用environment.yaml安裝所需的套件。

使用`environment.yaml`

conda env create -f environment.yaml
conda activate qwen2
pip install qwen-vl-utils
pip install flash-attn==2.5.8 --no-build-isolation

注意：您應該在安裝其他軟體包後安裝 flash-attn。

資料集準備

該腳本需要根據 LLaVA 規範格式化的資料集。資料集應該是一個 JSON 文件，其中每個條目包含有關對話和圖像的資訊。確保資料集中的影像路徑與提供的--image_folder相符。

使用多影像資料集時，影像標記應全部為<image> ，且影像檔案名稱應位於清單中。請參閱下面的範例並遵循資料格式。

單一影像資料集範例

[
  {
    "id" : " 000000033471 " ,
    "image" : " 000000033471.jpg " ,
    "conversations" : [
      {
        "from" : " human " ,
        "value" : " <image> n What are the colors of the bus in the image? "
      },
      {
        "from" : " gpt " ,
        "value" : " The bus in the image is white and red. "
      },
      {
        "from" : " human " ,
        "value" : " What feature can be seen on the back of the bus? "
      },
      {
        "from" : " gpt " ,
        "value" : " The back of the bus features an advertisement. "
      },
      {
        "from" : " human " ,
        "value" : " Is the bus driving down the street or pulled off to the side? "
      },
      {
        "from" : " gpt " ,
        "value" : " The bus is driving down the street, which is crowded with people and other vehicles. "
      }
    ]
  }
  ...
]

多圖像資料集範例

[
  {
    "id" : " 000000033471 " ,
    "image" : [ " 000000033471.jpg " , " 000000033472.jpg " ],
    "conversations" : [
      {
        "from" : " human " ,
        "value" : " <image> n <image> n Is the perspective of the camera differnt? "
      },
      {
        "from" : " gpt " ,
        "value" : " Yes, It the perspective of the camera is different. "
      }
    ]
  }
  ...
]

視訊資料集範例

[
  {
    "id" : " sample1 " ,
    "video" : " sample1.mp4 " ,
    "conversations" : [
      {
        "from" : " human " ,
        "value" : " <video> n What is going on in this video? "
      },
      {
        "from" : " gpt " ,
        "value" : " A man is walking down the road. "
      }
    ]
  }
  ...
]

注意： Qwen2-VL 使用影片作為影像序列。

訓練

注意：對於混合資料集（例如，批次中的某些資料有圖像，而有些資料沒有），它僅支援 Zero2。

若要執行訓練腳本，請使用以下命令：

全面微調

bash scripts/finetune.sh

8 位元全面微調

bash scripts/finetune_8bit.sh

該腳本將使用 8bit-adamw 和 fp8 模型資料類型對模型進行微調。如果你的 vram 用完了，你可以使用這個。

使用 LoRA 進行微調

如果您只想使用 LoRA 訓練語言模型並對視覺模型進行完整訓練：

bash scripts/finetune_lora.sh

如果你想用LoRA同時訓練語言模型和視覺模型：

bash scripts/finetune_lora_vision.sh

重要提示：如果您想使用 LoRA 調整embed_token ，則需要一起調整lm_head 。注意：凍結 LLM 僅在沒有 LoRA 的情況下有效（包括 Vision_model LoRA）。

訓練論證

--deepspeed (str)：DeepSpeed 設定檔的路徑（預設值：「scripts/zero2.json」）。
--data_path (str)：LLaVA 格式的訓練資料（JSON 檔案）的路徑。 （必需的）
--image_folder (str)：LLaVA 格式的訓練資料中引用的圖像資料夾的路徑。 （必需的）
--model_id (str)：Qwen2-VL 模型的路徑。 （必需的）
--output_dir (str): 模型檢查點的輸出目錄
--num_train_epochs (int)：訓練紀元數（預設值：1）。
--per_device_train_batch_size (int)：每個轉送步驟每個 GPU 的訓練批次大小。
--gradient_accumulation_steps (int): 梯度累積步數（預設值：4）。
--freeze_vision_tower (bool)：凍結vision_model 的選項（預設值：False）。
--freeze_llm (bool): 凍結 LLM 的選項（預設值：False）。
--tune_merger (bool)：調整投影機的選項（預設值：True）。
--num_lora_modules (int): 要新增 LoRA 的目標模組數量（-1 表示所有層）。
--vision_lr (float)：vision_model 的學習率。
--merger_lr (float): 合併的學習率(投影機)。
--learning_rate (float): 語言模組的學習率。
--bf16 (bool)：使用 bfloat16 的選項。
--fp16 (bool)：使用 fp16 的選項。
--min_pixels (int)：最小輸入標記的選項。
--max_pixles (int)：最大 maxmimum 標記的選項。
--lora_namespan_exclude (str)：排除具有名稱跨度的模組以新增 LoRA。
--max_seq_length (int)：最大序列長度（預設值：32K）。
--bits (int)：量化位數（預設值：16）。
--disable_flash_attn2 (bool): 停用 Flash Attention 2。
--report_to (str)：報告工具（選項：'tensorboard'、'wandb'、'none'）（預設值：'tensorboard'）。
--logging_dir (str)：日誌目錄（預設值：「./tf-logs」）。
--lora_rank (int)：LoRA 排名（預設值：128）。
--lora_alpha (int): LoRA alpha (預設值: 256)。
--lora_dropout (float): LoRA dropout (預設值: 0.05)。
--logging_steps (int)：記錄步驟（預設值：1）。
--dataloader_num_workers (int)：資料載入器工作人員數量（預設值：4）。

注意： vision_model的學習率應比language_model小 10x ~ 5x。

使用視訊資料集進行訓練

您可以使用視訊資料集訓練模型。但是，Qwen2-VL 將影片作為影像序列進行處理，因此您需要選擇特定的影格並將它們視為多個影像進行訓練。您也可以設定 LoRA 配置並用於 LoRA。

bash scripts/finetune_video.sh

注意：當使用影片進行訓練時，它就像多個影像一樣，因此您應該根據可用的 VRAM 調整max_pixels以獲得最大解析度和fps 。

如果 vram 耗盡，可以使用 Zero3_offload 取代 Zero3。然而，最好使用zero3。

合併 LoRA 權重

 bash scripts/merge_lora.sh

注意：請記得將finetune.sh或finetune_lora.sh中的路徑替換為您的特定路徑。（使用 LoRA 時也在merge_lora.sh中。）

影像解析度可提升效能

該模型支援多種分辨率輸入。預設情況下，它使用原始解析度進行輸入。為了獲得更好的效能，建議使用本機或更高的像素數，但是對於大圖像來說，它需要太多的記憶體和計算時間。所以你可以調整它的像素數。模型將圖片分割為token * 28 * 28因此您只需更改腳本中的 token_num 部分即可。
例如：

 min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28

注意：對於視頻，您不必這樣設置，您只需為其設置最大分辨率即可。

libcudnn 錯誤問題

 Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8

您可以執行unset LD_LIBRARY_PATH來解決此錯誤。你可以看到這個問題

推理

注意：使用 LoRA 訓練時應使用合併權重。

梯度推理 (WebUI)

安裝漸變

 pip install gradio

啟動應用程式

 python -m src.serve.app 
    --model-path /path/to/merged/weight

您可以使用此命令啟動基於漸層的演示。這也可以設定一些其他生成配置，例如repetition_penalty 、 temperature等。

待辦事項

支援視訊數據
添加多圖像和影片的演示
支援動態截斷

已知問題

libcudnn 問題

執照

該專案根據 Apache-2.0 許可證獲得許可。有關詳細信息，請參閱許可證文件。

引文

如果您發現此儲存庫對您的專案有用，請考慮給予並引用：

 @misc { Qwen2-VL-Finetuning ,
  author = { Yuwon Lee } ,
  title = { Qwen2-VL-Finetune } ,
  year = { 2024 } ,
  publisher = { GitHub } ,
  url = { https://github.com/2U1/Qwen2-VL-Finetune }
}