unsloth下載 - unsloth原始碼下載

unsloth

其他源碼

Gradient Accumulation

下載

推特（又稱 X）

在 X 上關注我們
安裝	不懶惰/README.md
標竿管理	性能表
已發布型號	不懶惰的發布
部落格	閱讀我們的博客

主要特點

所有核心均採用 OpenAI 的 Triton 語言編寫。手動反向傳播引擎。
精度損失為 0% - 無近似方法 - 全部精確。
沒有改變硬體。自 2018 年起支援 NVIDIA GPU。最低 CUDA 能力 7.0（V100、T4、Titan V、RTX 20、30、40x、A100、H100、L40 等）檢查您的 GPU！ GTX 1070、1080 可以工作，但速度很慢。
透過 WSL 在Linux和Windows上運行。
透過位元和位元組支援 4 位元和 16 位元 QLoRA / LoRA 微調。
開源訓練速度提高了 5 倍 - 請參閱 Unsloth Pro，訓練速度提高了 30 倍！
如果您使用 ?Unsloth 訓練模型，則可以使用這個很酷的貼紙！

績效基準測試

有關可重複基準測試表的完整列表，請訪問我們的網站

1 個 A100 40GB	?抱臉	閃光注意	?Unsloth 開源	?Unsloth Pro
羊駝毛	1x	1.04倍	1.98倍	15.64倍
萊昂晶片2	1x	0.92倍	1.61倍	20.73倍
歐亞斯特	1x	1.19倍	2.17倍	14.83倍
苗條逆戟鯨	1x	1.18倍	2.22倍	14.82倍

下面的基準測試表是由 ?Hugging Face 進行的。

免費Colab T4	數據集	?抱臉	火炬2.1.1	?不懶惰	？顯存減少
駱駝-2 7b	歐亞斯特	1x	1.19倍	1.95倍	-43.3%
米斯特拉爾7b	羊駝毛	1x	1.07倍	1.56倍	-13.7%
小羊駝 1.1b	羊駝毛	1x	2.06倍	3.87倍	-73.8%
DPO 與 Zephyr	超級聊天	1x	1.09倍	1.55倍	-18.6%

安裝說明

對於穩定版本，請使用pip install unsloth 。對於大多數安裝，我們建議pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" 。

康達安裝

️Only use Conda if you have it. If not, use Pip 。對於 CUDA 11.8 或 CUDA 12.1，選擇pytorch-cuda=11.8,12.1 。我們支持python=3.10,3.11,3.12 。

 conda 建立 --name unsloth_env
     蟒蛇=3.11
     pytorch-cuda=12.1
     pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers
     -y
conda 啟動 unsloth_env

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"pip install --no-deps trl peft 加速bitsandbytes

如果您想在 Linux 環境中安裝 Conda，請閱讀此處，或執行以下命令？

 mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh~/miniconda3/bin/conda init bash~/miniconda3/bin/conda init zsh

點安裝

️Do **NOT** use this if you have Conda. Pip 有點複雜，因為有依賴性問題。 torch 2.2,2.3,2.4,2.5和 CUDA 版本的 pip 指令有所不同。

對於其他 torch 版本，我們支援torch211 、 torch212 、 torch220 、 torch230 、 torch240 ，對於 CUDA 版本，我們支援cu118 、 cu121和cu124 。對於安培設備（A100、H100、RTX3090）及更高版本，請使用cu118-ampere或cu121-ampere或cu124-ampere 。

例如，如果您有torch 2.4和CUDA 12.1 ，請使用：

 pip 安裝 --升級 pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

另一個例子，如果您有torch 2.5和CUDA 12.4 ，請使用：

 pip 安裝 --升級 pip
pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"

以及其他例子：

 pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/ unslothai/unsloth.git"pip install"unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"pip install"unsloth[cu118-torch240] @ git+https://github。 com/unslothai/unsloth.git"pip install"unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"pip install"unsloth[cu121-ampere-torch230] @ git+https: //github.com/unslothai/unsloth.git"pip install"unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"pip install"unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git”

或者，在終端機中執行以下命令以獲得最佳的pip 安裝命令：

 wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | Python -

或者，在 Python REPL 中手動執行以下命令：

嘗試：導入torchexcept：引發ImportError（'透過'pip install torch'安裝火炬'）from Packaging.version import Version as Vv = V（torch.__version__）cuda = str（torch.version.cuda）is_ampere = torchion__cuda。 get_device_capability()[0] >= 8if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda} not support!")if v <= V ('2.1.0'): raise RuntimeError(f"Torch = {v} 太舊了！")elif v <= V('2.1.1'): x = 'cu{}{}-torch211'elif v < = V('2.1.2'): x = 'cu{}{}-torch212'elif v < V('2.3.0'): x = 'cu{}{}-torch220'elif v < V( ' 2.4.0'): x = 'cu{}{}-torch230'elif v < V('2.5.0'): x = 'cu{}{}-torch240'elif v < V('2.6.0 ' ): x = 'cu{}{}-torch250'else: raise RuntimeError(f"Torch = {v} 太新了！")x = x.format(cuda.replace(".", ""), "- ampere" if is_ampere else "")print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')

Windows安裝

若要直接在 Windows 上執行 Unsloth：

從此 Windows 分支安裝 Triton 並按照說明進行操作：https://github.com/woct0rdho/triton-windows
在 SFTTrainer 中，設定dataset_num_proc=1以避免崩潰問題：

訓練器 = SFTTrainer(dataset_num_proc=1,
    ……
）

有關進階安裝說明或如果您在安裝過程中看到奇怪的錯誤：

安裝torch和triton 。前往 https://pytorch.org 進行安裝。例如pip install torch torchvision torchaudio triton
確認CUDA是否安裝正確。嘗試nvcc 。如果失敗，您需要安裝cudatoolkit或 CUDA 驅動程式。
手動安裝xformers 。您可以嘗試安裝vllm ，看看vllm是否成功。使用python -m xformers.info檢查xformers是否成功前往 https://github.com/facebookresearch/xformers。另一個選擇是為 Ampere GPU 安裝flash-attn 。
最後，安裝bitsandbytes並使用python -m bitsandbytes檢查它

文件

請造訪我們的官方文檔，了解儲存到 GGUF、檢查點、評估等資訊！
我們支援 Huggingface 的 TRL、Trainer、Seq2SeqTrainer 甚至 Pytorch 程式碼！
我們進入了？查看 SFT 文件和 DPO 文件！

 from unsloth import FastLanguageModel from unsloth import is_bfloat16_supportedimport torchfrom trl import SFTTrainerfrom Transformers import TrainingArgumentsfrom datasets import load_datasetmax_seq_leurlth = 20488Sv. .co/datasets/laion/OIG / solve/main/unified_chip2.jsonl"dataset = load_dataset("json", data_files = {"train" : url}, split = "train")# 4 位元預量化模型，我們支援4 倍更快的下載速度+無OOMs.fourbit_models = [ "unsloth/mistral-7b-v0.3-bnb-4bit", # 新Mistral v3 速度提高2 倍！ unsloth/llama-3-8b- bnb-4bit", # Llama-3 15 兆代幣模型速度提高2 倍！"unsloth/llama-3-8b-Instruct-bnb-4bit","unsloth/llama-3 -70b-bnb-4bit","unsloth/Phi -3-mini-4k-instruct", # Phi-3 快2 倍！"unsloth/Phi-3-medium-4k-instruct","unsloth/mistral-7b -bnb-4bit","unsloth/gemma-7b- bnb-4bit", # Gemma 快2.2 倍！] # 更多模型請訪問https://huggingface.co/unslothmodel, tokenizer = FastLanguageModel.from_pretrained(model_name = " unsloth/llama-3-8b-bnb-4bit",max_seq_length = max_seq_length，dtype = None，load_in_4bit = True，
)# 進行模型修補並加入快速LoRA 權重model = FastLanguageModel.get_peft_model(model,r = 16,target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "up_proj" down_proj ",],lora_alpha = 16,lora_dropout = 0, # 支援任何，但= 0 已最佳化bias = "none", # 支援任何，但= "none" 已最佳化# [新] “unsloth” 使用的VRAM 減少30%，適合2 倍大的批次大小！ trainer = SFTTrainer(模型=模型，train_dataset =資料集，dataset_text_field =“文字”，max_seq_length = max_seq_length，tokenizer = tokenizer，args = TrainingArguments（per_device_train_batch_size = 2， 16 =不是is_bfloat16_supported（），bf16 = is_bfloat16_supported()，logging_steps = 1，output_dir =“輸出”，optim =“adamw_8bit”，種子= 3407，
    ),
)trainer.train()# 前往https://github.com/unslothai/unsloth/wiki 取得進階提示，例如# (1) 儲存至GGUF / 合併至16 位元以進行vLLM# (2) 從儲存的LoRA 轉接器繼續訓練# (3) 新增評估循環/OOM# (4) 自訂聊天模板

DPO 支援

DPO（直接偏好優化）、PPO、獎勵建模似乎都按照 Llama-Factory 的第 3 方獨立測試工作。我們有一個初步的 Google Colab 筆記本，用於在 Tesla T4 上複製 Zephyr：notebook。

我們進入了？我們正在查看 SFT 文件和 DPO 文件！

 import osos.environ["CUDA_VISIBLE_DEVICES"] = "0" # 可選設定GPU 裝置IDfrom unsloth import FastLanguageModel, PatchDPOTrainerfrom unsloth import is_fromloat16_supportedPatchDPOTrainer(Frecoals tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth /和風-sft-bnb-4bit”，max_seq_length = max_seq_length，dtype = None，load_in_4bit = True，
)# 進行模型修補並加入快速LoRA 權重model = FastLanguageModel.get_peft_model(model,r = 64,target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj" down_proj ",],lora_alpha = 64,lora_dropout = 0, # 支援任何，但= 0 已最佳化bias = "none", # 支援任何，但= "none" 已最佳化# [新] “unsloth” 使用的VRAM 減少30%，適合 2 倍大的批次大小！
)dpo_trainer = DPOTrainer(model = model,ref_model = None,args = TrainingArguments(per_device_train_batch_size = 4,gradient_accumulation_steps = 8,warmup_ratio = 0.1,num_trasup_mochs =steps = 8,warmup_ratio = 0.1,num_trasup_mochs = 306,4000,4000,4005) 45% = notn_6n_Fii50, 6n_Fyn. bfloat16_supported(),logging_step s = 1 ， optim =“adamw_8bit”，種子= 42，output_dir =“輸出”，
    ),beta = 0.1,train_dataset = YOUR_DATASET_HERE,# eval_dataset = YOUR_DATASET_HERE,tokenizer = tokenizer,max_length = 1024,max_prompt_length = 512,
)dpo_trainer.train()

詳細的基準測試表

點擊“代碼”以獲得完全可重現的範例
「Unsloth Equal」是我們 PRO 版本的預覽版，其中刪除了程式碼。所有設定和損失曲線保持相同。
有關基準測試表的完整列表，請訪問我們的網站

1 個 A100 40GB	?抱臉	閃光注意2	? 不懶惰開放	不懶惰平等	解除懶惰專業版	不懶惰麥克斯
羊駝毛	1x	1.04倍	1.98倍	2.48倍	5.32倍	15.64倍
程式碼	程式碼	程式碼	程式碼	程式碼
秒	1040	1001	第525章	第419章	196	67
記憶體MB	18235	15365	9631	8525
保存百分比		15.74	47.18	53.25

Llama-Factory 第三方基準測試

連結到性能表。 TGS：每 GPU 每秒的令牌數。型號：LLaMA2-7B。 GPU：NVIDIA A100 * 1。

方法	位元	TGS	克	速度
高頻	16	2392	18GB	100%
高頻+FA2	16	2954	17GB	123%
不懶惰+FA2	16	4007	16 GB	168%
高頻	4	2415	9GB	101%
不懶惰+FA2	4	3726	7GB	160%

流行型號之間的性能比較

按一下查看特定型號基準測試表（Mistral 7b、CodeLlama 34b 等）

米斯特拉爾7b

1 個 A100 40GB	抱臉	閃光注意2	不懶惰開放	不懶惰平等	解除懶惰專業版	不懶惰麥克斯
米斯特拉爾 7B 超薄逆戟鯨	1x	1.15倍	2.15倍	2.53倍	4.61倍	13.69倍
程式碼	程式碼	程式碼	程式碼	程式碼
秒	1813	第1571章	第842章	718	第393章	132
記憶體MB	32853	19385	12465	10271
保存百分比		40.99	62.06	68.74

代碼駱駝 34b

1 個 A100 40GB	抱臉	閃光注意2	不懶惰開放	不懶惰平等	解除懶惰專業版	不懶惰麥克斯
代碼駱駝 34B	OOM	0.99倍	1.87倍	2.61倍	4.27倍	12.82倍
程式碼	▶️程式碼	程式碼	程式碼	程式碼
秒	1953年	1982年	1043	第748章	第458章	152
記憶體MB	40000	33217	27413	22161
保存百分比		16.96	31.47	44.60

1 特斯拉 T4

1個T4 16GB	抱臉	閃光注意	不懶惰開放	Unsloth Pro 平等	解除懶惰專業版	不懶惰麥克斯
羊駝毛	1x	1.09倍	1.69倍	1.79倍	2.93倍	8.3倍
程式碼	▶️程式碼	程式碼	程式碼	程式碼
秒	1599	第1468章	第942章	第894章	第545章	193
記憶體MB	7199	7059	6459	5443
保存百分比		1.94	10.28	24.39

2 輛特斯拉 T4（透過 DDP）

2 T4 順鉑	抱臉	閃光注意	不懶惰開放	不懶惰平等	解除懶惰專業版	不懶惰麥克斯
羊駝毛	1x	0.99倍	4.95倍	4.44倍	7.28倍	20.61倍
程式碼	▶️程式碼	程式碼	程式碼
秒	9882	9946	1996年	2227	第1357章	第480章
記憶體MB	9176	9128	6904	6782
保存百分比		0.52	24.76	26.09

1 Tesla T4 GPU 上的效能比較：

點擊查看 1 epoch 所用時間

Google Colab 上的一輛 Tesla T4 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

系統	圖形處理器	羊駝毛 (52K)	萊昂 OIG (210K)	打開助手 (10K)	SlimOrca (518K)
抱臉	1個T4	23小時15米	56小時28米	8小時38米	391小時41分鐘
不懶惰開放	1個T4	13小時7公尺（1.8倍）	31小時47公尺（1.8倍）	4小時27公尺（1.9倍）	240小時4公尺（1.6倍）
解除懶惰專業版	1個T4	3小時6公尺（7.5倍）	5小時17公尺（10.7倍）	1小時7公尺（7.7倍）	59小時53公尺（6.5倍）
不懶惰麥克斯	1個T4	2小時39m（8.8倍）	4小時31m（12.5倍）	0小時58m（8.9倍）	51小時30公尺（7.6倍）

記憶體使用峰值

系統	圖形處理器	羊駝毛 (52K)	萊昂 OIG (210K)	打開助手 (10K)	SlimOrca (518K)
抱臉	1個T4	7.3GB	5.9GB	14.0GB	13.3GB
不懶惰開放	1個T4	6.8GB	5.7GB	7.8GB	7.7GB
解除懶惰專業版	1個T4	6.4GB	6.4GB	6.4GB	6.4GB
不懶惰麥克斯	1個T4	11.4GB	12.4GB	11.9GB	14.4GB

點擊透過 DDP 在 2 個 Tesla T4 GPU 上進行效能比較：

**1 epoch 所花費的時間**

Kaggle 上的兩個 Tesla T4 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

系統	圖形處理器	羊駝毛 (52K)	萊昂 OIG (210K)	打開助手 (10K)	SlimOrca (518K) *
抱臉	2 T4	84小時47米	163小時48米	30小時51米	1301小時24公尺*
解除懶惰專業版	2 T4	3小時20公尺（25.4倍）	5小時43公尺（28.7倍）	1小時12公尺（25.7倍）	71小時40公尺（18.1倍）*
不懶惰麥克斯	2 T4	3小時4公尺（27.6倍）	5小時14公尺（31.3倍）	1小時6公尺（28.1倍）	54小時20公尺（23.9倍）*

多 GPU 系統（2 個 GPU）上的峰值記憶體使用量

系統	圖形處理器	羊駝毛 (52K)	萊昂 OIG (210K)	打開助手 (10K)	SlimOrca (518K) *
抱臉	2 T4	8.4GB\| 6GB	7.2GB\| 5.3GB	14.3GB \| 14.3GB 6.6GB	10.9GB \| 10.9GB 5.9GB *
解除懶惰專業版	2 T4	7.7GB\| 4.9GB	7.5GB\| 4.9GB	8.5GB \| 4.9GB	6.2GB\| 4.7GB *
不懶惰麥克斯	2 T4	10.5GB \| 10.5GB 5GB	10.6GB \| 10.6GB 5GB	10.6GB \| 10.6GB 5GB	10.5GB \| 10.5GB 5GB *