Twitter (aka X)
Follow us on X | |
Installation | unsloth/README.md |
Benchmarking | Performance Tables |
Released Models | Unsloth Releases |
Blog | Read our Blogs |
All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
If you trained a model with ?Unsloth, you can use this cool sticker!
For the full list of reproducible benchmarking tables, go to our website
1 A100 40GB | ?Hugging Face | Flash Attention | ?Unsloth Open Source | ?Unsloth Pro |
---|---|---|---|---|
Alpaca | 1x | 1.04x | 1.98x | 15.64x |
LAION Chip2 | 1x | 0.92x | 1.61x | 20.73x |
OASST | 1x | 1.19x | 2.17x | 14.83x |
Slim Orca | 1x | 1.18x | 2.22x | 14.82x |
Benchmarking table below was conducted by ?Hugging Face.
Free Colab T4 | Dataset | ?Hugging Face | Pytorch 2.1.1 | ?Unsloth | ? VRAM reduction |
---|---|---|---|---|---|
Llama-2 7b | OASST | 1x | 1.19x | 1.95x | -43.3% |
Mistral 7b | Alpaca | 1x | 1.07x | 1.56x | -13.7% |
Tiny Llama 1.1b | Alpaca | 1x | 2.06x | 3.87x | -73.8% |
DPO with Zephyr | Ultra Chat | 1x | 1.09x | 1.55x | -18.6% |
For stable releases, use pip install unsloth
. We recommend pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
for most installations though.
️Only use Conda if you have it. If not, use Pip
. Select either pytorch-cuda=11.8,12.1
for CUDA 11.8 or CUDA 12.1. We support python=3.10,3.11,3.12
.
conda create --name unsloth_env python=3.11 pytorch-cuda=12.1 pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers -y conda activate unsloth_env pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"pip install --no-deps trl peft accelerate bitsandbytes
mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh~/miniconda3/bin/conda init bash~/miniconda3/bin/conda init zsh
️Do **NOT** use this if you have Conda.
Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5
and CUDA versions.
For other torch versions, we support torch211
, torch212
, torch220
, torch230
, torch240
and for CUDA versions, we support cu118
and cu121
and cu124
. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere
or cu121-ampere
or cu124-ampere
.
For example, if you have torch 2.4
and CUDA 12.1
, use:
pip install --upgrade pip pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
Another example, if you have torch 2.5
and CUDA 12.4
, use:
pip install --upgrade pip pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"
And other examples:
pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"
Or, run the below in a terminal to get the optimal pip installation command:
wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
Or, run the below manually in a Python REPL:
try: import torchexcept: raise ImportError('Install torch via `pip install torch`')from packaging.version import Version as Vv = V(torch.__version__)cuda = str(torch.version.cuda)is_ampere = torch.cuda.get_device_capability()[0] >= 8if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda} not supported!")if v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")elif v <= V('2.1.1'): x = 'cu{}{}-torch211'elif v <= V('2.1.2'): x = 'cu{}{}-torch212'elif v < V('2.3.0'): x = 'cu{}{}-torch220'elif v < V('2.4.0'): x = 'cu{}{}-torch230'elif v < V('2.5.0'): x = 'cu{}{}-torch240'elif v < V('2.6.0'): x = 'cu{}{}-torch250'else: raise RuntimeError(f"Torch = {v} too new!")x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')
To run Unsloth directly on Windows:
Install Triton from this Windows fork and follow the instructions: https://github.com/woct0rdho/triton-windows
In the SFTTrainer, set dataset_num_proc=1
to avoid a crashing issue:
trainer = SFTTrainer(dataset_num_proc=1, ... )
For advanced installation instructions or if you see weird errors during installations:
Install torch
and triton
. Go to https://pytorch.org to install it. For example pip install torch torchvision torchaudio triton
Confirm if CUDA is installated correctly. Try nvcc
. If that fails, you need to install cudatoolkit
or CUDA drivers.
Install xformers
manually. You can try installing vllm
and seeing if vllm
succeeds. Check if xformers
succeeded with python -m xformers.info
Go to https://github.com/facebookresearch/xformers. Another option is to install flash-attn
for Ampere GPUs.
Finally, install bitsandbytes
and check it with python -m bitsandbytes
Go to our official Documentation for saving to GGUF, checkpointing, evaluation and more!
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in ?Hugging Face's official docs! Check out the SFT docs and DPO docs!
from unsloth import FastLanguageModel from unsloth import is_bfloat16_supportedimport torchfrom trl import SFTTrainerfrom transformers import TrainingArgumentsfrom datasets import load_datasetmax_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!# Get LAION dataseturl = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"dataset = load_dataset("json", data_files = {"train" : url}, split = "train")# 4bit pre quantized models we support for 4x faster downloading + no OOMs.fourbit_models = ["unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!"unsloth/mistral-7b-instruct-v0.3-bnb-4bit","unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!"unsloth/llama-3-8b-Instruct-bnb-4bit","unsloth/llama-3-70b-bnb-4bit","unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!"unsloth/Phi-3-medium-4k-instruct","unsloth/mistral-7b-bnb-4bit","unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!] # More models at https://huggingface.co/unslothmodel, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/llama-3-8b-bnb-4bit",max_seq_length = max_seq_length,dtype = None,load_in_4bit = True, )# Do model patching and add fast LoRA weightsmodel = FastLanguageModel.get_peft_model(model,r = 16,target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],lora_alpha = 16,lora_dropout = 0, # Supports any, but = 0 is optimizedbias = "none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long contextrandom_state = 3407,max_seq_length = max_seq_length,use_rslora = False, # We support rank stabilized LoRAloftq_config = None, # And LoftQ)trainer = SFTTrainer(model = model,train_dataset = dataset,dataset_text_field = "text",max_seq_length = max_seq_length,tokenizer = tokenizer,args = TrainingArguments(per_device_train_batch_size = 2,gradient_accumulation_steps = 4,warmup_steps = 10,max_steps = 60,fp16 = not is_bfloat16_supported(),bf16 = is_bfloat16_supported(),logging_steps = 1,output_dir = "outputs",optim = "adamw_8bit",seed = 3407, ), )trainer.train()# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like# (1) Saving to GGUF / merging to 16bit for vLLM# (2) Continued training from a saved LoRA adapter# (3) Adding an evaluation loop / OOMs# (4) Customized chat templates
DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We're in ?Hugging Face's official docs! We're on the SFT docs and the DPO docs!
import osos.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device IDfrom unsloth import FastLanguageModel, PatchDPOTrainerfrom unsloth import is_bfloat16_supportedPatchDPOTrainer()import torchfrom transformers import TrainingArgumentsfrom trl import DPOTrainermodel, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/zephyr-sft-bnb-4bit",max_seq_length = max_seq_length,dtype = None,load_in_4bit = True, )# Do model patching and add fast LoRA weightsmodel = FastLanguageModel.get_peft_model(model,r = 64,target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],lora_alpha = 64,lora_dropout = 0, # Supports any, but = 0 is optimizedbias = "none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long contextrandom_state = 3407,max_seq_length = max_seq_length, )dpo_trainer = DPOTrainer(model = model,ref_model = None,args = TrainingArguments(per_device_train_batch_size = 4,gradient_accumulation_steps = 8,warmup_ratio = 0.1,num_train_epochs = 3,fp16 = not is_bfloat16_supported(),bf16 = is_bfloat16_supported(),logging_steps = 1,optim = "adamw_8bit",seed = 42,output_dir = "outputs", ),beta = 0.1,train_dataset = YOUR_DATASET_HERE,# eval_dataset = YOUR_DATASET_HERE,tokenizer = tokenizer,max_length = 1024,max_prompt_length = 512, )dpo_trainer.train()
Click "Code" for fully reproducible examples
"Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
For the full list of benchmarking tables, go to our website
1 A100 40GB | ?Hugging Face | Flash Attention 2 | ?Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
---|---|---|---|---|---|---|
Alpaca | 1x | 1.04x | 1.98x | 2.48x | 5.32x | 15.64x |
code | Code | Code | Code | Code | ||
seconds | 1040 | 1001 | 525 | 419 | 196 | 67 |
memory MB | 18235 | 15365 | 9631 | 8525 | ||
% saved | 15.74 | 47.18 | 53.25 |
Link to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.
Method | Bits | TGS | GRAM | Speed |
---|---|---|---|---|
HF | 16 | 2392 | 18GB | 100% |
HF+FA2 | 16 | 2954 | 17GB | 123% |
Unsloth+FA2 | 16 | 4007 | 16GB | 168% |
HF | 4 | 2415 | 9GB | 101% |
Unsloth+FA2 | 4 | 3726 | 7GB | 160% |
1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
---|---|---|---|---|---|---|
Mistral 7B Slim Orca | 1x | 1.15x | 2.15x | 2.53x | 4.61x | 13.69x |
code | Code | Code | Code | Code | ||
seconds | 1813 | 1571 | 842 | 718 | 393 | 132 |
memory MB | 32853 | 19385 | 12465 | 10271 | ||
% saved | 40.99 | 62.06 | 68.74 |
1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
---|---|---|---|---|---|---|
Code Llama 34B | OOM | 0.99x | 1.87x | 2.61x | 4.27x | 12.82x |
code | Code | Code | Code | |||
seconds | 1953 | 1982 | 1043 | 748 | 458 | 152 |
memory MB | 40000 | 33217 | 27413 | 22161 | ||
% saved | 16.96 | 31.47 | 44.60 |
1 T4 16GB | Hugging Face | Flash Attention | Unsloth Open | Unsloth Pro Equal | Unsloth Pro | Unsloth Max |
---|---|---|---|---|---|---|
Alpaca | 1x | 1.09x | 1.69x | 1.79x | 2.93x | 8.3x |
code | Code | Code | Code | |||
seconds | 1599 | 1468 | 942 | 894 | 545 | 193 |
memory MB | 7199 | 7059 | 6459 | 5443 | ||
% saved | 1.94 | 10.28 | 24.39 |
2 T4 DDP | Hugging Face | Flash Attention | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
---|---|---|---|---|---|---|
Alpaca | 1x | 0.99x | 4.95x | 4.44x | 7.28x | 20.61x |
code | Code | Code | ||||
seconds | 9882 | 9946 | 1996 | 2227 | 1357 | 480 |
memory MB | 9176 | 9128 | 6904 | 6782 | ||
% saved | 0.52 | 24.76 | 26.09 |
One Tesla T4 on Google Colabbsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) |
---|---|---|---|---|---|
Huggingface | 1 T4 | 23h 15m | 56h 28m | 8h 38m | 391h 41m |
Unsloth Open | 1 T4 | 13h 7m (1.8x) | 31h 47m (1.8x) | 4h 27m (1.9x) | 240h 4m (1.6x) |
Unsloth Pro | 1 T4 | 3h 6m (7.5x) | 5h 17m (10.7x) | 1h 7m (7.7x) | 59h 53m (6.5x) |
Unsloth Max | 1 T4 | 2h 39m (8.8x) | 4h 31m (12.5x) | 0h 58m (8.9x) | 51h 30m (7.6x) |
Peak Memory Usage
System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) |
---|---|---|---|---|---|
Huggingface | 1 T4 | 7.3GB | 5.9GB | 14.0GB | 13.3GB |
Unsloth Open | 1 T4 | 6.8GB | 5.7GB | 7.8GB | 7.7GB |
Unsloth Pro | 1 T4 | 6.4GB | 6.4GB | 6.4GB | 6.4GB |
Unsloth Max | 1 T4 | 11.4GB | 12.4GB | 11.9GB | 14.4GB |
Two Tesla T4s on Kagglebsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) * |
---|---|---|---|---|---|
Huggingface | 2 T4 | 84h 47m | 163h 48m | 30h 51m | 1301h 24m * |
Unsloth Pro | 2 T4 | 3h 20m (25.4x) | 5h 43m (28.7x) | 1h 12m (25.7x) | 71h 40m (18.1x) * |
Unsloth Max | 2 T4 | 3h 4m (27.6x) | 5h 14m (31.3x) | 1h 6m (28.1x) | 54h 20m (23.9x) * |
Peak Memory Usage on a Multi GPU System (2 GPUs)
System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) * |
---|---|---|---|---|---|
Huggingface | 2 T4 | 8.4GB | 6GB | 7.2GB | 5.3GB | 14.3GB | 6.6GB | 10.9GB | 5.9GB * |
Unsloth Pro | 2 T4 | 7.7GB | 4.9GB | 7.5GB | 4.9GB | 8.5GB | 4.9GB | 6.2GB | 4.7GB * |
Unsloth Max | 2 T4 | 10.5GB | 5GB | 10.6GB | 5GB | 10.6GB | 5GB | 10.5GB | 5GB * |
Slim Orca bsz=1
for all benchmarks since bsz=2
OOMs. We can handle bsz=2
, but we benchmark it with bsz=1
for consistency.
HuyNguyen-hust for making RoPE Embeddings 28% faster
RandomInternetPreson for confirming WSL support
152334H for experimental DPO support
atgctg for syntax highlighting