unsloth 다운로드 - unsloth 소스 코드 다운로드

unsloth

기타 소스코드

Gradient Accumulation

다운로드

트위터(일명 X)

X에서 우리를 팔로우하세요
설치	느림보/README.md
벤치마킹	성능표
출시 모델	느림보 출시
블로그	블로그 읽기

주요 특징

OpenAI의 Triton 언어로 작성된 모든 커널. 수동 백프롭 엔진 .
정확도 손실 0% - 근사 방법 없음 - 모두 정확합니다.
하드웨어는 변경되지 않습니다. 2018년 이후부터 NVIDIA GPU를 지원합니다. 최소 CUDA 기능 7.0(V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 등) GPU를 확인하세요! GTX 1070, 1080은 작동하지만 느립니다.
WSL을 통해 Linux 및 Windows 에서 작동합니다.
비트앤바이트를 통해 4비트 및 16비트 QLoRA/LoRA 미세 조정을 지원합니다.
오픈 소스는 5배 더 빠르게 훈련합니다. 최대 30배 더 빠른 훈련을 보려면 Unsloth Pro를 참조하세요!
?Unsloth로 모델을 훈련했다면 이 멋진 스티커를 사용할 수 있습니다!

성능 벤치마킹

재현 가능한 벤치마킹 테이블의 전체 목록을 보려면 당사 웹사이트를 방문하세요.

A100 40GB 1대	?껴안는 얼굴	플래시 주의	?Unsloth 오픈 소스	?언슬로스 프로
알파카	1x	1.04배	1.98배	15.64배
라온칩2	1x	0.92배	1.61배	20.73배
오아스트	1x	1.19배	2.17배	14.83배
슬림 오르카	1x	1.18배	2.22배	14.82배

아래 벤치마킹 테이블은 ?Hugging Face에서 수행한 것입니다.

무료 Colab T4	데이터세트	?껴안는 얼굴	파이토치 2.1.1	?느긋지긋한	? VRAM 감소
라마-2 7b	오아스트	1x	1.19배	1.95배	-43.3%
미스트랄 7b	알파카	1x	1.07배	1.56배	-13.7%
작은 라마 1.1b	알파카	1x	2.06배	3.87배	-73.8%
Zephyr를 사용한 DPO	울트라채팅	1x	1.09배	1.55배	-18.6%

설치 지침

안정적인 릴리스의 경우 pip install unsloth 사용하세요. 하지만 대부분의 설치에서는 pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" 권장합니다.

콘다 설치

️Only use Conda if you have it. If not, use Pip . CUDA 11.8 또는 CUDA 12.1의 경우 pytorch-cuda=11.8,12.1 선택합니다. 우리는 python=3.10,3.11,3.12 지원합니다.

 conda create --name unsloth_env
     파이썬=3.11
     파이토치-쿠다=12.1
     pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers
     -와이
conda는 unsloth_env를 활성화합니다.

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"pip install --no-deps trl peft 가속 비트앤바이트

Linux 환경에 Conda를 설치하려는 경우 여기를 읽거나 아래 ?를 실행하세요.

 mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
배쉬 ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh~/miniconda3/bin/conda init bash~/miniconda3/bin/conda init zsh

핍 설치

️Do **NOT** use this if you have Conda. Pip은 종속성 문제가 있기 때문에 좀 더 복잡합니다. pip 명령은 torch 2.2,2.3,2.4,2.5 및 CUDA 버전마다 다릅니다.

다른 토치 버전의 경우 torch211 , torch212 , torch220 , torch230 , torch240 지원하고 CUDA 버전의 경우 cu118 , cu121 및 cu124 지원합니다. 암페어 장치(A100, H100, RTX3090) 이상의 경우 cu118-ampere , cu121-ampere 또는 cu124-ampere 사용하세요.

예를 들어, torch 2.4 및 CUDA 12.1 있는 경우 다음을 사용하세요.

 pip 설치 - pip 업그레이드
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

또 다른 예를 들어, torch 2.5 및 CUDA 12.4 있는 경우 다음을 사용하세요.

 pip 설치 - pip 업그레이드
pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"

그리고 다른 예:

 pip 설치 "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"pip 설치 "unsloth[cu118-ampere-torch240] @ git+https://github.com/ unslothai/unsloth.git"pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"pip 설치 "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"pip 설치 "unsloth[cu121-torch230" ] @ git+https://github.com/unslothai/unsloth.git"pip 설치 "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth. git"pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"

또는 최적의 pip 설치 명령을 얻으려면 터미널에서 아래를 실행하십시오.

 wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | 파이썬 -

또는 Python REPL에서 아래를 수동으로 실행합니다.

 시도: import torch제외: importError('pip install torch를 통해 토치 설치'')from Packaging.version import Version as Vv = V(torch.__version__)cuda = str(torch.version.cuda)is_ampere = torch.cuda. get_device_capability()[0] >= 8if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda}는 지원되지 않습니다!")if v <= V('2.1.0'): raise RuntimeError(f"Torch = {v}가 너무 오래되었습니다!")elif v <= V(' 2.1.1'): x = 'cu{}{}-torch211'elif v <= V('2.1.2'): x = 'cu{}{}-torch212'elif v < V('2.3.0'): x = 'cu{}{}-torch220'elif v < V('2.4.0'): x = 'cu{}{}-torch230'elif v < V('2.5 .0'): x = 'cu{}{}-torch240'elif v < V('2.6.0'): x = 'cu{}{}-torch250'else: 인상 RuntimeError(f"Torch = {v}가 너무 새롭습니다!")x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")print(f'pip install -- pip 업그레이드 && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')

윈도우 설치

Windows에서 Unsloth를 직접 실행하려면:

이 Windows 포크에서 Triton을 설치하고 지침을 따르십시오: https://github.com/woct0rdho/triton-windows
SFTTrainer에서 충돌 문제를 방지하려면 dataset_num_proc=1 설정하세요.

 트레이너 = SFTTrainer(dataset_num_proc=1,
    ...
)

고급 설치 지침이 필요하거나 설치 중에 이상한 오류가 나타나는 경우:

torch 와 triton 설치합니다. https://pytorch.org로 이동하여 설치하세요. 예를 들어 pip install torch torchvision torchaudio triton
CUDA가 올바르게 설치되었는지 확인합니다. nvcc 사용해 보세요. 실패하면 cudatoolkit 또는 CUDA 드라이버를 설치해야 합니다.
xformers 수동으로 설치하십시오. vllm 설치하고 vllm 성공하는지 확인할 수 있습니다. xformers python -m xformers.info 로 성공했는지 확인하십시오. https://github.com/facebookresearch/xformers로 이동하십시오. 또 다른 옵션은 Ampere GPU용 flash-attn 설치하는 것입니다.
마지막으로, bitsandbytes 설치하고 python -m bitsandbytes 로 확인하세요.

선적 서류 비치

GGUF 저장, 체크포인트, 평가 등에 대한 내용은 공식 문서로 이동하세요!
Huggingface의 TRL, Trainer, Seq2SeqTrainer 또는 심지어 Pytorch 코드도 지원합니다!
우리는 ?Hugging Face의 공식 문서에 있습니다! SFT 문서와 DPO 문서를 확인해 보세요!

 from unsloth import FastLanguageModel from unsloth import is_bfloat16_supportedimport torchfrom trl import SFTTrainerfrom Transformers import TrainingArgumentsfromdatasets import load_datasetmax_seq_length = 2048 # 내부적으로 RoPE Scaling을 지원하므로 아무거나 선택하세요!# Get LAION 데이터세트url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"dataset = load_dataset("json", data_files = {"train" : url}, Split = "train")# 4비트 사전 4배 빠른 다운로드 + OOM 없음을 지원하는 양자화 모델.fourbit_models = ["unsloth/mistral-7b-v0.3-bnb-4bit", # 새로운 Mistral v3가 2배 더 빨라졌습니다!"unsloth/mistral-7b-instruct-v0.3-bnb-4bit","unsloth/llama-3-8b -bnb-4bit", # Llama-3 15조 토큰 모델 2x 더 빨라요!"unsloth/llama-3-8b-Instruct-bnb-4bit","unsloth/llama-3-70b-bnb-4bit","unsloth/Phi-3-mini-4k-instruct", # Phi-3 2배 더 빠르게!"unsloth/Phi-3-medium-4k-instruct","unsloth/mistral-7b-bnb-4bit","unsloth/gemma-7b-bnb-4bit", # Gemma가 2.2배 더 빠릅니다!] # 더 많은 모델 https://huggingface.co/unslothmodel에서, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/llama-3-8b-bnb-4bit",max_seq_length = max_seq_length,dtype = None,load_in_4bit = True,
)# 모델 패치를 수행하고 빠른 LoRA 가중치 추가model = FastLanguageModel.get_peft_model(model,r = 16,target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj ",],lora_alpha = 16,lora_dropout = 0, # 모두 지원하지만 = 0은 최적화됨bias = "none", # 모두 지원하지만 = "none"은 최적화됨# [NEW] "unsloth"는 30% 더 적은 VRAM을 사용하고 2배 더 큰 배치 크기에 적합합니다!use_gradient_checkpointing = "unsloth", # 매우 긴 경우 True 또는 "unsloth" contextrandom_state = 3407,max_seq_length = max_seq_length,use_rslora = False, # 순위 안정화를 지원합니다 LoRAloftq_config = None, # 그리고 LoftQ)trainer = SFTTrainer(model = model,train_dataset = 데이터 세트,dataset_text_field = "text",max_seq_length = max_seq_length,tokenizer = tokenizer,args = TrainingArguments(per_device_train_batch_size = 2, gradient_accumulation_steps = 4,warmup_steps = 10,max_steps = 60,fp16 = is_bfloat16_supported(),bf16 = is_bfloat16_supported(),logging_steps = 1,output_dir = "출력",optim = "adamw_8bit",seed = 3407,
    ),
)trainer.train()# 다음과 같은 고급 팁을 보려면 https://github.com/unslothai/unsloth/wiki로 이동하세요.# (1) GGUF에 저장 / vLLM용 16비트로 병합# (2) 저장된 LoRA 어댑터에서 계속 교육 # (3) 평가 루프/OOM 추가# (4) 맞춤형 채팅 템플릿

DPO 지원

DPO(직접 선호 최적화), PPO, 보상 모델링 모두 Llama-Factory의 제3자 독립 테스트에 따라 작동하는 것으로 보입니다. Tesla T4에서 Zephyr를 재현하기 위한 예비 Google Colab 노트북이 여기에 있습니다: 노트북.

우리는 ?Hugging Face의 공식 문서에 있습니다! 우리는 SFT 문서와 DPO 문서에 있습니다!

 import osos.environ["CUDA_VISIBLE_DEVICES"] = "0" # 옵션 설정 GPU 장치 IDfrom unsloth import FastLanguageModel, PatchDPOTrainerfrom unsloth import is_bfloat16_supportedPatchDPOTrainer()import torchfrom Transformers import TrainingArgumentsfrom trl import DPOTrainermodel, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/zephyr-sft-bnb-4bit",max_seq_length = max_seq_length,dtype = None,load_in_4bit = True,
)# 모델 패치를 수행하고 빠른 LoRA 가중치 추가model = FastLanguageModel.get_peft_model(model,r = 64,target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj ",],lora_alpha = 64,lora_dropout = 0, # 모두 지원하지만 = 0은 최적화됨bias = "none", # 모두 지원하지만 = "none"은 최적화됨# [NEW] "unsloth"는 30% 더 적은 VRAM을 사용하고 2배 더 큰 배치 크기에 적합합니다!use_gradient_checkpointing = "unsloth", # 매우 긴 경우 true 또는 "unsloth" contextrandom_state = 3407,max_seq_length = max_seq_length,
)dpo_trainer = DPOTrainer(모델 = 모델,ref_model = 없음,args = TrainingArguments(per_device_train_batch_size = 4,gradient_accumulation_steps = 8,warmup_ratio = 0.1,num_train_epochs = 3,fp16 = is_bfloat16_supported(),bf16 = 아님) is_bfloat16_supported(),logging_steps = 1,optim = "adamw_8bit",seed = 42,output_dir = "출력",
    ),beta = 0.1,train_dataset = YOUR_DATASET_HERE,# eval_dataset = YOUR_DATASET_HERE,tokenizer = tokenizer,max_length = 1024,max_prompt_length = 512,
)dpo_trainer.train()

상세한 벤치마킹 테이블

완전히 재현 가능한 예제를 보려면 "코드"를 클릭하세요.
"Unsloth Equal"은 코드가 제거된 PRO 버전의 미리보기입니다. 모든 설정과 손실 곡선은 동일하게 유지됩니다.
벤치마킹 테이블의 전체 목록을 보려면 당사 웹사이트를 방문하세요.

A100 40GB 1대	?껴안는 얼굴	플래시 어텐션 2	?언슬로스 오픈	느림보가 아닌 평등	언슬로스 프로	느림보 최대
알파카	1x	1.04배	1.98배	2.48배	5.32배	15.64배
암호	암호	암호	암호	암호
초	1040	1001	525	419	196	67
메모리MB	18235	15365	9631	8525
% 저장됨		15.74	47.18	53.25

Llama-Factory 타사 벤치마킹

성능표 링크입니다. TGS: 초당 GPU당 토큰. 모델: LLaMA2-7B. GPU: NVIDIA A100 * 1. 배치 크기: 4. 그라데이션 누적: 2. LoRA 순위: 8. 최대 길이: 1024.

방법	비트	TGS	그램	속도
HF	16	2392	18GB	100%
HF+FA2	16	2954	17GB	123%
느림보+FA2	16	4007	16GB	168%
HF	4	2415	9GB	101%
느림보+FA2	4	3726	7GB	160%

미스트랄 7b

A100 40GB 1대	포옹하는 얼굴	플래시 어텐션 2	언슬로스 오픈	느림보가 아닌 평등	언슬로스 프로	느림보 최대
미스트랄 7B 슬림 오르카	1x	1.15배	2.15배	2.53배	4.61배	13.69배
암호	암호	암호	암호	암호
초	1813년	1571	842	718	393	132
메모리MB	32853	19385	12465	10271
% 저장됨		40.99	62.06	68.74

코드라마 34b

A100 40GB 1대	포옹하는 얼굴	플래시 어텐션 2	언슬로스 오픈	느림보가 아닌 평등	언슬로스 프로	느림보 최대
코드 라마 34B	OOM	0.99배	1.87배	2.61배	4.27배	12.82배
암호	▶️ 암호	암호	암호	암호
초	1953년	1982년	1043	748	458	152
메모리MB	40000	33217	27413	22161
% 저장됨		16.96	31.47	44.60

1 테슬라 T4

T4 16GB 1개	포옹하는 얼굴	플래시 주의	언슬로스 오픈	언슬로스 프로 이퀄	언슬로스 프로	느림보 최대
알파카	1x	1.09배	1.69배	1.79배	2.93배	8.3배
암호	▶️ 암호	암호	암호	암호
초	1599년	1468	942	894	545	193
메모리MB	7199	7059	6459	5443
% 저장됨		1.94	10.28	24.39

DDP를 통한 Tesla T4 2대

2 T4 DDP	포옹하는 얼굴	플래시 주의	언슬로스 오픈	느림보가 아닌 평등	언슬로스 프로	느림보 최대
알파카	1x	0.99배	4.95배	4.44배	7.28배	20.61배
암호	▶️ 암호	암호	암호
초	9882	9946	1996년	2227	1357	480
메모리MB	9176	9128	6904	6782
% 저장됨		0.52	24.76	26.09

1 Tesla T4 GPU의 성능 비교:

1 에포크에 소요된 시간을 보려면 클릭하세요.

Google Colab의 Tesla T4 1개 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

체계	GPU	알파카 (52K)	라이온 OIG (210K)	오픈 어시스턴트(10K)	슬림오르카(518K)
포옹하는 얼굴	1 T4	23시 15분	56시간 28분	8시간 38분	391시간 41분
언슬로스 오픈	1 T4	13시간 7분 (1.8배)	31시간 47분(1.8배)	4시간 27분(1.9배)	240시간 4분(1.6배)
언슬로스 프로	1 T4	3시간 6분(7.5배)	5시간 17분(10.7배)	1시간 7분(7.7배)	59시간 53분(6.5배)
느림보 최대	1 T4	2시간 39분(8.8배)	4시간 31분(12.5배)	0시 58분(8.9x)	51시간 30분(7.6배)

최대 메모리 사용량

체계	GPU	알파카 (52K)	라이온 OIG (210K)	오픈 어시스턴트(10K)	슬림오르카(518K)
포옹하는 얼굴	1 T4	7.3GB	5.9GB	14.0GB	13.3GB
언슬로스 오픈	1 T4	6.8GB	5.7GB	7.8GB	7.7GB
언슬로스 프로	1 T4	6.4GB	6.4GB	6.4GB	6.4GB
느림보 최대	1 T4	11.4GB	12.4GB	11.9GB	14.4GB

DDP를 통한 2개의 Tesla T4 GPU 성능 비교를 보려면 클릭하세요.

**1 에포크에 소요된 시간**

Kaggle의 두 Tesla T4 bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

체계	GPU	알파카 (52K)	라이온 OIG (210K)	오픈 어시스턴트(10K)	슬림오르카(518K) *
포옹하는 얼굴	2 T4	84시간 47분	163시간 48분	30시 51분	13시 1분 24분 *
언슬로스 프로	2 T4	3시간 20분(25.4배)	5시간 43분(28.7배)	1시간 12분(25.7x)	71시간 40분(18.1배) *
느림보 최대	2 T4	3시간 4분 (27.6x)	5시간 14분(31.3x)	1시간 6분(28.1x)	54시간 20분(23.9배) *

다중 GPU 시스템(2 GPU)의 최대 메모리 사용량

체계	GPU	알파카 (52K)	라이온 OIG (210K)	오픈 어시스턴트(10K)	슬림오르카(518K) *
포옹하는 얼굴	2 T4	8.4GB \| 6GB	7.2GB \| 5.3GB	14.3GB \| 6.6GB	10.9GB \| 5.9GB *
언슬로스 프로	2 T4	7.7GB \| 4.9GB	7.5GB \| 4.9GB	8.5GB \| 4.9GB	6.2GB \| 4.7GB *
느림보 최대	2 T4	10.5GB \| 5GB	10.6GB \| 5GB	10.6GB \| 5GB	10.5GB \| 5GB *