nnScaler is a parallelization engine that compiles a Deep neural network (DNN) model that designed for single-GPU execution into a program that capable of running in parallel across multiple GPUs.
Ease of Use: Only a few lines of code need to be changed to enable automated parallelization.
Pythonic: The parallelization output is in PyTorch code, making it easy for users to understand and convenient for further development or customization.
Extensibility: nnScaler exposes an API to support new operators for emerging models.
Reliability: Verified through various end-to-end training sessions, nnScaler is a dependable system.
Performance: By exploring a large parallelization space, nnScaler can significantly enhance parallel training performance.
For DNN scientists, they can concentrate on model design with PyTorch on single GPU, while leaving parallelization complexities to nnScaler. It introduces innovative parallelism techniques that surpass existing methods in performance. Additionally, nnScaler supports the extension of DNN modules with new structures or execution patterns, enabling users to parallelize their custom DNN models.
For DNN system experts, they can leverage nnScaler to explore new DNN parallelization mechanisms and policies for emerging models. By providing user-defined functions for new operators not recognized by nnScaler, it ensures seamless parallelization of novel DNN models. For example, to facilitate long sequence support in LLMs.
Install the following packages before the installation of nnScaler:
Python >= 3.8, < 3.11 (3.10 is recommanded) PyTorch >= 2.0, < 2.4 (2.2.0 is recommanded)
Execute below commands in nnScaler directory:
pip install -r requirements.txt pip install -e .
Besides, to avoid cppimport error, it also needs to include nnScaler directory in environment variable PYTHONPATH:
export NNSCALER_HOME=$(pwd) export PYTHONPATH=${NNSCALER_HOME}:$PYTHONPATH
Install packages required to run Llama-3. Besides, a certain version of CUDA library is needed during flash-attn installation. For example, CUDA V11.8 is needed if using PyTorch 2.20.
python -m pip install transformers==4.40.0 flash-attn==2.5.5 tensorboard
Obtain access of Llama-3 model from HuggingFace, where you will receive an access token which should be set as an environment variable:
export HF_TOKEN=<HUGGINGFACE_ACCESS_TOKEN>
You can find all the example code at examples/llama3_8B_128K
. As shown below, a user needs to:
Wrap the Model: Include loss computation and other necessary components.
Configure Components: Set up the model, optimizer, and dataloader.
Initialize and Start: In the main function, create an nnScaler trainer with the above configurations and start the training process.
# import the nnScaler build-in parallelization-capable trainerfrom nnscaler.cli.trainer import Trainer# wrap model to include loss computing, etc.class WrapperModel(torch.nn.Module):def __init__(self, model_id):super().__init__()self.model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation='flash_attention_2')def forward(self, samples):outputs = self.model.model(input_ids=samples['net_input']['src_tokens'],use_cache=False,return_dict=False, )loss = torch.sum(chunk_linear_cross_entropy(outputs[0], self.model.lm_head.weight, samples['target'], ...))return loss, samples['ntokens'], samples['nsentences']def main(args):# data configdataloader_config = ... # model configmodel_config = ModelConfig(type=WrapperModel,args={'model_id': args.model_id, }, )# optimizer hyperparameters optimizer_config = OptimizerConfig(type=MixedPrecisionAdamW,args={'lr': 2e-5, 'betas': (0.9, 0.95), 'weight_decay': 0.0, 'fused': True},#...)#...# setup trainer with configs of dataloader/model/optimizer, etc. trainer = Trainer(train_args=TrainerArgs(#...model=model_config,optimizer=optimizer_config,dataloader=dataloader_config,#...))trainer.run()
Then we can start the example, and all the parallelization tasks will be finished by nnScaler automatically.
cd examples/llama3_8B_128K# prepare training data:python bookcorpus.py --data_path_or_name bookcorpus/bookcorpus --tokenizer_path_or_name meta-llama/Meta-Llama-3-8B-Instruct --save_path ./bookcorpus_llama3_4K --sequence_length 4096# build the mini modelpython create_mini_model.py --model_id meta-llama/Meta-Llama-3-8B-Instruct --output_id ./llama3_mini#compile and run using data parallelism + zero1torchrun --nproc_per_node=2 train.py --plan_ngpus 1 --runtime_ngpus 2 --name llama3_debug --model_id ./llama3_mini --dataset_path ./bookcorpus_llama3_4K
We also provide an example to demonstrate how to parallelize a model through a PyTorch Lightning-compatible interface in nnScaler.
Find the nanoGPT example in nnScaler repo:
cd examples/nanogpt
Install nanoGPT's dependencies:
pip install -r requirements.txt
Prepare dataset:
python nanoGPT/data/shakespeare_char/prepare.py
Test with Single GPU
Now you can run train_nnscaler.py
with torchrun <https://pytorch.org/docs/stable/elastic/run.html>
:
torchrun --nproc_per_node=1 train_nnscaler.py nanoGPT/config/train_shakespeare_char.py
This will train a baby GPT model on a single GPU. It will take several minutes and the best validation loss will be around 1.47.
Test with Multi-GPU
By default, nnScaler parallelizes a model over GPUs with data parallelism. If you have 4 GPUs on one node:
torchrun --nproc_per_node=4 train_nnscaler.py nanoGPT/config/train_shakespeare_char.py
Or if you have multiple nodes, for example 2 nodes with 4 GPUs each:
# on each node torchrun --nnodes=2 --nproc_per_node=4 --rdzv-id=NNSCALER_NANOGPT --rdzv-backend=c10d --rdzv-endpoint=<IP> train_nnscaler.py nanoGPT/config/train_shakespeare_char.py
NOTE: The local batch size is fixed by default, so using more workers will result in a larger global batch size.
? For advanced usages, please stay tuned for our future release.
nnScaler has been adopted by multiple projects, including both product and research explorations:
(YOCO)You only cache once: Decoder-decoder architectures for language models
LongRoPE: Extending LLM context window beyond 2 million tokens
Post training for the long context version of Phi-3 series
You may find the Artifact Evaluation for OSDI'24 with the guidance here. Please cite nnScaler in your publications if it helps your research:
@inproceedings{lin2024nnscaler, title = {nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training}, author={Lin, Zhiqi and Miao, Youshan and Zhang, Quanlu and Yang, Fan and Zhu, Yi and Li, Cheng and Maleki, Saeed and Cao, Xu and Shang, Ning and Yang, Yilei and Xu, Weijiang and Yang, Mao and Zhang, Lintao and Zhou, Lidong}, booktitle={18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)}, pages={347--363}, year={2024} }
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party's policies.
You may find our public repo from https://github.com/microsoft/nnscaler or microsoft internal repo https://aka.ms/ms-nnscaler. For any questions or inquiries, please contact us at [email protected].