Recently, various foundation models that serve as the brains of Generative AI have been released, and many companies are reviewing or developing applications that utilize foundation models. However, large-scale models are not easy to infer on a single GPU, and it is not easy to serve them for production or fine-tune them.
This hands-on is written for those who want to quickly review Generative AI and apply it to production, providing a step-by-step guide on how to efficiently serve and fine-tune large-scale Korean models on AWS infrastructure.
1_prepare-dataset-alpaca-method.ipynb
: Prepare a training dataset from the instruction dataset. This method tokenizes each sample.1_prepare-dataset-chunk-method.ipynb
: Prepare a training dataset from the instruction dataset. This method concatenates all samples and divides them according to the chunk size.2_local-train-debug-lora.ipynb
: Debug with some sample data in the development environment before performing in earnest on training instances. If you are already familiar with fine tuning, please skip this hands-on and proceed with 3_sm-train-lora.ipynb.3_sm-train-lora.ipynb
: Performs fine tuning on SageMaker training instances. 1_local-inference.ipynb
: Loads the model from Hugging Face Hub and performs simple inference. Although it is not required, we recommend starting with this course if you want to try out the model.2_local-inference-deepspeed.py
& 2_run.sh
: Experiment with DeepSpeed distributed inference. An instance or server equipped with multiple GPUs is recommended. (e.g. ml.g5.12xlarge
)3_sm-serving-djl-deepspeed-from-hub.ipynb
: Performs SageMaker model serving using the SageMaker DJL (Deep Java Library) serving container (DeepSpeed distributed inference). The hosting server downloads models directly from Hugging Face Hub.3_sm-serving-djl-deepspeed-from-hub.ipynb
: Performs SageMaker model serving using the SageMaker DJL (Deep Java Library) serving container (DeepSpeed distributed inference). The hosting server downloads the model from S3. The download speed is very fast because files are downloaded in parallel internally by s5cmd.3_sm-serving-tgi-from-hub.ipynb
: Performs SageMaker model serving using the SageMaker TGI (Text Generation Inferface) serving container. TGI is a distributed inference server developed by Hugging Face and shows very fast inference speed.3_sm-serving-djl-fastertransformer-nocode.ipynb
: Performs SageMaker model serving using the SageMaker DJL (Deep Java Library) serving container (NVIDIA FasterTransformer distributed inference). It shows faster speeds than DeepSpeed only for supported models. To perform this hands-on, we recommend preparing an instance with the specifications below.
Alternatively, you can use SageMaker Studio Lab or SageMaker Studio.
ml.t3.medium
(minimum specification)ml.m5.xlarge
(recommended)ml.g5.2xlarge
(minimum specification)ml.g5.12xlarge
(recommended)ml.g5.2xlarge
: Model with 7B parameters or lessml.g5.12xlarge
(recommended) This sample code is provided under the MIT-0 License. Please refer to the license file.