? Models on Hugging Face | Blog | Website | Get Started
We are unleashing the power of large language models. Our latest version of Llama is now available to individuals, creators, researchers and businesses of all sizes so they can responsibly experiment, innovate and scale their ideas.
This release includes model weights and starting code for pre-trained and instruction-tuned Llama 3 language models, including parameter sizes from 8B to 70B.
This repository is intended as a minimal example of loading a Llama 3 model and running inference. See llama-recipes for more detailed examples.
In order to download the model weights and tokenizer, please visit the Meta Llama website and accept our license agreement.
After submitting your request, you will receive a signed URL via email. Then run the download.sh script, passing the provided URL when prompted to start the download.
Prerequisite: Make sure you have wget
and md5sum
installed. Then run the script: ./download.sh
.
Keep in mind that the link will expire after 24 hours and a certain number of downloads. If you start seeing errors like 403: Forbidden
, you can always re-request the link.
We also offer downloads on Hugging Face, including transformers and native llama3
formats. To download weights from Hugging Face, follow these steps:
original
folder. You can also download them from the command line if you installed pip install huggingface-hub
: huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include " original/* " --local-dir meta-llama/Meta-Llama-3-8B-Instruct
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers . pipeline (
"text-generation" ,
model = "meta-llama/Meta-Llama-3-8B-Instruct" ,
model_kwargs = { "torch_dtype" : torch . bfloat16 },
device = "cuda" ,
)
You can quickly start using the Llama 3 model by following the steps below. These steps will enable you to do fast inference locally. For more examples, check out the Llama recipe repository.
Clone and download this repository in a conda environment with PyTorch/CUDA installed.
Run in the top directory:
pip install -e .
Visit the Meta Llama website and register to download models.
After registering, you will receive an email with the URL to download the model. You will need this URL when running the download.sh script.
Once you receive the email, navigate to the llama repository you downloaded and run the download.sh script.
After downloading the required model, you can run the model locally using the following command:
torchrun --nproc_per_node 1 example_chat_completion.py
--ckpt_dir Meta-Llama-3-8B-Instruct/
--tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model
--max_seq_len 512 --max_batch_size 6
Notice
Meta-Llama-3-8B-Instruct/
with your checkpoint directory path and Meta-Llama-3-8B-Instruct/tokenizer.model
with your tokenizer model path.–nproc_per_node
should be set to the MP value of the model you are using.max_seq_len
and max_batch_size
parameters as needed.Different models require different model parallelism (MP) values:
Model | MP |
---|---|
8B | 1 |
70B | 8 |
All models support sequence lengths up to 8192 tokens, but we pre-allocate cache based on the values of max_seq_len
and max_batch_size
. Therefore, set these values according to your hardware.
These models are not fine-tuned for chat or Q&A. Prompts should be set up so that the expected answer is a natural continuation of the prompt.
See example_text_completion.py
for some examples. For illustration, see the command below to run it using the llama-3-8b model ( nproc_per_node
needs to be set to MP
value):
torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B/ --tokenizer_path Meta-Llama-3-8B/tokenizer.model --max_seq_len 128 --max_batch_size 4
Fine-tuned models are trained for conversational applications. In order to obtain their expected characteristics and performance, they need to follow a specific format defined in ChatFormat
: prompts start with the special token <|begin_of_text|>
, followed by one or more messages. Each message starts with the tag <|start_header_id|>
, has the role of system
, user
or assistant
, and ends with the tag <|end_header_id|>
. After the double newline nn
content of the message follows. The end of each message is marked by the <|eot_id|>
token.
You can also deploy additional classifiers to filter out inputs and outputs deemed unsafe. See an example in the llama-recipes repository on how to add safety checkers to the input and output of your inference code.
Example using llama-3-8b-chat:
torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6
Llama 3 is a new technology and comes with potential risks. The tests conducted so far do not – and cannot – cover every situation. To help developers address these risks, we've created Responsible Use Guidelines.
Please report software "bugs" or other issues with the model via one of the following methods:
See MODEL_CARD.md.
Our models and weights are licensed to researchers and commercial entities, adhering to open principles. Our mission is to empower individuals and industries through this opportunity while promoting an environment of discovery and ethical AI advancement.
Please review the LICENSE document, as well as our Acceptable Use Policy
For frequently asked questions, the FAQ can be found here https://llama.meta.com/faq, this will be continuously updated as new questions arise.