Use HuggingFace's official download tools huggingface-cli and hf_transfer to download models and data sets at high speed from the HuggingFace mirror site.
This script is just a simple encapsulation of huggingface-cli. It is intended to facilitate my own usage. If you have a need for more advanced functions, please refer to the official documentation and modify it yourself. In addition, domestic users can also refer to the download method provided on the HuggingFace mirror site.
12/17/2023 update: Added --include
and --exlucde
parameters to specify whether to download or ignore certain files.
--include "tokenizer.model tokenizer_config.json"
--include "*.bin"
--exclude "*.md"
--include "*.json" --exclude "config.json"
Obtain the required model name from HuggingFace, such as lmsys/vicuna-7b-v1.5
:
python hf_download.py --model lmsys/vicuna-7b-v1.5 --save_dir ./hf_hub
If you download a model that requires authorization, such as the meta-llama series, you need to specify the --token
parameter as your Huggingface Access Token.
Things to note:
(1) If --save_dir
is specified, the file will be temporarily stored in the default path of transformers ~/.cache/huggingface/hub
during the download process. After the download is completed, it will be automatically moved to the directory specified by --save_dir
, so it needs to be downloaded Make sure there is sufficient capacity in the default path beforehand.
After downloading, you need to specify the path after saving when loading using the transformers library, for example:
from transformers import pipeline
pipe = pipeline ( "text-generation" , model = "./hf_hub/models--lmsys--vicuna-7b-v1.5" )
If --save_dir
is not specified, it will be downloaded to the default path ~/.cache/huggingface/hub
. At this time, when calling the model, you can directly use the model name lmsys/vicuna-7b-v1.5
.
(2) If you do not want to use the absolute path when calling, and do not want to save all models under the default path, you can set it through a soft link . The steps are as follows:
mkdir /data/huggingface_cache
~/.cache/huggingface/hub
, it needs to be deleted first: rm -r ~ /.cache/huggingface
ln -s /data/huggingface_cache ~ /.cache/huggingface
save_dir
when running the download script later, it will automatically download to the directory created in the first step: python hf_download.py --model lmsys/vicuna-7b-v1.5
from transformers import pipeline
pipe = pipeline( " text-generation " , model= " lmsys/vicuna-7b-v1.5 " )
(3) The built-in script automatically installs huggingface-cli and hf_transfer through pip. If the hf_transfer version is lower than 0.1.4, the download progress bar will not be displayed and can be updated manually:
pip install -U hf-transfer -i https://pypi.org/simple
If huggingface-cli: error
occurs, try reinstalling:
pip install -U huggingface_hub
If there is an error about hf_transfer
, you can turn off hf_transfer through the --use_hf_transfer False
parameter.
The same as downloading the model, taking zh-plus/tiny-imagenet
as an example:
python hf_download.py --dataset zh-plus/tiny-imagenet --save_dir ./hf_hub
--model
: The name of the model to be downloaded on huggingface, for example --model lmsys/vicuna-7b-v1.5
--dataset
: The name of the dataset to be downloaded on huggingface, for example --dataset zh-plus/tiny-imagenet
--save_dir
: the actual storage path of the file after downloading--token
: When downloading a model that requires login (Gated Model), such as meta-llama/Llama-2-7b-hf
, you need to specify the hugginface token in the format hf_****
--use_hf_transfer
: Use hf-transfer to accelerate downloads. It is enabled by default (True). If the version is lower than enabled, the progress bar will not be displayed.--use_mirror
: Download from the mirror site https://hf-mirror.com/, enabled by default (True), domestic users are recommended to enable it--include
: Download the specified file, such as --include "tokenizer.model tokenizer_config.json"
or --include "*.bin
download--exclude
: Do not download the specified file, consistent with include usage, for example --exclude "*.md"