English | 中文
Pre-training has become an essential part of AI technology. TencentPretrain is a toolkit for pre-training and fine-tuning on data of different modalities (e.g. text and vision). TencentPretrain is characterized by modular design. It facilitates the use of existing pre-training models, and provides interfaces for users to further extend upon. With TencentPretrain, we build a model zoo which contains pre-trained models of different properties. TencentPretrain inherits the open source toolkit UER (https://github.com/dbiir/UER-py/) and extends it to a multimodal pre-training framework.
TencentPretrain has the following features:
This section uses several commonly-used examples to demonstrate how to use TencentPretrain. More details are discussed in Instructions section. We firstly use BERT (a text pre-training model) on book review sentiment classification dataset. We pre-train model on book review corpus and then fine-tune it on book review sentiment classification dataset. There are three input files: book review corpus, book review sentiment classification dataset, and vocabulary. All files are encoded in UTF-8 and included in this project.
The format of the corpus for BERT is as follows (one sentence per line and documents are delimited by empty lines):
doc1-sent1
doc1-sent2
doc1-sent3
doc2-sent1
doc3-sent1
doc3-sent2
The book review corpus is obtained from book review sentiment classification dataset. We remove labels and split a review into two parts from the middle to construct a document with two sentences (see book_review_bert.txt in corpora folder).
The format of the classification dataset is as follows:
label text_a
1 instance1
0 instance2
1 instance3
Label and instance are separated by t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.
We use Google's Chinese vocabulary file models/google_zh_vocab.txt, which contains 21128 Chinese characters.
We firstly pre-process the book review corpus. In the pre-processing stage, the corpus needs to be processed into the format required by the specified pre-training model (--data_processor):
python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt
--dataset_path dataset.pt --processes_num 8 --data_processor bert
Notice that six>=1.12.0 is required.
Pre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (--processes_num). BERT tokenizer is used in default (--tokenizer bert). After pre-processing, the raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download Google's pre-trained Chinese BERT model google_zh_model.bin (in TencentPretrain format and the original model is from here), and put it in models folder. We load the pre-trained Chinese BERT model and further pre-train it on book review corpus. Pre-training model is usually composed of embedding, encoder, and target layers. To build a pre-training model, we should provide related information. Configuration file (--config_path) specifies the modules and hyper-parameters used by pre-training models. More details can be found in models/bert/base_config.json. Suppose we have a machine with 8 GPUs:
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt
--pretrained_model_path models/google_zh_model.bin
--config_path models/bert/base_config.json
--output_model_path models/book_review_model.bin
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7
--total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32
mv models/book_review_model.bin-5000 models/book_review_model.bin
Notice that the model trained by pretrain.py is attacted with the suffix which records the training step (--total_steps). We could remove the suffix for ease of use.
Then we fine-tune the pre-trained model on downstream classification dataset. We use embedding and encoder layers of book_review_model.bin, which is the output of pretrain.py:
python3 finetune/run_classifier.py --pretrained_model_path models/book_review_model.bin
--vocab_path models/google_zh_vocab.txt
--config_path models/bert/base_config.json
--train_path datasets/book_review/train.tsv
--dev_path datasets/book_review/dev.tsv
--test_path datasets/book_review/test.tsv
--epochs_num 3 --batch_size 32
The default path of the fine-tuned classifier model is models/finetuned_model.bin . It is noticeable that the actual batch size of pre-training is --batch_size times --world_size ; The actual batch size of downstream task (e.g. classification) is --batch_size . Then we do inference with the fine-tuned model.
python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin
--vocab_path models/google_zh_vocab.txt
--config_path models/bert/base_config.json
--test_path datasets/book_review/test_nolabel.tsv
--prediction_path datasets/book_review/prediction.tsv
--labels_num 2
--test_path specifies the path of the file to be predicted. The file should contain text_a column. --prediction_path specifies the path of the file with prediction results. We need to explicitly specify the number of labels by --labels_num. The above dataset is a two-way classification dataset.
The above content provides basic ways of using TencentPretrain to pre-process, pre-train, fine-tune, and do inference. More use cases can be found in complete ➡️ quickstart ⬅️ . The complete quickstart contains abundant use cases, covering most of the pre-training related application scenarios. It is recommended that users read the complete quickstart in order to use the project reasonably.
This section provides links to a range of ➡️ pre-training data ⬅️ . TencentPretrain can load these pre-training data directly.
This section provides links to a range of ➡️ downstream datasets ⬅️ . TencentPretrain can load these datasets directly.
With the help of TencentPretrain, we pre-trained models of different properties (e.g. models based on different modalities, encoders, and targets). Detailed introduction of pre-trained models and their download links can be found in ➡️ modelzoo ⬅️ . All pre-trained models can be loaded by TencentPretrain directly.
TencentPretrain is organized as follows:
TencentPretrain/
|--tencentpretrain/
| |--embeddings/ # contains modules of embedding component
| |--encoders/ # contains modules of encoder component such as RNN, CNN, Transformer
| |--decoders/ # contains modules of decoder component
| |--targets/ # contains modules of target component such as language modeling, masked language modeling
| |--layers/ # contains frequently-used NN layers
| |--models/ # contains model.py, which combines modules of different components
| |--utils/ # contains frequently-used utilities
| |--model_builder.py
| |--model_loader.py
| |--model_saver.py
| |--opts.py
| |--trainer.py
|
|--corpora/ # contains pre-training data
|--datasets/ # contains downstream tasks
|--models/ # contains pre-trained models, vocabularies, and configuration files
|--scripts/ # contains useful scripts for pre-training models
|--finetune/ # contains fine-tuning scripts for downstream tasks
|--inference/ # contains inference scripts for downstream tasks
|
|--preprocess.py
|--pretrain.py
|--README.md
|--README_ZH.md
|--requirements.txt
|--LICENSE
The code is organized based on components (e.g. embeddings, encoders). Users can use and extend upon it with little efforts.
Comprehensive examples of using TencentPretrain can be found in ➡️ instructions ⬅️ , which help users quickly implement pre-training models such as BERT, GPT-2, ELMo, T5, CLIP and fine-tune pre-trained models on a range of downstream tasks.
TencentPretrain has been used in winning solutions of many competitions. In this section, we provide some examples of using TencentPretrain to achieve SOTA results on competitions, such as CLUE. See ➡️ competition solutions ⬅️ for more detailed information.
@article{zhao2023tencentpretrain,
title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
journal={ACL 2023},
pages={217},
year={2023}
}