[README] [?HF Repo] [?Web version]
Chinese | English
The Yayi large model is obtained by fine-tuning instructions on millions of manually constructed high-quality domain data. The training data covers five major fields such as media publicity, public opinion analysis, public security, financial risk control, and urban governance, and hundreds of natural language instruction tasks. During the iterative process of the Yayi large model from pre-training initialization weights to domain models, we gradually enhanced its basic Chinese capabilities and domain analysis capabilities, and added multiple rounds of dialogue and some plug-in capabilities. At the same time, through continuous manual feedback optimization during the internal testing process of hundreds of users, we have further improved the model performance and security.
Through the open source of the Yayi large model, we will contribute our own efforts to promote the development of the Chinese pre-trained large model open source community. Through open source, we will build the Yayi large model ecosystem with every partner.
News: Yayi Large Model has open sourced the Chinese optimization model version based on LLaMA 2 to explore the latest practices suitable for Chinese multi-domain tasks.
Model name | ?HF model identification | Download address |
---|---|---|
YAYI -7B | wenge-research/ YAYI -7b | Model download |
YAYI -7B-Llama2 | wenge-research/ YAYI -7b-llama2 | Model download |
YAYI -13B-Llama2 | wenge-research/ YAYI -13b-llama2 | Model download |
git clone https://github.com/wenge-research/YAYI.git
cd YAYI
conda create --name YAYI python=3.8
conda activate YAYI
pip install -r requirements.txt
The torch
and transformers
versions are not recommended to be lower than the recommended versions.
Model weights (version 7b) have been open sourced in our Huggingface model warehouse, and you are welcome to download and use them. The following is a sample code that simply calls YAYI -7b
for downstream task inference. It can be run on a single GPU such as A100/A800/3090. It takes up about 20GB of video memory when using FP16 precision inference:
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig
import torch
YAYI _7b_path = "wenge-research/ YAYI -7b"
tokenizer = AutoTokenizer . from_pretrained ( YAYI _7b_path )
model = AutoModelForCausalLM . from_pretrained ( YAYI _7b_path , device_map = "auto" , torch_dtype = torch . bfloat16 )
prompt = "你好"
formatted_prompt = f"<|System|>: n A chat between a human and an AI assistant named YAYI . n YAYI is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd. n n <|Human|>: n { prompt } n n <| YAYI |>:"
inputs = tokenizer ( formatted_prompt , return_tensors = "pt" ). to ( model . device )
eos_token_id = tokenizer ( "<|End|>" ). input_ids [ 0 ]
generation_config = GenerationConfig (
eos_token_id = eos_token_id ,
pad_token_id = eos_token_id ,
do_sample = True ,
max_new_tokens = 100 ,
temperature = 0.3 ,
repetition_penalty = 1.1 ,
no_repeat_ngram_size = 0
)
response = model . generate ( ** inputs , generation_config = generation_config )
print ( tokenizer . decode ( response [ 0 ]))
Note that the special token <|End|>
is added as the end character during model training, so eos_token_id
is set to the token id corresponding to the end character in GenerationConfig
of the above code. The inference code based on the LlaMA2 instruction fine-tuning model is slightly different. For details, please refer to the corresponding version in our Huggingface model warehouse.
This project is based on the deepspeed
framework for model training. After configuring the environment, execute the corresponding script to start training. Supports full parameter fine-tuning of command data, LoRA fine-tuning of command data, full-parameter fine-tuning of multi-round dialogue data, and LoRA fine-tuning of multi-round dialogue data.
Data format : Refer to data/ YAYI _train_example.json
, which adopts the jsonline data format of the Alpaca project. Each line has one piece of json data, consisting of three fields "instruction"
, "input"
, and "output"
. Among them, "instruction"
and "input"
are the instruction input, and "output"
is the output answer.
Operation instructions : Run the following command to start full-parameter fine-tuning of the Yayi large model. This command supports single-machine multi-card training. If you need to configure multi-machine and multi-card training, please refer to the deepspeed official documentation. It is recommended to use 4*A100(80G) or above hardware configuration.
deepspeed --num_gpus=8
--module training.trainer
--data-path ./data/ YAYI _train_example.json
--input-model ./checkpoints/ YAYI -7b
--deepspeed ./config/deepspeed_zero2_bf16.json
--epochs 2
--local-output-dir ./checkpoints
--per-device-train-batch-size 8
--per-device-eval-batch-size 8
--logging-steps 1
--save-steps 100
--save-total-limit 10
--eval-steps 100
--warmup-steps 100
--test-size 400
--lr 5e-6
--seed 515
Data format : Same as above, refer to data/ YAYI _train_example.json
.
Operation instructions : LoRA is a low-resource and efficient fine-tuning method, and a single card can train tens of billions of parameter models. This project mainly implements LoRA fine-tuning based on peft
. Run the following command to start LoRA fine-tuning of the Yayi large model. Fine-tuning can be completed using a single card A100 (80G), and the learning rate can be adjusted to a larger value. Among them, --lora-dim
sets the rank of the update matrix. The larger the value, the greater the number of parameters for training; --lora-module-name
sets the module of the LoRA update matrix, which can be changed according to the model type.
deepspeed --num_gpus=1
--module training.trainer_lora
--data-path ./data/ YAYI _train_example.json
--input-model ./checkpoints/ YAYI -7b
--deepspeed ./config/deepspeed_zero2_bf16.json
--epochs 2
--local-output-dir ./checkpoints
--per-device-train-batch-size 8
--per-device-eval-batch-size 8
--logging-steps 1
--save-steps 100
--save-total-limit 10
--eval-steps 100
--warmup-steps 100
--test-size 400
--lr 5e-4
--seed 515
--lora-dim 16
--lora-module-name query_key_value
Data format : refer to data/ YAYI _train_example_multi_rounds.json
, which is a standard JSON file. Each piece of data consists of "system"
and "conversations"
, where "system"
is the global role setting information and can be an empty string, "conversations"
It is a multi-round dialogue between two characters, human and YAYI .
Operation instructions : Run the following command to start full-parameter fine-tuning of the Yayi large model. For multi-round dialogue data, only the loss of the reply generated by the model is calculated. This command supports single-machine multi-card training. If you need to configure multi-machine and multi-card training, please refer to the deepspeed official documentation. It is recommended to use 4*A100(80G) or above hardware configuration.
deepspeed --num_gpus=8
--module training.trainer_multi_rounds
--data-path ./data/ YAYI _train_example_multi_rounds.json
--input-model ./checkpoints/ YAYI -7b
--deepspeed ./config/deepspeed_zero2_bf16.json
--epochs 2
--local-output-dir ./checkpoints
--per-device-train-batch-size 8
--per-device-eval-batch-size 8
--logging-steps 1
--save-steps 100
--save-total-limit 10
--eval-steps 100
--warmup-steps 100
--test-size 400
--lr 5e-7
--seed 515
data/ YAYI _train_example_multi_rounds.json
.The Yayi large model is trained based on Zhongke Wenge's million-level high-quality domain instruction fine-tuning data set. We open source 50,000 training data sets this time, which can be downloaded from our Huggingface data warehouse. The data set mainly covers several major fields such as finance, security, public opinion, and media. We have added discrete prompt prefixes to most command data of tasks in each field to distinguish data in each field. In addition, the training data also includes some security enhancement data, plug-in capability data, multi-round dialogue data, etc.
The SFT model trained based on current data and basic models still has the following problems in terms of effectiveness:
Based on the above model limitations, we require developers to only use our open source code, data, models and subsequent derivatives generated by this project for research purposes and not for commercial purposes or other uses that will cause harm to society. . Please be careful to identify and use the content generated by Yayi Big Model, and do not spread the generated harmful content to the Internet. If any adverse consequences occur, the communicator will be responsible.
This project can only be used for research purposes, and the project developer is not responsible for any harm or loss caused by the use of this project (including but not limited to data, models, codes, etc.). Please refer to the disclaimer for details.
The code in this project is open source according to the Apache-2.0 protocol, the data adopts the CC BY-NC 4.0 protocol, and the use of YAYI series model weights needs to follow the Model License.