YAYI Download - YAYI Source code download

YAYI

AI Source Code

1.0.0

Download

Yayi large model

YAYI " style="width: 30%; max-width: 100%;">

[README] [?HF Repo] [?Web version]

Chinese | English

introduce

The Yayi large model is obtained by fine-tuning instructions on millions of manually constructed high-quality domain data. The training data covers five major fields such as media publicity, public opinion analysis, public security, financial risk control, and urban governance, and hundreds of natural language instruction tasks. During the iterative process of the Yayi large model from pre-training initialization weights to domain models, we gradually enhanced its basic Chinese capabilities and domain analysis capabilities, and added multiple rounds of dialogue and some plug-in capabilities. At the same time, through continuous manual feedback optimization during the internal testing process of hundreds of users, we have further improved the model performance and security.

Through the open source of the Yayi large model, we will contribute our own efforts to promote the development of the Chinese pre-trained large model open source community. Through open source, we will build the Yayi large model ecosystem with every partner.

News: Yayi Large Model has open sourced the Chinese optimization model version based on LLaMA 2 to explore the latest practices suitable for Chinese multi-domain tasks.

Model address

Model name	?HF model identification	Download address
YAYI -7B	wenge-research/ YAYI -7b	Model download
YAYI -7B-Llama2	wenge-research/ YAYI -7b-llama2	Model download
YAYI -13B-Llama2	wenge-research/ YAYI -13b-llama2	Model download

Operation mode

Environment installation

Download the contents of this warehouse to the local/remote server

YAYI">

git clone https://github.com/wenge-research/YAYI.git
cd YAYI

Create conda environment

YAYI python=3.8 conda activate YAYI ">

conda create --name YAYI python=3.8
conda activate YAYI

Install dependencies

pip install -r requirements.txt

The torch and transformers versions are not recommended to be lower than the recommended versions.

Model reasoning

Model weights (version 7b) have been open sourced in our Huggingface model warehouse, and you are welcome to download and use them. The following is a sample code that simply calls YAYI -7b for downstream task inference. It can be run on a single GPU such as A100/A800/3090. It takes up about 20GB of video memory when using FP16 precision inference:

YAYI -7b" tokenizer = AutoTokenizer.from_pretrained( YAYI _7b_path) model = AutoModelForCausalLM.from_pretrained( YAYI _7b_path, device_map="auto", torch_dtype=torch.bfloat16) prompt = "你好" formatted_prompt = f"<|System|>:nA chat between a human and an AI assistant named YAYI .n YAYI is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd.nn<|Human|>:n{prompt}nn<| YAYI |>:" inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device) eos_token_id = tokenizer("<|End|>").input_ids[0] generation_config = GenerationConfig( eos_token_id=eos_token_id, pad_token_id=eos_token_id, do_sample=True, max_new_tokens=100, temperature=0.3, repetition_penalty=1.1, no_repeat_ngram_size=0 ) response = model.generate(**inputs, generation_config=generation_config) print(tokenizer.decode(response[0]))">

 from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig
import torch

YAYI _7b_path = "wenge-research/ YAYI -7b"
tokenizer = AutoTokenizer . from_pretrained ( YAYI _7b_path )
model = AutoModelForCausalLM . from_pretrained ( YAYI _7b_path , device_map = "auto" , torch_dtype = torch . bfloat16 )

prompt = "你好"
formatted_prompt = f"<|System|>: n A chat between a human and an AI assistant named YAYI . n YAYI is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd. n n <|Human|>: n { prompt } n n <| YAYI |>:"
inputs = tokenizer ( formatted_prompt , return_tensors = "pt" ). to ( model . device )

eos_token_id = tokenizer ( "<|End|>" ). input_ids [ 0 ]
generation_config = GenerationConfig (
    eos_token_id = eos_token_id ,
    pad_token_id = eos_token_id ,
    do_sample = True ,
    max_new_tokens = 100 ,
    temperature = 0.3 ,
    repetition_penalty = 1.1 ,
    no_repeat_ngram_size = 0
)
response = model . generate ( ** inputs , generation_config = generation_config )
print ( tokenizer . decode ( response [ 0 ]))

Note that the special token <|End|> is added as the end character during model training, so eos_token_id is set to the token id corresponding to the end character in GenerationConfig of the above code. The inference code based on the LlaMA2 instruction fine-tuning model is slightly different. For details, please refer to the corresponding version in our Huggingface model warehouse.

Model fine-tuning

This project is based on the deepspeed framework for model training. After configuring the environment, execute the corresponding script to start training. Supports full parameter fine-tuning of command data, LoRA fine-tuning of command data, full-parameter fine-tuning of multi-round dialogue data, and LoRA fine-tuning of multi-round dialogue data.

1. Fine-tuning of all parameters of command data

Data format : Refer to data/ YAYI _train_example.json , which adopts the jsonline data format of the Alpaca project. Each line has one piece of json data, consisting of three fields "instruction" , "input" , and "output" . Among them, "instruction" and "input" are the instruction input, and "output" is the output answer.
Operation instructions : Run the following command to start full-parameter fine-tuning of the Yayi large model. This command supports single-machine multi-card training. If you need to configure multi-machine and multi-card training, please refer to the deepspeed official documentation. It is recommended to use 4*A100(80G) or above hardware configuration.
YAYI_train_example.json --input-model ./checkpoints/ YAYI -7b --deepspeed ./config/deepspeed_zero2_bf16.json --epochs 2 --local-output-dir ./checkpoints --per-device-train-batch-size 8 --per-device-eval-batch-size 8 --logging-steps 1 --save-steps 100 --save-total-limit 10 --eval-steps 100 --warmup-steps 100 --test-size 400 --lr 5e-6 --seed 515">
```
 deepspeed --num_gpus=8 
    --module training.trainer 
    --data-path ./data/ YAYI _train_example.json 
    --input-model ./checkpoints/ YAYI -7b 
    --deepspeed ./config/deepspeed_zero2_bf16.json 
    --epochs 2 
    --local-output-dir ./checkpoints 
    --per-device-train-batch-size 8 
    --per-device-eval-batch-size 8 
    --logging-steps 1 
    --save-steps 100 
    --save-total-limit 10 
    --eval-steps 100 
    --warmup-steps 100 
    --test-size 400 
    --lr 5e-6 
    --seed 515
```

2. Instruction data LoRA fine-tuning

Data format : Same as above, refer to data/ YAYI _train_example.json .
Operation instructions : LoRA is a low-resource and efficient fine-tuning method, and a single card can train tens of billions of parameter models. This project mainly implements LoRA fine-tuning based on peft . Run the following command to start LoRA fine-tuning of the Yayi large model. Fine-tuning can be completed using a single card A100 (80G), and the learning rate can be adjusted to a larger value. Among them, --lora-dim sets the rank of the update matrix. The larger the value, the greater the number of parameters for training; --lora-module-name sets the module of the LoRA update matrix, which can be changed according to the model type.
YAYI_train_example.json --input-model ./checkpoints/ YAYI -7b --deepspeed ./config/deepspeed_zero2_bf16.json --epochs 2 --local-output-dir ./checkpoints --per-device-train-batch-size 8 --per-device-eval-batch-size 8 --logging-steps 1 --save-steps 100 --save-total-limit 10 --eval-steps 100 --warmup-steps 100 --test-size 400 --lr 5e-4 --seed 515 --lora-dim 16 --lora-module-name query_key_value">
```
 deepspeed --num_gpus=1 
    --module training.trainer_lora 
    --data-path ./data/ YAYI _train_example.json 
    --input-model ./checkpoints/ YAYI -7b 
    --deepspeed ./config/deepspeed_zero2_bf16.json 
    --epochs 2 
    --local-output-dir ./checkpoints 
    --per-device-train-batch-size 8 
    --per-device-eval-batch-size 8 
    --logging-steps 1 
    --save-steps 100 
    --save-total-limit 10 
    --eval-steps 100 
    --warmup-steps 100 
    --test-size 400 
    --lr 5e-4 
    --seed 515 
    --lora-dim 16 
    --lora-module-name query_key_value
```

3. Full parameter fine-tuning of multi-round dialogue data

Data format : refer to data/ YAYI _train_example_multi_rounds.json , which is a standard JSON file. Each piece of data consists of "system" and "conversations" , where "system" is the global role setting information and can be an empty string, "conversations" It is a multi-round dialogue between two characters, human and YAYI .
Operation instructions : Run the following command to start full-parameter fine-tuning of the Yayi large model. For multi-round dialogue data, only the loss of the reply generated by the model is calculated. This command supports single-machine multi-card training. If you need to configure multi-machine and multi-card training, please refer to the deepspeed official documentation. It is recommended to use 4*A100(80G) or above hardware configuration.
YAYI_train_example_multi_rounds.json --input-model ./checkpoints/ YAYI -7b --deepspeed ./config/deepspeed_zero2_bf16.json --epochs 2 --local-output-dir ./checkpoints --per-device-train-batch-size 8 --per-device-eval-batch-size 8 --logging-steps 1 --save-steps 100 --save-total-limit 10 --eval-steps 100 --warmup-steps 100 --test-size 400 --lr 5e-7 --seed 515">
```
 deepspeed --num_gpus=8 
    --module training.trainer_multi_rounds 
    --data-path ./data/ YAYI _train_example_multi_rounds.json 
    --input-model ./checkpoints/ YAYI -7b 
    --deepspeed ./config/deepspeed_zero2_bf16.json 
    --epochs 2 
    --local-output-dir ./checkpoints 
    --per-device-train-batch-size 8 
    --per-device-eval-batch-size 8 
    --logging-steps 1 
    --save-steps 100 
    --save-total-limit 10 
    --eval-steps 100 
    --warmup-steps 100 
    --test-size 400 
    --lr 5e-7 
    --seed 515
```

4. LoRA fine-tuning of multi-turn dialogue data

Data format : Same as above, refer to data/ YAYI _train_example_multi_rounds.json .
Operation instructions : Refer to the data loading method for multi-round dialogue data full parameter fine-tuning, and the command data LoRA fine-tuning method.

training data

The Yayi large model is trained based on Zhongke Wenge's million-level high-quality domain instruction fine-tuning data set. We open source 50,000 training data sets this time, which can be downloaded from our Huggingface data warehouse. The data set mainly covers several major fields such as finance, security, public opinion, and media. We have added discrete prompt prefixes to most command data of tasks in each field to distinguish data in each field. In addition, the training data also includes some security enhancement data, plug-in capability data, multi-round dialogue data, etc.

Related agreements

limitation

The SFT model trained based on current data and basic models still has the following problems in terms of effectiveness:

Factual instructions may lead to incorrect answers that go against the facts.
Harmful instructions cannot be well identified, and harmful remarks may be made.
The model's capabilities still need to be improved in some scenarios involving logical reasoning, code generation, scientific computing, etc.

Disclaimer

Based on the above model limitations, we require developers to only use our open source code, data, models and subsequent derivatives generated by this project for research purposes and not for commercial purposes or other uses that will cause harm to society. . Please be careful to identify and use the content generated by Yayi Big Model, and do not spread the generated harmful content to the Internet. If any adverse consequences occur, the communicator will be responsible.

This project can only be used for research purposes, and the project developer is not responsible for any harm or loss caused by the use of this project (including but not limited to data, models, codes, etc.). Please refer to the disclaimer for details.

Open source agreement

The code in this project is open source according to the Apache-2.0 protocol, the data adopts the CC BY-NC 4.0 protocol, and the use of YAYI series model weights needs to follow the Model License.

Change log

[2023/08/09] Updated LoRA fine-tuning code and multi-round conversation format data training code.
[2023/07/22] Update the YAYI -7B-Llama2 and YAYI -13B-Llama2 model weights enhanced with Chinese domain knowledge.
[2023/07/14] Upgraded model security and recognition rejection capabilities, and added model int8 quantification.
[2023/06/29] Upgrade and optimize multi-round dialogue capabilities in Chinese and English.
[2023/06/03] The Yayi large model was officially released and open sourced the 7B version model weights.

Acknowledgments

This project uses the model weights of BigScience bloomz-7b1-mt and Meta Llama 2 series as initialization weights, and expands the vocabulary;
The training code of this project refers to Databricks’ dolly project and Huggingface transformers library;
The distributed training of this project uses Microsoft's DeepSpeed distributed training tool and the ZeRO stage 2 configuration file in the Huggingface transformers document;