ChatLM mini Chinese Download - ChatLM mini Chinese Source code download

Chinese dialogue 0.2B small model ChatLM-Chinese-0.2B

Chinese | English

1. Introduction

Today's large language models tend to have larger parameters, and consumer-grade computers are slower to perform simple inferences, let alone train a model from scratch. The goal of this project is to train a generative language model from scratch, including data cleaning, tokenizer training, model pre-training, SFT instruction fine-tuning, RLHF optimization, etc.

ChatLM-mini-Chinese is a small Chinese dialogue model with only 0.2B model parameters (about 210M including shared weights). It can be pre-trained on a machine with a minimum of 4GB of video memory ( batch_size=1 , fp16 or bf16 ), and float16 loading and inference require at least Requires 512MB of video memory.

Disclose all pre-training, SFT instruction fine-tuning, and DPO preference optimization data set sources.
Use the Huggingface NLP framework, including transformers , accelerate , trl , peft , etc.
Self-implemented trainer supports pre-training and SFT fine-tuning on a single machine with a single card or with multiple cards on a single machine. It supports stopping at any position during training and continuing training at any position.
Pre-training: integrated into end-to-end Text-to-Text pre-training and non- mask prediction pre-training.
- Open source all data cleaning (such as standardization, document deduplication based on mini_hash, etc.), data set construction, data set loading optimization and other processes;
- Tokenizer multi-process word frequency statistics supports tokenizer training of sentencepiece and huggingface tokenizers ;
- Pre-training supports breakpoints at any location, and training can be continued from the breakpoint;
- Streaming loading of large data sets (GB level) supports buffer data shuffling and does not use memory or hard disk as cache, effectively reducing memory and disk usage. When configuring batch_size=1, max_len=320 , pre-training is supported on a machine with at least 16GB memory + 4GB video memory;
- Training log records.
SFT fine-tuning: open source SFT data set and data processing process.
- The self-implemented trainer supports prompt command fine-tuning and supports any breakpoint to continue training;
- Supports sequence to sequence fine-tuning of Huggingface trainer ;
- Supports traditional low learning rate and only trains fine-tuning of the decoder layer.
RLHF preference optimization: Use DPO for full preference optimization.
- Support using peft lora for preference optimization;
- Supports model merging, Lora adapter can be merged into the original model.
Supports fine-tuning of downstream tasks: finetune_examples gives fine-tuning examples of triple information extraction tasks . The model dialogue capabilities after fine-tuning are still there.

If you need to do retrieval enhanced generation (RAG) based on small models, you can refer to my other project Phi2-mini-Chinese. For the code, see rag_with_langchain.ipynb

? Latest updates

2024-01-30

- The model files are updated to Moda modelscope and can be quickly downloaded through `snapshot_download`.

2024-01-07

- Add document deduplication based on mini hash during the data cleaning process (in this project, it is actually sample deduplication of the data set) to prevent the model from spitting out training data during inference after encountering multiple repeated data.
- Add the `DropDatasetDuplicate` class to implement deduplication of documents from large data sets.

2023-12-29

- Update the model code (weights remain unchanged), you can directly use `AutoModelForSeq2SeqLM.from_pretrained(...)` to load the model for use.
- Updated readme documentation.

2023-12-18

- Supplementary use of the `ChatLM-mini-0.2B` model to fine-tune the downstream triplet information extraction task code and display the extraction effect.
- Updated readme documentation.

2023-12-14

- Updated model weight files after SFT and DPO.
- Updated pre-training, SFT and DPO scripts.
- Updated `tokenizer` to `PreTrainedTokenizerFast`.
- Refactor the `dataset` code to support dynamic maximum length. The maximum length of each batch is determined by the longest text in the batch, saving video memory.
- Added `tokenizer` training details.

2023-12-04

- Updated `generate` parameters and model effect display.
- Updated readme documentation.

2023-11-28

- Updated dpo training code and model weights.

2023-10-19

- The project is open source and the model weights are open for download.

2. ChatLM-0.2B-Chinese model training process

2.1 Pre-training data set

All data sets come from single-round conversation data sets published on the Internet. After data cleaning and formatting, they are saved as parquet files. For the data processing process, see utils/raw_data_process.py . Main data sets include:

Community Q&A json version webtext2019zh-large-scale high-quality data set, see: nlp_chinese_corpus. A total of 4.1 million, with 2.6 million remaining after cleaning.
baike_qa2019 encyclopedia Q&A, see: https://aistudio.baidu.com/datasetdetail/107726, a total of 1.4 million, and the remaining 1.3 million after waking up.
Chinese medical field question and answer data set, see: Chinese-medical-dialogue-data, a total of 790,000, and the remaining 790,000 after cleaning.
~~Financial industry question and answer data, see: https://zhuanlan.zhihu.com/p/609821974, a total of 770,000, and the remaining 520,000 after cleaning.~~ The data quality was too poor and was not used.
Zhihu question and answer data, see: Zhihu-KOL, a total of 1 million rows, and 970,000 rows remain after cleaning.
belle open source instruction training data, introduction: BELLE, download: BelleGroup, only select the data from Belle_open_source_1M , train_2M_CN , and train_3.5M_CN that have short answers, do not contain complex table structures, and translation tasks (no English vocabulary list), a total of 3.7 million rows, and 3.38 million rows remain after cleaning.
Wikipedia entry data is pieced together into prompts, and the first N words of the encyclopedia are the answers. Encyclopedia data of 202309 is used, and 1.19 million entry prompts and answers remain after cleaning. Wiki download: zhwiki, convert the downloaded bz2 file to wiki.txt reference: WikiExtractor.

The total number of data sets is 10.23 million: Text-to-Text pre-training set: 9.3 million, evaluation set: 25,000 (because the decoding is slow, the evaluation set is not set too large). ~~Test set: 900,000.~~ SFT fine-tuning and DPO optimization data sets are shown below.

2.2 Model

T5 model (Text-to-Text Transfer Transformer), for details, see the paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

The model source code comes from huggingface, see: T5ForConditionalGeneration.

See model_config.json for model configuration. The official T5-base : encoder layer and decoder layer are both 12 layers. In this project, these two parameters are modified to 10 layers.

Model parameters: 0.2B. Word list size: 29298, including only Chinese and a small amount of English.

2.3 Training process

hardware:

 # 预训练阶段：
CPU: 28 vCPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
内存：60 GB
显卡：RTX A5000(24GB) * 2

# sft及dpo阶段：
CPU: Intel(R) i5-13600k @ 5.1GHz
内存：32 GB
显卡：NVIDIA GeForce RTX 4060 Ti 16GB * 1

Tokenizer training : The existing tokenizer training library has an OOM problem when encountering large corpus. Therefore, the full corpus is merged and constructed based on word frequency according to a method similar to BPE , which takes half a day to run.
Text-to-Text pre-training : a dynamic learning rate of 1e-4 to 5e-3 , and a pre-training time of 8 days. Training loss:

traing loss

prompt supervised fine-tuning (SFT) : Use belle instruction training data set (both instruction and answer lengths are below 512), the learning rate is a dynamic learning rate from 1e-7 to 5e-5 , and the fine-tuning time is 2 days. Fine-tuning loss:

finetune loss

dpo direct preference optimization (RLHF) : The data set alpaca-gpt4-data-zh is used as the chosen text. In step 2 , the SFT model batch generate the prompts in the data set and obtains the rejected text. It takes 1 day to optimize the dpo full preference and learn. The rate is le-5 , half-precision fp16 , a total of 2 epoch , and it takes 3 hours. dpo loss:

dpo loss

2.4 Dialogue effect display

2.4.1 stream chat

By default, TextIteratorStreamer of huggingface transformers is used to implement streaming dialogue, which only supports greedy search . If you need other generation methods such as beam sample , please change the stream_chat parameter of cli_demo.py to False .

2.4.2 Dialogue display

There are problems: the pre-training data set only has more than 9 million, and the model parameters are only 0.2B. It cannot cover all aspects, and there will be situations where the answer is wrong and the generator is nonsense.

3. Instructions for use

3.1 Quick start:

If huggingface cannot be connected, use modelscope.snapshot_download to download the model file from modelscope.

 from transformers import AutoTokenizer , AutoModelForSeq2SeqLM
import torch

model_id = 'charent/ChatLM-mini-Chinese'

# 如果无法连接huggingface，打开以下两行代码的注释，将从modelscope下载模型文件，模型文件保存到'./model_save'目录
# from modelscope import snapshot_download
# model_id = snapshot_download(model_id, cache_dir='./model_save')

device = torch . device ( 'cuda' if torch . cuda . is_available () else 'cpu' )

tokenizer = AutoTokenizer . from_pretrained ( model_id )
model = AutoModelForSeq2SeqLM . from_pretrained ( model_id , trust_remote_code = True ). to ( device )

txt = '如何评价Apple这家公司？'

encode_ids = tokenizer ([ txt ])
input_ids , attention_mask = torch . LongTensor ( encode_ids [ 'input_ids' ]), torch . LongTensor ( encode_ids [ 'attention_mask' ])

outs = model . my_generate (
    input_ids = input_ids . to ( device ),
    attention_mask = attention_mask . to ( device ),
    max_seq_len = 256 ,
    search_type = 'beam' ,
)

outs_txt = tokenizer . batch_decode ( outs . cpu (). numpy (), skip_special_tokens = True , clean_up_tokenization_spaces = True )
print ( outs_txt [ 0 ])

Apple是一家专注于设计和用户体验的公司，其产品在设计上注重简约、流畅和功能性，而在用户体验方面则注重用户的反馈和使用体验。作为一家领先的科技公司，苹果公司一直致力于为用户提供最优质的产品和服务，不断推陈出新，不断创新和改进，以满足不断变化的市场需求。
在iPhone、iPad和Mac等产品上，苹果公司一直保持着创新的态度，不断推出新的功能和设计，为用户提供更好的使用体验。在iPad上推出的iPad Pro和iPod touch等产品，也一直保持着优秀的用户体验。
此外，苹果公司还致力于开发和销售软件和服务，例如iTunes、iCloud和App Store等，这些产品在市场上也获得了广泛的认可和好评。
总的来说，苹果公司在设计、用户体验和产品创新方面都做得非常出色，为用户带来了许多便利和惊喜。

3.2 Start by cloning the warehouse code

Caution

The model of this project is a TextToText model. In the prompt , response and other fields in the pre-training, SFT, and RLFH phases, please be sure to add the [EOS] sequence end mark.

3.2.1 Clone project:

git clone --depth 1 https://github.com/charent/ChatLM-mini-Chinese.git

cd ChatLM-mini-Chinese

3.2.2 Install dependencies

It is recommended to use python 3.10 for this project. Older python versions may not be compatible with the third-party libraries it depends on.

pip installation:

pip install -r ./requirements.txt

If pip installed the CPU version of pytorch, you can install the CUDA version of pytorch with the following command:

 # pip 安装torch + cu118
pip3 install torch --index-url https://download.pytorch.org/whl/cu118

conda installation:

conda install --yes --file ./requirements.txt

3.2.3 Download the pre-trained model and model configuration file

Use the git command to download the model weights and configuration files from Hugging Face Hub . You need to install Git LFS first, and then run:

 # 使用git命令下载huggingface模型，先安装[Git LFS]，否则下载的模型文件不可用
git clone --depth 1 https://huggingface.co/charent/ChatLM-mini-Chinese

# 如果无法连接huggingface，请从modelscope下载
git clone --depth 1 https://www.modelscope.cn/charent/ChatLM-mini-Chinese.git

mv ChatLM-mini-Chinese model_save

You can also manually download it directly from the Hugging Face Hub warehouse ChatLM-Chinese-0.2B and move the downloaded file to the model_save directory.

3.3 Tokenizer training

Prepare txt corpus

The corpus requirements should be as complete as possible. It is recommended to add multiple corpora, such as encyclopedias, codes, papers, blogs, conversations, etc.

This project is mainly based on wiki Chinese encyclopedia. How to obtain Chinese wiki corpus: Chinese Wiki download address: zhwiki, download zhwiki-[存档日期]-pages-articles-multistream.xml.bz2 file, about 2.7GB, convert the downloaded bz2 file into wiki.txt reference: WikiExtractor, Then use python's OpenCC library to convert it to Simplified Chinese, and finally put the obtained wiki.simple.txt in the data directory of the project root directory. Please merge multiple corpora into one txt file yourself.

Since training tokenizer consumes a lot of memory, if your corpus is very large (the merged txt file exceeds 2G), it is recommended to sample the corpus according to categories and proportions to reduce training time and memory consumption. Training a 1.7GB txt file requires about 48GB of memory (estimated, I only have 32GB, swap is triggered frequently, the computer is stuck for a long time T_T), and 13600k CPU takes about 1 hour.

training tokenizer

The difference between char level and byte level is as follows (please search for information on your own for specific usage differences). The tokenizer trains char level by default. If byte level is required, just set token_type='byte' in train_tokenizer.py .

 # 原始文本
txt = '这是一段中英混输的句子, （chinese and English, here are words.）'

tokens = charlevel_tokenizer . tokenize ( txt )
print ( tokens )
# char level tokens输出
# ['▁这是', '一段', '中英', '混', '输', '的', '句子', '▁,', '▁(', '▁ch', 'inese', '▁and', '▁Eng', 'lish', '▁,', '▁h', 'ere', '▁', 'are', '▁w', 'ord', 's', '▁.', '▁)']

tokens = bytelevel_tokenizer . tokenize ( txt )
print ( tokens )
# byte level tokens输出
# ['Ġè¿Ļæĺ¯', 'ä¸Ģæ®µ', 'ä¸Ńèĭ±', 'æ··', 'è¾ĵ', 'çļĦ', 'åı¥åŃĲ', 'Ġ,', 'Ġ(', 'Ġch', 'inese', 'Ġand', 'ĠEng', 'lish', 'Ġ,', 'Ġh', 'ere', 'Ġare', 'Ġw', 'ord', 's', 'Ġ.', 'Ġ)']

Start training:

 # 确保你的训练语料`txt`文件已经data目录下
python train_tokenizer . py

3.4 Text-to-Text pre-training

Pre-training dataset example

{
    "prompt" : "对于花园街，你有什么了解或看法吗？ " ,
    "response" : "花园街（是香港油尖旺区的一条富有特色的街道，位于九龙旺角东部，北至界限街，南至登打士街，与通菜街及洗衣街等街道平行。现时这条街道是香港著名的购物区之一。位于亚皆老街以南的一段花园街，也就是"波鞋街"整条街约150米长，有50多间售卖运动鞋和运动用品的店舖。旺角道至太子道西一段则为排档区，售卖成衣、蔬菜和水果等。花园街一共分成三段。明清时代，花园街是芒角村栽种花卉的地方。此外，根据历史专家郑宝鸿的考证：花园街曾是1910年代东方殷琴拿烟厂的花园。纵火案。自2005年起，花园街一带最少发生5宗纵火案，当中4宗涉及排档起火。2010年。2010年12月6日，花园街222号一个卖鞋的排档于凌晨5时许首先起火，浓烟涌往旁边住宅大厦，消防接报4 "
}

jupyter-lab or jupyter notebook:
See the file train.ipynb . It is recommended to use jupyter-lab to avoid considering the situation where the terminal process is killed after disconnecting from the server.
Console:
Console training needs to consider that the process will be killed after the connection is disconnected. It is recommended to use the process daemon tool Supervisor or screen to establish a connection session.
First, you need to configure accelerate , execute the following command, and select according to the prompts. Refer to accelerate.yaml . Note: DeepSpeed is more troublesome to install on Windows .
```
accelerate config
```
Start training. If you want to use the configuration provided by the project, please add the parameter --config_file ./accelerate.yaml after the following command accelerate launch . This configuration is based on the single-machine 2xGPU configuration.
There are two scripts for pre-training. The trainer implemented in this project corresponds to train.py , and the trainer implemented by huggingface corresponds to pre_train.py . You can use either one and the effect will be the same. The trainer implemented in this project displays more beautiful training information and makes it easier to modify training details (such as loss functions, log records, etc.). All support breakpoints to continue training. The trainer implemented in this project supports continuing training after a breakpoint at any position. Press ctrl+c will save breakpoint information when exiting the script.
Single machine and single card:
```
 # 本项目实现的trainer
accelerate launch ./train.py train

# 或者使用 huggingface trainer
python pre_train.py
```
Single machine with multiple cards: 2 is the number of graphics cards, please modify it according to your actual situation.
```
 # 本项目实现的trainer
accelerate launch --multi_gpu --num_processes 2 ./train.py train

# 或者使用 huggingface trainer
accelerate launch --multi_gpu --num_processes 2 pre_train.py
```
Continue training from the breakpoint:
```
 # 本项目实现的trainer
accelerate launch --multi_gpu --num_processes 2 ./train.py train --is_keep_training=True

# 或者使用 huggingface trainer
# 需要在`pre_train.py`中的`train`函数添加`resume_from_checkpoint=True`
accelerate launch --multi_gpu --num_processes 2 pre_train.py
```

3.5 SFT fine-tuning

The SFT data set all comes from the contribution of BELLE boss, thank you. The SFT data sets are: generated_chat_0.4M, train_0.5M_CN and train_2M_CN, with approximately 1.37 million rows remaining after cleaning. Example of fine-tuning data set with sft command:

{
    "prompt" : "解释什么是欧洲启示录" ,
    "response" : "欧洲启示录（The Book of Revelation）是新约圣经的最后一卷书，也被称为《启示录》、《默示录》或《约翰默示录》。这本书从宗教的角度描述了世界末日的来临，以及上帝对世界的审判和拯救。 书中的主题包括来临的基督的荣耀，上帝对人性的惩罚和拯救，以及魔鬼和邪恶力量的存在。欧洲启示录是一个充满象征和暗示的文本，对于解读和理解有许多不同的方法和观点。 "
}

Make your own data set by referring to the sample parquet file in the data directory. The data set format is: parquet file is divided into two columns, one column of prompt text, which represents the prompt, and one column of response text, which represents the expected model output. For fine-tuning details, see the train method under model/trainer.py . When is_finetune is set to True , fine-tuning will be performed. Fine-tuning will freeze the embedding layer and encoder layer by default, and only train the decoder layer. If you need to freeze other parameters, please adjust the code yourself.

Run SFT fine-tuning:

 # 本项目实现的trainer， 添加参数`--is_finetune=True`即可, 参数`--is_keep_training=True`可从任意断点处继续训练
accelerate launch --multi_gpu --num_processes 2 ./train.py --is_finetune=True

# 或者使用 huggingface trainer, 多GPU请用accelerate launch --multi_gpu --num_processes gpu个数 sft_train.py
python sft_train.py

3.6 RLHF (Reinforcement Learning Human Feedback Optimization Method)

Here are two common preferred methods: PPO and DPO. Please search papers and blogs for specific implementations.

PPO method (approximate preference optimization, Proximal Policy Optimization)
Step 1: Use the fine-tuning data set to do supervised fine-tuning (SFT, Supervised Finetuning).
Step 2: Use the preference data set (a prompt contains at least 2 responses, one wanted response and one unwanted response. Multiple responses can be sorted by score, and the most wanted one has the highest score) to train the reward model (RM, Reward Model). You can use the peft library to quickly build the Lora reward model.
Step 3: Use RM to perform supervised PPO training on the SFT model so that the model meets preferences.
Use DPO (Direct Preference Optimization) fine-tuning ( this project uses the DPO fine-tuning method, which saves video memory ). On the basis of obtaining the SFT model, there is no need to train the reward model to obtain positive answers (chosen) and negative answers (rejected). ) to start fine-tuning. The fine-tuned chosen text comes from the original data set alpaca-gpt4-data-zh, and the rejected text comes from the model output after SFT fine-tuning for 1 epoch. The other two data sets: huozi_rlhf_data_json and rlhf-reward-single-round-trans_chinese, after merging A total of 80,000 dpo data.
For the dpo data set processing process, see utils/dpo_data_process.py .

DPO preference optimization data set example:

    {
        "prompt" : "为给定的产品创建一个创意标语。，输入：可重复使用的水瓶。 " ,
        "chosen" : " "保护地球，从拥有可重复使用的水瓶开始！ " " ,
        "rejected" : " "让你的水瓶成为你的生活伴侣，使用可重复使用的水瓶，让你的水瓶成为你的伙伴" "
    }

Run preference optimization:

 #  多GPU请用accelerate launch --multi_gpu --num_processes gpu个数 dpo_train.py
python dpo_train.py

3.7 Reasoning

Make sure there are the following files in the model_save directory. These files can be found in the Hugging Face Hub warehouse ChatLM-Chinese-0.2B:

ChatLM-mini-Chinese
├─model_save
|  ├─config.json
|  ├─configuration_chat_model.py
|  ├─generation_config.json
|  ├─model.safetensors
|  ├─modeling_chat_model.py
|  ├─special_tokens_map.json
|  ├─tokenizer.json
|  └─tokenizer_config.json

Console run:

python cli_demo.py

API call

python api_demo.py

API call example:

curl --location ' 127.0.0.1:8812/api/chat ' 
--header ' Content-Type: application/json ' 
--header ' Authorization: Bearer Bearer ' 
--data ' {
    "input_txt": "感冒了要怎么办"
} '

api demo

3.8 Fine-tuning of downstream tasks

Here we take the triplet information in the text as an example to do downstream fine-tuning. For the traditional deep learning extraction method for this task, see the warehouse pytorch_IE_model. Extract all the triples in a piece of text, such as the sentence 《写生随笔》是冶金工业2006年出版的图书，作者是张来亮, extract the triples (写生随笔,作者,张来亮) and (写生随笔,出版社,冶金工业) .

The original data set is: Baidu triple extraction data set. Example of the processed fine-tuned data set format:

{
    "prompt" : "请抽取出给定句子中的所有三元组。给定句子：《家乡的月亮》是宋雪莱演唱的一首歌曲，所属专辑是《久违的哥们》 " ,
    "response" : " [(家乡的月亮,歌手,宋雪莱),(家乡的月亮,所属专辑,久违的哥们)] "
}

You can directly use the sft_train.py script for fine-tuning. The script finetune_IE_task.ipynb contains the detailed decoding process. The training data set has about 17000 items, the learning rate 5e-5 , and the training epoch 5 . The dialogue capabilities of other tasks have not disappeared after fine-tuning.

Dialogue ability after fine-tuning of information extraction tasks

Fine-tuning effect: Use the dev data set published百度三元组抽取数据集as the test set to compare with the traditional method pytorch_IE_model.

Model	F1 score	Precision P	Recall R
ChatLM-Chinese-0.2B fine-tuning	0.74	0.75	0.73
ChatLM-Chinese-0.2B without pre-training	0.51	0.53	0.49
Traditional deep learning methods	0.80	0.79	80.1

Note: ChatLM-Chinese-0.2B无预训练means directly initializing random parameters and starting training with a learning rate of 1e-4 . Other parameters are consistent with fine-tuning.

3.9 C-Eval score

The model itself is not trained using a larger data set, nor is it fine-tuned for the instructions for answering multiple-choice questions. The C-Eval score is basically a baseline level and can be used as a reference if necessary. C-Eval evaluation code see: eval/c_eavl.ipynb

category	correct	question_count	accuracy
Humanities	63	257	24.51%
Other	89	384	23.18%
STEM	89	430	20.70%
Social Science	72	275	26.18%

4. Quote

If you think this project is helpful to you, please quote it.

 @misc{Charent2023,
    author={Charent Chen},
    title={A small chinese chat language model with 0.2B parameters base on T5},
    year={2023},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {url{https://github.com/charent/ChatLM-mini-Chinese}},
}

5. Other matters

This project does not bear the risks and responsibilities of data security and public opinion risks caused by open source models and codes, or the risks and responsibilities arising from any model being misled, abused, disseminated, or improperly exploited.

Expand