zero_nlp Download - zero_nlp Source code download

Download

zero to nlp

目标: Make an out-of-the-box training framework for NLP in the Chinese field based on pytorch and transformers , and provide a full set of solutions for training and fine-tuning models (including large models, text steering vectors, text generation, multi-modal and other models);
?数据:
- We have compiled massive training data from the open source community to help users get started quickly;
- At the same time, training data templates are also opened, which can quickly process vertical field data;
- Combined with more efficient data processing methods such as multi-threading and memory mapping, it is easy to process even百GB of data;
流程: Each project has complete model training steps, such as: data cleaning, data processing, model construction, model training, model deployment, and model illustration;
模型: currently supports multi-modal large models such as gpt2 , clip , gpt-neox , dolly , llama , chatglm-6b , VisionEncoderDecoderModel , etc.;
多卡串联: Currently, the size of most large models is much larger than the video memory of a single consumer-grade graphics card. Multiple graphics cards need to be connected in series to train and deploy large models. Therefore, some model structures were modified to realize the multi-card series function训练时and推理时.
模型工具: Added词表裁切and词表扩充tutorial for large models model_modify

Chinese name	folder name	data	Data cleaning	large model	Model deployment	Illustration
Chinese text classification	chinese_classifier	✅	✅	✅		✅
Chinese `gpt2`	chinese_gpt2	✅	✅	✅	✅
Chinese `clip`	chinese_clip	✅	✅	✅		✅
Image generation Chinese text	VisionEncoderDecoderModel	✅	✅	✅		✅
Introduction to vit core source code	vit model					✅
`Thu-ChatGlm-6b` ( `v1` version is obsolete)	simple_thu_chatglm6b	✅	✅	✅	✅
?chatglm- `v2` -6b?	chatglm_v2_6b_lora	✅	✅	✅
Chinese `dolly_v2_3b`	dolly_v2_3b	✅	✅	✅
Chinese `llama` (obsolete)	chinese_llama	✅	✅	✅
Chinese `bloom`	chinese_bloom	✅	✅	✅
Chinese `falcon` (note: the falcon model is similar to the bloom structure)	chinese_bloom	✅	✅	✅
Chinese pre-training code	model_clm	✅	✅	✅
Large model of Baichuan	model_baichuan	✅	✅	✅	✅
Model trimming✂️	model_modify	✅	✅	✅
llama2 pipeline parallelism	pipeline	✅	✅	✅
Baichuan 2-7b-chat `dpo`	DPO baichuan2-7b-chat	✅	✅	✅
During training, the proportion of data changes	train_data_sample	✅	✅	✅
internlm-base sft	internlm-sft	✅	✅	✅
train qwen2	train_qwen2	✅	✅	✅	✅
train llava	train_llava	✅	✅	✅	✅	✅