DL-C & partial DL-D demonstration • AI Society
This open source is DL-B, which is a digital image solution based on ChatGLM, Wav2lip, and so-vits-svc. The code base was written in mid-March 2023 and has not been optimized or updated since then.
I am currently competing on this project. The competition will enter the provincial competition stage in late June. The project team is currently advancing to the subsequent optimization and improvement of DL-C and the testing and development of DL-D. No codes and details about DL-C and DL-D will be released until the end of the competition. The code and detailed framework will be compiled and updated after the competition. Forgot to forgive.
The current code is too rigid. I am a second-year undergraduate student majoring in finance. I have no aesthetic or technical skills in code writing (C+V weird). Please don’t criticize me.
After the competition, the project will be taken over by the AI Society and a user-friendly framework will be produced in the future, with a full-process lazy package.
The platform used for DL-B production is provided here as a reference (you are welcome to propose lower runnable configurations as a supplement)
graphics card | CPU | Memory | harddisk |
---|---|---|---|
RTX 3060 12G | Intel i5-12400F | 16 GB | 30G |
The test environment is based on Python 3.9.13 64-bit
Use pip to install dependencies: pip install -r requirements.txt
It should be noted that you still need to download a Python 3.8 environment package for running So-VITS (click on the environment package), but don’t worry, I have already configured it for you, you only need to download and unzip it in DL-B folder and keep the file path
DL-B
├───python3.8
├───Lib
├───libs
├───···
└───Tools
In addition, you also need to install ffmpeg. If you don’t want to install it manually, you can also try using the lazy package we provide.
ChatGLM has many fine-tuning methods, and users can choose the appropriate fine-tuning method according to their actual situation. Tsinghua University officials gave a detailed explanation of the fine-tuning of ChatGLM by P-tuning. There is a better fine-tuning example library on Github, which uses Zhen Huan as an example of fine-tuning. This library contains the code for P-tuning fine-tuning, but does not include the pre-trained model of ChatGLM.
The program will automatically download the model implementation and parameters from transformers
. The complete model implementation can be found in Hugging Face Hub. If your network environment is poor, downloading model parameters may take a long time or even fail. At this time, you can first download the model to the local and then load it from the local.
To download the model from Hugging Face Hub, you need to install Git LFS first and then run
git clone https://huggingface.co/THUDM/chatglm-6b
If you are slow to download the checkpoint from Hugging Face Hub, you can only download the model implementation.
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm-6b
Then manually download the model parameter file from here and replace the downloaded file to the local module/chatglm-6b
directory.
The implementation of the Optional model is still in flux. If you want to fix the model implementation used to ensure compatibility, you can execute
git checkout v0.1.0
Use your own data set
Here is a helpful library for collecting personal corpus
Modify train_file
, validation_file
and test_file
in train.sh
and evaluate.sh
to your own JSON format data set path, and change prompt_column
and response_column
to the KEY corresponding to the input text and output text in the JSON file. You may also need to increase max_source_length
and max_target_length
to match the maximum input and output lengths in your own data set.
Conversation dataset
If you need to use multi-turn conversation data to fine-tune the model, you can provide chat history. For example, the following is the training data of a three-turn conversation:
{ "prompt" : "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线" , "response" : "用电脑能读数据流吗?水温多少" , "history" : []}
{ "prompt" : " 95 " , "response" : "上下水管温差怎么样啊?空气是不是都排干净了呢? " , "history" : [[ "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线" , "用电脑能读数据流吗?水温多少" ]]}
{ "prompt" : "是的。上下水管都好的" , "response" : "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况! " , "history" : [[ "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线" , "用电脑能读数据流吗?水温多少" ], [ " 95 " , "上下水管温差怎么样啊?空气是不是都排干净了呢? " ]]}
During training, you need to specify --history_column
as the key of the chat history in the data ( history
in this example), and the chat history will be automatically spliced. Please note that content exceeding the input length max_source_length
will be truncated.
You can refer to the following instructions:
bash train_chat.sh
Of course, you can also mix the corpus of multi-round dialogues and single-round dialogues together. Just add the following dialogue mode directly on top of the above.
{ "prompt" : "老刘,你知道那个作业要怎么从电脑上保存到手机上吗? " , "response" : "我也不知道啊" , "history" :[]}
So-VITS is already a very popular and mature model, and there are many teaching videos on station B, so I won’t go into details here. Here are tutorials that I think are of very high quality and essence. This library contains the code for basic training and clustering training of So-VITS, but it is not very user-friendly, and no changes have been made to the content in DL-B after it was completed in March. What is needed here Note that this library does not include tools for data processing and preliminary preparation.
There are some model files that need to be completed, checkpoint_best_legacy_500.pt, placed under hubert
, and two matching pre-trained models G_0.pth and D_0.pth placed under the .moduleSo-VITS
and pre_trained_model
folders.
This is an older method, and a lot of optimizations have been done in the latest framework. This version is based on the original Wav2Lip, and users can choose different pre-training model weights. The model here is a required download and is placed in the .modulewav2lip
folder.
Model | describe | Link |
---|---|---|
Wav2Lip | High-precision lip sync | Link |
Wav2Lip+GAN | The lip sync is slightly worse, but the visual quality is better | Link |
Expert Discriminator | Link | |
Visual Quality Discriminator | Link |
It should be noted that this library needs to collect some videos, which can be recorded using mobile phones, computers or cameras. It is used to collect facial information. The recommended format is .mp4
and the resolution is 720p
or 480p
. A single video is usually 5-10s. Multiple videos can be captured. Store the video files in the source
folder.
Regarding the optimization of Wan2lip, many big guys on station B have already done it, so I won’t go into details (lazy). Here is a video.
Note that in addition to the above content, you also need to download a model s3fd.pth that needs to be used during the inference process and place it in the .face_detectiondetectionsfd
folder
This library does not contain any models! ! It cannot be used after being pulled directly! ! It is necessary to train the model
The source code needs to be changed in the following places:
Place all fine-tuned models into the corresponding folders in module
. Please copy all the files output to output
after P-tuning training to the corresponding local output
. So-VITS/44k
is used to store So-VITS training models. The wav2lip+GAN model is stored under the wav2lip
folder.
In line 32 of main_demo.py
change CHECKPOINT_PATH
to the model after personal fine-tuning
prefix_state_dict = torch . load ( os . path . join ( CHECKPOINT_PATH , "pytorch_model.bin" ))
Note that you may need to change pre_seq_len
to the actual value during your training. If you are loading the model locally, you need to change THUDM/chatglm-6b
to the local model path (note that it is not the checkpoint path).
The default writing method in the source code is to load a new Checkpoint (only containing the PrefixEncoder parameter)
If you need to load the old Checkpoint (including ChatGLM-6B and PrefixEncoder parameters), or perform full parameter fine-tuning, load the entire Checkpoint directly:
model = AutoModel . from_pretrained ( CHECKPOINT_PATH , trust_remote_code = True )
Add the model path and speaker name to So-VITS_run.py
(depending on your training settings)
parser . add_argument ( '-m' , '--model_path' , type = str , default = "" , help = '模型路径' )
parser . add_argument ( '-s' , '--spk_list' , type = str , nargs = '+' , default = [ '' ], help = '合成目标说话人名称' )
Need to download wav2lip_run.py
:
#VIDEO
face_dir = "./source/"
Make changes. The video called here is the previously recorded video. You can write your own video selection plan.
No surprise, just run main_demo.py
directly in VScode or other software. Have fun everyone.
The code of this repository is open source under the GNU GPLv3 agreement. The weight usage of each model must follow its open source agreement.