CareGPT Download - CareGPT Source code download

CareGPT

AI Source Code

1.0.0

Download

Chinese | English

Video tutorial Installation and deployment Online experience

⚡Features:

Added ChatGPT fine-tuning implementation, and recommended friends with credits to conduct fine-tuning experiments on ChatGPT;
Supports ChatGPT-Next-Web deployment fine-tuning model;
Supports Gradio deployment fine-tuning models;
Supports LLaMA and LLaMA-2 full series model training;
Supports LoRA and QLoRA, including subsequent PPO and DPO reinforcement learning training;
Supports Q&A combined with models and knowledge base;
Open source medical guidance material information for more than 60 hospital departments;
Developed a tool to support GPT-4/ChatGPT model distillation of medical data, which can batch generate various data for building knowledge base and fine-tuning;
It aggregates a wealth of open source medical LLM, medical data for LLM training, LLM deployment data, LLM evaluation, and related LLM resource collection;
We participated in the CMB list evaluation of medical LLMs - IvyGPT. In the test, we were ahead of ChatGPT and a number of open source medical LLMs;
We have open sourced multiple medical LLMs trained on different base LLMs based on our own data sets. You can download them directly to experience;

?Dataset

Pre-training data

LLM-Pretrain-FineTune/data_pretrain
MedicalGPT/pretrain
zyj
TCM-Ancient-Books (nearly 700 ancient Chinese medicine texts)
epfl-llm/guidelines

Supervised training data

icliniq-10k(en)
HealthCareMagic-100k(en)
ShenNong_TCM_Dataset
✅ChatMed_Consult_Dataset
Chinese-medical-dialogue-data
cMedQA2
✅Huatuo-26M
cMedQA2
webMedQA
PubMedQA
CMCQA
✅QiZhenGPT
✅LLM-Pretrain-FineTune/data_sft
Medical-Dialogue-System
IMCS-V2
CHIP-MDCFNPC
MedDG
✅HuatuoGPT-sft-data-v1
MedicalGPT/finetune
✅shibing624/medical
medAlpaca/data
✅Zhongjing/sft
medical_dialog
huatuo_encyclopedia_qa
Med-ChatGLM/data
CMB
GenMedGPT-5k(en)
Alpaca-CoT(general)
✅DISC-Med-SFT
✅HuatuoGPT2_sft_instruct
FreedomIntelligence/Medbase_data
openmedlab/Awesome-Medical-Dataset

Reward training data

MedicalGPT/reward
Zhongjing/rw
comparison_gpt4_data
HH-RLHF
UltraFeedback

?️Full process training

1. Install dependencies

 conda create - n llm python = 3.11
conda activate llm
python - m pip install - r requirements . txt

LLaMA model download: https://blog.csdn.net/u014297502/article/details/129829677

 # 转为HF格式
python - m transformers . models . llama . convert_llama_weights_to_hf 
    - - input_dir path_to_llama_weights - - model_size 7 B - - output_dir path_to_llama_model

LLaMA-2 model download: https://huggingface.co/meta-llama

2.Data configuration

Data set configuration, PT, SFT, RW data format

dataset_info

If you use a custom dataset, be sure to provide your dataset definition in the dataset_info.json file in the following format.

 "数据集名称" : {
  "hf_hub_url" : " HuggingFace上的项目地址（若指定，则忽略下列三个参数） " ,
  "script_url" : "包含数据加载脚本的本地文件夹名称（若指定，则忽略下列两个参数） " ,
  "file_name" : "该目录下数据集文件的名称（若上述参数未指定，则此项必需） " ,
  "file_sha1" : "数据集文件的SHA-1哈希值（可选） " ,
  "columns" : {
    "prompt" : "数据集代表提示词的表头名称（默认：instruction） " ,
    "query" : "数据集代表请求的表头名称（默认：input） " ,
    "response" : "数据集代表回答的表头名称（默认：output） " ,
    "history" : "数据集代表历史对话的表头名称（默认：None） "
  }
}

The prompt and response columns should be non-empty strings. The content of the query column will be concatenated with the prompt column as model input. The history column should be a list, where each element is a string tuple representing the user request and model reply respectively.

PT example data

.txt format, one unsupervised data per line.

Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks.
Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

SFT example data 1

[
  {
    "instruction" : "听起来很不错。人工智能可能在哪些方面面临挑战呢？ " ,
    "input" : " " ,
    "output" : "人工智能面临的挑战包括数据隐私、安全和道德方面的问题，以及影响就业机会的自动化等问题。 " ,
    "history" : [
      [ "你好，你能帮我解答一个问题吗？ " , "当然，请问有什么问题？ " ],
      [ "我想了解人工智能的未来发展方向，你有什么想法吗？ " , "人工智能在未来的发展方向可能包括更强大的机器学习算法，更先进的自然语言处理技术，以及更加智能的机器人。 " ]
    ]
  }
]

SFT example data 2

[
  {
    "instruction" : "听起来很不错。人工智能可能在哪些方面面临挑战呢？ " ,
    "input" : " " ,
    "output" : "人工智能面临的挑战包括数据隐私、安全和道德方面的问题，以及影响就业机会的自动化等问题。 " ,
    "history" : []
  }
]

RW example data

[
  {
    "instruction" : "生成三个与“道歉”意思相同的动词" ,
    "input" : " " ,
    "output" : [
      "承认，表示遗憾，弥补。 " ,
      "道歉"
    ]
  }
]

3. Training configuration

Training parameters and instructions

Configure distributed

Check whether your graphics card is connected with NVLINK. Only with NVLINK connection can you effectively use accelerate for parallel accelerated training.

 nvidia-smi topo -m

 accelerate config # configure the environment
accelerate launch src / train_bash . py # arguments (same as above)

Supervised training

 # LLaMA-2
accelerate launch src / train_bash . py 
    - - stage sft 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - do_train 
    - - dataset mm 
    - - finetuning_type lora 
    - - quantization_bit 4 
    - - overwrite_cache 
    - - output_dir output 
    - - per_device_train_batch_size 8 
    - - gradient_accumulation_steps 4 
    - - lr_scheduler_type cosine 
    - - logging_steps 10 
    - - save_steps 1000 
    - - learning_rate 5e-5 
    - - num_train_epochs 2.0 
    - - plot_loss 
    - - fp16 
    - - template llama2 
    - - lora_target q_proj , v_proj

# LLaMA
accelerate launch src / train_bash . py 
    - - stage sft 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - do_train 
    - - dataset mm , hm 
    - - finetuning_type lora 
    - - overwrite_cache 
    - - output_dir output - 1 
    - - per_device_train_batch_size 4 
    - - gradient_accumulation_steps 4 
    - - lr_scheduler_type cosine 
    - - logging_steps 10 
    - - save_steps 2000 
    - - learning_rate 5e-5 
    - - num_train_epochs 2.0 
    - - plot_loss 
    - - fp16 
    - - template default 
    - - lora_target q_proj , v_proj

reinforcement learning

 # LLaMA-2, DPO
accelerate launch src / train_bash . py 
    - - stage dpo 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - do_train 
    - - dataset rlhf 
    - - template llama2 
    - - finetuning_type lora 
    - - quantization_bit 4 
    - - lora_target q_proj , v_proj 
    - - resume_lora_training False 
    - - checkpoint_dir . / output - 2 
    - - output_dir output - dpo 
    - - per_device_train_batch_size 2 
    - - gradient_accumulation_steps 4 
    - - lr_scheduler_type cosine 
    - - logging_steps 10 
    - - save_steps 1000 
    - - learning_rate 1e-5 
    - - num_train_epochs 1.0 
    - - plot_loss 
    - - fp16

4. Inference configuration

Inference parameters and instructions

Web access

 # LLaMA-2
python src / web_demo . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - checkpoint_dir output 
    - - finetuning_type lora 
    - - template llama2

# LLaMA
python src / web_demo . py 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - checkpoint_dir output - 1 
    - - finetuning_type lora 
    - - template default

# DPO
python src / web_demo . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - checkpoint_dir output - dpo 
    - - finetuning_type lora 
    - - template llama2

API access

 # LLaMA-2
python src / api_demo . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - checkpoint_dir output 
    - - finetuning_type lora 
    - - template llama2

# LLaMA
python src / api_demo . py 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - checkpoint_dir output - 1 
    - - finetuning_type lora 
    - - template default

# DPO
python src / api_demo . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - checkpoint_dir output - dpo 
    - - finetuning_type lora 
    - - template llama2

Test API:

 curl - X 'POST' 
    'http://127.0.0.1:8888/v1/chat/completions' 
    - H 'accept: application/json' 
    - H 'Content-Type: application/json' 
    - d ' {
    "model" : "string",
    "messages": [
      {
        "role" : "user",
        "content": "你好"
      }
    ],
    " temperature ": 0 ,
    "top_p" : 0 ,
    "max_new_tokens" : 0 ,
    "stream" : false
  }'

CLI access

 # LLaMA-2
python src / cli_demo . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - checkpoint_dir output 
    - - finetuning_type lora 
    - - template llama2

# LLaMA
python src / cli_demo . py 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - checkpoint_dir output - 1 
    - - finetuning_type lora 
    - - template default

# DPO
python src / cli_demo . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - checkpoint_dir output - dpo 
    - - finetuning_type lora 
    - - template llama2

Batch prediction

 # LLaMA-2
CUDA_VISIBLE_DEVICES = 0 python src / train_bash . py 
    - - stage sft 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - do_predict 
    - - dataset mm 
    - - template llama2 
    - - finetuning_type lora 
    - - checkpoint_dir output 
    - - output_dir predict_output 
    - - per_device_eval_batch_size 8 
    - - max_samples 100 
    - - predict_with_generate

# LLaMA
CUDA_VISIBLE_DEVICES = 0 python src / train_bash . py 
    - - stage sft 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - do_predict 
    - - dataset mm 
    - - template default 
    - - finetuning_type lora 
    - - checkpoint_dir output - 1 
    - - output_dir predict_output 
    - - per_device_eval_batch_size 8 
    - - max_samples 100 
    - - predict_with_generate

Experimental evaluation (BLEU and ROUGE_CHINESE)

 # LLaMA-2
CUDA_VISIBLE_DEVICES = 0 python src / train_bash . py 
    - - stage sft 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - do_eval 
    - - dataset mm 
    - - template llama2 
    - - finetuning_type lora 
    - - checkpoint_dir output 
    - - output_dir eval_output 
    - - per_device_eval_batch_size 8 
    - - max_samples 100 
    - - predict_with_generate

# LLaMA
CUDA_VISIBLE_DEVICES = 0 python src / train_bash . py 
    - - stage sft 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - do_eval 
    - - dataset mm 
    - - template default 
    - - finetuning_type lora 
    - - checkpoint_dir output - 1 
    - - output_dir eval_output 
    - - per_device_eval_batch_size 8 
    - - max_samples 100 
    - - predict_with_generate

For 4/8-bit evaluation, it is recommended to use --per_device_eval_batch_size=1 and --max_target_length 128

5.Gradio deployment

Gradio deployment instructions

Model export

 # LLaMA-2
python src / export_model . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - template llama2 
    - - finetuning_type lora 
    - - checkpoint_dir output - 1 
    - - output_dir output_export

# LLaMA
python src / export_model . py 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - template default 
    - - finetuning_type lora 
    - - checkpoint_dir output 
    - - output_dir output_export

Start running

 % cd Gradio
python app . py

6.ChatGPT-Next-Web deployment

Nextdeployment instructions

Start API service

 # LLaMA-2
python src / api_demo . py 
    - - model_name_or_path . / Llama - 2 - 7 b - chat - hf 
    - - checkpoint_dir output 
    - - finetuning_type lora 
    - - template llama2

# LLaMA
python src / api_demo . py 
    - - model_name_or_path . / Llama - 7 b - hf 
    - - checkpoint_dir output - 1 
    - - finetuning_type lora 
    - - template default

Download Next and run

DownloadNext:

Modify configuration: Install and open Next, then open设置, modify接口地址to: http://127.0.0.1:8000/ (that is, your API interface address), and then you can use it.

?Practical experience

In CareGPT, Chinese word segmentation is not added and retrained to the word segmentation model, but the effect is still promising;
The whole process of LLM training includes: pre-training, supervised fine-tuning, reward model, and reinforcement learning. In most cases, supervised fine-tuning can meet your own needs ;
When computing power is sufficient, it is recommended to use medical data and general corpus data for training , so that the model can not only have medical training and learning, but also maintain general capabilities (such as following instructions);
Don’t expect that one medical LLM can meet all needs. A reasonable approach may be a real-time updated knowledge base + fine-tuned medical LLM (such as ChatLaw);
The BLOOMZ model series was trained using the PILE corpus, which contains various medical texts, including PubMed Central and PubMed Abstracts . These valuable texts have greatly enriched the medical knowledge system of the BLOOMZ model, so many open source projects will give priority to BLOOMZ as the base model for medical fine-tuning;
(2023.08.26) ChatGPT is trained based on Code GPT. Will we use CodeLLaMA to fine-tune downstream tasks to achieve better results than fine-tuning on LLaMA-1/2?
Combining our recent work with many recently published works proves: In the LLM era, data质量> 数量is the truth, such as: Less is More! Handed over to Qingyuan&& Caspian | Use 200 pieces of data to fine-tune the model, surpassing MiniGPT-4 ! , ultra-large-scale SFT data will weaken downstream task LLM or lose ICL, CoT and other capabilities;
For vertical models, perhaps we should pay more attention to the PT process instead of collecting tens of millions of SFT data for training. Our suggestion is大规模预训练+小规模监督微调=超强的LLM模型;
A good pre-trained medical LLM has not yet been opened in the open source community, and I hope someone can supplement such work;
Pre-training can infuse knowledge, while supervised fine-tuning only activates domain capabilities (cannot focus on knowledge)? Should pre-training knowledge echo supervised fine-tuning knowledge? Will the tens of GB of pre-trained corpus knowledge be overwhelmed by the original pre-trained model knowledge of trillions of tokens?
Secondary pre-training of a large amount of data requires matching various types of other data: (1) After the language model training is completed, the responsible parts of each area of the parameters have been determined. If a large amount of knowledge that is not available during pre-training is added, the parameters will increase. Amplitude changes cause loss of the entire language model capability; (2) For secondary pre-training of large-scale data, 5-10 times the data in the original pre-training needs to be added, mixed and trained together;
The instruction fine-tuning phase cannot conduct too many rounds of training: (1) Training multiple EPOCHs on a small amount of data may cause changes in key areas of the language, leading to the failure of the entire model; (2) Instruction fine-tuning for specific task improvements, In order to ensure that the key areas of the model's language capabilities are not significantly adjusted, it is necessary to add general instruction fine-tuning data or pre-training data;
Training data must strictly control noise: (1) If there is a small amount of continuous noise data in the pre-training data, such as continuous repetition of words, non-word sequences, etc., it may cause adjustments in specific dimensions, causing the overall PPL of the model to fluctuate significantly; ( 2) If there are a large number of instruction fragments in the supervised fine-tuning instructions that do not match the original large language model, it may also cause the model to adjust specific dimensions, thereby significantly reducing the overall performance of the model;
When fine-tuning a large model with mixed data of multiple capabilities, it will appear: high resource conflict and low resource gain, so mixing different data for fine-tuning requires certain engineering skills;
Generally speaking, there is a non-negligible performance difference between lora and full-tuning (such as LoRA results in 4-6% lower performance compared to full fine-tuning);
Please give priority to the full-parameter fine-tuning method for 7B series models. LoRA, QLoRA and other methods can be used for 13B and above parameter models;
Even if a very large parameter model is quantified, its capabilities can still be maintained well;
Although LLM training (or all models trained on GPU) has inevitable randomness, the results of multi-lun training are still very consistent;
If you are limited by GPU memory, QLoRA provides a cost-effective compromise. It saves 33% of memory at the cost of a 39% increase in running time;
When fine-tuning LLM, the choice of optimizer is not the main factor affecting the results. Whether it is AdamW, SGD with scheduler, or AdamW with scheduler, the impact on the results is minimal;
Although Adam is often considered a memory-intensive optimizer because it introduces two new parameters for each model parameter, this does not significantly affect the peak memory requirements of LLM. This is because most of the memory will be allocated for multiplication of large matrices rather than holding extra parameters;
For static data sets, multiple iterations like multiple rounds of training may not work well. This often leads to overfitting, worsening training results;
If you want to combine LoRA, make sure it is applied on all layers, not just the Key and Value matrix, so as to maximize the performance of the model;
It is crucial to adjust the LoRA rank and choose an appropriate α value. To provide a little trick, try setting the α value to twice the rank value;
A single GPU with 14GB RAM can efficiently fine-tune a large model with 7 billion parameters in a few hours. For static data sets, it is impossible to strengthen LLM into an "all-rounder" and perform well in all baseline tasks. Solving this problem requires diversified data sources or the use of technologies other than LoRA;
According to the recommendations of the NeurIPS workshop, as of December 18, 2023, the recommended selection of fine-tuned models英文10B以下选择Mistral-7B中文, 10B以下选择Yi-6B 10B, and 10B以上选择Qwen-14B和Yi-34B ;

Important

Everyone is welcome to add new experiences to ISSUE!

11~13 Methodology comes from 13 billion large language models. Changing just one weight will completely lose the language ability! The latest research from the Natural Language Processing Laboratory of Fudan University.

14Methodology from How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition

17~25 methodology comes from LLM Optimization: Layer-wise Optimal Rank Adaptation (LORA) Chinese version interpretation

?Model open source

stage	Introduction to weights	Download address	Features	base model	fine-tuning method	Dataset
?Supervision and fine-tuning	Multi-turn dialogue data is trained based on LLaMA2-7b-Chat	CareLlama2-7b-chat-sft-multi、?CareLlama2-7b-multi	Excellent multi-turn conversation skills	LLaMA2-7b-Chat	QLoRA	mm
Supervise fine-tuning	Rich and efficient doctor-patient dialogue data is trained based on LLaMA2-7b-Chat	CareLlama2-7b-chat-sft-med	Excellent patient disease diagnosis capabilities	LLaMA2-7b-Chat	QLoRA	hm
supervise

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2024-12-09
size 22.13MB
From Github

Related Applications

node telegram bot api

2024-12-14
typebot.io

2024-12-14
python wechaty getting started

2024-12-14
TranscriberBot

2024-12-14
genal chat

2024-12-14
Facemoji

2024-12-14

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
node telegram bot api

AI Source Code

v0.50.0
typebot.io

AI Source Code

v3.1.2
python wechaty getting started

AI Source Code

1.0.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All