Awesome Chinese LLM Download - Awesome Chinese LLM Source code download

AWESOME CHINESE LLM

AWESOME-Chinese-LLM

An awesome collection for llm in Chinese

Collect and sort out Chinese LLM related

Since the appearance of Large Language Model (LLM) represented by ChatGPT, due to its amazing ability of universal artificial intelligence (AGI), it has set off a wave of research and application in the field of natural language processing. Especially after the small -scale LLM open source that can run up with Chatglm, LLAMA and other civilian players can run, there are many cases of LLM -based minimally adjusted or applications based on LLM. This project aims to collect and sort out open source models, applications, data sets and tutorials related to Chinese LLM. The currently included resources have reached 100+!

If this project can bring you a little help, please let me a little bit ~

At the same time, you are also welcome to contribute to the unpopular open source models, applications, data sets, etc. of this project. Provide new warehouse information, please initiate PR, and provide related information such as warehouse links, numbers of stars, profiles, briefing and other related information according to the format of this project. Thank you ~

AWESOME-Chinese-LLM

Common base model details overview:

Base	Include a model	Model parameter size	Trail token number	Maximum training	Whether to commercialize
Chatglm	Chatglm/2/3/4 Base & Chat	6B	1T/1.4	2K/32K	Commercial use
Llama	Llama/2/3 Base & Chat	7b/8B/13B/33B/70B	1T/2T	2K/4K	Partially commercialized
Baichuan	Baichuan/2 Base & Chat	7B/13B	1.2T/1.4T	4K	Commercial use
Qwen	Qwen/1.5/2/2.5 Base & Chat & VL	7B/14B/32B/72B/110B	2.2T/3T/18T	8K/32K	Commercial use
Bloom	Bloom	1B/7B/176B-MT	1.5T	2K	Commercial use
Aquila	Aquila/2 base/chat	7B/34B	-	2K	Commercial use
Internet	Internet	7B/20B	-	200K	Commercial use
Mixtrac	Base & Chat	8X7B	-	32K	Commercial use
Yi	Base & Chat	6b/9b/34b	3T	200K	Commercial use
Deepseek	Base & Chat	1.3B/7B/33B/67B	-	4K	Commercial use
Xverse	Base & Chat	7B/13B/65B/A4.2B	2.6T/3.2T	8K/16K/256K	Commercial use

Table of contents
- 1. Model
  - 1.1 Text LLM model
  - 1.2 Multifamily LLM model
- 2. Application
  - 2.1 Filtering in the vertical field
    - Medical care
    - law
    - finance
    - educate
    - science and technology
    - E -commerce
    - Network security
    - agriculture
  - 2.2 Langchain application
  - 2.3 Other applications
- 3. Data set
  - Pre -training data set
  - SFT data set
  - Preference data set
- 4. LLM training fine -tuning framework
- 5. LLM reasoning deployment framework
- 6. LLM evaluation
- 7. LLM tutorial
  - LLM basic knowledge
  - Prompting engineering tutorial
  - LLM application tutorial
  - LLM actual combat tutorial
- 8. Related warehouse
Star History

1. Model

1.1 Text LLM model

Chatglm:
- Address: https://github.com/thudm/chatglm-6b
- Introduction: One of the most effective open source base models in the Chinese field, optimized the Chinese Q & A and dialogue. After a bilingual training of about 1T identifier, supplemented by technologies such as supervising fine -tuning, feedback, and feedback from human feedback to strengthen learning
Chatglm2-6b
- Address: https://github.com/thudm/chatglm2-6b
- Introduction: Based on the second-generation version of the open source Chinese and English dialogue model Chatglm-6B, it has introduced the GLM's hybrid target function on the basis of the preserved model dialogue and low deployment thresholds, which have been retained. The pre-training of the British identification symbols in the T and the alignment of human preferences; the context length of the base model model has expanded to 32K, and the context length training of 8K context length is used during the dialogue stage; Vedication occupation; allowing commercial use.
Chatglm3-6b
- Address: https://github.com/thudm/Chatglm3
- Introduction: Chatglm3-6B is the open source model in the Chatglm3 series. On the basis of retaining many excellent characteristics such as the previous two generations of model dialogue and low deployment threshold, Chatglm3-6B introduces the following features: a more powerful basic model: Chatglm3- 6B's basic model Chatglm3-6B-Base uses more training data, more full training steps, and more reasonable training strategies; more complete functional support: Chatglm3-6b uses a newly designed Prompt format, except for normal Multiple rounds of dialogue.同时原生支持工具调用（Function Call）、代码执行（Code Interpreter）和Agent 任务等复杂场景；更全面的开源序列：除了对话模型ChatGLM3-6B 外，还开源了基础模型ChatGLM3-6B-Base、长文本Dialogue model Chatglm3-6B-32K. The above weight is completely open to academic research, and free commercial use is also allowed after filling in the questionnaire.
GLM-4
- Address: https://github.com/thudm/glm-4
- Brief introduction: GLM-4-9B is the open source version of the latest generation pre-training model launched by the Smart Spectrum AI. In the assessment of data sets such as semantics, mathematics, reasoning, code, and knowledge, the GLM-4-9B and its human preference version of GLM-4-9B-CHAT all shows excellent performance beyond Llama-3-8B Essence In addition to multiple rounds of dialogue, GLM-4-9B-Chat also has advanced functions such as web browsing, code execution, custom tool calling (Function Call) and long text reasoning (support for maximum 128K context). This generation has added multi -language support, supporting 26 languages including Japanese, Korean, and German. We also launched GLM-4-9B-Chat-1M models that support 1M contextual length (about 2 million Chinese characters) and multi-mode model GLM-4V-9B based on GLM-4-9B. GLM-4V-9B has a multilingual multilingual multilingual dialogue ability under the high resolution of 1120 * 1120. In many aspects of multi-modal evaluation such as comprehensive Chinese and English comprehensive ability, perception reasoning, text recognition, chart understanding, GLM-4V-9B Express the excellent performance of surpassing GPT-4-Turbo-2024-04-09, Gemini 1.0 Pro, Qwen-Vl-Max, and Claude 3 Opus.
Qwen/qwen1.5/qwen2/qwen2.5
- Address: https://github.com/qwenlm
- Introduction: Tongyi Qianwen is a series of models of Tongyi Qianwen's model developed by Alibaba Cloud, including the parameter scale of 1.8 billion (1.8B), 7 billion (7B), 14 billion (14B), 72 billion (72B), 1100 and 1100 100 million (110B). The models of each scale include the basic model Qwen and the dialog model. Data sets include a variety of data types such as text and code. It covers the general field and professional fields. It can support the context length of 8 ~ 32K. Specific optimization of alignment data related to plug -in calls. The current model can effectively call the plug -in and upgrade to the agent Essence
Internet
- Address: https://github.com/internlm/internlm-techreport
- Introduction: Shangtang Technology, Shanghai AI Laboratory and the Chinese University of Hong Kong, Fudan University, and Shanghai Jiaotong University released the 100 billion -level parameter large language model "Scholarship". It is reported that "scholar Pu" has 104 billion parameters, and is trained based on "multi -language high -quality data set containing 1.6 trillion token".
Internet
- Address: https://github.com/internlm/internlm
- Introduction: Shangtang Technology, Shanghai AI Laboratory and the Chinese University of Hong Kong, Fudan University, and Shanghai Jiaotong University released the 100 billion -level parameter large language model "Internlm2". Internlm2 has made great progress in digital, code, dialogue, and creation, and the comprehensive performance has reached the leading level of open source model. Internlm2 contains two models: 7B and 20B. 7B provides a lightweight but unique model for the research and application of lightweight. The comprehensive performance of the 20B model is more powerful, and it can effectively support more complex and practical scenarios.
Deepseek-v2
- Address: https://github.com/deepseek-ai/deepseek-v2
- Introduction: DeepSeek-V2: Powerful, economical, and efficient expert hybrid language model
Baichuan-7b
- Address: https://github.com/baichuan-inc/baichuan-7b
- Introduction: A large -scale pre -training language model developed by Baichuan Intelligent Development. Based on the Transformer structure, the 7 billion parameter model trained on about 1.2 trillion tokens supports Chinese and English bilingual, and the length of the context window is 4096. Both the standard Chinese and English authority Benchmark (C-Eval/MMLU) have the best effect of the same size.
Baichuan-13B
- Address: https://github.com/baichuan-inc/baichuan-13b
- Introduction: Baichuan-13B is a large-scale language model containing 13 billion parameters after Baichuan-7b after Baichuan-7B. It has achieved the best effects of the same size in the authoritative Chinese and English Benchmark. The project publishes two versions: Baichuan-13B-BASE and Baichuan-13B-CHAT.
Baichuan2
- Address: https://github.com/baichuan-inc/baichuan2
- Introduction: The new generation of open source large language model launched by Baichuan Intelligence uses 2.6 trillion tokens to train with high -quality corpus. Published Base with 7B, 13B and the Chat trained version of PPO, and provided the Chat version of the 4bits quantification.
Xverse-7b
- Address: https://github.com/xverse- ai/xverse-7b
- Introduction: The large language model supported by Shenzhen Yuanxiang Technology supports multi -language models, supports 8K context length, and uses a high -quality and diversified data of 2.6 trillion token to train the model. More than 40 languages such as Britain, Russia, and Western. It also includes models of GGUF and GPTQ quantitative versions, which supports reasoning on LLAMA.CPP and VLLM on the Macos/Linux/Windows system.
Xverse-13b
- Address: https://github.com/xverse- ai/xverse-13b
- Introduction: Large language models supported by Shenzhen Yuanxiang Technology supporting multi -language models, supporting 8K context length (Context Length), and using high -quality and diversified data of 3.2 trillion token to fully train the model. More than 40 languages such as Britain, Russia, and Western. Including the long sequence dialog model XVERSE-13B-256K. The model of this version supports the context window length with a maximum of 256K, and the input content of about 25W words can help the literature summary, report analysis and other tasks. It also includes models of GGUF and GPTQ quantitative versions, which supports reasoning on LLAMA.CPP and VLLM on the Macos/Linux/Windows system.
XVerse-65B
- Address: https://github.com/xverse- ai/xverse-65b
- Introduction: A large language model supported by Shenzhen Yuanxiang Technology supports multi -language model, supports the context length of 16K, and uses a high -quality and diversified data of 2.6 trillion token to train the model to fully train the model. More than 40 languages such as Britain, Russia, and Western. Including an incremental pre-training XVERSE-65B-2 model with an incremental pre-training. It also includes models of GGUF and GPTQ quantitative versions, which supports reasoning on LLAMA.CPP and VLLM on the Macos/Linux/Windows system.
Xverse-moe-a4.2b
- Address: https://github.com/xverse- ai/xverse-moe-a4.2b
- Introduction: Large Language Model, which supports multi-language independently developed by Shenzhen Yuanxiang Technology. The parameters are 4.2 billion, supporting the context length of 8K, and using the high -quality and diverse data of 3.2 trillion token to fully train the model to support more than 40 languages such as China, Britain, Russia, and Western.
Skywork
- Address: https://github.com/skyworkai/skywork
- Introduction: The project is open to the Tiangong series models. This series of models conduct pre -training on 3.2TB of high -quality multi -language and code data. Open source includes model parameters, training data, evaluation data, and evaluation methods. Specific SkyWork-13B-Base model, Skywork-13B-Chat model, Skywork-13B-Math model, Skywork-13B-MM model, and quantitative version models of each model to support users to deploy and reason at the consumer graphics card deployment and reasoning Essence
Yi
- Address: https://github.com/01- ai/yi
- Brief introduction: This project is open to models such as Yi-6B and Yi-34B. This series of models can support the ultra-long context window version of 200K, which can process about 400,000 Chinese character ultra-long text inputs to understand PDF documents with more than 1000 pages.
Chinese-llama-alpaca:
- Address: https://github.com/ymcui/chinese-llama-alpaca
- Introduction: Chinese LLAMA & Alpaca large language model+local CPU/GPU deployment. Based on the original LLAMA, the Chinese vocabulary was expanded and Chinese data was used for secondary pre -training training.
Chinese-llama-alpaca-2:
- Address: https://github.com/ymcui/chinese-llama-alpaca-2
- Introduction: The project will publish Chinese LLAMA-2 & Alpaca-2 large language model, based on commercial LLAMA-2 for secondary development.
Chinese-llama2:
- Address: https://github.com/michael-wzhu/Chinese-llama2
- Introduction: The project is based on the commercial LLAMA-2 for second development. Training; follow-up of the training scale will be increased; chinese-llama2-Chat: Form-to-chine-llama2 is fine-tuned and summarized to fine-tune multiple rounds to adapt to various application scenarios and multi-round dialogue interactions. At the same time, we also consider a faster Chinese adaptation solution: Chinese-Llama2-SFT-V0: Use existing open source Chinese instructions fine-tuning or dialogue data to directly fine-tune LLAMA-2 (will be recently open source).
Llama2-chinese:
- Address: https://github.com/flagalpha/llama2-chinese
- Introduction: The project focuses on the optimization of the LLAMA2 model in Chinese and upper -level construction. Based on large -scale Chinese data, the Chinese capabilities of the LLAMA2 model have continued to upgrade from pre -training.
OpenChineSellama:
- Address: https://github.com/openlmlab/openchineSellama
- Introduction: Based on LLAMA-7B, a large language model base generated by the pre-training of Chinese dataset incremental training, compared to the original LLAMA, this model has greatly improved in terms of Chinese understanding and generating ability. In many downstream tasks Highlights.
Belle:
- Address: https://github.com/lianjiaatech/belle
- Introduction: Open source for a series of models based on Bloomz and LLAMA optimization. At the same time, it includes training data, related models, training code, application scenarios, etc. It will also continue to evaluate the impact of different training data and training algorithms on model performance.
Panda:
- Address: https://github.com/dandelionsllm/pandallmm
- Introduction: Open source is based on LLAMA -7B, -13B, -33B, -65B for continuous pre -training language models in the Chinese field, and uses nearly 15M data for secondary pre -training.
Robin (Robin):
- Address: https://github.com/optimalscale/lmflow
- Brief introduction: Robin (Robin) is a Chinese -English bilingual model developed by the LMFlow team of the University of Science and Technology of China. Only the Robin second -generation model obtained by only 180K data data was fine -tuned, reaching the first place on the HuggingFace list. LMFlow supports users to quickly train personalized models. It only takes a single 3090 and 5 hours to fine -tune 7 billion parameter customized models.
Fengshenbang-lm:
- Address: https://github.com/idea-ccnl/fengshenbang-lm
- Introduction: Fengshenbang-LM (Big Model of God) is a large model open source system dominated by the Idea Research Institute Cognitive Computing and Natural Language Research Center. Training models, with the ability of translation, programming, text classification, information extraction, summary, copywriting, common sense quiz, and mathematical computing. In addition to the Jiangziya series models, the project is also open to the models such as Taiyi and Erlang God series.
Billa:
- Address: https://github.com/neutralzz/billa
- Brief introduction: The project is open source of China -English bilingual LLAMA model with enhanced reasoning capabilities. The main characteristics of the model are: greatly enhance the Chinese understanding ability of LLAMA, and minimize damage to the English ability of the original LLAMA as much as possible; the training process increases the task type data, and uses CHATGPT to generate analysis to strengthen the model understanding task to solve the logic of the task; full quantity Parameter update, pursue better generating results.
MOSS:
- Address: https://github.com/openlmlab/moss
- Introduction: Support the open source dialogue language model of Chinese and English bilingual and multiple plugins. The MOSS base language model is pre -training in about 700 billion Chinese and English and code words. After subsequent dialogue instructions, plug -in enhancement learning and human preference training, it has the dialogue instructions, plug -in learning and human preference training. Multiple rounds of dialogue and ability to use multiple plug -ins.
Luotuo-chinese-llm:
- Address: https://github.com/lc1332/luotuo-chinese-llm
- Introduction: It includes a series of open source projects of large Chinese language models, which contains a series of language models based on the existing open source models (MOSS, LLAMA), instructions to fine -tune data sets.
Linly:
- Address: https://github.com/cvi-szu/linly
- Introduction: Provide Chinese dialogue model Linly-Chatflow, Chinese basic model Linly-Chinese-Llama and its training data. The Chinese basic model is based on LLAMA, using Chinese and Chinese and British parallel incremental training. The project summarizes the current multi-language instruction data, conducts large-scale instructions to follow the Chinese model to follow the training, and realize the Linly-Chatflow dialogue model.
Firefly:
- Address: https://github.com/yangjianxin1/firefly
- Introduction: Firefly is a large open source Chinese language model project. Open source includes data, fine -tuning code, multiple fine -tuning models such as Bloom, Baichuan, etc.; Filtering instructions; support most of the mainstream open source models, such as Baichuan Baichuan, Ziya, Bloom, LLAMA, etc. Holding LoRa and Base Model to merge weight, which is more convenient to reason.
Chatyuan
- Address: https://github.com/Clue- ai/chatyuan
- Introduction: A series of functional dialogue language models supported by Yuanyu Intelligent, which supports Sino -British bilingual dialogue, optimized in fine -tuning data, human feedback enhanced learning, thinking chain, etc.
Chatrwkv:
- Address: https://github.com/blinkdl/chatrwkv
- Introduction: Open Source a series of CHAT models (including English and Chinese) based on the RWKV architecture, published models including Raven, Novel-ChNeng, Novel-CH and Novel-Chneng-ChnPro, can directly chat and play poetry, novels and other creations. Including models of 7B and 14B.
CPM-Bee
- Address: https://github.com/openbmbmbmb/cpm-bee
- Brief introduction: A full open source, allowable commercial use of 10 billion parameters Chinese and English base models. It adopts Transformer's auto-regression architecture to conduct pre-training on high-quality corpus in Trillion, and has strong basic capabilities. Developers and researchers can adapt to various scenarios on the basis of the CPM-BEE base model to create application models in specific fields.
Tigerbot
- Address: https://github.com/tigerresearch/tigerbot
- Introduction: A large-scale language model (LLM) with a multi-language and multi-task (LLM), open source includes models: Tigerbot-7B, Tigerbot-7B-Base, Tigerbot-180B, basic training and reasoning code, 100G pre-training data, covering finance and law The field of encyclopedia and API.
Aquila
- Address: https://github.com/flagai-open/flagai/tree/examples/aquilala
- Introduction: Published by Zhiyuan Research Institute, the Aquila Language Model inherited the architectural design advantages of GPT-3, LLAMA, etc., replaced a group of more efficient underlying operators to achieve, redesigned the Chinese and English bilingual Tokenizer , Upgraded the BMTrain parallel training method, started from 0 on the basis of Chinese and English high -quality corpus. Through data quality control and optimization methods of multiple training, it is achieved in smaller data sets and shorter training time. Get better performance than other open source models. It is also the first large -scale open source language model that supports Sino -British bilingual knowledge, support commercial license agreement, and meets the needs of domestic data compliance.
Aquila2
- Address: https://github.com/flagai-open/aquila2
- Introduction: Published by Zhiyuan Research Institute, Aquila2 series, including basic language model Aquila2-7B, Aquila2-34B and Aquila2-70B-EXPR, dialogue model Aquilachat2-7b, Aquilachat2-34B and Aquilachat2-70B-EXPR, long text text Dialog Aquilachat2-7B-16K and Aquilachat2-34B-16.
Anima
- Address: https://github.com/lyogavin/anima
- Introduction: A open source of QLORA-based 33B Chinese Language Model developed by Ai Ten Technology. This model is based on QLORA's Guanaco 33B model. The training data set opened by the Chinese-Vicuna project guanaco_belle_mer_v1.0 has trained 10,000 STEPs for FineTune. Based on ELO RATING TOURNAMENT evaluation is better.
Knowlm
- Address: https://github.com/zjunlp/knowlm
- Introduction: The Knowlm project aims to publish the framework of open source large models and corresponding model weights to help reduce the problem of knowledge fallacy, including the difficulty of the knowledge of large models and potential errors and prejudices. The first phase of the project released LLAMA -based extraction of big model intelligence analysis, using Chinese and English corpus to further fully train LLAMA (13B), and optimizes knowledge extraction tasks based on knowledge graph conversion instruction technology.
Bayling
- Address: https://github.com/ictnlp/bayling
- Introduction: A large -scale universal model with enhanced cross -language alignment was developed by the Natural Language Treatment Team of the Institute of Computing Technology of the Chinese Academy of Sciences. Bayling uses LLAMA as the base model, exploring the method of fine -tuning instructions with interactive translation tasks as the core. It aims to complete language alignment at the same time and align with human intentions. English migration to other languages (Chinese). In the evaluation of multi -language translation, interactive translation, universal tasks, and standardized examinations, Bai Ling showed better performance in Chinese/English. Bai Ling provides an online version of DEMO for everyone to experience.
Yulan-chaat
- Address: https://github.com/ruc-gsai/yulan- Chat
- Introduction: Yulan-Chat is a big language model developed by Renmin University of China GSAI researchers. It is developed fine -tuned on the basis of LLAMA and has high -quality English and Chinese instructions. Yulan-Chat can chat with users, follow English or Chinese instructions well, and can be deployed on the GPU (A800-80g or RTX3090) after quantification.
Polylm
- Address: https://github.com/damo-nlp-mt/polylm
- Brief introduction: A multi -language model trained from the beginning of 640 billion words, including the size of two models (1.7B and 13B). POLYLM covers China, Britain, Russia, West, France, Portuguese, Germany, Italy, He, Bo, Bobo, Ashi, Hebrews, Japan, South Korea, Thailand, Vietnam, Indonesia and other types, especially more friendly to Asian language.
huozi
- Address: https://github.com/hit-sCir/huozi
- Introduction: A large -scale pre -training language model of a large -scale pre -training language model that is developed by Harbin Institute of Nature Language Treatment Research Institute. This model is based on the 7 billion parameter model of the BLOOM structure, which supports Chinese and English bilingual. The length of the context window is 2048. At the same time, it also opens the model of RLHF training and the 16.9K Chinese preference data set.
Yayi
- Address: https://github.com/weenge-research/yayi
- Introduction: The elegant model is fine -tuned in the high -quality field data of the high -quality field of millions of artificial structures. The training data covers five major areas of media propaganda, public opinion analysis, public safety, financial risk control, and urban governance. Task. From the iteration of the pre -training initialization of the pre -training, we gradually enhanced its basic Chinese ability and field analysis capabilities, and increased multiple rounds of dialogue and some plug -in capabilities. At the same time, after hundreds of users' internal testing, continuous artificial feedback optimization has been continuously improved, which further improves model performance and security. Open source of Chinese optimization model based on LLAMA 2, explores the latest practice suitable for Chinese missions in many fields of Chinese.
Yayi2
- Address: https://github.com/weenge-research/yayi2
- Introduction: YAYI 2 is a new generation of open source large language models developed by Zhongke Wenge, including Base and Chat versions with a parameter scale of 30B. Yayi2-30B is a large language model based on Transformer, using high-quality and multi-language corpus of more than 2 trillion tokens for pre-training. In response to the application scenarios in general and specific areas, we have used millions of instructions to fine -tune, and at the same time, we use human feedback to strengthen learning methods to better align the model and human values. This open source model is YAYI2-30B Base model.
Yuan-2.0
- Address: https://github.com/ieit-yuan/yuan-2.0
- Introduction: The project is open to a new generation of basic language model released by Inspur Information. It specifically opens all 3 model sources 2.0-102B, source 2.0-51b and source 2.0-2B. And provide relevant scripts for pre -training, fine -tuning, and reasoning services. Source 2.0 is based on source 1.0, using more high -quality pre -training data and fine -tuning data sets to make the model have stronger understanding in semantics, mathematics, reasoning, code, and knowledge.
Chinese-mixtral-8x7b
- Address: https://github.com/hit-sCir/Chinese-miXTral-8x7b
- Introduction: The project conducts pre-training Chinese-expansion tables based on the MiXTRAL-8X7B sparse hybrid expert model. It opens the CHINESE-MIXTRAL-8X7B word-expansion table model and the training code. The Chinese coding efficiency of this model is significantly improved than the original model. At the same time, through incremental pre -training on large -scale open source corpus, this model has a strong Chinese generation and understanding ability.
Bluelm
- Address: https: //github.com/vivo-jlab/bluelm
- Introduction: Bluelm is a large -scale pre -training language model independently developed by the Vivo AI Global Research Institute. This release contains the 7B foundation (Base) model and the 7B dialogue (CHAT) model. At the same time ) Model and dialogue (Chat) model.
Turingmm
- Address: https://github.com/lightyear-turing/turingmm-34b- Chat
- Introduction: Turingmm-34B-Chat is an open source Chinese and English Chat model. Beijing Guangnian Unlimited Technology Co., Ltd. is based on the Yi-34B open source model, 14W refined education data, SFT fine-tuning and 15W alignment data. A fine -tuning model obtained.
Orion
- Address: https://github.com/orionStarai/orion
- Introduction: Orion-14B-Base is a multi-language model with 14 billion parameters. This model is trained on a diverse data set containing 2.5 trillion token, covering various types of Chinese, English, Japanese, Korean, etc. language.
OrionStar-yi-34B-Chat
- Address: https://github.com/orionstarai/orionstar- yi-34b-chaat
- Introduction: OrionStar-YI-34B-CHAT is the Hangion Starry Sky-based Yi-34B model based on the open source of 10,000 things. It uses 15W+ high-quality corpus training to fine-tune the big model, which aims to provide outstanding interactive experiences for large model community users.
Minicpm
- Add
- Introduction: MinicPM is a series of side-side models commonly opened by Noodle Wall Intelligence and Tsinghua University Natural Language Treatment Laboratory. The main language model MiniCPM-2B has only 2.4 billion (2.4B) non-word-embedded parameters, totaling 2.7B of parameters.
Mengzi3
- Address: https://github.com/langboat/mengzi3
- Introduction: Mengzi3 8B/13B model is based on the LLAMA architecture, with corpus selection from web pages, encyclopedia, social, media, news, and high -quality open source data sets. By continuing multi -language corpus training on trillion tokens, the Chinese ability of the model is outstanding and takes into account multi -language ability.

1.2 Multifamily LLM model

Visualglm-6B
- Address: https://github.com/thudm/visualglm-6b
- Introduction: An open source, multi-mode dialogue language model supporting images, Chinese, and English. The language model is based on Chatglm-6B and has 6.2 billion parameters. The model has a total of 7.8 billion parameters. Relying on the 30M high -quality Chinese graphic pair from the COGView data set, pre -training with the screened English graphic with 300M.
COGVLM
- Address: https://github.com/thudm/cogvlm
- Brief introduction: A powerful open source visual language model (VLM). COGVLM-17B has 10 billion visual parameters and 7 billion language parameters. COGVLM-17B has achieved SOTA performance in 10 classic cross-modular benchmark tests. COGVLM can accurately describe images, and almost no hallucinations appear.
Visual-chinese-llama-Alpaca
- Address: https://github.com/airaria/visual-chinese-llama-alpaca
- Introduction: Multi -mode Chinese models developed based on Chinese LLAMA & Alpaca Model Project. Visualcla adds image encoding modules to the Chinese LLAMA/Alpaca model, so that the LLAMA model can receive visual information. On this basis, the Chinese graphic was used for multi -modal pre -training on the data. The alignment images and text representations were given to give it a basic multi -mode understanding ability; The ability to understand, execute and dialogue of multi-mode instructions is currently open source Visualcla-7B-V0.1.
LLASM
- Address: https://github.com/linksoul- ai/llasm
- Brief introduction: The first open source and commercial dialogue model that supports Chinese and English dual-voice-text multi-modal dialogue. The convenient voice input will greatly improve the experience of the large model with text as the input, while avoiding the tedious processes based on ASR solutions and possible errors that may be introduced. Currently open source Llasm-Chinese-Llama-2-7b, LLASM-BAICHUAN-7B and other models and data sets.
Viscpm
- Add
- Introduction: A open source multi-mode and large model series, supports Chinese and English bilingual multi-mode dialogue (VISCPM-CHAT models) and text to graph generation capabilities (VISCPM-PAINT model). VISCPM is based on tens of billions of parameters, a large-scale language model CPM-BEE (10B) training, and integrates the visual encoder (Q-FORMER) and the visual decoder (Diffusion-UNet) to support the input and output of the visual signal. Thanks to the excellent bilingual ability of the CPM-Bee base, VISCPM can only be pre-training through English multi-modal data to achieve excellent Chinese multi-modal ability.
Minicpm-v
- 地址：https://github.com/OpenBMB/MiniCPM-V
- Introduction: The end -side multi -modal model series facing graphic understanding. Including a series of minicpm-v 2/2.6, parameters include 2B, 8B, etc., 2B multi-mode comprehensive performance surpassed the larger parameter size models such as Yi-VL 34B, COGVLM-CHAT 17B, Qwen-Vl-Chat 10B, 8B , Single, multi-map, and video understanding performance exceeded GPT-4V.
Qwen-vl
- Address: https://github.com/qwenlm/qwen-vl
- Introduction: It is a large -scale visual language model developed by Alibaba Cloud. It can be used as an input with images, texts, and detection boxes, and text and detection boxes. Features include: powerful performance: the best effect of the same general model in the standard English evaluation of the four categories of multi -mode tasks; multi -language dialogue models: naturally supports English, Chinese and other multi -language dialogue, end -to -end to end to end Support the long text recognition of Chinese and English bilingual in the picture; multi -diagram interlaced dialogue: support multi -diagram input and comparison, designated picture questions and answers, multi -picture literary creation, etc.; The first universal model that supports Chinese open domain positioning: expressed through Chinese open domain language expression Detecting box labeling; fine particle size recognition and understanding: Compared to the 224 resolution used by other open source LVLM, Qwen-VL is the first open source 448 resolution LVLM model. Higher resolution can improve fine -grained text recognition, document question answering and detection boxes.
Internvl/1.5/2.0
- Address: https://github.com/opengvlab/internvl
- Introduction: Open source multi -mode and big models are also the first model in China to break the 60 model on MMMU (multi -disciplinary Q & A). In the test of mathematical benchmark Mathvista, the score of 66.3%of the scholar · Vientiane is significantly higher than that of other closed source business models and open source models. In general chart benchmark Chartqa, documentation benchmark DOCVQA, information graphic category benchmark InfogramicVQA, and universal visual Q & A benchmark MMBENCH (V1.1), the scholar Vientiane also achieved the most advanced (SOTA) performance.

2. Application

2.1 Filtering in the vertical field

Medical care

Doctorglm:
- Add
- Introduction: Based on the Chinese consultation model of Chatglm-6B, fine-tuning through the Chinese medical dialogue data set, it has achieved fine-tuning and deployment of fine-tuning and deployment including LoRa, P-Tuningv2
BENTSAO:
- Address: https://github.com/scir-hi/huata- llama-med-chinese
- Brief introduction: open source Llama-7B model of the Chinese medical instructions/instructions fine-tuning. Through medical knowledge maps and GPT3.5 API, a Chinese medical instruction data set was built, and on this basis, LLAMA was fine -tuned, which improved LLAMA's Q & A effect in the medical field.
Bianque:
- Address: https://github.com/scutcyr/bianque
- Introduction: A large model of medical dialogue with a multi-round inquiry dialogue with multiple rounds of inquiry dialogue. Based on the CLUEAI/Chatyuan-Large-V2 as the base, use the Chinese medical question and answer instruction to mixed database sets with multiple rounds of inquiry dialogue.
Huatuogpt:
- Address: https://github.com/freedomintelligence/huatuogpt
- Introduction: open source, a GPT-Like model that has gone through Chinese medical instructions/instructions
Med-Chatglm:
- Address: https://github.com/scir-hi/med-chatglm
- Introduction: The CHATGLM model based on Chinese medical knowledge is fine -tuned, and the fine -tuning data is the same as BENTSAO.
Qizhengpt:
- Address: https://github.com/cmkrg/qizhengptpt
- Introduction: The project uses the Chinese medical instruction data set built by Qizhen Medical Knowledge Base, and based on this on the LLAMA-7B model, it has greatly improved the effect of the model in the Chinese medical scenario. Evaluate data sets, follow -up plans to optimize the question and answer effects of diseases, surgery, and inspection, and expand expansion of applications such as doctor -patient Q & A, automatic medical records and other applications.
ChatMed:
- Address: https://github.com/michael-wzhu/chatmed
- Introduction: The project launched a large-scale language model of Chinese medical treatment series in this project. The model of the model is LLAMA-7B and uses LoRa fine-tuning. Specific ChatMed-Consult: 50W+online consultation+Chatgpt reply based on Chinese medical online consultation data set ChatMed_Consult_dataset+Chatgpt reply as training Collection; ChatMed-TCM: Based on the Chinese medicine instruction data set ChatMed_TCM_DataSet, based on the open source Chinese medicine knowledge map, the entity-centered self-instruction method is used, and the ChatGPT is used to get 2.6W+around Pharmaceutical instruction data training is obtained.
XRayglm, the first Chinese multi -modal medical model that can watch the chest X -rays:
- Address: https://github.com/wangrongSheng/xrayglm
- Introduction: In order to promote the research and development of medical multi -modal models in the Chinese field, this project has released XRAYGLM data sets and models, which shows extraordinary potential in medical imaging diagnosis and multi -round interactive dialogue.
Mechat, Chinese Psychological Health Support Dialogue Model:
- Address: https://github.com/qiuhuachuan/smile
- Introduction: The open source of the project supports the general model by the Chatglm-6B LORA 16-BIT instruction.数据集通过调用gpt-3.5-turbo API扩展真实的心理互助QA为多轮的心理健康支持多轮对话，提高了通用语言大模型在心理健康支持领域的表现，更加符合在长程多轮对话的应用Scene.
MedicalGPT
- 地址：https://github.com/shibing624/MedicalGPT
- 简介：训练医疗大模型，实现包括二次预训练、有监督微调、奖励建模、强化学习训练。发布中文医疗LoRA模型shibing624/ziya-llama-13b-medical-lora，基于Ziya-LLaMA-13B-v1模型，SFT微调了一版医疗模型，医疗问答效果有提升，发布微调后的LoRA权重。
Sunsimiao
- 地址：https://github.com/thomas-yanxin/Sunsimiao
- 简介：Sunsimiao是一个开源的中文医疗大模型，该模型基于baichuan-7B和ChatGLM-6B底座模型在十万级高质量的中文医疗数据中微调而得。
ShenNong-TCM-LLM
- 地址：https://github.com/michael-wzhu/ShenNong-TCM-LLM
- 简介：该项目开源了ShenNong中医药大规模语言模型，该模型以LlaMA为底座，采用LoRA (rank=16)微调得到。微调代码与ChatMed代码库相同。此外该项目还开源了中医药指令微调数据集。
SoulChat
- 地址：https://github.com/scutcyr/SoulChat
- 简介：该项目开源了经过百万规模心理咨询领域中文长文本指令与多轮共情对话数据联合指令微调的心理健康大模型灵心（SoulChat），该模型以ChatGLM-6B作为初始化模型，进行了全量参数的指令微调。
CareGPT
- 地址：https://github.com/WangRongsheng/CareGPT
- 简介：该项目开源了数十个公开可用的医疗微调数据集和开放可用的医疗大语言模型，包含LLM的训练、测评、部署等以促进医疗LLM快速发展。
DISC-MedLLM
- 地址：https://github.com/FudanDISC/DISC-MedLLM
- 简介：该项目是由复旦大学发布的针对医疗健康对话式场景而设计的医疗领域大模型与数据集，该模型由DISC-Med-SFT数据集基于Baichuan-13B-Base指令微调得到。
Taiyi-LLM
- 地址：https://github.com/DUTIR-BioNLP/Taiyi-LLM
- 简介：该项目由大连理工大学信息检索研究室开发的中英双语医学大模型"太一"，收集整理了丰富的中英双语生物医学自然语言处理（BioNLP）训练语料，总共包含38个中文数据集，通过丰富的中英双语任务指令数据（超过100W条样本）进行大模型（Qwen-7B-base）指令微调，使模型具备了出色的中英双语生物医学智能问答、医患对话、报告生成、信息抽取、机器翻译、标题生成、文本分类等多种BioNLP能力。
WiNGPT
- 地址：https://github.com/winninghealth/WiNGPT2
- 简介：WiNGPT是一个基于GPT的医疗垂直领域大模型，基于Qwen-7b1作为基础预训练模型，在此技术上进行了继续预训练，指令微调等，该项目具体开源了WiNGPT2-7B-Base与WiNGPT2-7B-Chat模型。
ChiMed-GPT
- 地址：https://github.com/synlp/ChiMed-GPT
- 简介：ChiMed-GPT是一个开源中文医学大语言模型，通过在中文医学数据上持续训练Ziya-v2 构建而成，其中涵盖了预训练、有监督微调(SFT) 和来自人类反馈的强化学习(RLHF) 等训练过程。
MindChat
- 地址：https://github.com/XD-Lab/MindChat
- 简介：心理大模型——漫谈(MindChat)期望从心理咨询、心理评估、心理诊断、心理治疗四个维度帮助人们纾解心理压力与解决心理困惑，为用户提供隐私、温暖、安全、及时、方便的对话环境，从而帮助用户克服各种困难和挑战，实现自我成长和发展。MindChat是一个基于Qwen作为基础预训练模型，并在此基础上进行指令微调得到的心理垂域大模型。

law

獬豸(LawGPT_zh): 中文法律对话语言模型
- 地址：https://github.com/LiuHC0428/LAW-GPT
- 简介: 本项目开源的中文法律通用模型由ChatGLM-6B LoRA 16-bit指令微调得到。数据集包括现有的法律问答数据集和基于法条和真实案例指导的self-Instruct构建的高质量法律文本问答，提高了通用语言大模型在法律领域的表现，提高了模型回答的可靠性和专业程度。
LaWGPT：基于中文法律知识的大语言模型
- 地址：https://github.com/pengxiao-song/LaWGPT
- 简介：该系列模型在通用中文基座模型（如Chinese-LLaMA、ChatGLM 等）的基础上扩充法律领域专有词表、大规模中文法律语料预训练，增强了大模型在法律领域的基础语义理解ability.在此基础上，构造法律领域对话问答数据集、中国司法考试数据集进行指令精调，提升了模型对法律内容的理解和执行能力。
LexiLaw：中文法律大模型
- 地址：https://github.com/CSHaitao/LexiLaw
- 简介：LexiLaw 是一个基于ChatGLM-6B微调的中文法律大模型，通过在法律领域的数据集上进行微调。该模型旨在为法律从业者、学生和普通用户提供准确、可靠的法律咨询服务，包括具体法律问题的咨询，还是对法律条款、案例解析、法规解读等方面的查询。
Lawyer LLaMA：中文法律LLaMA
- 地址：https://github.com/AndrewZhe/lawyer-llama
- 简介：开源了一系列法律领域的指令微调数据和基于LLaMA训练的中文法律大模型的参数。Lawyer LLaMA 首先在大规模法律语料上进行了continual pretraining。在此基础上，借助ChatGPT收集了一批对中国国家统一法律职业资格考试客观题（以下简称法考）的分析和对法律咨询的回答，利用收集到的数据对模型进行指令微调，让模型习得将法律知识应用到具体场景中的能力。
韩非(HanFei)
- 地址: https://github.com/siat-nlp/HanFei
- 简介: HanFei-1.0(韩非)是国内首个全参数训练的法律大模型，参数量7b，主要功能包括：法律问答、多轮对话、撰写文章、检索等。
ChatLaw-法律大模型
- 地址：https://github.com/PKU-YuanGroup/ChatLaw
- 简介：由北大开源的一系列法律领域的大模型，包括ChatLaw-13B（基于姜子牙Ziya-LLaMA-13B-v1训练而来），ChatLaw-33B（基于Anima-33B训练而来，逻辑推理能力大幅提升），ChatLaw-Text2Vec，使用93w条判决案例做成的数据集基于BERT训练了一个相似度匹配模型，可将用户提问信息和对应的法条相匹配。
lychee_law-律知
- 地址：https://github.com/davidpig/lychee_law
- 简介：该项目由德国萨尔大学团队和中国南京大学团队合作开发，开源一系列中文司法领域大模型，如Law-GLM-10B: 基于GLM-10B 模型, 在30GB 中文法律数据上进行指令微调得到of.
智海-录问(wisdomInterrogatory)
- 地址：https://github.com/zhihaiLLM/wisdomInterrogatory
- 简介：该项目由浙江大学、阿里巴巴达摩院以及华院计算三家单位共同设计研发的法律大模型，基于baichuan-7b进行了法律领域数据的二次预训练与指令微调，并设计了知识增强的推理流程。
夫子•明察司法大模型
- 地址：https://github.com/irlab-sdu/fuzi.mingcha
- 简介：该项目由是由山东大学、浪潮云、中国政法大学联合研发，以ChatGLM 为大模型底座，基于海量中文无监督司法语料（包括各类判决文书、法律法规等）与有监督司法微调数据（包括法律问答、类案检索）训练的中文司法大模型。该模型支持法条检索、案例分析、三段论推理判决以及司法对话等功能。
DISC-LawLLM
- 地址：https://github.com/FudanDISC/DISC-LawLLM
- 简介：该项目由由复旦大学数据智能与社会计算实验室(Fudan-DISC) 开发并开源的法律领域大模型，包括数据集，基于Baichuan-13B-Base 进行微调的模型，且增加了检索增强模块Essence

finance

Cornucopia（聚宝盆）：基于中文金融知识的LLaMA微调模型
- 地址：https://github.com/jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese
- 简介：开源了经过中文金融知识指令精调/指令微调(Instruct-tuning) 的LLaMA-7B模型。通过中文金融公开数据+爬取的金融数据构建指令数据集，并在此基础上对LLaMA进行了指令微调，提高了LLaMA 在金融领域的问答效果。基于相同的数据，后期还会利用GPT3.5 API构建高质量的数据集，另在中文知识图谱-金融上进一步扩充高质量的指令数据集。
BBT-FinCUGE-Applications
- 地址：https://github.com/ssymmetry/BBT-FinCUGE-Applications
- 简介：开源了中文金融领域开源语料库BBT-FinCorpus，中文金融领域知识增强型预训练语言模型BBT-FinT5及中文金融领域自然语言处理评测基准CFLEB。
XuanYuan（轩辕）：首个千亿级中文金融对话模型
- 地址：https://github.com/Duxiaoman-DI/XuanYuan
- 简介：轩辕是国内首个开源的千亿级中文对话大模型，同时也是首个针对中文金融领域优化的千亿级开源对话大模型。轩辕在BLOOM-176B的基础上针对中文通用领域和金融领域进行了针对性的预训练与微调，它不仅可以应对通用领域的问题，也可以解答与金融相关的各类问题，为用户提供准确、全面的金融信息和建议。
FinGPT
- 地址：https://github.com/AI4Finance-Foundation/FinGPT
- 简介：该项目开源了多个金融大模型，包括ChatGLM-6B/ChatGLM2-6B+LoRA和LLaMA-7B+LoRA的金融大模型，收集了包括金融新闻、社交媒体、财报等中英文训练数据。
DISC-FinLLM
- 地址：https://github.com/FudanDISC/DISC-FinLLM
- 简介：该项目由复旦大学数据智能与社会计算实验室(Fudan-DISC) 开发并开源，项目中开源的资源包括：DISC-FinLLM-SFT训练数据样本，DISC-FinLLM模型参数（基于Baichuan-13B-Chat训练），DISC-Fin-Eval-Benchmark等。
Tongyi-Finance
- 地址：https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B
- 简介：该模型是针对对金融行业推出的大语言模型，基于通义千问基础模型进行行业语料增量学习，强化金融领域知识和场景应用能力，覆盖金融知识问答、文本分类、信息抽取、文本创作、阅读理解、逻辑推理、多模态、Coding等能力象限。具有以下特点：行业语料增量学习：使用200B高质量金融行业语料进行增量学习，并进行金融行业词表扩展，覆盖丰富的数据类型，支持更大上下文（16k）输入和完整的语义表达。行业能力强化：自研SFT质量&多样性分析工具，筛选高质量SFT数据，解决大语言模型的alignment问题。行业后链路优化：借助multi-agent框架，实现知识库增强和工具API调用。

educate

桃李（Taoli）：
- 地址：https://github.com/blcuicall/taoli
- 简介：一个在国际中文教育领域数据上进行了额外训练的模型。项目基于目前国际中文教育领域流通的500余册国际中文教育教材与教辅书、汉语水平考试试题以及汉语学习者词典等，构建了国际中文教育资源库，构造了共计88000 条的高质量国际中文教育问答数据集，并利用收集到的数据对模型进行指令微调，让模型习得将知识应用到具体场景中的能力。
EduChat：
- 地址：https://github.com/icalk-nlp/EduChat
- 简介：该项目华东师范大学计算机科学与技术学院的EduNLP团队研发，主要研究以预训练大模型为基底的教育对话大模型相关技术，融合多样化的教育垂直领域数据，辅以指令微调、价值观对齐等方法，提供教育场景下自动出题、作业批改、情感支持、课程辅导、高考咨询等丰富功能，服务于广大老师、学生和家长群体，助力实现因材施教、公平公正、富有温度的智能教育。
chatglm-maths：
- 地址：https://github.com/yongzhuo/chatglm-maths
- 简介：基于chatglm-6b微调/LORA/PPO/推理的数学题解题大模型, 样本为自动生成的整数/小数加减乘除运算, 可gpu/cpu部署，开源了训练数据集等。
MathGLM：
- 地址：https://github.com/THUDM/MathGLM
- 简介：该项目由THUDM研发，开源了多个能进行20亿参数可以进行准确多位算术运算的语言模型，同时开源了可用于算术运算微调的数据集。
QiaoBan：
- 地址：https://github.com/HIT-SCIR-SC/QiaoBan
- 简介：该项目旨在构建一个面向儿童情感陪伴的大模型，这个仓库包含：用于指令微调的对话数据/data，巧板的训练代码，训练配置文件，使用巧板进行对话的示例代码（TODO，checkpoint将发布至huggingface）。

science and technology

天文大语言模型StarGLM：
- 地址：https://github.com/Yu-Yang-Li/StarGLM
- 简介：基于ChatGLM训练了天文大语言模型，以期缓解大语言模型在部分天文通用知识和前沿变星领域的幻觉现象，为接下来可处理天文多模态任务、部署于望远镜阵列的观测Agent——司天大脑（数据智能处理）打下基础。
TransGPT·致远：
- 地址：https://github.com/DUOMO/TransGPT
- 简介：开源交通大模型，主要致力于在真实交通行业中发挥实际价值。它能够实现交通情况预测、智能咨询助手、公共交通服务、交通规划设计、交通安全教育、协助管理、交通事故报告和分析、自动驾驶辅助系统等功能。
Mozi：
- 地址：https://github.com/gmftbyGMFTBY/science-llm
- 简介：该项目开源了基于LLaMA和Baichuan的科技论文大模型，可以用于科技文献的问答和情感支持。

E -commerce

EcomGPT
- 地址：https://github.com/Alibaba-NLP/EcomGPT
- 简介：一个由阿里发布的面向电商领域的语言模型，该模型基于BLOOMZ在电商指令微调数据集上微调得到，人工评估在12个电商评测数据集上超过ChatGPT。

Network security

SecGPT
- 地址：https://github.com/Clouditera/secgpt
- 简介：开项目开源了网络安全大模型，该模型基于Baichuan-13B采用Lora做预训练和SFT训练，此外该项目还开源了相关预训练和指令微调数据集等资源。

agriculture

后稷（AgriMa）：
- 地址：https://github.com/zhiweihu1103/AgriMa
- 简介：首个中文开源农业大模型是由山西大学、山西农业大学与The Fin AI联合研发，以Baichuan为底座，基于海量有监督农业领域相关数据微调，具备广泛的农业知识和智能分析能力，该模型旨在为农业领域提供全面而高效的信息处理和决策支持。
稷丰（AgriAgent）：
- 地址：https://github.com/zhiweihu1103/AgriAgent
- 简介：首个开源中文农业多模态大模型是由山西农业大学研发，以MiniCPM-Llama3-V 2.5为底座，能够从图像、文本、气象数据等多源信息中提取有用信息，为农业生产提供全面、精准的智能化解决方案。我们致力于将稷丰应用于作物健康监测、病虫害识别、土壤肥力分析、农田管理优化等多个方面，帮助农民提升生产效率，减少资源浪费，促进农业的可持续发展。

2.2 LangChain应用

langchain-ChatGLM：
- 地址：https://github.com/imClumsyPanda/langchain-ChatGLM
- 简介：基于本地知识库的问答应用，目标期望建立一套对中文场景与开源模型支持友好、可离线运行的知识库问答解决方案。建立了全流程可使用开源模型实现的本地知识库问答应用。现已支持使用ChatGLM-6B 等大语言模型直接接入，或通过fastchat api 形式接入Vicuna, Alpaca, LLaMA, Koala, RWKV 等模型。
LangChain-ChatGLM-Webui：
- 地址：https://github.com/thomas-yanxin/LangChain-ChatGLM-Webui
- 简介：利用LangChain和ChatGLM-6B系列模型制作的Webui, 提供基于本地知识的大模型应用。目前支持上传txt、docx、md、pdf等文本格式文件, 提供包括ChatGLM-6B系列、Belle系列等模型文件以及GanymedeNil/text2vec-large-chinese、nghuyong/ernie-3.0-base-zh、nghuyong/ernie-3.0-nano-zh等Embedding模型。
Langchain-ChatGLM-and-TigerBot：
- 地址：https://github.com/wordweb/langchain-ChatGLM-and-TigerBot
- 简介：该项目在langchain-ChatGLM的基础上补充了加载TigerBot模型的基于本地知识库的问答应用。
Chinese-LangChain：
- 地址：https://github.com/yanqiangmiffy/Chinese-LangChain
- 简介：基于ChatGLM-6b+langchain实现本地化知识库检索与智能答案生成（包括互联网检索结果接入）
Lagent：
- 地址：https://github.com/InternLM/lagent
- 简介：Lagent 是一个轻量级、开源的基于大语言模型的智能体（agent）框架，支持用户快速地将一个大语言模型转变为多种类型的智能体。具体实现了多种类型的智能体，如经典的ReAct，AutoGPT 和ReWoo 等智能体。框架简单易拓展. 只需要不到20行代码你就能够创造出一个你自己的智能体（agent）。同时支持了Python 解释器、API 调用和搜索三类常用典型工具。灵活支持多个大语言模型. 提供了多种大语言模型支持包括InternLM、Llama-2 等开源模型和GPT-4/3.5 等基于API 的闭源模型。
DemoGPT：
- 地址：https://github.com/melih-unsal/DemoGPT
- 简介：⚡ DemoGPT 使您只需使用提示即可创建快速演示。 Bleak
ChatDev：
- 地址：https://github.com/OpenBMB/ChatDev
- 简介：ChatDev是一家虚拟软件公司，通过担任不同角色的各种智能代理进行运营，包括首席执行官、首席技术官、程序员、测试员等。这些代理形成了一个多代理组织结构，并因“通过编程彻底改变数字世界”的使命而团结在一起。 ChatDev中的代理通过参加专门的功能研讨会进行协作，包括设计、编码、测试和记录等任务。

2.3 其他应用

wenda：
- 地址：https://github.com/wenda-LLM/wenda
- 简介：一个LLM调用平台。为小模型外挂知识库查找和设计自动执行动作，实现不亚于于大模型的生成能力。
JittorLLMs：
- 地址：https://github.com/Jittor/JittorLLMs
- 简介：计图大模型推理库：笔记本没有显卡也能跑大模型，具有成本低，支持广，可移植，速度快等优势。
LMFlow:
- 地址：https://github.com/OptimalScale/LMFlow
- 简介：LMFlow是香港科技大学LMFlow团队开发的大模型微调工具箱。LMFlow工具箱具有可扩展性强、高效、方便的特性。LMFlow仅使用180K条数据微调，即可得到在Huggingface榜单第一名的Robin模型。LMFlow支持用户快速训练个性化模型，仅需单张3090和5个小时即可微调70亿参数定制化模型。
fastllm：
- 地址：https://github.com/ztxz16/fastllm
- 简介：纯c++的全平台llm加速库，chatglm-6B级模型单卡可达10000+token / s，支持moss, chatglm, baichuan模型，手机端流畅运行。
WebCPM
- 地址：https://github.com/thunlp/WebCPM
- 简介：一个支持可交互网页搜索的中文大模型。
GPT Academic：
- 地址：https://github.com/binary-husky/gpt_academic
- 简介：为GPT/GLM提供图形交互界面，特别优化论文阅读润色体验，支持并行问询多种LLM模型，支持清华chatglm等本地模型。兼容复旦MOSS, llama, rwkv, 盘古等。
ChatALL：
- 地址：https://github.com/sunner/ChatALL
- 简介：ChatALL（中文名：齐叨）可以把一条指令同时发给多个AI，可以帮助用户发现最好的回答。
CreativeChatGLM：
- 地址：https://github.com/ypwhs/CreativeChatGLM
- 简介：可以使用修订和续写的功能来生成创意内容，可以使用“续写”按钮帮ChatGLM 想一个开头，并让它继续生成更多的内容，你可以使用“修订”按钮修改最后一句ChatGLM 的reply.
docker-llama2-chat：
- 地址：https://github.com/soulteary/docker-llama2-chat
- 简介：开源了一个只需要三步就可以上手LLaMA2的快速部署方案。
ChatGLM2-Voice-Cloning：
- 地址：https://github.com/KevinWang676/ChatGLM2-Voice-Cloning
- 简介：实现了一个可以和喜欢的角色沉浸式对话的应用，主要采用ChatGLM2+声音克隆+视频对话的技术。
Flappy
- 地址：https://github.com/pleisto/flappy
- 简介：一个产品级面向所有程序员的LLM SDK，
LazyLLM
- 地址：https://github.com/LazyAGI/LazyLLM
- 简介：LazyLLM是一款低代码构建多Agent大模型应用的开发工具，协助开发者用极低的成本构建复杂的AI应用，并可以持续的迭代优化效果。LazyLLM提供了更为灵活的应用功能定制方式，并实现了一套轻量级网管机制来支持一键部署多Agent应用，支持流式输出，兼容多个Iaas平台，且支持对应用中的模型进行持续微调。
MemFree
- 地址：https://github.com/memfreeme/memfree
- 简介：MemFree 是一个开源的Hybrid AI 搜索引擎，可以同时对您的个人知识库（如书签、笔记、文档等）和互联网进行搜索, 为你提供最佳答案。MemFree 支持自托管的极速无服务器向量数据库，支持自托管的极速Local Embedding and Rerank Service，支持一键部署。

3. 数据集

预训练数据集

MNBVC
- 地址：https://github.com/esbatmop/MNBVC
- 数据集说明：超大规模中文语料集，不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。数据均来源于互联网收集，且在持续更新中。
WuDaoCorporaText
- 地址：https://data.baai.ac.cn/details/WuDaoCorporaText
- 数据集说明：WuDaoCorpora是北京智源人工智能研究院（智源研究院）构建的大规模、高质量数据集，用于支撑大模型训练研究。目前由文本、对话、图文对、视频文本对四部分组成，分别致力于构建微型语言世界、提炼对话核心规律、打破图文模态壁垒、建立视频文字关联，为大模型训练提供坚实的数据support.
CLUECorpus2020
- 地址：https://github.com/CLUEbenchmark/CLUECorpus2020
- 数据集说明：通过对Common Crawl的中文部分进行语料清洗，最终得到100GB的高质量中文预训练语料，可直接用于预训练、语言模型或语言生成任务以及专用于简体中文NLP任务的小词表Essence
WanJuan-1.0
- 地址：https://opendatalab.org.cn/WanJuan1.0
- 数据集说明：书生·万卷1.0为书生·万卷多模态语料库的首个开源版本，包含文本数据集、图文数据集、视频数据集三部分，数据总量超过2TB。目前，书生·万卷1.0已被应用于书生·多模态、书生·浦语的训练。通过对高质量语料的“消化”，书生系列模型在语义理解、知识问答、视觉理解、视觉问答等各类生成式任务表现出的优异性能。
seq-monkey-data
- 地址：https://github.com/mobvoi/seq-monkey-data
- 数据集说明：序列猴子是出门问问提供的超大规模语言模型，基于其通用的表示与推理能力，支持多轮交互，能够大幅度提高生产效率和数据处理能力，被广泛应用于问答系统、自然语言处理、机器翻译、文本摘要等领域。序列猴子数据集是用于训练序列猴子模型的数据集合，现选择部分数据集向公众开放。

SFT数据集

RefGPT：基于RefGPT生成大量真实和定制的对话数据集
- 地址：https://github.com/DA-southampton/RedGPT
- 数据集说明：包括RefGPT-Fact和RefGPT-Code两部分，其中RefGPT-Fact给出了5万中文的关于事实性知识的多轮对话，RefGPT-Code给出了3.9万中文编程相关的多轮对话data.
COIG
- 地址：https://huggingface.co/datasets/BAAI/COIG
- 数据集说明：维护了一套无害、有用且多样化的中文指令语料库，包括一个人工验证翻译的通用指令语料库、一个人工标注的考试指令语料库、一个人类价值对齐指令语料库、一个多轮反事实修正聊天语料库和一个leetcode 指令语料库。
generated_chat_0.4M：
- 地址：https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M
- 数据集说明：包含约40万条由BELLE项目生成的个性化角色对话数据，包含角色介绍。但此数据集是由ChatGPT产生的，未经过严格校验，题目或解题过程可能包含错误。
alpaca_chinese_dataset：
- 地址：https://github.com/hikariming/alpaca_chinese_dataset
- 数据集说明：根据斯坦福开源的alpaca数据集进行中文翻译，并再制造一些对话数据
Alpaca-CoT：
- 地址：https://github.com/PhoebusSi/Alpaca-CoT
- 数据集说明：统一了丰富的IFT数据（如CoT数据，目前仍不断扩充）、多种训练效率方法（如lora，p-tuning）以及多种LLMs，三个层面上的接口，打造方便研究人员上手的LLM-IFT研究平台。
pCLUE：
- 地址：https://github.com/CLUEbenchmark/pCLUE
- 数据集说明：基于提示的大规模预训练数据集，用于多任务学习和零样本学习。包括120万训练数据，73个Prompt，9个任务。
firefly-train-1.1M：
- 地址：https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
- 数据集说明：23个常见的中文数据集，对于每个任务，由人工书写若干种指令模板，保证数据的高质量与丰富度，数据量为115万
BELLE-data-1.5M：
- 地址：https://github.com/LianjiaTech/BELLE/tree/main/data/1.5M
- 数据集说明：通过self-instruct生成，使用了中文种子任务，以及openai的text-davinci-003接口,涉及175个种子任务
Chinese Scientific Literature Dataset：
- 地址：https://github.com/ydli-ai/csl
- 数据集说明：中文科学文献数据集（CSL），包含396,209 篇中文核心期刊论文元信息（标题、摘要、关键词、学科、门类）以及简单的prompt
Chinese medical dialogue data：
- 地址：https://github.com/Toyhom/Chinese-medical-dialogue-data
- 数据集说明：中文医疗对话数据集，包括：<Andriatria_男科> 94596个问答对<IM_内科> 220606个问答对<OAGD_妇产科> 183751个问答对<Oncology_肿瘤科> 75553个问答对<Pediatric_儿科> 101602个问答对<Surgical_外科> 115991个问答对总计792099个问答对。
Huatuo-26M：
- 地址：https://github.com/FreedomIntelligence/Huatuo-26M
- 数据集说明：Huatuo-26M 是一个中文医疗问答数据集，此数据集包含了超过2600万个高质量的医疗问答对，涵盖了各种疾病、症状、治疗方式、药品信息等多个方面。Huatuo-26M 是研究人员、开发者和企业为了提高医疗领域的人工智能应用，如聊天机器人、智能诊断系统等需要的重要资源。
Alpaca-GPT-4:
- 地址：https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
- 数据集说明：Alpaca-GPT-4 是一个使用self-instruct 技术，基于175 条中文种子任务和GPT-4 接口生成的50K 的指令微调数据集。
InstructionWild
- 地址：https://github.com/XueFuzhao/InstructionWild
- 数据集说明：InstructionWild 是一个从网络上收集自然指令并过滤之后使用自然指令结合ChatGPT 接口生成指令微调数据集的项目。主要的指令来源：Twitter、CookUp.AI、Github 和Discard。
ShareChat
- 地址：https://paratranz.cn/projects/6725
- 数据集说明：一个倡议大家一起翻译高质量ShareGPT 数据的项目。
- 项目介绍：清洗/构造/翻译中文的ChatGPT数据，推进国内AI的发展，人人可炼优质中文Chat 模型。本数据集为ChatGPT约九万个对话数据，由ShareGPT API获得（英文68000，中文11000条，其他各国语言）。项目所有数据最终将以CC0 协议并入Multilingual Share GPT 语料库。
Guanaco
- 地址：https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
- 数据集说明：一个使用Self-Instruct 的主要包含中日英德的多语言指令微调数据集。
chatgpt-corpus
- 地址：https://github.com/PlexPt/chatgpt-corpus
- 数据集说明：开源了由ChatGPT3.5 生成的300万自问自答数据，包括多个领域，可用于用于训练大模型。
SmileConv
- 地址：https://github.com/qiuhuachuan/smile
- 数据集说明：数据集通过ChatGPT改写真实的心理互助QA为多轮的心理健康支持多轮对话（single-turn to multi-turn inclusive language expansion via ChatGPT），该数据集含有56k个多轮对话，其对话主题、词汇和篇章语义更加丰富多样，更加符合在长程多轮对话的应用场景。

偏好数据集

CValues
- 地址：https://github.com/X-PLUG/CValues
- 数据集说明：该项目开源了数据规模为145k的价值对齐数据集，该数据集对于每个prompt包括了拒绝&正向建议(safe and reponsibility) > 拒绝为主(safe) > 风险回复(unsafe)三种类型，可用于增强SFT模型的安全性或用于训练reward模型。
GPT-4-LLM
- 地址：https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
- 数据集说明：该项目开源了由GPT4生成的多种数据集，包括通过GPT4生成的中英PPO数据，可以用于奖励模型的训练。
zhihu_rlhf_3k
- 地址：https://huggingface.co/datasets/liyucheng/zhihu_rlhf_3k
- 数据集说明：该项目开源了3k+条基于知乎问答的人类偏好数据集，每个实际的知乎问题下给出了赞同数据较高（chosen）和较低（rejected）的回答，可以用于奖励模型的训练。
hh_rlhf_cn
- 地址：https://huggingface.co/datasets/dikw/hh_rlhf_cn
- 数据集说明：基于Anthropic论文Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 开源的helpful 和harmless数据，使用翻译工具进行了翻译。
chatbot_arena_conversations
- 地址：https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
- 数据集说明：该偏好数据集包含20个LLM的输出，其中包括GPT-4和Claude-v1等更强的LLM，它还包含这些最先进模型的许多失败案例。包含来自超过13K个用户的无限制对话。
UltraFeedback
- 地址：https://github.com/OpenBMB/UltraFeedback
- 数据集说明：该数据集是一个大规模、细粒度、多样化的偏好数据集，用于训练强大的奖励模型和批评者模型。该工作从各种资源（包括UltraChat、ShareGPT、Evol-Instruct、TruthfulQA、FalseQA和FLAN，数据集统计数据请参见此处）中收集了约64k条提示。然后使用这些提示来查询多个LLM（模型列表请参见此处），并为每个提示生成4个不同的回复，从而得到总共256k个样本。

4. LLM训练微调框架

DeepSpeed Chat：
- 地址：https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat
- 简介：该项目提供了一键式RLHF训练框架，只需一个脚本即可实现多个训练步骤，包括SFT，奖励模型微调和基于人类反馈的强化学习（RLHF），此外还实现了DeepSpeed HE，统一的高效混合引擎，达到训练和推理引擎之间的过渡是无缝的。
LLaMA Efficient Tuning：
- 地址：https://github.com/hiyouga/LLaMA-Efficient-Tuning
- 简介：该项目提供了易于使用的基于PEFT的LLaMA微调框架，实现了包括全参数，LoRA，QLoRA等的预训练，指令微调和RLHF，并支持LLaMA, BLOOM, Falcon, Baichuan, InternLM等底座模型。
ChatGLM Efficient Tuning：
- 地址：https://github.com/hiyouga/ChatGLM-Efficient-Tuning
- 简介：该项目提供了基于PEFT的高效ChatGLM微调，支持LoRA，P-Tuning V2，全参数微调等模式，并适配了多个微调数据集。
bert4torch：
- 地址：https://github.com/Tongjilibo/bert4torch
- 简介：该项目提供了一个大模型的训练和部署框架，包含了目前主要的开源大模型，llama系列，chatglm，bloom系列等等，同时还给出了预训练和微调的示例。

5. LLM推理部署框架

vLLM：
- 地址：https://github.com/vllm-project/vllm
- 简介：适用于大批量Prompt输入，并对推理速度要求高的场景。吞吐量比HuggingFace Transformers高14x-24倍，比HuggingFace Text Generation Inference（TGI）高2.2x-2.5倍，实现了Continuous batching和PagedAttention等技巧。但该框架对适配器（LoRA、QLoRA等）的支持不友好且缺少权重量化。
DeepSpeed-MII：
- 地址：https://github.com/microsoft/DeepSpeed-MII
- 简介：支持多个机器之间的负载均衡，支持不同的模型库（如Hugging Face、FairSeq等），支持模型量化推理。
text-generation-inference：
- 地址：https://github.com/huggingface/text-generation-inference
- 简介：用于文本生成推断的Rust、Python和gRPC部署框架，可以监控服务器负载，实现了flash attention和Paged attention，所有的依赖项都安装在Docker中：支持HuggingFace模型；但该框架对适配器（LoRA、QLoRA等）的支持不友好。
CTranslate2
- 地址：https://github.com/OpenNMT/CTranslate2
- 简介：基于C++和python的推理框架，支持在CPU和GPU上并行和异步执行，且支持prompt缓存及量化。但缺少对适配器（LoRA、QLoRA等）的支持。
OpenLLM
- 地址：https://github.com/bentoml/OpenLLM
- 简介：支持将要部署的LLM连接多个适配器，可以实现只使用一个底座模型来执行多个特定的任务；支持量化推理和LangChain集成。但对批处理和分布式推理的支持相对不友好。
MLC LLM
- 地址：https://github.com/mlc-ai/mlc-llm
- 简介：支持不同平台上的不同设备部署推理，包括移动设备（iOS或Android设备等）的高效推理，压缩等。但对大规模批量调用相对不友好。
LightLLM：
- 地址：https://github.com/ModelTC/lightllm
- 简介：一个基于Python 的LLM（大型语言模型）推理和服务框架，该框架采用轻量级设计、易于扩展和高速性能，LightLLM引入了一种更细粒度的kv cache管理算法TokenAttention，并设计了一个与TokenAttention高效配合的Efficient Router调度实现。在TokenAttention 和Efficient Router的相互作用下，LightLLM在大部分场景下都能获得比vLLM 和Text Generation Inference 得到更高的吞吐，部分场景下可以得到4倍左右的性能提升。
AirLLM：
- 地址：https://github.com/lyogavin/Anima/tree/main/air_llm
- 简介：该项目开源了一个优化inference内存的推理框架，可实现4GB单卡GPU可以运行70B大语言模型推理。不需要任何损失模型性能的量化和蒸馏，剪枝等模型压缩，该项目采用了分层推理的技术以在较低的内存下实现大模型推理。
LMDeploy:
- 地址：https://github.com/InternLM/lmdeploy
- 简介：该项目支持LLM（大语言模型）和VL（视觉语言模型）任务在NVIDIA 设备上量化、推理和服务。LMDeploy 支持有状态的推理，可以缓存对话，记住历史。它实现了Persistent Batch(即Continuous Batch)，Blocked K/V Cache，动态拆分和融合，张量并行，高效的计算kernel等重要特性。推理性能是vLLM 的1.8 倍以上。其4bit 量化模型推理性能达FP16 的2.4 倍以上。

6. LLM评测

FlagEval （天秤）大模型评测体系及开放平台
- 地址：https://github.com/FlagOpen/FlagEval
- 简介：旨在建立科学、公正、开放的评测基准、方法、工具集，协助研究人员全方位评估基础模型及训练算法的性能，同时探索利用AI方法实现对主观评测的辅助，大幅提升评测的效率和客观性。FlagEval （天秤）创新构建了“能力-任务-指标”三维评测框架，细粒度刻画基础模型的认知能力边界，可视化呈现评测结果。
C-Eval: 构造中文大模型的知识评估基准：
- 地址：https://github.com/SJTU-LIT/ceval
- 简介：构造了一个覆盖人文，社科，理工，其他专业四个大方向，52 个学科（微积分，线代…），从中学到大学研究生以及职业考试，一共13948 道题目的中文知识和推理型测试集。此外还给出了当前主流中文LLM的评测结果。
OpenCompass:
- 地址：https://github.com/InternLM/opencompass
- 简介：由上海AI实验室发布的面向大模型评测的一站式平台。主要特点包括：开源可复现；全面的能力维度：五大维度设计，提供50+ 个数据集约30 万题的的模型评测方案；丰富的模型支持：已支持20+ HuggingFace 及API 模型；分布式高效评测：一行命令实现任务分割和分布式评测，数小时即可完成千亿模型全量评测；多样化评测范式：支持零样本、小样本及思维链评测，结合标准型或对话型提示词模板；灵活化拓展。
SuperCLUElyb: SuperCLUE琅琊榜
- 地址：https://github.com/CLUEbenchmark/SuperCLUElyb
- 简介：中文通用大模型匿名对战评价基准，这是一个中文通用大模型对战评价基准，它以众包的方式提供匿名、随机的对战。他们发布了初步的结果和基于Elo评级系统的排行榜。
GAOKAO-Bench:
- 地址：https://github.com/OpenLMLab/GAOKAO-Bench
- 简介：GAOKAO-bench是一个以中国高考题目为数据集，测评大模型语言理解能力、逻辑推理能力的测评框架，收集了2010-2022年全国高考卷的题目，其中包括1781道客观题和1030道主观题，构建起GAOKAO-bench的数据部分。
AGIEval:
- 地址：https://github.com/ruixiangcui/AGIEval
- 简介：由微软发布的一项新型基准测试，这项基准选取20种面向普通人类考生的官方、公开、高标准往常和资格考试，包括普通大学入学考试（中国高考和美国SAT 考试）、法学入学考试、数学竞赛、律师资格考试、国家公务员考试等等。
Xiezhi:
- 地址：https://github.com/mikegu721/xiezhibenchmark
- 简介：由复旦大学发布的一个综合的、多学科的、能够自动更新的领域知识评估Benchmark，包含了哲学、经济学、法学、教育学、文学、历史学、自然科学、工学、农学、医学、军事学、管理学、艺术学这13个学科门类，24万道学科题目，516个具体学科，249587道题目。
Open LLM Leaderboard：
- 地址：https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- 简介：由HuggingFace组织的一个LLM评测榜单，目前已评估了较多主流的开源LLM模型。评估主要包括AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA四个数据集上的表现，主要以英文为主。
CMMLU：
- 地址：https://github.com/haonan-li/CMMLU
- 简介：CMMLU是一个综合性的中文评估基准，专门用于评估语言模型在中文语境下的知识和推理能力。CMMLU涵盖了从基础学科到高级专业水平的67个主题。它包括：需要计算和推理的自然科学，需要知识的人文科学和社会科学,以及需要生活常识的中国驾驶规则等。此外，CMMLU中的许多任务具有中国特定的答案，可能在其他地区或语言中并不普遍适用。因此是一个完全中国化的中文测试基准。
MMCU：
- 地址：https://github.com/Felixgithub2017/MMCU
- 简介：该项目提供对中文大模型语义理解能力的测试，评测方式、评测数据集、评测记录都公开，确保可以复现。该项目旨在帮助各位研究者们评测自己的模型性能，并验证训练策略是否有效。
chinese-llm-benchmark：
- 地址：https://github.com/jeinlee1991/chinese-llm-benchmark
- 简介：中文大模型能力评测榜单：覆盖百度文心一言、chatgpt、阿里通义千问、讯飞星火、belle / chatglm6b 等开源大模型，多维度能力评测。不仅提供能力评分排行榜，也提供所有模型的原始输出结果！
Safety-Prompts：
- 地址：https://github.com/thu-coai/Safety-Prompts
- 简介：由清华大学提出的一个关于LLM安全评测benchmark，包括安全评测平台等，用于评测和提升大模型的安全性，囊括了多种典型的安全场景和指令攻击的prompt。
PromptCBLUE: 中文医疗场景的LLM评测基准
- 地址：https://github.com/michael-wzhu/PromptCBLUE
- 简介：为推动LLM在医疗领域的发展和落地，由华东师范大学联合阿里巴巴天池平台，复旦大学附属华山医院，东北大学，哈尔滨工业大学（深圳），鹏城实验室与同济大学推出PromptCBLUE评测基准, 将16种不同的医疗场景NLP任务全部转化为基于提示的语言生成任务,形成首个中文医疗场景的LLM评测基准。
HalluQA: 中文幻觉评估基准
- 地址：https://github.com/xiami2019/HalluQA
- 简介：该项目提出了一个名为HalluQA的基准测试，用于衡量中文大型语言模型中的幻觉现象。HalluQA包含450个精心设计的对抗性问题，涵盖多个领域，并考虑了中国历史文化、风俗和社会现象。在构建HalluQA时，考虑了两种类型的幻觉：模仿性虚假和事实错误，并基于GLM-130B和ChatGPT构建对抗性样本。为了评估，设计了一种使用GPT-4进行自动评估的方法，判断模型输出是否是幻觉。

7. LLM教程

LLM基础知识

HuggingLLM：
- 地址：https://github.com/datawhalechina/hugging-llm
- 简介：介绍ChatGPT 原理、使用和应用，降低使用门槛，让更多感兴趣的非NLP或算法专业人士能够无障碍使用LLM创造价值。
LLMsPracticalGuide：
- 地址：https://github.com/Mooler0410/LLMsPracticalGuide
- 简介：该项目提供了关于LLM的一系列指南与资源精选列表，包括LLM发展历程、原理、示例、论文等。

提示工程教程

面向开发者的LLM 入门课程：
- 地址：https://github.com/datawhalechina/prompt-engineering-for-developers
- 简介：一个中文版的大模型入门教程，围绕吴恩达老师的大模型系列课程展开，主要包括：吴恩达《ChatGPT Prompt Engineering for Developers》课程中文版，吴恩达《Building Systems with the ChatGPT API》课程中文版，吴恩达《LangChain for LLM Application Development》课程中文版等。
提示工程指南:
- 地址：https://www.promptingguide.ai/zh
- 简介：该项目基于对大语言模型的浓厚兴趣，编写了这份全新的提示工程指南，介绍了大语言模型相关的论文研究、学习指南、模型、讲座、参考资料、大语言模型能力以及与其他与提示工程相关的工具。
awesome-chatgpt-prompts-zh：
- 地址：https://github.com/PlexPt/awesome-chatgpt-prompts-zh
- 简介：该项目是ChatGPT中文调教指南。包括各种场景使用指南，让chatgpt知道怎么听你的话，对指令构造可以提供一些参考。

LLM应用教程

LangChain ?️? 中文网，跟着LangChain一起学LLM/GPT开发：
- 地址：https://www.langchain.asia
- 简介：Langchain的中文文档，由是两个在LLM创业者维护，希望帮助到从刚进入AI应用开发的朋友们。
OpenAI Cookbook：
- 地址：https://github.com/openai/openai-cookbook
- 简介：该项目是OpenAI提供的使用OpenAI API的示例和指导，其中包括如何构建一个问答机器人等教程，能够为从业人员开发类似应用时带来指导。
构筑大语言模型应用：应用开发与架构设计：
- 地址：https://github.com/phodal/aigc
- 简介：该项目开源了一本关于LLM 在真实世界应用的开源电子书，介绍了大语言模型的基础知识和应用，以及如何构建自己的模型。其中包括Prompt的编写、开发和管理，探索最好的大语言模型能带来什么，以及LLM应用开发的模式和架构设计。

LLM实战教程

LLMs九层妖塔：
- 地址：https://github.com/km1994/LLMsNineStoryDemonTower
- 简介：ChatGLM、Chinese-LLaMA-Alpaca、MiniGPT-4、FastChat、LLaMA、gpt4all等实战与经验。
llm-action：
- 地址：https://github.com/liguodongiot/llm-action
- 简介：该项目提供了一系列LLM实战的教程和代码，包括LLM的训练、推理、微调以及LLM生态相关的一些技术文章等。
llm大模型训练专栏：
- 地址：https://www.zhihu.com/column/c_1252604770952642560
- 简介：该项目提供了一系列LLM前言理论和实战实验，包括论文解读与洞察分析。
书生·浦语大模型实战营
- 地址：https://github.com/InternLM/tutorial
- 简介：该课程由上海人工智能实验室重磅推出。课程包括大模型微调、部署与评测全链路，目的是为广大开发者搭建大模型学习和实践开发的平台。
8. 相关仓库
FindTheChatGPTer：
- 地址：https://github.com/chenking2020/FindTheChatGPTer
- 简介：ChatGPT爆火，开启了通往AGI的关键一步，本项目旨在汇总那些ChatGPT的开源平替们，包括文本大模型、多模态大模型等，为大家提供一些便利。
LLM_reviewer：
- 地址：https://github.com/SpartanBin/LLM_reviewer
- 简介：总结归纳近期井喷式发展的大语言模型，以开源、规模较小、可私有化部署、训练成本较低的'小羊驼类'模型为主。
Awesome-AITools：
- 地址：https://github.com/ikaijua/Awesome-AITools
- 简介：收藏整理了AI相关的实用工具、评测和相关文章。
open source ChatGPT and beyond：
- 地址：https://github.com/SunLemuria/open_source_chatgpt_list
- 简介：This repo aims at recording open source ChatGPT, and providing an overview of how to get involved, including: base models, technologies, data, domain models, training pipelines, speed up techniques, multi-language, multi-modal, and more To go.
Awesome Totally Open Chatgpt：
- 地址：https://github.com/nichtdax/awesome-totally-open-chatgpt
- 简介：This repo record a list of totally open alternatives to ChatGPT.
Awesome-LLM：
- 地址：https://github.com/Hannibal046/Awesome-LLM
- 简介：This repo is a curated list of papers about large language models, especially relating to ChatGPT. It also contains frameworks for LLM training, tools to deploy LLM, courses and tutorials about LLM and all publicly available LLM checkpoints and APIs.
DecryptPrompt：
- 地址：https://github.com/DSXiangLi/DecryptPrompt
- 简介：总结了Prompt&LLM论文，开源数据&模型，AIGC应用。
Awesome Pretrained Chinese NLP Models：
- 地址：https://github.com/lonePatient/awesome-pretrained-chinese-nlp-models
- 简介：收集了目前网上公开的一些高质量中文预训练模型。
ChatPiXiu：
- 地址：https://github.com/catqaq/ChatPiXiu
- 简介：该项目旨在打造全面且实用的ChatGPT模型库和文档库。当前V1版本梳理了包括：相关资料调研+通用最小实现+领域/任务适配等。
LLM-Zoo：
- 地址：https://github.com/DAMO-NLP-SG/LLM-Zoo
- 简介：该项目收集了包括开源和闭源的LLM模型，具体包括了发布时间，模型大小，支持的语种，领域，训练数据及相应论文/仓库等。
LLMs-In-China：
- 地址：https://github.com/wgwang/LLMs-In-China
- 简介：该项目旨在记录中国大模型发展情况，同时持续深度分析开源开放的大模型以及数据集的情况。
BMList：
- 地址：https://github.com/OpenBMB/BMList
- 简介：该项目收集了参数量超过10亿的大模型，并梳理了各个大模型的适用模态、发布的机构、适合的语种，参数量和开源地址、API等信息。
awesome-free-chatgpt：
- 地址：https://github.com/LiLittleCat/awesome-free-chatgpt
- 简介：该项目收集了免费的ChatGPT 镜像网站列表，ChatGPT的替代方案，以及构建自己的ChatGPT的教程工具等。
Awesome-Domain-LLM：
- 地址：https://github.com/luban-agi/Awesome-Domain-LLM
- 简介：该项目收集和梳理垂直领域的开源模型、数据集及评测基准。