VisualGLM 6B Download - VisualGLM 6B Source code download

VisualGLM 6B

Other source code

1.0.0

Download

VisualGLM-6B

? HF Repo • ⚒️ SwissArmyTransformer (sat) • ? Twitter

• ? [CogView@NeurIPS 21] [GitHub] • ? [GLM@ACL 22] [GitHub]

Join us on Slack and WeChat

News

[2023.10] Welcome to pay attention to CogVLM (https://github.com/THUDM/CogVLM), a new generation multi-modal dialogue model of Zhipu AI. It adopts the new architecture of visual experts and won the first place in 10 authoritative classic multi-modal tasks. . The current open source CogVLM-17B English model will be based on the GLM open source Chinese model.

introduce

VisualGLM-6B is an open-source, multi-modal dialog language model that supports images, Chinese, and English . The language model is based on ChatGLM-6B with 6.2 billion parameters; the image part builds a bridge between the visual model and the language model through the training of BLIP2-Qformer, with the total model comprising 7.8 billion parameters. Click here for English version.

VisualGLM-6B is an open source multi-modal dialogue language model that supports images, Chinese and English . The language model is based on ChatGLM-6B and has 6.2 billion parameters. The image part builds a bridge between the visual model and the language model by training BLIP2-Qformer. , the overall model has a total of 7.8 billion parameters.

VisualGLM-6B relies on 30M high-quality Chinese image-text pairs from the CogView data set and 300M screened English image-text pairs for pre-training. Chinese and English weights are the same. This training method better aligns visual information to the semantic space of ChatGLM; in the subsequent fine-tuning stage, the model is trained on long visual question and answer data to generate answers that conform to human preferences.

VisualGLM-6B is trained by the SwissArmyTransformer ( sat for short) library, which is a tool library that supports flexible modification and training of Transformer, and supports efficient fine-tuning methods of parameters such as Lora and P-tuning. This project provides a huggingface interface that conforms to user habits, and also provides an interface based on sat.

Combined with model quantization technology, users can deploy it locally on consumer-grade graphics cards (the minimum required is 6.3G of video memory at the INT4 quantization level).

The VisualGLM-6B open source model aims to promote the development of large model technology together with the open source community. Developers and everyone are kindly requested to abide by the open source agreement and do not use this open source model and code and derivatives based on this open source project for any purpose that may bring harm to the country and society. Harmful uses and any services that have not been safety assessed and documented. Currently, this project has not officially developed any applications based on VisualGLM-6B, including websites, Android Apps, Apple iOS applications, Windows Apps, etc.

Since VisualGLM-6B is still in the v1 version, it is currently known to have quite a few limitations , such as image description factuality/model hallucination problems, insufficient image detail information capture, and some limitations from language models. Although the model tries its best to ensure the compliance and accuracy of the data at each stage of training, due to the small scale of the VisualGLM-6B model and the fact that the model is affected by probabilistic and random factors, the accuracy of the output content cannot be guaranteed, and the model is easily Misleading (see Limitations section for details). In subsequent versions of VisualGLM, efforts will be made to optimize such problems. This project does not bear the risks and responsibilities of data security and public opinion risks caused by open source models and codes, or the risks and responsibilities arising from any model being misled, abused, disseminated, or improperly exploited.

Sample

VisualGLM-6B can perform question and answer related knowledge of image description. Titanic sample

You can also combine common sense or put forward interesting ideas. Click to expand/collapse more examples.

Taxi ironing example Mona Lisa Dog Sample

Friendly links

XrayGLM is an X-ray diagnosis question and answer project fine-tuned on the X-ray diagnosis data set based on visualGLM-6B, which can answer medical-related inquiries based on X-ray films.

Click to view sample

Sample

StarGLM is a project fine-tuned on astronomical data sets based on Chat/visualGLM-6B, which can answer information related to variable star light curves.

Click to view sample

Sample

use

Model reasoning

Use pip to install dependencies

 pip install -i https://pypi.org/simple -r requirements.txt
# 国内请使用aliyun镜像，TUNA等镜像同步最近出现问题，命令如下
pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt

At this time, deepspeed library (which supports sat library training) will be installed by default. This library is not necessary for model inference. At the same time, some Windows environments will encounter problems when installing this library. If we want to bypass the deepspeed installation, we can change the command to

 pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements_wo_ds.txt
pip install -i https://mirrors.aliyun.com/pypi/simple/ --no-deps "SwissArmyTransformer>=0.4.4"

If you use the Huggingface transformers library to call the model ( you also need to install the above dependency package! ), you can pass the following code (where the image path is the local path):

 from transformers import AutoTokenizer , AutoModel
tokenizer = AutoTokenizer . from_pretrained ( "THUDM/visualglm-6b" , trust_remote_code = True )
model = AutoModel . from_pretrained ( "THUDM/visualglm-6b" , trust_remote_code = True ). half (). cuda ()
image_path = "your image path"
response , history = model . chat ( tokenizer , image_path , "描述这张图片。" , history = [])
print ( response )
response , history = model . chat ( tokenizer , image_path , "这张图片可能是在什么场所拍摄的？" , history = history )
print ( response )

The above code will automatically download the model implementation and parameters by transformers . The complete model implementation can be found in Hugging Face Hub. If you are slow to download model parameters from Hugging Face Hub, you can manually download the model parameter file from here and load the model locally. For specific methods, please refer to Loading the model from local. For information on quantification, CPU inference, Mac MPS backend acceleration, etc. based on the transformers library model, please refer to the low-cost deployment of ChatGLM-6B.

If you use the SwissArmyTransformer library to call the model, the method is similar. You can use the environment variable SAT_HOME to determine the model download location. In this warehouse directory:

 import argparse
from transformers import AutoTokenizer
tokenizer = AutoTokenizer . from_pretrained ( "THUDM/chatglm-6b" , trust_remote_code = True )
from model import chat , VisualGLMModel
model , model_args = VisualGLMModel . from_pretrained ( 'visualglm-6b' , args = argparse . Namespace ( fp16 = True , skip_init = True ))
from sat . model . mixins import CachedAutoregressiveMixin
model . add_mixin ( 'auto-regressive' , CachedAutoregressiveMixin ())
image_path = "your image path or URL"
response , history , cache_image = chat ( image_path , model , tokenizer , "描述这张图片。" , history = [])
print ( response )
response , history , cache_image = chat ( None , model , tokenizer , "这张图片可能是在什么场所拍摄的？" , history = history , image = cache_image )
print ( response )

Efficient fine-tuning of parameters can also be easily performed using the sat library.

Model fine-tuning

Multimodal tasks are widely distributed and of many types, and pre-training often cannot cover everything. Here we provide an example of small sample fine-tuning, using 20 annotated images to enhance the model's ability to answer "background" questions.

Unzip fewshot-data.zip and run the following command:

 bash finetune/finetune_visualglm.sh

Currently, three methods of fine-tuning are supported:

LoRA: In the example, LoRA fine-tuning with rank=10 is added to the 0th and 14th layers of the ChatGLM model. The --layer_range and --lora_rank parameters can be adjusted according to the specific scenario and data volume.
QLoRA: If resources are limited, you can consider using bash finetune/finetune_visualglm_qlora.sh . QLoRA quantizes the linear layer of ChatGLM with 4-bit and only requires 9.8GB of video memory for fine-tuning.
P-tuning: You can replace --use_lora with --use_ptuning , but it is not recommended unless the model application scenario is very fixed.

After training, you can use the following command for inference:

 python cli_demo.py --from_pretrained your_checkpoint_path --prompt_zh 这张图片的背景里有什么内容？

Comparison of effects before and after fine-tuning

Before fine-tuning:

欢迎使用 VisualGLM-6B 模型，输入图像URL或本地路径读图，继续输入内容对话，clear 重新开始，stop 终止程序
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/2p.png
用户：这张图片的背景里有什么内容？
VisualGLM-6B：人们总是被生活琐事所困扰，无法摆脱束缚.
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/meme.png
用户：这张图片的背景里有什么内容？
VisualGLM-6B：这张图片的背景是一张狗的图片，它躺在木地板上。
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/passport.png
用户：这张图片的背景里有什么内容？
VisualGLM-6B：这张图片的背景是一台电脑键盘和两张护照。
用户：clear
请输入图像路径或URL（回车进入纯文本对话）：

After fine-tuning:

欢迎使用 VisualGLM-6B 模型，输入图像URL或本地路径读图，继续输入内容对话，clear 重新开始，stop 终止程序
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/2p.png
用户：这张图片的背景里有什么内容？
VisualGLM-6B：这张图片的背景是雨天，
用户：描述一下这张图片。
VisualGLM-6B：这张图片描绘了雨天的情景，雨水淅淅沥沥地落在地面上。人们穿着雨衣或雨伞，走在雨中，似乎在寻找着什么.
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/meme.png
用户：这张图片的背景里有什么内容？
VisualGLM-6B：这张图片的背景是木质地板。
用户：什么颜色的地板
VisualGLM-6B：这张图片的背景是蓝色的。
用户：描述一下这张图片。
VisualGLM-6B：这张图片描绘了一只棕色的小狗躺在蓝色的木质地板上。
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/passport.png
用户：这张图片的背景里有什么内容？
VisualGLM-6B：这张图片的背景是电脑键盘。
用户：描述一下这张图片。
VisualGLM-6B：这张图片的特点是电脑键盘和护照。
用户：有几张护照    
VisualGLM-6B：两张护照。
用户：clear
请输入图像路径或URL（回车进入纯文本对话）：

If you want to merge the parameters of the LoRA part into the original weights, you can call merge_lora() , for example:

 from finetune_visualglm import FineTuneVisualGLMModel
import argparse

model , args = FineTuneVisualGLMModel . from_pretrained ( 'checkpoints/finetune-visualglm-6b-05-19-07-36' ,
        args = argparse . Namespace (
        fp16 = True ,
        skip_init = True ,
        use_gpu_initialization = True ,
        device = 'cuda' ,
    ))
model . get_mixin ( 'lora' ). merge_lora ()
args . layer_range = []
args . save = 'merge_lora'
args . mode = 'inference'
from sat . training . model_io import save_checkpoint
save_checkpoint ( 1 , model , None , None , args )

Fine-tuning requires the installation of the deepspeed library. Currently, this process only supports Linux systems. More sample instructions and process instructions for Windows systems will be completed in the near future.

Deployment tools

Command line demo

python cli_demo.py

The program will automatically download the sat model and conduct an interactive conversation on the command line. Enter instructions and press Enter to generate a reply. Enter clear to clear the conversation history. Enter stop to terminate the program.

cli_demo The program provides the following hyperparameters to control the generation process and quantization accuracy:

 usage: cli_demo.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE] [--english] [--quant {8,4}]

optional arguments:
  -h, --help            show this help message and exit
  --max_length MAX_LENGTH
                        max length of the total sequence
  --top_p TOP_P         top p for nucleus sampling
  --top_k TOP_K         top k for top k sampling
  --temperature TEMPERATURE
                        temperature for sampling
  --english             only output English
  --quant {8,4}         quantization bits

It should be noted that during training, the prompt words for English question and answer pairs are Q: A: :, while the Chinese prompts are问：答： The Chinese prompts are used in the web demo, so the English responses will be worse and mixed with Chinese; if necessary To reply in English, please use the --english option in cli_demo.py .

We also provide a typewriter effect command line tool inherited from ChatGLM-6B . This tool uses the Huggingface model:

python cli_demo_hf.py

We also support parallel multi-card deployment of models: (You need to update the latest version of sat. If you have downloaded checkpoint before, you also need to manually delete it and download it again)

 torchrun --nnode 1 --nproc-per-node 2 cli_demo_mp.py

Web version Demo

web_demo

We provide a web version Demo based on Gradio. First install Gradio: pip install gradio . Then download and enter this warehouse to run web_demo.py :

 git clone https://github.com/THUDM/VisualGLM-6B
cd VisualGLM-6B
python web_demo.py

The program will automatically download the sat model, run a Web Server, and output the address. Open the output address in a browser to use it.

We also provide a typewriter effect web version tool inherited from ChatGLM-6B . This tool uses the Huggingface model and will run on port :8080 after startup:

python web_demo_hf.py

Both web version demos accept the command line parameter --share to generate gradio public links, and accept --quant 4 and --quant 8 to use 4-bit quantization/8-bit quantization respectively to reduce video memory usage.

API deployment

First, you need to install additional dependencies pip install fastapi uvicorn , and then run api.py in the warehouse:

python api.py

The program will automatically download the sat model, which is deployed on the local port 8080 by default and called through the POST method. The following is an example of using curl to request. Generally speaking, you can also use the code method to perform POST.

 echo " { " image " : " $( base64 path/to/example.jpg ) " , " text " : "描述这张图片" , " history " :[]} " > temp.json
curl -X POST -H " Content-Type: application/json " -d @temp.json http://127.0.0.1:8080

The return value obtained is

  {
    "response":"这张图片展现了一只可爱的卡通羊驼，它站在一个透明的背景上。这只羊驼长着一张毛茸茸的耳朵和一双大大的眼睛，它的身体是白色的，带有棕色斑点。",
    "history":[('描述这张图片', '这张图片展现了一只可爱的卡通羊驼，它站在一个透明的背景上。这只羊驼长着一张毛茸茸的耳朵和一双大大的眼睛，它的身体是白色的，带有棕色斑点。')],
    "status":200,
    "time":"2023-05-16 20:20:10"
  }

We also provide api_hf.py that uses the Huggingface model. The usage is consistent with the API of the sat model:

python api_hf.py

Model quantification

In the Huggingface implementation, the model is loaded with FP16 precision by default, and running the above code requires approximately 15GB of video memory. If your GPU has limited memory, you can try loading the model in quantized mode. How to use it:

 # 按需修改，目前只支持 4/8 bit 量化。下面将只量化ChatGLM，ViT 量化时误差较大
model = AutoModel . from_pretrained ( "THUDM/visualglm-6b" , trust_remote_code = True ). quantize ( 8 ). half (). cuda ()

In the sat implementation, you need to first pass the parameter to change the loading location to cpu , and then perform quantification. The method is as follows, see cli_demo.py for details:

 from sat . quantization . kernels import quantize
quantize ( model , args . quant ). cuda ()
# 只需要 7GB 显存即可推理

limitation

This project is in the V1 version. The parameters and calculation volume of the visual and language models are relatively small. We have summarized the main improvement directions as follows:

Image description factuality/model illusion problem. When generating a long description of an image, when it is far away from the image, the language model will dominate, and it is possible to generate content that does not exist in the image based on the context.
Attribute mismatch problem. In multi-object scenes, certain attributes of some objects are often mistakenly assigned to other objects.
Resolution issue. This project uses a resolution of 224*224, which is also the most commonly used size in visual models; however, in order to achieve a more fine-grained understanding, larger resolution and calculation amount are necessary.
Due to data and other reasons, the model currently does not have Chinese OCR capabilities (it has some English OCR capabilities). We will add this capability in subsequent versions.

protocol

The code of this repository is open source according to the Apache-2.0 agreement. The use of the weights of the VisualGLM-6B model needs to comply with the Model License.

Quotes and Acknowledgments

If you find our work helpful, please consider citing the following papers

 @inproceedings{du2022glm,
  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={320--335},
  year={2022}
}
@article{ding2021cogview,
  title={Cogview: Mastering text-to-image generation via transformers},
  author={Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and others},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  pages={19822--19835},
  year={2021}
}

The data set in the instruction fine-tuning phase of VisualGLM-6B includes part of the English graphic and text data from the MiniGPT-4 and LLAVA projects, as well as many classic cross-modal working data sets. We sincerely thank them for their contributions.

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2024-12-07
size 10.31MB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
GitHub the via/releases

2024-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
wp functions

Other categories

1.0.0
termwind

Other categories

v2.3.0

Related Information All