VSA Download - VSA Source code download

VSA

Other source code

1.0.0

Download

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

[Project Page] [?Paper] [?Hugging Face Space] [Model Zoo] [Introduction] [?Video]

? Release

[2024/10/29] We released the code for the local demo.
[2024/10/29] Vision Search Assistant is released on arxiv.

Setup

Clone this repository and navigate to VSA folder.

git clone https://github.com/cnzzx/VSA.git
cd VSA

Create conda environments.

conda create -n vsa python=3.10
conda activate vsa

Install LLaVA.

cd models/LLaVA
pip install -e .

Install other requirements.

pip install -r requirements.txt

Local Demo

The local demo is based on gradio, and you can simply run with:

python app.py

Run Inference

In the "Run" UI, you can upload one image in the "Input Image" panel, and type in your question in the "Input Text Prompt" panel. Then, click submit and wait for model inference.
You can also customize object classes for detection in the "Ground Classes" panel. Please separate each class by commas (followed by a space), such as "handbag, backpack, suitcase."
On the right are temporary outputs. "Query Output" shows generated queries for searching, and "Search Output" shows web knowledge related to each object.

Try with Samples

We provide some samples for you to start with. In the "Samples" UI, you can select one in the "Samples" panel, click "Select This Sample", and you will find sample input has already been filled in the "Run" UI.

? CLI Inference

You can also chat with our Vision Search Assistant in the terminal by running.

python cli.py 
    --vlm-model "liuhaotian/llava-v1.6-vicuna-7b" 
    --ground-model "IDEA-Research/grounding-dino-base" 
    --search-model "internlm/internlm2_5-7b-chat" 
    --vlm-load-4bit

Then, select an image and type your question.

License

This project is released under the Apache 2.0 license.

Acknowledgements

Vision Search Assistant is greatly inspired by the following outstanding contributions to the open-source community: GroundingDINO, LLaVA, MindSearch.

Citation

If you find this project useful in your research, please consider cite:

@article{zhang2024visionsearchassistantempower,
  title={Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines},
  author={Zhang, Zhixin and Zhang, Yiyuan and Ding, Xiaohan and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2410.21220},
  year={2024}
}

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2024-12-26
size 18.77MB
From Github

Related Applications

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
viptools for eslam

2024-12-15
MySchedule.py

2024-12-15
VITAident

2024-12-15

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All