Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience. It is designed to assist users in performing various tasks, from simple information retrieval to complex multimedia reasoning.
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
arXiv 2022/12
[paper]
GPT-4
arXiv 2023/03
[paper] [blog]
Visual Instruction Tuning
arXiv 2023/04
[paper] [code] [project page] [demo]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2023/04
[paper] [code] [project page] [demo]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv 2023/04
[paper] [code] [demo]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023/04
[paper] [code] [demo]
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding
[code]
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2023/05
[paper] [code]
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023/05
[paper] [code] [demo]
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2023/05
[paper] [code] [project page]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
arXiv 2023/05
[paper] [code] [demo]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv 2023/05
[paper] [code]
InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language
arXiv 2023/05
[paper] [code] [demo]
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv 2023/05
[paper] [code]
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2023/05
[paper] [code] [project page]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv 2023/05
[paper] [code] [project page]
DetGPT: Detect What You Need via Reasoning
arXiv 2023/05
[paper] [code] [project page]
PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology
arXiv 2023/05
[paper] [code]
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2023/05
[paper] [code] [project page]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2023/06
[paper] [code]
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023/06
[paper]
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv 2023/06
[paper] [project page]
VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY
arXiv 2023/06
[paper] [code]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023/03
[paper] [code] [demo]
ViperGPT: Visual Inference via Python Execution for Reasoning
arXiv 2023/03
[paper] [code] [project page]
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
arXiv 2023/03
[paper] [code]
Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions
arXiv 2023/03
[paper] [code]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023/03
[paper] [code] [project page] [demo]
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface
arXiv 2023/03
[paper] [code] [demo]
VLog: Video as a Long Document
[code] [demo]
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
arXiv 2023/04
[paper] [code]
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
arXiv 2023/04
[paper] [project page]
VideoChat: Chat-Centric Video Understanding
arXiv 2023/05
[paper] [code] [demo]