InternVL下載 - InternVL原始碼下載

InternVL

其他源碼

InternL-Chat-1.5.0

下載

InternVL系列：透過開源套件縮小與商業多式聯運模型的差距－GPT-4o的開創性開源替代品

訊息

2024/10/21 : 我們發表 Mini-InternVL 系列。這些模型以最小的尺寸實現了令人印象深刻的性能：4B 模型僅用 5% 的模型尺寸即可實現 90% 的性能。欲了解更多詳情，請查看我們的專案頁面和文件。
2024/08/01 ：Chartmimic 團隊根據其基準評估了 InternVL2 系列模型。 InternVL2-26B 和 76B 模型取得了開源模型中前兩名的性能，其中 InternVL2 76B 模型超越了 GeminiProVision，並表現出與 Claude-3-opus 相當的結果。
2024/08/01 : InternVL2-Pro 在 CharXiv 資料集上實現了開源模型中的 SOTA 效能，超越了 GPT-4V、Gemini 1.5 Flash、Claude 3 Sonnet 等眾多閉源模型。
2024/07/24 ：MLVU 團隊根據其基準評估了 InternVL-1.5。多項選擇任務的平均成績為 50.4%，而生成任務的平均成績為 4.02。在多項選擇任務上的表現在所有開源 MLLM 中排名第一。
2024/07/18 : ?? InternVL2-40B 在 Video-MME 資料集上的開源模型中實現了 SOTA 效能，輸入 16 幀時得分為 61.2，輸入 32 幀時得分為 64.4。它顯著優於其他開源模型，是最接近 GPT-4o mini 的開源模型。
2024/07/18 : ？ InternVL2-Pro 在 DocVQA 和 InfoVQA 基準測試中實現了 SOTA 效能。
2024/07/04 : ？我們發布了 InternVL2 系列。 InternVL2-Pro 在 MMMU 基準測試中達到了 62.0% 的準確率，與 GPT-4o 等領先的閉源商業模型的性能相當。此模型的免費API可以透過填寫（申請表）/（申請表）來申請。其他型號可在 HF link 上購買。
2024/06/19 ：我們提出了多模態大海撈針（MM-NIAH），這是第一個旨在系統評估現有 MLLM 理解長多模態文件的能力的基準。
2024/05/30 ：我們發布了 ShareGPT-4o，這是一個大規模資料集，我們計劃開源該資料集，其中包含 200K 映像、10K 視訊和 10K 音訊以及詳細描述。
2024/05/28 ：感謝lmdeploy團隊提供AWQ量化支援。 4 位元模型可從 OpenGVLab/InternVL-Chat-V1-5-AWQ 取得。
2024/05/13 ：InternVL 1.0 現在可以用作擴散模型的文本編碼器，以支援全球 110 多種語言的多語言生成。更多詳情請參閱《花木蘭》。
2024/04/18 : InternVL-Chat-V1-5 已在 HF link 發布，在 MMMU、DocVQA、ChartQA、MathVista 等各種基準測試上接近 GPT-4V 和 Gemini Pro 的性能。
2024/02/27 : InternVL 被 CVPR 2024 (Oral) 接收！？
2024/02/21 : InternVL-Chat-V1-2-Plus 在 MathVista (59.9)、MMBench (83.8) 和 MMVP (58.7) 上實現了 SOTA 效能。請參閱我們的部落格以了解更多詳細資訊。
2024/02/12 : InternVL-Chat-V1-2 已發布。它在 MMMU val 上達到 51.6，在 MMBench 測試上達到 82.3。欲了解更多詳情，請參閱我們的部落格和SFT數據。該模型現已在 HuggingFace 上提供，並且訓練/評估資料和腳本都是開源的。
2024/01/24 : InternVL-Chat-V1-1 發布，支援中文，OCR 能力更強，請看這裡。
2024/01/16 : 我們發布了定制的 mmcv/mmsegmentation/mmdetection 程式碼，與 DeepSpeed 集成，可用於訓練大規模檢測和分割模型。

待辦事項列表

支援vLLM和Ollama
使用 readthedocs 重建文檔
支援使用 LoRA 微調不同的 LLM
線上示範支援影片和PDF輸入
發布具有 VisionLLMv2 整合的 InternVL2
InternVL2 的發佈requirements.txt
發佈InternVL2系列培訓/評估代碼
發佈適用於 InternVL1.5 和 InternVL2 的 Streamlit Web UI

文件

開始使用

安裝：[環境][requirements.txt]
評估資料準備：【InternVL評估】
聊天資料格式：[元檔案][純文字][單圖][多圖][影片]
InternVL-聊天 API：[InternVL2-Pro]
本地聊天演示：[Streamlit 演示] [Gradio 演示] [LMDeploy 演示]
教學：[使用 LoRA 微調增強 COCO 字幕上的 InternVL2]

實習生VL家族

InternVL 2.0：[簡介][快速入門][Finetune][評估][部署]
InternVL 1.5：[簡介][快速入門][Finetune][評估][部署]
InternVL 1.2：[簡介][快速入門][Finetune][評估]
InternVL 1.1：[簡介][快速入門][評估]
InternVL 1.0：[分類] [CLIP-Benchmark] [分段] [InternVL-Chat-LLaVA] [InternVL-G]

與 SOTA VLLM 相比

模型動物園

多模態大語言模型 (InternVL 2.0)

型號名稱	願景部分	語言部分	高頻鏈路	女士連結	文件
實習生VL2-1B	InternViT-300M-448px	Qwen2-0.5B-指令	？關聯	？關聯	？文件
實習生VL2-2B	InternViT-300M-448px	internlm2-chat-1-8b	？關聯	？關聯	？文件
實習生VL2-4B	InternViT-300M-448px	Phi-3-mini-128k-指令	？關聯	？關聯	？文件
實習生VL2-8B	InternViT-300M-448px	internlm2_5-7b-聊天	？關聯	？關聯	？文件
實習生VL2-26B	InternViT-6B-448px-V1-5	internlm2-chat-20b	？關聯	？關聯	？文件
實習生VL2-40B	InternViT-6B-448px-V1-5	Nous-Hermes-2-Yi-34B	？關聯	？關聯	？文件
實習生VL2-Llama3-76B	InternViT-6B-448px-V1-5	Hermes-2-Theta- 駱駝-3-70B	？關聯	？關聯	？文件

實習生VL2-Pro API

我們歡迎大家使用我們的 API 進行研究。為了更好的管理，請提交（申請表）/（申請表）以獲得免費的API存取權。

多模態大語言模型 (InternVL 1.0-1.5)

模型	日期	高頻鏈路	女士連結	筆記
Mini-InternVL-Chat-4B-V1-5	2024年5月28日	？關聯	？關聯	?? 16% 的模型大小，90% 的性能
Mini-InternVL-Chat-2B-V1-5	2024年5月19日	？關聯	？關聯	？ 8% 的模型大小，80% 的性能
實習生VL-Chat-V1-5	2024年4月18日	？關聯	？關聯	支援4K圖像；超強OCR；在 MMMU、DocVQA、ChartQA、MathVista 等各種基準上接近 GPT-4V 和 Gemini Pro 的性能。
實習生VL-Chat-V1-2-Plus	2024年2月21日	？關聯	？關聯	SFT數據更多更強
實習生VL-Chat-V1-2	2024年2月11日	？關聯	？關聯	LLM 升級至 34B
實習生VL-Chat-V1-1	2024年1月24日	？關聯	？關聯	支援中文，OCR更強大
實習生VL-Chat-19B	2023年12月25日	？關聯	？關聯	英語多模態對話
實習生VL-Chat-13B	2023年12月25日	？關聯	？關聯	英語多模態對話

視覺基礎模型（InternVL 1.0-1.5）

模型	日期	高頻鏈路	女士連結	筆記
InternViT-300M-448px	2024年5月25日	？關聯	？關聯	具有 300M 參數的精煉小視覺基礎模型（？新）
InternViT-6B-448px-V1-5	2024年4月20日	？關聯	？關聯	透過增量預訓練支援動態解析度和超強OCR特徵提取能力（？新）
InternViT-6B-448px-V1-2	2024年2月11日	？關聯	？關聯	透過增量預訓練支援448分辨率
InternViT-6B-448px-V1-0	2024年1月30日	？關聯	？關聯	透過增量預訓練支援448分辨率
InternViT-6B-224px	2023年12月22日	？關聯	？關聯	InternViT-6B 的第一個版本，摘自 InternVL‑14B‑224px

視覺語言基礎模型（InternVL 1.0）

模型	日期	高頻鏈路	女士連結	筆記
實習生VL-14B-224px	2023年12月22日	？關聯	？關聯	視覺語言基礎模型InternViT-6B + QLLaMA，可用於像CLIP一樣的圖文檢索

實習生可以做什麼？

視覺感知（點擊展開）

線性探針影像分類[查看詳情]

ViT-22B 使用私有 JFT-3B 資料集。

方法	#參數	IN-1K	現實中	IN-V2	IN-A	內R	IN-草圖
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINov2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	-
InternViT-6B（我們的）	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

語意分割【查看詳情】

方法	解碼器	#param（訓練/總和）	作物尺寸	米盧
OpenCLIP-G（凍結）	線性	0.3M/1.8B	第512章	39.3
ViT-22B（冷凍）	線性	0.9M/21.7B	504	34.6
InternViT-6B（冷凍）	線性	0.5M/5.9B	504	47.2 (+12.6)
ViT-22B（冷凍）	上網	0.8B/22.5B	504	52.7
InternViT-6B（冷凍）	上網	0.4B / 6.3B	504	54.9（+2.2）
ViT-22B	上網	22.5B / 22.5B	504	55.3
實習生ViT-6B	上網	6.3B / 6.3B	504	58.9（+3.6）

零樣本影像分類【檢視詳情】

方法	IN-1K	IN-A	內R	IN-V2	IN-草圖	對象網
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
ViT-22B*	85.9	90.1	96.0	80.9	-	87.6
實習生VL-C（我們的）	83.2	83.8	95.5	77.3	73.9	80.6

多語言零樣本影像分類[查看詳情]

EN：英文、ZH：中文、JP：日文、Ar：阿拉伯語、IT：義大利語

方法	IN-1K（英語）	IN-1K (ZH)	IN-1K（日本）	IN-1K (AR)	IN-1K（義大利）
太乙-CLIP-ViT-H	-	54.4	-	-	-
悟空-ViT-LG	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-RH	77.0	55.7	53.1	37.0	56.8
實習生VL-C（我們的）	83.2	64.5	61.5	44.9	65.7

零樣本影片分類

方法	＃框架	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
實習生VL-C（我們的）	1	71.0	71.3	65.7
維CLIP	8	75.7	73.5	66.4
實習生VL-C（我們的）	8	79.4	78.8	71.5

跨模態檢索（點擊展開）

英文零樣本圖文檢索【檢視詳情】

模型	Flickr30K						可可						平均
	圖像到文字			文字到圖像			圖像到文字			文字到圖像
	R@1	R@5	電阻@10	R@1	R@5	電阻@10	R@1	R@5	電阻@10	R@1	R@5	電阻@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4	86.2
實習生VL-C（我們的）	94.7	99.6	99.9	81.7	96.0	98.2	70.6	89.0	93.5	54.1	77.3	84.6	86.6
實習生VL-G（我們的）	95.7	99.7	99.9	85.0	97.0	98.6	74.9	91.3	95.2	58.6	81.3	88.0	88.8

中文零樣本圖文檢索【檢視詳情】

模型	Flickr30K-CN						COCO-CN						平均
	圖像到文字			文字到圖像			圖像到文字			文字到圖像
	R@1	R@5	電阻@10	R@1	R@5	電阻@10	R@1	R@5	電阻@10	R@1	R@5	電阻@10
CN-CLIP-ViT-H	81.6	97.5	98.8	71.2	91.4	95.5	63.0	86.6	92.9	69.2	89.9	96.1	86.1
OpenCLIP-XLM-RH	86.1	97.5	99.2	71.0	90.5	94.9	70.0	91.5	97.0	66.1	90.8	96.0	87.6
實習生VL-C（我們的）	90.3	98.8	99.7	75.1	92.9	96.4	68.8	92.0	96.7	68.9	91.9	96.5	89.0
實習生VL-G（我們的）	92.9	99.4	99.8	77.7	94.8	97.3	71.4	93.9	97.7	73.8	94.4	98.1	90.9

XTD上的多語言零樣本圖文檢索[查看詳情]

方法	CN	ES	FR	ZH	它	KO	魯	太平紳士	平均的
替代剪輯	95.4	94.1	92.9	95.1	94.2	94.4	91.8	91.7	93.7
OpenCLIP-XLM-RH	97.3	96.1	94.5	94.7	96.0	90.2	93.9	94.0	94.6
實習生VL-C（我們的）	97.3	95.7	95.1	95.6	96.0	92.2	93.3	95.5	95.1
實習生VL-G（我們的）	98.6	97.7	96.5	96.7	96.9	95.1	94.8	96.1	96.6

多模態對話

請參閱“與 SOTA VLLM 比較”部分。

快速開始使用 HuggingFace

使用InternViT-6B進行視覺特徵擷取（點選展開）

導入 torchfrom PIL 導入 Imagefrom 變壓器導入 AutoModel, CLIPImageProcessormodel = AutoModel.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5',torch_dtype=torch.bfloat16,low_cpu_mem_usage=True,trust_remote_code=True).cuda().eval( )image = Image.open('./examples/image1.jpg').convert('RGB')image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')pixel_valuesces影像， return_tensors='pt').pixel_valuespixel_values = Pixel_values.to(torch.bfloat16).cuda()outputs = model(pixel_values)

使用 InternVL-C（對比）和 InternVL-G（產生）進行跨模態檢索（點擊展開）

導入 torchfrom PIL 導入 Imagefrom 變壓器 導入 AutoModel, CLIPImageProcessorfrom 變壓器 導入 AutoTokenizermodel = AutoModel.from_pretrained('OpenGVLab/InternVL-14B-224px',torch_dtypes.m. cuda().eval() 。 _token_id 設定為0images = [Image.open('./examples/image1.jpg').convert('RGB'),Image.open('./examples/image2.jpg').convert('RGB'),Image.open ( './examples/image3.jpg').convert('RGB')
]prefix = 'summarize:'texts = [prefix + '一張小熊貓的照片', # Englishprefix + '一張熊貓的照片', # Chineseprefix + '二匹の貓の寫真' # Japanese]pixel_values = image_processor( images=images, return_tensors='pt').pixel_valuespixel_values = Pixel_values.to(torch.bfloat16).cuda()input_ids = tokenizer(texts, return_tensors='pt', max_length=80,runmft )。 e -01, 5.2185e-03, 6.0070e-08],# [2.2949e-02, 9.7656e-01, 5.9903e-06],# [3.2932e-06, 7.4863e-06],# [3.2932e-06, 7.4863e-05,+ ] , device='cuda:0',# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)# InternVL-Glogits_per_image, logits_per_text = model(image=pixel_values, text=input_ids, mode='prodn =dn; logits_per_image .softmax(dim=-1)# 張量([[9.9609e-01, 3.1738e-03, 3.6322e-08],# [8.6060e-03, 9.9219e-08],# [8.6060e-03, 9.9219e-01, 2.8759e#. [1.7583 e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)#請為Generationtokenizer 將 addeos_fn=<SoftmaxBackward0>)#請為Generationtokenizer 將 addeos_pse.to = Falseimage = Image.open('./examples/image1.jpg').convert('RGB')pixel_values = image_processor(images=image, return_tensors='pt').pixel_valuespixel_values = Pixel_value='ptatvv.s. cuda() tokenized = tokenizer("英文字幕：", return_tensors='pt')pred = model.generate(pixel_values=pixel_values,input_ids=tokenized.input_ids.cuda(),attention_masput_ids=tokenized.input_ids.cuda(),attention_masput_ids=tokenized.input_ids.cuda(),attention_mascmul.m. =5, min_new_tokens=8,
)caption = tokenizer.decode(pred[0].cpu(),skip_special_tokens=True).strip()# 英文字幕：一隻小熊貓坐在木製平台上

使用 InternVL-Chat 進行多模式聊天（點擊展開）

這裡，我們以較小的OpenGVLab/InternVL2-8B為例：

導入numpy為np導入torchimport torchvision.transforms為Tfromdecord導入VideoReader，cpufromPIL導入Imagefromtorchvision.transforms.function導入InterpolationModefrom變壓器導入AutoModel，AutoTokenizerIMAGENET_MEAN =（0.485，0.45620. ）def build_transform（ input_size) :MEAN, STD = IMAGENET_MEAN, IMAGENET_STDtransform = T.Compose([T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),T.Resize(input_size) , input_size), 插值=InterpolationMode.BICUBIC),T.ToTensor(),T.Normalize(平均值=MEAN, std=STD)
    ])return transformdef find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):best_ratio_diff = float('inf')best_ratio = (1, 1)area = widgets _ratiofors 中的比值* 5:70 中的比值* _raty] 之間的比率值[ 1]ratio_diff = abs(aspect_ratio - target_aspect_ratio)ifratio_diff < best_ratio_diff:best_ratio_diff =ratio_diffbest_ratio =ratioelifratio_diff == best_ratio_diffbest_ratio =ratioelifratio_diff == best_ratio_diffbest> 0. ratio =ratioreturn best_ratiodefdynamic_preprocess( image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):orig_width, orig_height = image.sizeaspect_ratio = orig_width / orig_height#計算現有圖像長寬比# 計算現有圖像長寬比數。
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) ifi * j <= max_num and i * j >= min_num)target_ratios =排序(target_ratios, key=lambda x: x[0] * x[1])#找到與目標最接近的寬高比target_aspect_ratio = find_closest_aspect_ratio(aspect_ratio, target_ratio, orig_wid, 計算目標, 計算目標_igage_igage_igage_igage_igage_ig;寬度and heighttarget_width = image_size * target_aspect_ratio[0]target_height = image_size * target_aspect_ratio[1]blocks = target_aspect_ratio[0] * target_aspect_ratio[1]blocks = target_aspect_ratio[0] * target_aspect_ratio[1]blocks = target_aspect_ratio[0] * target_aspect_ratio[1]#$respiz; ed_images = []for i in範圍（區塊）：框=（
            (i % (目標寬度 // 圖片大小)) * 圖片大小,
            (i // (目標寬度 // 圖片大小)) * 圖片大小,
            ((i % (目標寬度 // 圖像大小)) + 1) * 圖像大小,
            ((i // (target_width // image_size)) + 1) * image_size)# 分割影像split_img = resized_img.crop(box)processed_images.append(split_img)assert <sp

展開

附加信息

版本 InternL-Chat-1.5.0
類型其他源碼
更新時間 2024-11-02
大小 50MB
來自於 Github

相關應用

waymo open dataset

2024-11-18
SmartTube

2024-12-14
Sunamu

2024-12-14
MySchedule.py

2024-12-15
viptools for eslam

2024-12-15
VITAident

2024-12-15

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
waymo open dataset

其他源碼

December 2023 Update
SmartTube

其他源碼

24.71 Stable
Sunamu

其他源碼

Release 2.2.0
waymo open dataset

其他源碼

December 2023 Update
termwind

其他類別

v2.3.0
wp functions

其他類別

1.0.0

相關資訊全部