clip retrieval下載 - clip retrieval原始碼下載

clip retrieval

Ai源碼

2.44.0

下載

剪輯檢索

輕鬆計算剪輯嵌入並用它們建立clip retrieval系統。使用 3080 可以在 20 小時內處理 100M 文字+影像嵌入。

Clip 用戶端允許透過 python 遠端查詢後端。剪輯客戶端筆記本
剪輯推理可讓您快速（3080 上為 1500 個樣本/秒）計算圖像和文字嵌入
剪輯索引從嵌入中建立有效的索引
剪輯過濾器允許您使用剪輯索引過濾掉數據
夾背透過簡單的燒瓶服務託管索引
Clip Front 是一個查詢背面的簡單 UI。在 Clip-retrieval ui 中查看
Clip end2end 運行 img2dataset、推理、索引，然後前後運行，使所有這些更容易開始

端到端這使得建立一個簡單的語義搜尋系統成為可能。有興趣了解一般語意搜尋嗎？您可以閱讀我關於該主題的中等帖子。

另請參閱 laion5B 和數十億規模的語義搜索，以了解有關如何將這種規模擴大到數十億樣本的更多資訊。

夾子正面

如果您相信開發可重複使用工具以使資料易於用於 ML，並且您願意做出貢獻，請加入 DataToML 聊天室。

誰在使用clip retrieval ？

cah-prepro對400M影像+文字爬取家庭資料集進行預處理。 Clip-retrieval 用於計算 400M 的剪輯嵌入和索引
autofaiss 使用 Clip-retrieval 顯示使用範例（請參閱此處的多模式筆記本範例）
afiaka87 openai 示範展示如何查看 openai 為其 DALL-E 演示發布的 1M 範例
dzryk 的 antarctic-captions 使用 autofaiss 和剪輯推理作為圖像到文字任務生成錨點的方法，取得了巨大成功

安裝

pip install 剪輯檢索

如果您有興趣執行 laion5B 索引，請參閱此文檔

剪輯客戶端

ClipClient允許透過 python 遠端查詢剪輯檢索後端。

有關 jupyter 筆記本範例，請參閱ClipClient - 入門筆記本。

API初始化

在初始化期間，您可以指定一些參數：

backend_url ：後端的 url。（必需的）
indice_name ：指定要使用的索引的名稱。（必需的）
aesthetic_score ：由美感預測器所評定的美感分數。預設值為9 。
use_mclip ：是否使用多語言版本的 CLIP。預設為False 。
aesthetic_weight ：美感分數的權重。預設值為0.5
modality ：搜尋索引中的圖像或文本， Multimodal.IMAGE或Multimodal.TEXT之一。預設為Multimodal.IMAGE 。
num_images ：從 API 傳回的圖片數量。預設值為40 。
deduplicate ：是否透過影像嵌入對結果進行去重。預設為 true。
use_safety_model ：是否刪除不安全的影像。預設為 true。
use_violence_detector ：是否刪除含有暴力的圖像。預設為 true。

例如，要使用預設參數查詢 Laion5B 的託管後端：

 from clip_retrieval . clip_client import ClipClient , Modality

client = ClipClient ( url = "https://knn.laion.ai/knn-service" , indice_name = "laion5B-L-14" )

文字查詢

您可以找到與您提供的文字類似的標題的圖像。

 results = client . query ( text = "an image of a cat" )
results [ 0 ]
> { 'url' : 'https://example.com/kitten.jpg' , 'caption' : 'an image of a kitten' , 'id' : 14 , 'similarity' : 0.2367108941078186 }

圖片查詢

您還可以找到與您提供的圖像類似的標題的圖像。圖像可以透過本地路徑或 url 傳遞。

 cat_results = client . query ( image = "cat.jpg" )
dog_results = client . query ( image = "https://example.com/dog.jpg" )

嵌入查詢

您還可以找到與您提供的嵌入剪輯類似的標題的圖像。

 cat_results = client . query ( embedding_input = cat_embedding )

查詢圖片目錄

若要使用類似的文字/影像對增強現有資料集，您可以查詢影像目錄並合併結果。

 all_results = [ result for result in [ client . query ( image = image ) for image in os . listdir ( "my-images" )]]
with open ( "search-results.json" , "w" ) as f :
    json . dump ( all_results , f )

建立資料集

您可以使用已儲存的 json 結果和工具img2dataset建立資料集。

img2dataset " search-results.json " 
    --input_format= " json " 
    --output_folder= " knn_search_dataset " 
    --caption_col= " caption "

剪輯末端到末端

首先選擇圖像網址和標題的資料集（範例），然後運行：

如果 GPU 沒有足夠的 VRAM，您可能需要執行export CUDA_VISIBLE_DEVICES=以避免使用 GPU。

 wget https://github.com/rom1504/img2dataset/raw/main/tests/test_files/test_1000.parquet
clip-retrieval end2end test_1000.parquet /tmp/my_output

然後訪問 http://localhost:1234 並享受在圖片中搜尋的樂趣

如果您不想運行後端，請使用--run_back False

剪輯推理

取得example_folder中的一些圖像，例如透過執行以下操作：

 pip install img2dataset
echo 'https://placekitten.com/200/305' >> myimglist.txt
echo 'https://placekitten.com/200/304' >> myimglist.txt
echo 'https://placekitten.com/200/303' >> myimglist.txt
img2dataset --url_list=myimglist.txt --output_folder=image_folder --thread_count=64 --image_size=256

您也可以將與圖像同名的文字檔案放入該資料夾中，以取得文字嵌入。

然後執行clip-retrieval inference --input_dataset image_folder --output_folder embeddings_folder

輸出資料夾將包含：

影像嵌入/
- img_emb_0.npy 包含圖像嵌入作為 numpy
文字嵌入/
- text_emb_0.npy 包含文字嵌入作為 numpy
元數據/
- metadata_0.parquet 包含圖像路徑、標題和元數據

這可以擴展到數百萬個樣本。在 3080 的 1400 個樣本/秒下，2 小時內可處理 10M 個樣本。

應用程式介面

Clip_inference 將一組文字+影像轉換為剪輯嵌入

input_dataset輸入資料集的路徑。如果 input_format 是文件，則為資料夾。 Bash 大括號模式，例如「{000..150}.tar」（請參閱 https://pypi.org/project/braceexpand/）（如果是 webdataset）（必需）
output_folder將保存剪輯嵌入以及元資料的資料夾（必需）
input_format檔案或 webdataset （預設檔）
cache_path Web資料集的快取路徑（預設None ）
batch_size一次進行推理的項目數（預設256 ）
num_prepro_workers進行預處理的進程數（預設8 ）
enable_text啟用文字處理（預設True ）
enable_image啟用影像處理（預設True ）
enable_metadata啟用元資料處理（預設False ）
write_batch_size寫入批次大小（預設10**6 ）
wds_image_key用於網路資料集中影像的鍵。（預設為jpg ）
wds_caption_key用於 web 資料集中的標題的鍵。（預設txt ）
Clip_model要載入的 CLIP 模型（預設ViT-B/32 ）。將其指定為"open_clip:ViT-B-32/laion2b_s34b_b79k"以使用 open_clip 或"hf_clip:patrickjohncyh/fashion-clip"以使用擁抱臉部剪輯模型。
mclip_model要載入的 MCLIP 模型（預設Sentence-transformers/clip-ViT-B-32-multilingual-v1 ）
use_mclip如果為 False，則使用 CLIP 執行推理；否則為 MCLIP（預設False ）
use_jit使用 jit 作為剪輯模型（預設True ）
distribution_strategy選擇如何指派作業，詳細資料請參閱指派部分（預設順序）
wds_number_file_per_input_file如果使用 wds 且未指定 output_partition_count （預設10000 ），則估計每個 tar 的樣本數
output_partition_count輸出分區的數量（預設None ）
wandb_project要使用的 wandb 專案（預設Clip_retrieval ）
enable_wandb是否使用wandb (預設False )
Clip_cache_path剪輯的快取路徑（預設None ）
slurm_job_name ，在 slurm 中使用的作業名稱。（預設無）
slurm_partition （預設None ），要建立作業的 slurm 分割區。
slurm_jobs ，在 slurm 中建立的作業數量。（預設無）
slurm_job_comment ，要使用的作業註釋。（預設無）
slurm_nodelist ，要使用的特定節點的列表。
slurm_exclude ，建立作業時要排除的節點清單。（預設無）
slurm_job_timeout ，如果未提供，則預設為 2 週。（預設無）
slurm_cache_path ，用於 slurm 相關任務的快取路徑。（預設無）
slurm_verbose_wait=False ，是否列印 slurm 作業的狀態（預設False ）

深度稀疏後端

DeepSparse 是一個推理運行時，用於在 CPU 上進行快速稀疏模型推理。 Clip-retrieval 中有一個可用的後端，方法是使用pip install deepsparse-nightly[clip]安裝它，並指定帶有前綴"nm:"的clip_model ，例如"nm:neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds"或"nm:mgoin/CLIP-ViT-B-32-laion2b_s34b_b79k-ds" 。

推理工作者

如果您希望更好地控制推理的運行方式，您可以使用clip-retrieval inference.worker直接建立和呼叫工作程序

用法範例：

clip-retrieval inference.worker 
--tasks= " [0] " 
--input_dataset= " input/folder/{000000..000100}.tar " 
--output_folder= " example/path " 
--input_format= " webdataset " 
--output_partition_count= " 1 "

這樣做將呼叫單一工作程序，可以指示該工作程序專注於input_dataset的特定子集。該工作人員將按順序處理傳遞給它的tasks 。在這裡， tasks是該工作人員將負責的partition_id的清單。

若要手動計算任務數，請使用下列公式： number_samples / wds_number_file_per_input_file 。

該 API 與clip-retrieval inference非常相似，但有一些細微的變化：

tasks表示該工作線程負責計算的partition_id的整數列表。（必需的）
input_dataset輸入資料集的路徑。如果 input_format 是文件，則為資料夾。 Bash 大括號模式，例如「{000..150}.tar」（請參閱 https://pypi.org/project/braceexpand/）（如果是 webdataset）（必需）
output_folder將保存剪輯嵌入以及元資料的資料夾（必要）
output_partition_count輸出分區的數量（必要）
input_format檔案或 webdataset （預設檔）
cache_path Web資料集的快取路徑（預設None ）
batch_size一次進行推理的項目數（預設256 ）
num_prepro_workers進行預處理的進程數（預設8 ）
enable_text啟用文字處理（預設True ）
enable_image啟用影像處理（預設True ）
enable_metadata啟用元資料處理（預設False ）
wds_image_key用於網路資料集中影像的鍵。（預設為jpg ）
wds_caption_key用於 web 資料集中的標題的鍵。（預設txt ）
Clip_model要載入的 CLIP 模型（預設ViT-B/32 ）。將其指定為"open_clip:ViT-B-32-quickgelu"以使用 open_clip 或"hf_clip:patrickjohncyh/fashion-clip"以使用擁抱臉部剪輯模型。
mclip_model要載入的 MCLIP 模型（預設Sentence-transformers/clip-ViT-B-32-multilingual-v1 ）
use_mclip如果為 False，則使用 CLIP 執行推理；否則為 MCLIP（預設False ）
use_jit使用 jit 作為剪輯模型（預設True ）
wandb_project要使用的 wandb 專案（預設Clip_retrieval ）
enable_wandb是否使用wandb (預設False )
Clip_cache_path剪輯的快取路徑（預設None ）

註：工人不接受以下說法
write_batch_size寫入批次大小（預設10**6 ）
distribution_strategy選擇如何指派作業，詳細資料請參閱指派部分（預設順序）
wds_number_file_per_input_file如果使用 wds 且未指定 output_partition_count （預設10000 ），則估計每個 tar 的樣本數
任何 SLURM 參數

在 hdfs 上載入/寫入文件

若要從 hdfs 資料夾載入 Web 資料集，請在請求中設定 --input_dataset“pipe:hdfs dfs -cat path_on_hdfs”，不帶“hdfs://”前綴。
若要將輸出寫入 hdfs，請將 --output_hdfs_folder 設定為 hdfs 上以「hdfs://」為前綴的路徑

使用 webdataset 格式的 hdfs 查詢範例： `clip_inference --input_dataset "pipe:hdfs dfs -cat /myfolder/webdataset/{00000..00010}.tar" --output_folder "hdddings://myfolder/embeddings" --inset_format webfs

在 s3 上載入/寫入文件

`clip_inference --input_dataset "pipe:aws s3 cp --quiet s3://myfolder/webdataset/{00000..00010}.tar -" --output_folder "s3://myfolder/embeddings" --input_format webdataset

分佈式推理

要在多個節點（和多個 GPU）上運行它，請參閱 docs/distributed_clip_inference.md 上的教學課程

剪輯索引

剪輯索引將剪輯推理的輸出作為輸入，並使用 autofaiss 從中建立索引

clip-retrieval index --embeddings_folder embeddings_folder --index_folder index_folder

--max_index_memory_usage "16G"選項允許配置索引將消耗的記憶體量。更多內存，更好的 knn 召回（預設4G ）。
--current_memory_available 24G允許控制建立過程中使用多少記憶體（預設為16G ）。
--image_subfolder "img_emb"允許為圖像嵌入指定一個子資料夾，該子資料夾與--embeddings_folder選項連接（預設img_emb ）。
--text_subfolder "text_emb"允許為文字嵌入指定一個子資料夾，該子資料夾與--embeddings_folder選項連接（預設text_emb ）。
--copy_metadata True可以選擇是否在進程結束時複製元資料（預設True ）。
--nb_cores 8允許控制執行緒數量（預設None ，這將使用所有核心）。

輸出是一個包含以下內容的資料夾：

image.index 包含圖像的 faiss 索引
text.index 包含文字的 faiss 索引
包含 parquet 元資料的元資料資料夾

由於 autofaiss 和 faiss，這可以在幾個小時內擴展到數億個樣本。

您可能需要仔細選擇索引使用多少內存，以便最大限度地提高 knn 召回率。 autofaiss 索引選擇 colab 可以幫助與autofaiss score_index指令一起檢查索引的召回率。一般來說，使用更多記憶體的索引可以獲得更好的召回率，因此更接近樸素（慢）knn

剪輯濾鏡

計算嵌入後，您可能想要透過特定查詢過濾掉資料。為此，您可以執行clip-retrieval filter --query "cat" --output_folder "cat/" --indice_folder "indice_folder"它將在輸出資料夾中複製此查詢的 100 個最佳影像。使用--num_results或--threshold可能有助於優化過濾器

由於快速 knn 索引，對於大 K 值 (100000)，這可以即時運行 (<10ms)，對於非常大的 K 值，可以在幾分鐘內運行。

該腳本適用於小型資料集。對於較大的，請檢查[notebook/simple_filter.ipynb]。

夾回

Clip back 是一個簡單的 knn 服務後端。如果同時使用hdf5和faiss內存映射，則僅使用clip使用的內存，即4GB。

運行（output_folder是剪輯索引的輸出）

 echo ' {"example_index": "output_folder"} ' > indices_paths.json
clip-retrieval back --port 1234 --indices-paths indices_paths.json

選項：

--use_jit True使用 jit 作為剪輯模型
--clip_model "ViT-B/32"允許選擇要使用的剪輯模型。前綴為"open_clip:"以使用 open_clip 模型。
--enable_mclip_option True載入 mclip 模型，因此可以使用任何語言進行搜尋。
--columns_to_return='["url", "image_path", "caption", "NSFW"]可讓您指定應從元資料取得哪些資料列並由後端傳回。在 hdf5 快取的情況下指定 less 很有用，可以加快查詢速度。
--enable_faiss_memory_mapping=True選項來使用具有記憶體映射的索引。這會將記憶體使用量減少到零。
--enable_hdf5 True選項來啟用元資料的 hdf5 快取。 HDF5 快取使得在幾乎不使用記憶體的情況下使用元資料成為可能。
--use_arrow True允許使用箭頭而不是 hdf5。對於非常大的資料集（數十億），應與 Clip_back_prepro 一起使用
--reorder_metadata_by_ivf_index True選項利用 knn ivf 索引結果的資料局部性屬性：它依照 IVF 叢集的順序對元資料集合進行排序。這使得元資料檢索速度更快，因為讀取將存取元資料的一些主要是連續的部分，而不是許多非連續的部分。實際上，這意味著能夠在 1 秒內檢索到 1M 個項目，而如果沒有此方法，則在 1 秒內只能檢索到 1000 個項目。這將使用第一個圖像索引對元資料進行排序。
--provide_safety_model True將自動下載並載入安全模型。您需要pip install autokeras可選依賴項才能正常運作。
--provide_violence_detector True將載入暴力偵測器，論文
--provide_aesthetic_embeddings True將載入美學嵌入並允許使用者使查詢移向剪輯空間的更好點

這些選項也可以在設定檔中提供，以便為每個索引提供不同的選項。例子：

{
        "laion5B" : {
                "indice_folder" : " /mnt/laion5B/prepared_data " ,
                "provide_safety_model" : true ,
                "enable_faiss_memory_mapping" : true ,
                "use_arrow" : true ,
                "enable_hdf5" : false ,
                "reorder_metadata_by_ivf_index" : false ,
                "columns_to_return" : [ " url " , " caption " ],
                "clip_model" : " ViT-L/14 " ,
                "enable_mclip_option" : false
        },
        "laion_400m" : {
                "indice_folder" : " /mnt/laion400M/index100 " ,
                "provide_safety_model" : true ,
                "enable_faiss_memory_mapping" : true ,
                "enable_hdf5" : true ,
                "use_arrow" : false ,
                "reorder_metadata_by_ivf_index" : true ,
                "enable_mclip_option" : true ,
                "clip_model" : " ViT-B/32 "
        }
}

在以下情況下使用 hdf5 或 arrow 快取是一個好主意：

您沒有足夠的內存來加載內存中的元數據
你的磁碟速度很快（即你有一個SSD）

此時，您有一個在連接埠 1234 上運行的簡單 Flask 伺服器，它可以回答以下查詢：

/indices-list -> 返回索引列表
/knn-service作為輸入：

 {
    "text" : "a text query" ,
    "image" : "a base64 image" ,
    "image_url" : "http://some-url.com/a.jpg" ,
    "modality" : "image" , // image or text index to use
    "num_images" : 4 , // number of output images
    "indice_name" : "example_index" ,
    "num_result_ids" : 4 // optional, if specified fetch this number of results in total but only num_images with metadata
}

text、image 和 image_url 互斥並回傳：

 [
    {
        "image" : "base 64 of an image" ,
        "text" : "some result text" ,
        "id" : 543
    } ,
    {
        "image" : "base 64 of an image" ,
        "text" : "some result text" ,
        "id" : 782
    }
]

如果元資料提供，每個物件也可能包含 url 欄位。

id 是索引中項目的位置。它可用於透過 /metadata 端點查詢元資料：

 {
    "indice_name" : "example_index" ,
    "ids" : [ 543 , 782 ]
}

返回：

 {
    "image" : "base 64 of an image" ,
    "text" : "some result text"
    // any other key available in the metadata and specified in columns_to_return cli option
}

/knn-service和/metadata的num_result_ids參數可以一起使用來執行大型 knn 查詢，然後僅在需要時取得元資料。這樣做是有意義的，因為knn 搜尋可以非常高效，這要歸功於knn IVF 索引引用的強大局部性，使得使用大K 可以快速執行knn，而元資料(hdf5) 的當前磁碟實作不具備這種能力屬性，因此無法快速檢索大量隨機項目。特別是，這可以用於在前端實現無限滾動。

預設情況下，後端也會暴露前端。預設情況下，該前端將命中該後端，但是您可能需要指定這是透過 http 還是 https 發生的，在這種情況下，請使用選項--default_backend來指定後端 url。 --url_column允許指定前面的欄位 url 的名稱

剪輯回來：基準和監控

如果使用記憶體映射索引和元數據，此後端有 50 毫秒的延遲。吞吐量約為 20 個查詢/秒。為了實現高吞吐量，需要使用 grpc 伺服器以及 GPU 來進行快速剪輯推理，關閉記憶體映射選項也可以加快請求速度，但代價是高記憶體使用率。

該後端也揭露了 prometheus /metrics端點以及/metrics-summary處的人類可讀摘要。這可以（可選）用於設定 grafana 儀表板進行監控：

格拉法納

在此儀表板上可以看到，在圖像 url 搜尋的情況下，任何呼叫中最慢的部分是透過其 url 獲取圖像，最多需要 300 毫秒。對於文字查詢或圖像查詢，延遲約為 50ms。以下是指標摘要中的輸出範例：

 Among 20.0 calls to the knn end point with an average latency of 0.1889s per request, the step costs are (in order):
                        name                               description  calls  average proportion
0              download_time             Time spent downloading an url      6  0.3215s     170.2%
1          metadata_get_time            Time spent retrieving metadata     20  0.0415s      21.9%
2             knn_index_time       Time spent doing a knn on the index     20  0.0267s      14.1%
3  image_clip_inference_time   Time spent doing a image clip inference      6  0.0206s      10.9%
4   text_clip_inference_time    Time spent doing a text clip inference     14  0.0186s       9.8%
5          image_prepro_time  Time spent doing the image preprocessing      6  0.0097s       5.2%
6           text_prepro_time   Time spent doing the text preprocessing     14  0.0020s       1.0%

夾前

Clip front 是一個簡單的 UI，連接到 Clip back 並顯示結果。您可以在剪輯檢索 ui 中使用它

或者您可以自己運行：

 npm install -g clip-retrieval-front
clip-retrieval-front 3005

您也可以使用 python 套件中的clip-retrieval front來運行它。

發展

要進行開發，請轉到前面並執行npm install然後執行npm start 。

為了發展

在本地或在 gitpod 中（在那裡export PIP_USER=false ）

設定虛擬環境：

 python3 -m venv .env
source .env/bin/activate
pip install -e .

運行測試：

 pip install -r requirements-test.txt

然後

 make lint
make test

您可以使用make black重新格式化程式碼

python -m pytest -x -s -v tests -k "test_runner"運行特定測試

如果你想透過 python 後端或前端使用前端，請運行

 cd front
npm install
npm run build
cd ..
pip install -e .

引文

 @misc{beaumont-2022-clip-retrieval,
  author = {Romain Beaumont},
  title = { clip retrieval : Easily compute clip embeddings and build a clip retrieval system with them},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {url{https://github.com/rom1504/clip-retrieval}}
}

展開

附加信息