clip retrieval下载 - clip retrieval源码下载

clip retrieval

Ai源码

2.44.0

下载

剪辑检索

轻松计算剪辑嵌入并用它们构建clip retrieval系统。使用 3080 可以在 20 小时内处理 100M 文本+图像嵌入。

Clip 客户端允许通过 python 远程查询后端。剪辑客户端笔记本
剪辑推理允许您快速（3080 上为 1500 个样本/秒）计算图像和文本嵌入
剪辑索引从嵌入中构建有效的索引
剪辑过滤器允许您使用剪辑索引过滤掉数据
夹背通过简单的烧瓶服务托管索引
Clip Front 是一个查询背面的简单 UI。在 Clip-retrieval ui 中查看
Clip end2end 运行 img2dataset、推理、索引，然后前后运行，使所有这些更容易开始

端到端这使得构建一个简单的语义搜索系统成为可能。有兴趣了解一般语义搜索吗？您可以阅读我关于该主题的中等帖子。

另请参阅 laion5B 和数十亿规模的语义搜索，以了解有关如何将这种规模扩大到数十亿样本的更多信息。

夹子正面

如果您相信开发可重用工具以使数据易于用于 ML，并且您愿意做出贡献，请加入 DataToML 聊天室。

谁在使用clip retrieval ？

cah-prepro对400M图像+文本爬取家庭数据集进行预处理。 Clip-retrieval 用于计算 400M 的剪辑嵌入和索引
autofaiss 使用 Clip-retrieval 显示使用示例（请参阅此处的多模式笔记本示例）
afiaka87 openai 演示展示了如何查看 openai 为其 DALL-E 演示发布的 1M 示例
dzryk 的 antarctic-captions 使用 autofaiss 和剪辑推理作为图像到文本任务生成锚点的方法，取得了巨大成功

安装

pip install 剪辑检索

如果您有兴趣运行 laion5B 索引，请参阅此文档

剪辑客户端

ClipClient允许通过 python 远程查询剪辑检索后端。

有关 jupyter 笔记本示例，请参阅ClipClient - 入门笔记本。

API初始化

在初始化期间，您可以指定一些参数：

backend_url ：后端的 url。（必需的）
indice_name ：指定要使用的索引的名称。（必需的）
aesthetic_score ：由审美预测器评定的审美分数。默认值为9 。
use_mclip ：是否使用多语言版本的 CLIP。默认为False 。
aesthetic_weight ：审美分数的权重。默认值为0.5
modality ：搜索索引中的图像或文本， Multimodal.IMAGE或Multimodal.TEXT之一。默认为Multimodal.IMAGE 。
num_images ：从 API 返回的图像数量。默认值为40 。
deduplicate ：是否通过图像嵌入对结果进行去重。默认为 true。
use_safety_model ：是否删除不安全的图像。默认为 true。
use_violence_detector ：是否删除含有暴力的图像。默认为 true。

例如，要使用默认参数查询 Laion5B 的托管后端：

 from clip_retrieval . clip_client import ClipClient , Modality

client = ClipClient ( url = "https://knn.laion.ai/knn-service" , indice_name = "laion5B-L-14" )

文字查询

您可以找到与您提供的文本类似的带标题的图像。

 results = client . query ( text = "an image of a cat" )
results [ 0 ]
> { 'url' : 'https://example.com/kitten.jpg' , 'caption' : 'an image of a kitten' , 'id' : 14 , 'similarity' : 0.2367108941078186 }

图片查询

您还可以找到与您提供的图像类似的带标题的图像。图像可以通过本地路径或 url 传递。

 cat_results = client . query ( image = "cat.jpg" )
dog_results = client . query ( image = "https://example.com/dog.jpg" )

嵌入查询

您还可以找到与您提供的嵌入剪辑类似的带标题的图像。

 cat_results = client . query ( embedding_input = cat_embedding )

查询图片目录

要使用类似的文本/图像对增强现有数据集，您可以查询图像目录并合并结果。

 all_results = [ result for result in [ client . query ( image = image ) for image in os . listdir ( "my-images" )]]
with open ( "search-results.json" , "w" ) as f :
    json . dump ( all_results , f )

创建数据集

您可以使用保存的 json 结果和工具img2dataset创建数据集。

img2dataset " search-results.json " 
    --input_format= " json " 
    --output_folder= " knn_search_dataset " 
    --caption_col= " caption "

剪辑末端到末端

首先选择图像网址和标题的数据集（示例），然后运行：

如果 GPU 没有足够的 VRAM，您可能需要运行export CUDA_VISIBLE_DEVICES=以避免使用 GPU。

 wget https://github.com/rom1504/img2dataset/raw/main/tests/test_files/test_1000.parquet
clip-retrieval end2end test_1000.parquet /tmp/my_output

然后访问 http://localhost:1234 并享受在图片中搜索的乐趣

如果您不想运行后端，请使用--run_back False

剪辑推理

获取example_folder中的一些图像，例如通过执行以下操作：

 pip install img2dataset
echo 'https://placekitten.com/200/305' >> myimglist.txt
echo 'https://placekitten.com/200/304' >> myimglist.txt
echo 'https://placekitten.com/200/303' >> myimglist.txt
img2dataset --url_list=myimglist.txt --output_folder=image_folder --thread_count=64 --image_size=256

您还可以将与图像同名的文本文件放入该文件夹中，以获取文本嵌入。

然后运行clip-retrieval inference --input_dataset image_folder --output_folder embeddings_folder

输出文件夹将包含：

图像嵌入/
- img_emb_0.npy 包含图像嵌入作为 numpy
文本嵌入/
- text_emb_0.npy 包含文本嵌入作为 numpy
元数据/
- metadata_0.parquet 包含图像路径、标题和元数据

这可以扩展到数百万个样本。在 3080 的 1400 个样本/秒下，2 小时内可处理 10M 个样本。

应用程序编程接口

Clip_inference 将一组文本+图像转换为剪辑嵌入

input_dataset输入数据集的路径。如果 input_format 是文件，则为文件夹。 Bash 大括号模式，例如“{000..150}.tar”（请参阅 https://pypi.org/project/braceexpand/）（如果是 webdataset）（必需）
output_folder将保存剪辑嵌入以及元数据的文件夹（必需）
input_format文件或 webdataset （默认文件）
cache_path Web数据集的缓存路径（默认None ）
batch_size一次进行推理的项目数（默认256 ）
num_prepro_workers进行预处理的进程数（默认8 ）
enable_text启用文本处理（默认True ）
enable_image启用图像处理（默认True ）
enable_metadata启用元数据处理（默认False ）
write_batch_size写入批量大小（默认10**6 ）
wds_image_key用于网络数据集中图像的键。（默认为jpg ）
wds_caption_key用于 web 数据集中的标题的键。（默认txt ）
Clip_model要加载的 CLIP 模型（默认ViT-B/32 ）。将其指定为"open_clip:ViT-B-32/laion2b_s34b_b79k"以使用 open_clip 或"hf_clip:patrickjohncyh/fashion-clip"以使用拥抱面部剪辑模型。
mclip_model要加载的 MCLIP 模型（默认Sentence-transformers/clip-ViT-B-32-multilingual-v1 ）
use_mclip如果为 False，则使用 CLIP 执行推理；否则为 MCLIP（默认False ）
use_jit使用 jit 作为剪辑模型（默认True ）
distribution_strategy选择如何分配作业，详细信息请参阅分配部分（默认顺序）
wds_number_file_per_input_file如果使用 wds 且未指定 output_partition_count （默认10000 ），则估计每个 tar 的样本数
output_partition_count输出分区的数量（默认None ）
wandb_project要使用的 wandb 项目（默认Clip_retrieval ）
enable_wandb是否使用wandb (默认False )
Clip_cache_path剪辑的缓存路径（默认None ）
slurm_job_name ，在 slurm 中使用的作业名称。（默认无）
slurm_partition （默认None ），要在其中创建作业的 slurm 分区。
slurm_jobs ，在 slurm 中创建的作业数量。（默认无）
slurm_job_comment ，要使用的作业注释。（默认无）
slurm_nodelist ，要使用的特定节点的列表。（默认None ）
slurm_exclude ，创建作业时要排除的节点列表。（默认无）
slurm_job_timeout ，如果未提供，则默认为 2 周。（默认无）
slurm_cache_path ，用于 slurm 相关任务的缓存路径。（默认无）
slurm_verbose_wait=False ，是否打印 slurm 作业的状态（默认False ）

深度稀疏后端

DeepSparse 是一个推理运行时，用于在 CPU 上进行快速稀疏模型推理。 Clip-retrieval 中有一个可用的后端，方法是使用pip install deepsparse-nightly[clip]安装它，并指定带有前缀"nm:"的clip_model ，例如"nm:neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds"或"nm:mgoin/CLIP-ViT-B-32-laion2b_s34b_b79k-ds" 。

推理工作者

如果您希望更好地控制推理的运行方式，您可以使用clip-retrieval inference.worker直接创建和调用工作程序

用法示例：

clip-retrieval inference.worker 
--tasks= " [0] " 
--input_dataset= " input/folder/{000000..000100}.tar " 
--output_folder= " example/path " 
--input_format= " webdataset " 
--output_partition_count= " 1 "

这样做将调用单个工作程序，可以指示该工作程序专注于input_dataset的特定子集。该工作人员将按顺序处理传递给它的tasks 。在这里， tasks是该工作人员将负责的partition_id的列表。

要手动计算任务数，请使用以下公式： number_samples / wds_number_file_per_input_file 。

该 API 与clip-retrieval inference非常相似，但有一些细微的变化：

tasks表示该工作线程负责计算的partition_id的整数列表。（必需的）
input_dataset输入数据集的路径。如果 input_format 是文件，则为文件夹。 Bash 大括号模式，例如“{000..150}.tar”（请参阅 https://pypi.org/project/braceexpand/）（如果是 webdataset）（必需）
output_folder将保存剪辑嵌入以及元数据的文件夹（必需）
output_partition_count输出分区的数量（必需）
input_format文件或 webdataset （默认文件）
cache_path Web数据集的缓存路径（默认None ）
batch_size一次进行推理的项目数（默认256 ）
num_prepro_workers进行预处理的进程数（默认8 ）
enable_text启用文本处理（默认True ）
enable_image启用图像处理（默认True ）
enable_metadata启用元数据处理（默认False ）
wds_image_key用于网络数据集中图像的键。（默认为jpg ）
wds_caption_key用于 web 数据集中的标题的键。（默认txt ）
Clip_model要加载的 CLIP 模型（默认ViT-B/32 ）。将其指定为"open_clip:ViT-B-32-quickgelu"以使用 open_clip 或"hf_clip:patrickjohncyh/fashion-clip"以使用拥抱面部剪辑模型。
mclip_model要加载的 MCLIP 模型（默认Sentence-transformers/clip-ViT-B-32-multilingual-v1 ）
use_mclip如果为 False，则使用 CLIP 执行推理；否则为 MCLIP（默认False ）
use_jit使用 jit 作为剪辑模型（默认True ）
wandb_project要使用的 wandb 项目（默认Clip_retrieval ）
enable_wandb是否使用wandb (默认False )
Clip_cache_path剪辑的缓存路径（默认None ）

注：工人不接受以下说法
write_batch_size写入批量大小（默认10**6 ）
distribution_strategy选择如何分配作业，详细信息请参阅分配部分（默认顺序）
wds_number_file_per_input_file如果使用 wds 且未指定 output_partition_count （默认10000 ），则估计每个 tar 的样本数
任何 SLURM 参数

在 hdfs 上加载/写入文件

要从 hdfs 文件夹加载 Web 数据集，请在请求中设置 --input_dataset“pipe:hdfs dfs -cat path_on_hdfs”，不带“hdfs://”前缀。
要将输出写入 hdfs，请将 --output_hdfs_folder 设置为 hdfs 上以“hdfs://”为前缀的路径

使用 webdataset 格式的 hdfs 查询示例： `clip_inference --input_dataset "pipe:hdfs dfs -cat /myfolder/webdataset/{00000..00010}.tar" --output_folder "hdfs://myfolder/embeddings" --input_format webdataset

在 s3 上加载/写入文件

`clip_inference --input_dataset "pipe:aws s3 cp --quiet s3://myfolder/webdataset/{00000..00010}.tar -" --output_folder "s3://myfolder/embeddings" --input_format webdataset

分布式推理

要在多个节点（和多个 GPU）上运行它，请参阅 docs/distributed_clip_inference.md 上的教程

剪辑索引

剪辑索引将剪辑推理的输出作为输入，并使用 autofaiss 从中创建索引

clip-retrieval index --embeddings_folder embeddings_folder --index_folder index_folder

--max_index_memory_usage "16G"选项允许配置索引将消耗的内存量。更多内存，更好的 knn 召回（默认4G ）。
--current_memory_available 24G允许控制创建过程中使用多少内存（默认16G ）。
--image_subfolder "img_emb"允许为图像嵌入指定一个子文件夹，该子文件夹与--embeddings_folder选项连接（默认img_emb ）。
--text_subfolder "text_emb"允许为文本嵌入指定一个子文件夹，该子文件夹与--embeddings_folder选项连接（默认text_emb ）。
--copy_metadata True可以选择是否在进程结束时复制元数据（默认True ）。
--nb_cores 8允许控制线程数量（默认None ，这将使用所有核心）。

输出是一个包含以下内容的文件夹：

image.index 包含图像的 faiss 索引
text.index 包含文本的 faiss 索引
包含 parquet 元数据的元数据文件夹

得益于 autofaiss 和 faiss，这可以在几个小时内扩展到数亿个样本。

您可能需要仔细选择索引使用多少内存，以便最大限度地提高 knn 召回率。 autofaiss 索引选择 colab 可以帮助与autofaiss score_index命令一起检查索引的召回率。一般来说，使用更多内存的索引可以获得更好的召回率，因此更接近朴素（慢）knn

剪辑滤镜

计算嵌入后，您可能希望通过特定查询过滤掉数据。为此，您可以运行clip-retrieval filter --query "cat" --output_folder "cat/" --indice_folder "indice_folder"它将在输出文件夹中复制此查询的 100 个最佳图像。使用--num_results或--threshold可能有助于优化过滤器

得益于快速 knn 索引，对于大 K 值 (100000)，这可以实时运行 (<10ms)，对于非常大的 K 值，可以在几分钟内运行。

该脚本适用于小型数据集。对于较大的，请检查[notebook/simple_filter.ipynb]。

夹回

Clip back 是一个简单的 knn 服务后端。如果同时使用hdf5和faiss内存映射，则仅使用clip使用的内存，即4GB。

运行（output_folder是剪辑索引的输出）

 echo ' {"example_index": "output_folder"} ' > indices_paths.json
clip-retrieval back --port 1234 --indices-paths indices_paths.json

选项：

--use_jit True使用 jit 作为剪辑模型
--clip_model "ViT-B/32"允许选择要使用的剪辑模型。前缀为"open_clip:"以使用 open_clip 模型。
--enable_mclip_option True加载 mclip 模型，从而可以使用任何语言进行搜索。
--columns_to_return='["url", "image_path", "caption", "NSFW"]允许您指定应从元数据中获取哪些列并由后端返回。在 hdf5 缓存的情况下指定 less 很有用，可以加快查询速度。
--enable_faiss_memory_mapping=True选项来使用具有内存映射的索引。这会将内存使用量减少到零。
--enable_hdf5 True选项来启用元数据的 hdf5 缓存。 HDF5 缓存使得在几乎不使用内存的情况下使用元数据成为可能。
--use_arrow True允许使用箭头而不是 hdf5。对于非常大的数据集（数十亿），应与 Clip_back_prepro 一起使用
--reorder_metadata_by_ivf_index True选项利用 knn ivf 索引结果的数据局部性属性：它按照 IVF 集群的顺序对元数据集合进行排序。这使得元数据检索速度更快，因为读取将访问元数据的一些主要是连续的部分，而不是许多非连续的部分。实际上，这意味着能够在 1 秒内检索到 1M 个项目，而如果没有此方法，则在 1 秒内只能检索到 1000 个项目。这将使用第一个图像索引对元数据进行排序。
--provide_safety_model True将自动下载并加载安全模型。您需要pip install autokeras可选依赖项才能正常工作。
--provide_violence_detector True将加载暴力检测器，论文
--provide_aesthetic_embeddings True将加载美学嵌入并允许用户使查询移向剪辑空间的更好点

这些选项也可以在配置文件中提供，以便为每个索引提供不同的选项。例子：

{
        "laion5B" : {
                "indice_folder" : " /mnt/laion5B/prepared_data " ,
                "provide_safety_model" : true ,
                "enable_faiss_memory_mapping" : true ,
                "use_arrow" : true ,
                "enable_hdf5" : false ,
                "reorder_metadata_by_ivf_index" : false ,
                "columns_to_return" : [ " url " , " caption " ],
                "clip_model" : " ViT-L/14 " ,
                "enable_mclip_option" : false
        },
        "laion_400m" : {
                "indice_folder" : " /mnt/laion400M/index100 " ,
                "provide_safety_model" : true ,
                "enable_faiss_memory_mapping" : true ,
                "enable_hdf5" : true ,
                "use_arrow" : false ,
                "reorder_metadata_by_ivf_index" : true ,
                "enable_mclip_option" : true ,
                "clip_model" : " ViT-B/32 "
        }
}

在以下情况下使用 hdf5 或 arrow 缓存是一个好主意：

您没有足够的内存来加载内存中的元数据
你的磁盘速度很快（即你有一个SSD）

此时，您有一个在端口 1234 上运行的简单 Flask 服务器，它可以回答以下查询：

/indices-list -> 返回索引列表
/knn-service作为输入：

 {
    "text" : "a text query" ,
    "image" : "a base64 image" ,
    "image_url" : "http://some-url.com/a.jpg" ,
    "modality" : "image" , // image or text index to use
    "num_images" : 4 , // number of output images
    "indice_name" : "example_index" ,
    "num_result_ids" : 4 // optional, if specified fetch this number of results in total but only num_images with metadata
}

text、image 和 image_url 互斥并返回：

 [
    {
        "image" : "base 64 of an image" ,
        "text" : "some result text" ,
        "id" : 543
    } ,
    {
        "image" : "base 64 of an image" ,
        "text" : "some result text" ,
        "id" : 782
    }
]

如果元数据提供，每个对象还可能包含 url 字段。

id 是索引中项目的位置。它可用于通过 /metadata 端点查询元数据：

 {
    "indice_name" : "example_index" ,
    "ids" : [ 543 , 782 ]
}

返回：

 {
    "image" : "base 64 of an image" ,
    "text" : "some result text"
    // any other key available in the metadata and specified in columns_to_return cli option
}

/knn-service和/metadata的num_result_ids参数可以一起使用来执行大型 knn 查询，然后仅在需要时获取元数据。这样做是有意义的，因为 knn 搜索可以非常高效，这要归功于 knn IVF 索引引用的强大局部性，使得使用大 K 可以快速执行 knn，而元数据 (hdf5) 的当前磁盘实现不具备这种能力属性，因此无法快速检索大量随机项目。特别是，这可以用于在前端实现无限滚动。

默认情况下，后端也会暴露前端。默认情况下，该前端将命中该后端，但是您可能需要指定这是通过 http 还是 https 发生的，在这种情况下，请使用选项--default_backend来指定后端 url。 --url_column允许指定前面的列 url 的名称

剪辑回来：基准和监控

如果使用内存映射索引和元数据，此后端有 50 毫秒的延迟。吞吐量约为 20 个查询/秒。为了实现高吞吐量，需要使用 grpc 服务器以及 GPU 来进行快速剪辑推理，关闭内存映射选项也可以加快请求速度，但代价是高内存使用率。

该后端还公开了 prometheus /metrics端点以及/metrics-summary处的人类可读摘要。这可以（可选）用于设置 grafana 仪表板进行监控：

格拉法纳

在此仪表板上可以看到，在图像 url 搜索的情况下，任何调用中最慢的部分是通过其 url 获取图像，最多需要 300 毫秒。对于文本查询或图像查询，延迟约为 50ms。以下是指标摘要中的输出示例：

 Among 20.0 calls to the knn end point with an average latency of 0.1889s per request, the step costs are (in order):
                        name                               description  calls  average proportion
0              download_time             Time spent downloading an url      6  0.3215s     170.2%
1          metadata_get_time            Time spent retrieving metadata     20  0.0415s      21.9%
2             knn_index_time       Time spent doing a knn on the index     20  0.0267s      14.1%
3  image_clip_inference_time   Time spent doing a image clip inference      6  0.0206s      10.9%
4   text_clip_inference_time    Time spent doing a text clip inference     14  0.0186s       9.8%
5          image_prepro_time  Time spent doing the image preprocessing      6  0.0097s       5.2%
6           text_prepro_time   Time spent doing the text preprocessing     14  0.0020s       1.0%

夹前

Clip front 是一个简单的 UI，可连接到 Clip back 并显示结果。您可以在剪辑检索 ui 中使用它

或者您可以自己运行：

 npm install -g clip-retrieval-front
clip-retrieval-front 3005

您还可以使用 python 包中的clip-retrieval front来运行它。

发展

要进行开发，请转到前面并运行npm install然后运行npm start 。

为了发展

在本地或在 gitpod 中（在那里export PIP_USER=false ）

设置虚拟环境：

 python3 -m venv .env
source .env/bin/activate
pip install -e .

运行测试：

 pip install -r requirements-test.txt

然后

 make lint
make test

您可以使用make black重新格式化代码

python -m pytest -x -s -v tests -k "test_runner"运行特定测试

如果你想通过 python 后端或前端使用前端，请运行

 cd front
npm install
npm run build
cd ..
pip install -e .

引文

 @misc{beaumont-2022-clip-retrieval,
  author = {Romain Beaumont},
  title = { clip retrieval : Easily compute clip embeddings and build a clip retrieval system with them},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {url{https://github.com/rom1504/clip-retrieval}}
}

展开

附加信息