Controlled Text Generation Image Datasets Download - Controlled Text Generation Image Datasets Source code download

Controlled Text Generation Image Datasets

Other source code

Download

Controlled-Text-Generation-Image-Datasets

Controllable text-to-image generation dataset

2D dataset

1. Pre-training data set

Noah-Wukong Dataset

Address: https://wukong-dataset.github.io/wukong-dataset/download.html
Introduction: The Noah-Wukong dataset is a large-scale multi-modal Chinese dataset. This dataset contains 100 million <image, text> pairs.

Zero: Fine-tuning text-to-image diffusion models for topic-driven generation

Pre-training dataset 23 million datasets (zero corpus). The zero corpus is collected from search engines and contains images and corresponding text descriptions, filtered from 5 billion image-text pairs based on user click-through rates. 2.3 million datasets (Zero-Corpus-Sub). Subdataset of the null corpus. Training a VLP model on a zero corpus may require extensive GPU resources, so a sub-dataset containing 10% image-text pairs is also provided for research purposes.
Downstream dataset
ICM is designed for image-text matching tasks. It contains 400,000 image-text pairs, including 200,000 positive examples and 200,000 negative examples.
IQM is also a dataset used for image-text matching tasks. Unlike ICM, we use search queries rather than detailed description text. Likewise, IQM contains 200,000 positive cases and 200,000 negative cases.
ICR we collected 200,000 image-text pairs. It contains image-to-text retrieval and text-to-image retrieval tasks.
IQR IQR has also been proposed for image text retrieval tasks. We randomly select 200,000 queries and corresponding images as annotated image-query pairs similar to IQM.
Flickr30k-CNA We gathered professional English-Chinese linguists to carefully re-translate all Flickr30k data and carefully check every sentence. Beijing Magic Data Technology Co., Ltd. contributed to the translation of this dataset.
Address: https://zero.so.com/download.html
Introduction: Zero is a large-scale Chinese cross-modal benchmark, consisting of two pre-training datasets called Zero-Corpus and five downstream datasets.

Flickr 30k Dataset

Address: https://shannon.cs.illinois.edu/DenotationGraph/data/index.html
Introduction: The Flickr 30k dataset consists of images obtained from Flickr.

Visual Genome Dataset

Address: http://visualgenome.org/
Introduction: Visual Genome is a large-scale image semantic understanding data set released by Li Feifei in 2016, including image and question and answer data. The annotations are dense and the semantics are diverse. This dataset contains 5M image-text pairs.

Conceptual Captions (CC) Dataset

Address: https://ai.google.com/research/ConceptualCaptions/download
Introduction: Conceptual Captions (CC) is a non-human annotated multi-modal data, including image URL and subtitles. The corresponding subtitle description is filtered from the alt-text attribute of the website. The CC data set is divided into two versions: CC3M (approximately 3.3 million image-text pairs) and CC12M (approximately 12 million image-text pairs) due to different data volumes.

YFCC100M Dataset

Address: http://projects.dfki.uni-kl.de/yfcc100m/
Introduction: YFCC100M database is an image database based on Yahoo Flickr since 2014. The database consists of 100 million pieces of media data generated between 2004 and 2014, including 99.2 million pieces of photo data and 800,000 pieces of video data. The YFCC100M data set establishes a text data document based on the database. Each line in the document is a piece of metadata of a photo or video.

ALT200M Dataset

Address: None
[Introduction]: ALT200M is a large-scale image-text dataset built by the Microsoft team to study the characteristics of scaling trends in description tasks. This dataset contains 200M image-text pairs. The corresponding text description is filtered from the alt-text attribute of the website. (Private dataset, no dataset link)

LAION-400M Dataset

Address: https://laion.ai/blog/laion-400-open-dataset/
Introduction: LAION-400M obtains text and images from web pages from 2014 to 2021 through CommonCrwal, and then uses CLIP to filter out image-text pairs with image and text embedding similarities less than 0.3, ultimately retaining 400 million image-text pairs. However, LAION-400M contains a large number of uncomfortable pictures, which has a greater impact on the text and image generation task. Many people use this data set to generate pornographic images, to bad effect. Therefore, larger and cleaner data sets become a requirement.

LAION-5B Dataset

Address: https://laion.ai/blog/laion-5b/
Introduction: LAION-5B is the largest multi-modal dataset currently known and open source. It obtains text and images through CommonCrawl, and then uses CLIP to filter out image-text pairs whose image and text embedding similarity are lower than 0.28, ultimately retaining 5 billion image-text pairs. The dataset contains 2.32 billion descriptions in English, 2.26 billion in 100+ other languages, and 1.27 billion unknown languages.

Wikipedia-based Image Text (WIT) Dataset Wikipedia-based Image Text (WIT) Dataset

Address: https://github.com/google-research-datasets/wit/blob/main/DATA.md
Introduction: The WIT (Wikipedia-based Image Text) dataset is a large multi-modal multi-lingual dataset containing more than 37 million image text sets containing more than 11 million unique images across more than 100 languages. We provide WIT as a set of 10 tsv files (zipped). The total dataset size is approximately 25GB. This is the entire training data set. If you want to get started quickly, choose any of the ~2.5GB files which will give you ~10% of the data and contain a set of ~3.5M+ image text examples. We also include validation and test sets (5 files each).

LAION-5B Dataset

Address: https://laion.ai/blog/laion-5b/
Introduction: LAION-5B is the largest multi-modal dataset currently known and open source. It obtains text and images through CommonCrawl, and then uses CLIP to filter out image-text pairs whose image and text embedding similarity are lower than 0.28, ultimately retaining 5 billion image-text pairs. The dataset contains 2.32 billion descriptions in English, 2.26 billion in 100+ other languages, and 1.27 billion unknown languages.

TaiSu (TaiSu--billion-level large-scale Chinese visual language pre-training data set)

Address: https://github.com/ksOAn6g5/TaiSu
Introduction: TaiSu: 166M large-scale high-quality Chinese visual language pre-training data set

COYO-700M: Large-scale image-text pair dataset

Address: https://huggingface.co/datasets/kakaobrain/coyo-700m
Introduction: COYO-700M is a large dataset containing 747M image-text pairs along with many other meta-attributes to improve usability in training various models. Our dataset follows a similar strategy to previous visual and linguistic datasets, collecting many informative alternative text and its associated image pairs in HTML documents. We expect COYO to be used to train popular large-scale base models, complementing other similar datasets.
Sample example

WIT: Image text dataset based on Wikipedia

Address: https://github.com/google-research-datasets/wit
Introduction: The Wikipedia-based Image to Text (WIT) dataset is a large multi-modal multi-lingual dataset. WIT consists of a curated set of 37.6 million entity-rich image text examples, containing 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pre-training dataset for multi-modal machine learning models.
Paper WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Sample example

DiffusionDB

Address: https://huggingface.co/datasets/poloclub/diffusiondb
Introduction: DiffusionDB is the first large-scale text-to-image prompting dataset. It contains 14 million images generated by stable diffusion using real user-specified cues and hyperparameters. The unprecedented size and diversity of this human-driven dataset provides exciting research opportunities for understanding the interplay between cues and generative models, detecting deepfakes, and designing human-computer interaction tools to help users more easily use these models. . The 2 million images in DiffusionDB 2M are divided into 2,000 folders, each of which contains 1,000 images and a JSON file that links the 1,000 images to their cues and hyperparameters. Similarly, the 14 million images in DiffusionDB Large are divided into 14,000 folders.
Paper DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models
Sample example

2. Vincent graph fine-tuning data set

DreamBooth: Fine-tuning text-to-image diffusion models for topic-driven generation

Address: https://github.com/google/dreambooth
Introduction: This data set includes 30 subjects in 15 different categories. Nine of them were living subjects (dogs and cats) and 21 were objects. This dataset contains a variable number of images (4-6) per subject.

3. Controllable text generation image data set

COCO-Stuff Dataset

# Get this repo
git clone https://github.com/nightrome/cocostuff.git
cd cocostuff

# Download everything
wget --directory-prefix=downloads http://images.cocodataset.org/zips/train2017.zip
wget --directory-prefix=downloads http://images.cocodataset.org/zips/val2017.zip
wget --directory-prefix=downloads http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip

# Unpack everything
mkdir -p dataset/images
mkdir -p dataset/annotations
unzip downloads/train2017.zip -d dataset/images/
unzip downloads/val2017.zip -d dataset/images/
unzip downloads/stuffthingmaps_trainval2017.zip -d dataset/annotations/

Address: https://github.com/nightrome/cocostuff
Introduction: COCO-Stuff enhances all 164K images of the popular COCO [2] dataset with pixel-level content annotations. These annotations can be used for scene understanding tasks such as semantic segmentation, object detection, and image captioning.
Sample example
Command line download

* Pick-a-Pic: An open dataset of user preferences for text-to-image generation
Address: https://huggingface.co/datasets/yuvalkirstain/pickapic_v1
Introduction: The Pick-a-Pic dataset is collected via the Pick-a-Pic web application and contains over 500,000 examples of human preferences for model-generated images. The dataset with URLs instead of actual images (which makes it much smaller in size) can be found here.
Command line download [domestic acceleration]

1. 下载hfd
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
2. 设置环境变量
export HF_ENDPOINT=https://hf-mirror.com
3.1 下载模型
./hfd.sh gpt2 --tool aria2c -x 4
3.2 下载数据集
./hfd.sh yuvalkirstain/pickapic_v1 --dataset --tool aria2c -x 4

DeepFashion-MultiModal

Address: https://drive.google.com/drive/folders/1An2c_ZCkeGmhJg0zUjtZF46vyJgQwIr2
Introduction: This dataset is a large-scale, high-quality human body dataset with rich multi-modal annotations. It has the following properties: It contains 44,096 high-resolution human body images, including 12,701 full-body human body images. For each full-body image, we manually annotate 24 categories of body parsing labels. For each full-body image, we manually annotate key points. Each image is manually annotated with attributes of clothing shape and texture. We provide a text description for each image. DeepFashion-MultiModal can be applied to text-driven human image generation, text-guided human image manipulation, skeleton-guided human image generation, human pose estimation, human image subtitles, multi-modal learning of human images, human attribute recognition, and human body parsing prediction. etc., this dataset is presented in Text2Human.
Paper: Text2Human: Text-Driven Controllable Human Image Generation

DeepFashion

Address: https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
Introduction: This dataset is a large-scale clothing database with several attractive properties: First, DeepFashion contains over 800,000 diverse fashion images, ranging from posed store images to unconstrained consumer photos. , constituting the largest visual fashion analysis database. Secondly, DeepFashion annotates rich clothing item information. Each image in this dataset is annotated with 50 categories, 1,000 descriptive attributes, bounding boxes, and clothing landmarks. Third, DeepFashion contains more than 300,000 cross-pose/cross-domain image pairs. Four benchmarks were developed using the DeepFashion database, including attribute prediction, consumer-to-store clothing retrieval, in-store clothing retrieval, and landmark detection. The data and annotations from these benchmarks can also be used as training and test sets for computer vision tasks such as clothing detection, clothing recognition, and image retrieval.
Thesis: ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

COCO (COCO Captions) Dataset

Address: https://cocodataset.org/#download
Introduction: COCO Captions is a caption dataset that targets scene understanding, captures image data from daily life scenes, and manually generates image descriptions. This dataset contains 330K image-text pairs.
PaperText to image generation Using Generative Adversarial Networks (GANs)
Sample example

CUBS-2000-2021 Dataset

Address: https://www.vision.caltech.edu/datasets/cub_200_2011/
Related data: https://www.vision.caltech.edu/datasets/
Introduction: This dataset is a fine-grained dataset proposed by the California Institute of Technology in 2010. It is also the benchmark image dataset for current fine-grained classification and recognition research. The data set has a total of 11,788 bird images, including 200 bird subcategories. The training data set has 5,994 images and the test set has 5,794 images. Each image provides image class label information and the bounding of the bird in the image. box, key part information of the bird, and attribute information of the bird.
PaperText to image generation Using Generative Adversarial Networks (GANs)
Sample example

102 Category Flower Dataset

Address: https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
Introduction: We created a 102-category dataset consisting of 102 flower categories. The flowers were chosen as common flowers in Britain. Each category consists of 40 to 258 images.
Sample example
Reference: https://blog.csdn.net/air__heaven/article/details/136141343
After downloading the image data set, you need to download the corresponding text data set. Also use Google Cloud Disk to download: https://drive.google.com/file/d/1G4QRcRZ_s57giew6wgnxemwWRDb-3h5P/view

Flickr8k_dataset

  Flickr8k_Dataset.zip https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip
  Flickr8k_text.zip https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip

Flickr30k_dataset Flick 30k dataset for image captioning
Address: https://www.kaggle.com/datasets/adityajn105/flickr30k
Introduction: A new benchmark collection for sentence-based image description and search, consisting of 30,000 images, each accompanied by five distinct captions that provide clear descriptions of salient entities and events. …these images were selected from six different Flickr groups and often do not contain any well-known people or places, but are hand-selected to depict a variety of scenes and situations
Address: https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
Introduction: A new benchmark collection for sentence-based image description and search, consisting of 8,000 images, each accompanied by five distinct captions that provide clear descriptions of salient entities and events. The images were selected from six different Flickr groups and tend not to include any well-known people or places, but are hand-selected to depict a variety of scenes and situations
Paper: Caption to Image generation using Deep Residual Generative Adversarial Networks [DR-GAN]

Nouns Dataset automatically adds titles to the noun dataset card

Address: https://huggingface.co/datasets/m1guelpf/nouns
Introduction: A dataset for training noun text-to-image models that automatically generate titles for nouns based on their attributes, colors, and items. For each row, the data set contains image and text keys. image are PIL jpegs of different sizes and text is the accompanying text caption. Only train splits are available.
Sample example

OxfordTVG-HIC Dataset Large-Scale Humor Image Text Dataset

Address: https://github.com/runjiali-rl/Oxford_HIC?tab=readme-ov-file
Introduction: This is a large dataset for humor generation and understanding. Humor is an abstract, subjective, context-dependent cognitive construct that involves multiple cognitive factors, making its generation and interpretation a challenging task. Oxford HIC provides approximately 2.9 million image-text pairs with humor scores to train a general humor captioning model. In contrast to existing captioning datasets, Oxford HIC has a wide range of sentiment and semantic diversity, resulting in out-of-context examples being particularly beneficial for generating humor.
Sample example

Multi-Modal-CelebA-HQ large-scale face image text dataset

Address: https://github.com/IIGROUP/MM-CelebA-HQ-Dataset
Introduction: Multi-Modal-CelebA-HQ (MM-CelebA-HQ) is a large-scale face image dataset, which has 30k high-resolution face images, selected from the CelebA dataset according to CelebA-HQ. Each image in the dataset is accompanied by a semantic mask, a sketch, a descriptive text, and an image with a transparent background. Multi-Modal-CelebA-HQ can be used to train and evaluate algorithms for a range of tasks, including text-to-image generation, text-guided image manipulation, sketch-to-image generation, image captioning, and visual question answering. This dataset is introduced and used in TediGAN.
Sample example

3D dataset

1. Pre-training data set

Multimodal3DIdent: A multimodal dataset of image/text pairs generated from controllable ground truth factors

Address: https://zenodo.org/records/7678231
Introduction: The official code for generating the Multimodal3DIdent dataset is introduced in the article "Identifiability Results of Multimodal Contrastive Learning" published at ICLR 2023. This dataset provides a recognizability benchmark containing image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities, as shown in the following example.
Paper: Identifiability Results for Multimodal Contrastive Learning