With the ability to perform tasks across a range of combined modalities like text, image, audio, video and more, multimodal AI systems are fast becoming more versatile and powerful. However, building useful multimodal AI models requires good multimodal datasets, which are the necessary fuel for training these polyvalent systems — allowing them to expand their understanding of the world beyond one dimension or modality.
For instance, tasks like image captioning require a set of training data that combines both images and relevant, descriptive text, which can be used to train an AI model. After the training process, the AI model can then be deployed, using natural language processing and computer vision techniques to recognize the contents of a new image and to generate the associated text.
The same idea applies to a wide range of tasks, like video analysis, audio-visual speech recognition, cross-modal retrieval, medical diagnostics and more. This is because multimodal datasets empower AI models to learn more complex semantic relationships between objects and their context, thus boosting model performance and accuracy.
With so many multimodal datasets out in the wild, it can be difficult to know where to start. In this post, we’ll cover the most notable multimodal datasets that are currently available, and briefly describe what they include and what they can potentially be used for.
1. Flickr30K Entities
As an extension to the popular image-captioning Flickr30K dataset, this dataset contains more than 31,000 images sourced from Flickr, with each image associated with five crowd-sourced captions. The Flickr30K Entities dataset augments the original 158,000 captions with 244,000 coreference chains, on top of adding bounding box annotation for all entities (i.e. people or objects) referred to in the captions.
One important advantage of the Flickr30K Entities dataset is that it provides more in-depth annotations for image-text tasks, and helps models better describe the contents of an image — in addition to locating the entities within the image.
Applications: Real-time image captioning; image search.
License: Use of images must abide by Flickr’s Terms of Use; it can be used by researchers and educators for non-commercial purposes.

Examples from Flickr30 Entities dataset.
2. InternVid
Developed for video-related tasks like video captioning, video retrieval and video generation, InternVid is a relatively new video-text dataset that includes 7 million videos of various types of objects and activities lasting almost 760,000 hours. This is broken down into an impressive 234 million clips, paired with richly descriptive captions that total over 4.1 billion words.
One of the biggest benefits of this dataset include its breadth, with 16 distinct types of scenes and over 6,000 distinct actions being covered.
Applications: Video chatbots; personalized e-learning.
License: Apache License 2.0.
3. MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)
This intriguing text-image-audio dataset is designed to understand sentiment in the context of user-generated video reviews in order to understand the emotional engagement that occurs during product reviews. The MuSe dataset consists of over 40 hours of extensively annotated, high-quality, user-generated video recordings, which provide insights into emotional nuances that might show up in faces, voices, gestures or body language.
The aim of the dataset is to advance multimodal sentiment analysis further by providing an in-depth dataset for understanding complex human emotions in a variety of ways.
Applications: Mental health chatbots or assistants; automated sentiment analysis systems for evaluating customer satisfaction with products.
License: Non-commercial under an End User Licence Agreement (EULA).

Examples from MuSe-CaR dataset.
4. MovieQA
MovieQA is a text-video-question-answer multimodal dataset designed for evaluating story comprehension and performing video question-answering (VideoQA) tasks. It consists of 15,000 multiple choice questions paired with subtitled film clips that have been taken from over 400 movies of high semantic diversity.
Answering the questions correctly requires the model to have a sufficient understanding of the visual and textual context contained within the video clip, such as sequential events, human interactions, intent, and the text used to describe them. This dataset is unique in the sense that it contains multiple sources of information, ranging from video clips, plots, subtitles, scripts and DVS (Descriptive Video Service).
Applications: Automated film analysis, summary and categorization.
License: Not specified.

Examples from MovieQA dataset.
5. MINT-1T
MINT-1T is a massive, open source dataset from Salesforce AI Research that contains one trillion text tokens and 3.4 billion images — nearly ten times larger than the next largest open source dataset. This is an incredibly diverse, multimodal, interleaved dataset that integrates text and images in a way that imitates documents in the real world, like web pages and scientific papers — including PDFs and ArXiv papers.
The sheer scale of this dataset means that models can be more broadly versed in the existing online corpus of scientific and technological research. According to the research team, the goal was to create a dataset that features “free-form interleaved sequences of images and text,” suitable for training large multimodal AI models.
Applications: Developing AI assistants that are more context-aware; MINT-1T is a massive dataset that levels the playing field for researchers and businesses with smaller budgets.
License: CC-BY-4.0.
Conclusion
New datasets are continuously emerging, so here are some other recent multimodal datasets that are also worth a mention:
- BigDocs: This open and “permissively licensed” dataset is designed to train models for extracting information from documents, using enhanced OCR, layout and diagram analysis, and table detection.
- Newsmediabias-plus (NMB+): Combining textual and visual data from news articles, this dataset from the Vector Institute is designed for the detection and analysis of media bias and disinformation.
These are but a handful of the vast number of multimodal datasets that are available — not to mention multilingual datasets that are also coming to the fore. With so many options out there, it’s relatively easy to find the right datasets to train your AI model. For more information, check out our posts on tools for building multimodal AI applications, plus some open source and small-scale multimodal AI models.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.