5 Useful Datasets for Training Multimodal AI Models

With the ability to perform tasks across a range of combined modalities like text, image, audio, video and more, multimodal AI systems are fast becoming more versatile and powerful. However, building useful multimodal AI models requires good multimodal datasets, which are the necessary fuel for training these polyvalent systems — allowing them to expand their understanding of the world beyond one dimension or modality.

For instance, tasks like image captioning require a set of training data that combines both images and relevant, descriptive text, which can be used to train an AI model. After the training process, the AI model can then be deployed, using natural language processing and computer vision techniques to recognize the contents of a new image and to generate the associated text.

The same idea applies to a wide range of tasks, like video analysis, audio-visual speech recognition, cross-modal retrieval, medical diagnostics and more. This is because multimodal datasets empower AI models to learn more complex semantic relationships between objects and their context, thus boosting model performance and accuracy.

With so many multimodal datasets out in the wild, it can be difficult to know where to start. In this post, we’ll cover the most notable multimodal datasets that are currently available, and briefly describe what they include and what they can potentially be used for.

1. Flickr30K Entities

As an extension to the popular image-captioning Flickr30K dataset, this dataset contains more than 31,000 images sourced from Flickr, with each image associated with five crowd-sourced captions. The Flickr30K Entities dataset augments the original 158,000 captions with 244,000 coreference chains, on top of adding bounding box annotation for all entities (i.e. people or objects) referred to in the captions.

One important advantage of the Flickr30K Entities dataset is that it provides more in-depth annotations for image-text tasks, and helps models better describe the contents of an image — in addition to locating the entities within the image.

Applications: Real-time image captioning; image search.

License: Use of images must abide by Flickr’s Terms of Use; it can be used by researchers and educators for non-commercial purposes.

Examples from Flickr30 Entities dataset.

2. InternVid

Developed for video-related tasks like video captioning, video retrieval and video generation, InternVid is a relatively new video-text dataset that includes 7 million videos of various types of objects and activities lasting almost 760,000 hours. This is broken down into an impressive 234 million clips, paired with richly descriptive captions that total over 4.1 billion words.

One of the biggest benefits of this dataset include its breadth, with 16 distinct types of scenes and over 6,000 distinct actions being covered.

Applications: Video chatbots; personalized e-learning.

License: Apache License 2.0.

3. MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)

This intriguing text-image-audio dataset is designed to understand sentiment in the context of user-generated video reviews in order to understand the emotional engagement that occurs during product reviews. The MuSe dataset consists of over 40 hours of extensively annotated, high-quality, user-generated video recordings, which provide insights into emotional nuances that might show up in faces, voices, gestures or body language.

The aim of the dataset is to advance multimodal sentiment analysis further by providing an in-depth dataset for understanding complex human emotions in a variety of ways.

Applications: Mental health chatbots or assistants; automated sentiment analysis systems for evaluating customer satisfaction with products.

License: Non-commercial under an End User Licence Agreement (EULA).

Examples from MuSe-CaR dataset.

4. MovieQA

MovieQA is a text-video-question-answer multimodal dataset designed for evaluating story comprehension and performing video question-answering (VideoQA) tasks. It consists of 15,000 multiple choice questions paired with subtitled film clips that have been taken from over 400 movies of high semantic diversity.

Answering the questions correctly requires the model to have a sufficient understanding of the visual and textual context contained within the video clip, such as sequential events, human interactions, intent, and the text used to describe them. This dataset is unique in the sense that it contains multiple sources of information, ranging from video clips, plots, subtitles, scripts and DVS (Descriptive Video Service).

Applications: Automated film analysis, summary and categorization.

License: Not specified.

Examples from MovieQA dataset.

5. MINT-1T

MINT-1T is a massive, open source dataset from Salesforce AI Research that contains one trillion text tokens and 3.4 billion images — nearly ten times larger than the next largest open source dataset. This is an incredibly diverse, multimodal, interleaved dataset that integrates text and images in a way that imitates documents in the real world, like web pages and scientific papers — including PDFs and ArXiv papers.

The sheer scale of this dataset means that models can be more broadly versed in the existing online corpus of scientific and technological research. According to the research team, the goal was to create a dataset that features “free-form interleaved sequences of images and text,” suitable for training large multimodal AI models.

Applications: Developing AI assistants that are more context-aware; MINT-1T is a massive dataset that levels the playing field for researchers and businesses with smaller budgets.

License: CC-BY-4.0.

Conclusion

New datasets are continuously emerging, so here are some other recent multimodal datasets that are also worth a mention:

BigDocs: This open and “permissively licensed” dataset is designed to train models for extracting information from documents, using enhanced OCR, layout and diagram analysis, and table detection.
Newsmediabias-plus (NMB+): Combining textual and visual data from news articles, this dataset from the Vector Institute is designed for the detection and analysis of media bias and disinformation.

These are but a handful of the vast number of multimodal datasets that are available — not to mention multilingual datasets that are also coming to the fore. With so many options out there, it’s relatively easy to find the right datasets to train your AI model. For more information, check out our posts on tools for building multimodal AI applications, plus some open source and small-scale multimodal AI models.

ath d=”M24.002,29.619 L29.77,29.619 L29.77,15.808 C29.77,15.038 29.622,11.265 29.59,10.414 L29.77,10.414 C31.424,14.019 31.473,14.147 32.168,15.322 L39.65,29.618 L44.845,29.618 L44.845,0 L39.075,0 L39.075,11.064 C39.075,12.197 39.075,12.44 39.182,14.472 L39.325,17.468 L39.151,17.468 C39.034,17.267 38.596,16.173 38.467,15.929 C38.164,15.323 37.725,14.512 37.373,13.905 L30.031,0 L24,0 L24,29.619 L24.002,29.619 Z” id=”Path-Copy” fill=”#FF3287″/>

ath d=”M56.948,0 C50.745,0 47.606,3.43 47.606,8.296 C47.606,14.114 51.036,15.404 55.518,17.132 C60.438,18.853 61.782,19.332 61.782,21.539 C61.782,24.225 58.969,24.867 57.401,24.867 C54.579,24.867 52.493,23.342 51.536,20.858 L47,24.185 C49.43,28.937 52.145,30.185 57.713,30.185 C59.364,30.185 62.059,29.74 63.727,28.694 C67.779,26.156 67.779,22.22 67.779,20.898 C67.779,18.129 66.531,16.207 66.178,15.726 C65.049,14.121 63.032,12.918 61.25,12.278 L57.084,10.914 C55.073,10.267 52.928,10.105 52.928,8.019 C52.928,7.707 53.008,5.528 56.288,5.319 L61.465,5.319 L61.465,0 C61.465,0 57.342,0 56.948,0 Z” id=”Path-Copy-2″ fill=”#00AFF4″/>

olygon id=”Path” fill=”#00AFF4″ points=”5.32907052e-15 1.77635684e-15 5.32907052e-15 5.319 7.572 5.319 7.572 29.564 14.132 29.564 14.132 5.319 21.544 5.319 21.544 1.77635684e-15″/>

Kimberley is a tech and design reporter who covers artificial intelligence, robotics, quantum computing, tech culture, and science stories for The New Stack. Trained as an architect, she is also an illustrator and multidisciplinary designer who has been passionate about…

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

Previous News

Dear Americans, here’s what a TikTok ban looks like. Love, India

Next News

The original Kindle Scribe is more than $100 off in refurbished condition

Team SNFYI

Hi! This is Admin.

More like this

5 Useful Datasets for Training Multimodal AI Models

1. Flickr30K Entities

2. InternVid

3. MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)

4. MovieQA

5. MINT-1T

Conclusion

Disclaimer

Popular

More Like this

5 Useful Datasets for Training Multimodal AI Models

1. Flickr30K Entities

2. InternVid

3. MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)

4. MovieQA

5. MINT-1T

Conclusion

Disclaimer

More like this

Popular

Block title

Startup Events

Trending News

About

Partnership

Contact us