5 Useful Datasets for Training Multimodal AI Models

Share via:


With the ability to perform tasks across a range of combined modalities like text, image, audio, video and more, multimodal AI systems are fast becoming more versatile and powerful. However, building useful multimodal AI models requires good multimodal datasets, which are the necessary fuel for training these polyvalent systems — allowing them to expand their understanding of the world beyond one dimension or modality.

For instance, tasks like image captioning require a set of training data that combines both images and relevant, descriptive text, which can be used to train an AI model. After the training process, the AI model can then be deployed, using natural language processing and computer vision techniques to recognize the contents of a new image and to generate the associated text.

The same idea applies to a wide range of tasks, like video analysis, audio-visual speech recognition, cross-modal retrieval, medical diagnostics and more. This is because multimodal datasets empower AI models to learn more complex semantic relationships between objects and their context, thus boosting model performance and accuracy.

With so many multimodal datasets out in the wild, it can be difficult to know where to start. In this post, we’ll cover the most notable multimodal datasets that are currently available, and briefly describe what they include and what they can potentially be used for.

1. Flickr30K Entities

As an extension to the popular image-captioning Flickr30K dataset, this dataset contains more than 31,000 images sourced from Flickr, with each image associated with five crowd-sourced captions. The Flickr30K Entities dataset augments the original 158,000 captions with 244,000 coreference chains, on top of adding bounding box annotation for all entities (i.e. people or objects) referred to in the captions.

One important advantage of the Flickr30K Entities dataset is that it provides more in-depth annotations for image-text tasks, and helps models better describe the contents of an image — in addition to locating the entities within the image.

Applications: Real-time image captioning; image search.

License: Use of images must abide by Flickr’s Terms of Use; it can be used by researchers and educators for non-commercial purposes.

Examples from Flickr30 Entities dataset.

2. InternVid

Developed for video-related tasks like video captioning, video retrieval and video generation, InternVid is a relatively new video-text dataset that includes 7 million videos of various types of objects and activities lasting almost 760,000 hours. This is broken down into an impressive 234 million clips, paired with richly descriptive captions that total over 4.1 billion words.

One of the biggest benefits of this dataset include its breadth, with 16 distinct types of scenes and over 6,000 distinct actions being covered.

Applications: Video chatbots; personalized e-learning.

License: Apache License 2.0.

3. MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)

This intriguing text-image-audio dataset is designed to understand sentiment in the context of user-generated video reviews in order to understand the emotional engagement that occurs during product reviews. The MuSe dataset consists of over 40 hours of extensively annotated, high-quality, user-generated video recordings, which provide insights into emotional nuances that might show up in faces, voices, gestures or body language.

The aim of the dataset is to advance multimodal sentiment analysis further by providing an in-depth dataset for understanding complex human emotions in a variety of ways.

Applications: Mental health chatbots or assistants; automated sentiment analysis systems for evaluating customer satisfaction with products.

License: Non-commercial under an End User Licence Agreement (EULA).

Examples from MuSe-CaR dataset.

4. MovieQA

MovieQA is a text-video-question-answer multimodal dataset designed for evaluating story comprehension and performing video question-answering (VideoQA) tasks. It consists of 15,000 multiple choice questions paired with subtitled film clips that have been taken from over 400 movies of high semantic diversity.

Answering the questions correctly requires the model to have a sufficient understanding of the visual and textual context contained within the video clip, such as sequential events, human interactions, intent, and the text used to describe them. This dataset is unique in the sense that it contains multiple sources of information, ranging from video clips, plots, subtitles, scripts and DVS (Descriptive Video Service).

Applications: Automated film analysis, summary and categorization.

License: Not specified.

Examples from MovieQA dataset.

5. MINT-1T

MINT-1T is a massive, open source dataset from Salesforce AI Research that contains one trillion text tokens and 3.4 billion images — nearly ten times larger than the next largest open source dataset. This is an incredibly diverse, multimodal, interleaved dataset that integrates text and images in a way that imitates documents in the real world, like web pages and scientific papers — including PDFs and ArXiv papers.

The sheer scale of this dataset means that models can be more broadly versed in the existing online corpus of scientific and technological research. According to the research team, the goal was to create a dataset that features “free-form interleaved sequences of images and text,” suitable for training large multimodal AI models.

Applications: Developing AI assistants that are more context-aware; MINT-1T is a massive dataset that levels the playing field for researchers and businesses with smaller budgets.

License: CC-BY-4.0.

Conclusion

New datasets are continuously emerging, so here are some other recent multimodal datasets that are also worth a mention:

  • BigDocs: This open and “permissively licensed” dataset is designed to train models for extracting information from documents, using enhanced OCR, layout and diagram analysis, and table detection.
  • Newsmediabias-plus (NMB+): Combining textual and visual data from news articles, this dataset from the Vector Institute is designed for the detection and analysis of media bias and disinformation.

These are but a handful of the vast number of multimodal datasets that are available — not to mention multilingual datasets that are also coming to the fore. With so many options out there, it’s relatively easy to find the right datasets to train your AI model. For more information, check out our posts on tools for building multimodal AI applications, plus some open source and small-scale multimodal AI models.


Group Created with Sketch.





Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

admin
admin
Hi! This is Admin.

Popular

More Like this

5 Useful Datasets for Training Multimodal AI Models


With the ability to perform tasks across a range of combined modalities like text, image, audio, video and more, multimodal AI systems are fast becoming more versatile and powerful. However, building useful multimodal AI models requires good multimodal datasets, which are the necessary fuel for training these polyvalent systems — allowing them to expand their understanding of the world beyond one dimension or modality.

For instance, tasks like image captioning require a set of training data that combines both images and relevant, descriptive text, which can be used to train an AI model. After the training process, the AI model can then be deployed, using natural language processing and computer vision techniques to recognize the contents of a new image and to generate the associated text.

The same idea applies to a wide range of tasks, like video analysis, audio-visual speech recognition, cross-modal retrieval, medical diagnostics and more. This is because multimodal datasets empower AI models to learn more complex semantic relationships between objects and their context, thus boosting model performance and accuracy.

With so many multimodal datasets out in the wild, it can be difficult to know where to start. In this post, we’ll cover the most notable multimodal datasets that are currently available, and briefly describe what they include and what they can potentially be used for.

1. Flickr30K Entities

As an extension to the popular image-captioning Flickr30K dataset, this dataset contains more than 31,000 images sourced from Flickr, with each image associated with five crowd-sourced captions. The Flickr30K Entities dataset augments the original 158,000 captions with 244,000 coreference chains, on top of adding bounding box annotation for all entities (i.e. people or objects) referred to in the captions.

One important advantage of the Flickr30K Entities dataset is that it provides more in-depth annotations for image-text tasks, and helps models better describe the contents of an image — in addition to locating the entities within the image.

Applications: Real-time image captioning; image search.

License: Use of images must abide by Flickr’s Terms of Use; it can be used by researchers and educators for non-commercial purposes.

Examples from Flickr30 Entities dataset.

2. InternVid

Developed for video-related tasks like video captioning, video retrieval and video generation, InternVid is a relatively new video-text dataset that includes 7 million videos of various types of objects and activities lasting almost 760,000 hours. This is broken down into an impressive 234 million clips, paired with richly descriptive captions that total over 4.1 billion words.

One of the biggest benefits of this dataset include its breadth, with 16 distinct types of scenes and over 6,000 distinct actions being covered.

Applications: Video chatbots; personalized e-learning.

License: Apache License 2.0.

3. MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)

This intriguing text-image-audio dataset is designed to understand sentiment in the context of user-generated video reviews in order to understand the emotional engagement that occurs during product reviews. The MuSe dataset consists of over 40 hours of extensively annotated, high-quality, user-generated video recordings, which provide insights into emotional nuances that might show up in faces, voices, gestures or body language.

The aim of the dataset is to advance multimodal sentiment analysis further by providing an in-depth dataset for understanding complex human emotions in a variety of ways.

Applications: Mental health chatbots or assistants; automated sentiment analysis systems for evaluating customer satisfaction with products.

License: Non-commercial under an End User Licence Agreement (EULA).

Examples from MuSe-CaR dataset.

4. MovieQA

MovieQA is a text-video-question-answer multimodal dataset designed for evaluating story comprehension and performing video question-answering (VideoQA) tasks. It consists of 15,000 multiple choice questions paired with subtitled film clips that have been taken from over 400 movies of high semantic diversity.

Answering the questions correctly requires the model to have a sufficient understanding of the visual and textual context contained within the video clip, such as sequential events, human interactions, intent, and the text used to describe them. This dataset is unique in the sense that it contains multiple sources of information, ranging from video clips, plots, subtitles, scripts and DVS (Descriptive Video Service).

Applications: Automated film analysis, summary and categorization.

License: Not specified.

Examples from MovieQA dataset.

5. MINT-1T

MINT-1T is a massive, open source dataset from Salesforce AI Research that contains one trillion text tokens and 3.4 billion images — nearly ten times larger than the next largest open source dataset. This is an incredibly diverse, multimodal, interleaved dataset that integrates text and images in a way that imitates documents in the real world, like web pages and scientific papers — including PDFs and ArXiv papers.

The sheer scale of this dataset means that models can be more broadly versed in the existing online corpus of scientific and technological research. According to the research team, the goal was to create a dataset that features “free-form interleaved sequences of images and text,” suitable for training large multimodal AI models.

Applications: Developing AI assistants that are more context-aware; MINT-1T is a massive dataset that levels the playing field for researchers and businesses with smaller budgets.

License: CC-BY-4.0.

Conclusion

New datasets are continuously emerging, so here are some other recent multimodal datasets that are also worth a mention:

  • BigDocs: This open and “permissively licensed” dataset is designed to train models for extracting information from documents, using enhanced OCR, layout and diagram analysis, and table detection.
  • Newsmediabias-plus (NMB+): Combining textual and visual data from news articles, this dataset from the Vector Institute is designed for the detection and analysis of media bias and disinformation.

These are but a handful of the vast number of multimodal datasets that are available — not to mention multilingual datasets that are also coming to the fore. With so many options out there, it’s relatively easy to find the right datasets to train your AI model. For more information, check out our posts on tools for building multimodal AI applications, plus some open source and small-scale multimodal AI models.


Group Created with Sketch.





Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

admin
admin
Hi! This is Admin.

More like this

Apple aims to bring AI features and spatial content...

Apple Intelligence on the Vision Pro would include...

Zomato Opens Up In-House AI Platform ‘Nugget’ For Enterprises

SUMMARY Nugget is an AI-native, no-code customer support platform....

InCred Group Buys Financial Advisory Firm Arrow Capital

SUMMARY With this acquisition, InCred also looks to bolster...

Popular

Upcoming Events

Startup Information that matters. Get in your inbox Daily!