5 Multimodal AI Models That Are Actually Open Source

Share via:


Multimodal AI is attracting a lot of attention, thanks to the tantalizing promise of AI systems that are designed to be jacks of all trades — capable of processing a combination of text, image, audio, and video.

But while there is already a constellation of powerful, proprietary multimodal AI systems on the market, smaller multimodal AI models and open source alternatives are also rapidly gaining ground, as users continue to seek out options that are more accessible and adaptable, and prioritize transparency and collaboration. To get you up to speed on the latest open source multimodal AI systems, we’ll outline some of the more popular options — including their features and uses.

1. Aria

The recently introduced Aria AI model from Rhymes AI is touted as the world’s first open source, multimodal native mixture-of-experts (MoE) model that can process text, code, images, and video — all within one architecture.

This versatile model is relatively powerful compared to even larger models, yet is more efficient, as it selectively leverages relevant subsets (or “mini-experts”) of its framework, depending on the task. Its architecture is designed for ease of scalability, as new “experts” could be added to address new tasks without straining the system. Aria excels at long multimodal input understanding, meaning that it is adept at quickly and accurately parsing long documents and videos.

Aria’s architecture.

2. Leopard

Developed by an interdisciplinary team of researchers from  University of Notre Dame, Tencent AI Seattle Lab, and the University of Illinois Urbana-Champaign (UIUC), Leopard is an open source multimodal model that is specifically designed for text-rich image tasks.

Leopard is intended to tackle two of the biggest challenges in the multimodal AI space, namely the scarcity of high-quality multi-image datasets, and balancing image resolution with sequence length. To achieve this, the model is trained with a curated dataset featuring over 1 million high-quality, human-made and synthetic data pieces that have been collected from real-world examples. It is also openly available for use in other models.

“Leopard stands out with its novel adaptive high-resolution encoding module, which dynamically optimizes the allocation of visual sequence lengths based on the original aspect ratios and resolutions of the input images,” Wenhao Yu, a senior researcher at Tencent America and one of the creators of Leopard, explained to The New Stack. “Additionally, it uses pixel shuffling to losslessly compress long visual feature sequences into shorter ones. This design enables the model to handle multiple high-resolution images without sacrificing detail or clarity.”

These features make Leopard an excellent tool for multi-page document understanding (think slide decks, scientific and financial reports), data visualization, webpage comprehension, and in deploying multimodal AI agents capable fo handling tasks in visually complex environments.

Leopard’s overall model pipeline.

3. CogVLM

Utilizing deep fusion techniques to attain high performance, CogVLM stands for Cognitive Visual Language Model, an open source, state-of-the-art visual language foundational model that can be used for visual question answering (VQA) and image captioning.

CogVLM uses an class=”utm-none ext-link” href=”https://openreview.net/pdf?id=c72vop46KY” target=”_blank” rel=”noopener external ” onclick=”this.target=’_blank’;”>attention-based fusion mechanism that fuses text and image embeddings, and freezes network layers to keep performance high. It also employs a EVA2-CLIP-E visual encoder and a multi-layer perceptron (MLP) adapter for co-mapping visual and text features onto the same space.

4. LLaVA

Large Language and Vision Assistant ( LLaVA) is another open source, state-of-the-art option. It leverages Vicuna to decode language, and CLIP for fine-tuning on instruction-following textual data. The model has been trained using instruction-following text-based data generated by ChatGPT and GPT-4. LLaVA uses a trainable projection matrix to map visual representations onto the language embedding space.

As a versatile visual assistant, LLaVA can be used to create more advanced chatbots that can handle text- and image-based queries.

5. xGen-MM

Also known as BLIP-3, this state-of-the-art, open source suite of multimodal models from Salesforce features a line of variants, including a base pretrained model, an instruction-tuned model, and a safety-tuned model that is intended to reduce harmful outputs.

One crucial development is that the systems were trained using a massive, open source trillion-token dataset of “interleaved” image and text data, which the researchers characterize as the “the most natural form of multimodal data”. That means the models are skilled at handling inputs with text and multiple images, which could be useful in a wide range of settings — such as autonomous vehicles, or image analysis and diagnosing diseases in healthcare, or creating interactive educational tools, or promotional marketing materials.

Conclusion

There is still an ongoing, vigorous debate surrounding the actual definition of open source AI, peppered with accusations of large tech companies “open washing” their AI models in order to gain wider credibility and cachet.

Regardless of how the open source AI debate unfolds, it’s clear that there’s still a further need for truly open source systems — and datasets — that emphasize transparency, collaboration and accessibility and that actually live up to the open source ethos.


Group Created with Sketch.

ath d=”M24.002,29.619 L29.77,29.619 L29.77,15.808 C29.77,15.038 29.622,11.265 29.59,10.414 L29.77,10.414 C31.424,14.019 31.473,14.147 32.168,15.322 L39.65,29.618 L44.845,29.618 L44.845,0 L39.075,0 L39.075,11.064 C39.075,12.197 39.075,12.44 39.182,14.472 L39.325,17.468 L39.151,17.468 C39.034,17.267 38.596,16.173 38.467,15.929 C38.164,15.323 37.725,14.512 37.373,13.905 L30.031,0 L24,0 L24,29.619 L24.002,29.619 Z” id=”Path-Copy” fill=”#FF3287″/>

ath d=”M56.948,0 C50.745,0 47.606,3.43 47.606,8.296 C47.606,14.114 51.036,15.404 55.518,17.132 C60.438,18.853 61.782,19.332 61.782,21.539 C61.782,24.225 58.969,24.867 57.401,24.867 C54.579,24.867 52.493,23.342 51.536,20.858 L47,24.185 C49.43,28.937 52.145,30.185 57.713,30.185 C59.364,30.185 62.059,29.74 63.727,28.694 C67.779,26.156 67.779,22.22 67.779,20.898 C67.779,18.129 66.531,16.207 66.178,15.726 C65.049,14.121 63.032,12.918 61.25,12.278 L57.084,10.914 C55.073,10.267 52.928,10.105 52.928,8.019 C52.928,7.707 53.008,5.528 56.288,5.319 L61.465,5.319 L61.465,0 C61.465,0 57.342,0 56.948,0 Z” id=”Path-Copy-2″ fill=”#00AFF4″/>

olygon id=”Path” fill=”#00AFF4″ points=”5.32907052e-15 1.77635684e-15 5.32907052e-15 5.319 7.572 5.319 7.572 29.564 14.132 29.564 14.132 5.319 21.544 5.319 21.544 1.77635684e-15″/>





Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Team SNFYI
Hi! This is Admin.

Popular

More Like this

5 Multimodal AI Models That Are Actually Open Source


Multimodal AI is attracting a lot of attention, thanks to the tantalizing promise of AI systems that are designed to be jacks of all trades — capable of processing a combination of text, image, audio, and video.

But while there is already a constellation of powerful, proprietary multimodal AI systems on the market, smaller multimodal AI models and open source alternatives are also rapidly gaining ground, as users continue to seek out options that are more accessible and adaptable, and prioritize transparency and collaboration. To get you up to speed on the latest open source multimodal AI systems, we’ll outline some of the more popular options — including their features and uses.

1. Aria

The recently introduced Aria AI model from Rhymes AI is touted as the world’s first open source, multimodal native mixture-of-experts (MoE) model that can process text, code, images, and video — all within one architecture.

This versatile model is relatively powerful compared to even larger models, yet is more efficient, as it selectively leverages relevant subsets (or “mini-experts”) of its framework, depending on the task. Its architecture is designed for ease of scalability, as new “experts” could be added to address new tasks without straining the system. Aria excels at long multimodal input understanding, meaning that it is adept at quickly and accurately parsing long documents and videos.

Aria’s architecture.

2. Leopard

Developed by an interdisciplinary team of researchers from  University of Notre Dame, Tencent AI Seattle Lab, and the University of Illinois Urbana-Champaign (UIUC), Leopard is an open source multimodal model that is specifically designed for text-rich image tasks.

Leopard is intended to tackle two of the biggest challenges in the multimodal AI space, namely the scarcity of high-quality multi-image datasets, and balancing image resolution with sequence length. To achieve this, the model is trained with a curated dataset featuring over 1 million high-quality, human-made and synthetic data pieces that have been collected from real-world examples. It is also openly available for use in other models.

“Leopard stands out with its novel adaptive high-resolution encoding module, which dynamically optimizes the allocation of visual sequence lengths based on the original aspect ratios and resolutions of the input images,” Wenhao Yu, a senior researcher at Tencent America and one of the creators of Leopard, explained to The New Stack. “Additionally, it uses pixel shuffling to losslessly compress long visual feature sequences into shorter ones. This design enables the model to handle multiple high-resolution images without sacrificing detail or clarity.”

These features make Leopard an excellent tool for multi-page document understanding (think slide decks, scientific and financial reports), data visualization, webpage comprehension, and in deploying multimodal AI agents capable fo handling tasks in visually complex environments.

Leopard’s overall model pipeline.

3. CogVLM

Utilizing deep fusion techniques to attain high performance, CogVLM stands for Cognitive Visual Language Model, an open source, state-of-the-art visual language foundational model that can be used for visual question answering (VQA) and image captioning.

CogVLM uses an class=”utm-none ext-link” href=”https://openreview.net/pdf?id=c72vop46KY” target=”_blank” rel=”noopener external ” onclick=”this.target=’_blank’;”>attention-based fusion mechanism that fuses text and image embeddings, and freezes network layers to keep performance high. It also employs a EVA2-CLIP-E visual encoder and a multi-layer perceptron (MLP) adapter for co-mapping visual and text features onto the same space.

4. LLaVA

Large Language and Vision Assistant ( LLaVA) is another open source, state-of-the-art option. It leverages Vicuna to decode language, and CLIP for fine-tuning on instruction-following textual data. The model has been trained using instruction-following text-based data generated by ChatGPT and GPT-4. LLaVA uses a trainable projection matrix to map visual representations onto the language embedding space.

As a versatile visual assistant, LLaVA can be used to create more advanced chatbots that can handle text- and image-based queries.

5. xGen-MM

Also known as BLIP-3, this state-of-the-art, open source suite of multimodal models from Salesforce features a line of variants, including a base pretrained model, an instruction-tuned model, and a safety-tuned model that is intended to reduce harmful outputs.

One crucial development is that the systems were trained using a massive, open source trillion-token dataset of “interleaved” image and text data, which the researchers characterize as the “the most natural form of multimodal data”. That means the models are skilled at handling inputs with text and multiple images, which could be useful in a wide range of settings — such as autonomous vehicles, or image analysis and diagnosing diseases in healthcare, or creating interactive educational tools, or promotional marketing materials.

Conclusion

There is still an ongoing, vigorous debate surrounding the actual definition of open source AI, peppered with accusations of large tech companies “open washing” their AI models in order to gain wider credibility and cachet.

Regardless of how the open source AI debate unfolds, it’s clear that there’s still a further need for truly open source systems — and datasets — that emphasize transparency, collaboration and accessibility and that actually live up to the open source ethos.


Group Created with Sketch.

ath d=”M24.002,29.619 L29.77,29.619 L29.77,15.808 C29.77,15.038 29.622,11.265 29.59,10.414 L29.77,10.414 C31.424,14.019 31.473,14.147 32.168,15.322 L39.65,29.618 L44.845,29.618 L44.845,0 L39.075,0 L39.075,11.064 C39.075,12.197 39.075,12.44 39.182,14.472 L39.325,17.468 L39.151,17.468 C39.034,17.267 38.596,16.173 38.467,15.929 C38.164,15.323 37.725,14.512 37.373,13.905 L30.031,0 L24,0 L24,29.619 L24.002,29.619 Z” id=”Path-Copy” fill=”#FF3287″/>

ath d=”M56.948,0 C50.745,0 47.606,3.43 47.606,8.296 C47.606,14.114 51.036,15.404 55.518,17.132 C60.438,18.853 61.782,19.332 61.782,21.539 C61.782,24.225 58.969,24.867 57.401,24.867 C54.579,24.867 52.493,23.342 51.536,20.858 L47,24.185 C49.43,28.937 52.145,30.185 57.713,30.185 C59.364,30.185 62.059,29.74 63.727,28.694 C67.779,26.156 67.779,22.22 67.779,20.898 C67.779,18.129 66.531,16.207 66.178,15.726 C65.049,14.121 63.032,12.918 61.25,12.278 L57.084,10.914 C55.073,10.267 52.928,10.105 52.928,8.019 C52.928,7.707 53.008,5.528 56.288,5.319 L61.465,5.319 L61.465,0 C61.465,0 57.342,0 56.948,0 Z” id=”Path-Copy-2″ fill=”#00AFF4″/>

olygon id=”Path” fill=”#00AFF4″ points=”5.32907052e-15 1.77635684e-15 5.32907052e-15 5.319 7.572 5.319 7.572 29.564 14.132 29.564 14.132 5.319 21.544 5.319 21.544 1.77635684e-15″/>





Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

Team SNFYI
Hi! This is Admin.

More like this

UFC-Que Choisir Takes Ubisoft To French Court Over the...

Longtime Slashdot reader Elektroschock writes: When Ubisoft pulled...

T-Mobile ditches data caps for time limits on its...

Joe Maring / Android AuthorityTL;DR T-Mobile is switching its Home...

Apple Sports Now Lets You Follow Your Favorite 2026...

Apple today updated the Sports app for iPhone to...

Popular

iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista melhor iptv portugal lista best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv best iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv portugal iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv iptv