5 Multimodal AI Models That Are Actually Open Source

Multimodal AI is attracting a lot of attention, thanks to the tantalizing promise of AI systems that are designed to be jacks of all trades — capable of processing a combination of text, image, audio, and video.

But while there is already a constellation of powerful, proprietary multimodal AI systems on the market, smaller multimodal AI models and open source alternatives are also rapidly gaining ground, as users continue to seek out options that are more accessible and adaptable, and prioritize transparency and collaboration. To get you up to speed on the latest open source multimodal AI systems, we’ll outline some of the more popular options — including their features and uses.

1. Aria

The recently introduced Aria AI model from Rhymes AI is touted as the world’s first open source, multimodal native mixture-of-experts (MoE) model that can process text, code, images, and video — all within one architecture.

This versatile model is relatively powerful compared to even larger models, yet is more efficient, as it selectively leverages relevant subsets (or “mini-experts”) of its framework, depending on the task. Its architecture is designed for ease of scalability, as new “experts” could be added to address new tasks without straining the system. Aria excels at long multimodal input understanding, meaning that it is adept at quickly and accurately parsing long documents and videos.

Aria’s architecture.

2. Leopard

Developed by an interdisciplinary team of researchers from University of Notre Dame, Tencent AI Seattle Lab, and the University of Illinois Urbana-Champaign (UIUC), Leopard is an open source multimodal model that is specifically designed for text-rich image tasks.

Leopard is intended to tackle two of the biggest challenges in the multimodal AI space, namely the scarcity of high-quality multi-image datasets, and balancing image resolution with sequence length. To achieve this, the model is trained with a curated dataset featuring over 1 million high-quality, human-made and synthetic data pieces that have been collected from real-world examples. It is also openly available for use in other models.

“Leopard stands out with its novel adaptive high-resolution encoding module, which dynamically optimizes the allocation of visual sequence lengths based on the original aspect ratios and resolutions of the input images,” Wenhao Yu, a senior researcher at Tencent America and one of the creators of Leopard, explained to The New Stack. “Additionally, it uses pixel shuffling to losslessly compress long visual feature sequences into shorter ones. This design enables the model to handle multiple high-resolution images without sacrificing detail or clarity.”

These features make Leopard an excellent tool for multi-page document understanding (think slide decks, scientific and financial reports), data visualization, webpage comprehension, and in deploying multimodal AI agents capable fo handling tasks in visually complex environments.

Leopard’s overall model pipeline.

3. CogVLM

Utilizing deep fusion techniques to attain high performance, CogVLM stands for Cognitive Visual Language Model, an open source, state-of-the-art visual language foundational model that can be used for visual question answering (VQA) and image captioning.

CogVLM uses an class=”utm-none ext-link” href=”https://openreview.net/pdf?id=c72vop46KY” target=”_blank” rel=”noopener external ” onclick=”this.target=’_blank’;”>attention-based fusion mechanism that fuses text and image embeddings, and freezes network layers to keep performance high. It also employs a EVA2-CLIP-E visual encoder and a multi-layer perceptron (MLP) adapter for co-mapping visual and text features onto the same space.

4. LLaVA

Large Language and Vision Assistant ( LLaVA) is another open source, state-of-the-art option. It leverages Vicuna to decode language, and CLIP for fine-tuning on instruction-following textual data. The model has been trained using instruction-following text-based data generated by ChatGPT and GPT-4. LLaVA uses a trainable projection matrix to map visual representations onto the language embedding space.

As a versatile visual assistant, LLaVA can be used to create more advanced chatbots that can handle text- and image-based queries.

5. xGen-MM

Also known as BLIP-3, this state-of-the-art, open source suite of multimodal models from Salesforce features a line of variants, including a base pretrained model, an instruction-tuned model, and a safety-tuned model that is intended to reduce harmful outputs.

One crucial development is that the systems were trained using a massive, open source trillion-token dataset of “interleaved” image and text data, which the researchers characterize as the “the most natural form of multimodal data”. That means the models are skilled at handling inputs with text and multiple images, which could be useful in a wide range of settings — such as autonomous vehicles, or image analysis and diagnosing diseases in healthcare, or creating interactive educational tools, or promotional marketing materials.

Conclusion

There is still an ongoing, vigorous debate surrounding the actual definition of open source AI, peppered with accusations of large tech companies “open washing” their AI models in order to gain wider credibility and cachet.

Regardless of how the open source AI debate unfolds, it’s clear that there’s still a further need for truly open source systems — and datasets — that emphasize transparency, collaboration and accessibility and that actually live up to the open source ethos.

ath d=”M24.002,29.619 L29.77,29.619 L29.77,15.808 C29.77,15.038 29.622,11.265 29.59,10.414 L29.77,10.414 C31.424,14.019 31.473,14.147 32.168,15.322 L39.65,29.618 L44.845,29.618 L44.845,0 L39.075,0 L39.075,11.064 C39.075,12.197 39.075,12.44 39.182,14.472 L39.325,17.468 L39.151,17.468 C39.034,17.267 38.596,16.173 38.467,15.929 C38.164,15.323 37.725,14.512 37.373,13.905 L30.031,0 L24,0 L24,29.619 L24.002,29.619 Z” id=”Path-Copy” fill=”#FF3287″/>

ath d=”M56.948,0 C50.745,0 47.606,3.43 47.606,8.296 C47.606,14.114 51.036,15.404 55.518,17.132 C60.438,18.853 61.782,19.332 61.782,21.539 C61.782,24.225 58.969,24.867 57.401,24.867 C54.579,24.867 52.493,23.342 51.536,20.858 L47,24.185 C49.43,28.937 52.145,30.185 57.713,30.185 C59.364,30.185 62.059,29.74 63.727,28.694 C67.779,26.156 67.779,22.22 67.779,20.898 C67.779,18.129 66.531,16.207 66.178,15.726 C65.049,14.121 63.032,12.918 61.25,12.278 L57.084,10.914 C55.073,10.267 52.928,10.105 52.928,8.019 C52.928,7.707 53.008,5.528 56.288,5.319 L61.465,5.319 L61.465,0 C61.465,0 57.342,0 56.948,0 Z” id=”Path-Copy-2″ fill=”#00AFF4″/>

olygon id=”Path” fill=”#00AFF4″ points=”5.32907052e-15 1.77635684e-15 5.32907052e-15 5.319 7.572 5.319 7.572 29.564 14.132 29.564 14.132 5.319 21.544 5.319 21.544 1.77635684e-15″/>

Kimberley is a tech and design reporter who covers artificial intelligence, robotics, quantum computing, tech culture, and science stories for The New Stack. Trained as an architect, she is also an illustrator and multidisciplinary designer who has been passionate about…

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

Previous News

Apple discontinues 15 products: What’s gone and what deals are still available for you?

Next News

What 2024’s Data Told Us About How Developers Work Now

Team SNFYI

Hi! This is Admin.

More like this

5 Multimodal AI Models That Are Actually Open Source

1. Aria

2. Leopard

3. CogVLM

4. LLaVA

5. xGen-MM

Conclusion

Disclaimer

Popular

More Like this

5 Multimodal AI Models That Are Actually Open Source

1. Aria

2. Leopard

3. CogVLM

4. LLaVA

5. xGen-MM

Conclusion

Disclaimer

More like this

Popular

Block title

Startup Events

Trending News

About

Partnership

Contact us