Decoding India’s Bid To Build Multimodal LLM

SUMMARY

The key “distinguishing features” of BharatGen will be its multilingual and multimodal nature, indigenously built datasets, open-source architecture, among others

By July 2026, Indian authorities have set their eyes on extensive AI model development, experimentation, and the establishment of AI benchmarks tailored to India’s needs

One of the core features of BharatGen will be its focus on data-efficient learning, particularly for Indian languages with limited digital presence

It was December of 2023 and Prime Minister (PM) Narendra Modi had just taken to the stage to address the gathering of people at the Kashi Tamil Sangamam in Uttar Pradesh’s Varanasi. Just as PM Modi commenced his address in Hindi, the attendees plugged in their headphones to hear the translated version of his speech in real time.

At work was the Centre’s ambitious AI platform, Bhashini, a language translation platform that aims to make digital services and the internet more accessible in Indian languages. However, much has happened in the 10 months since then.

Since then, the government has taken a series of steps to bolster its AI offerings. Most recently, the Centre announced the BharatGen project, touted as the world’s first government-funded multimodal large language model (LLM) project.

The Ministry of Science said that it will undertake the development of the multimodal LLM project, which will be focussed on creating “efficient and inclusive AIs” in Indian languages.

Once completed, BharatGen will be able to generate high-quality text and “multimodal content” in various Indian languages.

For the uninitiated, a multimodal LLM can process multiple types of data, or modalities, such as text, images, audio, video, and 3D environments. It can also generate content in all these formats.

As per the government, there will be four key “distinguishing features” of BharatGen:

Multilingual and multimodal nature of foundation models
Indigenously built datasets, which will be leveraged to train the LLMs
Open-source architecture
Development of an ecosystem of GenAI research in India

The Making Of BharatGen

Slated to be completed in a span of two years, BharatGen will cater to both text and speech to ensure coverage across India’s “diverse linguistic landscape”.

“Looking ahead, BharatGen’s roadmap outlines key milestones up to July 2026. These include extensive AI model development, experimentation, and the establishment of AI benchmarks tailored to India’s needs,” the government said in a statement.

To be undertaken under DST’s National Mission on Interdisciplinary Cyber-Physical Systems (NM-ICPS), the development of BharatGen will be spearheaded by IIT Bombay. Besides, the execution of the project will also see participation from other academic institutes such as IIIT Hyderabad, IIT Mandi, IIT Kanpur, IIT Hyderabad, IIM Indore, and IIT Madras.

The multimodal LLM will be trained on multilingual datasets to “deeply capture” the nuances of Indian languages.

In order to address this paucity of data sets, necessary to train AI models, BharatGen will also look to develop processes for collecting and curating India-centric data. This data will be accumulated in a way that the country’s diverse languages, dialects, and cultural contexts are accurately represented.

Notably, one of the core features of BharatGen will be its focus on data-efficient learning, particularly for Indian languages with limited digital presence. The government will partner with multiple academic institutions to develop AI models that are effective with minimal data.

“This emphasis on data sovereignty strengthens India’s control over its digital resources and narrative,” the statement added.

As of now, LLMs are predominantly trained in the English language as there is a plethora of data online with regards to the language. However, there have been attempts by the likes of Google to roll out their AI chatbots in multiple Indian languages on the back of the treasure trove of search-related data.

However, smaller players do not have access to such resources. And it is this chasm that the government wants to fill with its open-architecture LLM, which can be used by startups and academicians to build products on top of this tech stack and linguistic datasets.

“BharatGen will deliver generative AI models and their applications as a public good by prioritising India’s socio-cultural and linguistic diversity. It strives to address India’s broader needs such as social equity, cultural preservation, and linguistic diversity, while ensuring that GenAI reaches all segments of society,” as per the government.

Secretary in the department of science and technology (DST), Professor Abhay Karandikar, said that BharatGen will be leveraged to address “national priorities” such as cultural preservation and inclusive technology development, beyond merely making AI accessible to all and for industrial and commercial purposes.

Aligned with the government’s ‘Atmanirbhar Bharat’ vision, one of the stated goals of the project is to reduce “reliance on foreign technologies” and strengthen the domestic AI ecosystem for startups, industries, and government agencies.

The Centre also believes that BharatGen will democratise access to AI through foundational models, adding that the tech stack will allow innovators, researchers, and startups to build AI applications quickly and affordably.

The project will also look to foster a vibrant AI research community through training programmes, hackathons, and collaborations with global experts.

The proposed project is part of the Indian government’s overarching push for digital public infrastructure (DPI). Leveraging AI could give a further impetus to India’s existing digital public goods rails and pave the way for offering cost-effective solutions not just in India, but globally.

The BharatGen project also echoes India’s focus on fostering the adoption of AI technologies. Earlier this year, the union cabinet approved the IndiaAI Mission with an allocation of INR 10,372 Cr over the course of next five years. The outlay will be utilised to facilitate funding for emerging AI startups and spur innovation in the sector.

In September, the government also invited applications from startups and researchers to build and deploy “impactful” AI solutions in key critical areas. Amid all these, the Centre has already constituted an advisory group to formulate a framework to regulate AI.

At the heart of all this is the Indian AI landscape, which already hosts more than 100 startups that have raised more than $600 Mn between 2019 and H1 2024. As per Inc42 data, the Indian GenAI ecosystem is projected to be a $17 Bn market opportunity by 2030 on the back of the growing adoption of the emerging technology.

Source link

The Making Of BharatGen

Slated to be completed in a span of two years, BharatGen will cater to both text and speech to ensure coverage across India’s “diverse linguistic landscape”.

The multimodal LLM will be trained on multilingual datasets to “deeply capture” the nuances of Indian languages.

“This emphasis on data sovereignty strengthens India’s control over its digital resources and narrative,” the statement added.

The project will also look to foster a vibrant AI research community through training programmes, hackathons, and collaborations with global experts.

Decoding India’s Bid To Build Multimodal LLM

The Making Of BharatGen

Disclaimer

Popular

More Like this

Decoding India’s Bid To Build Multimodal LLM

The Making Of BharatGen

Disclaimer

More like this

Popular

Block title

Startup Events

Trending News

About

Partnership

Contact us