Indian IT firm Tech Mahindra, intends to launch ‘Project Indus’ its LLM designed for Hindi and its 37 dialects by the end of December or early January, reported Economic Times. This initiative comes four months after the company introduced ‘Project Indus,’ a strategic effort by the fifth largest software services firm to develop a foundational model for Indian languages.
Over the last two months, the 15-member Project Indus team has gathered 1.2 terabytes of data in Hindi and its related dialects. Currently, they are working on refining this data into web text, which they plan to release as open source by the end of November, stated Nikhil Malhotra, global head of maker’s lab at Tech Mahindra, the ET report added.
“In the meantime, we have started constructing the model… We are looking at probably the end of December or starting of January, we will release the model for at least Hindi and its dialects. And then the other work starts for other dialects in other regions,” Malhotra said.
The team encountered difficulties related to data availability and collection. “In Hindi, the maximum number of tokens available is about 2.8 billion, which doesn’t meet the model’s requirements. For instance, to create a 7 billion parameter model, I would need at least around 100 billion tokens,” explained Malhotra.
At the beginning, a portal was established to gather voice samples in local dialects through crowd-sourcing. Initially, there were 1,500 responses within the first two days, but the response gradually decreased. In total, only 6,000 samples were received, as stated by Malhotra.
To address this, teams were dispatched to regions like Uttar Pradesh, Madhya Pradesh, Haryana, and Jammu to collect data in person. Additionally, the Hyderabad campus of Tech Mahindra organized a camp where employees contributed samples in dialects like Hyderabadi Dakhini.
According to Tech Mahindra’s chief CP Gurnani, the model will be the biggest Indic LLM and could possibly cater to 25% of the world’s population. While Tech Mahindra has not revealed the cost associated with the project or when the model is expected to be launched, the aim is to build a 7-billion parameter LLM to begin with, Malhotra, told AIM in an exclusive interview.
The post Tech Mahindra to Launch OpenAI Rival ‘Project Indus’ Early Next Year appeared first on Analytics India Magazine.