New AI Models Enhance Voice Interactions with Large Language Models

Researchers from Alibaba unveiled FunAudioLLM, a groundbreaking framework designed to facilitate natural voice interactions between humans and large language models (LLMs). The system comprises two key components: SenseVoice for voice understanding and CosyVoice for voice generation.

Read the full paper here – https://arxiv.org/pdf/2407.04051

SenseVoice, available in Small and Large variants, excels in multilingual speech recognition, emotion recognition, and audio event detection. SenseVoice-Small offers low-latency ASR for five languages, while SenseVoice-Large supports high-precision ASR for over 50 languages.

CosyVoice, on the other hand, specialises in multilingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. It supports five languages: Chinese, English, Japanese, Cantonese, and Korean.

The integration of these models with LLMs enables various applications, including speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.

Experimental results show that SenseVoice outperforms existing models like Whisper in many benchmarks. For instance, SenseVoice-Small is more than 5 times faster than Whisper-small and more than 15 times faster than Whisper-large for speech recognition tasks.

CosyVoice demonstrates high-quality speech synthesis, achieving comparable or better performance than original utterances in terms of content consistency and speaker similarity.

The researchers have open-sourced the models related to SenseVoice and CosyVoice on Modelscope and Huggingface, along with training, inference, and fine-tuning codes on GitHub.

While the system shows promising results, the researchers acknowledge some limitations. These include lower performance for under-resourced languages, lack of streaming transcription capabilities, and the need for improvement in expressive emotional changes while maintaining original voice timbre.

Alibaba previously created an image generator called Tongyi, which challenged Midjourney and Dall-E. This new development, FunAudioLLM, represents a significant step forward in expanding its creative models.

Source link

Previous News

Samsung’s Galaxy Ring, its first smart ring, arrives July 24 for $399

Next News

July 10, 2024 – iPhone 16 Pro rumors, HomePod with a screen

Disclaimer

Popular

Microsoft to Introduce Voice Reporting Feature for Xbox

Adobe teams up with India’s Education Ministry for creative learning initiative

Meta May Allow Instagram and Facebook Users in Europe to Pay to Avoid Ads

Indian fintechs amplify payments soundbox pitches to woo merchants

Fintech Unicorn Pine Labs Launches Mini — A QR-First Device With Card Support

More Like this

Cognizant: Cognizant CMO quits, Thea Hayden to take interim position

Blockdaemon mulls 2026 IPO: Report

How to install iOS 18.1 beta

Mukesh and Akash Ambani Visit TWO’s US Office to Discuss India’s AI Future

Virtuous, a fundraising CRM for nonprofits, raises $100M from Susquehanna Growth Equity

No, the FAA isn’t fining SpaceX because of Elon Musk’s politics, former FAA head says

New AI Models Enhance Voice Interactions with Large Language Models

Disclaimer

More like this

Cognizant: Cognizant CMO quits, Thea Hayden to take interim...

Blockdaemon mulls 2026 IPO: Report

How to install iOS 18.1 beta

Popular

Apple releases new firmware version for AirPods Pro 2 and AirPods 4

Railways Developing A Super App: Ashwini Vaishnaw

Moneyboxx To Raise INR 176 Cr To Expand Its Lending Play

Wealthtech Centricity Bags $20 Mn To Build GenAI Modules

MCA Exempts Startups Looking To Reverse Flip From NCLT Nod

iPhone users can stay on iOS 17 and get security patches

Xiaomi India Ropes In Ex-Motorola Exec Sudhin Mathur As COO

Upcoming Events

Fintech Revolution Summit | Jakarta | October 24

International Technology Congress 2024 Moscow | Russia | September 17 - 19

Token 2049 | Singapore | Sept 18-19

ECODOX 4.0 | Delhi | September 18 - 19

Startup Meetup (RTF) | Gurugram | September 20

StartupNews.fyi

StartupNews.fyi

New AI Models Enhance Voice Interactions with Large Language Models

Disclaimer

Popular

More Like this

New AI Models Enhance Voice Interactions with Large Language Models

Disclaimer

More like this

Popular

Upcoming Events

Newsletter Signup Form!

Newsletter Signup Form!