New AI Models Enhance Voice Interactions with Large Language Models

Share via:


Researchers from Alibaba unveiled FunAudioLLM, a groundbreaking framework designed to facilitate natural voice interactions between humans and large language models (LLMs). The system comprises two key components: SenseVoice for voice understanding and CosyVoice for voice generation.

Read the full paper here – https://arxiv.org/pdf/2407.04051

SenseVoice, available in Small and Large variants, excels in multilingual speech recognition, emotion recognition, and audio event detection. SenseVoice-Small offers low-latency ASR for five languages, while SenseVoice-Large supports high-precision ASR for over 50 languages.

CosyVoice, on the other hand, specialises in multilingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. It supports five languages: Chinese, English, Japanese, Cantonese, and Korean.

The integration of these models with LLMs enables various applications, including speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.

Experimental results show that SenseVoice outperforms existing models like Whisper in many benchmarks. For instance, SenseVoice-Small is more than 5 times faster than Whisper-small and more than 15 times faster than Whisper-large for speech recognition tasks.

CosyVoice demonstrates high-quality speech synthesis, achieving comparable or better performance than original utterances in terms of content consistency and speaker similarity.

The researchers have open-sourced the models related to SenseVoice and CosyVoice on Modelscope and Huggingface, along with training, inference, and fine-tuning codes on GitHub.

While the system shows promising results, the researchers acknowledge some limitations. These include lower performance for under-resourced languages, lack of streaming transcription capabilities, and the need for improvement in expressive emotional changes while maintaining original voice timbre.

Alibaba previously created an image generator called Tongyi, which challenged Midjourney and Dall-E. This new development, FunAudioLLM, represents a significant step forward in expanding its creative models.



Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Popular

More Like this

New AI Models Enhance Voice Interactions with Large Language Models


Researchers from Alibaba unveiled FunAudioLLM, a groundbreaking framework designed to facilitate natural voice interactions between humans and large language models (LLMs). The system comprises two key components: SenseVoice for voice understanding and CosyVoice for voice generation.

Read the full paper here – https://arxiv.org/pdf/2407.04051

SenseVoice, available in Small and Large variants, excels in multilingual speech recognition, emotion recognition, and audio event detection. SenseVoice-Small offers low-latency ASR for five languages, while SenseVoice-Large supports high-precision ASR for over 50 languages.

CosyVoice, on the other hand, specialises in multilingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. It supports five languages: Chinese, English, Japanese, Cantonese, and Korean.

The integration of these models with LLMs enables various applications, including speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration.

Experimental results show that SenseVoice outperforms existing models like Whisper in many benchmarks. For instance, SenseVoice-Small is more than 5 times faster than Whisper-small and more than 15 times faster than Whisper-large for speech recognition tasks.

CosyVoice demonstrates high-quality speech synthesis, achieving comparable or better performance than original utterances in terms of content consistency and speaker similarity.

The researchers have open-sourced the models related to SenseVoice and CosyVoice on Modelscope and Huggingface, along with training, inference, and fine-tuning codes on GitHub.

While the system shows promising results, the researchers acknowledge some limitations. These include lower performance for under-resourced languages, lack of streaming transcription capabilities, and the need for improvement in expressive emotional changes while maintaining original voice timbre.

Alibaba previously created an image generator called Tongyi, which challenged Midjourney and Dall-E. This new development, FunAudioLLM, represents a significant step forward in expanding its creative models.



Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

More like this

Cognizant: Cognizant CMO quits, Thea Hayden to take interim...

Global technology services giant Cognizant saw yet another...

Blockdaemon mulls 2026 IPO: Report

Other Web3 infrastructure platforms, such as Circle, are...

How to install iOS 18.1 beta

Apple released a very early preview of Apple...

Popular

Upcoming Events

Startup Information that matters. Get in your inbox Daily!