29 Feb, 2024

Alibaba’s new AI system EMO creates realistic talking and singing videos from photos

Views News
1 892
Likes News
0
Dislikes News
0
Comments News
0
Alibaba’s new AI system EMO creates realistic talking and singing videos from photos

Researchers at Alibaba‘s Institute for Intelligent Computing have developed a new artificial intelligence system called “EMO,” short for Emote Portrait Alive, that can animate a single portrait photo and generate videos of the person talking or singing in a remarkably lifelike fashion.

The system, described in a research paper published on arXiv, is able to create fluid and expressive facial movements and head poses that closely match the nuances of a provided audio track. This represents a major advance in audio-driven talking head video generation, an area that has challenged AI researchers for years.

“Traditional techniques often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles,” said lead author Linrui Tian in the paper. “To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks.”

The EMO system employs an AI technique known as a diffusion model, which has shown tremendous ability for generating realistic synthetic imagery. The researchers trained the model on a dataset of over 250 hours of talking head videos curated from speeches, films, TV shows, and singing performances.

Unlike previous methods that rely on 3D face models or blend shapes to approximate facial movements, EMO directly converts the audio waveform into video frames. This allows it to capture subtle motions and identity-specific quirks associated with natural speech.

According to experiments described in the paper, EMO significantly outperforms existing state-of-the-art methods on metrics measuring video quality, identity preservation, and expressiveness. The researchers also conducted a user study that found the videos generated by EMO to be more natural and emotive than those produced by other systems.

Beyond conversational videos, EMO can also animate singing portraits with appropriate mouth shapes and evocative facial expressions synchronized to the vocals. The system supports generating videos for an arbitrary duration based on the length of the input audio.

“Experimental results demonstrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism,” the paper states.

The EMO research hints at a future where personalized video content can be synthesized from just a photo and an audio clip. However, ethical concerns remain about potential misuse of such technology to impersonate people without consent or spread misinformation. The researchers say they plan to explore methods to detect synthetic video.

Source: VentureBeat

Show Business

30 Aug, 2024
Views News
472
Likes News
0
Dislikes News
0
Comments News
0
27 Aug, 2024
Views News
617
Likes News
0
Dislikes News
0
Comments News
0
26 Aug, 2024
Views News
668
Likes News
0
Dislikes News
0
Comments News
0