Prosody Modeling with 3D Visual Information

Prosody Modeling with 3D Visual Information for Expressive Video Dubbing

The automatic video dubbing task is proposed to meet personal and industrial demands for dubbing. Current methods mostly focus on duration matching and overlook the synchronization of prosody, and thus lack expressiveness. In this paper, we introduce visual prosody modeling to promote expressiveness for video dubbing , defined as the expression and head pose in 3D space, which has the advantages of 1) high relevance to the tone and stress of utterances; 2) more accurate than 2D images; 3) disentanglement from irrelevant factors such as speaker identity. We propose a 3D-VD (3D Video Dubber) system to incorporate visual prosody, utilizing a visual-text step-wise aligner to control the generated prosody. Experiments demonstrate that the proposed method outperforms previous methods that only consider 2D face images in terms of naturalness, lip-speech alignment, and synchronization of visual and auditory prosody. The case study demonstrates the correlation between expression and pitch.

SaveSavedRemoved 0

Prosody Modeling with 3D Visual Information

Unleashing Vanilla Vision Transformer with Masked Image Modeling

Binary Embedding-based Retrieval at Tencent

To Get Daily Health Newsletter

Prosody Modeling with 3D Visual Information

You Might Also Like This Posts:

Unleashing Vanilla Vision Transformer with Masked Image Modeling

Binary Embedding-based Retrieval at Tencent

To Get Daily Health Newsletter