Contribution

Evaluation of lip synchronization algorithms via SRT measurements in virtual environments

* Presenting author
Day / Time: 20.03.2025, 11:00-11:40
Manuscript: PDF-Download
Type: Poster
Abstract ID: DAS-DAGA2025/461
Abstract: In this study, two algorithms for real-time lip animation are evaluated via SRT measurements in a virtual audio-visual environment. One algorithm is a deep neural network that takes images of human faces cropped to the lips as input and yields seven blendshapes as output. The other algorithm takes the speech signal power of four frequency bins from the smoothed short-term power spectral density and yields 3 blendshapes as output. Blendshapes are 3D mesh deformations that allow for animation of 3D objects.Both algorithms were used to animate and record videos of a virtual avatar using audio of the female Oldenburger Satztest. Four conditions were presented to 10 normal-hearing participants: animated videos with audio, audio-visual version of the Oldenburger Satztest and audio-only condition. These were presented via a virtual environment in noisy conditions. Speech reception thresholds corresponding to the SNR at which a sentence intelligibility of 80% is reached were measured in these conditions. Results confirmed the established audio-visual benefit in SRTs for the original videos compared to the audio-only condition and showed both real-time lip-synchronization algorithms did not yield such a benefit. We conclude that real-time-lip synchronization requires more precise and better trained models to potentially yield an audio-visual benefit.