MSDWild: Multimodal Speaker Diarization Dataset in the Wild

Apr 1, 2022·

Tao Liu*

Shuai Fan*

Xu Xiang

Hongbo Song

Shaoxiong Lin

Jiaqi Sun

Tianyuan Han

Siyuan Chen

Binwei Yao

Sen Liu

Yifei Wu

Yanmin Qian

Kai Yu

· 0 min read

PDF Cite Website

Speaker Diarization

Abstract

Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporating visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this paper, we release MSDWild, a benchmark dataset for multi-modal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without overediting such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.

Type

Publication

In Interspeech 2022