Abstract
The most challenging problem of video conferencing systems is the
degradation of sound quality due to various noise sources. Speech
enhancement includes the reduction of background, acoustic echo
cancellation, and dereverberation. A number of studies have been
carried out to remove acoustic echo and background noise in video
conferencing systems, and recently, DNN approaches have been
applied to speech processing based on classical digital signal
processing techniques, leading to great progress. We first propose a
multi-input deep complex recurrent network (MIDCCRN) for noise
suppression. Then, we propose a model for joint acoustic echo
cancellation and background noise suppression in online voice
communication systems, including video conferencing systems, using
this network. The best performance of the proposed method is
demonstrated by experiments with objective metrics including echo
return loss enhancement (ERLE), signal-to-artifacts-ratio (SAR) and
scale-invariant source-to-noise ratio (SI-SNR), mean opinion score
(MOS) as a subjective metric, and AECMOS, real time factor (RTF),
network size, and final score.
Authors
Kum-Song Pak1, Chol-I Om2, Kwon Kim3, Chol-Ui Ri4, Chol-Nam Om5
Kim Il Sung University, Democratic People’s Republic of Korea1,3,4,5, University of Sciences, Democratic People’s Republic of Korea2
Keywords
Acoustic Echo Cancellation (AEC), Background Noise Suppression (BNS), Multi-Input Deep Complex Convolution Recurrent Network (MIDCCRN), Speech Enhancement (SE)