Abstract
Personalized speech enhancement (PSE) is a speech enhancement
method to remove interfering speech, background noise, and
reverberation based on a speaker embedding extracted from the target
speaker such as d-vector and x-vector. In full duplex communication
scenarios, when the microphone and far-end signal are coexisted
together, it creates acoustic echoes. This echo is one of the major
factors to the degradation of the sound quality of online
communication systems, including video conferencing. Hence,
Acoustic Echo Cancellation (AEC), a technique that can effectively
remove these acoustic echoes, has been investigated. For full-duplex
communications, which acoustic echoes are exist with background
noises and interfering speech together, AEC and PSE must be
combined. We study this combination. Our goal is to develop a causal
model that can be applied to various model architectures to efficiently
handle the tasks of AEC, PSE, and joint AEC-PSE. The features are
extracted from the far-end signal and the near-end signal. The cross-
attention alignment mechanism is used for feature alignment of the
far-end signal and x-vectors are used as speaker embedding features.
The proposed method is applied to PSE models such as E3Net and
VoiceFilter-Lite. We present extensive experimental results. We
demonstrate the effectiveness of the proposed method through the
experiments in terms of various evaluation metrics with several
standard audio and real recording datasets.
Authors
Kwon Kim, Yong-Hun Yun, Chol-Nam Om
Kim Il Sung University, Democratic People’s Republic of Korea
Keywords
Personalized Speech Enhancement, Acoustic Echo Cancellation, Cross-Attention Alignment, X-Vector