Speaker diarization is the process of identifying who is speaking at
different times in audio recordings. This is important in various
situations, such as recording meetings, monitoring calls in call centers,
or analyzing media. In this paper, examine how well different methods
for speaker diarization perform in real-life scenarios. focus on two
modern techniques: I-vectors and X-vectors. I-vectors are effective for
automatic speaker recognition because they create compact and
efficient representations of speakers using statistical models. However,
they struggle in situations involving overlapping voices or background
noise. On the other hand, X-vectors overcome these limitations. They
use deep neural networks to create more complex and reliable
representations, making them better suited for challenging conditions.
To evaluate these two approaches, used standard datasets, specifically
the AMI Meeting Corpus and VoxCeleb. measured their performance
using two indicators: Diarization Error Rate (DER) and Jaccard Error
Rate (JER). Results show that while I-vectors are less resource-
intensive and work well in ideal conditions, X-vectors perform better in
real-world settings where noise and overlapping speech are present.
This study provides guidance for practitioners in choosing the right
approach based on their needs, considering factors such as accuracy,
computational costs, and reliability.
Vinod K. Pande1, Vijay K. Kale2, Sangramsing N. Kayte3 Dr G.Y. Pathrikar College of Computer Science and Information Technology, India1,2, University of Copenhagen, Denmark3
Speaker Diarization, I-Vector, X-Vector, MFCC, Speech Recognition
January | February | March | April | May | June | July | August | September | October | November | December |
7 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Published By : ICTACT
Published In :
ICTACT Journal on Soft Computing ( Volume: 15 , Issue: 4 , Pages: 3717 - 3721 )
Date of Publication :
January 2025
Page Views :
125
Full Text Views :
21
|