A Speaker Verification Method Based On TDNN-LSTMP PDF
A Speaker Verification Method Based On TDNN-LSTMP PDF
A Speaker Verification Method Based On TDNN-LSTMP PDF
https://doi.org/10.1007/s00034-019-01092-3
Abstract
In speaker recognition, a robust recognition method is essential. This paper proposes
a speaker verification method that is based on the time-delay neural network (TDNN)
and long short-term memory with recurrent project layer (LSTMP) model for the
speaker modeling problem in speaker verification. In this work, we present the appli-
cation of the fusion of TDNN and LSTMP to the i-vector speaker recognition system
that is based on the Gaussian mixture model-universal background model. By using
a model that can establish long-term dependencies to create a universal background
model that contains a larger amount of speaker information, it is possible to extract
more feature parameters, which are speaker dependent, from the speech signal. We
conducted experiments with this method on four corpora: two in Chinese and two in
English. The equal error rate, minimum detection cost function and detection error
tradeoff curve are used as criteria for system performance evaluation. The exper-
imental results show that the TDNN–LSTMP/i-vector speaker recognition method
outperforms the baseline system on both Chinese and English corpora and has better
robustness.
1 Introduction
Speaker verification (SV) is a branch of speaker recognition (SR) that refers to authen-
ticating the claimed identity of a speaker based on a speech signal and an enrolled
speaker record [15]. The result of this task is to “accept” or “reject” a binary collection.
B Longlian Zhao
zsczll@cau.edu.cn
Hui Liu
liuh@cau.edu.cn
1 College of Information and Electrical Engineering, China Agricultural University, Beijing, China
2 Modern Precision Agricultural System Integration Research Key Laboratory, Ministry of Education,
Beijing, China
Circuits, Systems, and Signal Processing
signals, and effectively learn dynamic time-based signals for remote time-dependent
modeling.
DNN can model a long span as well as high-dimensional and related features. In
addition to using contexts as input features for modeling temporal dynamics, neural
network architectures can capture long-term dependencies between sequential events.
The TDNN is an architecture with such a design for processing sequential data. The
TDNN is a feedforward network. However, it has a delay in the layer weights that
are associated with the input rights [14]. By adding a set of delays to the input, the
data can be represented at different points in time. This allows the TDNN to have a
limited dynamic response to the time series input data. The TDNN is similar to the
convolutional neural network (CNN), which is convolved only along the time axis.
The TDNN starts with a short context and expands the context of context learning
as the number of hidden layers increases. Therefore, TDNN can better represent the
contextual relationship in time than DNN.
In the feedforward neural network, information can only flow from the bottom
level to the top level without considering the impact of the historical information
calculated by the same layer in the current calculation. The LSTM [20] model not
only enables long-term orbital memory and transient memory but also can simulate
selective forgetting by the human brain. In recent years, LSTM RNNs have been
successfully applied to speech recognition to significantly improve speech recognition
model performance. The LSTM hidden layer consists of a set of reconnect units called
“memory blocks.” Each memory block contains one or more self-connected memory
cells and three gate controllers, namely input, output, and forgetting, to control the
flow of information; that is, they provide continuous memory cells to write, read, and
reset operations. Among a number of variants of LSTM, we use the structure that is
proposed in [18], which is called LSTMP, the architecture is shown in Fig. 1. This
structure reduces the number of neurons in the feedback layer. When the number is
decreased, the number of parameters in the memory neural network makes the acoustic
model more robust. A summary of the LSTMP formula is as follows:
m t ot tanh(ct ) (5)
pt W pm m t (6)
Circuits, Systems, and Signal Processing
rt Wr m m t (7)
yt ( pt , rt ) (8)
where the W ∗∗ are weight matrices and b∗ are biases, σ is the sigmoid function, it ,
f t , ot , ct , mt , r t and pt are vectors with input gate, forget gate, output gate, cell state,
cell output, the recurrent and projection values, respectively.
The cell input and cell output activation functions, generally tanh.
To improve the modeling capabilities of acoustic models, it has become a trend
to combine different layers of network structure [6, 10]. Therefore, combining the
advantages of TDNN and LSTM network models, in this paper, we use a TDNN–L-
STMP network structure and i-vector to model the speaker space. In the process of
extracting sufficient statistics, this method uses the deep neural network model based
on bound triphone state in speech recognition to replace the UBM in the UBM/i-vector
model to extract the posterior probability of each frame for each category. The dif-
ference between GMM and TDNN–LSTMP network in speaker recognition is that in
GMM, each mixed component represents one category, which is obtained by unsu-
pervised clustering. Each class has no specific meaning and only represents modules
in space. However, in the neural network model based on bound tri-phoneme states,
each output node represents a category, and each class is clustered by decision tree in
speech recognition to get bound tri-phoneme state, which has a clear correspondence
with speech content. DNN is used to calculate the posterior probability of each frame
to each category, and extract the hidden information in the speaker’s voice, so more
information can be mined.
The proposed TDNN–LSTMP network structure is composed of TDNN and
LSTMP hidden layers, and its combination is shown in Fig. 2. The figure shows a
network structure with six hidden layers. In the TDNN layer, we use the sub-sampled
TDNN to get sufficient speech feature information from the input time step, which
allows the input of each layer to be spaced in time. Behind the TDNN hidden layer
Circuits, Systems, and Signal Processing
3.1 Datasets
In the experiment, we trained the UBM model, the T-matrix, and the PLDA back end
using the LibriSpeech clean-100 [13] dataset which includes 100 h of clean speech of
28539 English utterances from 251 speakers (125 female and 126 male speakers).
The test set of this paper used four corpora (two English corpora and two Chinese
corpora). The corpus parameters are shown in Table 2. The LibriSpeech is a corpus
of read English speech derived from LibriVoxs audiobooks. Timit [5] is a read speech
corpus, in which speakers come from the 8 major dialect regions of the United States of
America. Aishell [8] is a multi-channel Mandarin corpus, recorded at the condition of
silent office, which covers 11 areas such as smart home, unmanned driving. Thchs-30
[21] is a Mandarin corpus recorded in clean environment, with content selected from
a large amount of news. The all datasets are sampled at 16 kHz.
The indicators that are used in this paper are equal error rate (EER), minimum detection
cost function (MinDCF) and detection error tradeoff (DET) curve according to the
2008 NIST speaker recognition evaluation (SRE) [12]. The EER is the error rate when
the false rejection rate is equal to the false acceptance rate. The detection cost function
is defined by the following expression:
Cdet CMiss PMiss|Target Ptarget + CFalseAlarm PFalseAlarm|Nontarget 1 − Ptarget (9)
where CMiss and CFalseAlarm are the cost of missed alarm and false alarm, respectively.
PMiss|Target and PFalseAlarm|Nontarget Where are the missed alarm rate and false alarm
rate for a given threshold θ, and Ptarget is the prior probability of the specified target
speaker. In the NIST SRE task, the parameters of MinDCF are defined as CMiss 1,
CFalseAlarm 10, Ptarget 0.01. The DET curve gives a comprehensive performance
evaluation of the speaker recognition system, and the minDCF is the most important
evaluation indicator of system performance.
The lower the EER and MinDCF are, the closer the DET curve is to the origin and
the better the recognition performance of the system is. The distance between DET
curves can effectively describe the differences between systems.
Circuits, Systems, and Signal Processing
This article describes an experiment in which the open-source toolkit Kaldi is used
under the Linux operating system Ubuntu 16.04. As the latest and most popular voice
recognition tool, Kaldi is an open-source toolkit that was developed at the University
of Cambridge and written in C++. In the training process of a neural network, a GPU
can substantially increase the training speed. Thus, a GPU is used to train the neu-
ral network. When using a GPU for neural network training, the number of parallel
processes for training the neural network is set, and a fixed sample from each parallel
neural network training step is extracted; eventually, after completely parallel train-
ing of the neural network is achieved, the mean training time of the neural network
is calculated. One iteration of the neural network structure must be completed many
times. We adopt mel-frequency cepstral coefficients (MFCC) as acoustic feature rep-
resentation for training the neural network [9]. The experimental configuration of the
system structure is described below.
The speaker recognition model based on the UBM/i-vector method [22] is shown
in Fig. 3, where UBM creates sufficient data for the extraction of the i-vector. Based
on the MFCC features, a UBM model is first trained using the expectation maximiza-
tion algorithm. The T-matrix is a low-rank total variability matrix, and the T-matrix
and i-vector are extracted using zeroth-order and first-order Baum-Welch statistics
of the UBM. The PLDA side calculates the similarity scores between the i-vectors.
For a given two-segment speech, assuming that their corresponding i-vectors are
w1 and w2 , respectively, the PLDA model calculates the similarity by the following
formula:
p((w1 , w2 )|θtar )
score log (10)
p((w1 , w2 )|θnontar )
where θ tar represents the assumption that the w1 and w2 are from the same speaker,
and θ nontar represents the assumption that the w1 and w2 are from different speakers.
The higher the score, the greater the likelihood that the two voices will come from
the same speaker. In this paper, using TDNN–LSTMP to create a UBM by combining
Circuits, Systems, and Signal Processing
the speaker characteristics, the GMM model parameters in the UBM model under this
framework are estimated by DNN clustering, and the update formula is as follows:
(i)
γk P(k|yi , ) (11)
N
(i)
λk γk (12)
i1
1 N
(i)
μk (i)
γk x i (13)
γk i1
1
N
(i)
Σk (i)
γk (xi − μk )(xi − μk )T (14)
γk i1
Here, for the given speaker recognition features xi , λk , μk and Σ k are, respectively,
the weighting, mean and covariance of the kth Gaussian. The DNN parameter is
represented by , and P(k|yi , ), i.e., γ (i) , is the posterior probability of triphone k at
frame i given the DNN features yi .
The baseline is a standard i-vector system that is based on the GMM–UBM Kaldi
SRE10 V1 [16]. The front end consists of 20 MFCCs with a 25-ms frame length, and
the features are mean-normalized over a 3-s window. All features are 60-dimensional,
including the first-order difference and second- order difference coefficients. Non-
speech segments are removed with energy-based voice activity detection. The UBM
is a 1425-component full-covariance GMM. The GMM–UBM is initialized with a
diagonal covariance matrix for 4 iterations of EM and 4 iterations with a full covari-
ance matrix. The system uses a 100-dimensional i-vector extractor for 5 iterations of
EM. The back end is PLDA scoring.
3.3.2 TDNN–LSTMP/i-Vector
TDNN–LSTMP is used to create a UBM that models phonetic content, which differs
from the baseline in the UBM training steps. The system is based on the DNN posteriors
and speaker features for creating a GMM. In the extraction of the i-vector, both systems
follow Fig. 3. We use the TDNN–LSTMP architecture that is described in Sect. 2. The
output dimension of the TDNN layer is 1024, and the LSTM layer has a cell dimension
of 1024. The input features are 40 MFCCs with a 25-ms frame length.
3.4 Results
To evaluate the effectiveness of the proposed method, three sets of experiments were
performed based on the four corpora mentioned above. Part 1 compares the perfor-
mances of the UBM/i-vector and TDNN–LSTMP/i-vector systems on the LibriSpeech,
Circuits, Systems, and Signal Processing
Timit, Aishell, and Thchs-30 corpora. Part 2 compares the performances on four cor-
pora under various registered speech numbers. Part 3 tests the performance of the
system under various speech duration conditions. Because of the different recording
conditions and methods of the corpora, the four corpora represent different experi-
mental environments.
Experiment 2 studied the influence of each speaker on two systems with different
numbers of utterances for required enrollment. Tables 5 and 6 present comparisons
Circuits, Systems, and Signal Processing
Librispeech Timit
40 40
UBM UBM
TDNN-LSTMP TDNN-LSTMP
Miss probability (in %)
10 10
5 5
2 2
1 1
1 2 5 10 20 40 1 2 5 10 20 40
False Alarm probability (in %) False Alarm probability (in %)
Aishell Thchs-30
40 40
UBM UBM
TDNN-LSTMP TDNN-LSTMP
Miss probability (in %)
20 20
10 10
5 5
2 2
1 1
1 2 5 10 20 40 1 2 5 10 20 40
False Alarm probability (in %) False Alarm probability (in %)
of the EER and minDCF of the system, respectively, when the number of enrolled
utterances is increased from 1 to 20. For a more intuitive comparison, the corresponding
polyline graph is drawn, as shown in Figs. 5 and 6. Figures 5 and 6 show that the
performance of the two systems increases with the number of registered voices. Once
the number of registered utterances reaches approximately eight, the performance of
the system reaches a stable value, and in the training set, which belongs to the same
LibriSpeech corpus as the test set, when the number of registered utterances is as low
as 1 or 2, the performance of the system is relatively good and the number of registered
utterances does not have a large impact.
This experiment tests the performance of the system on the Aishell and LibriSpeech
corpora under various speech duration conditions. We show the system performance
under various speech duration conditions in Table 7. We use two systems to test the
Circuits, Systems, and Signal Processing
Table 5 SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for enrollment in terms of
EER (%)
Table 6 SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for enrollment in terms of
minDCF
UBM/ivector
10
LibriSpeech
9 Timit
Aishell
8 Thchs-30
7
EER(%)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The number of enrolled utts
Fig. 5 UBM/i-vector SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for enrollment
in terms of EER
TDNN-LSTMP/ivector
10
LibriSpeech
9 Timit
Aishell
8 Thchs-30
7
EER(%)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The number of enrolled utts
Fig. 6 TDNN–LSTMP/i-vector SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for
enrollment in terms of EER
From the above experiments, we can see that the speaker verification system based
on TDNN–LSTMP significantly outperforms the UBM/i-vector-based speaker verifi-
cation system on both Chinese and English data sets. The experiments show that our
system always outperforms the traditional i-vector system on short-duration speaker
recognition tasks. From the experiment, we can also conclude that in the speaker
verification system, when the number of registered sentences is eight, the system’s
recognition performance has already reached an optimal value. This experimental
Circuits, Systems, and Signal Processing
Table 7 SV performance on the test set, with various duration conditions in terms of EER (%)
observation can provide guidance for the practical application design of many speaker
recognition systems in the future, especially for systems that can use only limited
resources.
Circuits, Systems, and Signal Processing
4 Conclusion
This paper studies the TDNN–LSTMP network applied in the speaker recognition task
by using a TDNN–LSTMP fusion network to create a UBM and extract additional
speech signal information related to the speaker. This paper compares SV’s mainstream
method, namely UBM/i-vector, and the proposed TDNN–LSTMP/i-vector method and
performs recognition tasks on four Chinese and English corpora. The experimental
results show that TDNN–LSTMP/i-vector improves the speech modeling capability
and improves the recognition performance. It outperforms the baseline system on
Chinese and English corpora and has better robustness.
Our future work will be to study speaker recognition systems in noisy environments
and multiple system fusions with various features.
Acknowledgements This research reported here was supported by China National Nature Science Funds
(No. 31772064).
References
1. Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult.
IEEE Trans. Neural Netw. 5(2), 157 (2002). https://doi.org/10.1109/72.279181
2. G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-
vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30 (2011). https://
doi.org/10.1109/TASL.2011.2134090
3. N. Dehak, P.J. Kenny, R. Dehak, Front-end factor analysis for speaker verification. IEEE Trans. Audio
Speech Lang. Process. 19(4), 788 (2011). https://doi.org/10.1109/TASL.2010.2064307
4. D. Garcia-Romero, X. Zhou, C.Y. Espy-Wilson, Multicondition training of Gaussian PLDA models in
i-vector space for noise and reverberation robust speaker recognition, in IEEE International Conference
on Acoustics, Speech and Signal Processing (2012), pp. 4257–4260
5. J.S. Garofolo, L.F. Lamel, W.M. Fisher, Darpa timit acoustic-phonetic continuous speech corpus cdrom.
NIST speech disc 1-1.1. NASA STI/Recon Technical Report N 93 (1993)
6. K.J. Han, S. Hahm, B.H. Kim, Deep learning-based telephony speech recognition in the wild, in
INTERSPEECH (2017), pp. 1323–1327. https://doi.org/10.21437/Interspeech.2017-1695
7. G. Hinton, L. Deng, D. Yu, Deep neural networks for acoustic modeling in speech recognition: the
shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82 (2012). https://doi.org/10.
1109/MSP.2012.2205597
8. J.D. Hui Bu, Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline, in
Orient. COCOSDA 2017 (2017) (Submitted)
9. V. Joshi, N.V. Prasad, S. Umesh, Modified mean and variance normalization: transforming to utterance-
specific estimates. Circuits Syst. Signal Process. 35(5), 1593 (2016). https://doi.org/10.1007/s00034-
015-0129-y
10. S. Karita, A. Ogawa, M. Delcroix, Forward-backward convolutional lstm for acoustic modeling, in
INTERSPEECH (2017), pp. 1601–1605. https://doi.org/10.21437/Interspeech.2017-554
11. P. Kenny, Bayesian speaker verification with heavy tailed priors, in Proceedings of the Odyssey Speaker
and Language Recognition Workshop, Brno, Czech Republic (2010)
12. NIST, The NIST year 2008 speaker recognition evaluation plan. https://www.nist.gov/sites/default/
files/documents/2017/09/26/sre08_evalplan_release4.pdf. Accessed 3 Apr 2008
13. V. Panayotov, G. Chen, D. Povey, Librispeech: an ASR corpus based on public domain audio books, in
IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 5206–5210.
https://doi.org/10.1109/ICASSP.2015.7178964
14. V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling
of long temporal contexts, in INTERSPEECH (2015), pp. 3214–3218
Circuits, Systems, and Signal Processing
15. A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges,
trends and opportunities. IET Biom. 7(2), 91 (2018). https://doi.org/10.1049/iet-bmt.2017.0065
16. D. Povey, A. Ghoshal, G. Boulianne, The kaldi speech recognition toolkit, in IEEE 2011 Workshop on
Automatic Speech Recognition and Understanding. (IEEE Signal Processing Society, 2011)
17. D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models
(Academic Press Inc., London, 2000)
18. H. Sak, A. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures
for large vocabulary speech recognition. In: Interspeech, pp 338–342 (2014). https://arxiv.org/abs/
1402.1128
19. D. Snyder, D. Garcia-Romero, D. Povey, Time delay deep neural network-based universal background
models for speaker recognition. Autom. Speech Recognit. Underst. (2016). https://doi.org/10.1109/
ASRU.2015.7404779
20. L.M. Surhone, M.T. Tennoe, S.F. Henssonow, Long Short Term Memory (Betascript Publishing, Riga,
2010)
21. D. Wang, X. Zhang, Thchs-30: a free chinese speech corpus. Comput. Sci. (2015). https://arxiv.org/
abs/1512.01882
22. Y. Xu, I. Mcloughlin, Y. Song, Improved i-vector representation for speaker diarization. Circuits Syst.
Signal Process. 35(9), 3393 (2016). https://doi.org/10.1007/s00034-015-0206-2
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations