Title: Rapid Speaker and Environment Adaptation in Automatic Speech Recognition - Part I: Parametric Normalization; Part II: Latent Variable Approaches (Snelle adaptatie aan spreker en omgeving in automatische spraakherkening - Deel I: parametrische normalisatie; Deel II: methoden met latente variabelen)
Other Titles: Rapid Speaker and Environment Adaptation in Automatic Speech Recognition - Part I: Parametric Normalization; Part II: Latent Variable Approaches
Authors: Zhang, Xueru; S0208330
Issue Date: 2-Sep-2014
Abstract: The progress made in automatic speech recognition (ASR) during thelast few decades has allowed ASR techniques to be introduced intovarious application fields. With the growing use of ASR in fields such asintelligent voice driven personal assistants in smart phones, query by voice automatic telephone attendants, or voice driven user interfaces like those used in car navigation systems, the requirements for the performance of speech recognition systems have also increased. One of the demands of ASR systems is robustness against variability present in the speech signal. Different from speech recognition by humans, the performance of an ASR system degrades more severely when there are variability in the speech signal. Two of the main variability in the speech signal that deteriorate the performance of an ASR system are environmental variability and speaker variability. In this thesis, we focus on compensating for the variability caused by these two main factors.Additive background noise causes mismatches between the distributionsthat the automatic speech recognition system learned from the trainingdata and the distribution of the test data feature vectors. To compensatefor the variability introduced by additive background noise, we proposea parametric histogram equalization (pHEQ) algorithm which maps thedistributions of the feature vectors of both the training and testing datato a common target distribution. A unique property of the proposedmapping is that the noise in the input signal is tracked, allowing thenoise distribution observed in the input signal to be mapped to its ownnormalized distribution located at a fixed number of decibels below thetarget speech distribution. In other words, the pHEQ algorithm triesto normalize both the noise distribution and the speech distributionwhile maintaining a fixed distance between the two distributions. As aresult, the signal-to-noise ratio (SNR) of clean speech (e.g. the trainingdata) will be lowered by injecting extra noise, a process called noisemasking. When facing noisy speech on the other hand (e.g. the testdata), the algorithm will transform the data to reach a target SNR. Anoise power spectrum tracking algorithm allows the pHEQ algorithm toestimate and map the noise distribution even when facing non-stationarynoise. By applying pHEQ both during training and testing, the algorithmeffectively compensates for non-linear distortions in the speech featurevectors introduced by additive noise.The second main factor which deteriorates the performance of anautomatic speech recognition system is speaker variability. Speakervariability is caused by the differences among speaker characteristics,for example gender, age or dialect region the person grew up in. Theproposed algorithm handles speaker variability by adjusting the acousticmodel. More specifically, our model-based algorithm adjusts the stateemission density functions–Gaussian mixture models (GMMs) in a hidden Markov model (HMM)– to better fit the observations of a target speaker.Unique to our method is that the speaker independent (SI) Gaussianmixture weights are adapted towards speaker-dependent (SD) weights. By expressing the SD weights as a linear combination of a set of latent speaker vectors, the Gaussian mixture weights can be adapted rapidly, given limited amounts of adaptation or enrollment data. Non-negative matrix factorization (NMF) is used to estimate the latent speaker vectors. The NMF-based weight adaptation technique can be combined with existing mean (and variance) based speaker adaptation techniques, for example, speaker adaptive training (SAT) and eigenvoice speaker adaptation, to further improve the state emission probabilities by adapting both the Gaussian mixture weights and means (and variances).Replacing the non-negative matrix factorization by an non-negative tensordecomposition allows the adaptation of the Gaussian mixture weights tocompensate for both speaker and noise variability. Since weight-basedspeaker adaptation was already shown to work, the experiment focused on compensating the noise variability by estimating noise-dependent (ND)mixture weights in the model-space. Considering the non-stationarity ofthe noise, a set of Gaussian mixture weights is estimated for each frameduring evaluation.The proposed techniques are evaluated and analyzed on large vocabularycontinuous speech recognition benchmark tasks: the Wall Street Journal(WSJ) benchmark and the Aurora4 benchmark. The Aurora4 task wasconstructed by artificially adding different types of noise with differentsignal-to-noise ratios (SNRs) to the clean WSJ database. The result showthat the proposed algorithms can significantly improve the performanceof ASR systems.
Description: Zhang X., ''Rapid speaker and environment adaptation in automatic speech recognition - Part I: Parametric normalization; Part II: Latent variable approaches'', Proefschrift voorgedragen tot het behalen van het doctoraat in de ingenieurswetenschappen, KU Leuven, September 2014, Leuven, Belgium.
Publication status: published
KU Leuven publication type: TH
Appears in Collections:ESAT - PSI, Processing Speech and Images

Files in This Item:
File Status SizeFormat
thesis.pdf Published 4232KbAdobe PDFView/Open Request a copy

These files are only available to some KU Leuven Association staff members


All items in Lirias are protected by copyright, with all rights reserved.