2nd European Conference on Speech Communication and Technology (Eurospeech 1991), 1991
We show how the objective measure of mutual information (MI) can be used to confirm that informa... more We show how the objective measure of mutual information (MI) can be used to confirm that information for identifying Place of Articulation (PoA) for plosives in Vowel-Plosive-Vowel context is concentrated at both ON (burst onset) and OFF (voicing termination) events in the acoustic spectrogram and is predominantly dynamic rather than static. We then run recognition tests to show that single-speaker plosive PoA in VPV context can be reliably identified from just one pair of short-term spectra centred at either ON or OFF position.
4th European Conference on Speech Communication and Technology (Eurospeech 1995), 1995
We present a simple model for onset and offset detection which is based on the broad functionalit... more We present a simple model for onset and offset detection which is based on the broad functionality of onset cells in the cochlear nucleus, the first auditory brain centre. We show that the clusters of transition events detected by this model in the spectrogram can be used to both locate and broad-classify phoneme transitions. A preliminary Isolated Word Recognition system is described which bases recognition solely on evidence from detected transition clusters together with short spectral samples taken from each cluster centre. Recognition performance is compared with that for two other IWR systems of a similar complexity which process the whole signal uniformly.
Abstract. In this chapter we will discuss feature extraction methods for speaker classification. ... more Abstract. In this chapter we will discuss feature extraction methods for speaker classification. We introduce linear predictive coding, mel fre-quency cepstral coefficients and wavelets and perform experimental stud-ies on AURORA and TIMIT data. For the speaker identification ...
Sixth International Conference on Spoken Language …
In this paper we apply the Full Combination (FC) multi-band approach, which has originally been i... more In this paper we apply the Full Combination (FC) multi-band approach, which has originally been introduced in the framework of posterior-based HMM/ANN (Hidden Markov Model/Artificial Neural Network) hybrid systems, to systems in which the ANN (or Multilayer Perceptron (MLP)) is ...
Speech synthesis by unit selection requires the segmentation of a large single speaker high quali... more Speech synthesis by unit selection requires the segmentation of a large single speaker high quality recording. Automatic speech recognition techniques, eg Hidden Markov Models (HMM), can be optimised for maximum segmentation accuracy. This paper presents the results of tuning such a phoneme segmentation system. Firstly, using no text transcription, the design of an HMM phoneme recogniser is optimised subject to a phoneme bigram language model. Optimal performance is obtained with triphone models, 7 states ...
Speaker recognition on the 630 speaker Timit speech database, using maximum probability selection... more Speaker recognition on the 630 speaker Timit speech database, using maximum probability selection with a simple Gaussian Mixture Model (GMM) for the data distribution for each speaker, gives above 99% correct recognition. In contrast, a powerful classifier such as a Multi Layer Perceptron (MLP), trained to estimate speaker probabilities, even on a small subset of speakers often performs no better than random selection. We hypothesise two effects which could combine to produce this situation. MLPs do badly because the acoustic feature data is primarily clustered around phonemes, so that speaker classes are highly fragmented and interspersed. In contrast, GMMs model speaker data distributions well because variation within the phonetic cluster identified by each Gaussian is primarily due to speaker variation, with the result that when speaker models are trained by adapting only the means from a multi speaker world model, the resulting GMMs are highly discriminative between speakers. In this article we analyse the distribution of speech and speaker information, both overall and within the cluster identified by each Gaussian in a GMM tuned for speaker recognition on Timit. We show that the results of this analysis support the above hypotheses, and then discuss ways in which the enhanced speaker separability within each Gaussian cluster could be used to harness the discriminative power of MLPs to provide feature data enhancement and improved speaker identification.
We address the theoretical and practical issues involved in ASR when some of the observation data... more We address the theoretical and practical issues involved in ASR when some of the observation data for the target signal is masked by other signals. Techniques discussed range from simple missing data imputation to Bayesian optimal classification. We have developed the Bayesian approach because this allows prior knowledge to be incorporated naturally into the recognition process, thereby permitting us to go beyond the simple “integrate over missing data ” or “marginals ” approach reported elsewhere, which we show to be inadequate for dealing with realistic patterns of missing data. After deriving general techniques for recognition with missing data, these techniques are formulated in the context of an HMM based CSR system. This scheme is evaluated under both random and more realistic patterns of missing data, with speech from the DARPA RM corpus and noise from NOISEX. We find that a key problem in real world recognition with missing data is that efficient ASR requires data vector com...
... andrew.morris@spinvox.com, jacques.koreman@hf.ntnu.no, bao.ly_van@int-evry.fr, {harin.sellahe... more ... andrew.morris@spinvox.com, jacques.koreman@hf.ntnu.no, bao.ly_van@int-evry.fr, {harin.sellahewa,sabah.jassim}@buckingham.ac.uk, rafael ... Furthermore, if we divide the frame sequence into NP equal parts and compute separate mean and variance vectors for each part, ...
2nd European Conference on Speech Communication and Technology (Eurospeech 1991), 1991
We show how the objective measure of mutual information (MI) can be used to confirm that informa... more We show how the objective measure of mutual information (MI) can be used to confirm that information for identifying Place of Articulation (PoA) for plosives in Vowel-Plosive-Vowel context is concentrated at both ON (burst onset) and OFF (voicing termination) events in the acoustic spectrogram and is predominantly dynamic rather than static. We then run recognition tests to show that single-speaker plosive PoA in VPV context can be reliably identified from just one pair of short-term spectra centred at either ON or OFF position.
4th European Conference on Speech Communication and Technology (Eurospeech 1995), 1995
We present a simple model for onset and offset detection which is based on the broad functionalit... more We present a simple model for onset and offset detection which is based on the broad functionality of onset cells in the cochlear nucleus, the first auditory brain centre. We show that the clusters of transition events detected by this model in the spectrogram can be used to both locate and broad-classify phoneme transitions. A preliminary Isolated Word Recognition system is described which bases recognition solely on evidence from detected transition clusters together with short spectral samples taken from each cluster centre. Recognition performance is compared with that for two other IWR systems of a similar complexity which process the whole signal uniformly.
Abstract. In this chapter we will discuss feature extraction methods for speaker classification. ... more Abstract. In this chapter we will discuss feature extraction methods for speaker classification. We introduce linear predictive coding, mel fre-quency cepstral coefficients and wavelets and perform experimental stud-ies on AURORA and TIMIT data. For the speaker identification ...
Sixth International Conference on Spoken Language …
In this paper we apply the Full Combination (FC) multi-band approach, which has originally been i... more In this paper we apply the Full Combination (FC) multi-band approach, which has originally been introduced in the framework of posterior-based HMM/ANN (Hidden Markov Model/Artificial Neural Network) hybrid systems, to systems in which the ANN (or Multilayer Perceptron (MLP)) is ...
Speech synthesis by unit selection requires the segmentation of a large single speaker high quali... more Speech synthesis by unit selection requires the segmentation of a large single speaker high quality recording. Automatic speech recognition techniques, eg Hidden Markov Models (HMM), can be optimised for maximum segmentation accuracy. This paper presents the results of tuning such a phoneme segmentation system. Firstly, using no text transcription, the design of an HMM phoneme recogniser is optimised subject to a phoneme bigram language model. Optimal performance is obtained with triphone models, 7 states ...
Speaker recognition on the 630 speaker Timit speech database, using maximum probability selection... more Speaker recognition on the 630 speaker Timit speech database, using maximum probability selection with a simple Gaussian Mixture Model (GMM) for the data distribution for each speaker, gives above 99% correct recognition. In contrast, a powerful classifier such as a Multi Layer Perceptron (MLP), trained to estimate speaker probabilities, even on a small subset of speakers often performs no better than random selection. We hypothesise two effects which could combine to produce this situation. MLPs do badly because the acoustic feature data is primarily clustered around phonemes, so that speaker classes are highly fragmented and interspersed. In contrast, GMMs model speaker data distributions well because variation within the phonetic cluster identified by each Gaussian is primarily due to speaker variation, with the result that when speaker models are trained by adapting only the means from a multi speaker world model, the resulting GMMs are highly discriminative between speakers. In this article we analyse the distribution of speech and speaker information, both overall and within the cluster identified by each Gaussian in a GMM tuned for speaker recognition on Timit. We show that the results of this analysis support the above hypotheses, and then discuss ways in which the enhanced speaker separability within each Gaussian cluster could be used to harness the discriminative power of MLPs to provide feature data enhancement and improved speaker identification.
We address the theoretical and practical issues involved in ASR when some of the observation data... more We address the theoretical and practical issues involved in ASR when some of the observation data for the target signal is masked by other signals. Techniques discussed range from simple missing data imputation to Bayesian optimal classification. We have developed the Bayesian approach because this allows prior knowledge to be incorporated naturally into the recognition process, thereby permitting us to go beyond the simple “integrate over missing data ” or “marginals ” approach reported elsewhere, which we show to be inadequate for dealing with realistic patterns of missing data. After deriving general techniques for recognition with missing data, these techniques are formulated in the context of an HMM based CSR system. This scheme is evaluated under both random and more realistic patterns of missing data, with speech from the DARPA RM corpus and noise from NOISEX. We find that a key problem in real world recognition with missing data is that efficient ASR requires data vector com...
... andrew.morris@spinvox.com, jacques.koreman@hf.ntnu.no, bao.ly_van@int-evry.fr, {harin.sellahe... more ... andrew.morris@spinvox.com, jacques.koreman@hf.ntnu.no, bao.ly_van@int-evry.fr, {harin.sellahewa,sabah.jassim}@buckingham.ac.uk, rafael ... Furthermore, if we divide the frame sequence into NP equal parts and compute separate mean and variance vectors for each part, ...
Uploads
Papers