Skip to main content

Kjell Elenius

KTH Royal Institute of Technology, School of Computer Science and Communication, Alumnus

Followers

18

Following

10

Co-authors

8

Public Views

Interests

Uploads

Papers

SpeechDat - Speech Databases for Creation of Voice Driven Teleservices

The SpeechDat project aims at producing telephone speech databases to be used for training and te... more The SpeechDat project aims at producing telephone speech databases to be used for training and testing of speech recognition and speaker verification devices. The main features are: coverage of applications (application-oriented words, phonetically rich sentences, spontaneous utterances, speaker verification), coverage of the 11 official European languages and variants, coverage of speaking styles (commands, carefully pronounced and spontaneous speech), coverage of

EXPERIMENTS WITH ARTIFICIAL NEURAL NETWORKS FOR PHONEME AND WORD RECOGNITION

An artificial neural network has been trained by the error back-propagation technique to recognis... more An artificial neural network has been trained by the error back-propagation technique to recognise phonemes and words. The speech material was recorded by a male Swedish talker and was labelled by a phonetician. There were 38 output nodes corresponding to Swedish phonemes. Introducing coarticulation information by adding simple recurrency to the net is shown to more effective than expanding the

Multi-layer perceptrons and probabilistic neural networks for phoneme recognition

Two artificial neural networks have been trained to recognise phonemes in continuous speech: mult... more Two artificial neural networks have been trained to recognise phonemes in continuous speech: multi-layer perceptron (MLP) nets and probabilistic neural networks (PNN). The speech material was recorded by one male Swedish speaker and the sentences were phonetically labelled. Fifty sentences were used for training and another fifty were used for testing. Both networks had a single hidden layer and 38

Comparing a connectionist and a rule based model for assigning parts-of-speech

International Conference on Acoustics, Speech, and Signal Processing, 1990

The orthographic structure of Swedish words was used for predicting word class using a connection... more The orthographic structure of Swedish words was used for predicting word class using a connectionist approach. This technique can be used to aid syntactic processing within a text-to-speech system. The error backpropagation technique was used for the connectionist learning procedure. A corpus of the 10000 most frequent Swedish words was used for training and testing the system. The results indicate

Phoneme recognition with an artificial neural network

An artificial neural network has been trained to recog- nize phonemes using the error back-propag... more An artificial neural network has been trained to recog- nize phonemes using the error back-propagation tech- nique. First a coarse feature network is trained to extract seven quasi-phonetic features from the spectral frames of a Bark-scaled filter bank. The outputs of this net and the spectral outputs of the filter bank were input to a phoneme recognition net. The coarse

Two Swedish Speechdat databases - some experiences and results

Experiences from Collecting Two Swedish Telephone Speech Databases

International Journal of Speech Technology, 2000

The EU-funded SpeechDat project was initiated in order to create large-scale speech databases for... more The EU-funded SpeechDat project was initiated in order to create large-scale speech databases for the development of voice-operated telecommunication services. This paper deals with the design of two such Swedish resources: 5000 speakers recorded over the fixed telephone network and 1000 speakers over the mobile network. Speakers were balanced according to gender, age and dialect. We also report on experiences from speaker recruitment. A “snowball” method, in which people gave addresses to friends according to a chain letter principle, was shown to be effective. Females were, in general, more cooperative than males. However, using Internet for recruiting favored young males. Statistics on speaker distribution are presented. Results regarding orthographic labeling of pronunciation, pronunciation errors and non-speech events are also included. The length of the longest word in a read sentence is shown to be directly correlated with mispronunciations and word repetitions.

Speech recognizer for voice control of mobile telephone

The OLGA project: An animated talking agent in a dialogue system

Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system

ICASSP '82. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1982

Phonetic properties of the basic vocabulary of five European languages: Implications for speech recognition

by Kjell Elenius and Björn Granström

ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986

Corpora of approximately 10,000 words have been examined in five languages: Swedish, English, Ger... more Corpora of approximately 10,000 words have been examined in five languages: Swedish, English, German, Italian, and French. A 2-class and a 6- class &amp;amp;amp;amp;amp;amp;quot;cohort&amp;amp;amp;amp;amp;amp;quot; classification have been defined, and calculations made of the number of cohorts, the number of unique cohorts, and their maximum, and expected sizes. The discriminatory ability of stress is also considered.

OLGA - a dialogue system with an animated talking agent

The object of t he Olga p roject i s to d evelop an interactive 3D animated talking agent. A futu... more The object of t he Olga p roject i s to d evelop an interactive 3D animated talking agent. A futuristic application scenario is interactive digital TV, where the Olga agent would gu ide naive users through the v arious s ervices available on the network. The current application is a consumer information service for microwave ovens. Olga required the

PHONEME RECOGNITION USING MULTI-LAYER PERCEPTRONS

An artificial neural network has been trained to recognizes phonemes using the error back-propaga... more An artificial neural network has been trained to recognizes phonemes using the error back-propagation technique. First a coarse feature network was trained to extract seven quasi-phonetic features from the spectral frames of a Bark-scaled filter bank. The outputs of this net and the spectral outputs of the filter bank were input to a phoneme recognition net. The coarse features were

Auditory models in isolated word recognition

ICASSP '84. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1984

Nonlinear frequency warp for speech recognition

ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986

A technique of nonlinear frequency warping has been investigated for recognition of Swedish vowel... more A technique of nonlinear frequency warping has been investigated for recognition of Swedish vowels. A frequency warp between two spectra is computed using a standard dynamic programming algorithm. The frequency distance, defined as the area between the obtained warping function and the diagonal, is contributing to the spectral distance. The distance between two spectra is a weighted sum of the warped amplitude distance and the frequency distance. By changing two weights, we get a gradual shift between non-warped amplitude distance, warped amplitude distance, and frequency distance. In recognition experiments on natural and synthetic vowel spectra, a metric combining the frequency and amplitude distances gave better results than using only amplitude or frequency deviation. Analysis of the results of the synthetic vowels show a reduced sensitivity to voice source and pitch variation. For the natural vowels, the recognition improvement is larger for the male and female speakers separately than for the combined groups.

Auditory models as front ends in speech-recognition systems

Includes comments by Stefanie Seneff and Nelson Kiang. (PsycINFO Database Record (c) 2012 APA, al... more

Continuous speech recognition using synthetic word and triphone prototypes

Speech recognition based on a text-to-speech synthesis system

by Kjell Elenius and Björn Granström

Emotion Recognition

by Kjell Elenius, susanne burger, and Susanne Burger

Computers in the Human Interaction Loop, 2009

Studies of expressive speech have shown that discrete emotions such as anger, fear, joy, and sadn... more Studies of expressive speech have shown that discrete emotions such as anger, fear, joy, and sadness can be accurately communicated, also cross-culturally, and that each emotion is associated with reasonably specific acoustic characteristics [8]. However, most previous research has been conducted on acted emotions. These certainly have something in common with naturally occurring emotions but may also be more intense

Acoustic-phonetic recognition of continuous speech by artificial neural networks

SpeechDat - Speech Databases for Creation of Voice Driven Teleservices

The SpeechDat project aims at producing telephone speech databases to be used for training and te... more The SpeechDat project aims at producing telephone speech databases to be used for training and testing of speech recognition and speaker verification devices. The main features are: coverage of applications (application-oriented words, phonetically rich sentences, spontaneous utterances, speaker verification), coverage of the 11 official European languages and variants, coverage of speaking styles (commands, carefully pronounced and spontaneous speech), coverage of

EXPERIMENTS WITH ARTIFICIAL NEURAL NETWORKS FOR PHONEME AND WORD RECOGNITION

An artificial neural network has been trained by the error back-propagation technique to recognis... more An artificial neural network has been trained by the error back-propagation technique to recognise phonemes and words. The speech material was recorded by a male Swedish talker and was labelled by a phonetician. There were 38 output nodes corresponding to Swedish phonemes. Introducing coarticulation information by adding simple recurrency to the net is shown to more effective than expanding the

Multi-layer perceptrons and probabilistic neural networks for phoneme recognition

Two artificial neural networks have been trained to recognise phonemes in continuous speech: mult... more Two artificial neural networks have been trained to recognise phonemes in continuous speech: multi-layer perceptron (MLP) nets and probabilistic neural networks (PNN). The speech material was recorded by one male Swedish speaker and the sentences were phonetically labelled. Fifty sentences were used for training and another fifty were used for testing. Both networks had a single hidden layer and 38

Comparing a connectionist and a rule based model for assigning parts-of-speech

International Conference on Acoustics, Speech, and Signal Processing, 1990

The orthographic structure of Swedish words was used for predicting word class using a connection... more The orthographic structure of Swedish words was used for predicting word class using a connectionist approach. This technique can be used to aid syntactic processing within a text-to-speech system. The error backpropagation technique was used for the connectionist learning procedure. A corpus of the 10000 most frequent Swedish words was used for training and testing the system. The results indicate

Phoneme recognition with an artificial neural network

An artificial neural network has been trained to recog- nize phonemes using the error back-propag... more An artificial neural network has been trained to recog- nize phonemes using the error back-propagation tech- nique. First a coarse feature network is trained to extract seven quasi-phonetic features from the spectral frames of a Bark-scaled filter bank. The outputs of this net and the spectral outputs of the filter bank were input to a phoneme recognition net. The coarse

Two Swedish Speechdat databases - some experiences and results

Experiences from Collecting Two Swedish Telephone Speech Databases

International Journal of Speech Technology, 2000

The EU-funded SpeechDat project was initiated in order to create large-scale speech databases for... more The EU-funded SpeechDat project was initiated in order to create large-scale speech databases for the development of voice-operated telecommunication services. This paper deals with the design of two such Swedish resources: 5000 speakers recorded over the fixed telephone network and 1000 speakers over the mobile network. Speakers were balanced according to gender, age and dialect. We also report on experiences from speaker recruitment. A “snowball” method, in which people gave addresses to friends according to a chain letter principle, was shown to be effective. Females were, in general, more cooperative than males. However, using Internet for recruiting favored young males. Statistics on speaker distribution are presented. Results regarding orthographic labeling of pronunciation, pronunciation errors and non-speech events are also included. The length of the longest word in a read sentence is shown to be directly correlated with mispronunciations and word repetitions.

Speech recognizer for voice control of mobile telephone

The OLGA project: An animated talking agent in a dialogue system

Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system

ICASSP '82. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1982

Phonetic properties of the basic vocabulary of five European languages: Implications for speech recognition

by Kjell Elenius and Björn Granström

ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986

Corpora of approximately 10,000 words have been examined in five languages: Swedish, English, Ger... more Corpora of approximately 10,000 words have been examined in five languages: Swedish, English, German, Italian, and French. A 2-class and a 6- class &amp;amp;amp;amp;amp;amp;quot;cohort&amp;amp;amp;amp;amp;amp;quot; classification have been defined, and calculations made of the number of cohorts, the number of unique cohorts, and their maximum, and expected sizes. The discriminatory ability of stress is also considered.

OLGA - a dialogue system with an animated talking agent

The object of t he Olga p roject i s to d evelop an interactive 3D animated talking agent. A futu... more The object of t he Olga p roject i s to d evelop an interactive 3D animated talking agent. A futuristic application scenario is interactive digital TV, where the Olga agent would gu ide naive users through the v arious s ervices available on the network. The current application is a consumer information service for microwave ovens. Olga required the

PHONEME RECOGNITION USING MULTI-LAYER PERCEPTRONS

An artificial neural network has been trained to recognizes phonemes using the error back-propaga... more An artificial neural network has been trained to recognizes phonemes using the error back-propagation technique. First a coarse feature network was trained to extract seven quasi-phonetic features from the spectral frames of a Bark-scaled filter bank. The outputs of this net and the spectral outputs of the filter bank were input to a phoneme recognition net. The coarse features were

Auditory models in isolated word recognition

ICASSP '84. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1984

Nonlinear frequency warp for speech recognition

ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986

A technique of nonlinear frequency warping has been investigated for recognition of Swedish vowel... more A technique of nonlinear frequency warping has been investigated for recognition of Swedish vowels. A frequency warp between two spectra is computed using a standard dynamic programming algorithm. The frequency distance, defined as the area between the obtained warping function and the diagonal, is contributing to the spectral distance. The distance between two spectra is a weighted sum of the warped amplitude distance and the frequency distance. By changing two weights, we get a gradual shift between non-warped amplitude distance, warped amplitude distance, and frequency distance. In recognition experiments on natural and synthetic vowel spectra, a metric combining the frequency and amplitude distances gave better results than using only amplitude or frequency deviation. Analysis of the results of the synthetic vowels show a reduced sensitivity to voice source and pitch variation. For the natural vowels, the recognition improvement is larger for the male and female speakers separately than for the combined groups.

Auditory models as front ends in speech-recognition systems

Includes comments by Stefanie Seneff and Nelson Kiang. (PsycINFO Database Record (c) 2012 APA, al... more

Continuous speech recognition using synthetic word and triphone prototypes

Speech recognition based on a text-to-speech synthesis system

by Kjell Elenius and Björn Granström

Emotion Recognition

by Kjell Elenius, susanne burger, and Susanne Burger

Computers in the Human Interaction Loop, 2009

Studies of expressive speech have shown that discrete emotions such as anger, fear, joy, and sadn... more Studies of expressive speech have shown that discrete emotions such as anger, fear, joy, and sadness can be accurately communicated, also cross-culturally, and that each emotion is associated with reasonably specific acoustic characteristics [8]. However, most previous research has been conducted on acted emotions. These certainly have something in common with naturally occurring emotions but may also be more intense

Acoustic-phonetic recognition of continuous speech by artificial neural networks