zyxwvutsrq
zyxwvutsrq
117
IEEE TRANSACTIONS ON REHABILITATION ENGINEERING, VOL. 3, NO. 1, MARCH 1995
zyxwv
Choice of Speech Features for Tactile
Presentation to the Profoundly Deaf
Ian R. Summers and Denise A. Gratton
zyxwvutsr
Abstract-Measurements have been made, using acoustic presentation of stimuli, to compare a variety of speech-derived signals (amplitude envelope, voice fundamental frequency, secondformant frequency, zero-crossing frequency, and an “odotl”
control derived from the amplitude envelope via a comparator)as
to their suitability for tactile presentation to the profoundly deaf
as an aid to speech reception. Segmental (phonemic) information
was conveyed adequately by all five signals; suprasegmental
(stress) information was conveyed very well by voice fundamental
frequency, and significantly less well by the other signals. The best
choice of speech features for presentation via a practical tactile
aid is discussed.
I. INTRODUCTION
S
INCE the pioneering work of Gault in the 1920’s, researchers have produced a variety of devices to convey
acoustic information to the profoundly deaf via the sense of
touch [1]-[3]. Such aids may be vibrotactile (with stimulation
via vibratory transducers on the skin) or electrotactile (with
electrical stimulation of the tactile nerves via skin electrodes).
Existing devices are most often single channel [4] (with a
single output transducer or pair of electrodes) but there has
also been investigation of multichannel devices [5] (with a
1-D or 2-D array of transducers or electrodes). Performance
of commercially available aids is generally disappointing [6]:
the benefit to the user, particularly in the important context of
speech reception, is not sufficient for these devices to be in
widespread use.
The available information-carrying capacity of tactile devices is such that only a small part of the information in
a speech signal can be conveyed (generally as a support to
lipreading). An obvious strategy, therefore, is to extract from
the speech signal one or more features which are particularly
significant in such a context and then recode these features
for ease of perception via the tactile channel. In our laboratories, we have investigated the design of a single-channel
aid to lipreading in which the information flow involves the
following stages:
1) extraction of speech feature(s) from the microphone
signal;
2) linear or nonlinear transformation of the time-varying
signals from 1) into control signals to specify the output
stimulus;
3) generation of the output stimulus with time-varying
parameters determined by the signals from 2).
Results from the many previous studies of this type of
aid (e.g., Thornton and Phillips [7], Plant [8]) do not clearly
indicate the best design choices for the stages l), 2), and 3);
this is because, in an the overall evaluation of a complete
device, it is difficult to assess these areas independently. This
paper describes part of a study the overall aim of which
is to optimize the design of such an aid by considering
individually each of the stages listed above. In a previous
study [9] on stage 3), we have compared vibrotactile and
electrotactile output, also frequency-modulated, amplitudemodulated, and frequency-and-amplitude modulated output
stimuli. Best results were obtained with a vibrotactile pulse
stimulus whose amplitude and frequency, i.e. repetition rate,
were varied redundantly. We here describe an experiment on
stage l), comparing the utility of various speech features,
in which subjects were required to discriminate segmental
or suprasegmental speech distinctions on the basis of five
different speech-derived signals.
11. EXPERIMENTAL
METHODS
zyxwvutsrq
Manuscript received January 14, 1994; revised December 30, 1994. This
work was supported by the Trent Area Health Authority.
I. R. Summers is with the Medical Physics Group, Department of Physics,
University of Exeter, U.K.
D. A. Gratton is with the Department of Medical Physics and Clinical
Engineering, University of Sheffield, Sheffield, U.K.
IEEE Log Number 9409086.
A. Overall Design
In order to ensure that results were not confounded by
problems in presentation or reception of stimuli (as is generally
the case in tactile studies), each signal was presented as a
frequency-modulated acoustic stimulus to normally hearing
subjects; in addition, the different signals were transformed
so that each produced approximately the same dynamic range
of frequency in the output stimulus. To avoid distortion of
results by the fact that some speech features may be easier to
train on than others, the test format was one that required no
feature-specific training: subjects were required to discriminate
pairs of stimuli presented in a “one-from-three” format, i.e.
each test item was a three-element sequence consisting of two
examples of one member of the stimulus pair and one example
of the other; subjects were required to identify the position in
the sequence of the element which occurred once only-see
below.
zyxwvuts
B. Speech Material
Speech recordings were made from a male speaker
using a Sony microphone (type ECM 959 DT) and DAT
1063-6528/95$04.00 0 1995 IEEE
118
zyxwvutsrqponmlkjihgfe
IEEE TRANSACTIONS ON REHABILITATION ENGINEERING, VOL. 3, NO. 1, MARCH 1995
zyxwvutsrqp
recorder (type TCD DlO). The selected speech items
4 ) FX: This is the zero-crossing frequency of the input
were: vowels differing principally in a) duration, b) first- speech signal (averaged over each successive group of eight
formant frequency F 1, c) second-formant frequency F2, cycles and then smoothed with a 10-ms time constant-a
(four pairs of each, in the context hVd: heedlhid, hardlhod, threshold on the input signal is set so that the system does not
who’dlhood, herdlhud, who’dlhard, hidlhead, hoodlhod, respond to inputs below -35 dB re typical peaks). The resulthidlhad, who’dlheed, hoodlhid, hodlhead, hodlhad); conso- ing signal has a dynamic range of around 1 :35, corresponding
nants differing in d) place of articulation, e) manner, f) to zero-crossing frequencies of 300 Hz-10.5 kHz.
voicing, (four pairs of each in the context aCa: amalana,
The signals produced by the circuitry described above were
afalasa, abalaga, atalaka, abalama, atalasa, adalaza, azalana, each used to control the frequency of an oscillator whose
apalaba, afalava, asalaza, akalaga); and phrases differing in g) output (Gaussian-like pulses of width 2 ms) was recorded on
stress (two large treeshwo large trees and 11 similar examples, tape to provide the acoustic test stimuli. However, in order to
taken from the SPAC test [lo]).
compare the four speech features on a relatively equal footing,
Each pair AB was recorded in the six different “one-from- the four signals were first subjected to nonlinear or linear
three” pattems BAA, ABA, AAB, ABB, BAB, and BBA. transformations so as to reduce the disparity between their
A balanced set of six vowel tests was constructed, each dynamic ranges:
test containing one pattem for each of the 12 vowel pairs.
a ) AE: This signal was square-rooted to reduce the dynamic
Consonant and stress tests were similarly produced.
range to approximately 1 : 8, giving output frequencies in the
range 50-400 Hz.
b) FO: This signal was subjected to a linear transformation
C . Extraction and Coding of Speech Features
(offset and gain), similar to that which we have used previously
In addition to a control condition based on the output of to code F O for tactile presentation [9], which increased the dythe commercially available Tactile Acoustic Monitor (TAM) namic range to approximately 1 : 6, giving output frequencies
device [ l l ] (which is constant frequency and constant am- in the range 70-400 Hz.
c) F2: This dynamic range of this signal was maintained
plitude, keyed “on” in response to the “loud” sections of
an input speech signal via a comparator which operates on at approximately 1 : 5, giving output frequencies in the range
the amplitude envelope), four different speech features were 80-400 Hz.
d ) FX: This signal was square-rooted to reduce the dynamic
selected for comparison: the amplitude envelope of the speech
signal AE, voice fundamental frequency FO, an estimated range to approximately 1 :6, giving output frequencies in the
second-formant frequency F2, and the zero-crossing frequency range 70-400 Hz.
Output signals for the TAM control condition were at a
of the speech signal FX. (These were chosen on the basis of
previous investigations [4]: AE, FO, and F2 have been demon- constant frequency of 200 Hz and were keyed “on” when the
strated to facilitate lipreading when presented acoustically or speech amplitude envelope was above a threshold set at -15
tactually [ 121-[ 161; electrotactile transmission of FX formed dB re typical peaks in this envelope.
the basis of an earlier study [ 171 from which our current project
is derived). Details of the feature extraction applied to the D.Test Procedure
recorded speech material are as follows:
Eleven normally hearing, young adult subjects (unpaid
1 ) AE: The input speech signal is full-wave rectified and
volunteers, age range 20-25 years) were played the test
low-pass filtered (first-order) with a 10-ms time constant. The material via a loudspeaker in a soundproof test room at a level
resulting envelope has a dynamic range of around 35 dB, i.e.
of 70 dB (A). In a single test session, each subject completed
approximately 1 :65 (a threshold on the input signal is set so two vowel tests, two consonant tests, and two stress tests
that the system does not respond to inputs below -35 dB re for one particular speech-derived signal (each test comprised
typical peaks).
12 items), preceded by a short period of explanation and
2) FO: A pitch extractor based on the peak-picking design
demonstration. This procedure was repeated in four subsequent
of Howard [ 181, followed by cycle-by-cycle frequency-to- sessions to cover all five types of signal. For each test item (e.g.
voltage conversion and smoothing with a IO-ms time constant,
hidlhadlhad) subjects were required to indicate, by circling
produces a signal with a dynamic range of around 1:2,
the appropriate number on an answer sheet, the position in
corresponding to an FO range of 110-220 Hz in the original
the three-element sequence of the element which occurred
speech.
once only. (Note: this experiment does not involve lipreading;
3) F2: The input speech is bandpass filtered in the range
subjects respond on the basis of acoustic information only.)
1-3 kHz and the zero-crossing frequency of the resulting signal
(averaged over each successive group of eight cycles and then
smoothed with a 10-ms time constant) taken as an estimate of E. Supplementary Experiment
zyxwvutsr
zyxwvuts
zyxwvutsrqpo
zyxwvutsrq
zyxwvutsrqp
F2. An output is produced for unvoiced as well as voiced
sounds (a threshold on the input signal is set so that the
system does not respond to inputs below -35 dB re typical
peaks). The resulting signal has a dynamic range of around
1 : 5, corresponding to zero-crossing frequencies of 800-4000
Hz.
Although the main experiment was designed with acoustic
presentation, out of interest it was decided to also investigate
tactile presentation of the same stimulus waveforms. In a
similar procedure to that described above, six normally hearing
subjects were presented, via a vibrator at the fight index
fingertip, with vibrotactile stimuli at a peak displacement level
SUMMERS AND GRATMN: SPEECH FEATURES FOR PRESENTATION TO THE DEAF
zyx
zyxwvutsrq
I19
zyxwvutsrqpon
-
I "Y
80
zyxwvutsrq
zyx
..
60 ..
AE
F2
FO
TAM
FX
A€
FO
(a)
F2
FX
TAM
40
20
(b)
..
MI
AE
FO
F2
FX
TAM
I
100,
FO
AE
F2
FX TAM
I
AE
FO
F2
FX TAM
"f
-
-
60'
40
voicing
"
I
-
..
0 le) manner
H If) voicing
zyxwvutsrqpon
20
"
0.
A€
FO
AE
F2
(e)
FO
A€
FX TAM
I I
AE
FO
F2
FX TAM
(0
FX TAM
F2
A€
FO
F2
FX TAM
of 30 dB re a nominal threshold [9] of 0.5 pm. Noise masking
via headphones was used to eliminate acoustic cues from the
vibrator.
111. RESULTS
Mean percentage-correct scores from the main experiment
are shown in Fig. 1. Scores for vowels are subdivided into
the three types of distinction: (a) duration, (b) F1, and (c) F2.
Scores for consonants are similarly subdivided into (d) place,
(e) manner, and (f) voicing. Discrimination scores are also
shown for (g) stress. The scores for (e), (f), and (g) are most
relevant to a lipreading context since these distinctions are
difficult on the basis of visual information alone: the means
of the (e), (f), and (g) scores are shown in Fig. l(h), and the
individual scores are shown together in Fig. 2 (top panel). The
bottom panel in Fig. 2 shows results for tests (e), (f), and (g)
from the supplementary experiment with tactile presentation.
It is encouraging to note that, as might be predicted, the
scores for discrimination of vowel F2 distinctions in Fig. l(c)
are highest when F2 is the signal presented [the difference
between the score for F2 presentation and the next highest
score in Fig. l(c) is significant at the 0.95 level (single-sided
Student t-test on paired data)].
Scores in Fig. l(g) show that stress information is conveyed
much better by voice fundamental frequency FO, which gives a
score of 92.8 f 2.3%, than by the other signals [the difference
between the score for FO presentation and the next highest
FO
F2
FX
TAM
Fig. 2. Discrimination of (e) consonant manner, (f) consonant voicing,
and (g) stress from acoustic presentation of the five speech-derived signals
(top panel: mean percent-correct scores from 1 1 subjects) and from tactile
presentation (bottom panel: mean percent-correct scores from six subjects).
The chance score is 33% throughout.
score in Fig. l(g) is significant at the 0.95 level (single-sided
Student t-test on paired data)]. This is not surprising, since
voice fundamental frequency F O corresponds to voice pitch,
and inflections of pitch in the acoustic signal of normal speech
provide the main cues to stress patterns.
Consonant manner is conveyed well by all five signals: all
scores in Fig. l(e) are significantly greater than 50% (singlesided Student t-test, 0.95 confidence level), with F2 and FX
producing scores of 92.1 f 1.9% and 95.5 f 1.9%, respectively. (Note: A score of 50% on this type three-alternative
forced-choice test [191 corresponds to a discrimination index
d' of 1.47. Scores in the range 90-96% correspond to d' in
the range 4.03 to 5.03.)
Vowel F1 and consonant place are conveyed badly by all
five signals, with no scores in Fig. l(b) or (d) significantly
greater than 50% (single-sided Student t-test, 0.95 confidence
level); however, transmission of these distinctions is not
particularly important in a lipreading context since visual cues
are available.
IV. DISCUSSION
The scores from the main experiment show no clear overall
superiority for any of the speech-derived signals. Attempts to
relate this finding to results from similar studies are hindered
by the fact that other researchers have chosen to present their
different speech-derived signals as different modulations of the
acoustic signal (for example, FO presented as varying stimulus
frequency might be compared with AE presented as varying
stimulus amplitude). Bearing in mind this caveat, the results
presented in this paper are in line with those of Breeuwer
and Plomp [12] and Grant et al. [13], who found the benefit
to lipreading from acoustic presentation of FO and of AE to
be similar. We are not aware of any previous study in which
a comparison of speech features has included second-formant
frequency or zero-crossing frequency in isolation, although the
former has been studied when presented in combination with
other speech-derived signals [12], 1141.
The TAM scores are, in general, surprisingly high in view
of the crude signal processing involved: scores for the vowel-
zyxwvutsr
- 1
-7
7
1
~
zyxwvutsrqponmlkjih
.
120
I
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCB
zyxwvutsrqpon
zyxwvutsrq
zyxwvutsrq
zyxwvutsrqponmlk
IEEE TRANSACTIONS ON REHABILITATION ENGINEERING, VOL. 3, NO. 1, MARCH 1995
duration and consonant-manner distinctions are significantly
greater than 50%, and four scores from seven are significantly
greater than chance (single-sided Student I-test, 0.95 confidence level). This is in line with the unexpected findings of
Thomton and Philips [ 7 ] that the TAM aid can give significant
benefit to lipreading.
Although lower overall, scores (Fig. 2 , bottom panel) from
the supplementary tactile experiment show a similar pattern to
scores (Fig. 2 , top panel) from the main experiment (correlation coefficient T = 0.86, 13 degrees of freedom), suggesting
that it is valid to use the results from the main experiment
to predict performance with tactile stimulation. It is to be
expected that somewhat higher scores would be obtained if
the tactile stimuli were optimized for tactile presentation (to do
this would involve lowering pulse repetition rates by a factor
of at least two and providing additional cues by amplitude
modulation of stimuli [9]).However, the experience of other
investigators [ 2 0 ] , [21] suggests that tactile perception of
speech-derived signals will always be worse than acoustic
perception.
v.
that transmission of a greater amount of amplitude information is desirable for a practical scenario in which low-level
background must be distinguished from the speech signals
of interest. With this in mind, alternatives to strategies 2 )
and 3) are offered by strategies 4) and 5), in which the
amplitude-insensitive parameters FO and F2’ are supplemented
by amplitude information as AE.
All five coding strategies are designed to be implementable
in an entirely wristwom aid, and the most successful strategy
will be used with profoundly deaf subjects in a field trial of
such a device, with testing against our earlier TAM aid.
ACKNOWLEDGMENT
We are grateful to our coworkers R. Gray, J. Stevens, and
B. Brown for their assistance.
REFERENCES
[ 13 J. M. Weisenberger, “Communication of the acoustic environment via
CONCLUSION: CHOICE OF SPEECH
FEATURE(S) IN A PRACTICAL DEVICE
[2]
The results of this experiment, conveniently summarized in
Fig. l(h), are equivocal. This suggests that the choice of best
speech feature(s) for tactile presentation may rest principally
on suitability for tactile perception after appropriate coding. In
future work, we propose to address this issue, using a range
of speech-to-tactile codings with amplitude- and frequencymodulated vibrotactile pulse output. On the basis of our
experience from this and earlier experiments, the following
coding strategies have been chosen for comparison:
1) output amplitude determined by AE, output frequency
determined by AE,
2 ) output amplitude determined by FO, output frequency
determined by FO;
3 ) output amplitude determined by F2’, output frequency
determined by F2’;
4) output amplitude determined by AE, output frequency
determined by FO;
5 ) output amplitude determined by AE, output frequency
determined by F2’.
(Here F2’ refers to a modified version of the F2 feature
described above in which the upper limit on zero-crossings
is increased to around 8 kHz by raising the low-pass cut-off
of the filter which precedes the zero-crossing detector; hence,
the characteristics of the F2’ signal are intermediate to those
of the F2 and FX features described above. This signal is
intended to convey the most useful information from both the
F2 and FX features: the gross spectral information of FX and
the vowel-related information of F2.)
Strategies l), 2), and 3) give an output in which the
amplitude and frequency vary redundantly, corresponding to
the most favorable condition for information transfer [9].
In strategies 2 ) and 3), speech-amplitude cues are reduced
to “one-bit’’ information which derives from the operating
thresholds of the FO and F2’ extractors. However, it may be
[3]
[4]
[5]
tactile stimuli,” in Tactile Aids for the Hearing Impaired, I. R. Summers,
Ed. London: Whurr, 1992, pp. 83-109.
C. M. Reed, N. I. Durlach, L. A. Delhome, W. M. Rabinowitz, and K.
W. Grant, “Research on tactual communication of speech Ideas, issues
and findings,” Volta Rev., vol. 91, pp. 65-78, 1989.
C. E. Shemck, “Basic and applied research on tactile aids for deaf
people: Progress and prospects,” J. Acoust. Soc. Am., vol. 75, pp.
1325-1342, 1984.
I. R. Summers, “Signal processing strategies for single-channel systems,” in Tactile Aids for the Hearing Impaired, I. R. Summers, Ed.
London: Whurr, 1992, pp. 110-127.
J. L. Mason and B. J. Frost, “Signal processing strategies for multichannel systems,” in Tactile Aids for the Hearing Impaired, I. R. Summers,
Ed. London: Whurr, 1992, pp. 128-145.
L. E. Bemstein, “The evaluation of tactile aids,” in Tactile Aids for
the Hearing Impaired, I. R. Summers, Ed. London: Whurr, 1992, pp.
167- 186.
A. R. D. Thomton and A. J. Phillips, “A comparative trial of four
vibrotactile aids,” in Tactile Aids for the Hearing Impaired, I. R.
Summers, Ed. London: Whurr, 1992, pp. 231-251.
G. Plant, “A comparison of five commercially available tactile aids,”
Aust. J. Audiol., vol. 11, pp. 11-19, 1989.
I. R. Summers, P. R. Dixon, P. G. Cooper, D. A. Gratton, B. H. Brown,
and J. C. Stevens, “Vibrotactile and electrotactile perception of timevarying pulse trains,” J. Acoust. Soc. Am., vol. 95, pp. 1548-1558,
1994.
A. Boothroyd, “Perception of speech pattem contrasts from auditory
presentation of voice fundamental frequency,” Ear Hear., vol. 9, pp.
313-321, 1988.
I. R. Summers and M. C. Martin, “A tactile sound level monitor for the
profoundly deaf,” Brit. J. Audiol., vol. 14, pp. 30-33, 1980.
M. Breeuwer and R. Plomp, “Speechreading supplemented with auditorily presented speech parameters,” J. Acoust. Soc. Am., vol. 79, pp.
481-499, 1986.
K. W. Grant, L. H. Ardell, P. K. Kuhl, and D. W. Sparks, “The
contributionof fundamental frequency, amplitude envelope, and voicing
duration cues to speechreading in normal-hearing subjects,” J. Acoust.
Soc. Am., vol. 77, pp. 671-677, 1985.
P. J. Blamey, L. F. A. Martin, and G. M. Clark, “A comparison of three
speech coding strategies using an acoustic model of a cochlear implant,”
J. Acoust. Soc. Am., vol. 77, pp. 209-217, 1985.
L. E. Bemstein, S. P. Eberhardt, and M. E. Demorest, “Single-channel
vibrotactile supplements to visual perception of intonation and stress,”
J. Acoust. Soc. Am., vol. 85, pp. 397-405, 1989.
P. J. Blamey, R. S. C. Cowan, J. I. Alcantara, and G. M. Clark,
“Phonemic information transmitted by a multichannel electrotactile
speech processor,” J. Speech Hear. Res., vol. 31. pp. 620-629, 1988.
G. S. Dodgson, B. H. Brown, I. L. Freeston, and J. C. Stevens,
“Electrical stimulation at the wrist as an aid to the profoundly deaf,”
Clin. Phys. Physiol. Meas., vol. 4, pp. 403-416, 1983.
D. M. Howard, “Peak-picking fundamental period estimation for hearing
prostheses,” .I
Acoust.
.
Soc. Am., vol. 86, pp. 902-910, 1990.
zyxwvuts
zyxwvutsr
[6]
[7]
[8]
[9]
101
111
121
[13]
zyxwvuts
[ 141
[15]
[16]
[17]
[18]
zyxwvutsr
zyxwvuts
zyxwvutsrqponm
zyxwvutsrqpo
zyxwvutsrqp
zyxwvuts
SUMMERS AND GRAlTON: SPEECH FEATURES FOR PRESENTATION TO THE DEAF
[19] B. J. Craven, “A table of d’ for M-altemative odd-man-out forcedchoice procedures,” Percept. Psychophys., vol. 51, pp. 379-385, 1992.
[20] C. M. Reed, M. S. Bratakos, L. A. Delhome, and F. Denesvich,
“A comparison of auditory and tactile presentation of a single-band
envelope cue as a supplement to speechreadiig,” presented at the 3rd
Int. Conf. Tactile Aids, Miami, FL, 1994.
[21] T. Hnath-Chisolm and L. Kishon-Rabin, “Tactile presentation of voice
fundamental frequency as an aid to the perception of speech pattem
contrasts,” Ear Hear., vol. 9, pp. 329-334, 1988.
Ian R. Summers received the B.A. in physics from
the University of Oxford in 1974 and the Ph.D. in
acoustics from the University of London in 1977.
He has been involved with tactile-aid research
since 1977 and currently lectures on the Physics
with Medical Physics program in the Department of
Physics, University of Exeter. He was Editor of the
1992 book, Tactile Aids for the Hearing Impaired.
His other research interests include more fundamental aspects of tactile perception and targeted
magnetic-resonance imaging.
121
Denise A. Gratton received the Ph.D. in neurophysiology from the University of Sheffield in 1973.
From 1973-1983, she worked in the Physiology Department at Sheffield on information transfer
along the sensory pathways of the rat and how
anaesthetic agents block this transfer. Since then,
she has been involved in research to determine the
limits of the skin senses to convey tactile information. She is currently Coordinator of the cochlear
implant team at the Central Sheffield University
Hospitals.