[go: up one dir, main page]

Academia.eduAcademia.edu
zyxwvutsrq zyxwvutsrq 117 IEEE TRANSACTIONS ON REHABILITATION ENGINEERING, VOL. 3, NO. 1, MARCH 1995 zyxwv Choice of Speech Features for Tactile Presentation to the Profoundly Deaf Ian R. Summers and Denise A. Gratton zyxwvutsr Abstract-Measurements have been made, using acoustic presentation of stimuli, to compare a variety of speech-derived signals (amplitude envelope, voice fundamental frequency, secondformant frequency, zero-crossing frequency, and an “odotl” control derived from the amplitude envelope via a comparator)as to their suitability for tactile presentation to the profoundly deaf as an aid to speech reception. Segmental (phonemic) information was conveyed adequately by all five signals; suprasegmental (stress) information was conveyed very well by voice fundamental frequency, and significantly less well by the other signals. The best choice of speech features for presentation via a practical tactile aid is discussed. I. INTRODUCTION S INCE the pioneering work of Gault in the 1920’s, researchers have produced a variety of devices to convey acoustic information to the profoundly deaf via the sense of touch [1]-[3]. Such aids may be vibrotactile (with stimulation via vibratory transducers on the skin) or electrotactile (with electrical stimulation of the tactile nerves via skin electrodes). Existing devices are most often single channel [4] (with a single output transducer or pair of electrodes) but there has also been investigation of multichannel devices [5] (with a 1-D or 2-D array of transducers or electrodes). Performance of commercially available aids is generally disappointing [6]: the benefit to the user, particularly in the important context of speech reception, is not sufficient for these devices to be in widespread use. The available information-carrying capacity of tactile devices is such that only a small part of the information in a speech signal can be conveyed (generally as a support to lipreading). An obvious strategy, therefore, is to extract from the speech signal one or more features which are particularly significant in such a context and then recode these features for ease of perception via the tactile channel. In our laboratories, we have investigated the design of a single-channel aid to lipreading in which the information flow involves the following stages: 1) extraction of speech feature(s) from the microphone signal; 2) linear or nonlinear transformation of the time-varying signals from 1) into control signals to specify the output stimulus; 3) generation of the output stimulus with time-varying parameters determined by the signals from 2). Results from the many previous studies of this type of aid (e.g., Thornton and Phillips [7], Plant [8]) do not clearly indicate the best design choices for the stages l), 2), and 3); this is because, in an the overall evaluation of a complete device, it is difficult to assess these areas independently. This paper describes part of a study the overall aim of which is to optimize the design of such an aid by considering individually each of the stages listed above. In a previous study [9] on stage 3), we have compared vibrotactile and electrotactile output, also frequency-modulated, amplitudemodulated, and frequency-and-amplitude modulated output stimuli. Best results were obtained with a vibrotactile pulse stimulus whose amplitude and frequency, i.e. repetition rate, were varied redundantly. We here describe an experiment on stage l), comparing the utility of various speech features, in which subjects were required to discriminate segmental or suprasegmental speech distinctions on the basis of five different speech-derived signals. 11. EXPERIMENTAL METHODS zyxwvutsrq Manuscript received January 14, 1994; revised December 30, 1994. This work was supported by the Trent Area Health Authority. I. R. Summers is with the Medical Physics Group, Department of Physics, University of Exeter, U.K. D. A. Gratton is with the Department of Medical Physics and Clinical Engineering, University of Sheffield, Sheffield, U.K. IEEE Log Number 9409086. A. Overall Design In order to ensure that results were not confounded by problems in presentation or reception of stimuli (as is generally the case in tactile studies), each signal was presented as a frequency-modulated acoustic stimulus to normally hearing subjects; in addition, the different signals were transformed so that each produced approximately the same dynamic range of frequency in the output stimulus. To avoid distortion of results by the fact that some speech features may be easier to train on than others, the test format was one that required no feature-specific training: subjects were required to discriminate pairs of stimuli presented in a “one-from-three” format, i.e. each test item was a three-element sequence consisting of two examples of one member of the stimulus pair and one example of the other; subjects were required to identify the position in the sequence of the element which occurred once only-see below. zyxwvuts B. Speech Material Speech recordings were made from a male speaker using a Sony microphone (type ECM 959 DT) and DAT 1063-6528/95$04.00 0 1995 IEEE 118 zyxwvutsrqponmlkjihgfe IEEE TRANSACTIONS ON REHABILITATION ENGINEERING, VOL. 3, NO. 1, MARCH 1995 zyxwvutsrqp recorder (type TCD DlO). The selected speech items 4 ) FX: This is the zero-crossing frequency of the input were: vowels differing principally in a) duration, b) first- speech signal (averaged over each successive group of eight formant frequency F 1, c) second-formant frequency F2, cycles and then smoothed with a 10-ms time constant-a (four pairs of each, in the context hVd: heedlhid, hardlhod, threshold on the input signal is set so that the system does not who’dlhood, herdlhud, who’dlhard, hidlhead, hoodlhod, respond to inputs below -35 dB re typical peaks). The resulthidlhad, who’dlheed, hoodlhid, hodlhead, hodlhad); conso- ing signal has a dynamic range of around 1 :35, corresponding nants differing in d) place of articulation, e) manner, f) to zero-crossing frequencies of 300 Hz-10.5 kHz. voicing, (four pairs of each in the context aCa: amalana, The signals produced by the circuitry described above were afalasa, abalaga, atalaka, abalama, atalasa, adalaza, azalana, each used to control the frequency of an oscillator whose apalaba, afalava, asalaza, akalaga); and phrases differing in g) output (Gaussian-like pulses of width 2 ms) was recorded on stress (two large treeshwo large trees and 11 similar examples, tape to provide the acoustic test stimuli. However, in order to taken from the SPAC test [lo]). compare the four speech features on a relatively equal footing, Each pair AB was recorded in the six different “one-from- the four signals were first subjected to nonlinear or linear three” pattems BAA, ABA, AAB, ABB, BAB, and BBA. transformations so as to reduce the disparity between their A balanced set of six vowel tests was constructed, each dynamic ranges: test containing one pattem for each of the 12 vowel pairs. a ) AE: This signal was square-rooted to reduce the dynamic Consonant and stress tests were similarly produced. range to approximately 1 : 8, giving output frequencies in the range 50-400 Hz. b) FO: This signal was subjected to a linear transformation C . Extraction and Coding of Speech Features (offset and gain), similar to that which we have used previously In addition to a control condition based on the output of to code F O for tactile presentation [9], which increased the dythe commercially available Tactile Acoustic Monitor (TAM) namic range to approximately 1 : 6, giving output frequencies device [ l l ] (which is constant frequency and constant am- in the range 70-400 Hz. c) F2: This dynamic range of this signal was maintained plitude, keyed “on” in response to the “loud” sections of an input speech signal via a comparator which operates on at approximately 1 : 5, giving output frequencies in the range the amplitude envelope), four different speech features were 80-400 Hz. d ) FX: This signal was square-rooted to reduce the dynamic selected for comparison: the amplitude envelope of the speech signal AE, voice fundamental frequency FO, an estimated range to approximately 1 :6, giving output frequencies in the second-formant frequency F2, and the zero-crossing frequency range 70-400 Hz. Output signals for the TAM control condition were at a of the speech signal FX. (These were chosen on the basis of previous investigations [4]: AE, FO, and F2 have been demon- constant frequency of 200 Hz and were keyed “on” when the strated to facilitate lipreading when presented acoustically or speech amplitude envelope was above a threshold set at -15 tactually [ 121-[ 161; electrotactile transmission of FX formed dB re typical peaks in this envelope. the basis of an earlier study [ 171 from which our current project is derived). Details of the feature extraction applied to the D.Test Procedure recorded speech material are as follows: Eleven normally hearing, young adult subjects (unpaid 1 ) AE: The input speech signal is full-wave rectified and volunteers, age range 20-25 years) were played the test low-pass filtered (first-order) with a 10-ms time constant. The material via a loudspeaker in a soundproof test room at a level resulting envelope has a dynamic range of around 35 dB, i.e. of 70 dB (A). In a single test session, each subject completed approximately 1 :65 (a threshold on the input signal is set so two vowel tests, two consonant tests, and two stress tests that the system does not respond to inputs below -35 dB re for one particular speech-derived signal (each test comprised typical peaks). 12 items), preceded by a short period of explanation and 2) FO: A pitch extractor based on the peak-picking design demonstration. This procedure was repeated in four subsequent of Howard [ 181, followed by cycle-by-cycle frequency-to- sessions to cover all five types of signal. For each test item (e.g. voltage conversion and smoothing with a IO-ms time constant, hidlhadlhad) subjects were required to indicate, by circling produces a signal with a dynamic range of around 1:2, the appropriate number on an answer sheet, the position in corresponding to an FO range of 110-220 Hz in the original the three-element sequence of the element which occurred speech. once only. (Note: this experiment does not involve lipreading; 3) F2: The input speech is bandpass filtered in the range subjects respond on the basis of acoustic information only.) 1-3 kHz and the zero-crossing frequency of the resulting signal (averaged over each successive group of eight cycles and then smoothed with a 10-ms time constant) taken as an estimate of E. Supplementary Experiment zyxwvutsr zyxwvuts zyxwvutsrqpo zyxwvutsrq zyxwvutsrqp F2. An output is produced for unvoiced as well as voiced sounds (a threshold on the input signal is set so that the system does not respond to inputs below -35 dB re typical peaks). The resulting signal has a dynamic range of around 1 : 5, corresponding to zero-crossing frequencies of 800-4000 Hz. Although the main experiment was designed with acoustic presentation, out of interest it was decided to also investigate tactile presentation of the same stimulus waveforms. In a similar procedure to that described above, six normally hearing subjects were presented, via a vibrator at the fight index fingertip, with vibrotactile stimuli at a peak displacement level SUMMERS AND GRATMN: SPEECH FEATURES FOR PRESENTATION TO THE DEAF zyx zyxwvutsrq I19 zyxwvutsrqpon - I "Y 80 zyxwvutsrq zyx .. 60 .. AE F2 FO TAM FX A€ FO (a) F2 FX TAM 40 20 (b) .. MI AE FO F2 FX TAM I 100, FO AE F2 FX TAM I AE FO F2 FX TAM "f - - 60' 40 voicing " I - .. 0 le) manner H If) voicing zyxwvutsrqpon 20 " 0. A€ FO AE F2 (e) FO A€ FX TAM I I AE FO F2 FX TAM (0 FX TAM F2 A€ FO F2 FX TAM of 30 dB re a nominal threshold [9] of 0.5 pm. Noise masking via headphones was used to eliminate acoustic cues from the vibrator. 111. RESULTS Mean percentage-correct scores from the main experiment are shown in Fig. 1. Scores for vowels are subdivided into the three types of distinction: (a) duration, (b) F1, and (c) F2. Scores for consonants are similarly subdivided into (d) place, (e) manner, and (f) voicing. Discrimination scores are also shown for (g) stress. The scores for (e), (f), and (g) are most relevant to a lipreading context since these distinctions are difficult on the basis of visual information alone: the means of the (e), (f), and (g) scores are shown in Fig. l(h), and the individual scores are shown together in Fig. 2 (top panel). The bottom panel in Fig. 2 shows results for tests (e), (f), and (g) from the supplementary experiment with tactile presentation. It is encouraging to note that, as might be predicted, the scores for discrimination of vowel F2 distinctions in Fig. l(c) are highest when F2 is the signal presented [the difference between the score for F2 presentation and the next highest score in Fig. l(c) is significant at the 0.95 level (single-sided Student t-test on paired data)]. Scores in Fig. l(g) show that stress information is conveyed much better by voice fundamental frequency FO, which gives a score of 92.8 f 2.3%, than by the other signals [the difference between the score for FO presentation and the next highest FO F2 FX TAM Fig. 2. Discrimination of (e) consonant manner, (f) consonant voicing, and (g) stress from acoustic presentation of the five speech-derived signals (top panel: mean percent-correct scores from 1 1 subjects) and from tactile presentation (bottom panel: mean percent-correct scores from six subjects). The chance score is 33% throughout. score in Fig. l(g) is significant at the 0.95 level (single-sided Student t-test on paired data)]. This is not surprising, since voice fundamental frequency F O corresponds to voice pitch, and inflections of pitch in the acoustic signal of normal speech provide the main cues to stress patterns. Consonant manner is conveyed well by all five signals: all scores in Fig. l(e) are significantly greater than 50% (singlesided Student t-test, 0.95 confidence level), with F2 and FX producing scores of 92.1 f 1.9% and 95.5 f 1.9%, respectively. (Note: A score of 50% on this type three-alternative forced-choice test [191 corresponds to a discrimination index d' of 1.47. Scores in the range 90-96% correspond to d' in the range 4.03 to 5.03.) Vowel F1 and consonant place are conveyed badly by all five signals, with no scores in Fig. l(b) or (d) significantly greater than 50% (single-sided Student t-test, 0.95 confidence level); however, transmission of these distinctions is not particularly important in a lipreading context since visual cues are available. IV. DISCUSSION The scores from the main experiment show no clear overall superiority for any of the speech-derived signals. Attempts to relate this finding to results from similar studies are hindered by the fact that other researchers have chosen to present their different speech-derived signals as different modulations of the acoustic signal (for example, FO presented as varying stimulus frequency might be compared with AE presented as varying stimulus amplitude). Bearing in mind this caveat, the results presented in this paper are in line with those of Breeuwer and Plomp [12] and Grant et al. [13], who found the benefit to lipreading from acoustic presentation of FO and of AE to be similar. We are not aware of any previous study in which a comparison of speech features has included second-formant frequency or zero-crossing frequency in isolation, although the former has been studied when presented in combination with other speech-derived signals [12], 1141. The TAM scores are, in general, surprisingly high in view of the crude signal processing involved: scores for the vowel- zyxwvutsr - 1 -7 7 1 ~ zyxwvutsrqponmlkjih . 120 I zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCB zyxwvutsrqpon zyxwvutsrq zyxwvutsrq zyxwvutsrqponmlk IEEE TRANSACTIONS ON REHABILITATION ENGINEERING, VOL. 3, NO. 1, MARCH 1995 duration and consonant-manner distinctions are significantly greater than 50%, and four scores from seven are significantly greater than chance (single-sided Student I-test, 0.95 confidence level). This is in line with the unexpected findings of Thomton and Philips [ 7 ] that the TAM aid can give significant benefit to lipreading. Although lower overall, scores (Fig. 2 , bottom panel) from the supplementary tactile experiment show a similar pattern to scores (Fig. 2 , top panel) from the main experiment (correlation coefficient T = 0.86, 13 degrees of freedom), suggesting that it is valid to use the results from the main experiment to predict performance with tactile stimulation. It is to be expected that somewhat higher scores would be obtained if the tactile stimuli were optimized for tactile presentation (to do this would involve lowering pulse repetition rates by a factor of at least two and providing additional cues by amplitude modulation of stimuli [9]).However, the experience of other investigators [ 2 0 ] , [21] suggests that tactile perception of speech-derived signals will always be worse than acoustic perception. v. that transmission of a greater amount of amplitude information is desirable for a practical scenario in which low-level background must be distinguished from the speech signals of interest. With this in mind, alternatives to strategies 2 ) and 3) are offered by strategies 4) and 5), in which the amplitude-insensitive parameters FO and F2’ are supplemented by amplitude information as AE. All five coding strategies are designed to be implementable in an entirely wristwom aid, and the most successful strategy will be used with profoundly deaf subjects in a field trial of such a device, with testing against our earlier TAM aid. ACKNOWLEDGMENT We are grateful to our coworkers R. Gray, J. Stevens, and B. Brown for their assistance. REFERENCES [ 13 J. M. Weisenberger, “Communication of the acoustic environment via CONCLUSION: CHOICE OF SPEECH FEATURE(S) IN A PRACTICAL DEVICE [2] The results of this experiment, conveniently summarized in Fig. l(h), are equivocal. This suggests that the choice of best speech feature(s) for tactile presentation may rest principally on suitability for tactile perception after appropriate coding. In future work, we propose to address this issue, using a range of speech-to-tactile codings with amplitude- and frequencymodulated vibrotactile pulse output. On the basis of our experience from this and earlier experiments, the following coding strategies have been chosen for comparison: 1) output amplitude determined by AE, output frequency determined by AE, 2 ) output amplitude determined by FO, output frequency determined by FO; 3 ) output amplitude determined by F2’, output frequency determined by F2’; 4) output amplitude determined by AE, output frequency determined by FO; 5 ) output amplitude determined by AE, output frequency determined by F2’. (Here F2’ refers to a modified version of the F2 feature described above in which the upper limit on zero-crossings is increased to around 8 kHz by raising the low-pass cut-off of the filter which precedes the zero-crossing detector; hence, the characteristics of the F2’ signal are intermediate to those of the F2 and FX features described above. This signal is intended to convey the most useful information from both the F2 and FX features: the gross spectral information of FX and the vowel-related information of F2.) Strategies l), 2), and 3) give an output in which the amplitude and frequency vary redundantly, corresponding to the most favorable condition for information transfer [9]. In strategies 2 ) and 3), speech-amplitude cues are reduced to “one-bit’’ information which derives from the operating thresholds of the FO and F2’ extractors. However, it may be [3] [4] [5] tactile stimuli,” in Tactile Aids for the Hearing Impaired, I. R. Summers, Ed. London: Whurr, 1992, pp. 83-109. C. M. Reed, N. I. Durlach, L. A. Delhome, W. M. Rabinowitz, and K. W. Grant, “Research on tactual communication of speech Ideas, issues and findings,” Volta Rev., vol. 91, pp. 65-78, 1989. C. E. Shemck, “Basic and applied research on tactile aids for deaf people: Progress and prospects,” J. Acoust. Soc. Am., vol. 75, pp. 1325-1342, 1984. I. R. Summers, “Signal processing strategies for single-channel systems,” in Tactile Aids for the Hearing Impaired, I. R. Summers, Ed. London: Whurr, 1992, pp. 110-127. J. L. Mason and B. J. Frost, “Signal processing strategies for multichannel systems,” in Tactile Aids for the Hearing Impaired, I. R. Summers, Ed. London: Whurr, 1992, pp. 128-145. L. E. Bemstein, “The evaluation of tactile aids,” in Tactile Aids for the Hearing Impaired, I. R. Summers, Ed. London: Whurr, 1992, pp. 167- 186. A. R. D. Thomton and A. J. Phillips, “A comparative trial of four vibrotactile aids,” in Tactile Aids for the Hearing Impaired, I. R. Summers, Ed. London: Whurr, 1992, pp. 231-251. G. Plant, “A comparison of five commercially available tactile aids,” Aust. J. Audiol., vol. 11, pp. 11-19, 1989. I. R. Summers, P. R. Dixon, P. G. Cooper, D. A. Gratton, B. H. Brown, and J. C. Stevens, “Vibrotactile and electrotactile perception of timevarying pulse trains,” J. Acoust. Soc. Am., vol. 95, pp. 1548-1558, 1994. A. Boothroyd, “Perception of speech pattem contrasts from auditory presentation of voice fundamental frequency,” Ear Hear., vol. 9, pp. 313-321, 1988. I. R. Summers and M. C. Martin, “A tactile sound level monitor for the profoundly deaf,” Brit. J. Audiol., vol. 14, pp. 30-33, 1980. M. Breeuwer and R. Plomp, “Speechreading supplemented with auditorily presented speech parameters,” J. Acoust. Soc. Am., vol. 79, pp. 481-499, 1986. K. W. Grant, L. H. Ardell, P. K. Kuhl, and D. W. Sparks, “The contributionof fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing subjects,” J. Acoust. Soc. Am., vol. 77, pp. 671-677, 1985. P. J. Blamey, L. F. A. Martin, and G. M. Clark, “A comparison of three speech coding strategies using an acoustic model of a cochlear implant,” J. Acoust. Soc. Am., vol. 77, pp. 209-217, 1985. L. E. Bemstein, S. P. Eberhardt, and M. E. Demorest, “Single-channel vibrotactile supplements to visual perception of intonation and stress,” J. Acoust. Soc. Am., vol. 85, pp. 397-405, 1989. P. J. Blamey, R. S. C. Cowan, J. I. Alcantara, and G. M. Clark, “Phonemic information transmitted by a multichannel electrotactile speech processor,” J. Speech Hear. Res., vol. 31. pp. 620-629, 1988. G. S. Dodgson, B. H. Brown, I. L. Freeston, and J. C. Stevens, “Electrical stimulation at the wrist as an aid to the profoundly deaf,” Clin. Phys. Physiol. Meas., vol. 4, pp. 403-416, 1983. D. M. Howard, “Peak-picking fundamental period estimation for hearing prostheses,” .I Acoust. . Soc. Am., vol. 86, pp. 902-910, 1990. zyxwvuts zyxwvutsr [6] [7] [8] [9] 101 111 121 [13] zyxwvuts [ 141 [15] [16] [17] [18] zyxwvutsr zyxwvuts zyxwvutsrqponm zyxwvutsrqpo zyxwvutsrqp zyxwvuts SUMMERS AND GRAlTON: SPEECH FEATURES FOR PRESENTATION TO THE DEAF [19] B. J. Craven, “A table of d’ for M-altemative odd-man-out forcedchoice procedures,” Percept. Psychophys., vol. 51, pp. 379-385, 1992. [20] C. M. Reed, M. S. Bratakos, L. A. Delhome, and F. Denesvich, “A comparison of auditory and tactile presentation of a single-band envelope cue as a supplement to speechreadiig,” presented at the 3rd Int. Conf. Tactile Aids, Miami, FL, 1994. [21] T. Hnath-Chisolm and L. Kishon-Rabin, “Tactile presentation of voice fundamental frequency as an aid to the perception of speech pattem contrasts,” Ear Hear., vol. 9, pp. 329-334, 1988. Ian R. Summers received the B.A. in physics from the University of Oxford in 1974 and the Ph.D. in acoustics from the University of London in 1977. He has been involved with tactile-aid research since 1977 and currently lectures on the Physics with Medical Physics program in the Department of Physics, University of Exeter. He was Editor of the 1992 book, Tactile Aids for the Hearing Impaired. His other research interests include more fundamental aspects of tactile perception and targeted magnetic-resonance imaging. 121 Denise A. Gratton received the Ph.D. in neurophysiology from the University of Sheffield in 1973. From 1973-1983, she worked in the Physiology Department at Sheffield on information transfer along the sensory pathways of the rat and how anaesthetic agents block this transfer. Since then, she has been involved in research to determine the limits of the skin senses to convey tactile information. She is currently Coordinator of the cochlear implant team at the Central Sheffield University Hospitals.