JP2008102551A

JP2008102551A - Apparatus for processing voice signal and processing method thereof

Info

Publication number: JP2008102551A
Application number: JP2007335479A
Authority: JP
Inventors: Masami Miura; 雅美三浦
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-12-27
Filing date: 2007-12-27
Publication date: 2008-05-01

Abstract

<P>PROBLEM TO BE SOLVED: To improve aural comprehension of words to masking by a voiced sound or successive time masking. <P>SOLUTION: An apparatus for processing a speech signal is provided with an amplitude changing means 15 which performs a change of the amplitude of a bandwidth of a consonant of an input speech signal S17, and an extraction means 21 which extracts a pitch component and formant component of the signal S17. The apparatus is provided with a level calculation means 22 which calculates a signal to indicate the level of the voiced sound from the extraction output of the extraction means 21, and a voiced sound start point detection means 23 which detects the start point and end point of the voiced sound in the input speech signal S17 from the output of the level calculation means 22. The apparatus is provided with a control means 24 which performs control to the amplitude changing means 15 so as to increase the gain of the amplitude changing means 15 with respect to a reference value only in the section from the start point of the voiced sound detected by the detection means 23 down to the fall from the end determination point and to lower the gain of the amplitude changing means 15 down to the reference value in the fall section from the end point of the voiced sound detected by a voiced sound end point detection means 23. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、音声信号の処理装置およびその処理方法に関する。 The present invention relates to an audio signal processing apparatus and a processing method therefor.

音声を伝送あるいは再生する場合、その伝送系あるいは再生系に残響やエコーが多いと、音声の明瞭度が低下してしまう。そこで、そのようなときには、発話速度を遅くする、連続して発声される語音を細かく分解し、時間をあけて再生するなどの処理が行なわれている。 When transmitting or reproducing sound, if there is a lot of reverberation or echo in the transmission system or reproduction system, the clarity of the sound will be reduced. Therefore, in such a case, processing such as slowing down the utterance speed, finely disassembling continuously uttered speech sounds, and reproducing them at intervals.

また、子音のような高域周波数が聞き取りにくいときには、周波数イコライザ処理により高域周波数の強調を行うこともある。さらに、いわゆる継時マスキング（エネルギーの大きい母音と子音とが続くとき、その母音により子音がマスクされる現象）を考慮した重み関数をかける処理も試みられている。 In addition, when high frequency such as consonant is difficult to hear, high frequency may be emphasized by frequency equalizer processing. Furthermore, a process of applying a weighting function in consideration of so-called successive masking (a phenomenon in which a consonant is masked by a vowel when a high energy vowel and a consonant continue) has been attempted.

さらに、以上の処理は難聴者や老人を対象に行われることもある。 Furthermore, the above processing may be performed for a hearing impaired person or an elderly person.

なお、先行技術文献として例えば以下のものがある。
特開平８−１７９７９２号公報特開平９− １６１９３号公報 For example, there are the following prior art documents.
JP-A-8-179792 Japanese Patent Laid-Open No. 9-16193

ところが、上述したように、発話速度を遅くしたり、連続して発声される語音を分解したりすると、次のような問題点を生じてしまう。 However, as described above, if the utterance speed is slowed down or the speech that is continuously spoken is decomposed, the following problems occur.

１．原音声との間に時間のずれを生じ、即時性がなくなってしまう。したがって、会話などを行なうときには使えない。また、放送などを聞く楊合であっても、聞き終わるまでの時間が長くなってしまう。
２．語音の知覚判断には音声成分の変化速度も重要な手がかりになっているので、発話速度を遅くすると、この手がかりが変化して別な語音に知覚されてしまうことがある。
３．語音を分解してゆっくり再生すると、語音のまとまりとしての情報や過渡的な変化部分の情報が失われ、明瞭度の悪くなることがある。
４．周波数イコライザ処理により常に高域周波数を増幅した音声は、音色のバランスがくずれて不快であったり、聞き取りにくいことがある。
５．継時マスキングを考慮した重み関数をかける処理は、少なくとも重み関数の時間長の遅延が生じてしまい、即時性が失われてしまう。この結果、口の動きと処理音との間に時間ずれを生じて明瞭度に悪い影響を与えることがある。また、イヤホンからマイクロフォンへの音響的フィードバックがあるときには、その時間遅れによって残響音のような現象が引き起こされてしまう。 1. There will be a time lag between the original voice and the immediacy will be lost. Therefore, it cannot be used for conversations. Also, even when listening to broadcasts, it takes a long time to finish listening.
2. The rate of change of the speech component is also an important clue for the perception of speech, so if the utterance speed is slowed, this clue may change and be perceived by another speech.
3. If the speech is decomposed and reproduced slowly, information as a unit of speech and information on transitional changes may be lost, resulting in poor clarity.
4). A sound whose frequency band is always amplified by the frequency equalizer process may be uncomfortable or difficult to hear because the tone color balance is lost.
5. The process of applying the weighting function considering the continuous masking causes a delay of at least the time length of the weighting function and loses immediacy. As a result, a time lag may occur between the mouth movement and the processed sound, which may adversely affect the intelligibility. In addition, when there is acoustic feedback from the earphone to the microphone, a phenomenon such as reverberation is caused by the time delay.

この発明は、以上のような問題点を解決しようとするものである。 The present invention is intended to solve the above problems.

この発明においては、
入力音声信号の子音の帯域の振幅の変更を行なう振幅変更手段と、
上記入力音声信号のピッチ成分およびフォルマント成分とを抽出する抽出手段と、
この抽出手段で抽出した信号から有声音のレベルを示す信号を算出するレベル算出手段と、
上記レベル算出手段の出力から上記入力音声信号における上記有声音の開始点を検出する有声音開始点検出手段と、
上記レベル算出手段の出力から上記入力音声信号における上記有声音の終了点を検出する有声音終了点検出手段と、
上記有声音開始点検出手段および上記有声音終了点検出手段により検出された上記有声音の開始点から終了判定点からの立下がりまでの区間のみ、上記振幅変更手段の利得を基準値に対し大きくし、上記有声音終了点検出手段により検出された上記有声音の終了点からの立下がり区間において上記振幅変更手段の利得を上記基準値まで下げるように上記振幅変更手段に対して制御を行なう制御手段と
を有する音声信号の処理装置
とするものである。 In this invention,
Amplitude changing means for changing the amplitude of the consonant band of the input voice signal;
Extraction means for extracting the pitch component and formant component of the input audio signal;
Level calculation means for calculating a signal indicating the level of voiced sound from the signal extracted by the extraction means;
Voiced sound start point detecting means for detecting the start point of the voiced sound in the input voice signal from the output of the level calculating means;
Voiced sound end point detecting means for detecting the end point of the voiced sound in the input voice signal from the output of the level calculating means;
The gain of the amplitude changing means is increased with respect to the reference value only in the section from the start point of the voiced sound detected by the voiced sound start point detecting means and the voiced sound end point detecting means to the fall from the end determination point. And control for controlling the amplitude changing means so as to lower the gain of the amplitude changing means to the reference value in the fall period from the end point of the voiced sound detected by the voiced sound end point detecting means. And an audio signal processing device having means.

この発明によれば、音声がはっきりし、明瞭度を改善できる。また、常に音声の高域を強調するときのような不快感がない。さらに、発声者の口の動きと処理音との間に時間差の生じることがない。 According to the present invention, the sound is clear and the intelligibility can be improved. Moreover, there is no discomfort as in the case where the high frequency range of the voice is always emphasized. Furthermore, there is no time difference between the movement of the speaker's mouth and the processed sound.

ところで、通常の会話の音声は、低い周波数の成分と高い周波数の成分との組み合わせで構成されている。また、一般の生活環境に存在する音も、低い周波数の成分と高い周波数の成分との組み合わせになっていることが多い。 By the way, the voice of normal conversation is composed of a combination of a low frequency component and a high frequency component. In addition, sound existing in a general living environment is often a combination of a low frequency component and a high frequency component.

そして、聴覚では、低域成分が高域成分をマスクすることが知られており、音声を知覚するときにも、このマスキングが働いている。健聴者では、このマスキングによる妨害は小さく、マスキングがあっても正しく音声を知覚できるが、難聴者では、マスキングによる妨害が大きく、時間的にも長く続くことがあり、言葉の聞き取りを悪くする原因の一つになっている。 In auditory sense, it is known that a low frequency component masks a high frequency component, and this masking works when perceiving speech. In normal hearing, this interference by masking is small, and even if there is masking, speech can be perceived correctly. It has become one of the.

また、健聴者でも低域成分の大きい雑音があると、言葉の聞き取りが低下することがある。さらに、いわゆる継時マスキングによっても、言葉の明瞭度が低下する。 In addition, even a normal hearing person may have difficulty in listening to words when there is a large amount of low-frequency noise. In addition, so-called successive masking also reduces word clarity.

そこで、この発明は、このようなマスキングあるいは継時マスキングに起因する明瞭度の低下を抑えようとするものである。 Therefore, the present invention is intended to suppress a reduction in clarity resulting from such masking or successive masking.

そして、このため、この発明の一形態においては、有声音の開始点から数ミリ秒〜十数ミリ秒の期間を立ち上がり期間とし、有声音の終了点から十数ミリ秒〜数十ミリ秒の期間を立ち下がり期間とするとき、これら立ち上がり期間および立ち下がり期間に、高域成分を増強するものである。 For this reason, in one embodiment of the present invention, a period of several milliseconds to tens of milliseconds from the start point of the voiced sound is set as a rising period, and a period of tens of milliseconds to several tens of milliseconds from the end point of the voiced sound. When the period is the falling period, the high frequency component is enhanced during the rising period and the falling period.

図１は、この発明の一形態を示すもので、処理前の音声信号Ｓ11が、入力端子１１を通じて有声音の帯域を通過帯域とするフィルタ１２に供給されてその有声音の信号成分Ｓ12が取り出され、この信号成分Ｓ12が加算回路１３に供給される。また、端子１１からの信号Ｓ11が、子音の帯域を通過帯域とするフィルタ１４に供給されて子音の信号成分Ｓ14が取り出され、この信号成分Ｓ14が可変利得アンプ１５を通じて加算回路１３に供給される。 FIG. 1 shows an embodiment of the present invention. An unprocessed audio signal S11 is supplied to a filter 12 having a passband of a voiced sound band through an input terminal 11, and a signal component S12 of the voiced sound is extracted. This signal component S12 is supplied to the adder circuit 13. Further, the signal S11 from the terminal 11 is supplied to the filter 14 having the passband of the consonant band to extract the consonant signal component S14, and this signal component S14 is supplied to the adder circuit 13 through the variable gain amplifier 15. .

したがって、加算回路１３においては、信号成分Ｓ12と信号成分Ｓ14とが加算されるので、可変利得アンプ１５の利得Ｇ15が基準利得（例えば１倍）であるとすれば、加算回路１３からは、音声信号Ｓ11に含まれる有声音の信号成分Ｓ12および子音の信号成分Ｓ14を、等しい割り合いで有する音声信号Ｓ13が得られることになる。そして、この信号Ｓ13が出力端子１６に取り出される。 Therefore, since the signal component S12 and the signal component S14 are added in the adder circuit 13, if the gain G15 of the variable gain amplifier 15 is a reference gain (for example, 1 time), the adder circuit 13 The voice signal S13 having the voiced signal component S12 and the consonant signal component S14 included in the signal S11 in an equal proportion is obtained. Then, this signal S13 is taken out to the output terminal 16.

さらに、端子１１の信号Ｓ11が、前処理のため、バンドパスフィルタ２１およびレベル算出回路２２に順に供給される。この場合、バンドパスフィルタ２１は、有声音の開始点および終了点を検出しやすくし、かつ、雑音による影響が小さくなるように、信号Ｓ11からピッチ成分とフォルマント成分とを、信号Ｓ21として抽出するものである。したがって、バンドパスフィルタ２１の通過帯域は、例えば150Hz〜1000Hzとされている。 Further, the signal S11 at the terminal 11 is sequentially supplied to the band pass filter 21 and the level calculation circuit 22 for preprocessing. In this case, the band pass filter 21 extracts the pitch component and the formant component as the signal S21 from the signal S11 so that the start point and the end point of the voiced sound can be easily detected and the influence of noise is reduced. Is. Therefore, the pass band of the band pass filter 21 is, for example, 150 Hz to 1000 Hz.

また、レベル算出回路２２は、例えば、信号Ｓ21を両波整流するとともに、その低域成分（例えば60Hz以下の成分）を取り出すことにより、信号Ｓ21のレベルを示す信号Ｓ22を形成するものである。 Further, the level calculation circuit 22 forms a signal S22 indicating the level of the signal S21 by, for example, performing both-wave rectification on the signal S21 and extracting a low-frequency component (for example, a component of 60 Hz or less).

そして、このレベル算出回路２２の算出信号Ｓ22が検出回路２３に供給されて有声音の開始点および終了点が検出され、その検出信号Ｓ23が制御回路２４に供給されて制御信号Ｓ24が形成され、この信号Ｓ24が可変利得アンプ１５に利得Ｇ15の制御信号として供給される。 Then, the calculation signal S22 of the level calculation circuit 22 is supplied to the detection circuit 23 to detect the start point and the end point of the voiced sound, and the detection signal S23 is supplied to the control circuit 24 to form the control signal S24. This signal S24 is supplied to the variable gain amplifier 15 as a control signal with a gain G15.

この場合、有声音の開始点および終了点の検出と、アンプ１５の利得Ｇ15の大きさとは、例えば図２に示すような関係とされる。すなわち、算出信号Ｓ22の示す有声音のレベルが開始判定のしきい値より小さいときには、アンプ１５の利得Ｇ15は基準値とされているが、有声音のレベルがその開始判定のしきい値よりも大きくなると、利得Ｇ15は、数ミリ秒から十数ミリ秒の立ち上がり期間をもって最大値まで次第に大きくされる。 In this case, the detection of the start point and the end point of the voiced sound and the magnitude of the gain G15 of the amplifier 15 have a relationship as shown in FIG. That is, when the level of the voiced sound indicated by the calculated signal S22 is smaller than the threshold value for the start determination, the gain G15 of the amplifier 15 is set to the reference value, but the level of the voiced sound is higher than the threshold value for the start determination. When the gain is increased, the gain G15 is gradually increased to the maximum value with a rising period of several milliseconds to several tens of milliseconds.

また、算出信号Ｓ22の示す有声音のレベルが終了判定のしきい値よりも大きいときには、利得Ｇ15は大きいままとされるが、有声音のレベルがその終了判定のしきい値よりも小さくなると、利得Ｇ15は、数十ミリ秒から200ミリ秒程度の立ち下がり期間をもって基準値まで次第に小さくされる。 Further, when the level of the voiced sound indicated by the calculation signal S22 is larger than the threshold value for the end determination, the gain G15 is kept high, but when the level of the voiced sound is lower than the threshold value for the end determination, The gain G15 is gradually reduced to the reference value with a falling period of about several tens of milliseconds to 200 milliseconds.

このような構成によれば、処理前の音声信号Ｓ11に有声音の信号成分が含まれているとき、その開始点から終了点までの期間、信号Ｓ24によりアンプ１５の利得Ｇ15が大きくなるので、その開始点から終了点までの期間、アンプ１５を通じる子音の信号成分Ｓ14が大きくなる。 According to such a configuration, when the signal component of voiced sound is included in the audio signal S11 before processing, the gain G15 of the amplifier 15 is increased by the signal S24 during the period from the start point to the end point. During the period from the start point to the end point, the consonant signal component S14 passing through the amplifier 15 increases.

したがって、有声音の開始点から終了点までの期間、端子１６に出力される音声信号Ｓ13の子音の信号成分Ｓ14のレベルが大きくなるので、信号Ｓ13の再生音にマスキングを生じても、そのマスキングに見合う大きさだけ子音が大きくなり、したがって、言葉の聞き取りが改善される。 Accordingly, since the level of the consonant signal component S14 of the audio signal S13 output to the terminal 16 is increased during the period from the start point to the end point of the voiced sound, even if masking occurs in the reproduced sound of the signal S13, the masking is performed. The consonant will be louder than the size that fits, thus improving the listening of words.

また、有声音の終了点からの時間間隔が短い期間には、継時マスキングが大きいが、子音の信号成分Ｓ14は大きく増幅されるので、継時マスキングに対しても明瞭度を有効に高めることができる。さらに、次の子音の開始点までの時間間隔の長い期間には、継時マスキングが小さいが、この期間には、子音の信号成分Ｓ14はあまり増幅されないので、音色のバランスのくずれことがない。 Also, during the period when the time interval from the end point of the voiced sound is short, the successive masking is large, but since the signal component S14 of the consonant is greatly amplified, the clarity is effectively enhanced even for the successive masking. Can do. Further, although the successive masking is small in the period where the time interval to the start point of the next consonant is long, the signal component S14 of the consonant is not much amplified during this period, so that the tone color balance is not lost.

図３は、検出回路２３および制御回路２４が、検出信号Ｓ22から制御信号Ｓ24を形成する方法の一形態を示す。すなわち、この場合には、図１に示した回路の全部がデジタル化されるとともに、例えばＤＳＰにより構成される。また、音声信号Ｓ11はもとの処理前のアナログ音声信号をＡ／Ｄ変換したデジタル音声信号とされる。 FIG. 3 shows one form of how the detection circuit 23 and the control circuit 24 form the control signal S24 from the detection signal S22. In other words, in this case, the entire circuit shown in FIG. 1 is digitized and configured by a DSP, for example. The audio signal S11 is a digital audio signal obtained by A / D converting the original analog audio signal before processing.

そして、検出回路２３および制御回路２４においては、デジタル音声信号Ｓ11の１サンプルごとに、図３の処理ルーチン１００が実行され、アンプ１５の利得Ｇ15が例えば図２に示すように制御させる。なお、ルーチン１００および以下の説明において、各変数の意味は以下のとおりである。 In the detection circuit 23 and the control circuit 24, the processing routine 100 of FIG. 3 is executed for each sample of the digital audio signal S11, and the gain G15 of the amplifier 15 is controlled as shown in FIG. In the routine 100 and the following description, the meaning of each variable is as follows.

e(i) ：音声信号Ｓ11の第ｉ番目のサンプルの示すレベル。
threshold1：有声音の終了判定のしきい値。信号Ｓ11がこの値よりも小さ
くなったとき、有声音が終了と判定する。
threshold2：有声音の開始判定のしきい値。信号Ｓ11がこの値よりも大き
くなったとき、有声音が開始と判定する。
threshold1≦threshold2に設定される。
w ：利得Ｇ15を制御するための重み係数。０≦w≦１
w＝０ときＧ15＝基準利得、w＝１のときＧ15＝最大利得。
d1 ：係数wを減少させるときのステップ幅。
d2 ：係数wを増加させるときのステップ幅。 e (i): level indicated by the i-th sample of the audio signal S11.
threshold1: Threshold value for determining the end of voiced sound. Signal S11 is smaller than this value
When it becomes, the voiced sound is determined to be finished.
threshold2: Threshold for determining the start of voiced sound. Signal S11 is greater than this value
When it becomes, the voiced sound is determined to be started.
It is set to threshold1 ≦ threshold2.
w: Weighting factor for controlling the gain G15. 0 ≦ w ≦ 1
G15 = reference gain when w = 0, G15 = maximum gain when w = 1.
d1: Step width when the coefficient w is decreased.
d2: Step width when the coefficient w is increased.

すなわち、ルーチン１００においては、まず、ステップ１０１において、第ｉ番目のサンプルの信号レベルe(i)が開始判定のしきい値threshold2よりも小さいかどうかが判別され、小さいときには、処理はステップ１０１からステップ１０２に進む。 That is, in the routine 100, first, in step 101, it is determined whether or not the signal level e (i) of the i-th sample is smaller than the threshold value threshold 2 for start determination. Proceed to step 102.

そして、このステップ１０２において、第ｉ番目のサンプルの信号レベルe(i)が終了判定のしきい値threshold1よりも小さいかどうかが判別され、小さいときには、処理はステップ１０２からステップ１０３に進み、このステップ１０３において、係数wがステップ幅d1だけ小さくされ、ルーチン１００を終了する。したがって、図２に示すように、有声音の終了点が検出されたときには、以後、利得Ｇ15は次第に小さくなっていく。 Then, in this step 102, it is determined whether or not the signal level e (i) of the i-th sample is smaller than the threshold value threshold 1 for the end determination. When it is smaller, the process proceeds from step 102 to step 103. In step 103, the coefficient w is decreased by the step width d1, and the routine 100 is terminated. Therefore, as shown in FIG. 2, when the end point of the voiced sound is detected, the gain G15 gradually decreases thereafter.

また、ステップ１０２において、第ｉ番目のサンプルの信号レベルe(i)が終了判定のしきい値threshold1以上のときには、処理はステップ１０２からこのルーチン１００を終了する。したがって、図２に示すように、有声音の終了が検出されるまでの期間（利得Ｇ15の大きい期間）は、その利得Ｇ15が保持される。 In step 102, when the signal level e (i) of the i-th sample is equal to or higher than the threshold value threshold 1 for the end determination, the process ends the routine 100 from step 102. Therefore, as shown in FIG. 2, the gain G15 is maintained for a period until the end of the voiced sound is detected (a period in which the gain G15 is large).

さらに、ステップ１０１において、第ｉ番目のサンプルの信号レベルe(i)が開始判定のしきい値threshold2以上のときには、処理はステップ１０１からステップ１０４に進み、このステップ１０４において、係数wがステップ幅d2だけ大きくされ、ルーチン１００を終了する。したがって、図２に示すように、有声音の開始点が検出されたときには、以後、利得Ｇ15は次第に大きくなっていく。 Further, in step 101, when the signal level e (i) of the i-th sample is equal to or higher than the threshold value threshold2 for the start determination, the process proceeds from step 101 to step 104. In step 104, the coefficient w is increased by the step width. The routine is finished by increasing d2. Therefore, as shown in FIG. 2, when the start point of the voiced sound is detected, the gain G15 gradually increases thereafter.

こうして、ルーチン１００によれば、有声音のレベルにしたがってアンプ１５の利得Ｇ15を制御することにより、子音のレベルを補正しているので、マスキングや継時マスキングによる子音成分の聴感上の減衰を補うことができ、会話の子音部分など音声の明瞭度を向上させることができる。 In this way, according to the routine 100, the gain G15 of the amplifier 15 is controlled according to the level of the voiced sound to correct the consonant level, so that the auditory attenuation of the consonant component due to masking or successive masking is compensated. And intelligibility of speech such as consonant parts of conversation can be improved.

図４は、音声波形の観測結果を示すもので、図４Ａはルーチン１００による処理を行っていない音声信号Ｓ11の波形、図４Ｂはルーチン１００による処理を行った音声信号Ｓ13の波形の観測例である。なお、このときの発声内容は、「１行目に書いてください」である。 FIG. 4 shows the observation result of the speech waveform. FIG. 4A shows an example of the waveform of the speech signal S11 not processed by the routine 100, and FIG. 4B shows an example of the waveform of the speech signal S13 processed by the routine 100. is there. The content of the utterance at this time is “Please write on the first line”.

そして、有声音の開始点から終了点までの区間Ｂは、子音の部分が大きく増幅され、有声音の終了点からの短い期間（矢印Ａ、Ｆの部分）は、継時マスキングが大きいので、子音は大きく増幅され、次の子音の始まりまでの長い期間（矢印Ｃ、Ｄ、Ｅの部分）は、継時マスキングが小さいので、子音はあまり増幅されていない。 And, in the section B from the start point to the end point of the voiced sound, the consonant part is greatly amplified, and in the short period from the end point of the voiced sound (arrow A and F part), the successive masking is large. The consonant is greatly amplified, and during the long period until the start of the next consonant (arrows C, D, and E), since the successive masking is small, the consonant is not so amplified.

したがって、上述の処理回路によれば、音声を残響やエコーなどのある系で伝送あるいは再生するとき、あるいは難聴者や老人が音声を聞くとき、以下のような効果を得ることができる。
１．次に発声される音への継時マスキングだけが軽減されるように、子音が強調されるので、音声がはっきりし、明瞭度を改善できる。
２．マスキングが起きているときだけ子音が強調されるので、常に高域が強調されるときのように、音色のバランスが崩れたような不快感がない。
３．原理的に即時処理ができるので、発声者の口の動きと処理音との間に時間差の生じることがない。また、イヤホンからマイクロフォンへの音響的フィードバックがあっても、残響音のような音にはならないので、聞きやすい。
４．語音の知覚判断にとって重要な音声成分の変化速度や、語音のまとまりとしての情報および過渡的な変化部分の情報が失われない。
５．図４の処理ルーチン１００によれば、その処理のステップ数が少ないので、処理が多少遅いＤＳＰであっても、十分に対応することができる。 Therefore, according to the above-described processing circuit, the following effects can be obtained when sound is transmitted or reproduced in a system such as reverberation or echo, or when a hearing-impaired person or an elderly person listens to the sound.
1. Since the consonant is emphasized so that only the time masking to the next uttered sound is reduced, the voice is clear and the clarity can be improved.
2. Since the consonant is emphasized only when masking is occurring, there is no unpleasant feeling that the timbre is out of balance as in the case where the high range is always emphasized.
3. In principle, since immediate processing can be performed, there is no time difference between the movement of the speaker's mouth and the processed sound. Even if there is acoustic feedback from the earphone to the microphone, it does not sound like reverberation, so it is easy to hear.
4). The speed of change of speech components important for speech perception judgment, information as a unit of speech, and information of transitional changes are not lost.
5. According to the processing routine 100 of FIG. 4, since the number of steps of the processing is small, even a DSP that is somewhat slow in processing can sufficiently cope with it.

図５に示すルーチン２００は、
threshold=threshold1=threshold2
とすることにより、ルーチン１００を簡略化した場合である。すなわち、ルーチン２００においては、
threshold ：有声音の開始判定および終了判定のしきい値。信号Ｓ11がこ
の値よりも小さいと終了と判定し、この値よりも大きいと開
始と判定する。
とされ、他はルーチン１００と同様とされる。 The routine 200 shown in FIG.
threshold = threshold1 = threshold2
Thus, the routine 100 is simplified. That is, in the routine 200,
threshold: Threshold value for voiced sound start / end judgment. Signal S11 is
If it is smaller than this value, it is determined that the process is finished.
It is determined that the beginning.
The others are the same as those in the routine 100.

そして、ステップ２０１において、第ｉ番目のサンプルの信号レベルe(i)がしきい値thresholdと比較され、レベルe(i)がしきい値thresholdよりも小さければ、ステップ２０２において、係数wがステップ幅d1だけ小さくされ、そうでなければ、ステップ２０３において、係数wがステップ幅d2だけ大きくされる。 In step 201, the signal level e (i) of the i-th sample is compared with the threshold threshold. If the level e (i) is smaller than the threshold threshold, in step 202, the coefficient w is Otherwise, it is decreased by the width d1, otherwise, in step 203, the coefficient w is increased by the step width d2.

したがって、このルーチン２００によれば、処理がさらに簡単であり、ＤＳＰの負担がより軽くなる。 Therefore, according to this routine 200, the processing is further simplified, and the burden on the DSP is further reduced.

この発明の一形態を示す系統図である。It is a systematic diagram showing one embodiment of the present invention. この発明を説明するための図である。It is a figure for demonstrating this invention. この発明の一形態の一部を示すフローチャートである。It is a flowchart which shows a part of one form of this invention. この発明を説明するための波形図である。It is a wave form diagram for demonstrating this invention. この発明の一形態の一部を示すフローチャートである。It is a flowchart which shows a part of one form of this invention.

Explanation of symbols

１２…有声音帯域フィルタ、１３…加算回路、１４…子音帯域フィルタ、１５…可変利得アンプ、２１…バンドパスフィルタ、２２…レベル算出回路、２３…検出回路、２４…制御回路、１００…処理ルーチン DESCRIPTION OF SYMBOLS 12 ... Voiced sound band filter, 13 ... Adder circuit, 14 ... Consonant band filter, 15 ... Variable gain amplifier, 21 ... Band pass filter, 22 ... Level calculation circuit, 23 ... Detection circuit, 24 ... Control circuit, 100 ... Processing routine

Claims

Amplitude changing means for changing the amplitude of the consonant band of the input voice signal;
Extraction means for extracting the pitch component and formant component of the input audio signal;
Level calculation means for calculating a signal indicating the level of voiced sound from the signal extracted by the extraction means;
Voiced sound start point detecting means for detecting the start point of the voiced sound in the input voice signal from the output of the level calculating means;
Voiced sound end point detecting means for detecting the end point of the voiced sound in the input voice signal from the output of the level calculating means;
The gain of the amplitude changing means is increased with respect to the reference value only in the section from the start point of the voiced sound detected by the voiced sound start point detecting means and the voiced sound end point detecting means to the fall from the end determination point. And control for controlling the amplitude changing means so as to lower the gain of the amplitude changing means to the reference value in the fall period from the end point of the voiced sound detected by the voiced sound end point detecting means. An audio signal processing apparatus comprising: means.

The audio signal processing device according to claim 1,
The voiced sound processing device, wherein the voiced sound start point detecting means detects the voiced sound signal level when the signal level of the voiced sound exceeds a predetermined threshold value and detects it as a detection signal of the start point.

The audio signal processing device according to claim 1,
The voiced sound end point detection means detects the end point detection signal when the signal level of the voiced sound falls below a predetermined threshold value.

The audio signal processing device according to claim 1,
An audio signal processing apparatus in which the amplitude changing means is a variable gain amplifier.

In the audio signal processing apparatus according to claim 1,
When the rising period is a period of several milliseconds to tens of milliseconds from the start point of the voiced sound, and the rising period is a period of tens of milliseconds to several tens of milliseconds from the end point of the voiced sound An audio signal processing apparatus configured to change the gain of the amplitude changing means during the period and the falling period.

Amplitude changing means for changing the amplitude of the consonant band of the input voice signal;
Extraction means for extracting the pitch component and formant component of the input audio signal;
Level calculation means for calculating a signal indicating the level of voiced sound from the signal extracted by the extraction means;
Voiced sound start point detecting means for detecting the start point of the voiced sound in the input voice signal from the output of the level calculating means;
Voiced sound end point detecting means for detecting the end point of the voiced sound in the input voice signal from the output of the level calculating means;
The gain of the amplitude changing means is increased with respect to the reference value only in the section from the start point of the voiced sound detected by the voiced sound start point detecting means and the voiced sound end point detecting means to the fall from the end determination point. And a control means for controlling the amplitude changing means so as to change the gain of the amplitude changing means in accordance with the level of the voiced sound.

An amplitude changing step for changing the amplitude of the consonant band of the input audio signal;
An extraction step for extracting a pitch component and a formant component of the input audio signal;
A level calculation step for calculating a signal indicating the level of voiced sound from the signal extracted in the extraction step;
A voiced sound start point detecting step for detecting a start point of the voiced sound in the input voice signal from an output of the level calculating step;
A voiced sound end point detecting step of detecting an end point of the voiced sound in the input voice signal from the output of the level calculating step;
A control step for controlling the gain of the amplitude changing step based on outputs of the voiced sound start point detection step and the voiced sound end point detection step;
In this control step, when the voiced sound start point detection step detects the start point, the control step supplies a control signal so that the gain is larger than a reference value with respect to the amplitude change step.
When the voiced sound end point detecting step detects the end point, the control signal is supplied so as to return the gain to the reference value with respect to the amplitude changing step.

An amplitude changing step for changing the amplitude of the consonant band of the input audio signal;
An extraction step for extracting a pitch component and a formant component of the input audio signal;
A level calculation step for calculating a signal indicating a level from the signal extracted in the extraction step;
A voiced sound start point detecting step for detecting a start point of the voiced sound in the input voice signal from an output of the level calculating step;
A voiced sound end point detecting step of detecting an end point of the voiced sound in the input voice signal from the output of the level calculating step;
A control step for controlling the gain of the amplitude changing step based on outputs of the voiced sound start point detection step and the voiced sound end point detection step;
In this control step, when the voiced sound start point detection step detects the start point, the control step supplies a control signal so that the gain is larger than a reference value with respect to the amplitude change step.
A method for processing an audio signal, comprising: supplying the control signal so as to change the gain with respect to the amplitude changing step according to a level of voiced sound.