JPS5982608A

JPS5982608A - System for controlling reproducing speed of sound

Info

Publication number: JPS5982608A
Application number: JP57192310A
Authority: JP
Inventors: Katsumi Hosoya; 細谷　克美
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1982-11-01
Filing date: 1982-11-01
Publication date: 1984-05-12

Abstract

PURPOSE:To obtain a reproduced sound of high quality by dividing sound data into frames with fixed length, detecting that each frame is a silent section, a silent consonant section, a steady vowel section, or a non-steady section, and then thinning or interpolating the frame in accordance with the detected result. CONSTITUTION:A sound signal is transferred from an input buffer 3 to a silence detecting part 4 in each frame. The detecting part 4 calculates the power of a signal corresponding to the frame and the number of times of zero-crossing, and when the value is a threshold or more, regards the frame as the non-steady section (c) or the steady vowel section (d) and transfers the signal to a steady discriminating part 5. When less than the threshold, the frame is regarded as the silent section (a) or the silent consonant section and transferred to an interpolation/thinning part 7. The discriminating part 5 claculates the coefficient of self-correlation of the sound signal, and when the maximum value is larger or less than the threshold, regards the signal as the section (d) and transfers the signal to a reference period extracting part 6 or regards the signal as the section (c) and transfers the signal to an output buffer 9. The extracting part 6 calculates the reference period T of the voice signal from the coefficient providing said maximum value and transfers the period T together with the sound data to the interpolation/thinning part 7. The part 7 executes interpolation or thinning corresponding to the reproducing speed in each period T.

Description

【発明の詳細な説明】発明の技術分野本発明は、録音した音声を再生する際、再生音の周波数
を変化させることなく、再生速度を増加または減少させ
ることができる音声の再生速度制御方式に関するもので
ある。DETAILED DESCRIPTION OF THE INVENTION Technical Field of the Invention The present invention relates to a sound playback speed control method that can increase or decrease the playback speed without changing the frequency of the playback sound when playing back recorded sound. It is something.

従来技術と問題点テープレコーダ等に於いて、再生速度を変化させると、
再生音の周波数が再生速度の変化に伴なって変化し、大
変聞きづらいものとなる。このような欠点をなくす為に
、音声を数１０ミリ秒毎のフレームに分割し、フレーム
を単位として音声の間引き或は補間を行なうことによシ
、再生音の周波数を変化させることなく、再生速度を増
加或は減少できるようにしたものも提案されているが、
間引き或は補間を行なった後の隣接するフレームの接続
部の不連続点でクリック音が生じる為、再生音の明瞭性
が低下する欠点があった。Conventional technology and problems In tape recorders, etc., when the playback speed is changed,
The frequency of the reproduced sound changes as the reproduction speed changes, making it very difficult to hear. In order to eliminate this drawback, audio is divided into frames every several tens of milliseconds, and the audio is thinned out or interpolated in units of frames, thereby making it possible to reproduce sound without changing the frequency of the reproduced sound. Some proposals have been made that allow the speed to be increased or decreased, but
Since click sounds are generated at discontinuous points between adjacent frames after thinning or interpolation, there is a drawback that the clarity of the reproduced sound deteriorates.

また上述の如き欠点を改善する為に、フレーム接続部の
相互相関係数を計算し、フレームを連続的に接続できる
ように接続時期を微調整する方式も提案されているが、
間引きまたは補間を行なうフレームの長さと位置とを、
音声の性質とは無関係に固定的に決定している為、破裂
音のような短時間に状態が変化する音素の明瞭性が低下
する欠点があった。In order to improve the above-mentioned drawbacks, a method has also been proposed in which the cross-correlation coefficient of the frame connection part is calculated and the connection timing is finely adjusted so that frames can be connected continuously.
The length and position of frames to be thinned out or interpolated,
Because it is fixedly determined regardless of the nature of the voice, it has the disadvantage that the clarity of phonemes that change state in a short period of time, such as plosives, deteriorates.

また、この他にも、無音の部分のみを伸縮する方式も提
案されているが、この方式では実質的々音声部分の速度
制御は不可能である。In addition, a method has also been proposed in which only the silent portion is expanded or contracted, but with this method, it is virtually impossible to control the speed of the audio portion.

発明の目的本発明は前述の如き欠点を改善したものであり、その目
的は、再生音の周波数を変化させることなく、且つ再生
音の品質を劣化させることなく、再生速度を増加或は減
少できるようにすることにある。以下実施例について詳
細に説明する。Purpose of the Invention The present invention improves the above-mentioned drawbacks, and its purpose is to increase or decrease the reproduction speed without changing the frequency of the reproduced sound and without degrading the quality of the reproduced sound. The purpose is to do so. Examples will be described in detail below.

発明の実施例先ず、第１図の音声信号波形図を参照して、本発明の詳
細な説明する。同図に於いて、ａは無音区間、ｂは定常
無声子音区間、Ｃは非定常区間、ｄは定常母音区間を表
わしている。Embodiments of the Invention First, the present invention will be described in detail with reference to the audio signal waveform diagram of FIG. In the figure, a represents a silent section, b a stationary unvoiced consonant section, C a non-stationary section, and d a stationary vowel section.

無音区間ａ１定常無声子音区間すに於いては、図示の如
く波形の振幅が小さい為、適当な区間長で間引き或は補
間を行なっても、再生音の明瞭性にはほとんど影響がな
い。また、定常母音区間ｄに於いては、図示の如く音声
の基本周期Ｔ毎に類似の波形が繰返されるので、基本周
期Ｔで間引き或は補間を行なうことによシ、音声の周波
数成分を変えることなく、シかも波形の不連続点をほと
んど生じることなく、再生音の伸縮が可能である。In the silent section a1, the steady unvoiced consonant section, the amplitude of the waveform is small as shown in the figure, so even if thinning or interpolation is performed with an appropriate section length, the clarity of the reproduced sound will hardly be affected. In addition, in the stationary vowel interval d, similar waveforms are repeated every basic period T of the voice as shown in the figure, so by thinning out or interpolating at the basic period T, the frequency components of the voice can be changed. It is possible to expand and contract the reproduced sound without causing any waveform discontinuities.

しかし、非定常区間Ｃに於いては、音声波形の性質が急
激に変化するものであるから、間引き或は補間を行なう
と、音素としての特徴が失なわれたシ、短い音素の場合
には、音素そのものが欠落したシして再生者の明瞭性が
低下する。However, in the non-stationary interval C, the properties of the speech waveform change rapidly, so if thinning or interpolation is performed, the characteristics of the phoneme will be lost, and in the case of short phonemes, , since the phoneme itself is missing, the intelligibility for the player deteriorates.

本発明は、上述した理由により、無音区間ａ１定常無声
子音区間すに於いては、適当な区間長で間引き或は補間
を行ない、また、非定常区間Ｃに於いては間引きも補間
にも行なわず、定常母音区間ｄに於いては、音声の基本
周期Ｔｆ：単位として間引き或は補間を行なうようにし
、音声の周波数成分を変化させることなく、シかも不連
続点をほとんど生じさせることなく、再生速度を増加或
は減少できるようにしたものである。For the reasons mentioned above, the present invention performs thinning or interpolation with an appropriate interval length in the silent interval a1 and the steady unvoiced consonant interval, and also performs thinning and interpolation in the unsteady interval C. First, in the stationary vowel interval d, the fundamental period Tf of the voice is thinned out or interpolated as a unit, without changing the frequency components of the voice and without causing almost any discontinuity. This allows the playback speed to be increased or decreased.

第２図は本発明の実施例のブロック線図であり、１はア
ナログ音声信号の入力端子、２はＡＤ変換器、３は入力
バッファメモリ、４は無音検出部、５は定常性判定部、
６は基本周期抽出部、７は補間／間引き部、８は出力バ
ッファメモリ、９はＤＡ変換器、ＩＯは出力端子である
。FIG. 2 is a block diagram of an embodiment of the present invention, in which 1 is an analog audio signal input terminal, 2 is an AD converter, 3 is an input buffer memory, 4 is a silence detection section, 5 is a stationarity determination section,
Reference numeral 6 designates a basic period extraction section, 7 an interpolation/decimation section, 8 an output buffer memory, 9 a DA converter, and IO an output terminal.

入力端子１からのアナログ音声信号はＡＤ変換器２でデ
ィジタル符号化された後、一定周期ｔのクロック信号に
よシ人カバツファメモリ３に蓄積される。入カバツ７ア
３からは、ｌフレーム毎に音声データが読出され、無音
検出部４に転送される。An analog audio signal from an input terminal 1 is digitally encoded by an AD converter 2 and then stored in a buffer memory 3 in accordance with a clock signal having a constant period t. Audio data is read out from the input cover 7a 3 every l frame and transferred to the silence detection section 4.

尚、１フレームには例えば３２ミリ秒分の音声データが
収容されているとする。無音検出部４では、フレーム内
の音声データに基づいて、そのフレームに対応する音声
信号のパワーと零交差数とを算出し、それらが予め定め
た閾値よシ大きい場合は、そのフレームは有音、即ち、
非定常区間Ｃ或は定常母音区間ｄであるとみなし、定常
性判定部５へ音声データを転送する。また、予め定めた
閾値以下の場合は、そのフレームは無音、即ち無音区間
ａ或は無声子音区間すであるとみなし、フレーム内の音
声データを補間／間引き部７へ転送する。It is assumed that one frame contains, for example, 32 milliseconds worth of audio data. The silence detection unit 4 calculates the power and zero crossing number of the audio signal corresponding to the frame based on the audio data in the frame, and if these are greater than a predetermined threshold, the frame is determined to be sound. , that is,
It is assumed that this is an unsteady section C or a steady vowel section d, and the audio data is transferred to the stationarity determining section 5. If the value is less than or equal to a predetermined threshold, it is assumed that the frame is silent, that is, a silent section a or a silent consonant section, and the audio data in the frame is transferred to the interpolation/decimation unit 7.

定常性判定部５はフレーム内の音声テークに基づいて、
そのフレームに対応する音声信号の自己相関係数を算出
し、その極太値が予め定められている閾値より大きい場
合は、そのフレームは周期性を有する、即ち、定常母音
区間ｄであるとみなし、フレーム内の音声データを基本
周期抽出部６に転送する。また、自己相関係数の極大値
が予め淀められている閾値以下の場合は、そのフレーム
は非定常区間Ｃであるとみなし、フレーム内の音声デー
タを出力バッファメモリ９に加える。The stationarity determination unit 5 determines, based on the audio take within the frame,
Calculate the autocorrelation coefficient of the audio signal corresponding to that frame, and if the thickest value is larger than a predetermined threshold, consider that the frame has periodicity, that is, it is a stationary vowel interval d, The audio data within the frame is transferred to the fundamental period extraction section 6. If the maximum value of the autocorrelation coefficient is less than or equal to a predetermined threshold, the frame is considered to be in the non-stationary section C, and the audio data in the frame is added to the output buffer memory 9.

基本周期抽出部６は自己相関係数の極太値を力える係数
から音声信号の基本周期Ｔを算出し、音声データと共に
算出した基本周期Ｔを補間／間引き部７へ転送する。The fundamental period extraction section 6 calculates the fundamental period T of the audio signal from the coefficient that inputs the thickest value of the autocorrelation coefficient, and transfers the calculated fundamental period T to the interpolation/decimation section 7 together with the audio data.

補間／間引き部７は基本周期抽出部６よシ転送されてき
た音声データについては、基本周期Ｔを単位とし、所望
の再生速度に合わせた補間或は間引きを行ない、補間或
は間引きを行なった音声データを出力バッファメモリ８
に転送するものであり、例えば再生速度を１／２にする
場合には、第３図（Ａ）に示すように、１フレームを、
基本周期Ｔを単位とする区間（１）〜（ｎ−１）と余り
の区間（ｎ）とに分割し、区間（１）〜（ｎ−１）につ
いては、各区間を２回繰返しながら補間して出力し、余
シの区間（ｎ）については、そのまま１回だけ出力する
ものである。また、例えば再生速度を２倍にする場合に
は、同図（Ｂ）に示すように、１フレームを、基本周期
Ｔを単位とする区間（１）〜（ｎ−１）と余りの区間（
ｎ）とに分割し、区間（１）〜（ｎ−１）については、
１つおきに間引きして出力し、余りの区間（ｎ）につい
てはそのまま出力するものである。The interpolation/decimation section 7 performs interpolation or thinning on the audio data transferred from the basic period extraction section 6, using the basic period T as a unit, and performs interpolation or thinning according to the desired playback speed. Output audio data buffer memory 8
For example, if you want to reduce the playback speed to 1/2, as shown in Figure 3 (A), one frame is
Divide into an interval (1) to (n-1) and a remainder interval (n) with the basic period T as a unit, and interpolate each interval by repeating each interval twice. The remaining section (n) is outputted only once as is. In addition, for example, when doubling the playback speed, one frame is divided into sections (1) to (n-1) whose unit is the basic period T and the remainder section (
n), and for interval (1) to (n-1),
Every other section is thinned out and output, and the remaining section (n) is output as is.

また、無音検出部４から直接転送されて来た音声データ
については、補間／間引き部７は、一定の長さく例えば
５ミリ秒）を単位とし、所望の再生速度に合わせた補間
或は間引きを行ない、補間或は間引きを行なった音声デ
ータを出力バッファメモリ８に転送する。出力バッファ
メモリ８に蓄積された音声データは一定周期ｔのクロッ
ク信号により読出され、ＤＡ変換器９を介して出力端子
１０よ多出力される。Furthermore, regarding the audio data directly transferred from the silence detection unit 4, the interpolation/decimation unit 7 performs interpolation or thinning in units of a certain length (for example, 5 milliseconds) in accordance with the desired playback speed. The interpolated or thinned audio data is transferred to the output buffer memory 8. The audio data stored in the output buffer memory 8 is read out by a clock signal having a constant period t, and is outputted to an output terminal 10 via a DA converter 9.

発明の詳細な説明したように、本発明は、録音した音声データを一
定長のフレームに分割するフレーム分割手段（実施例に
於いては入力バッファメモリ３等から成る）と、各フレ
ームが無音区間、無声子音区間、定常母音区間、非定常
区間の何れに対応しているかを検出する検出手段（実施
例に於いては無音検出部４と定常判定部５とから成る）
と、定常母音区間に於ける音声の基本周期を抽出する基
本周期抽出手段（実施例に於いては基本周期抽出部６か
ら成る）とを備え、音声波形が周期的に変化する定常母
音区間に対応したフレームについては、該フレームに収
容されている音声データを、音声の基本周期を単位とし
て補間或は間引きした後に再生するようにしたものであ
るから、不連続部分を減少させることができ、また、音
声波形が急激に変化する非定常区間に対応するフレーム
については、該フレームに収容されている音声データを
そのまま再生するものであるから、再生音の歪みを抑え
ることができ、従って、本発明によれば、再生速度を変
化させた場合に於いても高品質の再生音を得ることがで
きる利点がある。従って、本発明を、高速再生テープレ
コーダ、ＶＴＲの高速再生時の音声、留守番電話の高速
読出し、音声メールシステムの音声メツセージ編集、低
速再生を利用した口述筆記マシン、語学練習機など種々
の音声処理装置に適用すれば、非常に有効である。As described in detail, the present invention includes a frame dividing means (in the embodiment, consisting of an input buffer memory 3, etc.) that divides recorded audio data into frames of a fixed length, and a silent period for each frame. , a detection means for detecting whether it corresponds to a voiceless consonant section, a stationary vowel section, or an unsteady section (in the embodiment, it consists of a silence detection section 4 and a stationary determination section 5).
and a fundamental period extracting means (consisting of a fundamental period extracting section 6 in the embodiment) for extracting the fundamental period of speech in a stationary vowel interval, and a fundamental period extracting means (consisting of a fundamental period extracting section 6 in the embodiment), which extracts the fundamental period of speech in a stationary vowel interval in which the speech waveform changes periodically. Since the corresponding frame is played back after the audio data contained in the frame is interpolated or thinned out using the basic period of the audio as a unit, discontinuous parts can be reduced. In addition, for frames corresponding to unsteady sections where the audio waveform changes rapidly, the audio data contained in the frame is played back as is, so distortion of the reproduced sound can be suppressed, and therefore the main According to the invention, there is an advantage that high quality reproduced sound can be obtained even when the reproduction speed is changed. Therefore, the present invention can be applied to various audio processing applications such as high-speed playback tape recorders, high-speed playback of VTRs, high-speed reading of answering machines, voice message editing of voice mail systems, dictation machines using low-speed playback, language practice machines, etc. It is very effective when applied to equipment.

[Brief explanation of drawings]

れぞれ音声データの補間、間引き方法の説明図である。１は入力端子、２はＡＤ変換器、３は入力バッファメモ
リ、４は無音検出部、５は定常性判定部、６は基本周期
検出部、７は補間／間引き部、８は出力バッファメモリ
、９はＤＡ変換器、１０は出力端子である。特許出願人　日本電信電話公社代理人弁理士　玉　蟲　久　五　部（外３名）′Ｍ　１
　図 −ＩＴＩ− 第　３　図FIG. 3 is an explanatory diagram of an interpolation method and a thinning method for audio data, respectively. 1 is an input terminal, 2 is an AD converter, 3 is an input buffer memory, 4 is a silence detection section, 5 is a stationarity determination section, 6 is a fundamental period detection section, 7 is an interpolation/decimation section, 8 is an output buffer memory, 9 is a DA converter, and 10 is an output terminal. Patent Applicant Nippon Telegraph and Telephone Public Corporation Representative Patent Attorney Hisa Gobu Tamamushi (3 others)'M 1
Figure-ITI- Figure 3

Claims

[Claims]

In an audio playback speed control method for controlling the playback speed of recorded audio data, there is provided a frame dividing means for dividing the audio data into frames of a constant length, and each frame is divided into a silent section, a silent consonant section, and a steady vowel section. , a detection means for detecting which of the non-stationary intervals corresponds to the voice, and a fundamental period extraction means for extracting the fundamental period of the voice in the steady vowel interval, which corresponds to the silent interval and the voiceless consonant interval. As for the frame, the audio data contained in the frame is interpolated or thinned out in units of a certain length, and then played back, and the frame corresponding to the stationary vowel section is played back. The audio data is interpolated or thinned out using the basic period extracted by the basic period extraction means as a unit, and then played back, and for frames corresponding to the non-stationary section, the audio data contained in the frame is played back as is. An audio playback speed control method characterized by: