JP3976169B2

JP3976169B2 - Audio signal processing apparatus, audio signal processing method and program

Info

Publication number: JP3976169B2
Application number: JP2001298608A
Authority: JP
Inventors: 寧佐藤
Original assignee: Kenwood KK
Current assignee: Kenwood KK
Priority date: 2001-09-27
Filing date: 2001-09-27
Publication date: 2007-09-12
Anticipated expiration: 2021-09-27
Also published as: JP2003108172A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声信号加工装置、音声信号加工方法及びプログラムに関する。
【０００２】
【従来の技術】
人間の音声の特徴を表す音響パラメータ（たとえば、音声のピッチ情報やフォルマント情報）を用いて音声認識や音声合成を行う技術がある。
たとえば音声認識であれば、図３に示すように、予め音声のサンプルからの音響パラメータの抽出（図３、ステップＳ１０１）、言語音の音韻の識別や音韻の記号化を含む音韻処理（ステップＳ１０２）、単語あるいは文節を認識するための単語処理（ステップＳ１０３）、及び、構文を認識するための構文処理や文章の意味を認識するための意味処理を含む自然言語処理（ステップＳ１０４）を、相互に関連づけて行う。
【０００３】
音響パラメータを抽出するための手法としては、音声の波形を表すデジタル信号を用いてこの音声にケプストラム分析を施す手法や、このようなデジタル信号を用い、この音声の相関関数を求め、この相関関数に基づいて音響パラメータを抽出する手法がある。
【０００４】
【発明が解決しようとする課題】
ケプストラム分析や相関関数を利用して音響パラメータを抽出する場合、音声の波形を表すデジタル信号にＦＦＴ（Fast Fourier Transform）を施す等して周波数ドメインの情報を得る必要がある。
しかし、現実の音声のピッチはゆらぎを含んでおり、このためピッチが急激に変動する場合がある。従って、ＦＦＴ等の結果から抽出する音響パラメータは、ピッチの急激な変動に起因する誤差を含んだものとなる、という問題が生じる。
【０００５】
この発明は、上記実状に鑑みてなされたものであり、ピッチが揺らぎを含む音声の特徴を表す情報を正確に抽出するための音声信号加工装置及び音声信号加工方法を提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記目的を達成すべく、この発明の第１の観点にかかる音声信号加工装置は、
音声の波形を表す加工対象の音声信号を取得し、当該音声信号の単位ピッチ分にあたる区間の時間長を実質的に同一に揃えることにより、当該音声信号をピッチ波形信号へと加工するピッチ波形信号生成手段と、
ピッチ波形信号に基づき、前記加工対象の音声信号の基本周波数成分及び高調波成分の時間変化を表すサブバンド信号を生成するサブバンド抽出手段と、
前記サブバンド抽出手段が生成したサブバンド信号をフィルタリングすることにより、当該サブバンド信号が表す基本周波数成分及び高調波成分の時間変化のうち所定周波数以上の成分を実質的に除去するフィルタ手段と、
前記フィルタ手段によりフィルタリングされたサブバンド信号を非線形量子化した結果を表す出力用サブバンド情報を生成して出力する出力用サブバンド情報生成手段と、を備え、
前記出力用サブバンド情報生成手段は、参照用の音声の基本周波数成分及び高調波成分の時間変化を表す参照用サブバンド情報を記憶し、前記参照用サブバンド情報のうち、前記出力用サブバンド情報との間で一定程度以上の相関を示すものがあるか否かを判別し、あると判別したとき、前記出力用サブバンド情報に代えて、該当する参照用サブバンド情報を識別する識別情報を出力する、
ことを特徴とする。
【０００７】
前記音声信号加工装置は、前記サブバンド信号に基づいて、前記加工対象の音声信号が摩擦音を表すものであるか否かを判別し、摩擦音を表すと判別したとき、前記フィルタ手段によりフィルタリングされる前のピッチ波形信号のスペクトル分布を表す情報を生成して出力する手段を備えていてもよい。
【０００９】
前記サブバンド抽出手段は、
制御に従って周波数特性を変化させ、前記加工対象の音声信号をフィルタリングすることにより、加工する対象である音声の基本周波数成分を抽出する可変フィルタと、
前記可変フィルタにより抽出された基本周波数成分に基づいて前記音声の基本周波数を特定し、特定した基本周波数近傍の成分以外が遮断されるような周波数特性になるように前記可変フィルタを制御するフィルタ特性決定手段と、
前記加工対象の音声信号を、当該音声信号の基本周波数成分の値に基づき、単位ピッチ分の音声信号からなる区間へと区切るピッチ抽出手段と、
前記加工対象の音声信号の各前記区間内を互いに実質的に同数の標本でサンプリングすることにより、各該区間内の時間長が実質的に同一に揃ったピッチ波形信号を生成するピッチ長固定部と、を備えていてもよい。
【００１０】
前記音声信号加工装置は、ピッチ波形信号の各前記区間の元の時間長を特定するためのピッチ情報を生成して出力するピッチ情報出力手段を備えていてもよい。
【００１１】
前記フィルタ特性決定手段は、前記可変フィルタにより抽出された基本周波数成分が所定値に達するタイミングが来る周期を特定し、特定した周期に基づいて前記基本周波数を特定するクロス検出手段を備えていてもよい。
【００１２】
前記フィルタ特性決定手段は、
フィルタリングされる前の加工対象の音声信号に基づいて当該音声信号が表す音声のピッチの時間長を検出する平均ピッチ検出手段と、
前記クロス検出手段が特定した周期と前記平均ピッチ検出手段が特定したピッチの時間長とが互いに所定量以上異なっているか否かを判別して、異なっていないと判別したときは前記クロス検出手段が特定した基本周波数近傍の成分以外が遮断されるような周波数特性になるよう前記可変フィルタを制御し、異なっていると判別したときは前記平均ピッチ検出手段が特定したピッチの時間長から特定される基本周波数近傍の成分以外が遮断されるような周波数特性になるよう前記可変フィルタを制御する判別手段と、を備えていてもよい。
【００１３】
前記平均ピッチ検出手段は、
前記可変フィルタによりフィルタリングされる前の加工対象の音声信号のケプストラムが極大値をとる周波数を求めるケプストラム分析手段と、
前記可変フィルタによりフィルタリングされる前の加工対象の音声信号の自己相関関数のピリオドグラムが極大値をとる周波数を求める自己相関分析手段と、前記ケプストラム分析手段及び前記自己相関分析手段が求めた各周波数に基づいて当該加工対象の音声信号が表す音声のピッチの平均値を求め、求めた平均値を当該音声のピッチの時間長として特定する平均計算手段と、を備えていてもよい。
【００１４】
また、この発明の第２の観点にかかる音声信号加工方法は、
音声の波形を表す加工対象の音声信号を取得し、当該音声信号の単位ピッチ分にあたる区間の時間長を実質的に同一に揃えることにより、当該音声信号をピッチ波形信号へと加工するピッチ波形信号生成ステップと、
ピッチ波形信号をフィルタリングすることにより、当該ピッチ波形信号のうち所定周波数以上の成分を実質的に除去するフィルタリングステップと、
フィルタリングされたピッチ波形信号より前記加工対象の音声信号の基本周波数成分及び高調波成分を抽出し、抽出した基本周波数成分及び高調波成分を非線形量子化した結果を表す出力用サブバンド情報を生成して出力する出力用サブバンド情報生成ステップと、より構成されており、
前記出力用サブバンド情報生成ステップでは、参照用の音声の基本周波数成分及び高調波成分の時間変化を表す参照用サブバンド情報を記憶し、前記参照用サブバンド情報のうち、前記出力用サブバンド情報との間で一定程度以上の相関を示すものがあるか否かを判別し、あると判別したとき、前記出力用サブバンド情報に代えて、該当する参照用サブバンド情報を識別する識別情報を出力する、
ことを特徴とする。
【００１５】
また、この発明の第３の観点にかかるプログラムは、
コンピュータを、
音声の波形を表す加工対象の音声信号を取得し、当該音声信号の単位ピッチ分にあたる区間の時間長を実質的に同一に揃えることにより、当該音声信号をピッチ波形信号へと加工するピッチ波形信号生成手段と、
ピッチ波形信号をフィルタリングすることにより、当該ピッチ波形信号のうち所定周波数以上の成分を実質的に除去するフィルタ手段と、
前記フィルタ手段によりフィルタリングされたピッチ波形信号より前記加工対象の音声信号の基本周波数成分及び高調波成分を抽出し、抽出した基本周波数成分及び高調波成分を非線形量子化した結果を表す出力用サブバンド情報を生成して出力する出力用サブバンド情報生成手段と、
して機能させるためのプログラムであって、
前記出力用サブバンド情報生成手段は、参照用の音声の基本周波数成分及び高調波成分の時間変化を表す参照用サブバンド情報を記憶し、前記参照用サブバンド情報のうち、前記出力用サブバンド情報との間で一定程度以上の相関を示すものがあるか否かを判別し、あると判別したとき、前記出力用サブバンド情報に代えて、該当する参照用サブバンド情報を識別する識別情報を出力する、
ことを特徴とする。
【００１６】
【発明の実施の形態】
以下、この発明の実施の形態を、音響パラメータ抽出器を例とし、図面を参照して説明する。
【００１７】
図１は、この発明の実施の形態に係る音響パラメータ抽出器の構成を示す図である。図示するように、この音響パラメータ抽出器は、音声データ入力部１と、ピッチ抽出部２と、ピッチ長固定部３と、サブバンド分割部４と、帯域制限部５と、非線形量子化部６と、辞書選択部７と、音声辞書８と、摩擦音検出部９とより構成されている。
【００１８】
音声データ入力部１は、例えば、記録媒体（例えば、フレキシブルディスクやＭＯ（Magneto Optical disk）など）に記録されたデータを読み取る記録媒体ドライバ（フレキシブルディスクドライブや、ＭＯドライブなど）等より構成されている。
音声データ入力部１は、音響パラメータを抽出する対象の音声の波形を表す音声データを取得して、ピッチ抽出部２に供給する。
【００１９】
なお、音声データは、ＰＣＭ（Pulse Code Modulation）変調されたディジタル信号の形式を有しており、音声のピッチより十分短い一定の周期でサンプリングされた音声を表しているものとする。
【００２０】
ピッチ抽出部２、ピッチ長固定部３、サブバンド分割部４、帯域制限部５、非線形量子化部６及び辞書選択部７及び摩擦音検出部９は、いずれも、ＤＳＰ（Digital Signal Processor）やＣＰＵ（Central Processing Unit）等のデータ処理装置より構成されている。
なお、ピッチ抽出部２、ピッチ長固定部３、サブバンド分割部４、帯域制限部５、摩擦音検出部９、非線形量子化部６及び辞書選択部７の一部又は全部の機能を単一のデータ処理装置が行うようにしてもよい。
【００２１】
ピッチ抽出部２は、機能的には、たとえば図２に示すように、ケプストラム解析部２１と、自己相関解析部２２と、重み計算部２３と、ＢＰＦ（Band Pass Filter）係数計算部２４と、ＢＰＦ２５と、ゼロクロス解析部２６と、波形相関解析部２７と、位相調整部２８とより構成されている。
なお、ケプストラム解析部２１、自己相関解析部２２、重み計算部２３、ＢＰＦ（Band Pass Filter）係数計算部２４、ＢＰＦ２５、ゼロクロス解析部２６、波形相関解析部２７及び位相調整部２８の一部又は全部の機能を単一のデータ処理装置が行うようにしてもよい。
【００２２】
ケプストラム解析部２１は、音声データ入力部１より供給される音声データにケプストラム分析を施すことにより、この音声データが表す音声の基本周波数を特定し、特定した基本周波数を示すデータを生成して重み計算部２３へと供給する。
【００２３】
具体的には、ケプストラム解析部２１は、音声データ入力部１より音声データを供給されると、まず、この音声データの強度を、元の値の対数に実質的に等しい値へと変換する。（対数の底は任意であり、例えば常用対数などでよい。）
次に、ケプストラム解析部２１は、値が変換された音声データのスペクトル（すなわち、ケプストラム）を、高速フーリエ変換の手法（あるいは、離散的変数をフーリエ変換した結果を表すデータを生成する他の任意の手法）により求める。
そして、このケプストラムの極大値を与える周波数のうちの最小値を基本周波数として特定し、特定した基本周波数を示すデータを生成して重み計算部２３へと供給する。
【００２４】
自己相関解析部２２は、音声データ入力部１より音声データを供給されると、音声データの波形の自己相関関数に基づいて、この音声データが表す音声の基本周波数を特定し、特定した基本周波数を示すデータを生成して重み計算部２０３へと供給する。
【００２５】
具体的には、自己相関解析部２２は、音声データ入力部１より音声データを供給されるとまず、数式１の右辺により表される自己相関関数ｒ（ｌ）を特定する。
【００２６】
【数１】

【００２７】
次に、自己相関解析部２２は、自己相関関数ｒ（ｌ）をフーリエ変換した結果得られる関数（ピリオドグラム）の極大値を与える周波数のうち、所定の下限値を超える最小の値を基本周波数として特定し、特定した基本周波数を示すデータを生成して重み計算部２３へと供給する。
【００２８】
重み計算部２３は、ケプストラム解析部２１及び自己相関解析部２２より基本周波数を示すデータを１個ずつ合計２個供給されると、これら２個のデータが示す基本周波数の逆数の絶対値の平均を求める。そして、求めた値（すなわち、平均ピッチ長）を示すデータを生成し、ＢＰＦ係数計算部２４へと供給する。
【００２９】
ＢＰＦ係数計算部２４は、平均ピッチ長を示すデータを重み計算部２３より供給され、ゼロクロス解析部２６より後述のゼロクロス信号を供給されると、供給されたデータやゼロクロス信号に基づき、平均ピッチ長とピッチ信号とゼロクロスの周期とが互いに所定量以上異なっているか否かを判別する。そして、異なっていないと判別したときは、ゼロクロスの周期の逆数を中心周波数（ＢＰＦ２５の通過帯域の中央の周波数）とするように、ＢＰＦ２５の周波数特性を制御する。一方、所定量以上異なっていると判別したときは、平均ピッチ長の逆数を中心周波数とするように、ＢＰＦ２５の周波数特性を制御する。
【００３０】
ＢＰＦ２５は、中心周波数が可変なＦＩＲ（Finite Impulse Response）型のフィルタの機能を行う。
具体的には、ＢＰＦ２５は、自己の中心周波数を、ＢＰＦ係数計算部２４の制御に従った値に設定する。そして、音声データ入力部１より供給される音声データをフィルタリングして、フィルタリングされた音声データ（ピッチ信号）を、ゼロクロス解析部２６及び波形相関解析部２７へと供給する。ピッチ信号は、音声データのサンプルリング間隔と実質的に同一のサンプリング間隔を有するディジタル形式のデータからなるものとする。
なお、ＢＰＦ２５の帯域幅は、ＢＰＦ２５の通過帯域の上限が音声データの表す音声の基本周波数の２倍以内に常に収まるような帯域幅であることが望ましい。
【００３１】
ゼロクロス解析部２６は、ＢＰＦ２５から供給されたピッチ信号の瞬時値が０となる時刻（ゼロクロスする時刻）が来るタイミングを特定し、特定したタイミングを表す信号（ゼロクロス信号）を、ＢＰＦ係数計算部２４へと供給する。
ただし、ゼロクロス解析部２６は、ピッチ信号の瞬時値が０でない所定の値となる時刻が来るタイミングを特定し、特定したタイミングを表す信号を、ゼロクロス信号に代えてＢＰＦ係数計算部２４へと供給するようにしてもよい。
【００３２】
波形相関解析部２７は、音声データ入力部１より音声データを供給され、波形相関解析部２７よりピッチ信号を供給されると、ピッチ信号の単位周期（例えば１周期）の境界が来るタイミングで音声データを区切る。そして、区切られてできる区間のそれぞれについて、この区間内の音声データの位相を種々変化させたものとこの区間内のピッチ信号との相関を求め、最も相関が高くなるときの音声データの位相を、この区間内の音声データの位相として特定する。
【００３３】
具体的には、波形相関解析部２７は、それぞれの区間毎に、例えば、数式２の右辺により表される値ｃｏｒを、位相を表すφ（ただし、φは０以上の整数）の値を種々変化させた場合それぞれについて求める。そして、波形相関解析部２７は、値ｃｏｒが最大になるようなφの値Ψを特定し、値Ψを示すデータを生成して、この区間内の音声データの位相を表す位相データとして位相調整部２８に供給する。
【００３４】
【数２】

【００３５】
なお、区間の時間的な長さは、１ピッチ分程度であることが望ましい。区間が長いほど、区間内のサンプル数が増えてピッチ波形信号のデータ量が増大し、あるいは、サンプリング間隔が増大してピッチ波形信号が表す音声が不正確になる、という問題が生じる。
【００３６】
位相調整部２８は、音声入力部１より音声データを供給され、波形相関解析部２７より音声データの各区間の位相Ψを示すデータを供給されると、それぞれの区間の音声データの位相を、位相データが示すこの区間の位相Ψに等しくなるように移相する。そして、移相された音声データをピッチ長固定部３に供給する。
【００３７】
ピッチ長固定部３は、移相された音声データを位相調整部２８より供給されると、この音声データの各区間をサンプリングし直し（リサンプリングし）、リサンプリングされた音声データ（ピッチ波形データ）を、サブバンド分割部４及び摩擦音検出部９に供給する。ただし、ピッチ長固定部３は、音声データの各区間のサンプル数が互いにほぼ等しくなるようにして、同一区間内では等間隔になるようリサンプリングする。
【００３８】
また、ピッチ長固定部３は、各区間の元のサンプル数を示すデータを生成し、生成したデータを、各区間の元のピッチ長を表す情報（ピッチ情報）として外部に出力する。
【００３９】
サブバンド分割部４は、ピッチ長固定部３より供給された音声データにＤＣＴ（Discrete Cosine Transform）等の直交変換を施すことにより、サブバンドデータを生成する。そして、生成したサブバンドデータを帯域制限部５へと供給する。
【００４０】
サブバンドデータは、サブバンド分割部４に供給された音声データが表す音声の基本周波数成分の強度の時間変化を表すデータと、この音声のｎ個（ｎは自然数）の高調波成分の強度の時間変化を表すｎ個のデータとを含むデータである。従って、サブバンドデータは、音声の基本周波数成分（又は高調波成分）の強度の時間変化がないとき、この基本周波数成分（又は高調波成分）の強度を、直流信号の形で表す。
【００４１】
帯域制限部５は、たとえばＦＩＲ型のディジタルフィルタの機能を行うものであり、サブバンド分割部４より供給されるサブバンドデータを構成する上述の計（ｎ＋１）個のデータをそれぞれフィルタリングし、フィルタリングされたサブバンドデータを、非線形量子化部６へと供給する。
帯域制限部５がフィルタリングを行うことにより、サブバンドデータが表す（ｎ＋１）個の各周波数成分（基本周波数成分又は高調波成分）の強度の時間変化のうち、所定の周波数を超える成分が実質的に除去される。
【００４２】
非線形量子化部６は、データ処理装置に加え、更に、ＲＡＭ（Random Access Memory）等の揮発性記憶装置と、ＲＯＭ（Read Only Memory）等の不揮発性記憶装置とを備えている。
【００４３】
非線形量子化部６は、フィルタリングされたサブバンドデータを帯域制限部５より供給されると、このサブバンドデータが表す各周波数成分の瞬時値に非線形圧縮を施して得られる値（具体的には、たとえば、瞬時値を上に凸な関数に代入して得られる値）を量子化したものに相当するサブバンドデータを生成する。そして、生成したサブバンドデータ（非線形量子化後のサブバンドデータ）を、辞書選択部７及び摩擦音検出部９へと供給する。
【００４４】
具体的には、例えば、非線形量子化部６は、非線形圧縮後の各周波数成分の瞬時値を、数式３の右辺に示す関数Ｘｒｉ（ｘｉ）を量子化した値に実質的に等しくなるようなものへと変更することにより非線形量子化を行えばよい。
【００４５】
【数３】
Ｘｒｉ（ｘｉ）＝ｓｇｎ（ｘｉ）・｜ｘｉ｜^４／３・２^{｛ｇｌｏｂａｌ＿ｇａｉｎ（ｘｉ）｝／４}
（ただし、ｓｇｎ（α）＝（α／｜α｜）、ｘｉはサブバンドデータが表す周波数成分の元の瞬時値、ｇｌｏｂａｌ＿ｇａｉｎ（ｘｉ）は、フルスケールを設定するためのｘｉの関数）
【００４６】
なお、非線形量子化部６は、関数ｇｌｏｂａｌ＿ｇａｉｎ（ｘｉ）を特定するデータを、ユーザによる書き込み操作等に従って予め記憶しているものとする。
関数ｇｌｏｂａｌ＿ｇａｉｎ（ｘｉ）は、非線形量子化後のサブバンドデータのデータ量が、仮に非線形量子化部６が非線形圧縮を施すことなく量子化を行ったとした場合のデータ量に比べて１００分の１程度になるような関数であることが望ましい。
【００４７】
辞書選択部７は、音声辞書８にアクセスし、音声辞書８が後述する通り記憶するサブバンドデータのうち、非線形量子化部６より供給された非線形量子化後のサブバンドデータとの相関が最も強いものが、一定程度以上強い相関を示しているか否かを判別する。
【００４８】
具体的には、辞書選択部７は、たとえば、以下（１）〜（３）として示す処理を行えばよい。すなわち、
（１）まず、非線形量子化部６より供給されたサブバンドデータと、音声辞書８が記憶する１組のサブバンドデータとの間で、同一周波数成分間の相関係数を各々求め、求めた相関係数の平均値を求める。
（２）（１）の処理を、音声辞書８に含まれるすべてのサブバンドデータについて行い、相関係数の平均値が最も高かったサブバンドデータを、非線形量子化部６より供給されたサブバンドデータともっとも相関が高いものとして特定する。
（３）次に、（２）の処理で特定したサブバンドデータと、非線形量子化部６より供給されたサブバンドデータとの相関係数の平均値が所定値より大きいか否かを判別する。
【００４９】
そして、辞書選択部７は、一定程度以上強い相関を示していると判別したとき、そのような相関を示しているサブバンドデータに割り当てられている後述のインデックス番号（又は記号）を、音響情報として外部に出力する。一方、一定程度以上強い相関を示していないと判別したときは、非線形量子化部６より供給されたサブバンドデータ自体を、音響情報として外部に出力する。
【００５０】
音声辞書８は、ハードディスク装置等の不揮発性記憶装置より構成されている。
音声辞書８は、種々の音声のそれぞれについて、当該音声の各周波数成分の時間変化を表す非線形圧縮後のサブバンドデータを記憶する。また、これらのサブバンドに１対１に対応付けた形で、各々のサブバンドデータに固有のインデックス番号（又は記号）を記憶する。そして、辞書選択部７のアクセスに応答して、自己が記憶するサブバンドデータ及びインデックス番号（又は記号）を辞書選択部７に供給する。
【００５１】
摩擦音検出部９は、非線形量子化部６より非線形量子化後のサブバンドデータを供給されると、このサブバンドデータに基づいて、この音響パラメータ抽出器に入力された音声データが摩擦音を表すものか否かを判別する。
【００５２】
摩擦音の波形は、白色雑音のような幅広いスペクトルを有する一方、基本周波数成分や高調波成分を多く含まないという特徴がある。従って、摩擦音検出部９は、たとえば、供給されたサブバンドデータが表す高調波成分の強度が、音響パラメータを抽出する対象の音声の全強度に対して所定割合以下であるか否かを判別し、所定割合以下であると判別したとき、この音響パラメータ抽出器に入力された音声データが摩擦音を表すと判別し、所定割合を超えると判別したとき、摩擦音を表さないと判別するようにすればよい。なお、摩擦音検出部９は、音響パラメータを抽出する対象の音声の全強度を求めるため、音声データ入力部１より音声データを取得するようにしてもよい。
【００５３】
そして、摩擦音検出部９は、この音響パラメータ抽出器に入力された音声データが摩擦音を表すと判別すると、摩擦音検出部９は、ピッチ長固定部３より供給された音声データにＦＦＴ（Fast Fourier Transform）（あるいは、離散的変数をフーリエ変換した結果を表すデータを生成する他の任意の手法）による変換を施すことによって、この音声データのスペクトル分布を表すデータを生成する。そして、生成したデータを、摩擦音を表す情報（摩擦音情報）として外部に出力する。
【００５４】
以上説明した音響パラメータ抽出器は、入力された音声データが表す音声のピッチを表すピッチ情報と、この音声の基本周波数成分及び高調波成分の強度の時間変化を表す音響情報と、この音声が摩擦音であるか否かを表す摩擦音情報とを、音響パラメータを表すデータとして出力する。
【００５５】
入力された音声データは、単位ピッチ分の区間の時間長を規格化され、ピッチのゆらぎの影響が除去される。音声データからは高精度な音響情報が抽出される。
また、ピッチ情報と、既知である元の音声データのサンプリング間隔の値とを用いて、音声データの各区間の元の時間長を特定することができる。このため、ピッチ波形信号の各区間の時間長を、元の音声データにおける時間長へと復元することにより、元の音声データを容易に復元できる。
【００５６】
なお、このピッチ波形抽出システムの構成は上述のものに限られない。
たとえば、音声データ入力部１は、電話回線、専用回線、衛星回線等の通信回線を介して外部より音声データを取得するようにしてもよい。この場合、音声データ入力部１は、例えばモデムやＤＳＵ（Data Service Unit）等からなる通信制御部を備えていればよい。
【００５７】
また、音声データ入力部１は、マイクロフォン、ＡＦ（Audio Frequency）増幅器、サンプラー、Ａ／Ｄ（Analog-to-Digital）コンバータ及びＰＣＭエンコーダなどからなる集音装置を備えていてもよい。集音装置は、自己のマイクロフォンが集音した音声を表す音声信号を増幅し、サンプリングしてＡ／Ｄ変換した後、サンプリングされた音声信号にＰＣＭ変調を施すことにより、音声データを取得すればよい。なお、音声データ入力部１が取得する音声データは、必ずしもＰＣＭ信号である必要はない。
【００５８】
また、ピッチ抽出部２は、ケプストラム解析部２１（又は自己相関解析部２２）を備えていなくてもよく、この場合、重み計算部２３は、ケプストラム解析部２１（又は自己相関解析部２２）が求めた基本周波数の逆数をそのまま平均ピッチ長として扱うようにすればよい。
また、ゼロクロス解析部２６は、ＢＰＦ２５から供給されたピッチ信号を、そのままゼロクロス信号としてケプストラム解析部２１へと供給するようにしてもよい。
【００５９】
また、ピッチ長固定部３は、ピッチ情報を通信回線を介して外部に供給するようにしてもよい。この場合、ピッチ長固定部３は、モデムやＤＳＵ等からなる通信制御部を備えていればよい。同様に、摩擦音検出部９（又は辞書選択部７）は、摩擦音情報（又は音響情報）を通信回線を介して外部に供給するようにしてもよく、この場合、摩擦音検出部９（又は辞書選択部７）は、ピッチ長固定部３が備えるものと同様の通信制御部を備えていればよい。なお、ピッチ長固定部３、摩擦音検出部９及び辞書選択部７の各通信制御部の一部又は全部の機能を単一の装置が行ってもよい。
【００６０】
また、ピッチ長固定部３は、ピッチ情報を、外部の記録媒体や、ハードディスク装置等からなる外部の記憶装置に書き込むようにしてもよい。この場合、ピッチ長固定部３は、記録媒体ドライバやハードディスクコントローラ等の制御回路等からなる記録制御部を備えていればよい。同様に、摩擦音検出部９（又は辞書選択部７）は、摩擦音情報（又は音響情報）を外部の記憶装置に書き込むようにしてもよく、この場合、摩擦音検出部９（又は辞書選択部７）は、ピッチ長固定部３が備えるものと同様の記録制御部を備えていればよい。なお、ピッチ長固定部３、摩擦音検出部９及び辞書選択部７の各記録制御部の一部又は全部の機能を単一の装置が行ってもよい。
【００６１】
また、辞書選択部７は、過去に非線形量子化部６より供給された非線形量子化後のサブバンドデータのうちもっとも新しいものを記憶する記憶部を備えていてもよい。この場合、辞書選択部７は、新たに非線形量子化後のサブバンドデータを供給されるたびに、このサブバンドデータが、自ら記憶している非線形量子化後のサブバンドデータとの間で一定程度以上高い相関を示しているか否かを判別し、判別結果を表す情報を、音響情報を構成するデータとして出力してもよい。なお、単一の記憶装置が辞書選択部７の記憶部と音声辞書８の機能を行うようにしてもよい。
【００６２】
また、辞書選択部７は、新たに供給された非線形量子化後のサブバンドデータが、自ら記憶している非線形量子化後のサブバンドデータとの間で一定程度以上高い相関を示していると判別したとき、音響情報にはサブバンドデータ又はインデックス番号（又は記号）を含めないようにしてもよい。こうすることにより音響情報のデータ量が節約される。
【００６３】
また、辞書選択部７は、音声辞書８が記憶するサブバンドデータのうちに、非線形量子化部６より供給された非線形量子化後のサブバンドデータとの間で一定程度以上強い相関を示すものがないと判別したとき、非線形量子化部６より供給された非線形量子化後のサブバンドデータに固有のインデックス番号（又は記号）を割り当て、このサブバンドデータ及びインデックス番号（又は記号）を、互いが対応付けられた形で音声辞書８に格納してもよい。
【００６４】
以上、この発明の実施の形態を説明したが、この発明にかかる音声信号加工装置は、専用のシステムによらず、通常のコンピュータシステムを用いて実現可能である。
例えば、パーソナルコンピュータに上述の音声データ入力部１、ピッチ抽出部２、ピッチ長固定部３、摩擦音検出部９、サブバンド分割部４、帯域制限部５、非線形量子化部６、辞書選択部７及び音声辞書８の動作を実行させるためのプログラムを格納した媒体（ＣＤ−ＲＯＭ、ＭＯ、フレキシブルディスク等）から該プログラムをインストールすることにより、上述の処理を実行する音響パラメータ抽出器を構成することができる。
【００６５】
また、例えば、通信回線の掲示板（ＢＢＳ）にこのプログラムを掲示し、これを通信回線を介して配信してもよく、また、このプログラムを表す信号により搬送波を変調し、得られた変調波を伝送し、この変調波を受信した装置が変調波を復調してこのプログラムを復元するようにしてもよい。
そして、このプログラムを起動し、ＯＳの制御下に、他のアプリケーションプログラムと同様に実行することにより、上述の処理を実行することができる。
【００６６】
なお、ＯＳが処理の一部を分担する場合、あるいは、ＯＳが本願発明の１つの構成要素の一部を構成するような場合には、記録媒体には、その部分を除いたプログラムを格納してもよい。この場合も、この発明では、その記録媒体には、コンピュータが実行する各機能又はステップを実行するためのプログラムが格納されているものとする。
【００６７】
【発明の効果】
以上説明したように、この発明によれば、ピッチが揺らぎを含む音声の特徴を表す情報を正確に抽出するための音声信号加工装置及び音声信号加工方法が実現される。
【図面の簡単な説明】
【図１】この発明の実施の形態に係る音響パラメータ抽出器の構成を示すブロック図である。
【図２】ピッチ抽出部の構成を示すブロック図である。
【図３】従来の規則合成方式の概念を模式的に説明する図である。
【符号の説明】
１音声データ入力部
２ピッチ抽出部
２１ケプストラム解析部
２２自己相関解析部
２３重み計算部
２４ＢＰＦ係数計算部
２５ＢＰＦ
２６ゼロクロス解析部
２７波形相関解析部
２８位相調整部
３ピッチ長固定部
４サブバンド分割部
５帯域制限部
６非線形量子化部
７辞書選択部
８音声辞書
９摩擦音検出部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio signal processing apparatus, an audio signal processing method, and a program.
[0002]
[Prior art]
There is a technique for performing speech recognition and speech synthesis using acoustic parameters (for example, speech pitch information and formant information) representing the characteristics of human speech.
For example, in the case of speech recognition, as shown in FIG. 3, phoneme processing (step S102) including extraction of acoustic parameters from a speech sample in advance (FIG. 3, step S101), identification of phonemes of language sounds and symbolization of phonemes. ), Word processing (step S103) for recognizing words or phrases, and natural language processing (step S104) including syntax processing for recognizing syntax and semantic processing for recognizing the meaning of sentences. It is done in association with.
[0003]
As a method for extracting acoustic parameters, a method for performing cepstrum analysis on a voice using a digital signal representing a voice waveform, a correlation function of the voice using such a digital signal is obtained, and this correlation function is obtained. There is a method for extracting acoustic parameters based on the above.
[0004]
[Problems to be solved by the invention]
When an acoustic parameter is extracted using cepstrum analysis or a correlation function, it is necessary to obtain frequency domain information by performing FFT (Fast Fourier Transform) on a digital signal representing a speech waveform.
However, the pitch of actual speech includes fluctuations, and thus the pitch may fluctuate rapidly. Therefore, there arises a problem that the acoustic parameter extracted from the result of FFT or the like includes an error caused by a rapid change in pitch.
[0005]
The present invention has been made in view of the above circumstances, and an object thereof is to provide an audio signal processing apparatus and an audio signal processing method for accurately extracting information representing the characteristics of an audio including pitch fluctuations. .
[0006]
[Means for Solving the Problems]
In order to achieve the above object, an audio signal processing apparatus according to the first aspect of the present invention includes:
A pitch waveform signal that acquires a voice signal to be processed representing a voice waveform and processes the voice signal into a pitch waveform signal by aligning the time length of the section corresponding to the unit pitch of the voice signal substantially the same. Generating means;
Subband extraction means for generating a subband signal representing a time change of the fundamental frequency component and the harmonic component of the audio signal to be processed based on the pitch waveform signal;
Filter means for substantially removing a component having a predetermined frequency or more from temporal changes of the fundamental frequency component and the harmonic component represented by the subband signal by filtering the subband signal generated by the subband extraction means;
Output subband information generating means for generating and outputting output subband information representing the result of nonlinear quantization of the subband signal filtered by the filter means, and
The output subband information generating means stores reference subband information representing temporal changes in the fundamental frequency component and the harmonic component of reference audio, and the output subband information is included in the reference subband information. It is determined whether or not there is a certain degree of correlation with the information, and when it is determined that there is, identification information for identifying the corresponding reference subband information instead of the output subband information Output,
It is characterized by that.
[0007]
The sound signal processing device determines whether the sound signal to be processed represents a friction sound based on the subband signal, and when it is determined that the sound signal represents a friction sound, is filtered by the filter means. There may be provided means for generating and outputting information representing the spectral distribution of the previous pitch waveform signal.
[0009]
The subband extracting means includes
A variable filter that extracts a fundamental frequency component of a voice to be processed by changing a frequency characteristic according to control and filtering the voice signal to be processed;
A filter characteristic that specifies the fundamental frequency of the voice based on the fundamental frequency component extracted by the variable filter, and controls the variable filter so as to have a frequency characteristic that blocks other components near the identified fundamental frequency. A determination means;
Pitch extraction means for dividing the audio signal to be processed into sections consisting of audio signals for a unit pitch based on the value of the fundamental frequency component of the audio signal;
A pitch length fixing unit that generates a pitch waveform signal in which the time lengths in each section are substantially the same by sampling each section of the speech signal to be processed with substantially the same number of samples. And may be provided.
[0010]
The audio signal processing apparatus may include pitch information output means for generating and outputting pitch information for specifying the original time length of each section of the pitch waveform signal.
[0011]
The filter characteristic determination unit may include a cross detection unit that identifies a period in which a timing at which the fundamental frequency component extracted by the variable filter reaches a predetermined value comes and identifies the fundamental frequency based on the identified period. Good.
[0012]
The filter characteristic determining means includes
Average pitch detecting means for detecting the time length of the pitch of the voice represented by the voice signal based on the voice signal to be processed before being filtered;
It is determined whether or not the period specified by the cross detection means and the time length of the pitch specified by the average pitch detection means are different from each other by a predetermined amount or more. The variable filter is controlled so as to have a frequency characteristic such that components other than the component near the specified fundamental frequency are cut off, and when it is determined that they are different, the average pitch detecting means is specified from the time length of the specified pitch. And a discriminating means for controlling the variable filter so as to have a frequency characteristic such that components other than the components near the fundamental frequency are cut off.
[0013]
The average pitch detecting means is
Cepstrum analysis means for obtaining a frequency at which the cepstrum of the sound signal to be processed before being filtered by the variable filter takes a maximum value;
Autocorrelation analysis means for obtaining a frequency at which the periodogram of the autocorrelation function of the speech signal to be processed before being filtered by the variable filter has a maximum value, and each frequency obtained by the cepstrum analysis means and the autocorrelation analysis means And calculating an average value of the pitch of the voice represented by the voice signal to be processed, and specifying the calculated average value as a time length of the pitch of the voice.
[0014]
An audio signal processing method according to the second aspect of the present invention is as follows:
Get the machining object of speech signal representing the speech waveform, by aligning the time length of the unit pitch corresponding to the interval of the audio signal substantially the same, the pitch waveform signal for processing the audio signal into a pitch waveform signal Generation step ;
A filtering step of substantially removing a component having a predetermined frequency or higher from the pitch waveform signal by filtering the pitch waveform signal;
Extracts the fundamental frequency component and harmonic component of the processed speech signal from the filtered pitch waveform signal, and generates output subband information representing the result of nonlinear quantization of the extracted fundamental frequency component and harmonic component. Output subband information generating step for outputting
In the output subband information generation step, reference subband information representing temporal changes in the fundamental frequency component and the harmonic component of the reference sound is stored, and the output subband of the reference subband information is stored. It is determined whether or not there is a certain degree of correlation with the information, and when it is determined that there is, identification information for identifying the corresponding reference subband information instead of the output subband information Output,
It is characterized by that.
[0015]
A program according to the third aspect of the present invention is:
Computer
A pitch waveform signal that acquires a voice signal to be processed representing a voice waveform and processes the voice signal into a pitch waveform signal by aligning the time length of the section corresponding to the unit pitch of the voice signal substantially the same. Generating means;
Filter means for substantially removing a component having a predetermined frequency or higher from the pitch waveform signal by filtering the pitch waveform signal;
An output subband representing a result of extracting the fundamental frequency component and the harmonic component of the audio signal to be processed from the pitch waveform signal filtered by the filter means and nonlinearly quantizing the extracted fundamental frequency component and the harmonic component Output subband information generating means for generating and outputting information ;
A program to make it function ,
The output subband information generating means stores reference subband information representing temporal changes in the fundamental frequency component and the harmonic component of reference audio, and the output subband information is included in the reference subband information. It is determined whether or not there is a certain degree of correlation with the information, and when it is determined that there is, identification information for identifying the corresponding reference subband information instead of the output subband information Output,
It is characterized by that.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings, taking an acoustic parameter extractor as an example.
[0017]
FIG. 1 is a diagram showing a configuration of an acoustic parameter extractor according to an embodiment of the present invention. As shown, the acoustic parameter extractor includes an audio data input unit 1, a pitch extraction unit 2, a pitch length fixing unit 3, a subband division unit 4, a band limiting unit 5, and a nonlinear quantization unit 6. And a dictionary selection unit 7, a speech dictionary 8, and a frictional sound detection unit 9.
[0018]
The audio data input unit 1 includes, for example, a recording medium driver (flexible disk drive, MO drive, etc.) that reads data recorded on a recording medium (for example, a flexible disk, an MO (Magneto Optical disk), etc.), and the like. Yes.
The voice data input unit 1 acquires voice data representing a waveform of a target voice from which acoustic parameters are extracted, and supplies the acquired voice data to the pitch extraction unit 2.
[0019]
It is assumed that the audio data has a PCM (Pulse Code Modulation) modulated digital signal format and represents audio sampled at a constant cycle sufficiently shorter than the audio pitch.
[0020]
The pitch extraction unit 2, pitch length fixing unit 3, subband division unit 4, band limiting unit 5, nonlinear quantization unit 6, dictionary selection unit 7, and frictional sound detection unit 9 are all DSP (Digital Signal Processor) or CPU (Central Processing Unit) or the like.
Note that a part or all of the functions of the pitch extracting unit 2, the pitch length fixing unit 3, the subband dividing unit 4, the band limiting unit 5, the frictional sound detecting unit 9, the nonlinear quantizing unit 6 and the dictionary selecting unit 7 are combined into a single function. The data processing apparatus may perform this.
[0021]
As shown in FIG. 2, for example, the pitch extraction unit 2 functionally includes a cepstrum analysis unit 21, an autocorrelation analysis unit 22, a weight calculation unit 23, a BPF (Band Pass Filter) coefficient calculation unit 24, A BPF 25, a zero-cross analysis unit 26, a waveform correlation analysis unit 27, and a phase adjustment unit 28 are included.
A cepstrum analysis unit 21, an autocorrelation analysis unit 22, a weight calculation unit 23, a BPF (Band Pass Filter) coefficient calculation unit 24, a BPF 25, a zero cross analysis unit 26, a waveform correlation analysis unit 27, and a part of the phase adjustment unit 28 or All functions may be performed by a single data processing device.
[0022]
The cepstrum analysis unit 21 performs cepstrum analysis on the audio data supplied from the audio data input unit 1, thereby specifying the fundamental frequency of the voice represented by the audio data, generating data indicating the identified basic frequency, and weighting It supplies to the calculation part 23.
[0023]
Specifically, when audio data is supplied from the audio data input unit 1, the cepstrum analysis unit 21 first converts the intensity of the audio data into a value substantially equal to the logarithm of the original value. (The base of the logarithm is arbitrary, and may be a common logarithm, for example.)
Next, the cepstrum analysis unit 21 uses a fast Fourier transform technique (or other arbitrary data that generates a result of Fourier transform of discrete variables) on the spectrum of the speech data (ie, the cepstrum) whose values have been converted. This method is used.
Then, the minimum value among the frequencies giving the maximum value of the cepstrum is specified as the fundamental frequency, and data indicating the identified fundamental frequency is generated and supplied to the weight calculator 23.
[0024]
When the audio data is supplied from the audio data input unit 1, the autocorrelation analysis unit 22 specifies the basic frequency of the audio represented by the audio data based on the autocorrelation function of the waveform of the audio data, and the specified basic frequency Is generated and supplied to the weight calculation unit 203.
[0025]
Specifically, when the audio data is supplied from the audio data input unit 1, the autocorrelation analysis unit 22 first specifies the autocorrelation function r (l) represented by the right side of Equation 1.
[0026]
[Expression 1]

[0027]
Next, the autocorrelation analysis unit 22 sets a minimum value exceeding a predetermined lower limit value as a basic frequency among frequencies giving a maximum value of a function (periodogram) obtained as a result of Fourier transform of the autocorrelation function r (l). Is generated, and data indicating the specified fundamental frequency is generated and supplied to the weight calculator 23.
[0028]
When a total of two pieces of data indicating the fundamental frequency are supplied one by one from the cepstrum analysis unit 21 and the autocorrelation analysis unit 22, the weight calculation unit 23 averages the absolute value of the reciprocal of the fundamental frequency indicated by these two data. Ask for. Then, data indicating the obtained value (that is, average pitch length) is generated and supplied to the BPF coefficient calculation unit 24.
[0029]
When the BPF coefficient calculation unit 24 is supplied with data indicating the average pitch length from the weight calculation unit 23 and is supplied with a zero cross signal described later from the zero cross analysis unit 26, the average pitch length is based on the supplied data and the zero cross signal. It is determined whether or not the pitch signal and the zero-crossing period differ from each other by a predetermined amount or more. When it is determined that they are not different, the frequency characteristics of the BPF 25 are controlled so that the reciprocal of the zero-crossing period is the center frequency (the center frequency of the pass band of the BPF 25). On the other hand, when it is determined that they are different by a predetermined amount or more, the frequency characteristic of the BPF 25 is controlled so that the reciprocal of the average pitch length is set as the center frequency.
[0030]
The BPF 25 performs a function of a FIR (Finite Impulse Response) type filter having a variable center frequency.
Specifically, the BPF 25 sets its own center frequency to a value according to the control of the BPF coefficient calculation unit 24. Then, the voice data supplied from the voice data input unit 1 is filtered, and the filtered voice data (pitch signal) is supplied to the zero cross analysis unit 26 and the waveform correlation analysis unit 27. The pitch signal is assumed to be digital data having a sampling interval substantially the same as the sampling interval of audio data.
The bandwidth of the BPF 25 is desirably a bandwidth that always keeps the upper limit of the pass band of the BPF 25 within twice the fundamental frequency of the voice represented by the voice data.
[0031]
The zero cross analysis unit 26 specifies the timing when the time when the instantaneous value of the pitch signal supplied from the BPF 25 becomes 0 (time when zero crossing) comes, and the signal (zero cross signal) indicating the specified timing is determined as the BPF coefficient calculation unit 24. To supply.
However, the zero cross analysis unit 26 specifies the timing when the time at which the instantaneous value of the pitch signal is a predetermined value other than 0 comes, and supplies a signal representing the specified timing to the BPF coefficient calculation unit 24 instead of the zero cross signal. You may make it do.
[0032]
When the waveform correlation analysis unit 27 is supplied with the audio data from the audio data input unit 1 and is supplied with the pitch signal from the waveform correlation analysis unit 27, the waveform correlation analysis unit 27 performs the audio at the timing when the boundary of the unit period (for example, one cycle) of the pitch signal comes. Separate data. Then, for each of the sections that can be divided, the correlation between the variously changed phases of the audio data in this section and the pitch signal in this section is obtained, and the phase of the audio data when the correlation becomes the highest is obtained. The phase of the audio data in this section is specified.
[0033]
Specifically, for each section, the waveform correlation analysis unit 27 changes the value cor represented by, for example, the right side of Equation 2, and various values of φ representing the phase (where φ is an integer of 0 or more). Each change is obtained for each change. Then, the waveform correlation analysis unit 27 specifies the value ψ of φ that maximizes the value cor, generates data indicating the value ψ, and adjusts the phase as phase data representing the phase of the audio data in this section. Supplied to the unit 28.
[0034]
[Expression 2]

[0035]
Note that the time length of the section is preferably about one pitch. As the section is longer, the number of samples in the section increases and the data amount of the pitch waveform signal increases, or the sampling interval increases and the voice represented by the pitch waveform signal becomes inaccurate.
[0036]
When the phase adjustment unit 28 is supplied with audio data from the audio input unit 1 and is supplied with data indicating the phase Ψ of each section of the audio data from the waveform correlation analysis unit 27, the phase of the audio data in each section is The phase is shifted so as to be equal to the phase Ψ of this section indicated by the phase data. Then, the phase-shifted audio data is supplied to the pitch length fixing unit 3.
[0037]
When the phase-adjusted audio data is supplied from the phase adjustment unit 28, the pitch length fixing unit 3 resamples (resamples) each section of the audio data, and resamples the audio data (pitch waveform data). ) Is supplied to the sub-band dividing unit 4 and the frictional sound detecting unit 9. However, the pitch length fixing unit 3 performs resampling so that the number of samples in each section of the audio data is substantially equal to each other, and is equally spaced within the same section.
[0038]
The pitch length fixing unit 3 generates data indicating the original number of samples in each section, and outputs the generated data to the outside as information (pitch information) indicating the original pitch length of each section.
[0039]
The subband dividing unit 4 generates subband data by performing orthogonal transform such as DCT (Discrete Cosine Transform) on the audio data supplied from the pitch length fixing unit 3. Then, the generated subband data is supplied to the band limiting unit 5.
[0040]
The subband data includes data representing a temporal change in the intensity of the fundamental frequency component of the voice represented by the voice data supplied to the subband dividing unit 4 and the intensity of n harmonic components (n is a natural number) of the voice. This data includes n pieces of data representing changes over time. Therefore, the subband data represents the intensity of the fundamental frequency component (or harmonic component) in the form of a direct current signal when there is no temporal change in the intensity of the fundamental frequency component (or harmonic component) of the sound.
[0041]
The band limiting unit 5 performs a function of, for example, an FIR type digital filter, filters each of the above (n + 1) pieces of data constituting the subband data supplied from the subband dividing unit 4, and performs filtering. The obtained subband data is supplied to the nonlinear quantization unit 6.
When the band limiting unit 5 performs filtering, a component exceeding a predetermined frequency is substantially included in the temporal change in intensity of (n + 1) frequency components (fundamental frequency component or harmonic component) represented by the subband data. Removed.
[0042]
In addition to the data processing device, the nonlinear quantization unit 6 further includes a volatile storage device such as a RAM (Random Access Memory) and a nonvolatile storage device such as a ROM (Read Only Memory).
[0043]
When the non-linear quantization unit 6 is supplied with the filtered subband data from the band limiting unit 5, a value obtained by performing non-linear compression on the instantaneous value of each frequency component represented by the subband data (specifically, For example, subband data corresponding to a quantized value obtained by substituting an instantaneous value into an upward convex function is generated. Then, the generated subband data (subband data after nonlinear quantization) is supplied to the dictionary selection unit 7 and the frictional sound detection unit 9.
[0044]
Specifically, for example, the nonlinear quantization unit 6 makes the instantaneous value of each frequency component after nonlinear compression substantially equal to a value obtained by quantizing the function Xri (xi) shown on the right side of Equation 3. What is necessary is just to perform nonlinear quantization by changing into a thing.
[0045]
[Equation 3]
Xri (xi) = sgn (xi) · | xi | ^4/3 · 2 ^{{global_gain (xi)} / 4}
(Where sgn (α) = (α / | α |), xi is the original instantaneous value of the frequency component represented by the subband data, and global_gain (xi) is a function of xi for setting the full scale)
[0046]
It is assumed that the nonlinear quantization unit 6 stores in advance data specifying the function global_gain (xi) in accordance with a user's write operation or the like.
The function global_gain (xi) is one hundredth of the data amount of the subband data after nonlinear quantization compared to the data amount when the nonlinear quantization unit 6 performs quantization without performing nonlinear compression. It is desirable that the function be of the order.
[0047]
The dictionary selection unit 7 accesses the speech dictionary 8 and has the highest correlation with the subband data after nonlinear quantization supplied from the nonlinear quantization unit 6 among the subband data stored in the speech dictionary 8 as described later. It is determined whether or not a strong one shows a strong correlation of a certain level or more.
[0048]
Specifically, the dictionary selection part 7 should just perform the process shown as (1)-(3) below, for example. That is,
(1) First, correlation coefficients between the same frequency components are respectively obtained between the subband data supplied from the nonlinear quantizing unit 6 and a set of subband data stored in the speech dictionary 8. The average value of the correlation coefficient is obtained.
(2) The processing of (1) is performed on all the subband data included in the speech dictionary 8, and the subband data having the highest correlation coefficient average value is supplied to the subband supplied from the nonlinear quantization unit 6 Identify as having the highest correlation with the data.
(3) Next, it is determined whether or not the average value of the correlation coefficient between the subband data specified in the processing of (2) and the subband data supplied from the nonlinear quantization unit 6 is greater than a predetermined value. .
[0049]
And when the dictionary selection part 7 discriminate | determines that the correlation more than a fixed degree is shown, the below-mentioned index number (or symbol) allocated to the subband data which shows such a correlation is acoustic information. Output to the outside. On the other hand, when it is determined that the correlation is not strong enough, the subband data supplied from the nonlinear quantization unit 6 is output to the outside as acoustic information.
[0050]
The voice dictionary 8 is composed of a nonvolatile storage device such as a hard disk device.
The voice dictionary 8 stores, for each of various voices, subband data after nonlinear compression that represents a time change of each frequency component of the voice. In addition, a unique index number (or symbol) is stored in each subband data in a form corresponding to each of these subbands. Then, in response to the access of the dictionary selection unit 7, the subband data and the index number (or symbol) stored by itself are supplied to the dictionary selection unit 7.
[0051]
When the frictional sound detecting unit 9 is supplied with the subband data after nonlinear quantization from the nonlinear quantizing unit 6, the voice data input to the acoustic parameter extractor represents the frictional sound based on the subband data. It is determined whether or not.
[0052]
The waveform of the frictional sound has a wide spectrum such as white noise, but has a feature that it does not contain many fundamental frequency components and harmonic components. Therefore, for example, the frictional sound detection unit 9 determines whether or not the intensity of the harmonic component represented by the supplied subband data is equal to or less than a predetermined ratio with respect to the total intensity of the sound from which the acoustic parameters are extracted. When it is determined that the sound data is below a predetermined ratio, it is determined that the sound data input to the acoustic parameter extractor represents a frictional sound. When it is determined that the sound data exceeds the predetermined ratio, it is determined that no frictional sound is expressed. That's fine. Note that the frictional sound detection unit 9 may acquire voice data from the voice data input unit 1 in order to obtain the total intensity of the target voice from which the acoustic parameters are extracted.
[0053]
When the frictional sound detection unit 9 determines that the sound data input to the acoustic parameter extractor represents a frictional sound, the frictional sound detection unit 9 converts the sound data supplied from the pitch length fixing unit 3 to FFT (Fast Fourier Transform). ) (Or any other method for generating data representing the result of Fourier transform of discrete variables) to generate data representing the spectral distribution of the audio data. And the produced | generated data are output outside as information (friction sound information) showing friction sound.
[0054]
The acoustic parameter extractor described above includes pitch information representing the pitch of the voice represented by the input voice data, acoustic information representing the temporal change in intensity of the fundamental frequency component and harmonic component of the voice, and the voice is a frictional sound. Is output as data representing acoustic parameters.
[0055]
The input voice data is standardized for the time length of the section for the unit pitch, and the influence of pitch fluctuation is removed. High-accuracy acoustic information is extracted from the audio data.
In addition, the original time length of each section of the audio data can be specified using the pitch information and the known value of the sampling interval of the original audio data. For this reason, the original voice data can be easily restored by restoring the time length of each section of the pitch waveform signal to the time length in the original voice data.
[0056]
Note that the configuration of the pitch waveform extraction system is not limited to that described above.
For example, the voice data input unit 1 may acquire voice data from the outside via a communication line such as a telephone line, a dedicated line, or a satellite line. In this case, the audio data input unit 1 only needs to include a communication control unit including, for example, a modem or a DSU (Data Service Unit).
[0057]
The audio data input unit 1 may include a sound collection device including a microphone, an AF (Audio Frequency) amplifier, a sampler, an A / D (Analog-to-Digital) converter, a PCM encoder, and the like. If the sound collection device acquires sound data by amplifying a sound signal representing sound collected by its own microphone, sampling and A / D converting, and then performing PCM modulation on the sampled sound signal Good. Note that the audio data acquired by the audio data input unit 1 is not necessarily a PCM signal.
[0058]
In addition, the pitch extraction unit 2 may not include the cepstrum analysis unit 21 (or autocorrelation analysis unit 22). In this case, the weight calculation unit 23 includes the cepstrum analysis unit 21 (or autocorrelation analysis unit 22). The reciprocal of the obtained fundamental frequency may be handled as the average pitch length as it is.
Alternatively, the zero cross analysis unit 26 may supply the pitch signal supplied from the BPF 25 to the cepstrum analysis unit 21 as it is as a zero cross signal.
[0059]
The pitch length fixing unit 3 may supply pitch information to the outside via a communication line. In this case, the pitch length fixing unit 3 only needs to include a communication control unit including a modem, a DSU, or the like. Similarly, the frictional sound detection unit 9 (or dictionary selection unit 7) may supply frictional sound information (or acoustic information) to the outside via a communication line. In this case, the frictional sound detection unit 9 (or dictionary selection) The unit 7) only needs to include a communication control unit similar to that included in the pitch length fixing unit 3. A single device may perform a part or all of the functions of the communication control units of the pitch length fixing unit 3, the frictional sound detection unit 9, and the dictionary selection unit 7.
[0060]
The pitch length fixing unit 3 may write the pitch information to an external storage device such as an external recording medium or a hard disk device. In this case, the pitch length fixing unit 3 only needs to include a recording control unit including a control circuit such as a recording medium driver and a hard disk controller. Similarly, the frictional sound detection unit 9 (or dictionary selection unit 7) may write the frictional sound information (or acoustic information) in an external storage device. In this case, the frictional sound detection unit 9 (or dictionary selection unit 7). Need only include a recording control unit similar to that included in the pitch length fixing unit 3. A single device may perform a part or all of the functions of the recording control units of the pitch length fixing unit 3, the frictional sound detection unit 9, and the dictionary selection unit 7.
[0061]
Moreover, the dictionary selection part 7 may be provided with the memory | storage part which memorize | stores the newest thing among the subband data after the nonlinear quantization supplied from the nonlinear quantization part 6 in the past. In this case, each time the dictionary selection unit 7 is newly supplied with subband data after nonlinear quantization, the subband data is constant between the subband data after nonlinear quantization stored by itself. It may be determined whether or not the correlation is higher than a certain level, and information indicating the determination result may be output as data constituting the acoustic information. A single storage device may perform the functions of the storage unit of the dictionary selection unit 7 and the speech dictionary 8.
[0062]
In addition, the dictionary selection unit 7 shows that the newly supplied non-linear quantized subband data shows a correlation higher than a certain level with the non-linear quantized subband data stored therein. When determined, the acoustic information may not include subband data or index numbers (or symbols). By doing so, the data amount of acoustic information is saved.
[0063]
Further, the dictionary selection unit 7 shows a strong correlation of a certain degree or more with the subband data after nonlinear quantization supplied from the nonlinear quantization unit 6 among the subband data stored in the speech dictionary 8. When it is determined that there is no index, a unique index number (or symbol) is assigned to the subband data after nonlinear quantization supplied from the nonlinear quantization unit 6, and the subband data and the index number (or symbol) are assigned to each other. May be stored in the speech dictionary 8 in a form associated with.
[0064]
Although the embodiments of the present invention have been described above, the audio signal processing apparatus according to the present invention can be realized using a normal computer system, not a dedicated system.
For example, the above-described voice data input unit 1, pitch extraction unit 2, pitch length fixing unit 3, friction sound detection unit 9, subband division unit 4, band limiting unit 5, nonlinear quantization unit 6, dictionary selection unit 7 are added to the personal computer. And an acoustic parameter extractor for executing the above-described processing by installing the program from a medium (CD-ROM, MO, flexible disk, etc.) storing a program for executing the operation of the voice dictionary 8 Can do.
[0065]
Further, for example, this program may be posted on a bulletin board (BBS) of a communication line and distributed via the communication line. Also, a carrier wave is modulated by a signal representing this program, and the obtained modulated wave is An apparatus that transmits and receives the modulated wave may demodulate the modulated wave to restore the program.
The above-described processing can be executed by starting this program and executing it under the control of the OS in the same manner as other application programs.
[0066]
When the OS shares a part of the process, or when the OS constitutes a part of one component of the present invention, a program excluding that part is stored in the recording medium. May be. Also in this case, in the present invention, it is assumed that the recording medium stores a program for executing each function or step executed by the computer.
[0067]
【The invention's effect】
As described above, according to the present invention, an audio signal processing apparatus and an audio signal processing method for accurately extracting information representing the characteristics of audio including pitch fluctuations are realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an acoustic parameter extractor according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a configuration of a pitch extraction unit.
FIG. 3 is a diagram schematically illustrating the concept of a conventional rule composition method.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Voice data input part 2 Pitch extraction part 21 Cepstrum analysis part 22 Autocorrelation analysis part 23 Weight calculation part 24 BPF coefficient calculation part 25 BPF
26 Zero Cross Analysis Unit 27 Waveform Correlation Analysis Unit 28 Phase Adjustment Unit 3 Pitch Length Fixing Unit 4 Subband Division Unit 5 Band Limiting Unit 6 Nonlinear Quantization Unit 7 Dictionary Selection Unit 8 Speech Dictionary 9 Friction Sound Detection Unit

Claims

A pitch waveform signal that acquires a voice signal to be processed representing a voice waveform and processes the voice signal into a pitch waveform signal by aligning the time length of the section corresponding to the unit pitch of the voice signal substantially the same. Generating means;
Subband extraction means for generating a subband signal representing a time change of the fundamental frequency component and the harmonic component of the audio signal to be processed based on the pitch waveform signal;
Filter means for substantially removing a component having a predetermined frequency or more from temporal changes of the fundamental frequency component and the harmonic component represented by the subband signal by filtering the subband signal generated by the subband extraction means;
Output subband information generating means for generating and outputting output subband information representing the result of nonlinear quantization of the subband signal filtered by the filter means, and
The output subband information generating means stores reference subband information representing temporal changes in the fundamental frequency component and the harmonic component of reference audio, and the output subband information is included in the reference subband information. It is determined whether or not there is a certain degree of correlation with the information, and when it is determined that there is, identification information for identifying the corresponding reference subband information instead of the output subband information Output,
An audio signal processing apparatus.

Based on the subband signal, it is determined whether or not the sound signal to be processed represents a friction sound, and when it is determined that the sound signal represents a friction sound, the spectrum of the pitch waveform signal before being filtered by the filter means Means for generating and outputting information representing the distribution;
The audio signal processing apparatus according to claim 1.

The subband extracting means includes
A variable filter that extracts a fundamental frequency component of a voice to be processed by changing a frequency characteristic according to control and filtering the voice signal to be processed;
A filter characteristic that specifies the fundamental frequency of the voice based on the fundamental frequency component extracted by the variable filter, and controls the variable filter so as to have a frequency characteristic that blocks other components near the identified fundamental frequency. A determination means;
Pitch extraction means for dividing the audio signal to be processed into sections consisting of audio signals for a unit pitch based on the value of the fundamental frequency component of the audio signal;
A pitch length fixing unit that generates a pitch waveform signal in which the time lengths in each section are substantially the same by sampling each section of the speech signal to be processed with substantially the same number of samples. And comprising
Audio signal processing apparatus according to claim 1 or 2, characterized in that.

Pitch information output means for generating and outputting pitch information for specifying the original time length of each section of the pitch waveform signal,
The audio signal processing apparatus according to claim 3 .

The filter characteristic determination means includes a cross detection means for specifying a period in which a timing at which the fundamental frequency component extracted by the variable filter reaches a predetermined value comes, and identifying the fundamental frequency based on the identified period.
The audio signal processing apparatus according to claim 3 or 4 ,

The filter characteristic determining means includes
Average pitch detecting means for detecting the time length of the pitch of the voice represented by the voice signal based on the voice signal to be processed before being filtered;
It is determined whether or not the period specified by the cross detection means and the time length of the pitch specified by the average pitch detection means are different from each other by a predetermined amount or more. The variable filter is controlled so as to have a frequency characteristic such that components other than the component near the specified fundamental frequency are cut off, and when it is determined that they are different, the average pitch detecting means is specified from the time length of the specified pitch. A discriminating means for controlling the variable filter so as to have a frequency characteristic such that components other than components near the fundamental frequency are cut off,
The audio signal processing apparatus according to claim 5 .

The average pitch detecting means is
Cepstrum analysis means for obtaining a frequency at which the cepstrum of the sound signal to be processed before being filtered by the variable filter takes a maximum value;
Autocorrelation analysis means for obtaining a frequency at which the periodogram of the autocorrelation function of the speech signal to be processed before being filtered by the variable filter takes a maximum value;
Based on each frequency obtained by the cepstrum analysis means and the autocorrelation analysis means, an average value of the pitch of the voice represented by the voice signal to be processed is obtained, and the obtained average value is specified as a time length of the pitch of the voice. An average calculating means,
The audio signal processing apparatus according to claim 6 .

Get the machining object of speech signal representing the speech waveform, by aligning the time length of the unit pitch corresponding to the interval of the audio signal substantially the same, the pitch waveform signal for processing the audio signal into a pitch waveform signal Generation step ;
A filtering step of substantially removing a component having a predetermined frequency or higher from the pitch waveform signal by filtering the pitch waveform signal;
Extracts the fundamental frequency component and harmonic component of the processed speech signal from the filtered pitch waveform signal, and generates output subband information representing the result of nonlinear quantization of the extracted fundamental frequency component and harmonic component. Output subband information generating step for outputting
In the output subband information generation step, reference subband information representing temporal changes in the fundamental frequency component and the harmonic component of the reference sound is stored, and the output subband of the reference subband information is stored. It is determined whether or not there is a certain degree of correlation with the information, and when it is determined that there is, identification information for identifying the corresponding reference subband information instead of the output subband information Output,
An audio signal processing method characterized by the above.

Computer
A pitch waveform signal that acquires a voice signal to be processed representing a voice waveform and processes the voice signal into a pitch waveform signal by aligning the time length of the section corresponding to the unit pitch of the voice signal substantially the same. Generating means;
Filter means for substantially removing a component having a predetermined frequency or higher from the pitch waveform signal by filtering the pitch waveform signal;
An output subband representing a result of extracting the fundamental frequency component and the harmonic component of the audio signal to be processed from the pitch waveform signal filtered by the filter means and nonlinearly quantizing the extracted fundamental frequency component and the harmonic component Output subband information generating means for generating and outputting information ;
A program to make it function ,
The output subband information generating means stores reference subband information representing temporal changes in the fundamental frequency component and the harmonic component of reference audio, and the output subband information is included in the reference subband information. It is determined whether or not there is a certain degree of correlation with the information, and when it is determined that there is, identification information for identifying the corresponding reference subband information instead of the output subband information Output,
A program characterized by that .