JP2014109770A

JP2014109770A - Speech processing unit, speech recognition system, speech processing method, and speech processing program

Info

Publication number: JP2014109770A
Application number: JP2012265707A
Authority: JP
Inventors: Junta Asano; 純太浅野; Ryutaro Okamoto; 隆太郎岡本
Original assignee: Samsung R&D Institute Japan Co Ltd
Current assignee: Samsung R&D Institute Japan Co Ltd
Priority date: 2012-12-04
Filing date: 2012-12-04
Publication date: 2014-06-12
Anticipated expiration: 2032-12-04
Also published as: JP6229869B2

Abstract

PROBLEM TO BE SOLVED: To provide a novel and improved speech processing unit capable of extracting speech sounds more vividly while reducing the influence of noise to the speech signal that changes corresponding to the fluctuation of vocal sound.SOLUTION: The present invention includes a detecting part which detects a period of a speaker's behavior, a specification part which specifies gain change along a time series based on the period of the detected behavior, and a signal processing part which adjusts the amplitude of an input signal including a speech signal in accordance with the time series based on the specified gain change.

Description

本発明は、音声処理装置、音声認識システム、音声処理方法及び音声処理プログラムに関する。 The present invention relates to a voice processing device, a voice recognition system, a voice processing method, and a voice processing program.

入力として取得した発話者（以降では、単に「話者」と呼ぶ）の発した音声を解析して、操作内容や入力文字を判断する技術として、音声認識技術が広く用いられている。 A speech recognition technique is widely used as a technique for determining the operation content and input characters by analyzing a voice uttered by a speaker (hereinafter simply referred to as “speaker”) acquired as an input.

特開２００４−１６３４５８号公報JP 2004-163458 A

このような音声認識技術では、話者が発した音声を正しく認識する必要があるが、これを阻害する要因として音声以外の雑音のようなノイズが挙げられる。特許文献１には、このようなノイズを除去するための技術の一例が開示されている。 In such voice recognition technology, it is necessary to correctly recognize the voice uttered by the speaker, but noise such as noise other than voice can be cited as a factor that hinders this. Patent Document 1 discloses an example of a technique for removing such noise.

一方で、話者により発せられ所定のマイクで集音される音声は、話者ごとに独特の揺らぎを持っている。これに対して、雑音のようなノイズは、話者の動作に影響されない。即ち、この揺らぎの位相に応じて音声信号に対するノイズの影響の大きさが変化する場合がある。一方で、話者ごとの揺らぎの影響は、例えば、口の動きのような発話の動作や、頭や身体の動き、表情の変化等のような話者の動作のリズムにも表れている。 On the other hand, the sound emitted by a speaker and collected by a predetermined microphone has a unique fluctuation for each speaker. On the other hand, noise such as noise is not affected by the operation of the speaker. That is, the magnitude of the influence of noise on the audio signal may change according to the phase of the fluctuation. On the other hand, the influence of fluctuation for each speaker is also reflected in the rhythm of the movement of the speaker, such as the movement of the utterance such as the movement of the mouth, the movement of the head and body, the change of the facial expression, and the like.

そこで、本発明は、上記問題に鑑みてなされたものであり、本発明の目的とするところは、話者の動作に基づいて、音声の揺らぎに応じて変化する、音声信号へのノイズの影響を低減し、より鮮明に音声を抽出することが可能な、新規かつ改良された音声処理装置を提供することにある。 Accordingly, the present invention has been made in view of the above problems, and an object of the present invention is to influence the influence of noise on a voice signal that changes according to the fluctuation of the voice based on the operation of the speaker. It is an object of the present invention to provide a new and improved speech processing apparatus that can reduce speech and extract speech more clearly.

上記課題を解決するために、本発明のある観点によれば、話者の振舞いの周期を検出する検出部と、検出された前記振舞いの周期に基づき、時系列に沿ったゲインの変化を特定する特定部と、特定された前記ゲインの変化に基づき、音声信号を含む入力信号の振幅を時系列に沿って調整する信号処理部と、を備えたことを特徴とする音声処理装置が提供される。 In order to solve the above problems, according to an aspect of the present invention, a detection unit that detects a period of a speaker's behavior and a change in gain along a time series are identified based on the detected period of the behavior. And a signal processing unit that adjusts the amplitude of the input signal including the audio signal in time series based on the specified change in the gain. The

このような構成により、音声信号の揺らぎを話者の振舞いの周期として推定され、この話者の振舞いの周期中の所望のタイミングと、他のタイミングとの間で入力信号に振幅に差を生じさせる。これにより、話者の振舞いの周期（ひいては、音声信号の揺らぎ）にあわせて、その所望のタイミングの入力信号（ひいては、音声信号）が強調される。 With such a configuration, the fluctuation of the voice signal is estimated as the period of the speaker's behavior, and a difference occurs in the amplitude of the input signal between the desired timing and the other timing in the period of the speaker's behavior. Let Thereby, the input signal (and voice signal) at the desired timing is emphasized in accordance with the period of the speaker's behavior (and thus fluctuation of the voice signal).

前記話者の振舞いの周期と前記入力信号との間の同期タイミングを特定する同期演算部を備え、前記信号処理部は、特定された当該同期タイミングに基づき、前記入力信号の振幅を調整するタイミングを決定してもよい。 A synchronization operation unit that identifies a synchronization timing between the period of the speaker's behavior and the input signal, and the signal processing unit adjusts the amplitude of the input signal based on the identified synchronization timing May be determined.

これにより、この同期タイミングにあわせて、入力信号と、この入力信号に対する時系列に沿った振幅の調整に係る処理とを同期させることが可能となる。 Thereby, it is possible to synchronize the input signal and the process related to the amplitude adjustment along the time series for the input signal in accordance with the synchronization timing.

前記同期演算部は、時系列に沿って検出された話者の発話の動作を示す情報と、時系列に沿った前記入力信号の振幅の変化とを基に、前記話者の振舞いの周期と前記入力信号との間の時系列に沿ったずれ量を算出し、当該ずれ量を基に前記同期タイミングを特定してもよい。 The synchronization calculation unit, based on the information indicating the operation of the speaker's utterance detected along the time series, and the change in the amplitude of the input signal along the time series, the period of the speaker's behavior and A shift amount along the time series with the input signal may be calculated, and the synchronization timing may be specified based on the shift amount.

前記同期演算部は、前記入力信号の振幅が所定量以上増加したタイミングと、前記発話の動作が開始されるタイミングとの間の差を前記ずれ量としてもよい。 The synchronization calculation unit may use a difference between a timing at which the amplitude of the input signal increases by a predetermined amount or more and a timing at which the speech operation is started as the shift amount.

前記検出部は、話者の動作を示す画像情報に基づき、前記振舞いの周期を検出してもよい。 The detection unit may detect the period of the behavior based on image information indicating a speaker's operation.

前記検出部は、前記画像情報から、話者の各部位のうち、あらかじめ決められた部位の動作の周期を検出し、検出された当該周期に基づき、前記振舞いの周期を特定してもよい。 The detection unit may detect an operation period of a predetermined part of each part of the speaker from the image information, and specify the period of the behavior based on the detected period.

前記検出部は、前記あらかじめ決められた部位の動作として、複数の部位の動作の周期を検出し、前記複数の部位それぞれの周期に所定の統計処理を適用することで、前記振舞いの周期を特定してもよい。 The detection unit detects the period of movement of a plurality of parts as the action of the predetermined part, and specifies a period of the behavior by applying a predetermined statistical process to each period of the plurality of parts. May be.

このような構成とすることで、複数の部位の動きの組み合わせに応じて変化する話者の振舞いの周期を、より正確に特定することが可能となる。 With such a configuration, it is possible to more accurately specify the period of the speaker's behavior that changes according to the combination of movements of a plurality of parts.

前記検出部は、前記統計処理として、前記複数の部位それぞれの周期に重み付けを行い、重み付けされた当該周期の平均をとることで前記振舞いの周期を特定してもよい。 The detection unit may weight the periods of the plurality of parts as the statistical process, and specify the period of the behavior by taking an average of the weighted periods.

前記検出部は、前記統計処理として、前記複数の部位それぞれの周期と、前記振舞いの周期との因果関係に基づきあらかじめ作成されたベイジアンネットワークを基に、検出された前記複数の部位それぞれの周期から前記振舞いの周期を特定してもよい。 Based on a Bayesian network created in advance based on a causal relationship between the period of each of the plurality of parts and the period of the behavior as the statistical process, the detection unit calculates the period from each of the detected parts of the plurality of parts. The period of the behavior may be specified.

前記特定部は、検出された前記振舞いの周期に同期するように、前記ゲインの変化を特定してもよい。 The specifying unit may specify the change in the gain so as to synchronize with the detected period of the behavior.

前記特定部は、前記ゲインの変化のピーク位置が、前記振舞いの周期のピーク位置からあらかじめ決められた時間幅だけずれるように前記ゲインの変化を特定してもよい。 The specifying unit may specify the gain change such that a peak position of the gain change deviates from a peak position of the behavior period by a predetermined time width.

前記特定部は、検出された前記振舞いの周期中の所定の位相でゲインが増幅するように、当該振舞いの周期に対してあらかじめ決められた時間幅だけ位相がずれるように、前記ゲインの変化を特定してもよい。 The specifying unit changes the gain so that the phase is shifted by a predetermined time width with respect to the period of the behavior so that the gain is amplified at a predetermined phase in the detected period of the behavior. You may specify.

また、上記課題を解決するために、本発明の別の観点によれば、音声信号を含む入力信号を集音する集音部と、話者の動作を動画像として取得する画像取得部と、前記動画像に基づき話者の振舞いの周期を検出する検出部と、検出された前記振舞いの周期に基づき、時系列に沿ったゲインの変化を特定する特定部と、特定された前記ゲインの変化に基づき、音声信号を含む入力信号の振幅を時系列に沿って調整する信号処理部と、振幅が調整された前記入力信号に基づき音声認識を行う音声認識部と、を備えた音声認識装置が提供される。 In order to solve the above-described problem, according to another aspect of the present invention, a sound collection unit that collects an input signal including an audio signal, an image acquisition unit that acquires a motion of a speaker as a moving image, A detection unit that detects a period of a speaker's behavior based on the moving image, a specification unit that specifies a change in gain along a time series based on the detected period of the behavior, and a change in the specified gain A speech recognition apparatus comprising: a signal processing unit that adjusts the amplitude of an input signal including a speech signal along a time series; and a speech recognition unit that performs speech recognition based on the input signal whose amplitude is adjusted. Provided.

また、上記課題を解決するために、本発明の別の観点によれば、話者の振舞いの周期を検出する検出ステップと、検出された前記振舞いの周期に基づき、時系列に沿ったゲインの変化を特定する特定ステップと、特定された前記ゲインの変化に基づき、音声信号を含む入力信号の振幅を時系列に沿って調整する信号処理ステップと、を含むことを特徴とする音声処理方法が提供される。 In order to solve the above problem, according to another aspect of the present invention, a detection step for detecting a period of a speaker's behavior, and a gain along a time series based on the detected period of the behavior. An audio processing method comprising: a specifying step for specifying a change; and a signal processing step for adjusting the amplitude of an input signal including the audio signal along a time series based on the specified change in the gain. Provided.

また、上記課題を解決するために、本発明の別の観点によれば、話者の振舞いの周期を検出する検出処理と、検出された前記振舞いの周期に基づき、時系列に沿ったゲインの変化を特定する特定処理と、特定された前記ゲインの変化に基づき、音声信号を含む入力信号の振幅を時系列に沿って調整する信号処理と、を実行することを特徴とする音声処理プログラムが提供される。 In order to solve the above problem, according to another aspect of the present invention, a detection process for detecting a period of a speaker's behavior and a gain along a time series based on the detected period of the behavior. An audio processing program that executes a specifying process for specifying a change, and a signal process for adjusting the amplitude of an input signal including an audio signal along a time series based on the specified change in the gain. Provided.

以上説明したように本発明によれば、話者の動作に基づいて、音声の揺らぎに応じて変化する、音声信号へのノイズの影響を低減し、より鮮明に音声を抽出することが可能となる。 As described above, according to the present invention, it is possible to reduce the influence of noise on a voice signal, which changes according to the fluctuation of the voice, based on the action of the speaker, and to extract the voice more clearly. Become.

本発明の実施形態に係る音声認識システムの適用シーンの一例について説明するための概念図である。It is a conceptual diagram for demonstrating an example of the application scene of the speech recognition system which concerns on embodiment of this invention. 同実施形態に係る音声認識システムの構成を示したブロック図である。It is the block diagram which showed the structure of the speech recognition system which concerns on the same embodiment. 同実施形態にかかる音声処理ユニットの構成を示したブロック図である。It is the block diagram which showed the structure of the audio | voice processing unit concerning the embodiment. 入力信号に含まれる各信号の時系列に沿った振幅の変化について説明するための図である。It is a figure for demonstrating the change of the amplitude along the time series of each signal contained in an input signal. 話者の振舞いの周期の特定方法について説明するための図である。It is a figure for demonstrating the identification method of the period of a speaker's behavior. 話者の振舞いの周期に基づくゲイン制御について説明するための図である。It is a figure for demonstrating the gain control based on the period of a speaker's behavior. 入力信号と話者の振舞いの周期に基づく処理との同期について説明するための図である。It is a figure for demonstrating the synchronization with the process based on the cycle of an input signal and a speaker's behavior. 音声処理ユニットの一連の動作を示したフローチャートである。It is the flowchart which showed a series of operation | movement of a speech processing unit. 変形例に係る音声処理ユニットの一態様におけるゲイン制御について説明するための図である。It is a figure for demonstrating the gain control in the one aspect | mode of the audio | voice processing unit which concerns on a modification. 変形例に係る音声処理ユニットの一態様におけるゲイン制御について説明するための図である。It is a figure for demonstrating the gain control in the one aspect | mode of the audio | voice processing unit which concerns on a modification.

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

まず、図１Ａを参照しながら、本実施形態に係る音声認識システムの適用シーンの一例について説明する。図１Ａは、この実施形態に係る音声認識システムの適用シーンの一例について説明するための概念図である。図１Ａは、一例として、本体Ｍ１１とディスプレイＭ１３を含んで構成され、入力インタフェースとしてユーザＵ１の音声を集音するマイクＭ１２と、ユーザＵ１の外観を撮影するカメラＭ１３１とを備えたＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）を示している。 First, an example of an application scene of the speech recognition system according to the present embodiment will be described with reference to FIG. 1A. FIG. 1A is a conceptual diagram for explaining an example of an application scene of the speech recognition system according to this embodiment. As an example, FIG. 1A includes a main body M11 and a display M13. A PC (Personal Computer) including a microphone M12 that collects the voice of the user U1 as an input interface and a camera M131 that captures the appearance of the user U1. ).

本実施形態に係る音声認識してステムは、例えば、図１Ａに示したようなＰＣに適用される。この音声認識システムは、マイクＭ１２のような集音部で話者（ユーザＵ１）の音声を入力信号として取得し、その入力信号を解析して話者の指示内容（例えば、操作内容や入力文字等）を判断する。そして、この音声認識システムは、判断された指示内容に対応する処理を、例えば、本体Ｍ１１にインストールされたＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）を介して、各処理を実行する実行部（例えば、本体Ｍ１１のＣＰＵ）に実行させる。以降では、この音声認識システムの構成について説明する。 The speech recognition and stem according to the present embodiment is applied to a PC as shown in FIG. 1A, for example. This voice recognition system acquires the voice of a speaker (user U1) as an input signal at a sound collection unit such as a microphone M12, analyzes the input signal, and instructs the speaker's instructions (for example, operation contents and input characters). Etc.). The voice recognition system then executes a process corresponding to the determined instruction content, for example, an execution unit (for example, a CPU of the main body M11) via an OS (Operating System) installed in the main body M11. ). Hereinafter, the configuration of the voice recognition system will be described.

［音声認識システム］
まず、図１Ｂを参照しながら、本実施形態に係る音声認識システムの構成について説明する。図１Ｂは、本実施形態に係る音声認識システムの構成を示したブロック図である。図１Ｂに示すように、この音声認識システムは、集音部１０１と、画像取得部１０２と、音声処理ユニット１１と、音声認識部１２と、動作制御部１３とを含んで構成されている。なお、この音声認識システムが、「音声認識装置」に相当する。 [Voice recognition system]
First, the configuration of the speech recognition system according to the present embodiment will be described with reference to FIG. 1B. FIG. 1B is a block diagram showing the configuration of the speech recognition system according to the present embodiment. As shown in FIG. 1B, this voice recognition system includes a sound collection unit 101, an image acquisition unit 102, a voice processing unit 11, a voice recognition unit 12, and an operation control unit 13. This voice recognition system corresponds to a “voice recognition device”.

集音部１０１は、話者の発した音声を入力信号として取得する。この入力信号には、話者の発した音声を示す音声信号と、雑音のようなノイズが含まれる。集音部１０１の具体的な一例として、図１Ａに示されたマイクＭ１２が挙げられる。集音部１０１は、取得した入力信号ＩｎＡを音声処理ユニット１１に出力する。 The sound collection unit 101 acquires a voice uttered by a speaker as an input signal. This input signal includes a voice signal indicating the voice uttered by the speaker and noise such as noise. A specific example of the sound collection unit 101 is the microphone M12 illustrated in FIG. 1A. The sound collection unit 101 outputs the acquired input signal InA to the sound processing unit 11.

画像取得部１０２は、話者の外観を時系列に沿って撮影し、その動作が示された動画像を取得する。画像取得部１０２の具体的な一例として、図１Ａに示されたカメラＭ１３１が挙げられる。画像取得部１０２は、取得された動画像ＩｎＢを音声処理ユニット１１に出力する。なお、この動画像ＩｎＢが、「話者の動作を示す画像情報」に相当する。 The image acquisition unit 102 captures the appearance of the speaker along a time series, and acquires a moving image showing the operation. A specific example of the image acquisition unit 102 is the camera M131 illustrated in FIG. 1A. The image acquisition unit 102 outputs the acquired moving image InB to the sound processing unit 11. The moving image InB corresponds to “image information indicating the operation of the speaker”.

音声処理ユニット１１は、集音部１０１から入力信号ＩｎＡを取得する。前述のとおり、この入力信号ＩｎＡには音声信号が含まれている場合がある。この音声信号は、話者ごとに独特の揺らぎを持っている。即ち、音声信号は、時系列に沿って振幅が変化する。この揺らぎの影響は、口の動きなどの発話の動作や、首や手の動作のような、話者の動作（以降では、これらを総じて「話者の振舞い」と呼ぶ）にも表れている。 The sound processing unit 11 acquires the input signal InA from the sound collection unit 101. As described above, the input signal InA may include an audio signal. This voice signal has a unique fluctuation for each speaker. That is, the amplitude of the audio signal changes along the time series. The influence of this fluctuation is also reflected in the movement of the utterance such as the movement of the mouth and the movement of the speaker such as the movement of the neck and hand (hereinafter, these are collectively referred to as “the behavior of the speaker”). .

そこで、音声処理ユニット１１は、画像取得部１０２から動画像ＩｎＢを取得し、この動画像ＩｎＢを解析して話者の振舞いを特定する。音声処理ユニット１１は、特定された話者の振舞いに基づき、時系列に沿った音声信号の振幅の変化を推定し、推定された振幅の変化にあわせて入力信号の振幅を調整することで、入力信号中の音声信号を強調する。なお、この動作原理と音声処理ユニット１１の構成及び処理の内容の詳細については後述する。 Therefore, the voice processing unit 11 acquires the moving image InB from the image acquisition unit 102, analyzes the moving image InB, and specifies the behavior of the speaker. The speech processing unit 11 estimates the change in the amplitude of the speech signal along the time series based on the behavior of the identified speaker, and adjusts the amplitude of the input signal in accordance with the estimated change in the amplitude. Emphasize the audio signal in the input signal. The details of the operation principle, the configuration of the audio processing unit 11 and the contents of the processing will be described later.

音声処理ユニット１１は、時系列に沿って振幅が調整された入力信号Ｏｕｔを音声認識部１２に出力する。なお、この音声処理ユニット１１が、「音声処理装置」に相当する。 The voice processing unit 11 outputs the input signal Out whose amplitude is adjusted along the time series to the voice recognition unit 12. The voice processing unit 11 corresponds to a “voice processing apparatus”.

音声認識部１２は、音声処理ユニット１１から、時系列に沿って振幅が調整された入力信号Ｏｕｔを受ける。音声認識部１２は、この入力信号Ｏｕｔに対して音声認識処理を施し、話者により音声として発話された文章を特定する。この音声認識処理は、例えば、入力信号の周波数成分を解析し、周波数成分の分布に基づき文章を構成する各文字を特定すればよい。また、この音声認識処理は、入力信号Ｏｕｔに基づき文章が特定できれば、上記方法には限定されない。 The voice recognition unit 12 receives an input signal Out whose amplitude is adjusted along the time series from the voice processing unit 11. The voice recognition unit 12 performs voice recognition processing on the input signal Out, and specifies a sentence uttered as a voice by the speaker. In this voice recognition process, for example, the frequency component of the input signal is analyzed, and each character constituting the sentence may be specified based on the distribution of the frequency component. The voice recognition process is not limited to the above method as long as a sentence can be specified based on the input signal Out.

音声認識部１２は、特定された文章に対して構文解析（例えば、字句解析、意味解析等）を施し、その文章の意味を認識する。音声認識部１２は、意味が認識された文章に基づき、その文章が示す処理を動作制御部１３に指示する。また、この構文解析は、入力信号Ｏｕｔに基づき文章の意味が構文解析特定できれば、上記方法には限定されない。 The voice recognition unit 12 performs syntax analysis (for example, lexical analysis, semantic analysis, etc.) on the specified sentence, and recognizes the meaning of the sentence. The voice recognition unit 12 instructs the operation control unit 13 to perform processing indicated by the sentence based on the sentence whose meaning is recognized. The syntax analysis is not limited to the above method as long as the meaning of the sentence can be identified based on the input signal Out.

なお、入力信号Ｏｎｔ中の一部の音声信号の振幅が小さいため、その文字を特定できない場合がある。その場合には、音声認識部１２が、音声認識処理と構文解析とのうちのいずれか、または双方で、特定できた他の部分の文字を基に、特定できなかった文字を推定するように動作させてもよい。 In addition, since the amplitude of a part of audio signal in the input signal Ont is small, the character may not be specified. In that case, the voice recognition unit 12 estimates the character that could not be specified based on the other part of the character that could be specified by either or both of the voice recognition processing and the syntax analysis. It may be operated.

動作制御部１３は、この音声認識システムが導入された装置またはシステムを構成する各部（図示しない）の動作を制御する制御部を模擬的に示している。動作制御部１３は、音声認識部１２から入力信号Ｏｕｔに基づく処理の実行の指示を受ける。動作制御部１３は、この指示に基づき、この装置またはシステムを構成する各部の動作を制御する。動作制御部１３の具体的な一例として、図１Ａに示された本体Ｍ１１のＣＰＵが挙げられる。 The operation control unit 13 schematically shows a control unit that controls the operation of each unit (not shown) constituting the apparatus or system in which the speech recognition system is introduced. The operation control unit 13 receives an instruction to execute processing based on the input signal Out from the voice recognition unit 12. Based on this instruction, the operation control unit 13 controls the operation of each unit constituting this apparatus or system. A specific example of the operation control unit 13 is the CPU of the main body M11 illustrated in FIG. 1A.

［音声処理ユニット１１］
次に、音声処理ユニット１１の動作原理と詳細な構成及び処理の内容について説明する。 [Audio processing unit 11]
Next, the operation principle, detailed configuration and processing contents of the audio processing unit 11 will be described.

（動作原理）
まず、図２を参照しながら、入力信号に含まれる各信号の態様と、これに基づく音声処理ユニット１１の動作原理について説明する。図２は、入力信号に含まれる各信号の時系列に沿った振幅の変化について説明するための図である。図２の縦軸は各信号の振幅Ａを示しており、横軸は時間ｔを示している。図２中のｆ１０は、入力信号を示しており、入力信号ｆ１０には、話者の音声を示す音声信号ｆ１１０と、雑音等のノイズｆ１３０が含まれている。音声信号ｆ１１０には、例えば、文、節、句、文節、音節、音素等に起因する周期の信号が含まれている。なお、一般的に音声信号ｆ１１０とノイズｆ１３０とは重畳するが、図２の例では、模擬的に音声信号ｆ１１０とノイズｆ１３０とを分離して別々に示している。 (Operating principle)
First, with reference to FIG. 2, the mode of each signal included in the input signal and the principle of operation of the audio processing unit 11 based thereon will be described. FIG. 2 is a diagram for explaining the change in amplitude along the time series of each signal included in the input signal. The vertical axis in FIG. 2 indicates the amplitude A of each signal, and the horizontal axis indicates time t. 2 indicates an input signal, and the input signal f10 includes an audio signal f110 indicating the voice of the speaker and a noise f130 such as noise. The audio signal f110 includes a signal having a period due to, for example, a sentence, a clause, a phrase, a clause, a syllable, a phoneme, and the like. In general, the audio signal f110 and the noise f130 are superposed, but in the example of FIG. 2, the audio signal f110 and the noise f130 are separated and shown separately in a simulated manner.

前述したように、この音声信号ｆ１１０には、話者に応じた揺らぎが生じ、この揺らぎの影響は、話者の発話の動作（または、これに伴う文、節、句、文節等の文章の切れ目）や、首や手等の動作のような話者の振舞いにも表れている。即ち、音声信号ｆ１１０は、図２に示すように、話者の振舞いに応じた周期で時系列に沿って振幅が変化しているともいえる。この揺らぎの周期、即ち、音声信号の時系列に沿った振幅の変化の周期を、以降では周期ｆ２０と呼ぶ。 As described above, the voice signal f110 is fluctuated according to the speaker, and the influence of the fluctuation is the behavior of the utterance of the speaker (or the accompanying sentence, clause, phrase, phrase, etc.). This also appears in the speaker's behavior such as the movement of the neck and hand. That is, as shown in FIG. 2, it can be said that the amplitude of the audio signal f110 changes along the time series in a cycle corresponding to the behavior of the speaker. This period of fluctuation, that is, the period of change in amplitude along the time series of the audio signal is hereinafter referred to as period f20.

一方で、ノイズｆ１３０は、話者の振舞いに依存しないため、音声信号ｆ１１０とは異なり、話者に応じた揺らぎが生じない。そのため、音声信号ｆ１１０の振幅が小さくなる場合に（小さくなる位相では）、音声信号ｆ１１０の振幅に対するノイズｆ１３０の振幅の比率が大きくなり、ノイズｆ１３０が支配的となる。これに対して、音声信号ｆ１１０の振幅が大きくなる場合に（大きくなる位相では）、音声信号ｆ１１０の振幅に対するノイズｆ１３０の振幅の比率が小さくなり、前者に比べて音声信号ｆ１１０が支配的となる。 On the other hand, since the noise f130 does not depend on the behavior of the speaker, unlike the audio signal f110, fluctuation according to the speaker does not occur. Therefore, when the amplitude of the audio signal f110 is small (at a small phase), the ratio of the amplitude of the noise f130 to the amplitude of the audio signal f110 is large, and the noise f130 becomes dominant. On the other hand, when the amplitude of the audio signal f110 increases (in a phase that increases), the ratio of the amplitude of the noise f130 to the amplitude of the audio signal f110 decreases, and the audio signal f110 becomes dominant compared to the former. .

そこで、この実施形態に係る音声処理ユニット１１は、話者の振舞いに基づき、音声信号の振幅が時系列に沿って変化する周期ｆ２０を推定し、推定された周期に基づき入力信号の振幅を時系列に沿って調整する。この推定された周期を、以降では「話者の振舞いの周期」と呼ぶ。 Therefore, the speech processing unit 11 according to this embodiment estimates the period f20 in which the amplitude of the speech signal changes along the time series based on the behavior of the speaker, and calculates the amplitude of the input signal based on the estimated period. Adjust along the series. This estimated period is hereinafter referred to as “speaker behavior period”.

具体的な一例として、音声処理ユニット１１は、話者の振舞いの周期に基づき、音声信号ｆ１１０の振幅が小さくなるほど、入力信号ｆ１０の振幅を減衰させる。このように動作させることで、音声処理ユニット１１は、音声信号ｆ１１０の振幅に対してノイズｆ１３０の振幅の比率が大きくなるタイミングほど（即ち、ノイズｆ１３０が支配的になるほど）、入力信号ｆ１０の振幅をより減衰させることになる。これにより、音声信号ｆ１１０の振幅に対してノイズｆ１３０の振幅の比率が小さいタイミング（即ち、音声信号ｆ１１０が支配的なタイミング）が強調され、後段の音声認識部１２による音声認識処理の精度を向上させる。なお、図２における入力信号ｆ１０が、前述の入力信号ＩｎＡに相当する。 As a specific example, the speech processing unit 11 attenuates the amplitude of the input signal f10 as the amplitude of the speech signal f110 decreases based on the period of the speaker's behavior. By operating in this way, the audio processing unit 11 has the amplitude of the input signal f10 as the timing at which the ratio of the amplitude of the noise f130 to the amplitude of the audio signal f110 increases (that is, the noise f130 becomes dominant). Will be further attenuated. As a result, the timing at which the ratio of the amplitude of the noise f130 to the amplitude of the audio signal f110 is small (that is, the timing at which the audio signal f110 is dominant) is emphasized, and the accuracy of the speech recognition processing by the subsequent speech recognition unit 12 is improved. Let Note that the input signal f10 in FIG. 2 corresponds to the aforementioned input signal InA.

（構成）
次に、図１Ｃを参照しながら、音声処理ユニット１１の構成について説明する。図１Ｃは、この音声処理ユニット１１の構成を示したブロック図である。図１Ｃに示すように、音声処理ユニット１１は、動作検出部１１１と、ゲイン特定部１１２と、音声信号取得部１１３と、遅延処理部１１４と、同期演算部１１５と、信号処理部１１６とを含んで構成される。また、信号処理部１１６は、増幅器１１６１と、ゲイン制御部１１６２とを福運で構成されている。以下に、これら各部の動作について説明する。 (Constitution)
Next, the configuration of the audio processing unit 11 will be described with reference to FIG. 1C. FIG. 1C is a block diagram showing the configuration of the audio processing unit 11. As shown in FIG. 1C, the audio processing unit 11 includes an operation detecting unit 111, a gain specifying unit 112, an audio signal acquiring unit 113, a delay processing unit 114, a synchronization calculating unit 115, and a signal processing unit 116. Consists of including. The signal processing unit 116 includes an amplifier 1161 and a gain control unit 1162 with good luck. Hereinafter, the operation of each of these units will be described.

動作検出部１１１は、入力として画像取得部１０２から動画像ＩｎＢを受ける。この動画像ＩｎＢには、話者の外観を時系列に沿って撮影されたものであり、話者の動作が示されている。動作検出部１１１は、この動画像ＩｎＢに対して画像解析処理を施して、話者の振舞いの周期を特定する。以下に、その動作の具体的な一例について説明する。 The motion detection unit 111 receives the moving image InB from the image acquisition unit 102 as an input. In this moving image InB, the appearance of the speaker is taken in time series, and the operation of the speaker is shown. The motion detection unit 111 performs image analysis processing on the moving image InB to identify the period of the speaker's behavior. A specific example of the operation will be described below.

動作検出部１１１は、動画像ＩｎＢを構成する各フレーム画像から、話者の身体を構成する各部位のうち、あらかじめ決められた１または複数の部位の位置を特定する。なお、以降では、この位置が特定された部位のことを特に「対象部位」と呼ぶ場合がある。このような位置の特定は、例えば、対象部位の形状特徴を抽出し、その形状特徴の位置や向きに基づいて行うとよい。 The motion detection unit 111 identifies positions of one or more predetermined parts among the parts constituting the speaker's body from the frame images constituting the moving image InB. In the following, the part where the position is specified may be particularly referred to as “target part”. Such specification of the position may be performed, for example, by extracting the shape feature of the target part and based on the position and orientation of the shape feature.

ここで図３を参照する。図３は、話者の振舞いの周期の特定方法について説明するための図であり、話者Ｕ１１を構成する各部位を模式的に示したものである。例えば、この図が示す例では、「頭」に対応する部位Ｕ１１１と、「腕」に対応する部位Ｕ１１２を対象部位としている。「腕」に対応する部位Ｕ１１２の位置や向きは、例えば、「肘」または「肩」のような「関節」や「手」等のような、この部位の形状特徴を示す特徴点Ｐ１２の位置や向きとして特定することができる。なお、各部位から複数の特徴点を抽出することで、その部位の位置や向きの特定に係る精度を向上させることができる。同様に、「頭」に対応する部位Ｕ１１１の動きは、例えば、「目」、「鼻」、「耳」等のような、形状特徴を示す特徴点Ｐ１１の位置や向きとして特定することができる。このようにして、動作検出部１１１は、各対象部位の位置や向きを特定する。 Reference is now made to FIG. FIG. 3 is a diagram for explaining a method for specifying the period of the speaker's behavior, and schematically shows each part constituting the speaker U11. For example, in the example shown in this figure, a part U111 corresponding to “head” and a part U112 corresponding to “arm” are set as target parts. The position and orientation of the part U112 corresponding to “arm” is the position of the feature point P12 indicating the shape feature of this part, such as “joint” such as “elbow” or “shoulder”, “hand”, etc. And can be specified as orientation. Note that by extracting a plurality of feature points from each part, it is possible to improve the accuracy of specifying the position and orientation of the part. Similarly, the movement of the part U111 corresponding to the “head” can be specified as the position and orientation of the feature point P11 indicating the shape feature such as “eyes”, “nose”, “ear”, and the like. . In this way, the motion detection unit 111 identifies the position and orientation of each target part.

次に、動作検出部１１１は、各対象部位の動作、即ち、各対象部位の位置の変化を、あらかじめ決められたフレーム数の間だけ監視して、その動作の周期を対象部位ごとに特定する。そのため、この周期の特定に係る監視のためのフレーム数は、対象部位の動きの周期を特定するために十分な期間であることが望ましく、対象部位の種類に応じてあらかじめ決めておく。なお、各部位の周期を特定するために、どの程度のフレーム数分だけ監視期間を設ければよいかは、あらかじめ実験等に基づき調べておく。以降では、このあらかじめ決められたフレーム数分の時間幅をｈ１１１とする。また、以降では、「対象部位の周期」と表記した場合には、その対象部位の動作の周期を示すものとする。 Next, the motion detection unit 111 monitors the motion of each target part, that is, the change in the position of each target part for a predetermined number of frames, and specifies the cycle of the motion for each target part. . Therefore, it is desirable that the number of frames for monitoring related to the specification of this cycle is a sufficient period for specifying the movement cycle of the target part, and is determined in advance according to the type of the target part. In addition, in order to specify the period of each part, it is investigated beforehand by experiment etc. how many frames should be provided with the monitoring period. Hereinafter, the time width corresponding to the predetermined number of frames is set as h111. Further, hereinafter, when “the cycle of the target part” is described, it indicates the operation cycle of the target part.

動作検出部１１１は、特定された各対象部位の周期に対して、あらかじめ決められた統計処理を施すことで、これら各対象部位の動作に基づき時系列に沿って振幅が変化する１つの周期を、話者の振舞いの周期として特定する。この統計処理の一例として、重み付け平均処理が挙げられる。具体的には、各対象部位の周期に対して、その部位ごとにあらかじめ決められた重み付けを行い、重み付け後の各対象部位の周期の平均を話者の振舞いの周期として特定する。なお、各部位に対する重み付けの度合いは、各部位の動作と音声信号ｆ１１０の時系列に沿った振幅の変化（即ち、周期ｆ２０）との間の因果関係を実験等によりあらかじめ求め、この結果に基づき決定しておく。 The motion detection unit 111 performs a predetermined statistical process on the cycle of each identified target part, thereby obtaining one cycle in which the amplitude changes along the time series based on the motion of each target part. , Specified as the cycle of the speaker's behavior. An example of this statistical process is a weighted average process. Specifically, weighting predetermined for each part is performed on the period of each target part, and the average of the periods of each target part after weighting is specified as the period of the speaker's behavior. The degree of weighting for each part is obtained in advance by experimentation or the like to obtain a causal relationship between the movement of each part and the change in amplitude along the time series of the audio signal f110 (that is, the period f20). Make a decision.

また、この統計処理の別の一例として、ベイジアンネットワークによる推定を用いてもよい。具体的には、各部位の動作と話者の振舞いの周期との因果関係（ひいては、周期ｆ２０との因果関係）に基づき、あらかじめベイジアンネットワークを作成しておく。動作検出部１１１は、特定された各対象部位の周期を入力として、このベイジアンネットワークを適用し、その出力を話者の振舞いの周期とすればよい。このように複数の対象部位を対象として動作の周期を特定し、統計処理を施すことで、複数の部位の動きの組み合わせに応じて変化する話者の振舞いの周期を、より正確に特定することが可能となる。 As another example of this statistical processing, estimation by a Bayesian network may be used. Specifically, a Bayesian network is created in advance based on the causal relationship between the movement of each part and the period of the speaker's behavior (and consequently the causal relation with the period f20). The motion detection unit 111 may apply the Bayesian network using the specified period of each target region as an input and set the output as the period of the speaker's behavior. In this way, by specifying the period of movement for multiple target parts and performing statistical processing, the period of speaker behavior that changes according to the combination of movements of multiple parts can be specified more accurately Is possible.

なお、話者の振舞いの周期は、いわゆる周期のような時系列を明示的に特定しない情報に限らず、例えば、監視期間（時間幅ｈ１１１）中の時系列に沿った振幅の変化を示す情報のように、時系列上の位置を明示する情報であってもよい。このような情報を用いることで、例えば、監視期間の開始タイミングと、話者の振舞いの周期の開始タイミング（振幅が変化し始めるタイミング）とが必ずしも一致していなくてもよくなる。具体的な一例をあげると、話者が発話を開始してある時間が経過した後、話者が動作を開始した場合などが該当する。なお、以降では、これらを包含して、単に「話者の振舞いの周期」として説明する。 The period of the speaker's behavior is not limited to information that does not explicitly specify a time series such as a so-called period. For example, information indicating a change in amplitude along the time series during the monitoring period (time width h111). As described above, the information may clearly indicate time-series positions. By using such information, for example, the start timing of the monitoring period and the start timing of the speaker's behavior cycle (timing at which the amplitude starts to change) do not necessarily have to match. As a specific example, this may be the case when the speaker starts to operate after a certain time has elapsed since the speaker started speaking. In the following, these will be included and described simply as the “speaker behavior cycle”.

動作検出部１１１は、特定された話者の振舞いの周期をゲイン特定部１１２に出力する。 The motion detection unit 111 outputs the specified speaker behavior period to the gain specification unit 112.

また、動作検出部１１１は、所定のフレーム毎（例えば、１フレーム毎）に、「口」のように話者の発話の動作を示す部位の特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きを特定する。動作検出部１１１は、特定された特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きを示す情報を同期演算部１１５に逐次出力する。 In addition, the motion detection unit 111 identifies the positions and orientations of the feature points P20a and P20b of the part indicating the motion of the speaker's speech such as “mouth” for each predetermined frame (for example, every frame). The motion detection unit 111 sequentially outputs information indicating the positions and orientations of the identified feature points P20a and P20b to the synchronization calculation unit 115.

ゲイン特定部１１２は、動作検出部１１１から話者の振舞いの周期を示す情報を受ける。ゲイン特定部１１２は、この話者の振舞いの周期を基に、時系列に沿って入力信号ＩｎＡの振幅を変化させるための制御信号、即ち、時系列に沿った入力信号ＩｎＡに対するゲインの変化を示す制御信号を生成する。具体的な一例として、本実施形態に係るゲイン特定部１１２は、動作検出部１１１から受けた話者の振舞いの周期に同期して、入力信号ＩｎＡを減衰させるように制御信号を生成する。これにより、話者の振舞いの周期において、振幅が小さくなるタイミング（位相）で、入力信号ＩｎＡが減衰されるように、制御信号が生成される。 The gain specifying unit 112 receives information indicating the cycle of the speaker's behavior from the motion detection unit 111. Based on the period of the speaker's behavior, the gain specifying unit 112 controls the control signal for changing the amplitude of the input signal InA along the time series, that is, changes the gain with respect to the input signal InA along the time series. The control signal shown is generated. As a specific example, the gain specifying unit 112 according to the present embodiment generates a control signal so as to attenuate the input signal InA in synchronization with the period of the speaker's behavior received from the motion detection unit 111. As a result, the control signal is generated so that the input signal InA is attenuated at the timing (phase) at which the amplitude decreases in the period of the speaker's behavior.

例えば、図４は、この実施形態における、話者の振舞いの周期に基づくゲイン制御について説明するための図である。図４におけるｆ５０は、生成された制御信号に基づく時系列に沿ったゲインの変化を模擬的に示したグラフである。この図４の例では、特定された話者の振舞いの周期に同期して、生成された制御信号に基づくゲイン制御を適用した場合を示しており、理想的には、この時系列に沿ったゲイン制御が周期ｆ２０に同期する。なお、この制御信号に基づくゲイン制御の詳細については後述する。また、以降では、上述したゲイン特定部１１２の一連の動作に係る処理時間をｈ１１２とする。このように、時系列に沿ったゲイン制御（即ち、時系列に沿ったゲインの変化）が、制御信号として、ゲイン特定部１１２により特定される。 For example, FIG. 4 is a diagram for explaining gain control based on the period of the speaker's behavior in this embodiment. F50 in FIG. 4 is a graph that schematically shows a change in gain along a time series based on the generated control signal. The example in FIG. 4 shows a case where gain control based on the generated control signal is applied in synchronization with the specified period of the speaker's behavior, and ideally along this time series. Gain control is synchronized with the period f20. Details of gain control based on this control signal will be described later. Further, hereinafter, the processing time related to the series of operations of the above-described gain specifying unit 112 is assumed to be h112. In this manner, gain control along the time series (that is, gain change along the time series) is specified by the gain specifying unit 112 as a control signal.

ゲイン特定部１１２は、生成された制御信号をゲイン制御部１１６２に出力する。 The gain specifying unit 112 outputs the generated control signal to the gain control unit 1162.

音声信号取得部１１３は、入力として集音部１０１から音声信号ｆ１１０を含む入力信号ＩｎＡを逐次受ける。音声信号取得部１１３は、この入力信号ＩｎＡを遅延処理部１１４に逐次出力する。 The audio signal acquisition unit 113 sequentially receives the input signal InA including the audio signal f110 from the sound collection unit 101 as an input. The audio signal acquisition unit 113 sequentially outputs the input signal InA to the delay processing unit 114.

また、音声信号取得部１１３は、入力信号ＩｎＡの振幅を監視し、少なくとも、この振幅を示す情報を同期演算部１１５に逐次出力する。 The audio signal acquisition unit 113 monitors the amplitude of the input signal InA and sequentially outputs at least information indicating the amplitude to the synchronization calculation unit 115.

遅延処理部１１４は、音声信号取得部１１３から入力信号ＩｎＡを逐次受ける。遅延処理部１１４は、この入力信号ＩｎＡが、あらかじめ決めた遅延量ｈ１１４だけ遅延するように遅延処理を施す。このときの遅延量ｈ１１４は、動作検出部１１１及びゲイン特定部１１２の処理時間を鑑みてあらかじめ決定しておく。具体的には、この遅延量ｈ１１４は、少なくとも、動作検出部１１１が話者の振舞いの周期を特定するために要するフレーム数分の時間幅ｈ１１１と、ゲイン特定部１１２の処理時間ｈ１１２とを加算した時間幅ｈ１１１＋ｈ１１２だけ設ける。即ち、ｈ１１４≧ｈ１１１＋ｈ１１２の条件を満たすように、遅延量ｈ１１４をあらかじめ決定しておく。これは、入力信号ＩｎＡの振幅を時系列に沿って調整するための制御信号が、入力信号ＩｎＡに対して、少なくともこの時間幅ｈ１１１＋ｈ１１２分だけ遅延して出力されるためである。 The delay processing unit 114 sequentially receives the input signal InA from the audio signal acquisition unit 113. The delay processing unit 114 performs a delay process so that the input signal InA is delayed by a predetermined delay amount h114. The delay amount h114 at this time is determined in advance in consideration of the processing time of the motion detection unit 111 and the gain specifying unit 112. Specifically, the delay amount h114 is obtained by adding at least the time width h111 for the number of frames required for the motion detection unit 111 to specify the period of the speaker's behavior and the processing time h112 of the gain specifying unit 112. The time width h111 + h112 is provided. That is, the delay amount h114 is determined in advance so as to satisfy the condition of h114 ≧ h111 + h112. This is because the control signal for adjusting the amplitude of the input signal InA along the time series is output with a delay of at least the time width h111 + h112 with respect to the input signal InA.

遅延処理部１１４は、遅延処理が施された入力信号ＩｎＡを増幅器１１６１に出力する。 The delay processing unit 114 outputs the input signal InA subjected to the delay processing to the amplifier 1161.

同期演算部１１５は、動作検出部１１１から、話者の発話の動作を示す部位の特徴点Ｐ２０ａ及びＰ２０ｂの位置や向き示す情報を逐次受ける。同期演算部１１５は、この情報を基に特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きの変化を監視し、話者の発話の動作が開始されるタイミングｔＢを検出する。 The synchronization calculation unit 115 sequentially receives information indicating the positions and orientations of the feature points P20a and P20b of the part indicating the operation of the speaker's utterance from the motion detection unit 111. Based on this information, the synchronization calculation unit 115 monitors changes in the positions and orientations of the feature points P20a and P20b, and detects the timing tB when the speaker's speech operation is started.

また、同期演算部１１５は、音声信号取得部１１３から、入力信号ＩｎＡの振幅を示す情報を逐次受ける。入力信号ＩｎＡは、話者が発話していないタイミングでは、ノイズｆ１３０のみが含まれる。即ち、このタイミングにおける入力信号ＩｎＡの振幅は、ノイズｆ１３０の振幅となる。これに対して、話者が発話しているタイミングでは、このノイズｆ１３０の振幅に、音声信号ｆ１１０の振幅が重畳する。即ち、話者が発話を開始するタイミングで、入力信号ＩｎＡの振幅が増加する。そこで、同期演算部１１５は、この情報を基に入力信号ＩｎＡの振幅の変化を監視し、入力信号ＩｎＡの振幅が所定量以上変化（増加）するタイミングｔＡを検出する。 Further, the synchronization calculation unit 115 sequentially receives information indicating the amplitude of the input signal InA from the audio signal acquisition unit 113. The input signal InA includes only the noise f130 at the timing when the speaker is not speaking. That is, the amplitude of the input signal InA at this timing is the amplitude of the noise f130. On the other hand, at the timing when the speaker speaks, the amplitude of the audio signal f110 is superimposed on the amplitude of the noise f130. That is, the amplitude of the input signal InA increases at the timing when the speaker starts speaking. Therefore, the synchronization calculation unit 115 monitors the change in the amplitude of the input signal InA based on this information, and detects the timing tA at which the amplitude of the input signal InA changes (increases) by a predetermined amount or more.

集音部１０１を介して取得される入力信号ＩｎＡと、画像取得部１０２を介して取得される動画像ＩｎＢとは、厳密には同期しているとは限らない。即ち、入力信号ＩｎＡと動画像ＩｎＢから特定された話者の振舞いの周期（及び、話者の振舞いの周期に基づく処理）とは、厳密には同期しているとは限らない。なお、この話者の振舞いの周期に基づく処理とは、ゲイン特定部１１２により生成された制御信号を用いた、ゲイン制御部１１６２による、時系列に沿ったゲイン制御を示している。ゲイン制御部１１６２の詳細については後述する。 The input signal InA acquired through the sound collection unit 101 and the moving image InB acquired through the image acquisition unit 102 are not necessarily strictly synchronized. In other words, the period of the speaker's behavior specified from the input signal InA and the moving image InB (and processing based on the period of the speaker's behavior) is not necessarily strictly synchronized. Note that the processing based on the period of the speaker's behavior indicates time-series gain control by the gain control unit 1162 using the control signal generated by the gain specifying unit 112. Details of the gain control unit 1162 will be described later.

例えば、図５は、遅延処理が施された入力信号ＩｎＡと、話者の振舞いの周期に基づく処理との同期について説明するための図である。図５の横軸は時間ｔを示している。図５におけるｆ１０ａ〜ｆ１０ｃは、それぞれ、異なるタイミングにおける、ノイズｆ１３０に音声信号ｆ１１０が重畳した入力信号ＩｎＡを模擬的に示している。また、ｔ１０ａ〜ｔ１０ｃは、入力信号ｆ１０ａ〜ｆ１０ｃそれぞれの開始タイミング（即ち、入力信号ＩｎＡの振幅が所定量以上変化（増加）するタイミング）を示している。また、ｔ３０ａ〜ｔ３０ｃは、それぞれ、話者の発話の動作が開始されたタイミングを示している。また、ｆ５０ａ〜ｆ５０ｃは、それぞれ、異なるタイミングにおける、話者の振舞いの周期に基づく処理（即ち、時系列に沿ったゲイン制御）を模擬的に示している。また、ｔ５０ａ〜ｔ５０ｃは、話者の振舞いの周期に基づく処理ｆ５０ａ〜ｆ５０ｃそれぞれの開始タイミングを示している。 For example, FIG. 5 is a diagram for explaining the synchronization between the input signal InA subjected to the delay process and the process based on the period of the speaker's behavior. The horizontal axis in FIG. 5 indicates time t. F10a to f10c in FIG. 5 schematically illustrate the input signal InA in which the audio signal f110 is superimposed on the noise f130 at different timings. Further, t10a to t10c indicate start timings of the input signals f10a to f10c (that is, timings at which the amplitude of the input signal InA changes (increases) by a predetermined amount or more). Further, t30a to t30c indicate timings when the speaker's speech operation is started. Also, f50a to f50c schematically show processing (that is, gain control along a time series) based on the period of the speaker's behavior at different timings. Further, t50a to t50c indicate the start timings of the processes f50a to f50c based on the period of the speaker's behavior.

前述の通り、話者の振舞いの周期に基づく処理ｆ５０ａ〜ｆ５０ｃの開始タイミングｔ５０ａ〜ｔ５０ｃと、発話の動作の開始タイミングｔ３０ａ〜ｔ３０ｃとは同期している。これは、開始タイミングｔ５０ａ〜ｔ５０ｃと開始タイミングｔ３０ａ〜ｔ３０ｃとが、同じ動画像ＩｎＢから特定されるためである。これに対して、入力信号ＩｎＡと動画像ＩｎＢとは厳密には同期しているとは限らない。換言すると、開始タイミングｔ１０ａ〜ｔ１０ｃと、開始タイミングｔ３０ａ〜ｔ３０ｃ及びｔ５０ａ〜ｔ５０ｃとは、同期しているとは限らない。即ち、開始タイミングｔ１０ａ〜ｔ１０ｃと、開始タイミングｔ３０ａ〜ｔ３０ｃ及びｔ５０ａ〜ｔ５０ｃとの間のいずれかが遅延している場合がある。 As described above, the start timings t50a to t50c of the processes f50a to f50c based on the period of the speaker's behavior are synchronized with the start timings t30a to t30c of the speech operation. This is because the start timings t50a to t50c and the start timings t30a to t30c are specified from the same moving image InB. On the other hand, the input signal InA and the moving image InB are not strictly synchronized. In other words, the start timings t10a to t10c are not always synchronized with the start timings t30a to t30c and t50a to t50c. That is, one of the start timings t10a to t10c and the start timings t30a to t30c and t50a to t50c may be delayed.

そこで、同期演算部１１５は、入力信号ＩｎＡと、話者の振舞いの周期、厳密には、話者の振舞いの周期に基づく処理とを同期させるために、これらの間の時系列に沿ったずれ量（即ち、時間差）Δｔを算出する。具体的には、同期演算部１１５は、図５に示すように、入力信号ｆ１０ａ〜ｆ１０ｃそれぞれの開始タイミングｔ１０ａ〜ｔ１０ｃと、話者の発話の開始タイミングｔ３０ａ〜ｔ３０ｃとがそれぞれ同期するように、これらのずれ量Δｔを算出する。ここで、開始タイミングｔ１０ａ〜ｔ１０ｃは、検出されたタイミングｔＡで示される。また、開始タイミングｔ３０ａ〜ｔ３０ｃは、検出されたタイミングｔＢで示される。即ち、同期演算部１１５は、検出されたタイミングｔＡ及びｔＢの差として、ずれ量Δｔ＝ｔＡ−ｔＢを算出する。 Therefore, in order to synchronize the input signal InA and the period of the speaker's behavior, strictly speaking, the processing based on the period of the speaker's behavior, the synchronization calculation unit 115 shifts between them along the time series. A quantity (ie time difference) Δt is calculated. Specifically, as shown in FIG. 5, the synchronization calculation unit 115 synchronizes the start timings t10a to t10c of the input signals f10a to f10c with the start timings t30a to t30c of the speakers. These deviation amounts Δt are calculated. Here, the start timings t10a to t10c are indicated by the detected timing tA. The start timings t30a to t30c are indicated by the detected timing tB. That is, the synchronization calculation unit 115 calculates the deviation amount Δt = tA−tB as the difference between the detected timings tA and tB.

同期演算部１１５は、入力信号ＩｎＡ、または、話者の振舞いの周期に基づく処理が、このΔｔだけ時系列に沿ってシフトする（遅延させる）ように制御することで、遅延処理が施された入力信号ＩｎＡと、話者の振舞いの周期に基づく処理とを同期させる。 The synchronization calculation unit 115 performs the delay process by controlling the process based on the input signal InA or the period of the speaker's behavior to be shifted (delayed) along the time series by this Δt. The input signal InA and the processing based on the period of the speaker's behavior are synchronized.

具体的には、動画像ＩｎＢに対して入力信号ＩｎＡが遅延している場合（Δｔ＜０）には、話者の振舞いの周期に基づく処理に対して、遅延処理後の入力信号ＩｎＡが遅延することになる。この場合には、同期演算部１１５は、ずれ量Δｔをゲイン制御部１１６２に通知する。このずれ量Δｔを受けて、ゲイン制御部１１６２は、自身の処理の開始タイミングを、このずれ量Δｔ分だけ遅延させる。 Specifically, when the input signal InA is delayed with respect to the moving image InB (Δt <0), the input signal InA after the delay process is delayed with respect to the process based on the period of the speaker's behavior. Will do. In this case, the synchronization calculation unit 115 notifies the gain control unit 1162 of the shift amount Δt. In response to this deviation amount Δt, the gain control unit 1162 delays the start timing of its own processing by this deviation amount Δt.

また、入力信号ＩｎＡに対して動画像ＩｎＢが遅延している場合（Δｔ＞０）には、遅延処理後の入力信号ＩｎＡに対して、話者の振舞いの周期に基づく処理が遅延していることになる。この場合には、同期演算部１１５は、ずれ量Δｔを遅延処理部１１４に通知する。この通知を受けた場合に、遅延処理部１１４は、遅延量ｈ１１４に加えて、さらにΔｔ分だけ入力信号ＩｎＡを遅延させる。なお、この遅延処理部１１４へのΔｔの通知は、遅延処理部１１４による遅延処理が完了する前に行われる必要がある。そのため、Δｔの算出に係る処理時間よりも、遅延量ｈ１１４を十分に長く設定する必要がある。しかしながら、多くの場合には、Δｔの算出に係る処理は、遅延量ｈ１１４よりも十分に短い。なお、このとき同期演算部１１５は、Δｔ＝０をゲイン制御部１１６２に通知して、ゲイン制御部１１６２がただちに処理を開始するようにしてもよい。 When the moving image InB is delayed with respect to the input signal InA (Δt> 0), the processing based on the period of the speaker's behavior is delayed with respect to the input signal InA after the delay processing. It will be. In this case, the synchronization calculation unit 115 notifies the delay processing unit 114 of the shift amount Δt. When receiving this notification, the delay processing unit 114 further delays the input signal InA by Δt in addition to the delay amount h114. The notification of Δt to the delay processing unit 114 needs to be performed before the delay processing by the delay processing unit 114 is completed. Therefore, the delay amount h114 needs to be set sufficiently longer than the processing time related to the calculation of Δt. However, in many cases, the process related to the calculation of Δt is sufficiently shorter than the delay amount h114. At this time, the synchronization calculation unit 115 may notify Δt = 0 to the gain control unit 1162, and the gain control unit 1162 may immediately start processing.

入力信号ＩｎＡと動画像ＩｎＢとが同期している場合（Δｔ＝０）には、遅延処理後の入力信号ＩｎＡと、話者の振舞いの周期に基づく処理とが同期していることになる。そのため、この場合には、同期演算部１１５は、ずれ量Δｔの通知に係る処理を行わない、もしくは、ゲイン制御部１１６２及び遅延処理部１１４のうちのいずれかまたは双方にΔｔ＝０を通知すればよい。 When the input signal InA and the moving image InB are synchronized (Δt = 0), the input signal InA after the delay process and the process based on the period of the speaker's behavior are synchronized. Therefore, in this case, the synchronization calculation unit 115 does not perform the process related to the notification of the deviation amount Δt, or notifies either or both of the gain control unit 1162 and the delay processing unit 114 of Δt = 0. That's fine.

このように、同期演算部１１５は、ずれ量Δｔに基づき、入力信号ＩｎＡと話者の振舞いの周期との間の同期タイミングを特定し、この同期タイミングにあわせて、入力信号ＩｎＡまたは話者の振舞いの周期に基づく処理を時系列に沿ってシフト（遅延）させる。これにより、入力信号ＩｎＡと、話者の振舞いの周期に基づく処理とが同期する。 As described above, the synchronization calculation unit 115 specifies the synchronization timing between the input signal InA and the speaker's behavior based on the deviation amount Δt, and matches the input signal InA or the speaker's behavior according to the synchronization timing. The processing based on the period of behavior is shifted (delayed) along the time series. Thereby, the input signal InA and the processing based on the period of the speaker's behavior are synchronized.

なお、上記説明では、同期演算部１１５は、入力信号ＩｎＡと動画像ＩｎＢとの間のずれ量Δｔを基に、遅延処理が施された入力信号ＩｎＡと、話者の振舞いの周期に基づく処理との同期タイミングを間接的に特定していた。別の方法として、同期演算部１１５は、検出されたタイミングｔＡ及びｔＢと、前述した、時間幅ｈ１１１、処理時間ｈ１１２、及び遅延量ｈ１１４とを基に、遅延処理が施された入力信号ＩｎＡと、話者の振舞いの周期に基づく処理との同期タイミングを直接特定してもよい。この場合には、例えば、遅延処理が施された入力信号ＩｎＡに対応するタイミングは、ｔＡ＋ｈ１１４で表される。また、話者の振舞いの周期に基づく処理を開始可能なタイミングは、ｔＢ＋ｈ１１１＋ｈ１１２とで表される。そのため、ずれ量Δｔを、Δｔ＝（ｔＡ＋ｈ１１４）−（ｔＢ＋ｈ１１１＋ｈ１１２）に基づき算出することで、遅延処理が施された入力信号ＩｎＡと、話者の振舞いの周期に基づく処理との同期タイミングを特定することができる。 In the above description, the synchronization calculation unit 115 performs processing based on the input signal InA subjected to delay processing based on the shift amount Δt between the input signal InA and the moving image InB and the period of the speaker's behavior. The synchronization timing was indirectly specified. As another method, the synchronization calculation unit 115 receives the input signal InA subjected to delay processing based on the detected timings tA and tB and the above-described time width h111, processing time h112, and delay amount h114. The synchronization timing with the processing based on the period of the speaker's behavior may be directly specified. In this case, for example, the timing corresponding to the input signal InA subjected to the delay process is represented by tA + h114. The timing at which processing based on the speaker behavior cycle can be started is expressed as tB + h111 + h112. Therefore, by calculating the shift amount Δt based on Δt = (tA + h114) − (tB + h111 + h112), the synchronization timing between the input signal InA subjected to the delay process and the process based on the period of the speaker's behavior is specified. be able to.

また、上記では、同期演算部１１５が、入力信号ＩｎＡの振幅が所定量以上変化（増加）したタイミングｔＡを特定していたが、これを音声信号取得部１１３が行ってもよい。この場合には、音声信号取得部１１３は、特定されたタイミングｔＡを示す情報を同期演算部１１５に通知すればよい。同様に、動作検出部１１１が、話者の発話の動作が開始されたタイミングｔＢを特定してもよい。この場合には、動作検出部１１１は、特定されたタイミングｔＢを示す情報を同期演算部１１５に通知すればよい。 In the above description, the synchronization calculation unit 115 specifies the timing tA at which the amplitude of the input signal InA has changed (increased) by a predetermined amount or more. However, the audio signal acquisition unit 113 may perform this. In this case, the audio signal acquisition unit 113 may notify the synchronization calculation unit 115 of information indicating the specified timing tA. Similarly, the motion detection unit 111 may identify the timing tB when the speaker's speech motion is started. In this case, the operation detection unit 111 may notify the synchronization calculation unit 115 of information indicating the specified timing tB.

増幅器１１６１は、遅延処理部１１４から、遅延処理が施された入力信号ＩｎＡを受ける。増幅器１１６１の利得Ｇａｉｎは、ゲイン制御部１１６２に基づき制御されている。即ち、増幅器１１６１は、ゲイン制御部１１６２に従い、入力信号ＩｎＡの振幅を調整する（増幅または減衰させる）。増幅器１１６１は、振幅が調整された入力信号ＩｎＡを、後段に位置する音声認識部１２（図１Ｂ参照）に出力する。 The amplifier 1161 receives the input signal InA subjected to the delay process from the delay processing unit 114. The gain Gain of the amplifier 1161 is controlled based on the gain control unit 1162. That is, the amplifier 1161 adjusts (amplifies or attenuates) the amplitude of the input signal InA according to the gain control unit 1162. The amplifier 1161 outputs the input signal InA whose amplitude is adjusted to the voice recognition unit 12 (see FIG. 1B) located at the subsequent stage.

ゲイン制御部１１６２は、ゲイン特定部１１２から、時系列に沿った入力信号ＩｎＡに対するゲインの変化を示す制御信号を受ける。ゲイン制御部１１６２は、この制御信号に基づき、増幅器１１６１の利得Ｇａｉｎを時系列に沿って制御する（この動作が「ゲイン制御」に相当する）。 The gain control unit 1162 receives a control signal indicating a gain change with respect to the input signal InA in time series from the gain specifying unit 112. Based on this control signal, gain control section 1162 controls gain Gain of amplifier 1161 in time series (this operation corresponds to “gain control”).

また、ゲイン制御部１１６２は、同期演算部１１５からずれ量Δｔの通知を受ける。このずれ量Δｔの通知を受けた場合には、ゲイン制御部１１６２は、ずれ量Δｔだけ自身の処理、即ち、ゲイン制御の開始タイミングを遅延させる。これにより、ゲイン制御部１１６２による時系列に沿ったゲイン制御（即ち、話者の振舞いの周期に基づく処理）が、遅延処理が施された入力信号ＩｎＡに同期する。この態様について、図４を参照しながら以下に説明する。 Further, the gain control unit 1162 receives a notification of the shift amount Δt from the synchronization calculation unit 115. When receiving the notification of the shift amount Δt, the gain control unit 1162 delays its own process, that is, the gain control start timing by the shift amount Δt. Thereby, gain control along the time series by the gain control unit 1162 (that is, processing based on the period of the speaker's behavior) is synchronized with the input signal InA subjected to the delay processing. This aspect will be described below with reference to FIG.

前述の通り、話者の振舞いの周期は、音声信号の振幅が時系列に沿って変化する周期ｆ２０を推定したものに相当する。そのため、遅延処理が施された入力信号ＩｎＡに、ゲイン制御部１１６２による時系列に沿ったゲイン制御を同期させることで、理想的には、図４に示すように、周期ｆ２０に時系列に沿ったゲイン制御（グラフｆ５０に対応）が同期する。 As described above, the period of the speaker's behavior corresponds to an estimation of the period f20 at which the amplitude of the voice signal changes along the time series. Therefore, by synchronizing the gain control along the time series by the gain control unit 1162 with the input signal InA subjected to the delay processing, ideally, along the time series in the period f20 as shown in FIG. The gain control (corresponding to the graph f50) is synchronized.

この場合には、利得Ｇａｉｎが低下し信号が減衰されるタイミングｔ５０１、ｔ５０３と、入力信号ｆ１０（即ち、入力信号ＩｎＡ）中の音声信号ｆ１１０の振幅が小さくなるタイミングｔ２０１、ｔ２０３とが一致する。また、利得Ｇａｉｎが増加し信号が減衰されない（または、増幅される）タイミングｔ５０２と、入力信号ｆ１０中の音声信号ｆ１１０の振幅が大きくなるタイミングｔ２０２とが一致する。 In this case, timings t501 and t503 at which the gain Gain decreases and the signal is attenuated coincide with timings t201 and t203 at which the amplitude of the audio signal f110 in the input signal f10 (that is, the input signal InA) decreases. Also, the timing t502 at which the gain is increased and the signal is not attenuated (or amplified) coincides with the timing t202 at which the amplitude of the audio signal f110 in the input signal f10 increases.

これにより、音声信号ｆ１１０の振幅に対してノイズｆ１３０の振幅の比率が高いほど、入力信号ｆ１０の減衰量が大きくなる。即ち、ノイズｆ１３０が支配的なタイミングで入力信号ｆ１０がより減衰され、音声信号ｆ１１０が支配的なタイミングにおける入力信号ｆ１０の振幅が強調される。 Thereby, the higher the ratio of the amplitude of the noise f130 to the amplitude of the audio signal f110, the greater the attenuation of the input signal f10. That is, the input signal f10 is further attenuated at the timing when the noise f130 is dominant, and the amplitude of the input signal f10 is emphasized at the timing when the audio signal f110 is dominant.

なお、同期演算部１１５は、話者の発話の動作を示す部位の特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きが、あらかじめ決められた時間以上変化しなかった場合には、入力信号ＩｎＡに音声信号ｆ１１０が含まれていないと判断してもよい。この場合には、同期演算部１１５は、例えば、ゲイン制御部１１６２による、時系列に沿ったゲイン制御の処理を一旦停止させ、再度、特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きが変化したときに、改めてゲイン制御部１１６２による処理を開始させてもよい。また、このときには、動作検出部１１が、改めて話者の振舞いの周期を特定し、ゲイン特定部１１２が、この話者の振舞いの周期に基づき制御信号を生成しなおしてもよい。 It should be noted that the synchronization calculation unit 115 outputs the audio signal f110 as the input signal InA when the positions and orientations of the feature points P20a and P20b of the part indicating the speaker's utterance operation have not changed for a predetermined time or more. May be determined not to be included. In this case, for example, the synchronization calculation unit 115 temporarily stops gain control processing along the time series by the gain control unit 1162, and when the positions and orientations of the feature points P20a and P20b change again, The processing by the gain control unit 1162 may be started again. At this time, the motion detection unit 11 may specify the speaker's behavior cycle again, and the gain specification unit 112 may regenerate the control signal based on the speaker's behavior cycle.

また、遅延処理が施された入力信号ＩｎＡと、話者の振舞いの周期に基づく処理とを厳密に同期させる必要が無い場合や、入力信号ＩｎＡと動画像ＩｎＢとの間の同期が保障されている場合には、同期演算部１１５を設けない構成としてもよい。この場合には、話者の発話の動作が開始されるタイミングｔＢが、ノイズｆ１３０に音声信号ｆ１１０が重畳し始めるタイミング、即ち、入力信号ＩｎＡの振幅が所定量以上変化（増加）するタイミングｔＡと等しいものとして処理することとなる。このような構成とする場合には、ゲイン制御部１１６２は、同期演算部１１５からの通知を待たず、直ちに処理を開始すればよい。 In addition, when it is not necessary to strictly synchronize the input signal InA subjected to the delay processing and the processing based on the period of the speaker's behavior, the synchronization between the input signal InA and the moving image InB is guaranteed. In such a case, the synchronization calculation unit 115 may be omitted. In this case, the timing tB at which the speaker's speech operation is started is the timing at which the audio signal f110 starts to be superimposed on the noise f130, that is, the timing tA at which the amplitude of the input signal InA changes (increases) by a predetermined amount or more. It will be treated as equal. In the case of such a configuration, the gain control unit 1162 may start processing immediately without waiting for a notification from the synchronization calculation unit 115.

また、上記では、話者が単一の場合の処理について説明していたが、複数の話者を対象として、この中のいずれかの話者を選択的に処理の対象とするように動作させてもよい。この場合には、動作検出部１１１は、例えば、顔認識のような個人を特定する技術を応用することで、処理対象の話者を識別するとよい。 Also, in the above description, the processing in the case of a single speaker has been described. However, for a plurality of speakers, one of them is selectively operated as a processing target. May be. In this case, the motion detection unit 111 may identify a speaker to be processed by applying a technique for identifying an individual such as face recognition, for example.

なお、対象となる話者の選択基準については、所定の処理毎に操作者（例えば、話者）が指定できるようにしてもよいし、対象の話者をあらかじめ決めておいてもよい。また、識別された話者ごとに、各部位の周期に適用する統計処理のパラメタ、例えば、重み付け平均処理における重みのつけ方や、適用するベイジアンネットワークを切り替えてもよい。これらの情報は、あらかじめ作成しておき、動作検出部１１１が適宜読み出せる場所に記憶させておけばよい。 Note that the selection criterion for the target speaker may be specified by an operator (for example, a speaker) for each predetermined process, or the target speaker may be determined in advance. In addition, for each identified speaker, a parameter of statistical processing applied to the period of each part, for example, how to apply a weight in weighted average processing, or applied Bayesian network may be switched. These pieces of information may be created in advance and stored in a place where the motion detection unit 111 can appropriately read.

このように動作させることで、例えば、複数の話者が同時に発話している場合においても、対象の話者の発話に合わせて話者の振舞いの周期が特定され、この周期に基づき入力信号ＩｎＡの振幅が時系列に沿って調整される。そのため、対象の話者の発話が、他の話者の発話に比べてより強調されやすくなり、複数の話者が発話する場合においても、この対象の話者の発話を入力とした音声認識の精度を向上させることが可能となる。 By operating in this way, for example, even when a plurality of speakers are speaking at the same time, the cycle of the speaker's behavior is specified according to the speech of the target speaker, and the input signal InA is based on this cycle. Is adjusted along the time series. As a result, the speech of the target speaker is more easily emphasized than the speech of other speakers, and even when multiple speakers speak, The accuracy can be improved.

［音声処理ユニット１１の一連の処理］
次に、図６を参照しながら、音声処理ユニット１１の一連の処理について説明する。図６は、音声処理ユニット１１の一連の動作を示したフローチャートである。 [A series of processes of the audio processing unit 11]
Next, a series of processes of the audio processing unit 11 will be described with reference to FIG. FIG. 6 is a flowchart showing a series of operations of the audio processing unit 11.

（ステップＳ１１）
動作検出部１１１は、入力として画像取得部１０２から動画像ＩｎＢを受ける。この動画像ＩｎＢには、話者の外観を時系列に沿って撮影されたものであり、話者の動作が示されている。 (Step S11)
The motion detection unit 111 receives the moving image InB from the image acquisition unit 102 as an input. In this moving image InB, the appearance of the speaker is taken in time series, and the operation of the speaker is shown.

動作検出部１１１は、所定のフレーム毎（例えば、１フレーム毎）に、「口」のように話者の発話の動作を示す部位の特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きを特定する。動作検出部１１１は、特定された特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きを示す情報を同期演算部１１５に逐次出力する。 The motion detection unit 111 identifies the positions and orientations of the feature points P20a and P20b of the part indicating the speaker's utterance motion, such as “mouth”, for each predetermined frame (for example, every frame). The motion detection unit 111 sequentially outputs information indicating the positions and orientations of the identified feature points P20a and P20b to the synchronization calculation unit 115.

また、動作検出部１１１は、動画像ＩｎＢを構成する各フレーム画像から、話者の身体を構成する各部位のうち、あらかじめ決められた１または複数の部位の位置を特定する。 Further, the motion detection unit 111 identifies positions of one or more predetermined parts among the parts constituting the body of the speaker from the frame images constituting the moving image InB.

（ステップＳ１２）
次に、動作検出部１１１は、各対象部位の動作、即ち、各対象部位の位置の変化を、あらかじめ決められたフレーム数の間だけ監視して、その動作の周期を対象部位ごとに特定する。 (Step S12)
Next, the motion detection unit 111 monitors the motion of each target part, that is, the change in the position of each target part for a predetermined number of frames, and specifies the cycle of the motion for each target part. .

動作検出部１１１は、特定された各対象部位の周期に対して、あらかじめ決められた統計処理を施すことで、これら各対象部位の動作に基づき時系列に沿って振幅が変化する１つの周期を、話者の振舞いの周期として特定する。この統計処理の一例として、重み付け平均処理やベイジアンネットワークを用いた推定が挙げられる。 The motion detection unit 111 performs a predetermined statistical process on the cycle of each identified target part, thereby obtaining one cycle in which the amplitude changes along the time series based on the motion of each target part. , Specified as the cycle of the speaker's behavior. As an example of this statistical processing, weighted average processing or estimation using a Bayesian network can be cited.

動作検出部１１１は、特定された話者の振舞いの周期をゲイン特定部１１２に出力する。なお、この話者の振舞いの周期の特定に係る動作が、「検出ステップ」に相当する。 The motion detection unit 111 outputs the specified speaker behavior period to the gain specification unit 112. The operation related to the specification of the speaker's behavior period corresponds to the “detection step”.

ゲイン特定部１１２は、動作検出部１１１から話者の振舞いの周期を示す情報を受ける。ゲイン特定部１１２は、この話者の振舞いの周期を基に、時系列に沿って入力信号ＩｎＡの振幅を変化させるための制御信号、即ち、時系列に沿った入力信号ＩｎＡに対するゲインの変化を示す制御信号を生成する。具体的な一例として、本実施形態に係るゲイン特定部１１２は、動作検出部１１１から受けた話者の振舞いの周期に同期して入力信号ＩｎＡを減衰させるように制御信号を生成する。これにより、話者の振舞いの周期において、振幅が小さくなるタイミング（位相）で、入力信号ＩｎＡが減衰されるように、制御信号が生成される。 The gain specifying unit 112 receives information indicating the cycle of the speaker's behavior from the motion detection unit 111. Based on the period of the speaker's behavior, the gain specifying unit 112 controls the control signal for changing the amplitude of the input signal InA along the time series, that is, changes the gain with respect to the input signal InA along the time series. The control signal shown is generated. As a specific example, the gain specifying unit 112 according to the present embodiment generates a control signal so as to attenuate the input signal InA in synchronization with the period of the speaker's behavior received from the motion detection unit 111. As a result, the control signal is generated so that the input signal InA is attenuated at the timing (phase) at which the amplitude decreases in the period of the speaker's behavior.

ゲイン特定部１１２は、生成された制御信号をゲイン制御部１１６２に出力する。なお、この制御信号の生成に係る動作が、「特定ステップ」に相当する。 The gain specifying unit 112 outputs the generated control signal to the gain control unit 1162. The operation relating to the generation of the control signal corresponds to a “specific step”.

（ステップＳ２１）
音声信号取得部１１３は、入力として集音部１０１から音声信号ｆ１１０を含む入力信号ＩｎＡを逐次受ける。音声信号取得部１１３は、この入力信号ＩｎＡを遅延処理部１１４に逐次出力する。 (Step S21)
The audio signal acquisition unit 113 sequentially receives the input signal InA including the audio signal f110 from the sound collection unit 101 as an input. The audio signal acquisition unit 113 sequentially outputs the input signal InA to the delay processing unit 114.

（ステップＳ２２）
遅延処理部１１４は、音声信号取得部１１３から入力信号ＩｎＡを逐次受ける。遅延処理部１１４は、この入力信号ＩｎＡが、あらかじめ決めた遅延量ｈ１１４だけ遅延するように遅延処理を施す。このときの遅延量ｈ１１４は、動作検出部１１１及びゲイン特定部１１２の処理時間を鑑みてあらかじめ決定しておく。 (Step S22)
The delay processing unit 114 sequentially receives the input signal InA from the audio signal acquisition unit 113. The delay processing unit 114 performs a delay process so that the input signal InA is delayed by a predetermined delay amount h114. The delay amount h114 at this time is determined in advance in consideration of the processing time of the motion detection unit 111 and the gain specifying unit 112.

（ステップＳ３０）
同期演算部１１５は、動作検出部１１１から、話者の発話の動作を示す部位の特徴点Ｐ２０ａ及びＰ２０ｂの位置や向き示す情報を逐次受ける。同期演算部１１５は、この情報を基に特徴点Ｐ２０ａ及びＰ２０ｂの位置や向きの変化を監視し、話者の発話の動作が開始されるタイミングｔＢを検出する。 (Step S30)
The synchronization calculation unit 115 sequentially receives information indicating the positions and orientations of the feature points P20a and P20b of the part indicating the operation of the speaker's utterance from the motion detection unit 111. Based on this information, the synchronization calculation unit 115 monitors changes in the positions and orientations of the feature points P20a and P20b, and detects the timing tB when the speaker's speech operation is started.

また、同期演算部１１５は、音声信号取得部１１３から、入力信号ＩｎＡの振幅を示す情報を逐次受ける。同期演算部１１５は、この情報を基に入力信号ＩｎＡの振幅の変化を監視し、入力信号ＩｎＡの振幅が所定量以上変化（増加）するタイミングｔＡを検出する。 Further, the synchronization calculation unit 115 sequentially receives information indicating the amplitude of the input signal InA from the audio signal acquisition unit 113. The synchronization calculation unit 115 monitors the change in the amplitude of the input signal InA based on this information, and detects the timing tA at which the amplitude of the input signal InA changes (increases) by a predetermined amount or more.

同期演算部１１５は、検出されたタイミングｔＡ及びｔＢの差として、ずれ量Δｔ＝ｔＡ−ｔＢを算出する。このずれ量Δｔが、入力信号ＩｎＡと、話者の振舞いの周期に基づく処理との間の時系列に沿ったずれ量（時間差）を示している。 The synchronization calculation unit 115 calculates a deviation amount Δt = tA−tB as the difference between the detected timings tA and tB. This shift amount Δt indicates the shift amount (time difference) along the time series between the input signal InA and the processing based on the period of the speaker's behavior.

（ステップＳ４１）
同期演算部１１５は、入力信号ＩｎＡ、または、話者の振舞いの周期に基づく処理が、このΔｔだけ時系列に沿ってシフトする（遅延させる）ように制御することで、遅延処理が施された入力信号ＩｎＡと、話者の振舞いの周期に基づく処理とを同期させる。 (Step S41)
The synchronization calculation unit 115 performs the delay process by controlling the process based on the input signal InA or the period of the speaker's behavior to be shifted (delayed) along the time series by this Δt. The input signal InA and the processing based on the period of the speaker's behavior are synchronized.

具体的には、同期演算部１１５は、動画像ＩｎＢに対して入力信号ＩｎＡが遅延している場合（Δｔ＜０）には、ずれ量Δｔをゲイン制御部１１６２に通知する。このずれ量Δｔを受けて、ゲイン制御部１１６２は、自身の処理の開始タイミングを、このずれ量Δｔ分だけ遅延させる。 Specifically, when the input signal InA is delayed with respect to the moving image InB (Δt <0), the synchronization calculation unit 115 notifies the gain control unit 1162 of the shift amount Δt. In response to this deviation amount Δt, the gain control unit 1162 delays the start timing of its own processing by this deviation amount Δt.

また、同期演算部１１５は、入力信号ＩｎＡに対して動画像ＩｎＢが遅延している場合（Δｔ＞０）には、ずれ量Δｔを遅延処理部１１４に通知する。この通知を受けた場合に、遅延処理部１１４は、遅延量ｈ１１４に加えて、さらにΔｔ分だけ入力信号ＩｎＡを遅延させる。なお、このとき同期演算部１１５は、Δｔ＝０をゲイン制御部１１６２に通知して、ゲイン制御部１１６２がただちに処理を開始するようにしてもよい。 Further, when the moving image InB is delayed with respect to the input signal InA (Δt> 0), the synchronization calculation unit 115 notifies the delay processing unit 114 of the shift amount Δt. When receiving this notification, the delay processing unit 114 further delays the input signal InA by Δt in addition to the delay amount h114. At this time, the synchronization calculation unit 115 may notify Δt = 0 to the gain control unit 1162, and the gain control unit 1162 may immediately start processing.

入力信号ＩｎＡと動画像ＩｎＢとが同期している場合（Δｔ＝０）には、遅延処理後の入力信号ＩｎＡと、話者の振舞いの周期に基づく処理とが同期していることになる。そのため、この場合には、同期演算部１１５は、ずれ量Δｔの通知に係る処理を行わない、もしくは、ゲイン制御部１１６２及び遅延処理部１１４のうちのいずれかまたは双方にΔｔ＝０を通知する。 When the input signal InA and the moving image InB are synchronized (Δt = 0), the input signal InA after the delay process and the process based on the period of the speaker's behavior are synchronized. Therefore, in this case, the synchronization calculation unit 115 does not perform the process related to the notification of the deviation amount Δt, or notifies Δt = 0 to one or both of the gain control unit 1162 and the delay processing unit 114. .

（ステップＳ４２）
ゲイン制御部１１６２は、ゲイン特定部１１２から、時系列に沿った入力信号ＩｎＡに対するゲインの変化を示す制御信号を受ける。ゲイン制御部１１６２は、この制御信号に基づき、増幅器１１６１の利得Ｇａｉｎを時系列に沿って制御する（この動作が「ゲイン制御」に相当する）。 (Step S42)
The gain control unit 1162 receives a control signal indicating a gain change with respect to the input signal InA in time series from the gain specifying unit 112. Based on this control signal, gain control section 1162 controls gain Gain of amplifier 1161 in time series (this operation corresponds to “gain control”).

また、ゲイン制御部１１６２は、同期演算部１１５からずれ量Δｔの通知を受ける。このずれ量Δｔの通知を受けた場合には、ゲイン制御部１１６２は、ずれ量Δｔだけ自身の処理、即ち、ゲイン制御の開始タイミングを遅延させる。 Further, the gain control unit 1162 receives a notification of the shift amount Δt from the synchronization calculation unit 115. When receiving the notification of the shift amount Δt, the gain control unit 1162 delays its own process, that is, the gain control start timing by the shift amount Δt.

増幅器１１６１は、遅延処理部１１４から、遅延処理が施された入力信号ＩｎＡを受ける。増幅器１１６１の利得Ｇａｉｎは、ゲイン制御部１１６２に基づき制御されている。即ち、増幅器１１６１は、ゲイン制御部１１６２に従い、入力信号ＩｎＡの振幅を調整する（増幅／減衰させる）。増幅器１１６１は、振幅が調整された入力信号ＩｎＡを、後段に位置する音声認識部１２（図１Ｂ参照）に出力する。なお、この時系列に沿った入力信号ＩｎＡの振幅の調整に係る動作が、「信号処理ステップ」に相当する。 The amplifier 1161 receives the input signal InA subjected to the delay process from the delay processing unit 114. The gain Gain of the amplifier 1161 is controlled based on the gain control unit 1162. That is, the amplifier 1161 adjusts (amplifies / attenuates) the amplitude of the input signal InA according to the gain control unit 1162. The amplifier 1161 outputs the input signal InA whose amplitude is adjusted to the voice recognition unit 12 (see FIG. 1B) located at the subsequent stage. The operation related to the adjustment of the amplitude of the input signal InA along this time series corresponds to a “signal processing step”.

なお、上述した一連の動作は、音声処理ユニット１１（または、この音声処理ユニット１１を含む音声認識システム）を動作させる装置（例えば、図１Ａの本体Ｍ１１）のＣＰＵを機能させるためのプログラムによって構成することができる。このプログラムは、その装置（例えば、本体Ｍ１１）にインストールされたＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）を介して実行されるように構成してもよい。また、このプログラムは、音声処理ユニット１１を動作させる装置が読み出し可能であれば、記憶される位置は限定されない。例えば、このプログラムは、装置の外部から接続される記録媒体に格納されていてもよい。この場合には、この記録媒体を装置に接続することによって、その装置のＣＰＵに当該プログラムを実行させるように構成するとよい。 The series of operations described above is configured by a program for causing a CPU of a device (for example, the main body M11 in FIG. 1A) that operates the voice processing unit 11 (or the voice recognition system including the voice processing unit 11) to function. can do. This program may be configured to be executed via an OS (Operating System) installed in the apparatus (for example, the main body M11). Moreover, as long as the apparatus which operates the audio | voice processing unit 11 can read this program, the position memorize | stored will not be limited. For example, this program may be stored in a recording medium connected from the outside of the apparatus. In this case, the recording medium may be connected to the apparatus so that the CPU of the apparatus executes the program.

［変形例］
次に、変形例に係る音声処理ユニット１１の動作について説明する。音声認識部１２で実行される音声認識処理や構文解析は、一部の音声が認識できず、その音声に対応する文字が欠落した場合に、認識できた音声（文字）から、その認識できなかった部分を推定する処理を有する場合がある。この処理は、音声認識処理や構文解析の特性（例えば、アルゴリズム）に応じて、例えば、文、節、句などの所定の単位の文字の集合のうち、どの部分が認識できたかにより、欠落した部分の推定のしやすさに差が生じる場合がある。具体的な一例として、所定の単位の文字の集合のうち、中間部分が認識できた場合よりも、前半部分が認識できた場合の方が、認識率が高くなる場合がある。変形例に係る音声処理ユニット１１では、話者の振舞いの周期を基に、この所定単位の文字の集合（もしくは、文、節、句などの間の文章の切れ目）の時系列に沿った位置を推定し、入力信号中で強調する部分を調整することで、音声認識処理の認識率を向上させる。 [Modification]
Next, the operation of the audio processing unit 11 according to the modification will be described. The voice recognition processing and syntax analysis executed by the voice recognition unit 12 cannot recognize some voices and recognize the voices (characters) that can be recognized when characters corresponding to the voices are missing. There may be a process of estimating the portion. This processing is missing depending on which part of the set of characters in a predetermined unit such as a sentence, clause, phrase, etc. could be recognized, depending on the characteristics of speech recognition processing and parsing (eg, algorithm). There may be a difference in the ease of estimating the portion. As a specific example, the recognition rate may be higher when the first half of the set of characters of a predetermined unit can be recognized than when the middle part can be recognized. In the speech processing unit 11 according to the modification, based on the period of the speaker's behavior, the position along the time series of the set of characters of the predetermined unit (or the break of sentences between sentences, clauses, phrases, etc.) Is adjusted, and the portion to be emphasized in the input signal is adjusted, thereby improving the recognition rate of the speech recognition process.

そこで、変形例に係る音声処理ユニット１１では、音声信号の揺らぎの周期、即ち、特定された話者の振舞いの周期のうち、所定周期分（例えば、１／２周期分や１周期分）が、所定の単位の文字の集合の時系列に沿った位置と同期しているものとみなす。そのうえで、この音声処理ユニット１１は、声認識処理や構文解析の特性に応じて、この所定の単位の文字の集合のうち、時系列に沿った所定の位置（タイミング）が強調されるようにゲインを制御する。 Therefore, in the audio processing unit 11 according to the modification, a predetermined period (for example, 1/2 period or 1 period) of the period of fluctuation of the audio signal, that is, the period of the specified speaker's behavior, is provided. It is assumed that the position is synchronized with the position along the time series of a set of characters in a predetermined unit. In addition, the speech processing unit 11 gains so as to emphasize a predetermined position (timing) in a time series among the set of characters of the predetermined unit according to the characteristics of voice recognition processing and syntax analysis. To control.

以下に、その一例について、図７Ａ及び図７Ｂを参照しながら説明する。図７Ａは、変形例に係る音声処理ユニットの一態様における、話者の振舞いの周期に基づくゲイン制御について説明するための図である。また、図７Ｂは、変形例に係る音声処理ユニットの図７Ａとは異なる一態様を示しおり、この態様における、話者の振舞いの周期に基づくゲイン制御について説明するための図である。なお、図７Ａ及び図７Ｂに示す例では、所定の単位の文字の集合のうち、前半部分を強調する場合を示している。なお、以降では、前述した実施形態と処理の異なるゲイン特定部１１２に着目して説明することとし、その他の構成については、前述した実施形態と同様のため、詳細な説明は省略する。 An example thereof will be described below with reference to FIGS. 7A and 7B. FIG. 7A is a diagram for describing gain control based on a period of a speaker's behavior in one aspect of a voice processing unit according to a modification. FIG. 7B shows an aspect different from FIG. 7A of the speech processing unit according to the modification, and is a diagram for describing gain control based on the period of the speaker's behavior in this aspect. In the example shown in FIGS. 7A and 7B, the first half of a set of characters in a predetermined unit is emphasized. In the following, the description will be made with attention paid to the gain specifying unit 112 that is different in processing from the above-described embodiment, and other configurations are the same as those in the above-described embodiment, and thus detailed description thereof is omitted.

まず、図７Ａに示す例について説明する。図７Ａにおけるｆ１０は、図４で示した入力信号ｆ１０（即ち、入力信号ＩｎＡ）に対応しており、ｆ２０は、図４で示した周期ｆ２０に対応している。また、ｆ５０は、時系列に沿ったゲインの変化を模擬的に示したグラフであり、図４のｆ５０に対応している。また、図７Ａにおけるｆ５１は、この例における、時系列に沿ったゲインの変化を模擬的に示したグラフである。 First, the example shown in FIG. 7A will be described. 7A corresponds to the input signal f10 shown in FIG. 4 (that is, the input signal InA), and f20 corresponds to the period f20 shown in FIG. Further, f50 is a graph that schematically shows a change in gain along a time series, and corresponds to f50 in FIG. In addition, f51 in FIG. 7A is a graph that schematically shows a change in gain along the time series in this example.

この変形例に係るゲイン特定部１１２は、ゲインの変化を示す制御信号を、動作検出部１１１から受けた話者の振舞いの周期に対して、時系列に沿って所定の時間幅ｈ５１だけ位相をずれるように生成する。例えば、図７Ａに示す例では、グラフｆ５１は、グラフｆ５０に比べて、時間幅ｈ５１だけ前側（時系列上の前側）に位相がずれている。この時間幅ｈ５１には、所定の定数値を用いてもよいし、話者の振舞いの周期の長さ（例えば、１周期の長さ）に対する相対値を用いてもよい。これらのいずれを利用するかは、例えば、音声認識処理や構文解析の特性に応じて決定するとよい。このようにして、このゲイン特定部１１２は、話者の振舞いの周期中の所望の位相でゲインが増幅するように、時系列に沿ったゲインの変化の位相を調整する。 The gain identifying unit 112 according to this modification example shifts the phase of the control signal indicating the gain change from the period of the speaker behavior received from the motion detection unit 111 by a predetermined time width h51 along the time series. Generate to deviate. For example, in the example illustrated in FIG. 7A, the graph f51 is shifted in phase forward (front side in time series) by the time width h51 compared to the graph f50. A predetermined constant value may be used as the time width h51, or a relative value with respect to the length of the speaker's behavior cycle (for example, the length of one cycle) may be used. Which of these should be used may be determined according to, for example, the characteristics of speech recognition processing or syntax analysis. In this manner, the gain specifying unit 112 adjusts the phase of the gain change along the time series so that the gain is amplified at a desired phase in the period of the speaker's behavior.

これにより、図７Ａに示すように、グラフｆ５１で示された、時系列に沿ったゲインの変化のピーク位置が、時間幅ｈ５１だけ前側にシフトする。そのため、例えば、音声信号ｆ１１０のうちのタイミングｔ２０１からｔ２０３の間で示された部分のうち、前側の信号が減衰されずに残る。即ち、タイミングｔ２０１からｔ２０３の間のうち、前側の部分が強調されることになり、この期間に対応する所定の単位の文字の集合の認識率を向上させることが可能となる。 As a result, as shown in FIG. 7A, the peak position of the gain change along the time series shown by the graph f51 is shifted forward by the time width h51. Therefore, for example, in the portion of the audio signal f110 indicated between timings t201 and t203, the front signal remains without being attenuated. That is, the front part is emphasized between the timings t201 and t203, and the recognition rate of a set of characters of a predetermined unit corresponding to this period can be improved.

なお、上記では、ゲイン特定部１１２が、ゲインの変化を示す制御信号の位相を調整していたが、例えば、同期演算部１１５が、時間幅ｈ５１を鑑みて、ゲイン制御部１１６２による時系列に沿ったゲイン制御の開始タイミングを調整してもよい。 In the above description, the gain specifying unit 112 adjusts the phase of the control signal indicating the gain change. However, for example, the synchronization calculation unit 115 takes the time width h51 into consideration in time series by the gain control unit 1162. You may adjust the start timing of the gain control along.

次に、図７Ｂに示す例について説明する。図７Ａに示す例では、ゲインの変化を示す制御信号の位相を調整することで、所定の単位の文字の集合のうち、時系列に沿った所定の位置（タイミング）が強調されるようにゲインを制御していた。図７Ｂに示す例では、位相の調整に替えて、時系列に沿ったゲインの変化のピーク位置をずらすことで、所望の部分が強調されるように制御する。 Next, the example shown in FIG. 7B will be described. In the example shown in FIG. 7A, by adjusting the phase of the control signal indicating a change in gain, the gain is set so that a predetermined position (timing) along a time series is emphasized in a set of characters in a predetermined unit. Was controlling. In the example illustrated in FIG. 7B, control is performed so that a desired portion is emphasized by shifting the peak position of the gain change along the time series instead of adjusting the phase.

図７Ｂにおけるｆ１０は、図４及び図７Ａで示した入力信号ｆ１０（即ち、入力信号ＩｎＡ）に対応しており、ｆ２０は、図４及び図７Ａで示した周期ｆ２０に対応している。また、ｆ５０は、時系列に沿ったゲインの変化を模擬的に示したグラフであり、図４及び図７Ａのｆ５０に対応している。また、図７Ｂにおけるｆ５２は、この例における、時系列に沿ったゲインの変化を模擬的に示したグラフである。 7B corresponds to the input signal f10 (that is, the input signal InA) shown in FIGS. 4 and 7A, and f20 corresponds to the period f20 shown in FIGS. 4 and 7A. Further, f50 is a graph that schematically shows a change in gain along the time series, and corresponds to f50 in FIGS. 4 and 7A. Further, f52 in FIG. 7B is a graph that schematically shows a change in gain along the time series in this example.

この変形例に係るゲイン特定部１１２は、ゲインの変化を示す制御信号のピーク位置が、動作検出部１１１から受けた話者の振舞いの周期に対して、時系列に沿って所定の時間幅ｈ５２だけずれるように生成する。例えば、図７Ｂに示す例では、グラフｆ５２は、グラフｆ５０に比べて、時間幅ｈ５２だけ前側（時系列上の前側）にピーク位置がずれている。この時間幅ｈ５２には、所定の定数値を用いてもよいし、話者の振舞いの周期の長さ（例えば、１周期の長さ）に対する相対値を用いてもよい。これらのいずれを利用するかは、例えば、音声認識処理や構文解析の特性に応じて決定するとよい。 The gain specifying unit 112 according to this modified example has a predetermined time width h52 along the time series in which the peak position of the control signal indicating the change in gain is relative to the period of the speaker's behavior received from the motion detecting unit 111. It generates so that it may shift only. For example, in the example illustrated in FIG. 7B, the peak position of the graph f52 is shifted to the front side (front side in time series) by the time width h52 compared to the graph f50. As the time width h52, a predetermined constant value may be used, or a relative value with respect to the length of the speaker's behavior cycle (for example, the length of one cycle) may be used. Which of these should be used may be determined according to, for example, the characteristics of speech recognition processing or syntax analysis.

これにより、図７Ｂに示すように、グラフｆ５２で示された、時系列に沿ったゲインの変化のピーク位置が、時間幅ｈ５２だけ前側にシフトする。そのため、例えば、音声信号ｆ１１０のうちのタイミングｔ２０１からｔ２０３の間で示された部分のうち、前側の信号が減衰されずに残る。即ち、タイミングｔ２０１からｔ２０３の間のうち、前側の部分が強調されることになり、この期間に対応する所定の単位の文字の集合の認識率を向上させることが可能となる。また、図７Ｂに示す例では、音声信号ｆ１１０の振幅が最も低下するタイミング（例えば、ｔ２０１、ｔ２０３）と、利得Ｇａｉｎが最も低下するタイミング（例えば、ｔ５０１、ｔ５０３）が一致する。そのため、音声信号ｆ１１０の振幅に対してノイズｆ１３０の振幅の比率が最も大きくなるタイミング、即ち、ノイズｆ１３０が支配的となるタイミングの信号を減衰させ、他の部分を強調することが可能となる。 As a result, as shown in FIG. 7B, the peak position of the gain change along the time series shown by the graph f52 is shifted forward by the time width h52. Therefore, for example, in the portion of the audio signal f110 indicated between timings t201 and t203, the front signal remains without being attenuated. That is, the front part is emphasized between the timings t201 and t203, and the recognition rate of a set of characters of a predetermined unit corresponding to this period can be improved. In the example illustrated in FIG. 7B, the timing at which the amplitude of the audio signal f110 decreases most (for example, t201 and t203) matches the timing at which the gain Gain decreases most (for example, t501 and t503). Therefore, it is possible to attenuate the signal at the timing at which the ratio of the amplitude of the noise f130 to the amplitude of the audio signal f110 is the largest, that is, the timing at which the noise f130 is dominant, and emphasize other portions.

なお、上述で説明した実施形態及び変形例では、話者の振舞いの周期に基づき所望のタイミングで入力信号ＩｎＡが強調されるように、他のタイミングの信号を減衰させる例について説明した。しかしながら、入力信号ＩｎＡのうち、強調する部分と、それとは異なる他の部分との間で、信号の振幅に差を持たせることが可能であれば、この方法には限定されない。例えば、入力信号ＩｎＡ全体を増幅してから、他の部分の信号の振幅を減衰させてもよい。また、強調する部分の信号を増幅するように制御してもよい。このような制御は、例えば、ゲイン特定部１１２が、所望の制御にあわせて、各タイミングにおける利得Ｇａｉｎを調整して制御信号を生成すればよい。 In the embodiment and the modification described above, an example in which a signal at another timing is attenuated so that the input signal InA is emphasized at a desired timing based on the period of the speaker's behavior has been described. However, the method is not limited to this method as long as it is possible to make a difference in signal amplitude between the emphasized portion of the input signal InA and another portion different from the emphasized portion. For example, after the entire input signal InA is amplified, the amplitude of the signal of the other part may be attenuated. Further, control may be performed so as to amplify the signal of the emphasized portion. For example, the gain specifying unit 112 may generate the control signal by adjusting the gain Gain at each timing in accordance with the desired control.

以上、添付図面を参照しながら本発明の好適な実施形態について詳細に説明したが、本発明はかかる例に限定されない。本発明の属する技術の分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本発明の技術的範囲に属するものと了解される。 The preferred embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field to which the present invention pertains can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that these also belong to the technical scope of the present invention.

１０１集音部
１０２画像取得部
１１音声処理ユニット
１１１動作検出部
１１２ゲイン特定部
１１３音声信号取得部
１１４遅延処理部
１１５同期演算部
１１６信号処理部
１１６１増幅器
１１６２ゲイン制御部
１２音声認識部
１３動作制御部
DESCRIPTION OF SYMBOLS 101 Sound collecting part 102 Image acquisition part 11 Audio | voice processing unit 111 Operation | movement detection part 112 Gain specification part 113 Audio | voice signal acquisition part 114 Delay processing part 115 Synchronization calculating part 116 Signal processing part 1161 Amplifier 1162 Gain control part 12 Voice recognition part 13 Operation control Part

Claims

A detector that detects the period of the speaker's behavior;
Based on the detected period of the behavior, a specifying unit that specifies a change in gain along a time series;
A signal processing unit that adjusts the amplitude of an input signal including an audio signal along a time series based on the identified change in the gain;
An audio processing apparatus comprising:

A synchronization calculation unit for specifying a synchronization timing between the speaker's behavior period and the input signal;
The audio processing apparatus according to claim 1, wherein the signal processing unit determines a timing for adjusting an amplitude of the input signal based on the identified synchronization timing.

The synchronization calculation unit, based on the information indicating the operation of the speaker's utterance detected along the time series, and the change in the amplitude of the input signal along the time series, the period of the speaker's behavior and The audio processing apparatus according to claim 2, wherein a shift amount along a time series with the input signal is calculated, and the synchronization timing is specified based on the shift amount.

The said synchronous calculating part uses the difference between the timing when the amplitude of the said input signal increased more than predetermined amount, and the timing when the operation | movement of the said speech is started as said deviation | shift amount. Voice processing device.

The speech processing apparatus according to claim 1, wherein the detection unit detects a period of the behavior based on image information indicating a speaker's operation.

The detection unit detects a period of movement of a predetermined part of each part of the speaker from the image information, and specifies the period of the behavior based on the detected period. The speech processing apparatus according to claim 5.

The detection unit detects the period of movement of a plurality of parts as the action of the predetermined part, and specifies a period of the behavior by applying a predetermined statistical process to each period of the plurality of parts. The speech processing apparatus according to claim 6, wherein:

The said detection part weights the period of each of these some site | part as said statistical process, and specifies the period of the said behavior by taking the average of the said weighted said period, The said characteristic is characterized by the above-mentioned. Voice processing device.

Based on a Bayesian network created in advance based on a causal relationship between the period of each of the plurality of parts and the period of the behavior as the statistical process, the detection unit calculates the period from each of the detected parts of the plurality of parts. The speech processing apparatus according to claim 7, wherein a period of the behavior is specified.

The speech processing apparatus according to claim 1, wherein the specifying unit specifies the change in the gain so as to synchronize with the detected period of the behavior.

The said specific | specification part specifies the said gain change so that the peak position of the said gain change may shift | deviate only the predetermined time width from the peak position of the said behavior period. Audio processing device.

The specifying unit changes the gain so that the phase is shifted by a predetermined time width with respect to the period of the behavior so that the gain is amplified at a desired phase in the detected period of the behavior. The voice processing apparatus according to claim 1, wherein the voice processing apparatus is specified.

A sound collection unit for collecting an input signal including an audio signal;
An image acquisition unit for acquiring the motion of the speaker as a moving image;
A detection unit for detecting a period of a speaker's behavior based on the moving image;
Based on the detected period of the behavior, a specifying unit that specifies a change in gain along a time series;
A signal processing unit that adjusts the amplitude of an input signal including an audio signal along a time series based on the identified change in the gain;
A speech recognition unit for performing speech recognition based on the input signal whose amplitude is adjusted;
A speech recognition device comprising:

A detection step for detecting the period of the speaker's behavior;
A specific step of identifying a change in gain along a time series based on the detected period of the behavior;
A signal processing step of adjusting the amplitude of the input signal including the audio signal along a time series based on the specified change in the gain;
A speech processing method comprising:

A detection process that detects the period of the speaker's behavior;
Based on the detected period of the behavior, a specific process for specifying a change in gain along a time series;
Signal processing for adjusting the amplitude of the input signal including the audio signal along a time series based on the specified change in the gain;
A voice processing program characterized by executing