JP2020046500A

JP2020046500A - Information processing apparatus, information processing method and information processing program

Info

Publication number: JP2020046500A
Application number: JP2018173676A
Authority: JP
Inventors: 信瑩何; xin ying He
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2020-03-26
Also published as: WO2020059245A1

Abstract

To provide an information processing apparatus, an information processing method and an information processing program capable of acquiring information representing a musical performance of a musical instrument from an image.SOLUTION: An information processing apparatus 100 comprises: a position recognition unit which recognizes the position of a region of the body of a player from an input image; a musical instrument recognition unit which recognizes the musical instrument from the input image; and a musical performance information generation unit which generates, based upon relativity between the position of the region and the musical instrument, musical performance information representing a musical performance of the musical instrument by the player.SELECTED DRAWING: Figure 2

Description

本技術は、情報処理装置、情報処理方法および情報処理プログラムに関する。 The present technology relates to an information processing device, an information processing method, and an information processing program.

従来から、ダンスなどの人のパフォーマンスをデータ化するシステムが提案されている（特許文献１）。 2. Description of the Related Art Hitherto, a system has been proposed in which performance of a person such as a dance is converted into data (Patent Document 1).

特開２０１６−２４７４０号公報JP 2016-24740 A

特許文献１に記載のシステムは、３次元空間におけるパフォーマーのダンス動作が記録された譜面データを生成するものである。このような動作をデータ化する手法においては、動作の種別によってデータ化のために必要な情報や処理が異なるため、そのまま他の動作、例えば楽器演奏などに適用することは難しい。 The system described in Patent Literature 1 generates musical score data in which a dance motion of a performer in a three-dimensional space is recorded. In the method of converting such an operation into data, it is difficult to apply the operation to other operations, for example, a musical instrument performance as it is, because information and processing required for data conversion differ depending on the type of operation.

本技術はこのような点に鑑みなされたものであり、画像から楽器の演奏を示す情報を取得することができる情報処理装置、情報処理方法および情報処理プログラムを提供することを目的とする。 The present technology has been made in view of such a point, and an object of the present technology is to provide an information processing apparatus, an information processing method, and an information processing program capable of acquiring information indicating performance of a musical instrument from an image.

上述した課題を解決するために、第１の技術は、入力画像から演奏者の身体の部位の位置を認識する位置認識部と、入力画像から楽器を認識する楽器認識部と、部位の位置と楽器との関連性に基づき、演奏者による楽器の演奏を示す演奏情報を生成する演奏情報生成部とを備える情報処理装置である。 In order to solve the above-described problem, a first technique includes a position recognition unit that recognizes a position of a body part of a player from an input image, a musical instrument recognition unit that recognizes a musical instrument from the input image, and a position of the part. An information processing apparatus comprising: a performance information generating unit configured to generate performance information indicating a performance of a musical instrument by a player based on relevance to the musical instrument.

また、第２の技術は、入力画像から演奏者の身体の部位の位置を認識し、入力画像から楽器を認識し、部位の位置と楽器との関連性に基づき、演奏者による楽器の演奏を示す演奏情報を生成する情報処理方法である。 The second technique recognizes a position of a body part of a player from an input image, recognizes a musical instrument from the input image, and plays a musical instrument by the player based on the relationship between the position of the part and the musical instrument. This is an information processing method for generating performance information shown in FIG.

さらに、第３の技術は、入力画像から演奏者の身体の部位の位置を認識し、入力画像から楽器を認識し、部位の位置と楽器との関連性に基づき、演奏者による前記楽器の演奏を示す演奏情報を生成する情報処理方法をコンピュータに実行させる情報処理プログラムである。 Further, a third technique recognizes the position of a body part of a player from an input image, recognizes a musical instrument from the input image, and performs the performance of the musical instrument by the player based on the relationship between the position of the part and the musical instrument. Is an information processing program for causing a computer to execute an information processing method for generating performance information indicating the following.

端末装置の構成を示すブロック図である。FIG. 3 is a block diagram illustrating a configuration of a terminal device. 第１の実施の形態における情報処理装置の構成を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration of the information processing apparatus according to the first embodiment. 第１の実施の形態における入力画像の一例を示す図である。FIG. 3 is a diagram illustrating an example of an input image according to the first embodiment. 演奏者の手の認識についての説明図である。It is explanatory drawing about recognition of a player's hand. 演奏者の手の認識についての説明図である。It is explanatory drawing about recognition of a player's hand. 和音（コード）の認識についての説明図である。It is explanatory drawing about recognition of a chord (chord). 第１の実施の形態における部分演奏情報の生成処理を示すフローチャートである。6 is a flowchart illustrating a process of generating partial performance information according to the first embodiment. 第１の実施の形態における複合演奏情報の生成処理を示すフローチャートである。6 is a flowchart illustrating a process of generating composite performance information according to the first embodiment. 第２の実施の形態における入力画像の一例を示す図である。FIG. 14 is a diagram illustrating an example of an input image according to the second embodiment. 第２の実施の形態に係る情報処理装置の構成を示すブロック図である。It is a block diagram showing the composition of the information processor concerning a 2nd embodiment. 第２の実施の形態における部分演奏情報の生成処理を示すフローチャートである。It is a flow chart which shows generation processing of partial performance information in a 2nd embodiment. 第２の実施の形態における複合演奏情報の生成処理を示すフローチャートである。15 is a flowchart illustrating a process of generating composite performance information according to the second embodiment. 第３の実施の形態における入力画像の一例を示す図である。FIG. 14 is a diagram illustrating an example of an input image according to the third embodiment. 第３の実施の形態に係る情報処理装置の構成を示すブロック図である。It is a block diagram showing the composition of the information processor concerning a 3rd embodiment. 第３の実施の形態における部分演奏情報の生成処理を示すフローチャートである。It is a flowchart which shows the production | generation process of the partial performance information in 3rd Embodiment. 第３の実施の形態における複合演奏情報の生成処理を示すフローチャートである。It is a flow chart which shows generation processing of compound performance information in a 3rd embodiment.

以下、本技術の実施の形態について図面を参照しながら説明する。なお、説明は以下の順序で行う。
＜１．第１の実施の形態＞
［１−１．端末装置の構成］
［１−２．情報処理装置の構成］
［１−３．情報処理装置による処理］
［１−３−１．部分演奏情報の生成］
［１−３−２．複合演奏情報の生成］
＜２．第２の実施の形態＞
［２−１．情報処理装置の構成］
［２−２．情報処理装置の処理］
＜３．第３の実施の形態＞
［３−１．情報処理装置の構成］
［３−２．情報処理装置の処理］
＜４．変形例＞ Hereinafter, embodiments of the present technology will be described with reference to the drawings. The description will be made in the following order.
<1. First Embodiment>
[1-1. Configuration of Terminal Device]
[1-2. Configuration of Information Processing Apparatus]
[1-3. Processing by information processing device]
[1-3-1. Generation of partial performance information]
[1-3-2. Generating composite performance information]
<2. Second Embodiment>
[2-1. Configuration of Information Processing Apparatus]
[2-2. Processing of information processing device]
<3. Third Embodiment>
[3-1. Configuration of Information Processing Apparatus]
[3-2. Processing of information processing device]
<4. Modification>

＜１．第１の実施の形態＞
［１−１．端末装置の構成］
まず図１を参照して端末装置１０について説明する。端末装置１０は、制御部１１、記憶部１２、通信部１３、表示部１４、入力部１５、カメラ部１６および情報処理装置１００を備えている。 <1. First Embodiment>
[1-1. Configuration of Terminal Device]
First, the terminal device 10 will be described with reference to FIG. The terminal device 10 includes a control unit 11, a storage unit 12, a communication unit 13, a display unit 14, an input unit 15, a camera unit 16, and an information processing device 100.

制御部１１は、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）およびＲＯＭ（Read Only Memory）などから構成されている。ＲＯＭには、ＣＰＵにより読み込まれ動作されるプログラムなどが記憶されている。ＲＡＭは、ＣＰＵのワークメモリとして用いられる。ＣＰＵは、ＲＯＭに記憶されたプログラムに従い様々な処理を実行してコマンドの発行を行うことによって端末装置１０全体の制御を行う。 The control unit 11 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like. The ROM stores programs and the like that are read and operated by the CPU. The RAM is used as a work memory of the CPU. The CPU controls the entire terminal device 10 by executing various processes in accordance with the programs stored in the ROM and issuing commands.

記憶部１２は、例えば、ハードディスク、半導体メモリなどを用いた大容量記憶媒体である。記憶部１２は、カメラ部１６により撮影された撮影画像、撮影映像や、情報処理装置１００により生成された演奏情報や楽譜情報、さらにコンテンツやアプリケーションなどを保存することができる。 The storage unit 12 is a large-capacity storage medium using, for example, a hard disk, a semiconductor memory, or the like. The storage unit 12 can store a captured image and a captured video captured by the camera unit 16, performance information and music score information generated by the information processing apparatus 100, as well as contents and applications.

通信部１３は、他の装置、インターネットなどと通信するための通信モジュール、通信用コネクタなどである。通信部１３による通信は、ＵＳＢ通信などの有線通信、Wi-Fiなどの無線ＬＡＮ、Bluetooth（登録商標）、ZigBee、４Ｇ（第４世代移動通信システム）、ブロードバンドなどの無線通信などなんでもよい。 The communication unit 13 is a communication module for communicating with another device, the Internet, or the like, a communication connector, or the like. The communication by the communication unit 13 may be wired communication such as USB communication, wireless LAN such as Wi-Fi, wireless communication such as Bluetooth (registered trademark), ZigBee, 4G (4th generation mobile communication system), or broadband.

表示部１４は、例えば、ＬＣＤ（Liquid Crystal Display）、ＰＤＰ(Plasma Display Panel)、有機ＥＬ(Electro Luminescence)パネルなどにより構成された表示デバイスである。表示部１４には、端末装置１０のユーザインターフェース、情報処理装置１００による処理のためにユーザに提示するインターフェースなどが表示される。 The display unit 14 is a display device including, for example, an LCD (Liquid Crystal Display), a PDP (Plasma Display Panel), and an organic EL (Electro Luminescence) panel. The display unit 14 displays a user interface of the terminal device 10, an interface presented to the user for processing by the information processing device 100, and the like.

入力部１５は、ユーザの端末装置１０に対する操作入力を受け付けるものである。入力部１５に対してユーザから入力がなされると、その入力に応じた入力信号が生成されて制御部１１に出力される。そして、制御部１１はその入力信号に対応した演算処理、端末装置１０の制御を行う。入力部１５としては、表示部１４と一体に構成されたタッチパネル、トラックパッドやタッチパッドと称される表示部１４とは一体となっていない平板状のセンサーを指でなぞって操作するポインティングデバイス、キーボード、マウスなどがある。 The input unit 15 receives a user's operation input to the terminal device 10. When an input is made from the user to the input unit 15, an input signal corresponding to the input is generated and output to the control unit 11. Then, the control unit 11 performs arithmetic processing corresponding to the input signal and controls the terminal device 10. As the input unit 15, a touch panel integrated with the display unit 14, a pointing device that operates by tracing a flat sensor that is not integrated with the display unit 14 called a trackpad or a touchpad with a finger, There are keyboard, mouse, etc.

カメラ部１６は撮像素子、画像処理用ＬＳＩなどを備え、静止画像および映像の撮影が可能なカメラ機能を備えるものである。カメラ部１６により撮影された静止画像または映像は情報処理装置１００における演奏情報生成処理に用いることができる。なお、カメラ部１６は端末装置１０の必須の構成要素ではない。 The camera unit 16 includes an image sensor, an image processing LSI, and the like, and has a camera function capable of capturing still images and videos. The still image or video captured by the camera unit 16 can be used for performance information generation processing in the information processing device 100. The camera unit 16 is not an essential component of the terminal device 10.

情報処理装置１００は、本技術に係る演奏情報生成処理を行うものである。情報処理装置１００の詳細は後述する。 The information processing device 100 performs performance information generation processing according to the present technology. Details of the information processing device 100 will be described later.

端末装置１０は以上のようにして構成されている。端末装置１０の具体例としてはパーソナルコンピュータ、ノートパソコン、タブレット端末、スマートフォン、電子キーボード、シンセサイザー、ＤＡＷ（Digital Audio Workstation）などが挙げられる。 The terminal device 10 is configured as described above. Specific examples of the terminal device 10 include a personal computer, a notebook computer, a tablet terminal, a smartphone, an electronic keyboard, a synthesizer, and a DAW (Digital Audio Workstation).

［１−２．情報処理装置の構成］
次に図２を参照して情報処理装置１００の構成について説明する。情報処理装置１００は、画像入力部１０１、位置認識部１０２、形状認識部１０３、動き認識部１０４、楽器認識部１０５、関連性認識部１０６、演奏情報生成部１０７、楽譜情報生成部１０８とから構成されている。 [1-2. Configuration of Information Processing Apparatus]
Next, the configuration of the information processing apparatus 100 will be described with reference to FIG. The information processing apparatus 100 includes an image input unit 101, a position recognition unit 102, a shape recognition unit 103, a motion recognition unit 104, a musical instrument recognition unit 105, a relevance recognition unit 106, a performance information generation unit 107, and a score information generation unit 108. It is configured.

画像入力部１０１には処理対象となる入力画像としての複数枚の連続する静止画像、または動画を構成する連続する複数のフレーム画像が入力される。画像入力部１０１は、入力画像を位置認識部１０２と楽器認識部１０５に供給する。本技術における処理対象である入力画像は、複数枚の連続する静止画像のそれぞれ、動画を構成する複数のフレーム画像のそれぞれである。 The image input unit 101 receives a plurality of continuous still images as input images to be processed or a plurality of continuous frame images forming a moving image. The image input unit 101 supplies the input image to the position recognition unit 102 and the musical instrument recognition unit 105. The input image to be processed in the present technology is each of a plurality of continuous still images and each of a plurality of frame images forming a moving image.

なお、入力画像は端末装置１０が備えるカメラ部１６で撮影したものでもよいし、カメラ部１６以外のカメラで撮影して端末装置１０を介して情報処理装置１００に取り込んだものでもよい。また、外部の別の装置から端末装置１０を介して情報処理装置１００に供給したものでもよい。また、現実に情報処理装置１００の使用者の眼の前で行われている演奏を撮影したものでもよいし、テレビ、パーソナルコンピュータなどのディスプレイに表示されている映像を撮影したものでもよい。また、市販のＤＶＤ、Blue ray（登録商標）に収録された映像、インターネット上で取得可能な静止画像や映像などでもよい。すなわち、入力画像は、演奏者が演奏している様子が映っている画像であればどのようなものでもよい。また、入力画像はＲＢＧ（Red,Green,Blue）画像の他、ＩＲ画像などでもよい。 The input image may be an image captured by the camera unit 16 provided in the terminal device 10, or may be an image captured by a camera other than the camera unit 16 and captured by the information processing device 100 via the terminal device 10. Further, the information may be supplied from another external device to the information processing device 100 via the terminal device 10. Further, it may be a photograph of a performance actually performed in front of the eyes of the user of the information processing apparatus 100, or a photograph of a video displayed on a display such as a television or a personal computer. Further, a commercially available DVD, a video recorded on Blue ray (registered trademark), a still image or a video that can be obtained on the Internet, or the like may be used. That is, the input image may be any image as long as the image shows a state where the player is performing. The input image may be an IR image or the like in addition to the RBG (Red, Green, Blue) image.

第１の実施の形態における入力画像は図３に示すように、演奏者の両手、演奏者が演奏する楽器において演奏者の手が接触する演奏のための領域（演奏領域）の全体が写っているものである。 As shown in FIG. 3, the input image according to the first embodiment shows the entire area (performance area) for the performance in which both hands of the player and the hand of the player touch on the instrument played by the player. Is what it is.

位置認識部１０２は、入力画像からHand Pose Detection、Hand Pose Estimation、Hand segmentationなどの人体の手認識技術や、ＨＯＧ（Histogram of Oriented Gradient）、ＳＩＦＴ（Scale Invariant Feature Transform）などの特徴点抽出方法、Ｂｏｏｓｔｉｎｇ、ＳＶＭ（Support Vector Machine）などのパターン認識による被写体認識方法、ＧｒａｐｈＣｕｔなどによる領域抽出方法、ＣＮＮ（Convolutional Neural Network）などにより、入力画像中における演奏者の身体の部位である手の３次元位置を認識する。また位置認識部１０２は、手に加えて、演奏情報生成のために必要に応じて演奏者の身体の部位としての手の指の位置、腕の位置、肘の位置なども認識する。手の３次元位置情報は形状認識部１０３、動き認識部１０４および関連性認識部１０６に供給される。 The position recognition unit 102 performs a human body hand recognition technique such as Hand Pose Detection, Hand Pose Estimation, and Hand segmentation from an input image, a feature point extraction method such as HOG (Histogram of Oriented Gradient), SIFT (Scale Invariant Feature Transform), A three-dimensional hand, which is a part of the player's body in an input image, is obtained by a subject recognition method using pattern recognition such as Boosting or SVM (Support Vector Machine), a region extraction method using Graph Cut, or the like, or a CNN (Convolutional Neural Network). Recognize the position. The position recognizing unit 102 also recognizes finger positions, arm positions, elbow positions, and the like as parts of the player's body as necessary for generating performance information, in addition to the hands. The three-dimensional hand position information is supplied to the shape recognition unit 103, the motion recognition unit 104, and the relevance recognition unit 106.

手の３次元位置を認識するための手の特徴点としては指先、指の関節、手首などがある。位置情報は入力画像中における演奏者の手の３次元位置を示す情報であるため、例えば、入力画像の所定の位置を原点（０,０,０）とした（ｘ,ｙ,ｚ）の座標で表される。連続する入力画像の番号をｔ（ｔ＝１、２、３、・・・）とし、手の特徴点をＰ（Ｐ＝１、２、３、・・・）とすると、位置情報は（ｘ_tP,ｙ_tP,ｚ_tP）という形式で表される。 The characteristic points of the hand for recognizing the three-dimensional position of the hand include a fingertip, a finger joint, and a wrist. Since the position information is information indicating the three-dimensional position of the player's hand in the input image, for example, the coordinates of (x, y, z) where the predetermined position of the input image is the origin (0, 0, 0) It is represented by If the number of the continuous input image is t (t = 1, 2, 3,...) And the hand feature point is P (P = 1, 2, 3,...), The position information is (x _tP , _ytP , _ztP ).

例えば図４Ａに示すように、入力画像（ｔ＝１）では、手の特徴点が５つ認識された場合、それらは、
特徴点Ｐ１：（ｘ₁₁,ｙ₁₁,ｚ₁₁）
特徴点Ｐ２：（ｘ₁₂,ｙ₁₂,ｚ₁₂）
特徴点Ｐ３：（ｘ₁₃,ｙ₁₃,ｚ₁₃）
特徴点Ｐ４：（ｘ₁₄,ｙ₁₄,ｚ₁₄）
特徴点Ｐ５：（ｘ₁₅,ｙ₁₅,ｚ₁₅）
のように表される。 For example, as shown in FIG. 4A, when five feature points of the hand are recognized in the input image (t = 1),
Feature point _{_{P1: (x 11, y 11}} , z 11)
Feature point _{_{P2: (x 12, y 12}} , z 12)
Feature point _{_{P3: (x 13, y 13}} , z 13)
Feature point _{_{P4: (x 14, y 14}} , z 14)
Feature point _{_{P5: (x 15, y 15}} , z 15)
It is represented as

また、図４Ｂに示すように、入力画像（ｔ＝２）では、手の特徴点が５つ認識された場合、それらは、
特徴点Ｐ１：（ｘ₂₁,ｙ₂₁,ｚ₂₁）
特徴点Ｐ２：（ｘ₂₂,ｙ₂₂,ｚ₂₂）
特徴点Ｐ３：（ｘ₂₃,ｙ₂₃,ｚ₂₃）
特徴点Ｐ４：（ｘ₂₄,ｙ₂₄,ｚ₂₄）
特徴点Ｐ５：（ｘ₂₅,ｙ₂₅,ｚ₂₅）
のように表される。 Also, as shown in FIG. 4B, in the input image (t = 2), when five hand feature points are recognized,
Feature point _{_{P1: (x 21, y 21}} , z 21)
Feature point _{_{P2: (x 22, y 22}} , z 22)
Feature point _{_{P3: (x 23, y 23}} , z 23)
Feature point _{_{P4: (x 24, y 24}} , z 24)
Feature point _{_{P5: (x 25, y 25}} , z 25)
It is represented as

なお、手の３次元位置情報はカメラ原点のグローバル座標系でもよいし、入力画像上のローカル座標系＋奥行き情報でもよい。また、Hand Segmentationで求めた領域の重心およびDepth情報を用いて手の３次元位置を求めてもよい。 Note that the three-dimensional position information of the hand may be a global coordinate system of the camera origin, or a local coordinate system + depth information on the input image. Alternatively, the three-dimensional position of the hand may be obtained using the center of gravity and Depth information of the region obtained by Hand Segmentation.

なお、図４は説明の便宜上手の５本の各指の先端に特徴点が認識された図であるが、実際には図５の手に重畳して表された複数の黒点が示すように例えば各指の関節部分、水かき部分および手首など多数の特徴点が認識される。このように多数の特徴点を認識したほうがより正確に演奏情報を生成することができる。 Note that FIG. 4 is a diagram in which characteristic points are recognized at the tips of five fingers of the hand for convenience of description, but in reality, as shown by a plurality of black dots superimposed on the hand in FIG. For example, a number of feature points such as a joint part, a web part, and a wrist of each finger are recognized. Recognition of a large number of feature points enables more accurate performance information to be generated.

また、入力画像の一部領域を切り出した切り出し画像においては、（ｘ,ｙ,ｚ）の座標系とは異なる座標系である、切り出し画像の所定の位置を原点とした（ｕ_tP,ｖ_tP,ｄ_tP）の座標で表してもよい。 Further, in a cut-out image obtained by cutting out a partial area of the input image, a predetermined position of the cut-out image, which is a coordinate system different from the (x, y, z) coordinate system, is set as an origin ( _utP , _vtP). , d _tP ).

形状認識部１０３は、ＣＮＮ、パターンマッチング、Ｂｏｏｓｔｉｎｇなどの技術を用いて、位置認識部１０２から供給された位置情報で示される手の形状を認識する。手の形状情報は動き認識部１０４と演奏情報生成部１０７に供給される。 The shape recognizing unit 103 recognizes the hand shape indicated by the position information supplied from the position recognizing unit 102 using a technique such as CNN, pattern matching, and boosting. The hand shape information is supplied to the motion recognition unit 104 and the performance information generation unit 107.

動き認識部１０４は、ＣＮＮ、Hand Trackingなどの技術を用いて、位置および形状が認識された演奏者の手の動きを認識する。手の動き情報は演奏情報生成部１０７に供給される。手の動きは、複数の連続する入力画像のうちの一の入力画像（ｔ）と、時系列でその入力画像（ｔ）以降の入力画像（ｔ＋ｎ）との動きベクトルの変化から認識することができる。 The motion recognition unit 104 recognizes the motion of the player's hand whose position and shape have been recognized using techniques such as CNN and Hand Tracking. The hand movement information is supplied to the performance information generation unit 107. The hand movement can be recognized from a change in motion vector between one input image (t) of a plurality of continuous input images and an input image (t + n) subsequent to the input image (t) in time series. it can.

楽器認識部１０５は、ＣＮＮ、パターンマッチングなどの技術を用いて、入力画像中における楽器およびその楽器において演奏者の手が接触する演奏のための領域（演奏領域）を認識するものである。演奏領域とは、例えば楽器がピアノであれば鍵盤、楽器がギターであればピックアップ部分（アコースティックギターであればサウンドホール）およびネックである。楽器認識情報は関連性認識部１０６に供給される。 The musical instrument recognizing unit 105 recognizes a musical instrument in an input image and a performance area (performance area) in which the player's hand touches the musical instrument in the input image using techniques such as CNN and pattern matching. The performance area is, for example, a keyboard when the musical instrument is a piano, a pickup portion (a sound hole when the musical instrument is an acoustic guitar) and a neck when the musical instrument is a guitar. The instrument recognition information is supplied to the association recognition unit 106.

関連性認識部１０６は、ＣＮＮ、パターンマッチングなどの技術を用いて演奏者の手の位置と楽器の演奏領域の関連性を認識する。関連性とは、楽器演奏のための演奏者と楽器の関連、すなわち、演奏者の手が楽器の演奏領域のどこに接触しているかを示す接触位置である。また、関連性は、楽器の演奏領域に対する演奏者の手、腕、肘などの部位の動作の方向である。関連性情報は演奏情報生成部１０７に供給される。 The relevance recognition unit 106 recognizes the relevance between the position of the player's hand and the playing area of the musical instrument using techniques such as CNN and pattern matching. The relevance is a relation between a player and a musical instrument for playing a musical instrument, that is, a contact position indicating where the player's hand is in contact with the musical instrument playing area. The relevance is the direction of movement of the player's hands, arms, elbows, and other parts with respect to the performance area of the instrument. The relevancy information is supplied to the performance information generation unit 107.

演奏情報生成部１０７は、ＣＮＮなどの技術を用いて演奏者が演奏状態にあるか否かを認識する。そして、入力画像において演奏者が演奏している状態に基づく演奏要素（第１演奏要素）、演奏者が演奏してない状態に基づく演奏要素（第２演奏要素）、複数の入力画像に跨る演奏要素（第３演奏要素）とから入力画像の一枚に対応した演奏情報（部分演奏情報）を生成する。 The performance information generation unit 107 recognizes whether or not the player is in a playing state using a technique such as CNN. Then, in the input image, a performance element based on a state in which the player is performing (first performance element), a performance element based on a state in which the player is not performing (second performance element), and a performance spanning a plurality of input images Performance information (partial performance information) corresponding to one input image is generated from the element (third performance element).

第１演奏要素は楽器によって異なるものではあるが、ピアノなどの鍵盤楽器では、音階、音の長さ、テンポ、強弱などがある。また、ギターなどの弦楽器でも同様に音階、音の長さ、音の強弱などがある。さらに、ドラムなどの打楽器では叩くドラムセットの種類、音の長さ、テンポ、強弱などがある。 Although the first performance element varies depending on the musical instrument, a keyboard instrument such as a piano has a scale, a sound length, a tempo, a strength, and the like. Similarly, a stringed instrument such as a guitar has a scale, a sound length, and a sound intensity. Further, for percussion instruments such as drums, there are a type of drum set to be hit, a sound length, a tempo, a strength and the like.

また、第２演奏要素としてはいずれの楽器においても、休みの長さ、などがある。第３演奏要素としては、テンポ、音の長さ、休みの長さ、調、音の強弱などがある。音の強弱、音の長さなどは第１の演奏要素でもあり、第３の演奏要素でもあるが、これは入力画像１枚で音の強弱や音の長さを推定することができる場合もあれば、推定に複数枚の入力画像を必要とする場合もあるからである。例えば、１枚の入力画像において演奏者の指が楽器の演奏領域から大きく離れている位置にある場合はその１枚の入力画像から音が強いことを推定することができるが、指が楽器の演奏領域の近くで細かく動いているような場合は１枚の画像では強弱は推定できず、複数枚の入力画像を参照して演奏者の指の動きを認識して強弱を推定する必要がある。 Further, as the second performance element, for any musical instrument, there is a length of rest. As the third performance element, there are a tempo, a sound length, a rest length, a key, a sound intensity, and the like. The dynamics of the sound, the duration of the sound, etc. are also the first performance element and the third performance element. In some cases, the strength and duration of the sound can be estimated from one input image. If there is, a plurality of input images may be required for estimation. For example, when the player's finger is located far away from the musical performance area of the musical instrument in one input image, it can be estimated from the single input image that the sound is strong. In the case where the image moves finely near the performance area, the strength cannot be estimated with one image, and it is necessary to estimate the strength by recognizing the movement of the player's finger with reference to a plurality of input images. .

演奏者の手の位置と楽器の演奏領域との関連性情報から演奏要素を取得する方法としては、位置認識部１０２により認識された手の位置と、形状認識部１０３により認識された手の形状と、楽器認識部１０５により認識された楽器と演奏領域に基き、演奏者の手の指が楽器の演奏領域のどの鍵盤に接触しているかを認識する。それにより、その入力画像における演奏者の状態において演奏により音階のどの音を鳴らしているかを認識することができる。また、複数の音により構成されるどのような和音（コード）を鳴らしているかも認識することもできる。 As a method for acquiring a performance element from the relevance information between the position of the player's hand and the playing area of the instrument, the hand position recognized by the position recognition unit 102 and the hand shape recognized by the shape recognition unit 103 Then, on the basis of the musical instrument recognized by the musical instrument recognizing unit 105 and the performance area, it recognizes which keyboard of the musical instrument's performance area the finger of the player touches. Thus, it is possible to recognize which sound of the musical scale is sounding due to the performance in the state of the player in the input image. It is also possible to recognize what kind of chord (chord) composed of a plurality of sounds is sounding.

また、演奏者の指が演奏領域の同一箇所にどのくらい接触し続けているかを認識することにより音の長さを認識することもできる。 In addition, by recognizing how long the finger of the player keeps touching the same place in the performance area, the sound length can be recognized.

和音（コード）の認識は、例えば楽器がギターである場合は、図６Ａ、図６Ｂに示すように予め演奏情報生成部１０７に和音（コード）を演奏する場合の指の位置および形状を示すテンプレート画像を和音（コード）の種類ごとに複数保持させておく。そして、入力画像から抽出された指の位置情報、指の形状情報とテンプレート画像を比較（テンプレートマッチング）することにより指の位置および形状が最も近似する和音（コード）を決定する。 For example, if the musical instrument is a guitar, the chord (chord) is recognized by the performance information generation unit 107 as shown in FIGS. 6A and 6B. A plurality of images are stored for each type of chord (chord). Then, by comparing (template matching) the finger position information and the finger shape information extracted from the input image with the template image (template matching), a chord (chord) having the closest finger position and shape is determined.

また、和音（コード）の認識は、図６Ｃに示すように予め演奏情報生成部１０７に和音（コード）を演奏する場合の指の位置を示す指の特徴点の座標情報を和音（コード）の種類ごとに複数保持させておき、その座標情報と入力画像から抽出された指の位置情報（座標情報）を比較することによっても可能である。 As shown in FIG. 6C, the recognition of chords (chords) is performed in advance by the performance information generation unit 107 by transmitting coordinate information of finger characteristic points indicating finger positions when playing chords (chords) to the chords (chords). It is also possible to hold a plurality of types for each type and compare the coordinate information with the position information (coordinate information) of the finger extracted from the input image.

また、動き認識部１０４により認識された手の動きに基づいて、一の入力画像（ｔ）と、時系列でその入力画像（ｔ）以降の入力画像（ｔ＋ｎ）とから認識することができる演奏者の手の略垂直方向の動きから演奏しているか否か、演奏の強弱、テンポなどを認識することができる。 Also, based on the hand movement recognized by the motion recognition unit 104, a performance that can be recognized from one input image (t) and an input image (t + n) subsequent to the input image (t) in time series. It is possible to recognize whether or not the player is performing from the movement of the user's hand in a substantially vertical direction, the strength of the performance, the tempo, and the like.

この場合の略垂直方向とは、楽器がピアノの場合、鍵盤が並ぶ方向に対して略垂直の方向である。演奏者が演奏しているか否かは楽器の鍵盤に手が離れているか否かに基づいて判断することができる。また、演奏の強弱は略垂直方向における手の位置（手の高さ）から判断することができる。例えば、手が鍵盤から垂直方向に離れているほど音が強く、手が鍵盤に垂直方向に近づいているほど音が弱いと判断することができる。また、手の垂直方向における規則的な上下動作の時間間隔から曲のテンポを認識することができる。このように曲のテンポや音の長さなど時間に関連する演奏要素を認識するためには映像を構成するフレームレートと実時間を対応付けて、演奏者の規則的な動きの実時間での動作間隔と映像の再生時間とから求めることができる。 The substantially vertical direction in this case is a direction substantially perpendicular to the direction in which the keyboards are arranged when the musical instrument is a piano. Whether or not the player is playing can be determined based on whether or not the hand is off the keyboard of the musical instrument. The strength of the performance can be determined from the position of the hand (hand height) in a substantially vertical direction. For example, it can be determined that the sound is stronger when the hand is farther away from the keyboard in the vertical direction, and the sound is weaker when the hand is closer to the keyboard in the vertical direction. Further, the tempo of the music can be recognized from the time interval of the regular up and down movement in the vertical direction of the hand. In this way, in order to recognize performance elements related to time, such as the tempo of a song and the duration of a sound, the frame rate constituting the video is associated with real time, and the player's regular movements in real time are It can be obtained from the operation interval and the video playback time.

また、同様に一の入力画像（ｔ）と、時系列でその入力画像（ｔ）以降の入力画像（ｔ＋ｎ）とから認識することができる演奏者の手の略水平方向の動きから音階を認識することができる。この場合の略水平方向とは、楽器がピアノの場合、鍵盤が並ぶ方向に対して略水平の方向である。具体的にはピアノに対する手の略水平方向の位置が変わることにより、ピアノの鍵盤のどの領域を演奏しているかがわかり、それにより音域、オクターブの変化など演奏されている音階を認識することができる。 Similarly, the scale is recognized from the movement of the player's hand in a substantially horizontal direction that can be recognized from one input image (t) and the input image (t + n) subsequent to the input image (t) in time series. can do. In this case, the substantially horizontal direction is a direction substantially horizontal to the direction in which the keyboards are arranged when the musical instrument is a piano. Specifically, by changing the position of the hand with respect to the piano in a substantially horizontal direction, it is possible to know which area of the piano's keyboard is being played, thereby recognizing the scale being played, such as changes in the range and octave. it can.

手の略垂直方向の動きと略水平方向の動きは、複数の連続する入力画像のうちの一の入力画像（ｔ）と、時系列でその入力画像（ｔ）以降の入力画像（ｔ＋ｎ）との動きベクトルの変化から認識することができる。 The movement of the hand in the substantially vertical direction and the movement in the substantially horizontal direction include the input image (t) of one of a plurality of continuous input images and the input image (t + n) subsequent to the input image (t) in time series. Can be recognized from the change of the motion vector.

第３演奏要素は複数の入力画像に跨った演奏者の指や腕の変化に基づいて生成することができる。例えば、手が複数の入力画像に跨って鍵盤から垂直方向に離れている時間が長いほど次に鳴らされる音が強いとして第３演奏要素とすることができる。また、複数の入力画像跨る手の垂直方向における上下動作の時間間隔から曲のテンポを認識して第３演奏要素とすることができる。 The third performance element can be generated based on a change in a player's finger or arm over a plurality of input images. For example, it can be determined that the longer the time the hand is separated from the keyboard in the vertical direction over a plurality of input images, the stronger the sound to be played next is, and thus the third performance element can be used. In addition, the tempo of the music can be recognized from the time interval of the vertical movement in the vertical direction of the hand straddling a plurality of input images, and can be used as the third performance element.

さらに演奏情報生成部１０７は、複数の入力画像のそれぞれに対応した部分演奏情報を時系列に従って?いでいくことにより、それら複数の入力画像により構成されるフレーズ、曲の一部または全部の複合演奏情報を生成する。フレーズや曲の一部の複合演奏情報とは、１または複数の小節単位での演奏情報である。 Further, the performance information generation unit 107 searches partial performance information corresponding to each of the plurality of input images in chronological order, thereby forming a composite performance of a phrase composed of the plurality of input images, a part or all of a song. Generate information. The composite performance information of a part of a phrase or a piece of music is performance information in units of one or more measures.

部分演奏情報および複合演奏情報は、五線譜で記された楽譜に限らず、その情報に基づいて演奏者、コンピュータ、音楽演奏用ソフトウェア、音楽作成用ソフトウェアなどが楽曲を再現することができればどのような形式の情報でもよい。例えば、ＭＩＤＩ（Musical Instrument Digital Interface）形式の情報やプログラミング形式の情報、音楽演奏／制作用ソフトウェア独自のフォーマットの情報などでもよい。 The partial performance information and the composite performance information are not limited to music notation written in staff notation. What kind of information can be used if a player, a computer, software for music performance, software for music creation, etc. can reproduce music based on the information. Format information may be used. For example, information in a MIDI (Musical Instrument Digital Interface) format, information in a programming format, information in a format unique to music playing / production software, and the like may be used.

楽譜情報生成部１０８は、演奏情報生成部１０７から部分演奏情報が供給された場合には入力画像一枚に対応する部分楽譜情報を生成する。また、演奏情報生成部１０７から複合演奏情報が供給された場合には複数の入力画像により構成されるフレーズ、曲の一部または全部の楽譜情報である複合楽譜情報を生成する。ここでいう楽譜とは五線譜で記された楽譜であり、楽譜情報を構成する情報としては、音符、休符、拍子記号、テンポ、臨時記号、調号、強弱などがある。臨時記号情報は演奏者が演奏している状態に基づく第１演奏要素である、演奏されている音階と、複数の入力画像に跨る第３演奏要素である調とから導き出すことができる。 When the partial performance information is supplied from the performance information generating section 107, the musical score information generating section 108 generates partial musical score information corresponding to one input image. When composite performance information is supplied from the performance information generation unit 107, composite musical score information that is the musical score information of a part or all of a phrase composed of a plurality of input images and music is generated. The musical score referred to here is a musical score written in a staff notation, and information constituting the musical score information includes a note, a rest, a time signature, a tempo, an accidental key, a key signature, strength and the like. The accidental information can be derived from the musical scale being played, which is the first performance element based on the state in which the player is playing, and the key, which is the third performance element spanning a plurality of input images.

情報処理装置１００は以上のようにして構成されている。情報処理装置１００はプログラムで構成され、そのプログラムは予め端末装置１０にインストールされていてもよいし、ダウンロード、記憶媒体などで配布されて、ユーザが自ら端末装置１０にインストールするようにしてもよい。また、情報処理装置１００は、プログラムによって実現されるのみでなく、その機能を有するハードウェアによる専用の装置、回路などを組み合わせて実現されてもよい。 The information processing device 100 is configured as described above. The information processing apparatus 100 is configured by a program, and the program may be installed in the terminal device 10 in advance, or may be downloaded, distributed on a storage medium, or the like, and may be installed in the terminal device 10 by the user. . The information processing apparatus 100 may be realized not only by a program but also by a combination of a dedicated device, a circuit, and the like using hardware having the function.

［１−３．情報処理装置による処理］
［１−３−１．部分演奏情報の生成］
次に図７のフローチャートを参照して情報処理装置１００における処理の流れについて説明する。図７のフローチャートの処理は、入力画像一枚に対応した部分演奏情報を生成するものである。なお、上述したように入力画像の一枚とは、複数枚の連続する静止画像のうちの一枚または、動画を構成する複数のフレーム画像のうちの一枚である。 [1-3. Processing by information processing device]
[1-3-1. Generation of partial performance information]
Next, the flow of processing in the information processing apparatus 100 will be described with reference to the flowchart in FIG. The processing of the flowchart in FIG. 7 is for generating partial performance information corresponding to one input image. Note that, as described above, one input image is one of a plurality of continuous still images or one of a plurality of frame images constituting a moving image.

まずステップＳ１０１で、画像入力部１０１に対して入力画像が入力される。この入力画像は一枚の静止画像またはフレーム画像でもよいし、連続する複数の静止画像でもよいし、動画を構成する連続する複数のフレーム画像でもよい。複数の入力画像が入力されると、以下のステップＳ１０２以降の処理は、まず（ｔ＝１）の一番目の入力画像に対して行われる。また、連続する複数の静止画像または動画を構成する連続する複数のフレーム画像が入力された場合、どの入力画像の部分演奏情報を生成するかをユーザが選択できるようにしてもよい。 First, in step S101, an input image is input to the image input unit 101. This input image may be a single still image or a frame image, a plurality of continuous still images, or a plurality of continuous frame images constituting a moving image. When a plurality of input images are input, the following processes from step S102 are first performed on the first input image (t = 1). When a plurality of continuous still images or a plurality of continuous frame images constituting a moving image are input, the user may be able to select which input image to generate partial performance information for.

次にステップＳ１０２で位置認識部１０２により入力画像中における演奏者の手の３次元位置が認識され、手の位置情報が形状認識部１０３、動き認識部１０４および関連性認識部１０６に供給される。 Next, in step S102, the three-dimensional position of the player's hand in the input image is recognized by the position recognition unit 102, and the hand position information is supplied to the shape recognition unit 103, the motion recognition unit 104, and the association recognition unit 106. .

次にステップＳ１０３で、形状認識部１０３により入力画像中において位置が認識された手の形状が認識される。手の形状情報は動き認識部１０４と演奏情報生成部１０７に供給される。さらにステップＳ１０４で、動き認識部１０４により、位置および形状が認識された手の動きが認識される。手の動き情報は演奏情報生成部１０７に供給される。 Next, in step S103, the shape of the hand whose position has been recognized in the input image is recognized by the shape recognition unit 103. The hand shape information is supplied to the motion recognition unit 104 and the performance information generation unit 107. Further, in step S104, the movement recognition unit 104 recognizes the movement of the hand whose position and shape have been recognized. The hand movement information is supplied to the performance information generation unit 107.

次にステップＳ１０５で楽器認識部１０５により入力画像中における楽器および演奏領域が認識される。楽器情報および演奏領域情報は関連性認識部１０６に供給される。なお、ステップＳ１０２乃至ステップＳ１０４における演奏者の手の位置、形状、動きの認識処理とステップＳ１０５における楽器および演奏領域の認識は並行して行うようにしてもよいし、楽器および演奏領域の認識を先に行ってもよい。 Next, in step S105, the musical instrument and the playing area in the input image are recognized by the musical instrument recognition unit 105. The musical instrument information and the performance area information are supplied to the association recognizing unit 106. The recognition processing of the position, shape, and movement of the player's hand in steps S102 to S104 and the recognition of the musical instrument and the playing area in step S105 may be performed in parallel. You may go first.

次にステップＳ１０６で関連性認識部１０６により手の各指とそれに対応する楽器の演奏領域の位置の関連性が認識される。関連性とは演奏者の手が楽器の演奏領域のどこに位置しているかを示すものであり、関連性情報は演奏情報生成部１０７に供給される。 Next, in step S106, the relevance recognizing unit 106 recognizes the relevance of each finger of the hand and the position of the playing area of the corresponding musical instrument. The relevance indicates where the player's hand is located in the performance area of the musical instrument, and the relevancy information is supplied to the performance information generation unit 107.

次にステップＳ１０７で演奏情報生成部１０７は手の動き情報および関連性情報から演奏者が入力画像において楽器を演奏している状態であるか否かを判定する。 Next, in step S107, the performance information generation unit 107 determines from the hand movement information and the relevance information whether or not the player is playing a musical instrument in the input image.

判定の結果、演奏者が演奏している場合、処理はステップＳ１０８からステップＳ１０９に進む（ステップＳ１０８のＹｅｓ）。そしてステップＳ１０９で演奏情報生成部１０７は手の３次元位置情報、手の形状情報、手の動き情報、関連性情報とから第１演奏要素を生成する。 If the result of the determination is that the player is performing, the process proceeds from step S108 to step S109 (Yes in step S108). Then, in step S109, the performance information generation unit 107 generates a first performance element from the three-dimensional position information of the hand, hand shape information, hand movement information, and relevance information.

一方、ステップＳ１０７での判定の結果、演奏者が演奏していない場合、処理はステップＳ１０８からステップＳ１１０に進む（ステップＳ１０８のＮｏ）。そしてステップＳ１１０で演奏情報生成部１０７は第２演奏要素を生成する。 On the other hand, if the result of determination in step S107 is that the player is not playing, the process proceeds from step S108 to step S110 (No in step S108). Then, in step S110, the performance information generation unit 107 generates a second performance element.

次にステップＳ１１１で演奏情報生成部１０７は、第１演奏要素または第２演奏要素から入力画像に対応した部分演奏情報を生成する。そしてステップＳ１１２でその部分演奏情報を出力する。 Next, in step S111, the performance information generation unit 107 generates partial performance information corresponding to the input image from the first performance element or the second performance element. Then, in step S112, the partial performance information is output.

出力された部分演奏情報は端末装置１０の表示部１４において表示したり、端末装置１０が備える音楽演奏用ソフトウェア、音楽制作用ソフトウェアなどにおいて使用可能である。また、ユーザ、演奏者などからの要求に応じて楽譜情報生成部１０８によって演奏情報に基づいて楽譜情報を生成してもよい。また、部分演奏情報を端末装置１０の記憶部１２に保存しておき、必要に応じて記憶部１２から読み出して使用することも可能である。 The output partial performance information can be displayed on the display unit 14 of the terminal device 10 or used in music performance software, music production software, or the like provided in the terminal device 10. Further, the musical score information generating unit 108 may generate musical score information based on the performance information in response to a request from a user, a player, or the like. Further, the partial performance information can be stored in the storage unit 12 of the terminal device 10 and read out from the storage unit 12 and used as needed.

以上のようにして入力画像に対する演奏情報生成処理が行われる。 The performance information generation processing for the input image is performed as described above.

［１−３−２．複合演奏情報の生成］
次に図８のフローチャートについて説明する。図８のフローチャートの処理は、複数の入力画像により構成されるフレーズ、曲の一部または全部の演奏情報である複合演奏情報を生成する処理である。 [1-3-2. Generating composite performance information]
Next, the flowchart of FIG. 8 will be described. The process of the flowchart of FIG. 8 is a process of generating composite performance information that is performance information of a phrase or a part or all of a song composed of a plurality of input images.

まずステップＳ１０１で、画像入力部１０１に対して入力画像として連続する複数の静止画像または動画を構成する連続する複数のフレーム画像が入力される。複数の入力画像が入力されると以下のステップＳ１０２以降の処理はまず入力画像（ｔ＝１）の一番目の入力画像に対して行われる。また、連続する複数の静止画像または動画を構成する連続する複数のフレーム画像が入力された場合、どの入力画像から処理を開始するかをユーザが選択できるようにしてもよい。 First, in step S101, a plurality of continuous still images or a plurality of continuous frame images forming a moving image are input to the image input unit 101 as input images. When a plurality of input images are input, the processing from step S102 onward is first performed on the first input image of the input image (t = 1). When a plurality of continuous still images or a plurality of continuous frame images constituting a moving image are input, the user may be able to select from which input image to start processing.

ステップＳ１０１からステップＳ１１１までの処理は図７のフローチャートと同様であるため、説明を省略する。 The processing from step S101 to step S111 is the same as that in the flowchart in FIG.

ステップＳ１１１の後、次にステップＳ１２１で、演奏情報生成部１０７は複数の入力画像間に跨る演奏要素である第３演奏要素があるか否かが判定する。複数の入力画像間に跨る第３演奏要素があるか否かは、以下のように判断できる。例えば音の強弱（大きさ）の場合、現在処理中の入力画像（ｔ）において認識された演奏の強弱が一つ前の入力画像である入力画像（ｔ−１）で認識された強弱よりも強くなる場合、入力画像（ｔ−１）から入力画像（ｔ）まで、「だんだん強く」という演奏要素が導き出せる。また、同様に、例えば、入力画像（ｔ＋１）の音程において認識された演奏の強弱が入力画像（ｔ）の強弱より大きい場合は、入力画像（ｔ−１）、入力画像（ｔ）、入力画像（ｔ＋１）とも「だんだん強く」という演奏要素が導き出せる。このように、処理対象である複数の入力画像それぞれの状態により、現在の入力画像における演奏要素に基づいて過去の入力画像における演奏要素が認識される場合「フレーム間に跨る演奏要素である第３演奏要素がある」と判断することができる。 After step S111, next in step S121, the performance information generation unit 107 determines whether there is a third performance element that is a performance element extending over a plurality of input images. Whether there is a third performance element straddling a plurality of input images can be determined as follows. For example, in the case of sound intensity (loudness), the strength of the performance recognized in the input image (t) currently being processed is higher than the strength of the performance recognized in the input image (t-1) which is the immediately preceding input image. When it becomes stronger, a performance element "gradually stronger" can be derived from the input image (t-1) to the input image (t). Similarly, for example, when the strength of the performance recognized at the pitch of the input image (t + 1) is greater than the strength of the input image (t), the input image (t−1), the input image (t), and the input image With (t + 1), a performance element of "gradually stronger" can be derived. As described above, when the performance element in the past input image is recognized based on the performance element in the current input image based on the state of each of the plurality of input images to be processed, the “performance element spanning between frames, There is a performance element. "

複数の入力画像間に跨る第３演奏要素がある場合、処理はステップＳ１２２に進み（ステップＳ１２１のＹｅｓ）、ステップＳ１１１で生成した部分演奏情報に第３演奏要素を付加することにより部分演奏情報を更新する。そして処理はステップＳ１２２からステップＳ１２３に進む。なお、第３演奏要素は部分演奏情報において第１演奏要素、第２演奏要素と同様に部分演奏情報の構成要素としてもよいし、部分演奏情報とは別情報としたまま紐付けにより対応付けてもよい。 If there is a third performance element extending over a plurality of input images, the process proceeds to step S122 (Yes in step S121), and the partial performance information is added by adding the third performance element to the partial performance information generated in step S111. Update. Then, the process proceeds from step S122 to step S123. The third performance element may be a component of the partial performance information in the partial performance information similarly to the first performance element and the second performance element, or may be associated with the partial performance information by associating it with the separate performance information. Is also good.

一方、ステップＳ１２１で複数の画像間に跨る第３演奏要素がない場合処理はステップＳ１２３に進む（ステップＳ１２１のＮｏ）。 On the other hand, if there is no third performance element spanning a plurality of images in step S121, the process proceeds to step S123 (No in step S121).

次にステップＳ１２３で処理対象である次の入力画像があるか否かが判定される。ステップＳ１０１で画像入力部１０１に対して入力された、連続する複数の静止画像または動画を構成する連続する複数のフレーム画像にまだ未処理の画像がある場合には次の入力画像があるとして処理はステップＳ１０２に戻る（ステップＳ１２３のＹｅｓ）。そして、時系列で次の順の入力画像（フレーム画像である場合には次のフレーム番号の画像）に対してステップＳ１０２乃至ステップＳ１２３の処理が行われる。そして、入力された全ての入力画像のそれぞれに対して処理が行われるまでステップＳ１０２乃至ステップＳ１２３が繰り返される。 Next, in step S123, it is determined whether there is a next input image to be processed. If there are still unprocessed images in a plurality of continuous still images or a plurality of continuous frame images constituting a moving image input to the image input unit 101 in step S101, it is determined that there is a next input image. Returns to step S102 (Yes in step S123). Then, the processes of steps S102 to S123 are performed on the input image in the next order in a time series (if it is a frame image, the image of the next frame number). Steps S102 to S123 are repeated until the processing is performed on each of all the input images that have been input.

ステップＳ１２３で処理対象の画像がない場合、処理はステップＳ１２４に進む（ステップＳ１２３のＮｏ）。 If there is no image to be processed in step S123, the process proceeds to step S124 (No in step S123).

次にステップＳ１２４で演奏情報生成部１０７は、複数の入力画像のそれぞれに対応した部分演奏情報を時系列に従ってつないでいくことにより、それら複数の入力画像により構成されるフレーズ、曲の一部または全部の複合演奏情報を生成する。 Next, in step S124, the performance information generation unit 107 connects the partial performance information corresponding to each of the plurality of input images in a time series, thereby forming a phrase, a part of a song, or a part of the plurality of input images. Generate all composite performance information.

次にステップＳ１２５で、演奏情報生成部１０７は複合演奏情報を出力する。出力された複合演奏情報は端末装置１０の表示部１４において表示したり、端末装置１０が備える音楽演奏用ソフトウェア、音楽制作用ソフトウェアなどにおいて使用可能である。また、ユーザ、演奏者などからの要求に応じて楽譜情報生成部１０８が複合演奏情報に基づいて複合楽譜情報を生成してもよい。また、複合演奏情報を出力する際に部分演奏情報も出力してもよい。 Next, in step S125, the performance information generation unit 107 outputs composite performance information. The output composite performance information can be displayed on the display unit 14 of the terminal device 10 or used in music performance software, music production software, or the like provided in the terminal device 10. Further, the score information generating unit 108 may generate the composite score information based on the composite performance information in response to a request from a user, a player, or the like. Further, when the composite performance information is output, the partial performance information may also be output.

以上のようにして第１の実施の形態における処理が行われる。本技術の第１の実施の形態によれば、複数枚の連続する静止画像または動画を構成する複数のフレーム画像に基づいて演奏情報と楽譜情報を生成することができる。
これにより、専門的な知識のない人でも手軽に演奏情報、楽譜情報を得ることができる。また、例えば、音声がない映像データ、音声が劣化／破損している映像データなどに基づいても演奏情報と楽譜情報を生成することができる。また、音声を出力することができない環境においても映像データのみに基づいて演奏情報を生成することができる。 The processing according to the first embodiment is performed as described above. According to the first embodiment of the present technology, it is possible to generate performance information and score information based on a plurality of continuous still images or a plurality of frame images forming a moving image.
As a result, even people who do not have specialized knowledge can easily obtain performance information and score information. Also, for example, performance information and musical score information can be generated based on video data having no sound, video data having deteriorated / damaged sound, and the like. Further, even in an environment where audio cannot be output, performance information can be generated based only on video data.

なお、第１の実施の形態において演奏情報を生成するための入力画像は、例えば楽器がピアノの場合には、ピアノの演奏領域である鍵盤と演奏者の両手を認識することができる上方から撮影したものが好ましい。楽器がギターの場合にはギターの演奏領域であるピックアップ部分（アコースティックギターであればサウンドホール）およびネックと演奏領域の両手を認識することができる正面から撮影したものが好ましい。 In the first embodiment, the input image for generating the performance information is, for example, when the musical instrument is a piano, captured from above, which can recognize both the keyboard, which is the performance area of the piano, and both hands of the player. Are preferred. When the musical instrument is a guitar, it is preferable that the image is taken from the front which can recognize both hands of the pickup part (sound hole in the case of acoustic guitar) which is the playing area of the guitar and the neck and the playing area.

本技術は、自分または自分以外の他の演奏者の即興演奏の楽譜化、楽器練習の楽譜化、好きなアーティスト曲を演奏するための楽譜作成、作曲、編曲などの用途に用いることができる。また、作曲、編曲の際には、楽器でいろいろな演奏、フレーズなどを試し、必要な演奏パターンまたは全ての演奏パターンを用意に演奏情報、楽譜情報として得ることができる。また、「楽譜を書いて、楽器で演奏してみる」、または「楽器で演奏してみて、良かったら楽譜を書く」の繰り返し作業が必要なくなる。 The present technology can be used for scores of improvisations performed by one or other players, score creation for musical instrument practice, score creation for playing a favorite artist song, composition, arrangement, and the like. Also, when composing or arranging, it is possible to try various performances, phrases, and the like with musical instruments, and to obtain necessary performance patterns or all performance patterns as performance information and score information. Also, there is no need to repeat the steps of “writing a score and playing with an instrument” or “playing with an instrument and writing a score if it is good”.

＜２．第２の実施の形態＞
［２−１．情報処理装置の構成］
次に本技術の第２の実施の形態について説明する。第２の実施の形態は図９に示すように、入力画像において演奏者の身体の部位である手の一部が遮蔽されて隠れているまたは写っていない場合において演奏情報の生成を行うものである。図９においては演奏者の左手の一部が隠れている。なお、情報処理装置１００が動作する端末装置１０の構成は第１の実施の形態と同様であるためその説明を省略する。 <2. Second Embodiment>
[2-1. Configuration of Information Processing Apparatus]
Next, a second embodiment of the present technology will be described. In the second embodiment, as shown in FIG. 9, performance information is generated when a part of a hand, which is a part of a player's body, is covered or hidden in an input image. is there. In FIG. 9, a part of the player's left hand is hidden. Note that the configuration of the terminal device 10 on which the information processing device 100 operates is the same as in the first embodiment, and a description thereof will be omitted.

図１０に示すように情報処理装置２００は、画像入力部１０１、センサ情報取得部２０１、第１位置認識部２０２、第２位置認識部２０３、形状認識部１０３、動き認識部１０４、楽器認識部１０５、関連性認識部１０６、演奏情報生成部１０７、楽譜情報生成部１０８とから構成されている。画像入力部１０１、形状認識部１０３、動き認識部１０４、楽器認識部１０５、関連性認識部１０６、演奏情報生成部１０７、楽譜情報生成部１０８は第１の実施の形態と同様のものである。 As shown in FIG. 10, the information processing apparatus 200 includes an image input unit 101, a sensor information acquisition unit 201, a first position recognition unit 202, a second position recognition unit 203, a shape recognition unit 103, a motion recognition unit 104, and a musical instrument recognition unit. 105, a relevance recognition unit 106, a performance information generation unit 107, and a score information generation unit 108. The image input unit 101, the shape recognition unit 103, the motion recognition unit 104, the musical instrument recognition unit 105, the relevance recognition unit 106, the performance information generation unit 107, and the score information generation unit 108 are the same as those in the first embodiment. .

センサ情報取得部２０１は端末装置１０が備える、または端末装置１０に接続された外部のセンサで取得されたセンサ情報を取得して第２位置認識部２０３に供給するものである。センサとしては、マイクロホン、圧力センサ、動きセンサなどがある。 The sensor information acquisition unit 201 acquires sensor information acquired by an external sensor provided in the terminal device 10 or connected to the terminal device 10 and supplies the acquired sensor information to the second position recognition unit 203. The sensors include a microphone, a pressure sensor, a motion sensor, and the like.

第１位置認識部２０２は入力画像中において隠れていない演奏者の手の位置を認識するものであり、第１の実施の形態における位置認識部１０２と同様のものである。 The first position recognition unit 202 recognizes the position of the player's hand that is not hidden in the input image, and is similar to the position recognition unit 102 in the first embodiment.

第１位置認識部２０２は第１の実施の形態における位置認識部１０２と同様に、入力画像からHand Pose Detection、Hand Pose Estimationなどと称される人体の手認識技術やＨＯＧ、ＳＩＦＴなどの特徴点抽出方法、Ｂｏｏｓｔｉｎｇ、ＳＶＭなどのパターン認識による被写体認識方法、ＧｒａｐｈＣｕｔなどによる領域抽出方法、ＣＮＮなどにより、入力画像中における演奏者の身体の部位である手の３次元位置を認識する。 Similar to the position recognition unit 102 according to the first embodiment, the first position recognition unit 202 uses a hand recognition technique for a human body called Hand Pose Detection, Hand Pose Estimation, or the like, and features such as HOG and SIFT from an input image. The three-dimensional position of the hand, which is a part of the player's body in the input image, is recognized by an extraction method, a subject recognition method by pattern recognition such as Boosting, SVM, a region extraction method by Graph Cut, or the like, or a CNN.

第２位置認識部２０３は、入力画像中において遮蔽されることによって一部が隠れている演奏者の手の３次元位置を補助情報を用いて認識するものである。補助情報としては、センサ情報取得部２０１から供給されるセンサ情報などがある。センサ情報としては、マイクロホンで集音される演奏の音、手または指が楽器を押圧する力を示す圧力センサ情報、演奏者の腕／手／指の動き示す動きセンサ情報などがある。さらに補助情報としては、第１位置認識部２０２と同様の手法を用いて認識した演奏者の腕および／または肘の位置／形状／動き情報などもある。 The second position recognition unit 203 recognizes, using the auxiliary information, the three-dimensional position of the player's hand that is partially hidden by being occluded in the input image. The auxiliary information includes sensor information supplied from the sensor information acquisition unit 201 and the like. Examples of the sensor information include a performance sound collected by a microphone, pressure sensor information indicating a force of a hand or a finger pressing an instrument, and motion sensor information indicating a movement of a player's arm / hand / finger. Further, the auxiliary information includes position / shape / movement information of the player's arm and / or elbow recognized using the same method as the first position recognition unit 202.

例えば、演奏者の腕および肘の位置、形状の情報から演奏者の肘から先の腕の先端にある手（隠れている手）が楽器の演奏領域のどこに位置しているかを推定して認識することができる。 For example, based on information on the positions and shapes of the player's arms and elbows, the position of the hand (hidden hand) at the tip of the arm beyond the player's elbow is estimated and recognized in the playing area of the instrument. can do.

第１位置認識部２０２および第２位置認識部２０３により取得された位置情報は３次元位置を示す情報であるため、例えば、入力画像の所定の位置を原点とした（ｘ,ｙ,ｚ）の座標で表される。また、入力画像の一部領域を切り出した切り出し画像においては、切り出し画像の所定の位置を原点とした（ｕ,ｖ,ｄ）の座標で表される。この点は第１の実施の形態と同様である。位置情報は形状認識部１０３および関連性認識部１０６に供給される。 Since the position information acquired by the first position recognizing unit 202 and the second position recognizing unit 203 is information indicating a three-dimensional position, for example, the (x, y, z) with a predetermined position of the input image as the origin Expressed in coordinates. Further, in a cut-out image obtained by cutting out a partial area of the input image, coordinates are represented by (u, v, d) with a predetermined position of the cut-out image as an origin. This is the same as in the first embodiment. The position information is supplied to the shape recognition unit 103 and the association recognition unit 106.

第２の実施の形態における情報処理装置２００は以上のように構成されている。 The information processing device 200 according to the second embodiment is configured as described above.

［２−２．情報処理装置の処理］
次に第２の実施の形態における情報処理装置２００の処理の流れについて説明する。図１１のフローチャートは第１の実施の形態で説明した、一つの入力画像に対応する部分演奏情報を生成するための処理に対応したものである。 [2-2. Processing of information processing device]
Next, a flow of processing of the information processing device 200 according to the second embodiment will be described. The flowchart of FIG. 11 corresponds to the processing for generating the partial performance information corresponding to one input image described in the first embodiment.

まずステップＳ１０１で、画像入力部１０１に対して入力画像が入力されると、次にステップＳ２０１で入力画像において手の一部が隠れているか否かが判定される。これは、例えば、第１位置認識部２０２において２つの手の全体が認識された否かに基づいて判定することができる。 First, in step S101, when an input image is input to the image input unit 101, it is determined in step S201 whether a part of the hand is hidden in the input image. This can be determined, for example, based on whether the first position recognition unit 202 has recognized the entire two hands.

手の一部が隠れている場合、処理はステップＳ２０２に進み（ステップＳ２０１のＹｅｓ）、第２位置認識部２０３により補助情報を用いて一部が隠れている演奏者の手が認識される。 When a part of the hand is hidden, the process proceeds to step S202 (Yes in step S201), and the second position recognition unit 203 recognizes the partly hidden player's hand using the auxiliary information.

一方、手の一部が隠れていない場合処理はステップＳ１０３に進み、第１位置認識部２０２により演奏者の手が認識される。 On the other hand, if a part of the hand is not hidden, the process proceeds to step S103, where the first position recognition unit 202 recognizes the player's hand.

これ以降の処理は第１の実施の形態におけるものと同様である。 Subsequent processes are the same as those in the first embodiment.

また、図１２のフローチャートに示すように、複数の入力画像により構成されるフレーズ、曲の一部または全部の複合演奏情報を生成する処理においても図１１のフローチャートにおけるステップＳ２０１とステップＳ２０２と同様の処理が行われる。 Also, as shown in the flowchart of FIG. 12, in a process of generating composite performance information of a phrase composed of a plurality of input images and a part or all of a tune, the same as steps S201 and S202 in the flowchart of FIG. Processing is performed.

この第２の実施の形態によれば、入力画像において演奏者の手の一部が隠れていても第１の実施の形態と同様に演奏情報、楽譜情報の生成を行うことができる。 According to the second embodiment, even if a part of the player's hand is hidden in the input image, performance information and score information can be generated in the same manner as in the first embodiment.

＜３．第３の実施の形態＞
［３−１．情報処理装置の構成］
次に本技術の第３の実施の形態について説明する。第３の実施の形態は図１３に示すように、入力画像において楽器の一部が隠れているまたは映っていない場合において演奏情報の生成を行うものである。図１３においては、楽器であるピアノの鍵盤の一部のみが映っており、鍵盤の一部が入力画像の画角外に存在している。なお、情報処理装置３００が動作する端末装置１０の構成は第１の実施の形態と同様であるためその説明を省略する。 <3. Third Embodiment>
[3-1. Configuration of Information Processing Apparatus]
Next, a third embodiment of the present technology will be described. In the third embodiment, as shown in FIG. 13, performance information is generated when a part of a musical instrument is hidden or not shown in an input image. In FIG. 13, only a part of the keyboard of the piano which is the musical instrument is shown, and a part of the keyboard exists outside the angle of view of the input image. The configuration of the terminal device 10 on which the information processing device 300 operates is the same as in the first embodiment, and a description thereof will be omitted.

図１４に示すように、情報処理装置３００は、画像入力部１０１、センサ情報取得部３０１、位置認識部１０２、形状認識部１０３、動き認識部１０４、楽器認識部１０５、関連性認識部１０６、演奏情報生成部１０７、楽譜情報生成部１０８とから構成されている。画像入力部１０１、位置認識部１０２、形状認識部１０３、動き認識部１０４、演奏情報生成部１０７、楽譜情報生成部１０８は第１の実施の形態と同様のものである。 As shown in FIG. 14, the information processing apparatus 300 includes an image input unit 101, a sensor information acquisition unit 301, a position recognition unit 102, a shape recognition unit 103, a motion recognition unit 104, a musical instrument recognition unit 105, a relevance recognition unit 106, It comprises a performance information generator 107 and a score information generator 108. The image input unit 101, the position recognition unit 102, the shape recognition unit 103, the motion recognition unit 104, the performance information generation unit 107, and the score information generation unit 108 are the same as those in the first embodiment.

センサ情報取得部２０１は、端末装置１０が備える、または端末装置１０に接続された外部のセンサで取得されたセンサ情報を取得して演奏情報生成部１０７に供給するものである。センサとしては、マイクロホン、圧力センサ、動きセンサなどがある。 The sensor information acquiring unit 201 acquires sensor information acquired by an external sensor provided in the terminal device 10 or connected to the terminal device 10 and supplies the acquired sensor information to the performance information generating unit 107. The sensors include a microphone, a pressure sensor, a motion sensor, and the like.

楽器認識部１０５は、ＣＮＮ、パターンマッチング、テンプレートマッチングなどの技術を用いて、入力画像中における楽器およびその楽器において演奏者の手が接触する演奏のための領域（演奏領域）を認識するものである。そこで、例えば、テンプレートマッチングで楽器の一部分のみがテンプレートと一致するような場合、認識された楽器は一部分が隠れているまたは映っていないと判断する。入力画像に楽器の一部分しか映ってないことを示す情報と共に楽器認識情報は関連性認識部１０６に供給される。 The musical instrument recognizing unit 105 recognizes a musical instrument in an input image and a performance area (performance area) in which the player's hand touches the musical instrument in the input image using techniques such as CNN, pattern matching, and template matching. is there. Therefore, for example, when only a part of the musical instrument matches the template in the template matching, it is determined that the recognized musical instrument has a part hidden or not reflected. The instrument recognition information is supplied to the association recognition unit 106 together with information indicating that only a part of the instrument is reflected in the input image.

関連性認識部１０６は、ＣＮＮ、パターンマッチングなどの技術を用いて演奏者の手の位置と楽器の演奏領域の関連性を認識する。関連性とは、演奏者の手が楽器の演奏領域のどこに接触しているかを示す接触位置である。また、関連性は、楽器の演奏領域に対する演奏者の手の動作の方向である。関連性認識部１０６は、入力画像中において楽器の一部しか映っていない場合、手の位置情報、手の形状情報、楽器（例えばピアノ）の演奏領域である鍵盤が並ぶ方向に対する略水平方向における腕／肘の開き具合の角度、腕の動きから指が接触している演奏領域を推定ことにより演奏者の指と楽器の演奏領域の関連性を認識する。関連性情報は演奏情報生成部１０７に供給される。 The relevance recognition unit 106 recognizes the relevance between the position of the player's hand and the playing area of the musical instrument using techniques such as CNN and pattern matching. The relevance is a contact position indicating where the player's hand is in contact with the playing area of the instrument. The relevance is the direction of movement of the player's hand with respect to the playing area of the instrument. When only a part of the musical instrument is shown in the input image, the relevance recognizing unit 106 sets the hand position information, the hand shape information, and the substantially horizontal direction with respect to the direction in which the keyboard, which is the playing area of the musical instrument (eg, piano), is arranged. By recognizing the performance area where the finger is in contact with the angle of the arm / elbow opening and the movement of the arm, the relevance between the player's finger and the performance area of the instrument is recognized. The relevancy information is supplied to the performance information generation unit 107.

演奏情報生成部１０７は補助情報としてセンサ情報を用いて指が接触している鍵盤を推定する。センサ情報としては、マイクロホンで集音される演奏の音、手または指が楽器を押圧する力を示す圧力センサ情報、演奏者の腕／手／指の動き示す動きセンサ情報などがある。さらに関連性認識部１０６は、複数の入力画像において楽器全体が写っている入力画像がある場合、その入力画像と腕、手の動き情報から指が接触している鍵盤を推定することにより演奏者の指と楽器の演奏領域の関連性を推定する。 The performance information generation unit 107 estimates a keyboard with which a finger is in contact using sensor information as auxiliary information. Examples of the sensor information include a performance sound collected by a microphone, pressure sensor information indicating a force of a hand or a finger pressing an instrument, and motion sensor information indicating a movement of a player's arm / hand / finger. Further, when there is an input image in which the entire instrument is captured in a plurality of input images, the relevance recognition unit 106 estimates the keyboard with which a finger is in contact from the input image and arm and hand movement information, thereby performing To estimate the relevance between the finger and the playing area of the instrument.

このように指が接触している鍵盤を推定することによりその推定結果から第１の実施の形態と同様に第１演奏要素、第２演奏要素、第３演奏要素を生成することができる。 By estimating the keyboard with which the finger is in contact, the first performance element, the second performance element, and the third performance element can be generated from the estimation result in the same manner as in the first embodiment.

第３の実施の形態に係る情報処理装置３００は以上のように構成されている。 The information processing device 300 according to the third embodiment is configured as described above.

［３−２．情報処理装置の処理］
次に第３の実施の形態における情報処理装置３００の処理の流れについて説明する。図１５のフローチャートは第１の実施の形態で説明した、一つの入力画像に対応する演奏情報および楽譜情報を生成するための処理に対応したものである。 [3-2. Processing of information processing device]
Next, a flow of processing of the information processing device 300 according to the third embodiment will be described. The flowchart of FIG. 15 corresponds to the processing for generating the performance information and the musical score information corresponding to one input image described in the first embodiment.

ステップＳ１０１乃至ステップＳ１０５は第１の実施の形態における処理と同様である。 Steps S101 to S105 are the same as the processing in the first embodiment.

ステップＳ３０１で、関連性認識部１０６は入力画像において楽器の演奏領域全体が映っているかを判定し、楽器の演奏領域全体が映っている場合処理はステップＳ１０６に進む（ステップＳ３０１のＹｅｓ）。そして、ステップＳ１０６乃至ステップＳ１１２の処理が第１の実施の形態と同様に行われる。 In step S301, the relevance recognition unit 106 determines whether or not the entire musical performance area of the musical instrument is included in the input image. If the entire musical play area is included, the process proceeds to step S106 (Yes in step S301). Then, the processing of steps S106 to S112 is performed in the same manner as in the first embodiment.

一方、入力画像に楽器の演奏領域全体が映ってはいない場合、処理はステップＳ３０２に進む（ステップＳ３０１のＮｏ）。そしてステップＳ３０２で関連性認識部１０６により、手の位置情報、センサ情報などを用いて関連性を推定する。 On the other hand, if the entire performance area of the musical instrument is not shown in the input image, the process proceeds to step S302 (No in step S301). Then, in step S302, the relevance is estimated by the relevance recognition unit 106 using the hand position information, the sensor information, and the like.

その後はステップＳ１０６乃至ステップＳ１１２の処理が第１の実施の形態と同様に行われて、部分演奏情報が生成されて出力される。 Thereafter, the processing of steps S106 to S112 is performed in the same manner as in the first embodiment, and partial performance information is generated and output.

また、図１６のフローチャートに示すように、複数の入力画像により構成されるフレーズ、曲の一部または全部の複合演奏情報を生成する処理においても図１５のフローチャートにおけるステップＳ３０１とステップＳ３０２と同様の処理が行われる。 Also, as shown in the flowchart of FIG. 16, in a process of generating composite performance information of a phrase composed of a plurality of input images and a part or all of a song, the same processing as in steps S301 and S302 in the flowchart of FIG. Processing is performed.

この第３の実施の形態によれば、入力画像において楽器の一部が映っていなくても第１の実施の形態と同様に演奏情報、楽譜情報の生成を行うことができる。 According to the third embodiment, performance information and musical score information can be generated similarly to the first embodiment even if a part of the musical instrument is not shown in the input image.

＜４．変形例＞
以上、本技術の実施の形態について具体的に説明したが、本技術は上述の実施の形態に限定されるものではなく、本技術の技術的思想に基づく各種の変形が可能である。 <4. Modification>
Although the embodiments of the present technology have been specifically described above, the present technology is not limited to the above-described embodiments, and various modifications based on the technical idea of the present technology are possible.

実施の形態では演奏の音がなくても複数枚の連続する静止画像または動画を構成する複数のフレーム画像から演奏情報および楽譜情報を生成できると説明したが、本技術は音の使用を除外するものではない。演奏情報および楽譜情報を生成の際の補助情報として音情報を用いてもよいし、生成した演奏情報および楽譜情報の精度を確認する際に音情報を用いてもよい。例えば、入力映像の音声に対して音声認識処理を施し、音の周波数から音階を認識する、音量から強弱や演奏しているか否かを認識するなどである。 In the embodiment, it has been described that the performance information and the score information can be generated from a plurality of continuous still images or a plurality of frame images constituting a moving image without the sound of the performance, but the present technology excludes the use of the sound. Not something. Sound information may be used as auxiliary information when generating performance information and musical score information, or sound information may be used when checking the accuracy of the generated performance information and musical score information. For example, voice recognition processing is performed on the audio of the input video to recognize the scale based on the frequency of the sound, and to recognize whether the sound is strong or weak based on the volume.

第２の実施の形態と第３の実施の形態を組み合わせることにより、入力画像において演奏者の手の一部および楽器の演奏領域の一部が映っていない場合でも演奏情報の生成を行うことができる。 By combining the second embodiment and the third embodiment, it is possible to generate performance information even when a part of a player's hand and a part of a performance area of a musical instrument are not reflected in an input image. it can.

本技術は実施の形態で挙げたピアノ、ギター、ドラムに限られず、木琴、鉄琴、パーカッションなどの楽器の演奏に対しても使用可能である。 The present technology is not limited to the piano, guitar, and drum described in the embodiment, and can be used for playing musical instruments such as a xylophone, a metallophone, and a percussion.

実施の形態では主に押す、叩くなどのピアノの演奏方法、手をストロークさせる、爪弾くなどのギターの演奏方法を例にして説明を行ったが、それら以外の演奏方法、例えば、引っ張る、弾くなどの演奏動作を認識して演奏情報を生成してもよい。入力画像から認識できる楽器の演奏の動作であればどのような動作に基づいて演奏情報を生成してもよい。 In the embodiment, the description has been given mainly of the piano playing method such as pressing and striking, and the guitar playing method such as stroking a hand and striking a nail, but other playing methods such as pulling and playing The performance information may be generated by recognizing a performance operation such as a performance operation. The performance information may be generated based on any operation of the performance of the musical instrument that can be recognized from the input image.

第３の実施の形態においては、入力画像に写っていない楽器の一部を推定し、その推定結果に基づいて演奏情報生成部１０７が演奏情報を生成するようにしてもよい。 In the third embodiment, a part of the musical instrument not shown in the input image may be estimated, and the performance information generation unit 107 may generate the performance information based on the estimation result.

本技術は以下のような構成も取ることができる。
（１）
入力画像から演奏者の身体の部位の位置を認識する位置認識部と、
前記入力画像から楽器を認識する楽器認識部と、
前記部位の位置と前記楽器との関連性に基づき、前記演奏者による前記楽器の演奏を示す演奏情報を生成する演奏情報生成部と
を備える情報処理装置。
（２）
前記位置認識部により認識された前記部位の形状を認識する形状認識部を備え、
前記演奏情報生成部は、前記部位の形状と前記楽器の関連性に基づき前記演奏情報を生成する（１）に記載の情報処理装置。
（３）
前記位置認識部により認識された前記部位の動きを認識する動き認識部を備え、
前記演奏情報生成部は、前記部位の動きと前記楽器の関連性に基づき前記演奏情報を生成する（１）または（２）に記載の情報処理装置。
（４）
前記演奏情報は、前記演奏者が前記楽器を演奏している状態に対応した第１演奏要素を含む（１）から（３）のいずれかに記載の情報処理装置。
（５）
前記第１演奏要素は、前記演奏者により演奏されている音階を含む（４）に記載の情報処理装置。
（６）
前記演奏情報は、前記演奏者が前記楽器を演奏してない状態に対応した第２演奏要素を含む（１）から（５）のいずれかに請求項１に記載の情報処理装置。
（７）
前記第２演奏要素は、前記演奏者により演奏されていない休みの長さを含む（６）に記載の情報処理装置。
（８）
前記演奏情報は、複数の前記入力画像間に跨る要素である第３演奏要素を含む（１）から（７）のいずれかに記載の情報処理装置。
（９）
前記第３演奏要素は、前記演奏者により演奏されている曲のテンポを含む（８）に記載の情報処理装置。
（１０）
前記演奏情報生成部は、一の前記入力画像に対応した前記演奏情報を生成する（１）から（９）のいずれかに記載の情報処理装置。
（１１）
前記演奏情報生成部は、複数の前記入力画像により構成される前記楽器の演奏の一部または全部に対応する演奏情報を生成する（１）から（９）のいずれかに記載の情報処理装置。
（１２）
前記関連性は、前記楽器に対する前記部位の接触位置である（１）から（１１）のいずれかに記載の情報処理装置。
（１３）
前記関連性は、前記楽器に対する前記部位の動作の方向である（１）から（１２）のいずれかに記載の情報処理装置。
（１４）
前記部位は前記演奏者の手である（１）から（１３）のいずれかに記載の情報処理装置。
（１５）
前記演奏情報から楽譜情報を生成する楽譜情報生成部を備える（１）から（１４）のいずれかに記載の情報処理装置。
（１６）
前記入力画像において前記部位の一部が映っていない場合、前記部位の一部の位置を推定し、前記演奏情報生成部は推定結果に基づき前記演奏情報を生成する（１）から（１５）のいずれかに記載の情報処理装置。
（１７）
前記入力画像において前記楽器の一部が映っていない場合、前記演奏情報生成部は、前記部位と前記楽器の一部との前記関連性を推定し、推定結果に基づき前記演奏情報を生成する（１）から（１６）のいずれかに記載の情報処理装置。
（１８）
前記演奏情報生成部は、補助情報として音情報を用いて前記演奏情報を生成する（１）から（１７）のいずれかに記載の情報処理装置。
（１９）
入力画像から演奏者の身体の部位の位置を認識し、
前記入力画像から楽器を認識し、
前記部位の位置と前記楽器との関連性に基づき、前記演奏者による前記楽器の演奏を示す演奏情報を生成する
情報処理方法。
（２０）
入力画像から演奏者の身体の部位の位置を認識し、
前記入力画像から楽器を認識し、
前記部位の位置と前記楽器との関連性に基づき、前記演奏者による前記楽器の演奏を示す演奏情報を生成する
情報処理方法をコンピュータに実行させる情報処理プログラム。 The present technology can also have the following configurations.
(1)
A position recognition unit that recognizes a position of a body part of the player from the input image;
An instrument recognition unit that recognizes an instrument from the input image;
An information processing apparatus, comprising: a performance information generating unit configured to generate performance information indicating performance of the musical instrument by the player based on a relationship between the position of the part and the musical instrument.
(2)
A shape recognition unit that recognizes the shape of the part recognized by the position recognition unit,
The information processing device according to (1), wherein the performance information generation unit generates the performance information based on a relationship between the shape of the part and the musical instrument.
(3)
A movement recognition unit that recognizes the movement of the part recognized by the position recognition unit,
The information processing device according to (1) or (2), wherein the performance information generation unit generates the performance information based on the association between the movement of the part and the musical instrument.
(4)
The information processing apparatus according to any one of (1) to (3), wherein the performance information includes a first performance element corresponding to a state in which the player is playing the musical instrument.
(5)
The information processing device according to (4), wherein the first performance element includes a scale played by the player.
(6)
2. The information processing apparatus according to claim 1, wherein the performance information includes a second performance element corresponding to a state in which the player does not play the musical instrument. 3.
(7)
The information processing device according to (6), wherein the second performance element includes a length of a rest that is not performed by the player.
(8)
The information processing apparatus according to any one of (1) to (7), wherein the performance information includes a third performance element that is an element spanning the plurality of input images.
(9)
The information processing device according to (8), wherein the third performance element includes a tempo of a song played by the player.
(10)
The information processing apparatus according to any one of (1) to (9), wherein the performance information generation unit generates the performance information corresponding to one input image.
(11)
The information processing apparatus according to any one of (1) to (9), wherein the performance information generation unit generates performance information corresponding to a part or all of the performance of the musical instrument constituted by the plurality of input images.
(12)
The information processing apparatus according to any one of (1) to (11), wherein the association is a contact position of the part with the musical instrument.
(13)
The information processing apparatus according to any one of (1) to (12), wherein the association is a direction of movement of the part with respect to the musical instrument.
(14)
The information processing apparatus according to any one of (1) to (13), wherein the part is a hand of the player.
(15)
The information processing apparatus according to any one of (1) to (14), further comprising a score information generating unit configured to generate score information from the performance information.
(16)
When a part of the part is not shown in the input image, the position of a part of the part is estimated, and the performance information generation unit generates the performance information based on the estimation result (1) to (15). An information processing device according to any one of the above.
(17)
When a part of the musical instrument is not shown in the input image, the performance information generation unit estimates the association between the part and the part of the musical instrument, and generates the performance information based on the estimation result ( The information processing apparatus according to any one of (1) to (16).
(18)
The information processing apparatus according to any one of (1) to (17), wherein the performance information generation unit generates the performance information using sound information as auxiliary information.
(19)
Recognize the position of the performer's body from the input image,
Recognizing a musical instrument from the input image,
An information processing method for generating performance information indicating a performance of the musical instrument by the player based on a relationship between the position of the part and the musical instrument.
(20)
Recognize the position of the performer's body from the input image,
Recognizing a musical instrument from the input image,
An information processing program for causing a computer to execute an information processing method for generating performance information indicating a performance of the musical instrument by the player based on a relationship between the position of the part and the musical instrument.

１００、２００、３００・・・情報処理装置
１０２・・・位置認識部
１０３・・・形状認識部
１０４・・・動き認識部
１０５・・・楽器認識部
１０７・・・演奏情報生成部
１０８・・・楽譜情報生成部
２０２・・・第１位置認識部
２０３・・・第２位置認識部 100, 200, 300 Information processing device 102 Position recognition unit 103 Shape recognition unit 104 Motion recognition unit 105 Instrument recognition unit 107 Performance information generation unit 108・ Score information generation unit 202: first position recognition unit 203: second position recognition unit

Claims

A position recognition unit that recognizes a position of a body part of the player from the input image;
An instrument recognition unit that recognizes an instrument from the input image;
An information processing apparatus, comprising: a performance information generating unit configured to generate performance information indicating performance of the musical instrument by the player based on a relationship between the position of the part and the musical instrument.

A shape recognition unit that recognizes the shape of the part recognized by the position recognition unit,
The information processing apparatus according to claim 1, wherein the performance information generating unit generates the performance information based on a relationship between the shape of the part and the musical instrument.

A movement recognition unit that recognizes the movement of the part recognized by the position recognition unit,
The information processing apparatus according to claim 1, wherein the performance information generation unit generates the performance information based on a relationship between the movement of the part and the musical instrument.

The information processing apparatus according to claim 1, wherein the performance information includes a first performance element corresponding to a state in which the player is playing the musical instrument.

The information processing apparatus according to claim 4, wherein the first performance element includes a scale played by the player.

The information processing apparatus according to claim 1, wherein the performance information includes a second performance element corresponding to a state in which the player does not play the musical instrument.

The information processing apparatus according to claim 6, wherein the second performance element includes a length of a rest that has not been performed by the player.

The information processing apparatus according to claim 1, wherein the performance information includes a third performance element that is an element spanning the plurality of input images.

9. The information processing apparatus according to claim 8, wherein the third performance element includes a tempo of a music piece played by the player.

The information processing apparatus according to claim 1, wherein the performance information generation unit generates the performance information corresponding to one of the input images.

The information processing apparatus according to claim 1, wherein the performance information generation unit generates performance information corresponding to a part or all of the performance of the musical instrument composed of the plurality of input images.

The information processing apparatus according to claim 1, wherein the association is a contact position of the part with the musical instrument.

The information processing apparatus according to claim 1, wherein the association is a direction of movement of the part with respect to the musical instrument.

The information processing apparatus according to claim 1, wherein the part is a hand of the player.

The information processing apparatus according to claim 1, further comprising a musical score information generating unit configured to generate musical score information from the performance information.

The information processing according to claim 1, wherein when a part of the part is not shown in the input image, a position of the part is estimated, and the performance information generating unit generates the performance information based on the estimation result. apparatus.

When a part of the musical instrument is not shown in the input image, the performance information generation unit estimates the association between the part and the part of the musical instrument, and generates the performance information based on the estimation result. Item 2. The information processing device according to item 1.

The information processing apparatus according to claim 1, wherein the performance information generation unit generates the performance information using sound information as auxiliary information.

Recognize the position of the performer's body from the input image,
Recognizing a musical instrument from the input image,
An information processing method for generating performance information indicating a performance of the musical instrument by the player based on a relationship between the position of the part and the musical instrument.

Recognize the position of the performer's body from the input image,
Recognizing a musical instrument from the input image,
An information processing program for causing a computer to execute an information processing method for generating performance information indicating a performance of the musical instrument by the player based on a relationship between the position of the part and the musical instrument.