JP6676562B2

JP6676562B2 - Image synthesizing apparatus, image synthesizing method, and computer program

Info

Publication number: JP6676562B2
Application number: JP2017023668A
Authority: JP
Inventors: 和樹岡見; 広太竹内; 木全　英明; 英明木全
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Current assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Priority date: 2017-02-10
Filing date: 2017-02-10
Publication date: 2020-04-08
Anticipated expiration: 2037-02-10
Also published as: JP2018129009A

Description

本発明は、自由視点映像を生成するための技術に関する。 The present invention relates to a technique for generating a free viewpoint video.

自由視点映像では、複数の位置に配置されたカメラで撮った映像を用いて任意の視点の映像が合成される。このような合成処理によって、あらゆる視点からの映像を見ることが可能である。このような自由視点映像の技術は、次世代の映像メディアとして古くから研究が進められてきた。自由視点映像では、シーン中の被写体の三次元形状の復元を行うことで、実際にはカメラが配置されていない位置を視点とした映像を生成することを可能とする。 In the free viewpoint video, an image of an arbitrary viewpoint is synthesized using images captured by cameras arranged at a plurality of positions. By such a combining process, it is possible to view a video from any viewpoint. Such free viewpoint video technology has been studied for a long time as the next generation of video media. In the free viewpoint video, by restoring the three-dimensional shape of the subject in the scene, it is possible to generate a video from a viewpoint where a camera is not actually arranged.

高品質な自由視点映像を実現することができる代表的な研究の一つとして、Colletらの研究が挙げられる（非特許文献１参照）。この研究は、自由視点映像の撮影、合成及び配信の一連のパイプラインを提案した研究である。この研究の技術により、高品質な自由視点映像を合成することが可能である。しかし、大量のカメラ及び赤外カメラが必要とされる。また、被写体領域を抽出するために背景を均一色に限定する必要がある。さらに、これらの特殊な環境に特化したキャリブレーションを行う必要がある。このように、撮影環境に対して非常に厳しい制約条件がある。そのため、実際のシーンでの利用は難しい。 As one of typical researches capable of realizing a high-quality free viewpoint video, there is a research by Collet et al. (See Non-Patent Document 1). This research proposes a series of pipelines for capturing, synthesizing and distributing free viewpoint videos. With this research technology, it is possible to synthesize high-quality free viewpoint videos. However, a large number of cameras and infrared cameras are required. In addition, it is necessary to limit the background to a uniform color in order to extract a subject area. Furthermore, it is necessary to perform calibration specialized for these special environments. Thus, there are very strict constraints on the shooting environment. Therefore, use in an actual scene is difficult.

他の研究として、距離センサを用いることで比較的現実的な制約下での自由視点映像合成方法が提案されている（非特許文献２参照）。しかしながら、この提案による技術では、合成品質が十分には高くない。合成品質を低下させる大きな要因の一つとして、オクル―ジョン及び時間方向のちらつきが挙げられる。オクル―ジョンに関しては、取得できていない情報を再現する必要があるため、事前情報等を用いずに解決することは原理的に不可能である。時間方向のちらつきに関しては、距離センサが赤外光の干渉などを受けることにより、フレームによって取得する情報にばらつきが生じることが原因である。こちらについては、距離情報のフィルタリングなどによって解決が試みられている。しかしながら、改善はされているものの、未だに十分には解消されていない。オクルージョン及び時間方向のちらつきは、わずかに生じるだけでも視聴者が大きな違和感を覚えてしまうため、解決すべき問題である。 As another study, a free viewpoint video synthesizing method under relatively realistic constraints using a distance sensor has been proposed (see Non-Patent Document 2). However, in the technique according to this proposal, the synthesis quality is not sufficiently high. Occlusion and temporal flicker are one of the major factors that degrade the synthetic quality. Since it is necessary to reproduce information that has not been acquired for the occlusion, it is in principle impossible to solve it without using prior information or the like. The flicker in the time direction is caused by the fact that the distance sensor receives the interference of infrared light or the like, and thus the information acquired by the frames varies. This is being solved by filtering distance information. However, although improvements have been made, they have not yet been fully eliminated. Occlusion and flickering in the time direction are problems to be solved because even a slight occurrence causes a great discomfort to the viewer.

A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, S. Sullivan, “High-quality streamable free-viewpoint video,” ACM Transactions on Graphics, 34(4), 2015.A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, S. Sullivan, “High-quality streamable free-viewpoint video,” ACM Transactions on Graphics, 34 (4), 2015. D. Alexiadis, D. Zarpalas, P. Daras, “Fast and smooth 3D reconstruction using multiple RGB-Depth sensors,” in Visual Communications and Image Processing Conference 2014, pp.173-176.D. Alexiadis, D. Zarpalas, P. Daras, “Fast and smooth 3D reconstruction using multiple RGB-Depth sensors,” in Visual Communications and Image Processing Conference 2014, pp.173-176.

このように、従来の自由視点映像の技術には、解決すべき問題が残されており、実際のシーンで使用可能といえる制約条件で十分な品質の画像を生成することは実現されていない。
上記事情に鑑み、本発明は、背景を均一色に限定する等の厳しい制約条件を課すことなく得られた映像を用いることによって、任意の視野及び時刻における画像をより高い品質で生成する技術の提供を目的としている。 As described above, the conventional free viewpoint video technique has a problem to be solved, and it has not been realized to generate an image of sufficient quality under the constraint conditions that can be used in an actual scene.
In view of the above circumstances, the present invention provides a technique for generating an image in an arbitrary field of view and time with higher quality by using an image obtained without imposing strict constraints such as limiting the background to a uniform color. It is intended to be provided.

本発明の一態様は、複数の動画像と、前記動画像に撮影されている被写体の三次元形状に関するパラメータと、の入力を受け付ける入力部と、各時刻におけるフレームについて、予め得られた機械学習の結果に基づいて、前記被写体の三次元関節情報を推定する推定部と、各時刻のフレームにおける前記三次元関節情報の推定結果と、前記被写体の三次元形状に関するパラメータと、に基づいて各時刻における前記被写体の三次元形状を示す情報を取得し、前記被写体の三次元形状を示す情報に基づいて、指定された時刻における指定された視野の前記被写体の画像を含む画像を生成する画像合成部と、を備える画像合成装置である。 One embodiment of the present invention provides an input unit that receives inputs of a plurality of moving images, a parameter related to a three-dimensional shape of a subject captured in the moving images, and machine learning obtained in advance at each time. Estimating unit for estimating the three-dimensional joint information of the subject based on the result of the above, the estimation result of the three-dimensional joint information in the frame at each time, and parameters related to the three-dimensional shape of the subject, each time An image synthesizing unit that obtains information indicating the three-dimensional shape of the subject in the above and generates an image including an image of the subject in a specified field of view at a specified time based on the information indicating the three-dimensional shape of the subject And an image synthesizing device comprising:

本発明の一態様は、上記の画像合成装置であって、前記被写体の三次元形状に関するパラメータは、前記被写体の三次元形状を示す基準形状情報と、前記基準形状情報によって示される前記三次元形状における各関節の情報を示す基準関節情報と、を含む。 One embodiment of the present invention is the above-described image synthesizing device, wherein the parameter relating to the three-dimensional shape of the subject is a reference shape information indicating a three-dimensional shape of the subject, and the three-dimensional shape indicated by the reference shape information. , And reference joint information indicating information of each joint.

本発明の一態様は、上記の画像合成装置であって、前記動画像を構成する各時刻におけるフレームについて、前記フレームの画像に基づいて前記被写体の関節情報を補間処理によって取得する対象とならないフレームであるキーフレームであるか否か判定するフレーム分類部と、前記キーフレーム以外のフレームについては、前記キーフレームにおける前記三次元関節情報の推定結果を用いた補間処理によって前記三次元関節情報を取得するフレーム補間部と、をさらに備える。 One embodiment of the present invention is the image synthesizing apparatus, wherein the frame at each time constituting the moving image is not a target for acquiring joint information of the subject by interpolation based on an image of the frame. And a frame classification unit that determines whether or not the key frame is a key frame, and for the frames other than the key frame, obtains the three-dimensional joint information by an interpolation process using an estimation result of the three-dimensional joint information in the key frame. And a frame interpolation unit that performs the operation.

本発明の一態様は、上記の画像合成装置であって、関節を有する生物又は物体の三次元形状を表す三次元モデルについて、複数の視野によりレンダリングされた画像と、レンダリングされた際の前記三次元モデルの関節の位置及び角度を示す関節情報と、を含む学習データを生成するデータ生成部と、前記データ生成部によって生成された前記画像及び前記関節情報を用いて機械学習を行うことにより、処理の対象となる画像である対象画像に基づいて前記対象画像に撮像されている生物又は物体の関節情報を推定するための学習結果を取得する学習部と、をさらに備え、前記推定部は、前記学習部による機械学習の結果に基づいて三次元関節情報を推定する。 One embodiment of the present invention is the above-described image synthesizing apparatus, wherein, for a three-dimensional model representing a three-dimensional shape of a living thing or an object having a joint, an image rendered by a plurality of fields of view, By generating learning data including joint information indicating the position and angle of the joint of the original model, a data generating unit, and performing machine learning using the image and the joint information generated by the data generating unit, A learning unit that acquires a learning result for estimating joint information of a living thing or an object captured in the target image based on a target image that is an image to be processed, and the estimating unit includes: The three-dimensional joint information is estimated based on the result of the machine learning by the learning unit.

本発明の一態様は、上記の画像合成装置であって、前記データ生成部は、同一の三次元モデルに基づいて前記関節情報が異なる複数のシーンを生成し、前記シーン毎に１又は複数の視野の画像をレンダリングする。 One aspect of the present invention is the above-described image synthesizing device, wherein the data generation unit generates a plurality of scenes having different joint information based on the same three-dimensional model, and generates one or more scenes for each scene. Render an image of the field of view.

本発明の一態様は、複数の動画像と、前記動画像に撮影されている被写体の三次元形状に関するパラメータと、の入力を受け付ける入力ステップと、各時刻におけるフレームについて、予め得られた機械学習の結果に基づいて、前記被写体の三次元関節情報を推定する推定ステップと、各時刻のフレームにおける前記三次元関節情報の推定結果と、前記被写体の三次元形状に関するパラメータと、に基づいて各時刻における前記被写体の三次元形状を示す情報を取得し、前記被写体の三次元形状を示す情報に基づいて、指定された時刻における指定された視野の前記被写体の画像を含む画像を生成する画像合成ステップと、を有する画像合成方法である。 One aspect of the present invention is an input step of receiving an input of a plurality of moving images and a parameter related to a three-dimensional shape of a subject captured in the moving image, and machine learning previously obtained for a frame at each time. Estimating the three-dimensional joint information of the subject based on the result of the estimation, the estimation result of the three-dimensional joint information in the frame at each time, and the parameters related to the three-dimensional shape of the subject, Acquiring the information indicating the three-dimensional shape of the subject in the step of generating an image including an image of the subject in a specified field of view at a specified time based on the information indicating the three-dimensional shape of the subject. And an image combining method comprising:

本発明の一態様は、コンピュータを、上記の画像合成装置として機能させるためのコンピュータプログラムである。 One embodiment of the present invention is a computer program for causing a computer to function as the above-described image composition device.

本発明により、背景を均一色に限定する等の厳しい制約条件を課すことなく得られた映像を用いることによって、任意の視野及び時刻における画像をより高い品質で生成することが可能となる。 According to the present invention, an image in an arbitrary field of view and time can be generated with higher quality by using an image obtained without imposing severe restrictions such as limiting the background to a uniform color.

実施形態における画像合成装置１０の構成例を示す概略ブロック図である。FIG. 1 is a schematic block diagram illustrating a configuration example of an image synthesis device 10 according to an embodiment. データ生成部１１１の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a data generation unit 111. 学習部１１２の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a learning unit 112. ネットワーク構築部２１１によって構築されるネットワークの具体例を示す図である。FIG. 3 is a diagram illustrating a specific example of a network constructed by a network construction unit 211. 入力部１２の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of an input unit 12. 推定部１３の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of an estimation unit 13. フレーム分類部１４の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a frame classification unit 14. フレーム補間部１５の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a frame interpolation unit 15. 画像合成部１６の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of an image synthesis unit 16. 画像合成装置１０の前処理の流れの具体例を示す図である。FIG. 3 is a diagram showing a specific example of a flow of preprocessing of the image synthesizing device 10. 画像合成に関する処理の具体例を示す図である。FIG. 9 is a diagram illustrating a specific example of a process regarding image synthesis.

図１は実施形態における画像合成装置１０の構成例を示す概略ブロック図である。画像合成装置１０は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備え、画像合成プログラムを実行する。画像合成プログラムの実行によって、画像合成装置１０は、学習装置１１、入力部１２、推定部１３、フレーム分類部１４、フレーム補間部１５及び画像合成部１６を備える装置として機能する。なお、画像合成装置１０の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されてもよい。画像合成プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。画像合成プログラムは、電気通信回線を介して送信されてもよい。 FIG. 1 is a schematic block diagram illustrating a configuration example of an image composition device 10 according to the embodiment. The image synthesizing apparatus 10 includes a CPU (Central Processing Unit), a memory, and an auxiliary storage device connected by a bus, and executes an image synthesizing program. By executing the image synthesizing program, the image synthesizing device 10 functions as a device including the learning device 11, the input unit 12, the estimation unit 13, the frame classification unit 14, the frame interpolation unit 15, and the image synthesis unit 16. Note that all or a part of each function of the image synthesizing apparatus 10 may be realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), and an FPGA (Field Programmable Gate Array). . The image composition program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in a computer system. The image composition program may be transmitted via a telecommunication line.

学習装置１１は、データ生成部１１１及び学習部１１２を備える。
まず、データ生成部１１１について説明する。データ生成部１１１は、学習部１１２によって使用される学習データを生成する。データ生成部１１１によって生成される学習データは、画像データと、関節情報と、を有する。画像データは、予め生成された三次元人物モデルが所定の姿勢で視野に含まれるコンピュータグラフィックスとして生成される。データ生成部１１１は、１又は複数の三次元人物モデルを用いて、三次元人物モデル毎に１又は複数のシーンを形成し、シーン毎に１又は複数の視野でコンピュータグラフィックスを生成することによって、複数の画像データを生成する。 The learning device 11 includes a data generation unit 111 and a learning unit 112.
First, the data generator 111 will be described. The data generation unit 111 generates learning data used by the learning unit 112. The learning data generated by the data generating unit 111 includes image data and joint information. The image data is generated as computer graphics in which a previously generated three-dimensional human model is included in a field of view in a predetermined posture. The data generation unit 111 forms one or more scenes for each three-dimensional person model using one or more three-dimensional person models, and generates computer graphics in one or more fields of view for each scene. To generate a plurality of image data.

三次元人物モデルは、例えば人物の各関節の位置と、人物の表面形状と、人物の表面の画像（テクスチャ画像）と、を有するデータである。三次元人物モデルを用いることによって、所望の視野で所望の姿勢の人物の画像を生成することが可能となる。このような三次元人物モデルは、予め人間の手によって作成されたデータであってもよいし、人工知能（Artificial Intelligence：ＡＩ）によって作成されたデータであってもよいし、モーションキャプチャ等の三次元形状を記録する技術を用いて作成されたデータであってもよい。三次元人物モデルが有する関節の数は、推定部１３において求められる推定処理の精度などに応じて適宜決定されてよい。例えば、関節の数は１５であってもよいし、より少ない数であってもよいし、より多い数であってもよい。例えば、指先の動きなどをより精度よく推定する必要がある場合には、該当部分の関節をより多い数とすることが望ましい。この場合、関節の数が増加する。 The three-dimensional person model is data having, for example, the position of each joint of the person, the surface shape of the person, and an image (texture image) of the surface of the person. By using a three-dimensional person model, it is possible to generate an image of a person having a desired posture in a desired visual field. Such a three-dimensional person model may be data created in advance by a human hand, data created by artificial intelligence (AI), or a tertiary model such as motion capture. The data may be created using a technique for recording the original shape. The number of joints included in the three-dimensional human model may be appropriately determined according to the accuracy of the estimation process obtained by the estimation unit 13 and the like. For example, the number of joints may be fifteen, a smaller number, or a larger number. For example, when it is necessary to more accurately estimate the movement of the fingertip or the like, it is desirable to increase the number of joints in the corresponding portion. In this case, the number of joints increases.

所定の姿勢とは、人物の関節の位置や角度によって定義される人の姿勢を示す。三次元人物モデルが有する関節の位置や角度を変更することによって、同一の三次元人物モデルから複数の姿勢を得ることができる。 The predetermined posture indicates the posture of a person defined by the positions and angles of the joints of the person. By changing the positions and angles of the joints of the three-dimensional human model, a plurality of postures can be obtained from the same three-dimensional human model.

シーンとは、生成されるコンピュータグラフィックスの空間（以下「対象空間」という。）のモデリングデータを示す。シーンは、三次元人物モデルと環境情報とによって定義される。環境情報とは、対象空間内に位置する物に関する情報である。例えば、環境情報は、対象空間内に位置する光源の位置、光源の種類、光源が光を発する方向、光源が発する光の強さ、対象空間内に位置する物体（壁、家具、植物、動物など）の材質や位置などを示す。三次元人物モデルの姿勢や位置、環境情報のいずれか一つでも異なれば、異なるシーンである。複数のシーンは、予め人間の手によって作成されてもよいし、人工知能によって作成されてもよい。 A scene refers to modeling data of a generated computer graphics space (hereinafter, referred to as “target space”). A scene is defined by a three-dimensional person model and environment information. The environment information is information on an object located in the target space. For example, the environmental information includes the position of the light source located in the target space, the type of the light source, the direction in which the light source emits light, the intensity of the light emitted by the light source, and the objects (walls, furniture, plants, animals, etc.) located in the target space. Etc.) material and position. The scene is different if any one of the posture, the position, and the environment information of the three-dimensional human model is different. The plurality of scenes may be created in advance by human hands or may be created by artificial intelligence.

データ生成部１１１は、上述したように、一つのシーンに対して複数の視野でコンピュータグラフィックスをレンダリングしてもよい。視野は例えば視点の位置と視線の方向とによって定義される。視点の位置は、シーンの全周囲と定められてもよい。また、視野は、シーンの特性や用途等の情報に合わせて変更されてもよい。例えば、後述する入力部１２に入力される実処理の対象の画像において視野が予め定められている場合には、その視野と同じ視野でコンピュータグラフィックスが生成されてもよい。 As described above, the data generation unit 111 may render computer graphics in a plurality of views for one scene. The field of view is defined, for example, by the position of the viewpoint and the direction of the line of sight. The position of the viewpoint may be defined as the entire periphery of the scene. Further, the field of view may be changed according to information such as the characteristics and use of the scene. For example, if the field of view is predetermined in the image of the actual processing input to the input unit 12 described later, the computer graphics may be generated in the same field of view as the field of view.

関節情報は、三次元人物モデルの各関節の位置を示す情報（関節位置情報）と、各関節が成す角度を示す情報（関節角度情報）と、を有する。関節位置情報は、例えば三次元人物モデルの所定の位置を原点としたときの各関節の三次元座標値として表されてもよい。関節位置情報は、例えばシーンにおけるカメラ座標を基準として表されてもよい。関節角度情報は、例えばオイラー角を用いて表されてもよいし、クオータニオンを用いて表されてもよいし、他の表現で表されてもよい。関節情報は、三次元人物モデルのシーン毎に生成される。視野が変わったとしても、三次元人物モデル及びシーンが変わらなければ、関節情報は変わらない。 The joint information includes information indicating the position of each joint of the three-dimensional person model (joint position information) and information indicating the angle formed by each joint (joint angle information). The joint position information may be represented, for example, as three-dimensional coordinate values of each joint when a predetermined position of the three-dimensional human model is set as the origin. The joint position information may be represented based on, for example, camera coordinates in the scene. The joint angle information may be represented using, for example, an Euler angle, may be represented using a quaternion, or may be represented in another expression. The joint information is generated for each scene of the three-dimensional person model. Even if the visual field changes, the joint information does not change unless the three-dimensional person model and the scene change.

データ生成部１１１は、同一シーン（三次元人物モデル及びシーンが同じことを示す）において１又は複数の視野でコンピュータグラフィックスを生成する。同一シーンで生成された１又は複数の視野のコンピュータグラフィックスを、シーンＣＧセットという。データ生成部１１１は、シーンＣＧセットと、そのシーンにおける関節情報と、が対応付けられたデータを単位学習データとして出力する。データ生成部１１１は、このような単位学習データを複数生成する。データ生成部１１１は、複数の単位学習データを含む学習データを出力する。 The data generation unit 111 generates computer graphics in one or a plurality of views in the same scene (the three-dimensional person model and the scene indicate the same). Computer graphics of one or a plurality of fields of view generated in the same scene are called a scene CG set. The data generation unit 111 outputs data in which the scene CG set is associated with the joint information in the scene as unit learning data. The data generator 111 generates a plurality of such unit learning data. The data generator 111 outputs learning data including a plurality of unit learning data.

図２は、データ生成部１１１の構成例を示す図である。図面のスペースの都合により、三次元人物モデルを図２では“３Ｄモデル”と表す。データ生成部１１１には、複数（例えばＮ種類）の三次元人物モデルが入力される。データ生成部１１１は、シーン生成部２０１及び画像生成部２０２を有する。シーン生成部２０１は、入力された三次元人物モデル毎に１又は複数（例えばＭ種類）のシーンを生成する。なお、三次元人物モデル毎に異なる数のシーンが生成されてもよい。画像生成部２０２は、シーン毎に、１又は複数（例えばＬ種類）の視野のコンピュータグラフィックスを生成する。画像生成部２０２は、生成されたコンピュータグラフィックスに基づいて、複数の単位学習データを生成する。 FIG. 2 is a diagram illustrating a configuration example of the data generation unit 111. The three-dimensional person model is represented as “3D model” in FIG. 2 due to the space of the drawing. A plurality of (for example, N types) three-dimensional person models are input to the data generation unit 111. The data generation unit 111 includes a scene generation unit 201 and an image generation unit 202. The scene generation unit 201 generates one or a plurality (for example, M types) of scenes for each input three-dimensional person model. Note that a different number of scenes may be generated for each three-dimensional person model. The image generation unit 202 generates one or a plurality (for example, L types) of computer graphics having a visual field for each scene. The image generation unit 202 generates a plurality of unit learning data based on the generated computer graphics.

次に学習部１１２について説明する。学習部１１２は、データ生成部１１１によって生成された複数の単位学習データに基づいて学習処理を行う。学習部１１２は、機械学習を実行することによって、推定部１３によって実行される推定処理に用いられるパラメータを取得する。推定処理とは、処理の対象となる画像（以下「対象画像」という。）から、対象画像に撮影されている人物の関節情報を推定する処理である。学習部１１２に実装される機械学習は、どのような技術であってもよい。例えば、ディープニューラルネットワーク（ＤＮＮ）やサポートベクタマシン（ＳＶＭ）等の技術が適用されてもよい。 Next, the learning unit 112 will be described. The learning unit 112 performs a learning process based on the plurality of unit learning data generated by the data generation unit 111. The learning unit 112 acquires parameters used for the estimation process performed by the estimation unit 13 by executing machine learning. The estimation process is a process of estimating joint information of a person photographed in a target image from an image to be processed (hereinafter, referred to as a “target image”). The machine learning implemented in the learning unit 112 may be any technology. For example, a technology such as a deep neural network (DNN) or a support vector machine (SVM) may be applied.

図３は、学習部１１２の構成例を示す図である。学習部１１２には、複数の単位学習データを含む学習データが入力される。学習部１１２は、ネットワーク構築部２１１及びパラメータ学習部２１２を有する。なお、図３に示される学習部１１２は、ＤＮＮが適用された場合の具体例にすぎない。学習部１１２に他の機械学習の技術が適用される場合には、適用される機械学習に応じて学習部１１２の構成が変更されてもよい。 FIG. 3 is a diagram illustrating a configuration example of the learning unit 112. The learning unit 112 receives learning data including a plurality of unit learning data. The learning unit 112 includes a network construction unit 211 and a parameter learning unit 212. Note that the learning unit 112 illustrated in FIG. 3 is only a specific example in a case where DNN is applied. When another machine learning technique is applied to the learning unit 112, the configuration of the learning unit 112 may be changed according to the applied machine learning.

ネットワーク構築部２１１は、学習に用いられるネットワークを構築する。例えば、学習部１１２にＤＮＮが適用されている場合、学習部１１２は、対象画像を入力とし、被写体の三次元の関節情報を出力とするディープニューラルネットワークを構築する。ネットワーク構築部２１１によって構築されるネットワークは、出力される三次元の関節情報の数に応じて構築される。例えば、ネットワークの出力層では、求められる関節の数に応じて次元数が決定される。 The network construction unit 211 constructs a network used for learning. For example, when DNN is applied to the learning unit 112, the learning unit 112 constructs a deep neural network that inputs a target image and outputs three-dimensional joint information of a subject. The network constructed by the network construction unit 211 is constructed according to the number of output three-dimensional joint information. For example, in the output layer of the network, the number of dimensions is determined according to the number of required joints.

図４は、ネットワーク構築部２１１によって構築されるネットワークの具体例を示す図である。ネットワークは、単位学習データに含まれるＬ個の画像が入力であり、それらの画像に対応する関節情報が出力である。例えば、単位学習データに含まれる各画像（コンピュータグラフィックスとして生成された画像）は、縦のサイズが２５６、横のサイズが２５６、ＲＧＢの３チャネルで構成される３×２５６×２５６のカラー画像である。このカラー画像が視点数分入力される。各画像に対して、チャネル数は３６で、５×５のカーネルを用いて３６×５×５の畳み込みが行われる。さらに、２×２のプーリングが実行される。この時、畳み込み層で生成されるチャネル数は３６、ストライド幅は２である。次に、これらの出力がＣＯＮＣＡＴ処理によって縦に連結される。このようなＣＯＮＣＡＴ処理によって多視点の画像が畳み込まれたデータが、以降の処理で同一のフィルタで畳み込まれる。このような構造のネットワークが用いられることによって、関節情報を求めるための空間的な特徴量を抽出することが可能となる。以降のネットワークの構造及び処理は、一般的なネットワークに準ずるものとなる。チャネル数は７２であり、３×３のカーネルを用いて７２×３×３の畳み込みが行われる。２×２でストライド幅２のプーリングといった処理が２回繰り返される。それらの結果は、並べてＦＣ層へと流し込まれる。ＦＣ層は、例えば３つの層で構成されている。それぞれの活性化関数として、ＲｅＬＵが用いられる。ノード数は上流から順に５１２、１０２４、２０４８である。出力層では、１５個の関節位置それぞれについて、ｘｙｚの座標位置と、ｘｙｚオイラー角が出力される。そのため、出力層は９０次元となる。 FIG. 4 is a diagram illustrating a specific example of a network constructed by the network construction unit 211. The network receives L images included in the unit learning data as input and outputs joint information corresponding to those images. For example, each image (image generated as computer graphics) included in the unit learning data is a 3 × 256 × 256 color image composed of three channels of 256 in the vertical size, 256 in the horizontal size, and RGB. It is. This color image is input for the number of viewpoints. For each image, the number of channels is 36, and a 36 × 5 × 5 convolution is performed using a 5 × 5 kernel. Further, 2 × 2 pooling is performed. At this time, the number of channels generated in the convolutional layer is 36, and the stride width is 2. These outputs are then concatenated vertically by a CONCAT process. Data obtained by convolving a multi-viewpoint image by such CONCAT processing is convolved by the same filter in subsequent processing. By using a network having such a structure, it is possible to extract a spatial feature for obtaining joint information. The structure and processing of the subsequent network follow those of a general network. The number of channels is 72, and 72 × 3 × 3 convolution is performed using a 3 × 3 kernel. Processing such as pooling of 2 × 2 with a stride width of 2 is repeated twice. The results are poured side by side into the FC layer. The FC layer is composed of, for example, three layers. ReLU is used as each activation function. The number of nodes is 512, 1024, and 2048 in order from the upstream. The output layer outputs the xyz coordinate position and the xyz Euler angles for each of the 15 joint positions. Therefore, the output layer has 90 dimensions.

上述した図４のネットワークは、あくまで一つの例にすぎない。各カーネルのサイズやストライド幅については、他の値が用いられてもよい。また、活性化関数の種類については、どのような関数が用いられてもよい。ただし、出力される関節情報は、関節数が変化した場合においても、関節ごとにｘｙｚの座標位置と、ｘｙｚオイラー角とが出力されることが望ましい。 The network of FIG. 4 described above is only one example. Other values may be used for the size and stride width of each kernel. In addition, any type of activation function may be used. However, as the joint information to be output, it is desirable that the xyz coordinate position and the xyz Euler angle are output for each joint even when the number of joints changes.

このようなネットワークを構築することによって、画像から被写体の三次元の関節情報を推定するためのネットワークが構築される。また、複数の視野の画像を重ね合わせることによって、画像上での関節情報として扱うのではなく、空間上での関節情報として扱える。そのため、三次元の関節情報を推定することが可能となる。 By constructing such a network, a network for estimating three-dimensional joint information of a subject from an image is constructed. In addition, by superimposing images in a plurality of visual fields, it is possible to handle not as joint information on an image but as joint information on a space. Therefore, it is possible to estimate three-dimensional joint information.

パラメータ学習部２１２は、ネットワーク構築部２１１によって構築されたネットワークに関して、データ生成部１１１によって生成された学習データを用いた機械学習を行うことによって、ネットワークのパラメータを取得する。この時、反復回数や初期パラメータについては、最適と考えられる値が手動で与えられてもよい。図３及び図４の例では、このような学習処理によって得られたパラメータが学習部１１２から出力される。 The parameter learning unit 212 obtains network parameters by performing machine learning on the network constructed by the network construction unit 211 using the learning data generated by the data generation unit 111. At this time, with respect to the number of repetitions and the initial parameters, values considered to be optimal may be manually given. In the examples of FIGS. 3 and 4, the parameters obtained by such a learning process are output from the learning unit 112.

以下、入力部１２及び推定部１３について説明する。
図５は、入力部１２の構成例を示す図である。入力部１２は、複数の動画像、所定の被写体の三次元形状を示す基準形状情報、所定の被写体の三次元関節情報を示す基準関節情報、所定の被写体の変形パラメータ、の入力を受け付ける。複数の動画像は、複数の位置に配置された各カメラによって同時刻に同一のシーンを撮影することによって得られた動画像である。例えば、サッカー場などのフィールドを取り囲むように配置された複数のカメラによって上記フィールドを同時刻（例えば同日の１３時から１４時までの１時間）に撮影することによって得られる動画像が入力される。動画像は、カラーの動画像であってもよいし、グレースケールの動画像であってもよいし、二値の動画像であってもよい。動画像のデータは、各カメラからリアルタイムに入力されてもよいし、ハードディスクドライブ（ＨＤＤ）等の記録媒体に記録された動画像が入力されてもよい。各動画像は完全に同一の時刻に撮影されてものである必要は無く前後に多少の時間のずれが生じていてもよい。 Hereinafter, the input unit 12 and the estimation unit 13 will be described.
FIG. 5 is a diagram illustrating a configuration example of the input unit 12. The input unit 12 receives inputs of a plurality of moving images, reference shape information indicating a three-dimensional shape of a predetermined subject, reference joint information indicating three-dimensional joint information of the predetermined subject, and deformation parameters of the predetermined subject. The plurality of moving images are moving images obtained by capturing the same scene at the same time by the cameras arranged at a plurality of positions. For example, a moving image obtained by shooting the field at the same time (for example, 1 hour from 13:00 to 14:00 on the same day) by a plurality of cameras arranged so as to surround a field such as a soccer field is input. . The moving image may be a color moving image, a grayscale moving image, or a binary moving image. Moving image data may be input in real time from each camera, or a moving image recorded on a recording medium such as a hard disk drive (HDD) may be input. The moving images need not be shot at exactly the same time, and a slight time lag may occur before and after.

フレーム分離部１２１は、入力された動画像を、複数のフレームの画像に分離する。本実施形態では、入力された動画像の各フレームの画像が対象画像として処理される。すなわち、この各フレームの画像に対して、学習装置１１による学習結果を用いた推定処理が実行される。動画像の各フレームには、各フレームが撮影された時刻（以下「フレーム時刻」という。）が付与されていることが好ましい。フレーム時刻が付与されていない場合には、入力部１２は撮影開始時刻と動画像の再生時間とに基づいて各フレームに対してフレーム時刻を付与してもよい。以降の説明では、簡単のため動画像内に存在する人物は一人であり、その人物が基準形状情報等を用いてレンダリングが行われる対象の被写体（以下「注目被写体」という。）であるものとする。複数の注目被写体が存在する場合は、入力部１２は、動画像内で人物領域を切り出す処理を行うことによって、動画像を注目被写体ごとに分割してもよい。それぞれの注目被写体に関する動画像に対して、注目被写体が一人である場合と同様の処理を行うことによって、複数の注目被写体が存在する場合であっても同様の処理が可能となる。 The frame separation unit 121 separates the input moving image into a plurality of frame images. In the present embodiment, the image of each frame of the input moving image is processed as the target image. That is, an estimation process using the learning result by the learning device 11 is performed on the image of each frame. It is preferable that each frame of the moving image is provided with a time at which each frame was captured (hereinafter, referred to as “frame time”). If no frame time has been assigned, the input unit 12 may assign a frame time to each frame based on the shooting start time and the playback time of the moving image. In the following description, for simplicity, only one person exists in the moving image, and the person is a subject to be rendered using the reference shape information or the like (hereinafter, referred to as a “subject of interest”). I do. When there are a plurality of objects of interest, the input unit 12 may divide the moving image for each object of interest by performing a process of cutting out a person region in the moving image. By performing the same processing on a moving image relating to each target object as when only one target object is present, the same processing can be performed even when a plurality of target objects are present.

フレーム分離部１２１は、異なる動画像から得られたフレーム同士で、フレーム時刻に基づいて同時刻に撮影されたフレームの画像であることを示す所定の条件を満たすフレーム同士を関連づけする。所定の条件は、ある基準となる動画像のフレームに対して、他の動画像から得られるフレームのうち最もフレーム時刻が近いフレームであることを示す条件であってもよい。同時刻に撮影されたと推定された各フレームが関連づけられた一組のフレームセットを、同時刻フレームセットとよぶ。以下の処理では、同時刻フレームセットに含まれる各フレームは、実際のフレーム時刻にかかわらず、同一の時刻に撮影されたものとして扱われてもよい。 The frame separating unit 121 associates frames obtained from different moving images with frames that satisfy a predetermined condition indicating that they are images of frames captured at the same time based on the frame time. The predetermined condition may be a condition indicating that a frame of a moving image serving as a reference has the closest frame time among frames obtained from another moving image. A set of frames associated with each frame estimated to have been photographed at the same time is referred to as a same-time frame set. In the following processing, each frame included in the same time frame set may be treated as having been shot at the same time regardless of the actual frame time.

基準形状情報は、入力される動画像に撮影された被写体のうち、所定の基準に基づいて予め定められた被写体（注目被写体）の三次元形状を示す。例えば、特に注目される可能性の高い被写体について、基準形状情報が入力される。例えばサッカーの試合の動画像が入力される場合には、サッカーの試合に出場する選手（スタメンの選手及びベンチ入りした選手）全員の基準形状情報が入力されてもよい。基準形状情報は、注目被写体に対して予め三次元形状復元の処理を行うことによって得られてもよい。例えば、注目被写体に対して距離センサ等の測定機器を用いた測定を行うことによって得られたデータに基づいて基準形状情報が生成されてもよい。例えば、複数の位置のカメラによって撮影された静止画像を用いることによって基準形状情報が生成されてもよい。基準形状情報は、例えば人物の各関節の位置と、人物の表面形状と、人物の表面の画像（テクスチャ画像）と、を有するデータ（三次元人物モデルデータ）であってもよい。三次元人物モデルデータを用いることによって、所望の視野で所望の姿勢の人物の画像を生成することが可能となる。なお、基準形状情報における注目被写体の姿勢は、ＴポーズやＡスタンスのような姿勢であってもよいし、他の姿勢であってもよい。また、入力部１２は、入力された基準形状情報において欠損やノイズが生じていた場合には、Poisson Surface Reconstruction（参考文献１）や一般的な空間フィルタリングなどの手法を用いて表面形状の高品質化を行ってもよい。このような処理が行われることによって、その後に復元された形状は連続した表面を保持する。その結果、視点位置の変化による欠損が生じにくくなる。
参考文献１：M. Kazhdan, M. Bolitho, H. Hoppe, “Poisson Surface Reconstruction,” Symposium on Geometry Processing 2006, 61-70. The reference shape information indicates a three-dimensional shape of a subject (subject of interest) predetermined based on a predetermined reference among subjects captured in the input moving image. For example, reference shape information is input for a subject that is likely to be particularly noticed. For example, when a moving image of a soccer match is input, reference shape information of all players (starter players and benched players) participating in the soccer match may be input. The reference shape information may be obtained by performing a three-dimensional shape restoration process on the subject of interest in advance. For example, the reference shape information may be generated based on data obtained by performing measurement using a measuring device such as a distance sensor on the subject of interest. For example, the reference shape information may be generated by using still images captured by cameras at a plurality of positions. The reference shape information may be data (three-dimensional person model data) including, for example, the position of each joint of the person, the surface shape of the person, and an image (texture image) of the person's surface. By using three-dimensional person model data, it is possible to generate an image of a person in a desired posture in a desired visual field. The posture of the subject of interest in the reference shape information may be a posture such as a T pose or an A stance, or may be another posture. In addition, when a defect or noise occurs in the input reference shape information, the input unit 12 uses a method such as Poisson Surface Reconstruction (Reference Document 1) or general spatial filtering to obtain a high quality surface shape. May be performed. By performing such processing, the subsequently restored shape retains a continuous surface. As a result, loss due to a change in the viewpoint position is less likely to occur.
Reference 1: M. Kazhdan, M. Bolitho, H. Hoppe, “Poisson Surface Reconstruction,” Symposium on Geometry Processing 2006, 61-70.

基準関節情報は、入力された基準形状情報の姿勢における各関節の三次元関節情報である。三次元関節情報は、関節の位置と、関節の角度とを表す。関節の位置は、例えばｘｙｚ座標で表される。関節の角度は、例えばｘｙｚ軸を中心としたオイラー角によって表される。基準関節情報は、基準形状情報を生成する際にモーションキャプチャ等の測定技術を用いて測定されてもよいし、注目被写体が撮影された画像に対して推定処理を行うことによって取得されてもよい。 The reference joint information is three-dimensional joint information of each joint in the posture of the input reference shape information. The three-dimensional joint information indicates a joint position and a joint angle. The position of the joint is represented by, for example, xyz coordinates. The angle of the joint is represented, for example, by an Euler angle about the xyz axis. The reference joint information may be measured by using a measurement technique such as motion capture when generating the reference shape information, or may be obtained by performing an estimation process on an image of the subject of interest. .

変形パラメータは、基準形状情報に対応する注目被写体の関節が変化した際に、動画像内の注目被写体の形状がどのように変形するかを定めるパラメータである。変形パラメータは、例えば関節の回転に応じた形状の変化を定義するパラメータである。変形パラメータは、予め測定などによって取得されてもよい。例えば、一般的なスキニング手法を用いることによって関節と形状の頂点との距離に反比例するように変形パラメータが定められてもよい。変形パラメータは、ソフトウェアを用いて手動で定められてもよい。 The deformation parameter is a parameter that determines how the shape of the target object in the moving image is deformed when the joint of the target object corresponding to the reference shape information changes. The deformation parameter is, for example, a parameter that defines a change in shape according to the rotation of the joint. The deformation parameter may be obtained in advance by measurement or the like. For example, the deformation parameter may be determined so as to be inversely proportional to the distance between the joint and the vertex of the shape by using a general skinning method. The deformation parameters may be manually defined using software.

図６は、推定部１３の構成例を示す図である。推定部１３には対象画像及び学習結果が入力される。推定部１３に入力される対象画像は、入力部１２において入力された各動画像の各フレームの画像である。推定部１３は、学習装置１１による機械学習の結果を用いて、対象画像に写っている注目被写体の三次元の関節情報を推定する。推定部１３が使用する機械学習の結果とは、例えば学習部１１２によって得られたパラメータが与えられたネットワークである。推定部１３が使用する機械学習の結果とは、例えば機械学習によって得られた識別器である。 FIG. 6 is a diagram illustrating a configuration example of the estimation unit 13. The target image and the learning result are input to the estimating unit 13. The target image input to the estimation unit 13 is an image of each frame of each moving image input in the input unit 12. The estimating unit 13 estimates the three-dimensional joint information of the subject of interest in the target image using the result of the machine learning performed by the learning device 11. The result of machine learning used by the estimating unit 13 is, for example, a network to which parameters obtained by the learning unit 112 are given. The result of machine learning used by the estimating unit 13 is, for example, a discriminator obtained by machine learning.

フレーム分類部１４は、推定部１３によって推定された各フレームの注目被写体の関節情報を用いて、各同時刻フレームセットをキーフレームと非キーフレームとに分類する。フレーム分類部１４は、分類された各フレームに対し、キーフレームか否かの分類結果を示すフラグ（キーフレームフラグ又は非キーフレームフラグ）を付与する。キーフレームは、後述するフレーム補間部１５において関節情報の補間処理の対象とならない同時刻フレームセットである。非キーフレームは、後述するフレーム補間部１５において関節情報の補間処理の対象となる同時刻フレームセットである。 The frame classification unit 14 classifies each of the same-time frame sets into a key frame and a non-key frame using the joint information of the subject of interest in each frame estimated by the estimation unit 13. The frame classifying unit 14 assigns a flag (key frame flag or non-key frame flag) indicating the classification result as to whether the frame is a key frame or not to each of the classified frames. A key frame is a same-time frame set that is not subjected to joint information interpolation processing in a frame interpolation unit 15 described later. The non-key frame is a same-time frame set to be subjected to joint information interpolation processing in a frame interpolation unit 15 described later.

図７は、フレーム分類部１４の構成例を示す図である。フレーム分類部１４は、例えば所定の周期で同時刻フレームセットにキーフレームフラグを付与し、他の同時刻フレームセットに非キーフレームフラグを付与してもよい。フレーム分類部１４は、画像内で所定の条件が満たされた同時刻フレームセットに対しキーフレームフラグを付与してもよい。所定の条件とは、例えば画像内で注目被写体の移動速度が極値を示したことであってもよい。移動速度は、注目被写体全体の移動速度であってもよいし、一部の関節や身体部分（例えば腕や顔）の移動速度であってもよい。この場合、フレーム分類部１４は、画像内で注目被写体の移動速度を判定し、その移動速度が極値を示した場合にキーフレームフラグを付与してもよい。また、移動速度にかえて、注目被写体の一部の関節の角度の変化率（変化の速度）が用いられてもよい。関節情報推定フラグ付与部１２２は、同時刻フレームセットにおいていずれか一つのフレームが所定の条件を満たした場合には、その同時刻フレームセットに対してキーフレームフラグを付与してもよいし、同時刻フレームセットにおいて所定数以上のフレームにおいて所定の条件が満たされた場合にその同時刻フレームセットに対してキーフレームフラグを付与してもよい。なお、キーフレームフラグが付与されなかった全ての同時刻フレームセットに対して非キーフレームフラグが付与される。 FIG. 7 is a diagram illustrating a configuration example of the frame classification unit 14. The frame classification unit 14 may, for example, add a key frame flag to the same time frame set at a predetermined cycle, and add a non-key frame flag to another same time frame set. The frame classification unit 14 may add a key frame flag to the same-time frame set in the image that satisfies a predetermined condition. The predetermined condition may be, for example, that the moving speed of the subject of interest in the image has shown an extreme value. The moving speed may be the moving speed of the entire subject of interest or the moving speed of some joints or body parts (for example, arms or face). In this case, the frame classification unit 14 may determine the moving speed of the subject of interest in the image, and add a key frame flag when the moving speed indicates an extreme value. Further, instead of the moving speed, a change rate (change speed) of the angle of a part of the joint of the subject of interest may be used. When any one frame in the same-time frame set satisfies a predetermined condition, the joint information estimation flag assigning unit 122 may assign a key frame flag to the same-time frame set. When a predetermined condition is satisfied in a predetermined number or more of frames in the time frame set, a key frame flag may be given to the same time frame set. A non-key frame flag is assigned to all the same-time frame sets to which no key frame flag has been assigned.

図８は、フレーム補間部１５の構成例を示す図である。フレーム補間部１５は、非キーフレームフラグが付与された同時刻フレームセット（以下「非キーフレーム」という。）の関節情報を、フレーム時刻が近い同時刻フレームセットであってキーフレームフラグが付与された同時刻フレームセット（以下「キーフレーム」という。）の関節情報を用いた補間処理によって取得する。フレーム補間部１５は、全ての非キーフレームについて、補間処理によって関節情報を取得する。フレーム補間部１５による補間処理は、キーフレーム間のフレーム数に応じた線形補間であってもよいし、補間係数に何らかの重み付けがなされてもよい。また、フレーム間の注目被写体の移動速度や各関節の移動速度を算出し、算出された移動速度に基づいて補間係数が得られてもよい。例えば、処理対象の非キーフレームよりも早いフレーム時刻の直近のキーフレームをＰ、処理対象の非キーフレームよりも遅いフレーム時刻の直近のキーフレームをＦ、処理の対象となっている関節の角度及び角速度をキーフレームＰではそれぞれｒＰ、ｖＰ、キーフレームＦではｒＦ、ｖＦとする。なお、簡単のため単一の関節角度について記載する。この場合、速度の変化を考慮すると、処理の対象となっている関節の角度ｒは、速度を用いた内分を行うことで以下のように表されてもよい。

FIG. 8 is a diagram illustrating a configuration example of the frame interpolation unit 15. The frame interpolating unit 15 uses the joint information of the same-time frame set to which a non-key frame flag has been added (hereinafter, referred to as “non-key frame”) as a same-time frame set having a similar frame time and has a key frame flag added thereto. It is obtained by an interpolation process using the joint information of the same time frame set (hereinafter referred to as “key frame”). The frame interpolation unit 15 acquires joint information for all non-key frames by performing interpolation processing. The interpolation processing by the frame interpolation unit 15 may be linear interpolation according to the number of frames between key frames, or some weight may be given to the interpolation coefficient. The moving speed of the subject of interest and the moving speed of each joint between frames may be calculated, and an interpolation coefficient may be obtained based on the calculated moving speed. For example, P is the nearest key frame at a frame time earlier than the non-key frame to be processed, F is the nearest key frame at a frame time later than the non-key frame to be processed, and the angle of the joint to be processed. And the angular velocity are rP and vP in the key frame P, and rF and vF in the key frame F, respectively. Note that a single joint angle is described for simplicity. In this case, considering the change in speed, the angle r of the joint to be processed may be expressed as follows by performing internal division using the speed.

次に画像合成部１６について説明する。画像合成部１６には、基準形状情報、基準関節情報、三次元関節情報、変形パラメータが入力される。入力される三次元関節情報は、推定部１３によって推定された関節情報又はフレーム補間部１５の補間処理によって得られた関節情報である。キーフレームについては、推定部１３によって推定された関節情報が入力される。非キーフレームについては、フレーム補間部１５の補間処理によって得られた関節情報が入力される。画像合成部１６は、三次元関節情報と基準関節情報とを比較する。画像合成部１６は、比較結果と変形パラメータとに基づいて、基準形状を変形させることによって変形形状を取得する。画像合成部１６は、全ての同時刻フレームセットにおいて得られた三次元関節情報に対して上記の処理を実行する。画像合成部１６は、取得された変形形状を用いることによって、自由視点映像を合成する。 Next, the image synthesizing unit 16 will be described. The image combining unit 16 receives reference shape information, reference joint information, three-dimensional joint information, and deformation parameters. The input three-dimensional joint information is joint information estimated by the estimating unit 13 or joint information obtained by the interpolation processing of the frame interpolating unit 15. For the key frame, the joint information estimated by the estimation unit 13 is input. For non-key frames, joint information obtained by the interpolation processing of the frame interpolation unit 15 is input. The image synthesizing unit 16 compares the three-dimensional joint information with the reference joint information. The image synthesis unit 16 acquires a deformed shape by deforming the reference shape based on the comparison result and the deformation parameter. The image synthesizing unit 16 performs the above-described processing on the three-dimensional joint information obtained in all the same-time frame sets. The image synthesizing unit 16 synthesizes a free viewpoint video by using the acquired deformed shape.

図９は、画像合成部１６の構成例を示す図である。以下、図９を例に画像合成部１６について詳細に説明する。画像合成部１６は、関節変位計算部１６１、変形部１６２及び画像生成部１６３を有する。 FIG. 9 is a diagram illustrating a configuration example of the image synthesis unit 16. Hereinafter, the image combining unit 16 will be described in detail with reference to FIG. 9 as an example. The image synthesis unit 16 includes a joint displacement calculation unit 161, a deformation unit 162, and an image generation unit 163.

関節変位計算部１６１は、三次元関節情報と基準関節情報との差分を算出する。例えば、関節変位計算部１６１は、三次元関節情報における三次元座標と、基準関節情報における三次元座標との差分を算出し、ｘ軸、ｙ軸及びｚ軸における位置のずれを取得する。また、関節変位計算部１６１は、推定された三次元関節情報における三次元の角度と、基準関節情報における三次元の角度との差分を算出し、ｘ軸、ｙ軸及びｚ軸を中心とした回転角のずれを取得する。 The joint displacement calculator 161 calculates a difference between the three-dimensional joint information and the reference joint information. For example, the joint displacement calculation unit 161 calculates a difference between the three-dimensional coordinates in the three-dimensional joint information and the three-dimensional coordinates in the reference joint information, and obtains a positional shift on the x-axis, the y-axis, and the z-axis. In addition, the joint displacement calculation unit 161 calculates a difference between the three-dimensional angle in the estimated three-dimensional joint information and the three-dimensional angle in the reference joint information, and centers the x-axis, the y-axis, and the z-axis. Get the rotation angle shift.

変形部１６２は、三次元関節情報と基準関節情報との差分と、変形パラメータと、に基づいて、基準形状を変形する。このような処理によって、変形部１６２は、処理の対象となっている同時刻フレームセットのフレーム時刻において注目被写体がとっていた姿勢と同じ姿勢となるように、基準形状を変形させる。変形部１６２は、このように変形された後の形状の情報を、変形形状情報として出力する。このような処理が全ての同時刻フレームセットにおいて実行されることによって、各同時刻フレームセットに対応する変形形状情報が取得される。変形部１６２は、各フレーム時刻に対応付けて変形形状情報を記憶装置に記録してもよい。 The deformation unit 162 deforms the reference shape based on a difference between the three-dimensional joint information and the reference joint information and a deformation parameter. By such a process, the deforming unit 162 deforms the reference shape so that the posture of the subject of interest at the frame time of the same time frame set to be processed becomes the same posture as the subject. The deforming unit 162 outputs information on the shape after being deformed in this way as deformed shape information. By performing such processing in all the same-time frame sets, deformed shape information corresponding to each of the same-time frame sets is obtained. The deforming unit 162 may record the deformed shape information in the storage device in association with each frame time.

画像生成部１６３は指定された視野及び時刻における自由視点映像を生成する。視野及び時刻は、例えば自由視点映像を再生する装置によって指定されてもよいし、自由視点映像を視聴する者によって指定されてもよい。画像生成部１６３は、指定された時刻に相当するフレーム時刻の変形形状情報を取得する。取得される変形形状情報は、その時刻における注目被写体の位置や姿勢を示している。画像生成部１６３は、取得された変形形状情報を用いて、指定された視野における画像をレンダリングする。このとき、注目被写体の画像は変形形状情報を用いたレンダリングによって得られる。画像生成部１６３は、注目被写体の背景の画像については、予め得られている背景を示すモデルデータに基づいてレンダリングしてもよいし、対応する同時刻フレームセットにおいて近い視野の１又は複数のフレーム画像を用いてアフィン変換等の画像処理を行うことによってレンダリングしてもよい。画像生成部１６３は、背景の画像と注目被写体の画像とを合成することによって、指定された視野及び時刻における自由視点映像を生成する。画像生成部１６３は、動画像が要求されている場合には、以上の処理を時間軸にそって繰り返し実行することによって自由視点における映像を生成してもよい。 The image generation unit 163 generates a free viewpoint video at the designated visual field and time. The field of view and the time may be specified by, for example, a device that reproduces the free viewpoint video, or may be specified by a person who views the free viewpoint video. The image generation unit 163 acquires deformed shape information at a frame time corresponding to the designated time. The acquired deformed shape information indicates the position and orientation of the subject of interest at that time. The image generation unit 163 renders an image in the designated visual field using the acquired deformed shape information. At this time, the image of the subject of interest is obtained by rendering using the deformed shape information. The image generation unit 163 may render the background image of the subject of interest based on model data indicating the background obtained in advance, or one or a plurality of frames having a near field of view in a corresponding frame set at the same time. Rendering may be performed by performing image processing such as affine transformation using an image. The image generation unit 163 generates a free viewpoint video at the designated visual field and time by combining the background image and the image of the subject of interest. When a moving image is requested, the image generating unit 163 may generate a video at a free viewpoint by repeatedly executing the above processing along the time axis.

図１０は、画像合成装置１０の前処理の流れの具体例を示す図である。画像合成装置１０の前処理は、学習装置１１によって実行される。まず、データ生成部１１１は、関節情報を含む三次元人物モデルを取得する（ステップＳ１０１）。次に、データ生成部１１１は、三次元人物モデルの関節情報を変更することによって、複数の姿勢のシーンを生成する（ステップＳ１０２）。次に、データ生成部１１１は、シーン毎に複数の視野でコンピュータグラフィックス（画像）をレンダリングする（ステップＳ１０３）。データ生成部１１１は、シーン毎に生成された複数の画像と関節情報とを対応付けて単位学習データを生成する（ステップＳ１０４）。次に、学習部１１２が、データ生成部１１１によって生成された複数の単位学習データ（画像及び関節情報）に基づいて学習処理を実行する（ステップＳ１０５）。学習部１１２は、学習処理の結果に基づいて得られたパラメータをネットワークに設定する（ステップＳ１０６）。 FIG. 10 is a diagram illustrating a specific example of the flow of the pre-processing of the image composition device 10. The pre-processing of the image synthesizing device 10 is executed by the learning device 11. First, the data generation unit 111 acquires a three-dimensional person model including joint information (step S101). Next, the data generation unit 111 generates a scene with a plurality of postures by changing the joint information of the three-dimensional human model (step S102). Next, the data generator 111 renders computer graphics (images) with a plurality of visual fields for each scene (step S103). The data generation unit 111 generates unit learning data by associating a plurality of images generated for each scene with the joint information (step S104). Next, the learning unit 112 performs a learning process based on the plurality of unit learning data (image and joint information) generated by the data generation unit 111 (Step S105). The learning unit 112 sets parameters obtained based on the result of the learning process in the network (step S106).

図１１は、画像合成に関する処理の具体例を示す図である。まず、入力部１２が、注目被写体が映る複数の動画像、注目被写体の三次元形状を示す基準形状情報、注目被写体の三次元関節情報を示す基準関節情報、注目被写体の変形パラメータ、の入力を受け付ける（ステップＳ１０７）。次に、入力部１２は、入力された動画像をフレームに分離し、必要に応じてサイズ変更等の加工を行う（ステップＳ１０８）。次に、推定部１３は各同時刻フレームセットについて被写体の関節情報を推定する（ステップＳ１０９）。次にフレーム分類部１４は、各同時刻フレームセットについてキーフレーム又は非キーフレームのいずれかに分類し、分類結果に応じたフラグを付与する（ステップＳ１１０）。次にフレーム補間部１５が、非キーフレームフラグが付与された同時刻フレームセットにおいて、関節情報を補間処理によって取得する（ステップＳ１１１）。次に、画像合成部１６が、各同時刻フレームセットにおいて変形形状を生成する（ステップＳ１１２）。 FIG. 11 is a diagram illustrating a specific example of a process related to image synthesis. First, the input unit 12 inputs a plurality of moving images in which the target object is reflected, reference shape information indicating the three-dimensional shape of the target object, reference joint information indicating three-dimensional joint information of the target object, and deformation parameters of the target object. Accept (Step S107). Next, the input unit 12 separates the input moving image into frames, and performs processing such as changing the size as needed (step S108). Next, the estimating unit 13 estimates joint information of the subject for each of the same time frame sets (step S109). Next, the frame classification unit 14 classifies each of the same-time frame sets into either a key frame or a non-key frame, and gives a flag according to the classification result (step S110). Next, the frame interpolation unit 15 obtains joint information by interpolation processing in the same-time frame set to which the non-key frame flag has been added (step S111). Next, the image synthesizing unit 16 generates a deformed shape in each of the same-time frame sets (Step S112).

その後、自由視点映像を生成するタイミングにおいて、画像合成部１６は、予め取得されている変形形状を用いて、指定された時刻及び視野における注目被写体の画像をレンダリングする（ステップＳ１１３）。そして、画像合成部１６は、得られた画像に背景の画像を合成することによって、合成画像を生成し出力する（ステップＳ１１４）。 Thereafter, at the timing of generating the free viewpoint video, the image synthesizing unit 16 renders the image of the subject of interest at the specified time and field of view using the deformed shape acquired in advance (step S113). Then, the image combining unit 16 generates and outputs a combined image by combining the obtained image with the background image (step S114).

このように構成された学習装置１１では、データ生成部１１１によってレンダリングされたコンピュータグラフィックスを用いて学習データが生成される。そのため、三次元の関節情報と画像とを含む複数の学習データをより容易に生成することができる。特にディープラーニング等の機械学習では、一般的に多量の学習データが必要となるため、上述した学習装置１１は有効である。 In the learning device 11 configured as described above, the learning data is generated using the computer graphics rendered by the data generation unit 111. Therefore, it is possible to more easily generate a plurality of learning data including three-dimensional joint information and an image. Particularly, machine learning such as deep learning generally requires a large amount of learning data, and thus the learning device 11 described above is effective.

また、一つの三次元人物モデルの関節情報を変更することによって、一つの三次元人物モデルから多様な関節情報に基づく画像と関節情報とを生成することができる。例えば、従来のモーションキャプチャでは、たとえ同一の人物（三次元人物モデル）であっても、異なる姿勢毎に関節情報の測定を行う必要があり手間を要していた。一方、学習装置１１では、たとえモーションキャプチャを用いたとしても、一度三次元人物モデルを取得してしまえば、その後は関節情報を変更することによって容易に複数の姿勢の学習データを取得することが可能となる。 Further, by changing the joint information of one three-dimensional person model, it is possible to generate an image and joint information based on various joint information from one three-dimensional person model. For example, in the conventional motion capture, even for the same person (three-dimensional person model), it is necessary to measure joint information for each of different postures, which is troublesome. On the other hand, in the learning device 11, even if motion capture is used, once a three-dimensional human model is obtained, learning data of a plurality of postures can be easily obtained by changing joint information. It becomes possible.

また、三次元人物モデルには関節情報が定義されているため、学習データとしてより正確な関節情報を取得する事が可能となる。 Further, since joint information is defined in the three-dimensional person model, more accurate joint information can be obtained as learning data.

このように構成された画像合成装置１０では、自由視点映像を生成するために以下のような処理が行われる。実際に撮影された複数の動画像に基づいて、各フレーム時刻における注目被写体の各関節の三次元関節情報が推定される。三次元関節情報の推定結果に基づいて、予め得られていた注目被写体の基準形状が変形され、各フレーム時刻における注目被写体の変形形状情報が得られる。そして、実際に自由視点映像を生成する際には、指定された時刻における注目被写体の変形形状情報を用いて指定された視野におけるレンダリングを行うことによって、注目被写体の映像が生成される。そのため、例えば実際に撮影された複数の動画像では陰となって得られていなかった部分の映像（例えば、注目被写体の脇の部分や顎下の部分など）についても、オクルージョンの問題が生じることを抑止することが可能となる。 In the image synthesizing apparatus 10 configured as described above, the following processing is performed to generate a free viewpoint video. Based on a plurality of actually captured moving images, three-dimensional joint information of each joint of the subject of interest at each frame time is estimated. Based on the estimation result of the three-dimensional joint information, the reference shape of the target object obtained in advance is deformed, and the deformed shape information of the target object at each frame time is obtained. Then, when actually generating the free viewpoint video, the video of the target subject is generated by performing rendering in the specified field of view using the deformed shape information of the target subject at the specified time. For this reason, for example, occlusion problems may occur even in a part of a video that is not obtained as a shadow in a plurality of actually captured moving images (for example, a part of a side of a subject of interest or a part below a chin). Can be suppressed.

また、各フレーム時刻における変形形状情報を取得する際に、全てのフレーム時刻において注目被写体の三次元関節情報を独立に動画像から推定するのではなく、一部のフレーム時刻（キーフレームフラグが付与されたフレーム時刻）の同時刻フレームセットのみにおいて、推定部１３による推定結果が採用される。そして、残りのフレーム時刻（非キーフレームフラグが付与されたフレーム時刻）の同時刻フレームセットにおいては、動画像からではなく、キーフレームフラグが付与された同時刻フレームセットにおける推定結果に基づいた補間処理によって三次元関節情報が得られる。そのため、少なくともキーフレームフラグが付与された同時刻フレームセットから次のキーフレームフラグが付与された同時刻フレームセットまでの間で時間方向のちらつきが生じにくい。このような処理によって、時間方向のちらつきを抑止することが可能となる。 In addition, when acquiring the deformed shape information at each frame time, the 3D joint information of the subject of interest is not independently estimated from the moving image at all frame times, but a part of the frame time (key frame flag is added. The estimation result by the estimation unit 13 is adopted only in the same-time frame set of the (set frame time). Then, in the same-time frame set of the remaining frame times (frame times to which the non-key frame flag is added), interpolation is performed based on the estimation result in the same-time frame set to which the key frame flag is added, not from a moving image. Through the processing, three-dimensional joint information is obtained. Therefore, flicker in the time direction hardly occurs at least from the same-time frame set to which the key frame flag is added to the same-time frame set to which the next key frame flag is added. Such processing makes it possible to suppress flicker in the time direction.

（変形例）
データ生成部１１１によって生成されるコンピュータグラフィックスは、カラー画像であってもよいし、グレースケール画像であってもよいし、２値画像であってもよい。 (Modification)
The computer graphics generated by the data generation unit 111 may be a color image, a grayscale image, or a binary image.

上述した画像合成装置１０による処理の対象は人物であったが、必ずしも処理の対象は人物に限定される必要は無い。処理の対象は、関節を有する生物又は物体であればどのようなものであってもよい。例えば、動物が処理の対象となってもよい。この場合、三次元人物モデルに代えて三次元動物モデルが用いられて関節情報の学習結果が得られる。画像合成装置１０の入力部１２には処理の対象となっている動物が撮影された画像が入力され、その動物の関節情報が推定される。例えば、ロボットが処理の対象となっても良い。この場合、三次元人物モデルに代えて三次元ロボットモデルが用いられて関節情報の学習結果が得られる。画像合成装置１０の入力部１２には処理の対象となっているロボットが撮影された画像が入力され、そのロボットの関節情報が推定される。 Although the target of processing by the above-described image synthesizing device 10 is a person, the target of processing is not necessarily limited to a person. The processing target may be any living thing or object having a joint. For example, animals may be targets for processing. In this case, a learning result of joint information is obtained by using a three-dimensional animal model instead of the three-dimensional person model. An image obtained by photographing an animal to be processed is input to the input unit 12 of the image synthesizing device 10, and joint information of the animal is estimated. For example, a robot may be a processing target. In this case, a learning result of the joint information is obtained by using a three-dimensional robot model instead of the three-dimensional person model. An image of the robot to be processed is input to the input unit 12 of the image synthesizing device 10, and joint information of the robot is estimated.

データ生成部１１１は、生成されたコンピュータグラフィックスに対して所定の処理（以下「画像前処理」という。）を実行してもよい。画像前処理とは、学習部１１２における学習処理や、推定部１３による推定処理の精度をより高める事を目的として実行される処理である。画像前処理は、例えば、サイズの変更であってもよいし、処理の対象（例えば人物）の領域の切り出しであってもよい。ただし、生成されたコンピュータグラフィックスに対して実行される画像前処理は共通しており、変更後のサイズは全て同一になるように画像前処理が実行される。また、データ生成部１１１において実行される画像前処理は、入力部１２において入力される処理対象の画像に対しても同様に実行される。この場合、入力部１２において画像前処理が実行された後の画像のサイズは、データ生成部１１１によって画像前処理が実行された後のコンピュータグラフィックスのサイズと同じであることが望ましい。言い換えれば、学習部１１２において用いられるコンピュータグラフィックスのサイズと、推定部１３において用いられる画像のサイズは同一であることが望ましい。 The data generation unit 111 may execute a predetermined process (hereinafter, referred to as “image pre-processing”) on the generated computer graphics. The image preprocessing is a process executed for the purpose of further improving the accuracy of the learning process in the learning unit 112 and the estimation process by the estimation unit 13. The image pre-processing may be, for example, a change in size or a cut-out of a region to be processed (for example, a person). However, the image pre-processing performed on the generated computer graphics is common, and the image pre-processing is performed so that all the changed sizes are the same. The image pre-processing performed by the data generation unit 111 is similarly performed on the processing target image input by the input unit 12. In this case, it is desirable that the size of the image after the image preprocessing is executed in the input unit 12 is the same as the size of the computer graphics after the image preprocessing is executed by the data generation unit 111. In other words, it is desirable that the size of the computer graphics used in the learning unit 112 and the size of the image used in the estimation unit 13 are the same.

学習部１１２によって使用されるネットワークは、ネットワーク構築部２１１によって構築されるのではなく、予め構築されたものが不図示の記憶部に記憶されていてもよい。この場合、パラメータ学習部２１２は、記憶部に記憶されているネットワークを読み出すことによって機械学習を実行する。 The network used by the learning unit 112 is not constructed by the network construction unit 211, but a network constructed in advance may be stored in a storage unit (not shown). In this case, the parameter learning unit 212 executes machine learning by reading the network stored in the storage unit.

学習装置１１は、画像合成装置１０とは別の装置として構成されてもよい。この場合、画像合成システムが構築されてもよい。画像合成システムは、画像合成装置１０と学習装置１１とを備える。この場合、画像合成装置１０は、入力部１２及び推定部１３を備える。画像合成装置１０は、ネットワーク等を介して学習装置１１から学習結果を示すデータを取得し、推定処理を実行する。 The learning device 11 may be configured as a device different from the image composition device 10. In this case, an image composition system may be constructed. The image composition system includes an image composition device 10 and a learning device 11. In this case, the image composition device 10 includes an input unit 12 and an estimation unit 13. The image synthesizing device 10 acquires data indicating a learning result from the learning device 11 via a network or the like, and executes an estimation process.

データ生成部１１１は、学習装置１１とは異なる学習データ生成装置に備えられてもよい。この場合、学習データ生成装置によって生成された学習データは、ネットワークや記憶媒体などを介して学習装置１１に与えられてもよい。 The data generation unit 111 may be provided in a learning data generation device different from the learning device 11. In this case, the learning data generated by the learning data generation device may be provided to the learning device 11 via a network, a storage medium, or the like.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 As described above, the embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to the embodiments and includes a design and the like within a range not departing from the gist of the present invention.

１０…画像合成装置，１１…学習装置，１１１…データ生成部，１１２…学習部，１２…入力部，１３…推定部，１４…フレーム分類部，１５…フレーム補間部，１６…画像合成部，２０１…シーン生成部，２０２…画像生成部，２１１…ネットワーク構築部，２１２…パラメータ学習部，１２１…フレーム分離部，１６１…関節変位計算部，１６２…変形部，１６３…画像生成部 DESCRIPTION OF SYMBOLS 10 ... Image synthesis apparatus, 11 ... Learning apparatus, 111 ... Data generation part, 112 ... Learning part, 12 ... Input part, 13 ... Estimation part, 14 ... Frame classification part, 15 ... Frame interpolation part, 16 ... Image synthesis part, 201: scene generation unit, 202: image generation unit, 211: network construction unit, 212: parameter learning unit, 121: frame separation unit, 161: joint displacement calculation unit, 162: deformation unit, 163: image generation unit

Claims

An input unit that receives input of a plurality of moving images and a parameter related to a three-dimensional shape of a subject captured in the moving image;
An estimating unit that estimates three-dimensional joint information of the subject based on a result of machine learning obtained in advance for a frame at each time;
The information indicating the three-dimensional shape of the subject at each time is obtained based on the estimation result of the three-dimensional joint information in the frame at each time and the parameter related to the three-dimensional shape of the subject, and the three-dimensional shape of the subject An image combining unit that generates an image including an image of the subject in a specified field of view at a specified time based on information indicating
An image synthesis device comprising:

The parameter relating to the three-dimensional shape of the subject includes reference shape information indicating the three-dimensional shape of the subject, and reference joint information indicating information of each joint in the three-dimensional shape indicated by the reference shape information. Item 2. The image synthesizing device according to Item 1.

A frame classification unit that determines whether or not a frame at each time configuring the moving image is a key frame that is not a target frame to be obtained by performing interpolation processing on the joint information of the subject based on the image of the frame;
3. The frame interpolating unit that acquires the three-dimensional joint information by performing an interpolation process using an estimation result of the three-dimensional joint information in the key frame for a frame other than the key frame. 4. Image synthesis device.

For a three-dimensional model representing a three-dimensional shape of a living thing or an object having a joint, an image rendered by a plurality of visual fields and joint information indicating the position and angle of the joint of the three-dimensional model when rendered are included. A data generator for generating learning data,
By performing machine learning using the image and the joint information generated by the data generation unit, a joint of a living thing or an object captured in the target image based on the target image that is an image to be processed A learning unit that acquires a learning result for estimating information,
The image synthesizing device according to claim 1, wherein the estimating unit estimates three-dimensional joint information based on a result of machine learning performed by the learning unit.

The image synthesizing device according to claim 4, wherein the data generating unit generates a plurality of scenes having different joint information based on the same three-dimensional model, and renders an image of one or a plurality of visual fields for each scene. .

An input step of receiving an input of a plurality of moving images and a parameter relating to a three-dimensional shape of a subject captured in the moving image;
An estimation step of estimating three-dimensional joint information of the subject based on a result of machine learning obtained in advance for a frame at each time;
The information indicating the three-dimensional shape of the subject at each time is obtained based on the estimation result of the three-dimensional joint information in the frame at each time and the parameter related to the three-dimensional shape of the subject, and the three-dimensional shape of the subject An image combining step of generating an image including an image of the subject in a specified field of view at a specified time based on information indicating
An image synthesis method comprising:

A computer program for causing a computer to function as the image synthesizing device according to claim 1.