JP2020017927A

JP2020017927A - Image processing apparatus, control method therefor, and image processing system

Info

Publication number: JP2020017927A
Application number: JP2018141630A
Authority: JP
Inventors: 前田　充; Mitsuru Maeda; 充前田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2020-01-30

Abstract

To increase quality of a virtual viewpoint image generated using images acquired from a plurality of cameras.SOLUTION: An image processing apparatus for generating a virtual viewpoint image observed from a virtual viewpoint using a plurality of images acquired from a plurality of cameras, selects an image that can be used for texture mapping of the part of the virtual viewpoint image from the plurality of images based on a positional relationship between the virtual viewpoint and the plurality of cameras, determines a usage method to the texture mapping for the selected image based on encoding information representing the encoding used for the selected image, executes the texture mapping using the image selected according to the determined usage method, and generates an image of the part of the virtual viewpoint image.SELECTED DRAWING: Figure 1

Description

本発明は、複数のカメラにより被写体を複数の方向から撮影することにより得られた複数の画像を用いて仮想視点画像を生成するための画像処理装置およびその制御方法、画像処理システムに関する。 The present invention relates to an image processing apparatus for generating a virtual viewpoint image using a plurality of images obtained by photographing a subject from a plurality of directions with a plurality of cameras, a control method thereof, and an image processing system.

複数のカメラを異なる位置に設置して多視点で同期撮影し、当該撮影により得られた複数視点画像を用いて仮想視点画像を生成する技術が注目されている。このような仮想視点画像によれば、例えば、サッカーやバスケットボールのハイライトシーンを様々な角度から視聴することが出来るため、通常の画像と比較してユーザに高臨場感を与えることが出来る。一般に、このような仮想視点画像を生成するシステムでは、複数のカメラが撮影した複数視点画像は、サーバなどの画像処理部に集約される。画像処理部は、これら複数視点画像から生成したモデルを用いて、指定された仮想視点からのモデルの見えを表す仮想視点画像を生成し、ユーザ端末に伝送する。ユーザは、ユーザ端末で仮想視点画像を表示することにより、仮想視点画像を閲覧することができる。 2. Description of the Related Art A technique of installing a plurality of cameras at different positions, performing synchronous shooting from multiple viewpoints, and generating a virtual viewpoint image using the multiple viewpoint images obtained by the shooting has attracted attention. According to such a virtual viewpoint image, for example, a highlight scene of soccer or basketball can be viewed from various angles, so that a higher sense of reality can be given to the user as compared with a normal image. Generally, in such a system for generating a virtual viewpoint image, a plurality of viewpoint images captured by a plurality of cameras are collected in an image processing unit such as a server. The image processing unit generates a virtual viewpoint image representing the appearance of the model from the specified virtual viewpoint using the model generated from the multiple viewpoint images, and transmits the virtual viewpoint image to the user terminal. The user can browse the virtual viewpoint image by displaying the virtual viewpoint image on the user terminal.

特許文献１では、複数のカメラによる画像の伝送において、各カメラで量子化ステップサイズ（量子化パラメータ）を調整し、発生するデータ量を制御して画像伝送におけるレートを制御する技術が記載されている。さらに、特許文献２では、複数のカメラで撮影された複数の画像からモデルを生成し、仮想カメラから見えるモデルの部分に対してテクスチャマッピングを行って仮想画像を生成する。特許文献２には、仮想カメラの位置およびオブジェクトの位置といったような空間的情報（幾何学的情報）に応じて画像に優先順位を設け、テクスチャマッピングに利用する画像を選択する技術が記載されている。 Patent Literature 1 discloses a technique in which, in image transmission by a plurality of cameras, a quantization step size (quantization parameter) is adjusted in each camera, a generated data amount is controlled, and a rate in image transmission is controlled. I have. Further, in Patent Literature 2, a model is generated from a plurality of images captured by a plurality of cameras, and texture mapping is performed on a part of the model viewed from the virtual camera to generate a virtual image. Patent Literature 2 describes a technique of assigning priorities to images in accordance with spatial information (geometric information) such as the position of a virtual camera and the position of an object, and selecting an image to be used for texture mapping. I have.

特許第４１０１８５７号公報Japanese Patent No. 4101857 特許第４４６４７７３号公報Japanese Patent No. 4464773

特許文献１では、複数のカメラがそれぞれ、通信路の状況に応じて画質の制御や符号化モードを変更して符号量を調整する。このため、各カメラから送られてくる画像の符号化による劣化の度合いは各カメラによって異なる場合がある。すなわち、これらの画像では様々な符号化劣化が発生しており、特許文献２に記載されているような空間的情報によって得られた優先順位では、順位が高い画像の画質が必ずしも高くはならない。すなわち、カメラごとに符号化方式を変更して伝送のレート制御をおこなう場合、仮想視点画像の生成に使用するカメラの画像を空間的情報によって決定すると、符号化による劣化が大きい画像を優先して選択してしまう可能性がある。その結果、生成される仮想視点画像の画質が低くなってしまうという課題がある。 In Patent Literature 1, a plurality of cameras adjust the code amount by controlling the image quality and changing the encoding mode according to the state of the communication path. For this reason, the degree of deterioration due to encoding of the image sent from each camera may differ from camera to camera. That is, various coding degradations occur in these images, and the image quality of an image having a higher order is not necessarily higher in the priority order obtained by spatial information as described in Patent Document 2. In other words, when controlling the transmission rate by changing the encoding method for each camera, if the image of the camera to be used for generating the virtual viewpoint image is determined based on the spatial information, priority is given to an image that is largely degraded by encoding. There is a possibility of choosing. As a result, there is a problem that the image quality of the generated virtual viewpoint image is reduced.

本発明は、上記の課題に鑑みてなされたものであり、その目的は、複数のカメラから取得された画像を用いて生成される仮想視点画像の品質を高めることにある。 The present invention has been made in view of the above problems, and has as its object to improve the quality of a virtual viewpoint image generated using images acquired from a plurality of cameras.

本発明の一態様による画像処理装置は以下の構成を備える。すなわち、
複数のカメラから得られた複数の画像を用いて、仮想視点から観察される仮想視点画像を生成する画像処理装置であって、
前記仮想視点と前記複数のカメラの位置関係に基づいて、前記複数の画像から前記仮想視点画像の部分のテクスチャマッピングに利用可能な画像を選択する選択手段と、
前記選択手段により選択された画像に用いられた符号化を表す符号化情報に基づいて、前記選択された画像の前記テクスチャマッピングへの利用法を決定する決定手段と、
前記決定手段により決定された利用法に従って前記選択された画像を用いてテクスチャマッピングを実行し、前記仮想視点画像の前記部分の画像を生成する生成手段と、を備える。 An image processing device according to one aspect of the present invention has the following configuration. That is,
An image processing apparatus that generates a virtual viewpoint image observed from a virtual viewpoint using a plurality of images obtained from a plurality of cameras,
Selection means for selecting an image available for texture mapping of a portion of the virtual viewpoint image from the plurality of images, based on a positional relationship between the virtual viewpoint and the plurality of cameras,
Determining means for determining how to use the selected image for the texture mapping based on encoding information representing the encoding used for the image selected by the selecting means,
Generating means for performing texture mapping using the selected image in accordance with the usage determined by the determining means, and generating an image of the portion of the virtual viewpoint image.

本発明によれば、複数のカメラから取得された画像を用いて生成される仮想視点画像の品質を高めることができる。 According to the present invention, it is possible to enhance the quality of a virtual viewpoint image generated using images acquired from a plurality of cameras.

第１実施形態による画像処理システムの構成例を示すブロック図。FIG. 1 is a block diagram illustrating a configuration example of an image processing system according to a first embodiment. 複数のセンサシステムによるオブジェクトの撮影状況の一例を示す図。The figure which shows an example of the photography situation of the object by several sensor systems. 複数のカメラによるオブジェクトと仮想カメラの関係の一例を示す図。The figure which shows an example of the relationship between the object by a several camera, and a virtual camera. 第１実施形態における合成器の構成例を示すブロック図。FIG. 2 is a block diagram illustrating a configuration example of a combiner according to the first embodiment. 第１実施形態における合成器の動作を説明するフローチャート。5 is a flowchart illustrating an operation of the synthesizer according to the first embodiment. 第１実施形態の変形例による合成器の動作を示すフローチャート。9 is a flowchart showing the operation of the synthesizer according to a modification of the first embodiment. 第２実施形態における画像処理システムの構成例を示すブロック図。FIG. 9 is a block diagram illustrating a configuration example of an image processing system according to a second embodiment. 第２実施形態における合成器の構成例を示すブロック図。FIG. 9 is a block diagram illustrating a configuration example of a combiner according to a second embodiment. 第２実施形態における合成器の動作を示すフローチャート。9 is a flowchart illustrating the operation of the synthesizer according to the second embodiment. 第３実施形態における画像処理システムの構成例を示すブロック図。FIG. 13 is a block diagram illustrating a configuration example of an image processing system according to a third embodiment. 第３実施形態における合成器の構成例を示すブロック図。FIG. 13 is a block diagram illustrating a configuration example of a combiner according to a third embodiment. 第３実施形態における合成器の動作を示すフローチャート。13 is a flowchart illustrating the operation of the synthesizer according to the third embodiment. オブジェクトとカメラの位置関係の一例を示す図。FIG. 3 is a diagram illustrating an example of a positional relationship between an object and a camera. 第３実施形態の変形例による合成器の動作を示すフローチャート。15 is a flowchart showing the operation of the synthesizer according to a modification of the third embodiment.

以下、添付の図面を参照して、本願発明の実施形態のいくつかについて詳細に説明する。なお、以下の実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。 Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings. Note that the configurations shown in the following embodiments are merely examples, and the present invention is not limited to the illustrated configurations.

＜第１実施形態＞
競技場（スタジアム）やコンサートホールなどの施設に複数のカメラ及びマイクを設置し撮影及び集音を行う本実施形態の画像処理システムの構成について説明する。図１は、第１実施形態による画像処理システム１００の構成例を示すブロック図である。画像処理システム１００は、センサシステム１１０ａ〜１１０ｚ、画像コンピューティングサーバ２００、コントローラ２８０、スイッチングハブ１８０、及びエンドユーザ端末１９０を有する。 <First embodiment>
A configuration of the image processing system according to the present embodiment in which a plurality of cameras and microphones are installed in facilities such as a stadium or a concert hall to perform shooting and sound collection will be described. FIG. 1 is a block diagram illustrating a configuration example of an image processing system 100 according to the first embodiment. The image processing system 100 includes sensor systems 110a to 110z, an image computing server 200, a controller 280, a switching hub 180, and an end user terminal 190.

本実施形態において、センサシステム１１０ａからセンサシステム１１０ｚまでの２６セットのセンサシステムは同様の構成を有しており、これらを区別せずセンサシステム１１０と記載する場合がある。また、各センサシステム１１０内の装置についても同様に、特別な説明がない場合は区別せず、マイク１１１、カメラ１１２、雲台１１３、及びカメラアダプタ１２０と記載する。なお、センサシステムの台数として２６セットと記載しているが、あくまでも一例であり、台数をこれに限定するものではない。 In the present embodiment, 26 sets of sensor systems from the sensor system 110a to the sensor system 110z have the same configuration, and may be described as the sensor system 110 without distinguishing between them. Similarly, the devices in each sensor system 110 will be described as a microphone 111, a camera 112, a camera platform 113, and a camera adapter 120 without distinction unless otherwise specified. Although the number of sensor systems is described as 26 sets, this is merely an example, and the number is not limited to this.

また、本実施形態では、特に断りがない限り、画像という文言が、動画と静止画の概念を含むものとして説明する。すなわち、本実施形態の画像処理システム１００は、静止画及び動画の何れについても処理可能である。また、本実施形態では、画像処理システム１００により提供される仮想視点コンテンツには、仮想視点画像と仮想視点音声が含まれる例を中心に説明するが、これに限らない。例えば、仮想視点コンテンツに音声が含まれていなくても良い。また例えば、仮想視点コンテンツに含まれる音声が、仮想視点に最も近いマイクにより集音された音声であっても良い。また、本実施形態では、説明の簡略化のため、部分的に音声についての記載を省略しているが、基本的に画像と音声は共に処理されるものとする。 Further, in the present embodiment, unless otherwise specified, the term image is described as including the concept of a moving image and a still image. That is, the image processing system 100 of the present embodiment can process both still images and moving images. Further, in the present embodiment, an example will be described in which the virtual viewpoint content provided by the image processing system 100 includes a virtual viewpoint image and a virtual viewpoint sound, but the present invention is not limited to this. For example, the sound may not be included in the virtual viewpoint content. Further, for example, the sound included in the virtual viewpoint content may be sound collected by a microphone closest to the virtual viewpoint. In the present embodiment, for simplicity of description, description of audio is partially omitted, but it is assumed that both image and audio are basically processed.

画像処理システム１００において、センサシステム１１０ａ〜１１０ｚが有するカメラ１１２ａ〜１１２ｚは、被写体を複数の方向から撮影するための複数のカメラを構成する。複数のセンサシステム１１０ａ〜１１０ｚはデイジーチェーンにより接続される。なおこれに限らず、接続形態として、各センサシステム１１０ａ〜１１０ｚがスイッチングハブ１８０に接続されて、スイッチングハブ１８０を経由してセンサシステム１１０間のデータ送受信を行うスター型のネットワーク構成としてもよい。また、図１では、デイジーチェーンとなるようセンサシステム１１０ａ〜１１０ｚの全てがカスケード接続されている構成を示したがこれに限定するものではない。例えば、複数のセンサシステム１１０をいくつかのグループに分割して、分割したグループ単位でセンサシステム１１０間をデイジーチェーン接続してもよい。 In the image processing system 100, the cameras 112a to 112z included in the sensor systems 110a to 110z constitute a plurality of cameras for photographing a subject from a plurality of directions. The plurality of sensor systems 110a to 110z are connected by a daisy chain. However, the present invention is not limited thereto, and a star-type network configuration in which the sensor systems 110a to 110z are connected to the switching hub 180 and transmit and receive data between the sensor systems 110 via the switching hub 180 may be used as a connection form. FIG. 1 shows a configuration in which all of the sensor systems 110a to 110z are cascaded so as to form a daisy chain, but the present invention is not limited to this. For example, a plurality of sensor systems 110 may be divided into several groups, and the sensor systems 110 may be daisy-chain connected in divided groups.

マイク１１１ａにて集音された音声と、カメラ１１２ａにて撮影された画像は、カメラアダプタ１２０ａにおいて所定の画像処理が施された後、エンコーダ１２１ａで符号化される。カメラアダプタ１２０ａは、符号化されたデータを、ネットワーク１７０ａを通してセンサシステム１１０ｂのカメラアダプタ１２０ｂに伝送する。同様にセンサシステム１１０ｂは、集音された音声と撮影された画像を符号化し、センサシステム１１０ａから取得した画像及び音声の符号化データと合わせてセンサシステム１１０ｃに伝送する。 The sound collected by the microphone 111a and the image captured by the camera 112a are subjected to predetermined image processing in the camera adapter 120a and then encoded by the encoder 121a. The camera adapter 120a transmits the encoded data to the camera adapter 120b of the sensor system 110b via the network 170a. Similarly, the sensor system 110b encodes the collected sound and the photographed image, and transmits the encoded sound and the encoded data of the image and sound acquired from the sensor system 110a to the sensor system 110c.

以上の動作を続けることにより、センサシステム１１０ａ〜１１０ｚが取得した画像及び音声は、センサシステム１１０ｚからネットワーク１８０ｂを介してスイッチングハブ１８０に伝わり、その後、画像コンピューティングサーバ２００へ伝送される。なお、センサシステム１１０の構成は、上記に限定されるものではない。例えば、本実施形態では、カメラ１１２とカメラアダプタ１２０が分離された構成となっているが、同一筺体で一体化されていてもよい。その場合、マイク１１１は一体化されたカメラ１１２に内蔵されてもよいし、カメラ１１２の外部に接続されていてもよい。また、カメラアダプタ１２０の機能の少なくとも一部をフロントエンドサーバ２３０が有していてもよい。また、センサシステム１１０ａ〜１１０ｚが同じ構成を有している必要はなく、其々のセンサシステム１１０が異なる構成でもよい。 By continuing the above operation, the images and sounds acquired by the sensor systems 110a to 110z are transmitted from the sensor system 110z to the switching hub 180 via the network 180b, and then transmitted to the image computing server 200. Note that the configuration of the sensor system 110 is not limited to the above. For example, in the present embodiment, the camera 112 and the camera adapter 120 are separated from each other, but may be integrated in the same housing. In that case, the microphone 111 may be built in the integrated camera 112 or may be connected to the outside of the camera 112. Further, the front end server 230 may have at least a part of the functions of the camera adapter 120. Further, the sensor systems 110a to 110z need not have the same configuration, and the respective sensor systems 110 may have different configurations.

コントローラ２８０は制御ステーション２８１と仮想カメラ操作ＵＩ２８２を有する。制御ステーション２８１は画像処理システム１００を構成するそれぞれのブロックに対してネットワークを通じて動作状態の管理及びパラメータの設定・制御などを行う。仮想カメラ操作ＵＩ２８２は、ユーザが指定を指定するためのユーザインターフェースを提供し、ユーザ操作により指定された視点を、制御ステーション２８１を介してバックエンドサーバ２７０に提供する。 The controller 280 has a control station 281 and a virtual camera operation UI 282. The control station 281 manages the operation state of each block constituting the image processing system 100 through a network, and sets and controls parameters. The virtual camera operation UI 282 provides a user interface for the user to specify the specification, and provides the viewpoint specified by the user operation to the back-end server 270 via the control station 281.

タイムサーバ２９０は、スイッチングハブ１８０を介してセンサシステム１１０ａ〜１１０ｚに時刻及び同期信号を配信する。時刻と同期信号を受信したカメラアダプタ１２０ａ〜１２０ｚは、カメラ１１２ａ〜１１２ｚを時刻と同期信号をもとにＧｅｎｌｏｃｋさせて画像フレーム同期を行う。即ち、タイムサーバ２９０は、複数のカメラ１１２の撮影タイミングを同期させる。 The time server 290 distributes time and synchronization signals to the sensor systems 110a to 110z via the switching hub 180. The camera adapters 120a to 120z that have received the time and the synchronization signal perform Genlock of the cameras 112a to 112z based on the time and the synchronization signal to perform image frame synchronization. That is, the time server 290 synchronizes the shooting timings of the plurality of cameras 112.

画像コンピューティングサーバ２００は、センサシステム１１０ｚから取得したデータの処理を行う。画像コンピューティングサーバ２００は、エンキャプセレータ２１０ａ〜２１０ｂ、フロントエンドサーバ２３０、データベース２５０、バックエンドサーバ２７０を有する。 The image computing server 200 processes data acquired from the sensor system 110z. The image computing server 200 includes encapsulators 210a to 210b, a front-end server 230, a database 250, and a back-end server 270.

エンキャプセレータ２１０ａ〜２１０ｂは、センサシステム１１０ｚから取得した画像及び音声のセグメント化された伝送パケットを再構成してフレームデータに変換する。フロントエンドサーバ２３０は、フレームデータをカメラの識別子やデータ種別、フレーム番号に対応付けてデータベース２５０に書き込む。また、データベース２５０にはカメラの識別子で識別される各カメラについて、位置、方向、画角を含むカメラ設定情報が格納されている。バックエンドサーバ２７０は、仮想カメラ操作ＵＩ２８２から視点の指定を受け付け、受け付けた視点に基づいて、データベース２５０から対応する画像を取得し、レンダリング処理などを行って仮想視点画像を生成する。また、バックエンドサーバ２７０は、データベース２５０から音声データを取得して仮想視点画像に対応する音声を生成する。 The encapsulators 210a to 210b reconstruct the image and audio segmented transmission packets obtained from the sensor system 110z and convert them into frame data. The front-end server 230 writes the frame data into the database 250 in association with the camera identifier, data type, and frame number. The database 250 stores camera setting information including the position, direction, and angle of view for each camera identified by the camera identifier. The back-end server 270 receives designation of a viewpoint from the virtual camera operation UI 282, acquires a corresponding image from the database 250 based on the received viewpoint, and performs a rendering process or the like to generate a virtual viewpoint image. Further, the back-end server 270 acquires audio data from the database 250 and generates audio corresponding to the virtual viewpoint image.

なお、画像コンピューティングサーバ２００の構成は上記に限られるものではない。例えば、フロントエンドサーバ２３０、データベース２５０、及びバックエンドサーバ２７０のうち少なくとも２つが一体となって構成されていてもよい。また、フロントエンドサーバ２３０、データベース２５０、及びバックエンドサーバ２７０の少なくとも１つ以上が別体として構成されてもよい。また、画像コンピューティングサーバ２００内の任意の位置に上記の装置以外の装置が含まれていてもよい。さらに、画像コンピューティングサーバ２００の機能の少なくとも一部をエンドユーザ端末１９０や仮想カメラ操作ＵＩ２８２が有していてもよい。 The configuration of the image computing server 200 is not limited to the above. For example, at least two of the front-end server 230, the database 250, and the back-end server 270 may be integrally configured. Further, at least one of the front-end server 230, the database 250, and the back-end server 270 may be configured separately. Further, a device other than the above device may be included at an arbitrary position in the image computing server 200. Further, the end user terminal 190 and the virtual camera operation UI 282 may have at least a part of the functions of the image computing server 200.

バックエンドサーバ２７０によってレンダリング処理された画像（仮想視点画像）は、バックエンドサーバ２７０からエンドユーザ端末１９０に送信される。こうして、エンドユーザ端末１９０を操作するユーザは、視点の指定に応じた画像、音声を視聴することが出来る。すなわち、バックエンドサーバ２７０は、複数のカメラ１１２により撮影された複数の撮影画像（複数視点画像）とユーザ操作により指定された視点を示す視点情報とに基づいて、仮想視点画像を生成する。そしてバックエンドサーバ２７０は、生成した仮想視点コンテンツをエンドユーザ端末１９０に提供する。 The image (virtual viewpoint image) rendered by the backend server 270 is transmitted from the backend server 270 to the end user terminal 190. In this way, the user operating the end user terminal 190 can view images and sounds according to the designation of the viewpoint. That is, the back-end server 270 generates a virtual viewpoint image based on a plurality of captured images (multiple viewpoint images) captured by the plurality of cameras 112 and viewpoint information indicating a viewpoint designated by a user operation. Then, the back-end server 270 provides the generated virtual viewpoint content to the end user terminal 190.

本実施形態における仮想視点コンテンツは、仮想的な視点から被写体を撮影した場合に得られる画像としての仮想視点画像を含むコンテンツである。言い換えると、仮想視点画像は、指定された視点における見えを表す画像であるとも言える。仮想的な視点（仮想視点）は、ユーザにより（例えば、仮想カメラ操作ＵＩ２８２を用いて）指定されても良いし、画像解析の結果等に基づいて自動的に指定されても良い。すなわち仮想視点画像には、ユーザが任意に指定した視点に対応する任意視点画像（仮想視点画像）が含まれる。また、複数の候補からユーザが指定した視点に対応する画像や、装置が自動で指定した視点に対応する画像も、仮想視点画像に含まれる。また、バックエンドサーバ２７０は、仮想視点画像をＨ．２６４やＨＥＶＣに代表される標準技術により圧縮符号化したうえで、ＭＰＥＧ−ＤＡＳＨプロトコルを使ってエンドユーザ端末１９０へ送信してもよい。なお、本実施形態における画像処理システム１００は、上記で説明した物理的な構成に限定される訳ではなく、論理的に構成されていてもよい。 The virtual viewpoint content in the present embodiment is a content including a virtual viewpoint image as an image obtained when a subject is photographed from a virtual viewpoint. In other words, it can be said that the virtual viewpoint image is an image representing the appearance at the designated viewpoint. The virtual viewpoint (virtual viewpoint) may be specified by the user (for example, using the virtual camera operation UI 282), or may be automatically specified based on a result of image analysis or the like. That is, the virtual viewpoint image includes an arbitrary viewpoint image (virtual viewpoint image) corresponding to a viewpoint arbitrarily specified by the user. In addition, an image corresponding to a viewpoint designated by the user from a plurality of candidates and an image corresponding to a viewpoint automatically designated by the device are also included in the virtual viewpoint image. Further, the back-end server 270 converts the virtual viewpoint image into an H.264 image. H.264 or HEVC, and may be transmitted to the end user terminal 190 using the MPEG-DASH protocol after compression encoding. Note that the image processing system 100 according to the present embodiment is not limited to the physical configuration described above, and may be configured logically.

以上の構成を備えた本実施形態の画像処理システム１００による、撮影対象（オブジェクト）の撮影動作の例について図２を利用して説明する。図２は複数のセンサシステムによるオブジェクトの撮影状況の例を示す図である。図２において、撮影対象であるオブジェクト３２０は、説明を容易にするため本実施形態では球体を想定するが、これに限定されない。図２において、カメラ１１２ａ〜１１２ｃは、それぞれセンサシステム１１０ａ〜１１０ｃが有するカメラである。３０１ａ〜３０１ｃはカメラ１１２ａ〜１１２ｃの画角を表している。カメラ１１２ａ〜１１２ｃはオブジェクト３２０を様々な角度（視線方向）から撮影する。カメラ１１２ａによるオブジェクト３２０の撮影範囲は点３３１〜点３３５の間である。同様に、カメラ１１２ｂによるオブジェクト３２０の撮影範囲は点３３２〜点３３６の間であり、カメラ１１２ｃによるオブジェクト３２０の撮影範囲は点３３４〜点３３８の間である。これらのカメラ１１２ａ〜１１２ｃが画角３０１ａ〜３０１ｃで撮影したオブジェクト３２０の画像は、センサシステムごとにレート制御され、符号化される。 An example of a shooting operation of a shooting target (object) by the image processing system 100 according to the present embodiment having the above configuration will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of an imaging state of an object by a plurality of sensor systems. In FIG. 2, the object 320 to be photographed is assumed to be a sphere in this embodiment for ease of description, but is not limited to this. In FIG. 2, cameras 112a to 112c are cameras included in the sensor systems 110a to 110c, respectively. Reference numerals 301a to 301c represent angles of view of the cameras 112a to 112c. The cameras 112a to 112c photograph the object 320 from various angles (gaze directions). The shooting range of the object 320 by the camera 112a is between points 331 to 335. Similarly, the shooting range of the object 320 by the camera 112b is between the points 332 and 336, and the shooting range of the object 320 by the camera 112c is between the points 334 and 338. Images of the object 320 captured by the cameras 112a to 112c at the angles of view 301a to 301c are rate-controlled for each sensor system and encoded.

本実施形態におけるセンサシステム１１０のカメラアダプタ１２０について説明する。カメラアダプタ１２０のエンコーダ１２１は、カメラ１１２から画像を入力し、符号化して符号データを生成し、ネットワーク１７０に送出する。エンコーダ１２１はその生成される符号データの量を制御するためにレート制御を行う。レート制御では、量子化パラメータの調整、符号化モード（イントラ（Ｉｎｔｒａ）／インター（Ｉｎｔｅｒ）やロスレス／ロッシ―）の選択などによって符号量を制御する。これらのレート制御に使用する情報を以後、符号化情報と呼称する。本実施形態では符号化情報のうち、量子化パラメータの調整によってレート制御を行う場合を例にとって説明するが、これに限定されない。また、エンコーダ１２１はＨ．２６４符号化方式で符号化する場合を例にとって説明するがこれに限定されない。例えば、ＪＰＥＧ符号化方式やＭＰＥＧ符号化方式などを用いても良い。 The camera adapter 120 of the sensor system 110 according to the present embodiment will be described. The encoder 121 of the camera adapter 120 receives an image from the camera 112, generates encoded data by encoding, and sends the encoded data to the network 170. The encoder 121 performs rate control to control the amount of generated code data. In the rate control, the amount of code is controlled by adjusting a quantization parameter, selecting an encoding mode (Intra / Inter, lossless / lossy), or the like. The information used for such rate control is hereinafter referred to as coded information. In the present embodiment, a case will be described as an example in which rate control is performed by adjusting a quantization parameter in encoded information, but the present invention is not limited to this. In addition, the encoder 121 is H.264. A case where encoding is performed by the H.264 encoding method will be described as an example, but the present invention is not limited thereto. For example, a JPEG encoding method or an MPEG encoding method may be used.

エンコーダ１２１は、例えば、デイジーチェーンで伝送されてくる符号データのデータ量や、カメラ１１２から入力される画像の特徴から量子化パラメータ（例えば、量子化ステップ）を調整して符号化を行う。例えば、エンコーダ１２１は、伝送されてくる符号データのデータ量が多い場合には量子化パラメータを大きくして当該センサシステム１１０が生成する符号データのデータ量を減少させる。他方、データ量に余裕があれば、エンコーダ１２１は、量子化パラメータを小さくして当該センサシステム１１０で発生するデータ量を増加させる。また、カメラで撮影した画像にエッジ等の重要な情報を含む場合は、エンコーダ１２１は、量子化パラメータを小さくして符号化する。Ｈ．２６４符号化方式においては、フレーム単位で量子化パラメータを設定することが可能である。フレームのヘッダであるＰｉｃｔｕｒｅＰａｒａｍｅｔｅｒＳｅｔにはｐｉｃ＿ｉｎｉｔ＿ｑｐ＿ｍｉｎｕｓ２６符号があり、フレーム単位での量子化パラメータを定義できる。また、より詳細な符号化単位であるスライス単位でもスライスヘッダの中のｓｌｉｃｅ＿ｑｐ＿ｄｅｌｔａ符号で量子化パラメータを定義できる。さらに詳細なマクロブロック単位でもｍｂ＿ｑｐ＿ｄｅｌｔａ符号を用いて量子化パラメータを定義できる。以上のように、Ｈ．２６４では、フレーム単位、スライス単位、マクロブロック単位での量子化パラメータの設定が可能である。 The encoder 121 performs encoding by adjusting a quantization parameter (for example, a quantization step) based on, for example, the data amount of code data transmitted in a daisy chain and the characteristics of an image input from the camera 112. For example, when the amount of transmitted coded data is large, the encoder 121 increases the quantization parameter to reduce the amount of coded data generated by the sensor system 110. On the other hand, if the data amount has a margin, the encoder 121 decreases the quantization parameter to increase the data amount generated in the sensor system 110. If the image captured by the camera includes important information such as an edge, the encoder 121 performs encoding by reducing the quantization parameter. H. In the H.264 coding method, it is possible to set a quantization parameter for each frame. The Picture Parameter Set, which is the header of the frame, has a pic_init_qp_minus26 code, and can define a quantization parameter for each frame. Also, a quantization parameter can be defined by a slice_qp_delta code in a slice header in a slice unit which is a more detailed encoding unit. Further, a quantization parameter can be defined using a mb_qp_delta code even in macroblock units. As described above, H. In H.264, quantization parameters can be set in frame units, slice units, and macroblock units.

エンコーダ１２１は量子化パラメータを制御して変換係数の量子化を行い、符号量を調整する。また、符号化情報として使われた量子化パラメータは前述の符号を用いて符号化される。符号化によって得られた符号データはパケット化され、ネットワーク１７０、スイッチングハブ１８０を介して画像コンピューティングサーバ２００に送信される。 The encoder 121 controls the quantization parameter to quantize the transform coefficient, and adjusts the code amount. Further, the quantization parameter used as the encoding information is encoded using the above-described code. The encoded data obtained by the encoding is packetized and transmitted to the image computing server 200 via the network 170 and the switching hub 180.

画像コンピューティングサーバ２００は各センサシステム１１０から撮影された画像データの符号データをエンキャプセレータ２１０で受信する。エンキャプセレータ２１０は１つまたは複数で構成することができ、帯域やそれぞれの処理の重さに応じて並列化して処理を行うことができる。本実施形態では２台のエンキャプセレータ２１０ａ、２１０ｂを用いた例を示す。例えば、画像が撮影された時間を示すタイムコードが奇数か偶数かで処理を行うエンキャプセレータを分けてもよい。また、タイムコードは撮影された時間やフレーム番号で構成されているので、例えば、フレーム番号が奇数か偶数かでエンキャプセレータ２１０を選択するようにしてもよい。エンキャプセレータ２１０は、例えば、パケット化されて受信した画像の符号データを１フレーム単位でまとめて、フロントエンドサーバ２３０に出力する。 The image computing server 200 receives the code data of the image data captured from each sensor system 110 by the encapsulator 210. One or a plurality of encapsulators 210 can be configured, and the processes can be performed in parallel according to the band and the weight of each process. In the present embodiment, an example using two encapsulators 210a and 210b is shown. For example, an encapsulator that performs processing may be divided depending on whether the time code indicating the time at which the image was captured is odd or even. Further, since the time code is composed of the shooting time and the frame number, for example, the encapsulator 210 may be selected depending on whether the frame number is odd or even. The encapsulator 210 collects, for example, the encoded data of the packetized received image in units of one frame and outputs the collected data to the front-end server 230.

フロントエンドサーバ２３０はフレーム単位での符号データをデータベース２５０に書き込むためのデータ形式変換や必要なメタ情報の付与を行う。メタ情報として、例えば、センサシステム１１０のカメラ１１２を特定するためのカメラ識別子や同期のための時刻やフレームの番号などの情報がある。データベース２５０は各カメラの各時刻のフレーム画像の符号データを格納する。 The front-end server 230 performs data format conversion and addition of necessary meta-information for writing code data in frame units in the database 250. The meta information includes, for example, information such as a camera identifier for specifying the camera 112 of the sensor system 110, a time for synchronization, and a frame number. The database 250 stores the code data of the frame image of each camera at each time.

ユーザは仮想カメラ操作ＵＩ２８２を用いて仮想視点画像における仮想視点を示す仮想カメラの位置、方向、画角の設定を行う。以後、これらの仮想カメラの位置、方向、画角等の情報を仮想カメラ情報と呼称する。例えば、図２に示されるようにカメラ１１２ｂとカメラ１１２ｃとの間に仮想カメラ３５０を設定したとする。３５１は仮想カメラ３５０の画角を表す。仮想カメラ３５０によるオブジェクト３２０の撮影範囲は、点３３３〜点３３７の間である。図３は、図２に示される位置関係を有するカメラ１１２ａ〜１１２ｃおよび仮想カメラ３５０により撮影されるオブジェクト３２０の部分の関係の詳細を示す図である。図３と図２において同じ構成には同一の参照番号を付してある。仮想カメラ操作ＵＩ２８２で設定された仮想カメラ情報はデータベース２５０及びバックエンドサーバ２７０に出力される。 The user uses the virtual camera operation UI 282 to set the position, direction, and angle of view of the virtual camera indicating the virtual viewpoint in the virtual viewpoint image. Hereinafter, information such as the position, direction, and angle of view of the virtual camera is referred to as virtual camera information. For example, assume that a virtual camera 350 is set between the camera 112b and the camera 112c as shown in FIG. Reference numeral 351 denotes the angle of view of the virtual camera 350. The shooting range of the object 320 by the virtual camera 350 is between the points 333 and 337. FIG. 3 is a diagram showing the details of the relationship between the cameras 112a to 112c having the positional relationship shown in FIG. 3 and 2, the same components are denoted by the same reference numerals. The virtual camera information set in the virtual camera operation UI 282 is output to the database 250 and the back-end server 270.

バックエンドサーバ２７０は、仮想カメラ操作ＵＩ２８２から入力した仮想カメラ情報に基づいて、センサシステム１１０が撮影した画像から、仮想カメラ３５０から見た仮想視点画像を生成するために必要な画像を検索、選択する。バックエンドサーバ２７０は、仮想カメラ３５０の仮想カメラ情報から、画角３５１に含まれる実空間上の撮影範囲（例えば、オブジェクト３２０上の点３３３〜点３３７の範囲）を決定する。この撮影範囲に各オブジェクトが含まれるかどうかはその位置と撮影範囲との比較によって決定される。すなわち、画角３５１の仮想視点画像を生成するのに用いられる画像は、仮想カメラ３５０によって撮影される範囲を仮想カメラ情報に基づいて特定することで選択される。 The back-end server 270 searches and selects an image necessary for generating a virtual viewpoint image viewed from the virtual camera 350 from an image captured by the sensor system 110 based on virtual camera information input from the virtual camera operation UI 282. I do. The back-end server 270 determines a shooting range in the real space (for example, a range of points 333 to 337 on the object 320) included in the angle of view 351 from the virtual camera information of the virtual camera 350. Whether or not each object is included in the shooting range is determined by comparing the position with the shooting range. That is, the image used to generate the virtual viewpoint image of the angle of view 351 is selected by specifying the range captured by the virtual camera 350 based on the virtual camera information.

各センサシステム１１０で撮影した画像にはカメラ識別子を含む符号データが付与されている。バックエンドサーバ２７０は、カメラ識別子に基づいてデータベース２５０から各カメラのカメラ設定情報を読み出し、仮想カメラ３５０の仮想視点画像を構成するのに必要な画像を撮影したカメラを判別することができる。また、バックエンドサーバ２７０は、そのカメラの画像をデータベース２５０から読み出すことができる。図３においては、画角３５１は、仮想カメラ３５０のオブジェクト３２０に対する画角である。カメラ１１２ａ、１１２ｂ、１１２ｃのオブジェクト３２０に対する画角は、それぞれ画角３０１ａ、３０１ｂ、３０１ｃである。データベース２５０は、空間的情報（仮想カメラ３５０の画角３５１と、複数のカメラ１１２の画角）に基づいて、仮想カメラ３５０が撮影する仮想視点画像を生成するのに利用可能な画像を選択し、バックエンドサーバ２７０へ提供する。なお、バックエンドサーバ２７０が、そのような画像を選択し、データベース２５０に要求するようにしてもよい。 The image captured by each sensor system 110 is provided with code data including a camera identifier. The back-end server 270 can read the camera setting information of each camera from the database 250 based on the camera identifier, and can determine the camera that has captured the image necessary for forming the virtual viewpoint image of the virtual camera 350. Further, the back-end server 270 can read the image of the camera from the database 250. In FIG. 3, the angle of view 351 is the angle of view of the virtual camera 350 with respect to the object 320. The angles of view of the cameras 112a, 112b, and 112c with respect to the object 320 are the angles of view 301a, 301b, and 301c, respectively. The database 250 selects an image that can be used to generate a virtual viewpoint image captured by the virtual camera 350 based on spatial information (the angle of view 351 of the virtual camera 350 and the angle of view of the plurality of cameras 112). , To the back-end server 270. Note that the back-end server 270 may select such an image and request the database 250.

バックエンドサーバ２７０では、仮想カメラ操作ＵＩ２８２からの仮想カメラ情報がモデル生成器２７２と合成器２７３に入力される。合成器２７３で生成される仮想カメラ３５０からみたオブジェクト３２０の画像はその視線を法線とする面３５２上に射影した画像となる。図３では、オブジェクト３２０が射影される画像領域は点３３３〜点３３７の領域となる。この画像領域のうちの部分領域３２１は、カメラ１１２ａ〜１１２ｃの画角内にある。したがって、部分領域３２１については、カメラ１１２ａ〜１１２ｃにより撮影された画像が利用可能な画像であり、これらの画像からモデル上の点にテクスチャマッピングするための画像が選択される。部分領域３２２については、カメラ１１２ｂとカメラ１１２ｃにより撮影された画像がテクスチャマッピングに利用可能な画像である。さらに、部分領域３２３については、カメラ１１２ｃにより撮影された画像がテクスチャマッピングに利用可能な画像である。以上は各カメラの位置、画角といった空間的情報によって一意に判別される。データベース２５０はこれらの画像の符号データを時刻ごとにバックエンドサーバ２７０に出力する。 In the back-end server 270, virtual camera information from the virtual camera operation UI 282 is input to the model generator 272 and the synthesizer 273. The image of the object 320 viewed from the virtual camera 350 generated by the synthesizer 273 is an image projected on a plane 352 whose normal is the line of sight. In FIG. 3, the image area where the object 320 is projected is an area of points 333 to 337. The partial area 321 of the image area is within the angle of view of the cameras 112a to 112c. Therefore, regarding the partial region 321, images captured by the cameras 112a to 112c are available images, and an image for texture mapping to a point on the model is selected from these images. Regarding the partial region 322, images captured by the cameras 112b and 112c are images that can be used for texture mapping. Further, regarding the partial region 323, an image captured by the camera 112c is an image that can be used for texture mapping. The above is uniquely determined by spatial information such as the position and angle of view of each camera. The database 250 outputs the code data of these images to the back-end server 270 at each time.

バックエンドサーバ２７０において、デコーダ２７１は、データベース２５０から受け取った符号データの中から符号化情報を抽出し、符号化情報を用いて復号を行うことにより復号画像を生成する。生成した復号画像は、モデル生成器２７２と合成器２７３に入力される。また、抽出された符号化情報は合成器２７３に入力される。 In the back-end server 270, the decoder 271 extracts encoded information from encoded data received from the database 250, and performs decoding using the encoded information to generate a decoded image. The generated decoded image is input to the model generator 272 and the synthesizer 273. The extracted encoded information is input to the synthesizer 273.

モデル生成器２７２は入力された復号画像を用いて、オブジェクトの輪郭を取得し、画像とそのカメラのカメラ設定情報に基づいて３次元モデルを生成する。生成された３次元モデルでは、合成器２７３に出力される。オブジェクト３２０の３次元モデルの生成には、例えば、ＶｉｓｕａｌＨｕｌｌなどの方法を用いることができる。なお、３次元モデルの生成方法に関しては特に限定はなく、３次元モデルは例えば点群や、メッシュで表現され得る。ここでは、３次元モデルは点群で表現されるものとする。点群は撮影対象の空間を三次元空間で表した座標の値を持った点の集合である。 The model generator 272 obtains the outline of the object using the input decoded image, and generates a three-dimensional model based on the image and the camera setting information of the camera. The generated three-dimensional model is output to the synthesizer 273. For generating the three-dimensional model of the object 320, for example, a method such as Visual Hull can be used. The method of generating the three-dimensional model is not particularly limited, and the three-dimensional model can be represented by, for example, a point cloud or a mesh. Here, it is assumed that the three-dimensional model is represented by a point cloud. A point group is a set of points having coordinate values representing the space to be imaged in a three-dimensional space.

図４は、合成器２７３の構成例を示すブロック図である。合成器２７３は、モデルを構成する点群の各点に対して、前景画像の画素値を選択または合成して得られた値をマッピングする（テクスチャマッピング）。一般に、仮想視点画像の生成においては、画像を前景画像と背景画像に分け、それぞれの画像について生成された３次元モデルを用いることにより仮想視点からの画像が生成される。以下、映像のフレームに対応した前景画像を前景フレームとも称する。図４において、モデルバッファ５００は、モデル生成器２７２によって生成された３次元モデルを時刻ごとに格納する。前景フレームバッファ５０６はデコーダ２７１で復号された前景画像の一部またはすべてをカメラ識別子とともに蓄積する。点選択部５０１は、３次元モデルの点群のうち、仮想カメラ情報と３次元モデルの位置情報に基づいて決定される仮想カメラの画角の内側にある点を順に選択する。選択された点の情報は前景フレーム選択部５０２と前景マッピング部５０７に入力される。 FIG. 4 is a block diagram illustrating a configuration example of the synthesizer 273. The synthesizer 273 maps a value obtained by selecting or synthesizing a pixel value of the foreground image to each point of the point group forming the model (texture mapping). Generally, in generating a virtual viewpoint image, an image is generated from a virtual viewpoint by dividing the image into a foreground image and a background image, and using a three-dimensional model generated for each image. Hereinafter, a foreground image corresponding to a video frame is also referred to as a foreground frame. 4, the model buffer 500 stores the three-dimensional model generated by the model generator 272 for each time. The foreground frame buffer 506 stores a part or all of the foreground image decoded by the decoder 271 together with the camera identifier. The point selection unit 501 sequentially selects, from the point group of the three-dimensional model, points that are inside the angle of view of the virtual camera determined based on the virtual camera information and the position information of the three-dimensional model. Information on the selected point is input to the foreground frame selection unit 502 and the foreground mapping unit 507.

前景フレーム選択部５０２は、データベース２５０から各カメラのカメラ設定情報を取得し、点選択部５０１により選択された点が各カメラの画角に含まれるか否かをカメラ設定情報に基づいて判定する。さらに、前景フレーム選択部５０２は、選択された点の座標情報とカメラの位置、向き、画角に基づいて、各カメラの画像上に点を射影し、射影した点に対応する画素位置を算出する。なお、選択された点に対応する画素は１つとは限らず、複数が対応する場合もある。前景フレーム選択部５０２は、点を画角に含むカメラのカメラ識別子、およびそのカメラの画像における選択された点に対応する画素位置を量子化パラメータ比較部５０５に出力する。 The foreground frame selection unit 502 acquires the camera setting information of each camera from the database 250, and determines whether or not the point selected by the point selection unit 501 is included in the angle of view of each camera based on the camera setting information. . Further, the foreground frame selection unit 502 projects points on the image of each camera based on the coordinate information of the selected point and the position, orientation, and angle of view of the camera, and calculates a pixel position corresponding to the projected point. I do. Note that the number of pixels corresponding to the selected point is not limited to one, and a plurality of pixels may correspond. The foreground frame selection unit 502 outputs to the quantization parameter comparison unit 505 the camera identifier of the camera including the point in the angle of view and the pixel position corresponding to the selected point in the image of the camera.

前景フレーム選択部５０２が、選択された点を画角内に含むと判定されたカメラのカメラ識別子を選択することは、テクスチャマッピングするために用いる対象（候補）となる前景画像を選択することと等価である。量子化パラメータ比較部５０５は、選択された前景画像に用いられた符号化を表す符号化情報に基づいて、選択された前景画像のテクスチャマッピングへの利用法を決定するための構成の一例である。本実施形態では、複数の前景画像が選択された場合に、符号化情報（量子化パラメータ）に基づいて、テクスチャマッピングに用いる前景画像が決定される。 Selecting the camera identifier of a camera determined to include the selected point within the angle of view by the foreground frame selection unit 502 includes selecting a foreground image to be a target (candidate) used for texture mapping. Are equivalent. The quantization parameter comparison unit 505 is an example of a configuration for determining a method of using the selected foreground image for texture mapping based on encoding information representing the encoding used for the selected foreground image. . In the present embodiment, when a plurality of foreground images are selected, a foreground image to be used for texture mapping is determined based on the coding information (quantization parameter).

デコーダ２７１で復号された符号化情報は量子化パラメータ抽出部５０３に入力される。量子化パラメータ抽出部５０３は符号化情報からマクロブロック単位で量子化パラメータを抽出する。抽出された量子化パラメータはカメラ識別子ごとにマクロブロック単位で量子化パラメータメモリ５０４に格納される。なお、量子化パラメータの格納の単位は、これに限定されず、フレーム、スライス、ブロック、画素単位であっても良い。 The encoded information decoded by the decoder 271 is input to the quantization parameter extraction unit 503. The quantization parameter extraction unit 503 extracts a quantization parameter from coding information in macroblock units. The extracted quantization parameters are stored in the quantization parameter memory 504 in macroblock units for each camera identifier. The unit for storing the quantization parameter is not limited to this, and may be a frame, slice, block, or pixel unit.

量子化パラメータ比較部５０５は、前景フレーム選択部５０２で選択された前景画像のカメラ識別子と、選択された点に対応する画素位置とに基づいて、量子化パラメータメモリ５０４から、その画素位置を含むマクロブロックの量子化パラメータを読み出す。複数の前景画像が選択されている場合、量子化パラメータ比較部５０５は、テクスチャマッピングに用いる前景画像を決定するために、読み出した複数の量子化パラメータの比較を行う。本実施形態では、量子化パラメータ比較部５０５は、読み出された量子化パラメータの大小を比較し、その比較結果に基づいて、仮想視点画像を生成するために用いるべく選択された前景画像のカメラ識別子を取得する。例えば、量子化パラメータが最少のものを選ぶことにより、テスクチャの画質を向上させることができる。量子化パラメータ比較部５０５は、決定されたカメラ識別子および選択された点に対応する画素位置を前景フレームバッファ５０６に出力する。 The quantization parameter comparison unit 505 includes the pixel position from the quantization parameter memory 504 based on the camera identifier of the foreground image selected by the foreground frame selection unit 502 and the pixel position corresponding to the selected point. Read the quantization parameter of the macroblock. When a plurality of foreground images are selected, the quantization parameter comparison unit 505 compares the read plurality of quantization parameters to determine a foreground image to be used for texture mapping. In the present embodiment, the quantization parameter comparison unit 505 compares the magnitude of the read quantization parameter, and based on the comparison result, determines the camera of the foreground image selected to be used to generate the virtual viewpoint image. Get the identifier. For example, by selecting the one with the smallest quantization parameter, the image quality of the texture can be improved. The quantization parameter comparison unit 505 outputs the determined camera identifier and the pixel position corresponding to the selected point to the foreground frame buffer 506.

前景フレームバッファ５０６は前述の比較の結果に基づいて決定されたカメラ識別子と選択された点に対応する画素位置に基づいて、デコーダ２７１から得られる符号化された画像のうち、該当する画像の該当する画素値を前景マッピング部５０７に入力する。前景マッピング部５０７は、点選択部５０１によって選択された点の座標情報と仮想カメラ情報に基づき、選択された点を仮想カメラの画像に射影し、対応する画素位置を算出する。前景マッピング部５０７は、算出された仮想カメラの画像における画素位置に、前景フレームバッファ５０６から入力された画素値を配置することによりテクスチャマッピングを行う。１つの点に対して画素値が複数ある場合は、例えばそれらの平均値をとることで該当する画素値が算出される。こうして仮想カメラから観察される３次元モデル上の全ての点についてテクスチャマッピングを行って得られたモデルを用いて、仮想視点カメラから見た画像（仮想視点画像）が生成される。仮想視点画像は、エンドユーザ端末１９０に送信され、表示される。 The foreground frame buffer 506 stores a corresponding one of the encoded images obtained from the decoder 271 based on the camera identifier determined based on the result of the comparison and the pixel position corresponding to the selected point. The input pixel value is input to the foreground mapping unit 507. The foreground mapping unit 507 projects the selected point on the image of the virtual camera based on the coordinate information of the point selected by the point selection unit 501 and the virtual camera information, and calculates a corresponding pixel position. The foreground mapping unit 507 performs texture mapping by arranging the pixel value input from the foreground frame buffer 506 at the calculated pixel position in the image of the virtual camera. When there are a plurality of pixel values for one point, the corresponding pixel value is calculated by, for example, averaging the pixel values. In this way, an image (virtual viewpoint image) viewed from the virtual viewpoint camera is generated using a model obtained by performing texture mapping on all points on the three-dimensional model observed from the virtual camera. The virtual viewpoint image is transmitted to the end user terminal 190 and displayed.

図５は、以上の様な構成を備えた合成器２７３の動作を表したフローチャートである。なお、本実施形態では、コントローラ２８０が画像処理システム１００内のフロントエンドサーバ２３０やデータベース２５０等のワークフローを制御することにより、以下の制御が実現される。このことは、第２、第３実施形態も同様である。 FIG. 5 is a flowchart showing the operation of the synthesizer 273 having the above configuration. In the present embodiment, the following control is realized by the controller 280 controlling the workflow of the front-end server 230 and the database 250 in the image processing system 100. This is the same in the second and third embodiments.

ステップＳ６００からステップＳ６０８は処理のループを表し、３次元モデルを構成する点群のすべての点について処理を行うためのループである。ステップＳ６０１において、点選択部５０１は前景をテクスチャマッピングする対象となる点、すなわち３次元モデルを構成する点群の１つを選択する。ステップＳ６０２において、点選択部５０１により選択された点が仮想視点カメラから見えるか否かを判定する。選択された点が仮想視点カメラから見えないと判定された場合、処理はステップＳ６０８からステップＳ６０１に戻り、点選択部５０１は次の点を選択する。ステップＳ６０２において、選択された点が見えると判定された場合、処理はステップＳ６０３に進む。 Steps S600 to S608 represent a processing loop, which is a loop for performing processing on all points of the point group forming the three-dimensional model. In step S601, the point selection unit 501 selects a point to be subjected to texture mapping of the foreground, that is, one of a group of points forming a three-dimensional model. In step S602, it is determined whether or not the point selected by the point selection unit 501 is visible from the virtual viewpoint camera. If it is determined that the selected point is not visible from the virtual viewpoint camera, the process returns from step S608 to step S601, and the point selection unit 501 selects the next point. If it is determined in step S602 that the selected point is visible, the process proceeds to step S603.

ステップＳ６０３において、前景フレーム選択部５０２は、データベース２５０から各カメラのカメラ設定情報を取得し、取得したカメラ設定情報に基づいて、選択された点をテクスチャマッピングするために利用可能な前景画像を選択する。ステップＳ６０４において、量子化パラメータ比較部５０５は、前景フレーム選択部５０２が選択した前景画像のフレームが複数あるか否かを判断する。選択された前景画像が１つしかないと判定された場合、処理はステップＳ６０５に進み、複数あると判定された場合、処理はステップＳ６０６に進む。 In step S603, the foreground frame selection unit 502 acquires the camera setting information of each camera from the database 250, and selects a foreground image that can be used for texture-mapping the selected point based on the acquired camera setting information. I do. In step S604, the quantization parameter comparison unit 505 determines whether there are a plurality of frames of the foreground image selected by the foreground frame selection unit 502. If it is determined that there is only one selected foreground image, the process proceeds to step S605. If it is determined that there are a plurality of foreground images, the process proceeds to step S606.

ステップＳ６０５において、量子化パラメータ比較部５０５は、前景フレーム選択部５０２によって選択された前景画像を撮影したカメラ識別子を取得する。ステップＳ６０６において、量子化パラメータ比較部５０５は、選択された複数の前景画像のそれぞれについての量子化パラメータを量子化パラメータメモリ５０４から読み出す。量子化パラメータ比較部５０５は、読み出した量子化パラメータを比較することにより最小の量子化パラメータを選択する。そして、量子化パラメータ比較部５０５は、最小の量子化パラメータで復号された前景画像を撮影したカメラのカメラ識別子を取得し、前景フレームバッファ５０６に通知する。 In step S <b> 605, the quantization parameter comparison unit 505 acquires a camera identifier that has captured the foreground image selected by the foreground frame selection unit 502. In step S606, the quantization parameter comparison unit 505 reads the quantization parameters for each of the selected plurality of foreground images from the quantization parameter memory 504. The quantization parameter comparison unit 505 selects the minimum quantization parameter by comparing the read quantization parameters. Then, the quantization parameter comparison unit 505 acquires the camera identifier of the camera that captured the foreground image decoded with the minimum quantization parameter, and notifies the foreground frame buffer 506 of the acquired camera identifier.

ステップＳ６０７において、前景フレームバッファ５０６は、量子化パラメータ比較部５０５から通知されたカメラ識別子に対応する前景画像の、選択された点の画素位置に対応する画素値を取得し、前景マッピング部５０７に出力する。前景マッピング部５０７は、前景フレームバッファ５０６から入力された画素値を、モデルバッファ５００から取得した３次元モデル上の点選択部５０１が選択した点にマッピングする。その後、処理はステップＳ６０１に戻り、点選択部５０１は次の点を選択する。 In step S607, the foreground frame buffer 506 acquires a pixel value corresponding to the pixel position of the selected point in the foreground image corresponding to the camera identifier notified from the quantization parameter comparison unit 505, and sends the pixel value to the foreground mapping unit 507. Output. The foreground mapping unit 507 maps the pixel values input from the foreground frame buffer 506 to the points selected by the point selection unit 501 on the three-dimensional model acquired from the model buffer 500. After that, the process returns to step S601, and the point selection unit 501 selects the next point.

図３を用いて、上記処理についてさらに説明する。ここでは簡略化のために、カメラごとにフレーム単位で量子化パラメータが設定されるものとして説明する。カメラ１１２ａで撮影された画像を符号化するためにエンコーダ１２１ａが使用した量子化パラメータをＱ１とする。同様に、エンコーダ１２１ｂで使用した量子化パラメータをＱ２、エンコーダ１２１ｃで使用した量子化パラメータをＱ３とする。この時、各量子化パラメータではＱ１＜Ｑ２＜Ｑ３の関係があったとする。 The above processing will be further described with reference to FIG. Here, for the sake of simplicity, a description will be given assuming that the quantization parameter is set for each camera in frame units. The quantization parameter used by the encoder 121a to encode the image captured by the camera 112a is Q1. Similarly, the quantization parameter used in the encoder 121b is Q2, and the quantization parameter used in the encoder 121c is Q3. At this time, it is assumed that there is a relationship of Q1 <Q2 <Q3 in each quantization parameter.

図３において、オブジェクト３２０に対する仮想視点画像の中の部分領域３２１内の点のテクスチャマッピングに使用できる前景画像は、カメラ１１２ａ〜１１２ｃにより撮影された画像である。この時、エンコーダ１２１ａで用いられた量子化パラメータＱ１が最小であるので、部分領域３２１に射影される点に対してカメラ１１２ａにより撮影された画像が選択される。また、部分領域３２２内の点のテクスチャマッピングに使用できる前景画像はカメラ１１２ｂとカメラ１１２ｃの画像である。この場合、エンコーダ１２１ｂで用いられた量子化パラメータＱ２が最小であるので、部分領域３２２に射影される点に対してカメラ１１２ｂにより撮影された画像が選択される。さらに、部分領域３２３内の点のテクスチャマッピングに使用できる前景画像は、カメラ１１２ｃによって撮影された画像のみである。したがって、部分領域３２３に射影される点に対してカメラ１１２ｃの画像が選択される。 In FIG. 3, foreground images that can be used for texture mapping of points in the partial area 321 in the virtual viewpoint image for the object 320 are images captured by the cameras 112a to 112c. At this time, since the quantization parameter Q1 used by the encoder 121a is the minimum, an image captured by the camera 112a at a point projected on the partial area 321 is selected. Foreground images that can be used for texture mapping of points in the partial region 322 are images of the cameras 112b and 112c. In this case, since the quantization parameter Q2 used in the encoder 121b is the minimum, an image captured by the camera 112b at a point projected on the partial region 322 is selected. Furthermore, the only foreground image that can be used for texture mapping of a point in the partial region 323 is an image captured by the camera 112c. Therefore, the image of camera 112c is selected for the point projected on partial area 323.

以上のようにして仮想視点画像生成に必要な点に対してマッピングが終了したら、本処理を終了する。その後、全ての点でテクスチャマッピングによって得られたモデルを用いて、仮想視点カメラからの見た画像を仮想視点画像の画像として画像コンピューティングサーバ２００から送出する。生成された画像はエンドユーザ端末１９０に送られ、表示、閲覧される。 When the mapping for the points required for generating the virtual viewpoint image has been completed as described above, the present processing ends. Thereafter, using the models obtained by texture mapping at all points, the image viewed from the virtual viewpoint camera is transmitted from the image computing server 200 as a virtual viewpoint image. The generated image is sent to the end user terminal 190, where it is displayed and browsed.

以上の世に、第１実施形態によれば、仮想視点画像の生成に関連して、複数のカメラの前景画像を用いる場合、符号化による劣化の少ない画像を符号化情報によって選択することができる。結果、画像から特徴量等の抽出を行わなくても、高画質な仮想視点画像を生成できる。 As described above, according to the first embodiment, when foreground images of a plurality of cameras are used in connection with generation of a virtual viewpoint image, an image with little deterioration due to encoding can be selected based on encoded information. As a result, a high-quality virtual viewpoint image can be generated without extracting a feature amount or the like from the image.

なお、本実施形態ではステップＳ６０４で前景画像が１枚のみの場合について判定を行ったが、これに限定されない。ステップＳ６０４、Ｓ６０５を省略して、量子化パラメータの読み出しが増えるが、１枚の場合でも最小の量子化パラメータの前景画像を選択しても構わない。 In the present embodiment, the determination is made in step S604 when there is only one foreground image, but the present invention is not limited to this. Steps S604 and S605 are omitted, and the readout of the quantization parameter increases. However, even in the case of one image, the foreground image having the minimum quantization parameter may be selected.

＜変形例＞
前景フレーム選択部５０２によって選択された前景画像の、テクスチャマッピングへの利用法の決定に関する変形例を説明する。上記第１実施形態では、３次元モデルの各点について、複数の前景画像のうちの１つを符号化情報に基づいて選択してテスクチャマッピングに用いるように決定する構成を説明した。これに対して変形例では、複数の前景画像から得られる画素値に符号化情報に基づいた重みづけを行って合成された画素値をテクスチャマッピングに用いる。このために、例えば、前景マッピング部５０７は、符号化情報に基づいて優先的に用いる画像（本例では量子化パラメータが最小の画像）と他の画像に分類し、それぞれに重みを設定して画素値を合成する。 <Modification>
A description will be given of a modified example of determining how to use the foreground image selected by the foreground frame selection unit 502 for texture mapping. In the first embodiment, the configuration has been described in which, for each point of the three-dimensional model, one of a plurality of foreground images is selected based on the encoding information and determined to be used for texture mapping. On the other hand, in the modified example, pixel values obtained from a plurality of foreground images are weighted based on the encoding information, and the combined pixel values are used for texture mapping. For this purpose, for example, the foreground mapping unit 507 classifies the image to be used preferentially (the image having the smallest quantization parameter in this example) and another image based on the encoding information, and sets a weight for each. Combine pixel values.

変形例による合成器２７３の構成について図４を流用して説明する。変形例では、図４において、量子化パラメータ比較部５０５と前景マッピング部５０７との間の接続が追加される。量子化パラメータ比較部５０５は、前景フレーム選択部５０２によって選択された前景画像を撮影したカメラの全台数と、それらのカメラのうち最小の量子化パラメータで符号化したカメラの台数とを前景マッピング部５０７に出力する。前景マッピング部５０７は、選択された前景画像を撮影したカメラの全台数と最小の量子化パラメータを持つカメラの台数とに応じて、それぞれのカメラに対応する前景画像の画素値に重み付けを行ってマッピングを行う。 The configuration of the synthesizer 273 according to the modification will be described with reference to FIG. In the modification, a connection between the quantization parameter comparison unit 505 and the foreground mapping unit 507 is added in FIG. The quantization parameter comparison unit 505 converts the total number of cameras that have captured the foreground image selected by the foreground frame selection unit 502 and the number of cameras encoded with the minimum quantization parameter among those cameras into a foreground mapping unit. 507. The foreground mapping unit 507 weights the pixel value of the foreground image corresponding to each camera according to the total number of cameras that have captured the selected foreground image and the number of cameras having the minimum quantization parameter. Perform mapping.

図６は変形例による合成器２７３の動作を示すフローチャートである。図５に示した処理と同様の処理を行うステップには、図５と同一のステップ番号を付してある。ステップＳ６２０において、量子化パラメータ比較部５０５は前景フレーム選択部５０２で選択された前景画像のフレームが複数か否かを判断する。１つと判定された場合、処理はステップＳ６０５に進み、複数と判定された場合、処理はステップＳ６２１に進む。 FIG. 6 is a flowchart showing the operation of the synthesizer 273 according to the modification. Steps that perform the same processing as the processing shown in FIG. 5 are given the same step numbers as in FIG. In step S620, the quantization parameter comparison unit 505 determines whether there are a plurality of frames of the foreground image selected by the foreground frame selection unit 502. If it is determined that the number is one, the process proceeds to step S605. If it is determined that the number is plural, the process proceeds to step S621.

ステップＳ６２１において、量子化パラメータ比較部５０５は、選択された前景画像を復号した際の量子化パラメータを量子化パラメータメモリ５０４から読み出す。量子化パラメータ比較部５０５は、読み出した量子化パラメータの比較を行うことにより最小の量子化パラメータを選択する。量子化パラメータ比較部５０５は、最小の量子化パラメータで復号された前景画像を撮影したカメラのカメラ識別子を取得するとともに、合成時の重み係数Ｗ０を決定する。例えば、重み係数Ｗ０は、前景フレーム選択部５０２によって選択された前景画像を撮影したカメラの台数をＭとし、最小の量子化パラメータで復号された前景画像を撮影したカメラの台数をＮとした場合に、係数αを用いて、
Ｗ０＝α／（（Ｍ−Ｎ）＋Ｎα） ...（１）
により決定する。 In step S621, the quantization parameter comparison unit 505 reads from the quantization parameter memory 504 the quantization parameter used when decoding the selected foreground image. The quantization parameter comparison unit 505 selects the minimum quantization parameter by comparing the read quantization parameters. The quantization parameter comparison unit 505 obtains the camera identifier of the camera that captured the foreground image decoded with the minimum quantization parameter, and determines the weighting factor W0 at the time of synthesis. For example, when the weight coefficient W0 is M, the number of cameras that have captured the foreground image selected by the foreground frame selection unit 502 is N, and the number of cameras that have captured the foreground image decoded with the minimum quantization parameter is N. Then, using the coefficient α,
W0 = α / ((M−N) + Nα) (1)
Determined by

また、ステップＳ６２２において、量子化パラメータ比較部５０５は、最小の量子化パラメータ以外の量子化パラメータで復号された前景画像に適用する重み係数Ｗ１を、例えば、
Ｗ１＝１／（（Ｍ−Ｎ）＋Ｎα） ...（２）
により決定する。 In step S622, the quantization parameter comparison unit 505 calculates, for example, a weight coefficient W1 to be applied to the foreground image decoded using a quantization parameter other than the minimum quantization parameter.
W1 = 1 / ((M−N) + Nα) (2)
Determined by

なお、（１）式、（２）式において、係数αは１以上の値である。係数αは、予め決めておいてもよいし、量子化パラメータの最小値と最大値の差分値などに基づいて動的に決めても良い。例えば、差分値が大きければαの値を大きくし、差分値が小さければ値を小さくする。ただし、重み係数の決定方法はこれに限定されない。 In the expressions (1) and (2), the coefficient α is a value of 1 or more. The coefficient α may be determined in advance, or may be dynamically determined based on a difference value between the minimum value and the maximum value of the quantization parameter. For example, if the difference value is large, the value of α is increased, and if the difference value is small, the value is decreased. However, the method of determining the weight coefficient is not limited to this.

ステップＳ６２３において、前景マッピング部５０７はステップＳ６２１、Ｓ６２２で決定された重み係数を用いて、それぞれのカメラ識別子に基づいて読み出された画素値にこれらの重み係数を掛け、加重平均をとることで仮想視点画像の画素値を決定する。ステップＳ６２４において、前景マッピング部５０７は、ステップＳ６０５で選択された画素値、またはステップＳ６２３で合成された画素値を取得し、３次元モデルの選択されている点の位置にマッピングする。 In step S623, the foreground mapping unit 507 multiplies the pixel values read based on the respective camera identifiers by these weighting factors using the weighting factors determined in steps S621 and S622, and takes a weighted average. The pixel value of the virtual viewpoint image is determined. In step S624, the foreground mapping unit 507 acquires the pixel value selected in step S605 or the pixel value synthesized in step S623, and maps the pixel value to the position of the selected point of the three-dimensional model.

以上の構成と動作により、仮想視点画像の生成に関連して、複数のカメラからの前景画像を用いる場合、符号化による劣化の少ない画像を優先して生成することにより、画像から特徴量等の抽出を行わなくても、高画質な仮想視点画像を生成できる。 With the above-described configuration and operation, when foreground images from a plurality of cameras are used in connection with generation of a virtual viewpoint image, an image with little deterioration due to encoding is preferentially generated, so that features such as feature amounts can be obtained from the image. A high-quality virtual viewpoint image can be generated without performing extraction.

なお、上記のフローチャートでは重み係数を最小の量子化パラメータか否かで２つに分けたがこれに限定されない。量子化パラメータの最小の画像が最大の重み係数になるようにして、そのほかの画像の重み係数を量子化パラメータの大きさに応じて決定しても良い。また、選択された前景画像の分類は、符号化モードに基づいて行われてもよい。また、上述の実施形態では、データベース２５０が画角３５１から見た仮想視点画像を生成するために必要な画像を各センサシステム１１０の撮影した画像から選択したがこれに限定されない。前景フレーム選択部５０２が選択した結果を用いて、バックエンドサーバ２７０がデータベース２５０から必要な画像を選択して読み出すようにしても良い。 In the above flowchart, the weight coefficient is divided into two depending on whether it is the minimum quantization parameter, but the present invention is not limited to this. The image with the smallest quantization parameter may be the largest weighting factor, and the weighting factors of other images may be determined according to the magnitude of the quantization parameter. The classification of the selected foreground image may be performed based on the encoding mode. Further, in the above-described embodiment, the image necessary for the database 250 to generate the virtual viewpoint image viewed from the angle of view 351 is selected from the images captured by the sensor systems 110, but is not limited thereto. The back-end server 270 may select and read a required image from the database 250 using the result selected by the foreground frame selection unit 502.

なお、上述の実施形態は、画像処理システム１００が競技場やコンサートホールなどの施設に設置される場合の例を中心に説明した。施設の他の例としては、例えば、遊園地、公園、競馬場、競輪場、カジノ、プール、スケートリンク、スキー場、ライブハウスなどがある。また、各種施設で行われるイベントは、屋内で行われるものであっても屋外で行われるものであっても良い。また、本実施形態における施設は、一時的に（期間限定で）建設される施設も含む。 In the above-described embodiment, an example in which the image processing system 100 is installed in a facility such as a stadium or a concert hall has been mainly described. Other examples of facilities include, for example, amusement parks, parks, racetracks, bicycle racetracks, casinos, pools, skating rinks, ski resorts, live houses, and the like. In addition, events performed in various facilities may be performed indoors or outdoors. The facilities in the present embodiment also include facilities that are temporarily constructed (for a limited time).

＜第２実施形態＞
第１実施形態では、符号化情報に基づいてテクスチャマッピングに用いる前景画像を決定する構成を、符号化情報としての量子化パラメータに基づいて決定する場合を例に挙げて説明した。第２実施形態では、符号化情報のうちの符号化方式に基づいてテクスチャマッピングに用いる前景画像を決定する構成を説明する。 <Second embodiment>
In the first embodiment, the configuration for determining the foreground image to be used for texture mapping based on the encoding information has been described using an example in which the configuration is determined based on the quantization parameter as the encoding information. In the second embodiment, a configuration will be described in which a foreground image to be used for texture mapping is determined based on an encoding method of encoded information.

図７は、第２実施形態による画像処理システム１００の構成例を示すブロック図である。図７において図１と同じ機能を有するブロックには同一の参照番号を付してある。図７において、エンキャプセレータ２１０ａ、２１０ｂは、それぞれ、デコーダ２７１ａ、２７１ｂを備える。デコーダ２７１ａ、２７１ｂは図１のデコーダ２７１の機能に加え、復号した画像データと符号化情報をフロントエンドサーバ２３０ａに出力する。フロントエンドサーバ２３０ａは、モデル生成器２７２ａを備える。モデル生成器２７２ａは図１のモデル生成器２７２の機能に加え、エンキャプセレータ２１０ａ、２１０ｂから前景画像の画像データと符号化情報を取得する機能、３次元モデルをデータベース２５０ａに出力する機能を有する。データベース２５０ａは第１実施形態のデータベース２５０の機能に加えて、３次元モデルをフレーム単位で格納する機能を有する。バックエンドサーバ２７０ａは第１実施形態のバックエンドサーバ２７０とは異なり、３次元モデルの生成は行わず、３次元モデルをデータベース２５０ａから読み込む。 FIG. 7 is a block diagram illustrating a configuration example of an image processing system 100 according to the second embodiment. 7, blocks having the same functions as those in FIG. 1 are denoted by the same reference numerals. In FIG. 7, the encapsulators 210a and 210b include decoders 271a and 271b, respectively. The decoders 271a and 271b output decoded image data and encoded information to the front-end server 230a in addition to the functions of the decoder 271 in FIG. The front-end server 230a includes a model generator 272a. The model generator 272a has, in addition to the function of the model generator 272 of FIG. 1, a function of acquiring image data and coding information of a foreground image from the encapsulators 210a and 210b, and a function of outputting a three-dimensional model to the database 250a. Have. The database 250a has a function of storing a three-dimensional model in frame units in addition to the function of the database 250 of the first embodiment. Unlike the back-end server 270 of the first embodiment, the back-end server 270a does not generate a three-dimensional model and reads the three-dimensional model from the database 250a.

上述の構成での仮想視点画像生成処理について説明する。第１実施形態と同様に、センサシステム１１０において、カメラ１１２が画像を撮影し、エンコーダ１２１はその画像を符号化して符号化情報とともに出力する。エンコーダ１２１は、伝送されてくる符号データのデータ量、カメラ１１２から入力される画像の特徴から符号化モードを調整して符号化を行う。 The virtual viewpoint image generation processing in the above configuration will be described. As in the first embodiment, in the sensor system 110, the camera 112 captures an image, and the encoder 121 encodes the image and outputs the encoded image together with the encoded information. The encoder 121 performs encoding by adjusting the encoding mode based on the amount of transmitted encoded data and the characteristics of an image input from the camera 112.

例えば、Ｈ．２６４符号化方式においては、マクロブロック単位で、マクロブロックモードを設定することが可能である。マクロブロックレイヤの先頭にはｍｂ＿ｔｙｐｅがあり、マクロブロックの符号化モードを定義する。マクロブロックの符号化モードにはフレーム内予測を行うＩｎｔｒａモード、フレーム間予測を行うＩｎｔｅｒモードがある。さらに、Ｉｎｔｒａモードの中にはブロックの係数をそのままＰＣＭ符号化するＩ＿ＰＣＭモードがある。Ｉ＿ＰＣＭモードは符号化が難しい細かいテクスチャがある場合に用いられ、画素値がそのまま符号化される。このため、Ｉ＿ＰＣＭモードは、劣化がないロスレスによる符号化を行うこともでき、符号量は大きいが画質が優れている。 For example, H. In the H.264 coding method, a macroblock mode can be set in macroblock units. At the top of the macroblock layer is mb_type, which defines the coding mode of the macroblock. Macroblock coding modes include an intra mode for performing intra-frame prediction and an inter mode for performing inter-frame prediction. Further, among the Intra modes, there is an I_PCM mode in which coefficients of a block are directly subjected to PCM coding. The I_PCM mode is used when there is a fine texture that is difficult to encode, and the pixel value is encoded as it is. For this reason, the I_PCM mode can perform lossless encoding without deterioration, and has a large code amount but excellent image quality.

エンコーダ１２１はこれらの符号化モードと量子化パラメータを制御して変換係数の量子化を行い、符号量を調整する。また、符号化情報として使われた符号化モードは前述の符号を用いて符号化される。符号化によって得られた符号データはパケット化され、ネットワーク１７０、スイッチングハブ１８０を介して画像コンピューティングサーバ２００に送信される。 The encoder 121 controls these encoding modes and quantization parameters to quantize the transform coefficients and adjust the code amount. The encoding mode used as the encoding information is encoded using the above-described code. The encoded data obtained by the encoding is packetized and transmitted to the image computing server 200 via the network 170 and the switching hub 180.

画像コンピューティングサーバ２００は、各センサシステム１１０から撮影された画像データの符号データをエンキャプセレータ２１０で受信する。エンキャプセレータ２１０ａ、２１０ｂは、受信したデータが画像であれば、パケット化された符号データをデコーダ２７１ａ、２７１ｂで復号し、再生された画像データを１フレーム単位でまとめて、フロントエンドサーバ２３０ａに出力する。また、エンキャプセレータ２１０ａ、２１０ｂは符号化情報をフロントエンドサーバ２３０ａに出力する。ここでは、符号化情報は画像データにメタ情報として添付されて出力される。符号化情報には、マクロブロック単位での量子化パラメータ、符号化モードが含まれる。ただし、符号化情報の出力の方法は上記に限定されるものではなく、例えば画像データとは別データとして出力し、別管理を行うようにしても構わない。 The image computing server 200 receives, at the encapsulator 210, code data of image data captured from each sensor system 110. If the received data is an image, the encapsulators 210a and 210b decode the packetized coded data by the decoders 271a and 271b, collect the reproduced image data in units of one frame, and Output to Further, the encapsulators 210a and 210b output the encoded information to the front-end server 230a. Here, the encoded information is output as meta information attached to the image data. The coding information includes a quantization parameter and a coding mode for each macroblock. However, the method of outputting the encoded information is not limited to the above. For example, the encoded information may be output as data different from the image data and managed separately.

フロントエンドサーバ２３０ａは、エンキャプセレータ２１０ａ、２１０ｂから画像データと符号化情報を取得する。モデル生成器２７２ａは、画像データを時間単位で集約して３次元モデルの生成を行う。すなわち、モデル生成器２７２ａは、同じ時刻の各センサシステムの画像から３次元モデルの生成を行う。３次元モデルの生成方法は第１実施形態のモデル生成器２７２と同様である。生成された３次元モデルは復号された前景画像データ、符号化情報とともにデータベース２５０ａに格納される。第１実施形態と同様に、センサシステム１１０のカメラ１１２を特定するためのカメラ識別子や同期のためのフレームの時刻などの情報が付加される。データベース２５０ａは各時刻の３次元モデル、復号された前景画像データ、符号化情報を格納する。 The front-end server 230a acquires image data and coding information from the encapsulators 210a and 210b. The model generator 272a aggregates the image data in units of time and generates a three-dimensional model. That is, the model generator 272a generates a three-dimensional model from the image of each sensor system at the same time. The method of generating the three-dimensional model is the same as that of the model generator 272 of the first embodiment. The generated three-dimensional model is stored in the database 250a together with the decoded foreground image data and encoding information. As in the first embodiment, information such as a camera identifier for specifying the camera 112 of the sensor system 110 and a frame time for synchronization is added. The database 250a stores a three-dimensional model at each time, decoded foreground image data, and encoding information.

ユーザは仮想カメラ操作ＵＩ２８２を用いて仮想カメラ情報を設定する。設定された仮想カメラ情報はデータベース２５０ａ及びバックエンドサーバ２７０ａに出力される。データベース２５０ａは、第１実施形態と同様に、仮想カメラから見た仮想視点画像を生成するために必要な前景画像を、カメラの位置および画角と３次元モデルに基づいて選択する。データベース２５０ａは、選択した前景画像を時刻ごとにバックエンドサーバ２７０ａに出力する。バックエンドサーバ２７０ａにおいて、入力された符号化情報、復号された前景画像データ、３次元モデルは合成器２７３ａに入力される。 The user sets virtual camera information using the virtual camera operation UI 282. The set virtual camera information is output to the database 250a and the backend server 270a. As in the first embodiment, the database 250a selects a foreground image necessary for generating a virtual viewpoint image viewed from the virtual camera based on the camera position and the angle of view and the three-dimensional model. The database 250a outputs the selected foreground image to the back-end server 270a at each time. In the back-end server 270a, the input encoded information, the decoded foreground image data, and the three-dimensional model are input to the synthesizer 273a.

図８は第２実施形態による合成器２７３ａの詳細な機能構成例を示すブロック図である。図８において、第１実施形態（図４）と同様の機能を有するブロックには同一の参照番号を付してある。 FIG. 8 is a block diagram illustrating a detailed functional configuration example of the synthesizer 273a according to the second embodiment. 8, blocks having the same functions as those in the first embodiment (FIG. 4) are denoted by the same reference numerals.

符号化情報バッファ８１０は、データベース２５０ａから入力された符号化情報をフレームごとに格納する。符号化情報は、フレームに含まれるマクロブロック単位の符号化モード、量子化パラメータを含む。符号化モード抽出部８１１は符号化情報バッファ８１０に格納されている符号化情報から符号化モードをフレーム単位で抽出し、符号化モードメモリ８１２に格納する。符号化モード比較部８１３は前景フレーム選択部５０２により選択された前景フレームの、点選択部５０１により選択された点に対応する画素についての符号化モードを符号化モードメモリ８１２から読み出し比較する。 The encoding information buffer 810 stores the encoding information input from the database 250a for each frame. The coding information includes a coding mode and a quantization parameter for each macroblock included in the frame. The encoding mode extraction unit 811 extracts the encoding mode from the encoded information stored in the encoded information buffer 810 in units of frames, and stores the extracted encoding mode in the encoding mode memory 812. The encoding mode comparison unit 813 reads the encoding mode of the pixel corresponding to the point selected by the point selection unit 501 in the foreground frame selected by the foreground frame selection unit 502 from the encoding mode memory 812, and compares the read mode.

前景画素選択部８１４は、前景フレーム選択部５０２による前景画像の選択状態、符号化モード比較部８１３の比較結果と量子化パラメータ比較部５０５の比較結果に基づいてマッピングに使用する前景画像と画素を選択する。前景画素選択部８１４は、前景フレーム選択部５０２により選択された前景画像が１枚の場合は、当該前景画像の選択された点に対応する画素をテクスチャマッピングに利用する画素として選択する。他方、前景フレーム選択部５０２により複数の前景画像が選択された場合、前景画素選択部８１４は、量子化パラメータ比較部５０５と符号化モード比較部８１３の比較結果に基づいてカメラ識別子と、選択された点に対応する画素位置を選択する。前景フレームバッファ５０６は、前景画素選択部８１４により選択された前景画像の画素値を前景マッピング部５０７ａと画素合成部８０６へ提供する。画素合成部８０６は選択された画素値が複数の場合に、これらの画素値を合成して、マッピング処理に用いる画素値を生成する。画素合成部８０６における合成方法の詳細は後述する。 The foreground pixel selection unit 814 determines a foreground image and pixels to be used for mapping based on the selection state of the foreground image by the foreground frame selection unit 502, the comparison result of the encoding mode comparison unit 813, and the comparison result of the quantization parameter comparison unit 505. select. When one foreground image is selected by the foreground frame selection unit 502, the foreground pixel selection unit 814 selects a pixel corresponding to a selected point of the foreground image as a pixel to be used for texture mapping. On the other hand, when a plurality of foreground images are selected by the foreground frame selection unit 502, the foreground pixel selection unit 814 selects a camera identifier based on the comparison result of the quantization parameter comparison unit 505 and the encoding mode comparison unit 813. A pixel position corresponding to the point is selected. The foreground frame buffer 506 provides the pixel value of the foreground image selected by the foreground pixel selection unit 814 to the foreground mapping unit 507a and the pixel synthesis unit 806. When there are a plurality of selected pixel values, the pixel synthesizing unit 806 synthesizes these pixel values to generate a pixel value used for the mapping process. Details of the synthesizing method in the pixel synthesizing unit 806 will be described later.

前景マッピング部５０７ａは、点選択部５０１から入力された点の座標情報と仮想カメラ情報に基づいて、点選択部５０１により選択された点を仮想カメラの画像に射影し、対応する画素位置を算出する。また、前景マッピング部５０７ａは、算出された仮想カメラの画素位置に、前景フレームバッファ５０６から読み出された画素値または画素合成部８０６で合成された画素値をマッピングする。 The foreground mapping unit 507a projects the point selected by the point selection unit 501 on the image of the virtual camera based on the coordinate information of the point and the virtual camera information input from the point selection unit 501, and calculates the corresponding pixel position I do. Further, the foreground mapping unit 507a maps the pixel value read from the foreground frame buffer 506 or the pixel value synthesized by the pixel synthesis unit 806 to the calculated pixel position of the virtual camera.

図９は、第２実施形態による合成器２７３ａの動作を表したフローチャートである。なお、図９において、第１実施形態（図５）およびその変形例（図６）と同様の処理を行うステップには同一の参照番号を付してある。ステップＳ６００からステップＳ６０８は処理のループを表し、３次元モデルを構成する点群のすべての点について処理を行うためのループである。ループにおいては、すべての点群を行う方法と、見えている点群を判別して選択する方法があるがこれらについては特に限定しない。 FIG. 9 is a flowchart showing the operation of the synthesizer 273a according to the second embodiment. In FIG. 9, steps for performing the same processing as in the first embodiment (FIG. 5) and its modification (FIG. 6) are denoted by the same reference numerals. Steps S600 to S608 represent a processing loop, which is a loop for performing processing on all points of the point group forming the three-dimensional model. In the loop, there are a method of performing all the point groups and a method of determining and selecting a visible point group, but these are not particularly limited.

ステップＳ９００において、前景フレーム選択部５０２は、ステップＳ６０３で選択された前景画像が１枚か否かを判断し、その結果を量子化パラメータ比較部５０５、符号化モード比較部８１３に出力する。選択された前景画像が１つの場合、処理はステップＳ６０５に進み、選択された前景画像が複数の場合、処理はステップＳ９０１に進む。ステップＳ６０５において、選択された前景画像の画素が選択される。すなわち、前景画素選択部８１４は、選択された前景画像の選択された点に対応する画素位置を前景フレームバッファ５０６に提供し、前景フレームバッファ５０６はその画素値を前景マッピング部５０７ａに出力する。 In step S900, the foreground frame selection unit 502 determines whether the number of foreground images selected in step S603 is one, and outputs the result to the quantization parameter comparison unit 505 and the encoding mode comparison unit 813. If one foreground image is selected, the process proceeds to step S605. If there are a plurality of selected foreground images, the process proceeds to step S901. In step S605, a pixel of the selected foreground image is selected. That is, the foreground pixel selection unit 814 provides the pixel position corresponding to the selected point of the selected foreground image to the foreground frame buffer 506, and the foreground frame buffer 506 outputs the pixel value to the foreground mapping unit 507a.

ステップＳ９０１において、符号化モード比較部８１３は符号化モードメモリ８１２から、ステップＳ６０３で選択された前景画像における、選択された点に対応する画素を復号した際の符号化モードを読み出す。符号化モード比較部８１３は、選択された点に対応する画素の符号化モードがＩ＿ＰＣＭモードである前景画像が存在するか否かを判定する。ステップＳ９０１においてそのような前景画像が存在すると判定された場合、処理はステップＳ９０２に進み、そのような前景画像が存在しないと判定された場合、処理はステップＳ６０６に進む。 In step S901, the encoding mode comparison unit 813 reads, from the encoding mode memory 812, the encoding mode when the pixel corresponding to the selected point in the foreground image selected in step S603 is decoded. The encoding mode comparing unit 813 determines whether or not there is a foreground image in which the encoding mode of the pixel corresponding to the selected point is the I_PCM mode. If it is determined in step S901 that such a foreground image exists, the process proceeds to step S902. If it is determined that such a foreground image does not exist, the process proceeds to step S606.

ステップＳ６０６では、量子化パラメータ比較部５０５が量子化パラメータの比較を行い、最小の量子化パラメータで復号された前景画像のカメラ識別子を前景画素選択部８１４に入力する。前景画素選択部８１４は、カメラ識別子とその画素位置に基づいて前景フレームバッファ５０６からテクスチャマッピングに利用する前景画像の画素値を選択する。ステップＳ９０２において、前景画素選択部８１４は、選択された点に対応する画素の符号化モードがＩ＿ＰＣＭモードである前景画像が複数存在するか否かを判断する。そのような前景画像が１つしかない場合、処理はステップＳ９０３に進み、そのような前景画像が複数ある場合、処理はステップＳ９０４に進む。 In step S606, the quantization parameter comparison unit 505 compares the quantization parameters, and inputs the camera identifier of the foreground image decoded with the minimum quantization parameter to the foreground pixel selection unit 814. The foreground pixel selection unit 814 selects a pixel value of a foreground image used for texture mapping from the foreground frame buffer 506 based on the camera identifier and its pixel position. In step S902, the foreground pixel selection unit 814 determines whether there is a plurality of foreground images in which the encoding mode of the pixel corresponding to the selected point is the I_PCM mode. If there is only one such foreground image, the process proceeds to step S903. If there are multiple such foreground images, the process proceeds to step S904.

ステップＳ９０３において、前景画素選択部８１４は、選択された点に対応する画素がＩ＿ＰＣＭモードで復号された前景画像を撮影したカメラのカメラ識別子に基づいて前景フレームバッファ５０６からテクスチャマッピングに利用する前景画像の画素を選択する。他方、ステップＳ９０４において、前景画素選択部８１４は、選択された点に対応する画素がＩ＿ＰＣＭモードで復号された複数の前景画像について、該当する全てのカメラ識別子を取得する。前景画素選択部８１４は、取得した複数のカメラ識別子と選択された点に対応する画素位置に基づいて、前景フレームバッファ５０６から複数の画素を選択する。画素合成部８０６は、選択された複数の画素を合成する。画素合成部８０６による合成の方法には、例えば、複数の画素値の加算平均を用いることができる。 In step S903, the foreground image selection unit 814 uses the foreground frame buffer 506 to use the foreground image for texture mapping based on the camera identifier of the camera that captured the foreground image in which the pixel corresponding to the selected point was decoded in the I_PCM mode. Is selected. On the other hand, in step S904, the foreground pixel selection unit 814 acquires all applicable camera identifiers for a plurality of foreground images in which the pixel corresponding to the selected point is decoded in the I_PCM mode. The foreground pixel selection unit 814 selects a plurality of pixels from the foreground frame buffer 506 based on the acquired plurality of camera identifiers and a pixel position corresponding to the selected point. The pixel synthesizing unit 806 synthesizes a plurality of selected pixels. As a method of combining by the pixel combining unit 806, for example, an average of a plurality of pixel values can be used.

ステップＳ６２４において、ステップＳ６０５、ステップＳ６０６、ステップＳ９０３で選択された画素の画素値、またはステップＳ９０４で合成された画素値を用いて、３次元モデルへのテクスチャマッピングを行う。 In step S624, texture mapping to a three-dimensional model is performed using the pixel values of the pixels selected in steps S605, S606, and S903, or the pixel values synthesized in step S904.

以上のように、第２実施形態によれば、仮想視点画像の生成に複数の前景画像を用いる場合、符号化による劣化の少ない前景画像が優先して選択されるため、画像から特徴量等の抽出を行わなくても高画質な仮想視点画像を生成することができる。 As described above, according to the second embodiment, when a plurality of foreground images are used to generate a virtual viewpoint image, a foreground image with little deterioration due to encoding is preferentially selected. A high-quality virtual viewpoint image can be generated without performing extraction.

なお、第２実施形態では符号量の調整を符号化モードと量子化パラメータの両方を用いて行ったが、これに限定されない。用いる前景画像データでＩ＿ＰＣＭモードの画素を優先的に使い、もし、Ｉ＿ＰＣＭモードの画素がなければそれ以外の符号化モードの画素の平均値を用いても構わない。 In the second embodiment, the code amount is adjusted using both the coding mode and the quantization parameter. However, the present invention is not limited to this. The pixels in the I_PCM mode are preferentially used in the foreground image data to be used, and if there are no pixels in the I_PCM mode, the average value of the pixels in the other encoding modes may be used.

また、第２実施形態では比較する符号化モードをＩ＿ＰＣＭモードとそれ以外としたがこれに限定されない。一般的にＩｎｔｒａモードの復号画像の方がＩｎｔｅｒモードの復号画像より高画質であることが知られている。Ｉｎｔｅｒモードでは符号化劣化がある画像から予測を行うため、符号化の誤差が蓄積しやすく、画質が低い傾向がある。したがって、Ｉｎｔｒａモードの符号号画像を優先的に使用しても良い。 Further, in the second embodiment, the encoding modes to be compared are set to the I_PCM mode and the other modes, but the present invention is not limited to this. It is generally known that a decoded image in the Intra mode has higher image quality than a decoded image in the Inter mode. In the Inter mode, since prediction is performed from an image having coding deterioration, coding errors tend to accumulate and image quality tends to be low. Therefore, the code image in the Intra mode may be preferentially used.

また、第２実施形態において、デコーダ２７１をエンキャプセレータ２１０に含めたがこれに限定されず、例えば、第１実施形態のようにバックエンドサーバ２７０に含めても構わない。さらに、第２実施形態のステップＳ９０４において、画素値の合成を行ったが、その際に、第１実施形態の変形例で説明したような重みづけを行ってもよい。 In the second embodiment, the decoder 271 is included in the encapsulator 210, but is not limited to this. For example, the decoder 271 may be included in the back-end server 270 as in the first embodiment. Further, in step S904 of the second embodiment, the pixel values are combined. At this time, weighting as described in the modification of the first embodiment may be performed.

＜第３実施形態＞
第２実施形態では、符号化情報のうちの符号化方式（符号化モード）に基づいてテクスチャマッピングに用いる前景画像を決定する構成を説明した。第３実施形態では、さらに画像中の被写体とカメラとの距離に基づいてテクスチャマッピングに用いる前景画像を選択する構成を説明する。 <Third embodiment>
In the second embodiment, a configuration has been described in which a foreground image to be used for texture mapping is determined based on an encoding method (encoding mode) of encoded information. In the third embodiment, a configuration will be described in which a foreground image used for texture mapping is further selected based on a distance between a camera and a subject in the image.

図１０は、第３実施形態による画像処理システム１００の構成例を示すブロック図である。同図において、第１実施形態（図１）、もしくは第２実施形態（図７）と同様の機能を有するブロックについては同一の参照番号を付してある。データベース２５０ｂは、各カメラのカメラ設定情報を格納するカメラ設定情報記憶部２５１を有する。カメラ設定情報は、カメラの識別子で識別されるカメラの位置、方向、画角の少なくとも１つを含む。バックエンドサーバ２７０ｂは、データベース２５０ｂのカメラ設定情報記憶部２５１からカメラ設定情報を読み込む。 FIG. 10 is a block diagram illustrating a configuration example of an image processing system 100 according to the third embodiment. In the figure, blocks having the same functions as the first embodiment (FIG. 1) or the second embodiment (FIG. 7) are denoted by the same reference numerals. The database 250b has a camera setting information storage unit 251 that stores camera setting information of each camera. The camera setting information includes at least one of the position, direction, and angle of view of the camera identified by the camera identifier. The back-end server 270b reads the camera setting information from the camera setting information storage unit 251 of the database 250b.

次に、第３実施形態による仮想視点画像の生成処理について説明する。第１実施形態および第２実施形態と同様に、第３実施形態においても、センサシステム１１０のカメラ１１２が画像を撮影し、エンコーダ１２１がカメラ１１２の撮影した画像を符号化し、符号化された画像と符号化情報を出力する。ただし、第３実施形態では、他のセンサシステムから伝送されてくる符号データのデータ量、カメラ１１２から入力される画像の特徴から符号化モードを調整して符号化を行う。エンコーダ１２１はこれらの符号化モードと量子化パラメータを制御して変換係数の量子化を行い、符号量を調整する。また、符号化情報として、使用した符号化モードは前述の符号を用いて符号化される。符号化によって得られた符号データはパケット化され、ネットワーク１７０、スイッチングハブ１８０を介して画像コンピューティングサーバ２００に送信される。 Next, generation processing of a virtual viewpoint image according to the third embodiment will be described. As in the first and second embodiments, also in the third embodiment, the camera 112 of the sensor system 110 captures an image, the encoder 121 encodes the image captured by the camera 112, and encodes the encoded image. And the encoded information is output. However, in the third embodiment, encoding is performed by adjusting the encoding mode based on the data amount of encoded data transmitted from another sensor system and the characteristics of an image input from the camera 112. The encoder 121 controls these encoding modes and quantization parameters to quantize the transform coefficients and adjust the code amount. The encoding mode used is encoded using the above-described code as the encoding information. The encoded data obtained by the encoding is packetized and transmitted to the image computing server 200 via the network 170 and the switching hub 180.

画像コンピューティングサーバ２００のエンキャプセレータ２１０ａ、２１０ｂは、センサシステム１１０から、符号化された画像データを含む符号データを受信する。エンキャプセレータ２１０ａ、２１０ｂは受信した符号データが画像であれば、パケット化された符号データをデコーダ２７１ａ、２７１ｂで復号し、再生された画像データを１フレーム単位でまとめて、フロントエンドサーバ２３０ａに出力する。 The encapsulators 210a and 210b of the image computing server 200 receive the coded data including the coded image data from the sensor system 110. If the received code data is an image, the encapsulators 210a and 210b decode the packetized code data by the decoders 271a and 271b, combine the reproduced image data in units of one frame, and generate the front-end server 230a Output to

フロントエンドサーバ２３０ａは画像データと符号化情報を読み込み、モデル生成器２７２ａは画像データを時間単位で集約して３次元モデルの生成を行う。生成された３次元モデルは復号された前景画像、符号化情報とともにデータベース２５０ｂに格納される。また、モデル生成器２７２ａは、３次元モデルのスタジアム内の位置情報（オブジェクト位置情報と称す）をデータベース２５０ｂに出力する。第２実施形態と同様に、センサシステム１１０のカメラ１１２を特定するためのカメラ識別子や同期のためのフレーム時刻などの情報が付加される。また、データベース２５０ｂはカメラ設定情報を取得し、カメラ設定情報記憶部２５１に格納する。データベース２５０ｂは各時刻の３次元モデルと、各カメラの各時刻の復号された前景画像データ、符号化情報、及び、カメラ設定情報を格納する。 The front-end server 230a reads the image data and the encoding information, and the model generator 272a aggregates the image data in units of time to generate a three-dimensional model. The generated three-dimensional model is stored in the database 250b together with the decoded foreground image and encoding information. Further, the model generator 272a outputs position information of the three-dimensional model in the stadium (referred to as object position information) to the database 250b. As in the second embodiment, information such as a camera identifier for specifying the camera 112 of the sensor system 110 and a frame time for synchronization is added. Further, the database 250b acquires the camera setting information and stores it in the camera setting information storage unit 251. The database 250b stores a three-dimensional model at each time, decoded foreground image data at each time of each camera, encoding information, and camera setting information.

ユーザは仮想カメラ操作ＵＩ２８２を用いて仮想カメラ情報を設定する。設定された仮想カメラ情報はデータベース２５０ｂ及びバックエンドサーバ２７０ｂに出力される。データベース２５０ｂでは、第２実施形態と同様に、仮想カメラから見た仮想視点画像を生成するために必要な前景画像が、３次元モデル、仮想カメラ情報、カメラ設定情報に基づいて選択される。データベース２５０ｂはこれらのデータを時刻ごとにバックエンドサーバ２７０ｂに出力する。 The user sets virtual camera information using the virtual camera operation UI 282. The set virtual camera information is output to the database 250b and the backend server 270b. In the database 250b, as in the second embodiment, a foreground image required to generate a virtual viewpoint image viewed from a virtual camera is selected based on a three-dimensional model, virtual camera information, and camera setting information. The database 250b outputs these data to the back-end server 270b at each time.

バックエンドサーバ２７０ｂに入力された符号化情報、復号された前景画像、３次元モデルは合成器２７３ｂに入力される。また、バックエンドサーバ２７０ｂは、テクスチャマッピングに用いる前景画像を撮影したセンサシステム１１０のカメラ設定情報（カメラの位置情報）を、データベース２５０ｂのカメラ設定情報記憶部２５１から読み出しておく。なお、カメラ設定情報記憶部２５１がバックエンドサーバ２７０ｂに含まれていても良い。 The encoded information input to the back-end server 270b, the decoded foreground image, and the three-dimensional model are input to the synthesizer 273b. Further, the back-end server 270b reads camera setting information (camera position information) of the sensor system 110 that has captured the foreground image used for texture mapping from the camera setting information storage unit 251 of the database 250b. Note that the camera setting information storage unit 251 may be included in the back-end server 270b.

第３実施形態による合成器２７３ｂの詳細なブロック図を図１１に示す。図１１において、第１実施形態（図４）および第２実施形態（図７）と同様の機能を有するブロックには同一の参照番号を付してある。 FIG. 11 is a detailed block diagram of the synthesizer 273b according to the third embodiment. 11, blocks having the same functions as those of the first embodiment (FIG. 4) and the second embodiment (FIG. 7) are denoted by the same reference numerals.

モデルバッファ５００ａは入力された３次元モデルをフレームごとに格納する。その際に３次元モデルのオブジェクト位置情報もフレーム単位で格納する。距離演算部１１０１はデータベース２５０ｂのカメラ設定情報記憶部２５１から各カメラの位置情報を入力し、カメラの位置情報とオブジェクト位置情報とを用いて、それぞれのカメラとオブジェクトの距離を算出する。前景画素選択部８１４ａは符号化モード比較部８１３、量子化パラメータ比較部５０５、距離演算部１１０１の比較結果に基づいてマッピングに使用する前景画像の画素を選択する。画素の選択方法の詳細は後述する。前景画素選択部８１４ａは比較結果に基づいて選択された前景画像のカメラ識別子と、前景画像における選択された点に対応する画素位置の画素値を前景フレームバッファ５０６から選択する。前景フレームバッファ５０６は、選択された画素値を前景マッピング部５０７ｂに出力する。 The model buffer 500a stores the input three-dimensional model for each frame. At this time, the object position information of the three-dimensional model is also stored for each frame. The distance calculation unit 1101 inputs the position information of each camera from the camera setting information storage unit 251 of the database 250b, and calculates the distance between each camera and the object using the camera position information and the object position information. The foreground pixel selection unit 814a selects a pixel of the foreground image to be used for mapping based on the comparison result of the encoding mode comparison unit 813, the quantization parameter comparison unit 505, and the distance calculation unit 1101. The details of the pixel selection method will be described later. The foreground pixel selection unit 814a selects from the foreground frame buffer 506 the camera identifier of the foreground image selected based on the comparison result and the pixel value at the pixel position corresponding to the selected point in the foreground image. Foreground frame buffer 506 outputs the selected pixel value to foreground mapping section 507b.

図１２は、第３実施形態による合成器２７３ｂの動作を表したフローチャートである。図１２において、第１実施形態（図５）、第２実施形態（図９）と同様の処理を行うステップには同一のステップ番号を付してある。 FIG. 12 is a flowchart showing the operation of the synthesizer 273b according to the third embodiment. In FIG. 12, the same steps as those in the first embodiment (FIG. 5) and the second embodiment (FIG. 9) are denoted by the same step numbers.

ステップＳ６００からステップＳ６０８は処理のループを表す。このループは、３次元モデルを構成する点群のすべての点の処理を行うためのループである。ループにおいては、すべての点群を行う方法と、見えている点群を判別して選択する方法があるがこれらについては特に限定しない。 Steps S600 to S608 represent a processing loop. This loop is a loop for processing all points of the point group forming the three-dimensional model. In the loop, there are a method of performing all the point groups and a method of determining and selecting a visible point group, but these are not particularly limited.

ステップＳ９０２において、前景画素選択部８１４ａは、ステップＳ６０３で選択された前景画像のうち、選択された点に対応する画素の符号化モードがＩ＿ＰＣＭモードである前景画像が複数あるか否かを判断する。ステップＳ９０２において、そのような前景画像が１つしかないと判断された場合、処理はステップＳ９０３に進み、複数あると判断された場合、処理はステップＳ１２０４に進む。 In step S902, the foreground pixel selection unit 814a determines whether there are a plurality of foreground images in which the encoding mode of the pixel corresponding to the selected point is the I_PCM mode among the foreground images selected in step S603. . If it is determined in step S902 that there is only one such foreground image, the process proceeds to step S903. If it is determined that there are a plurality of foreground images, the process proceeds to step S1204.

ステップＳ１２０４において、前景画素選択部８１４ａは、３次元モデルの選択された点との距離が最も近いカメラにより撮影された前景画像の画素を選択する。ステップＳ１２０４の処理について、より詳細に説明する。前景フレーム選択部５０２は、ステップＳ６０３で選択した前景画像を取得したカメラのカメラ識別子と点選択部５０１が選択した３次元モデルの点を距離演算部１１０１に通知する。距離演算部１１０１は、カメラ識別子で特定されるカメラの位置情報をカメラ設定情報記憶部２５１から取得する。また、距離演算部１１０１はモデルバッファ５００ａに保持されているオブジェクト位置情報から、３次元モデルの選択された点の位置を取得する。距離演算部１１０１は、オブジェクト位置情報から取得した点の位置と、カメラの位置情報とからそれらの間の距離を算出し、その算出結果を前景画素選択部８１４ａに出力する。前景画素選択部８１４ａは、距離演算部１１０１により算出された距離が最も短いカメラ（選択された点に最も近いカメラ）により撮影された前景画像の、選択された点に対応する画素を選択する。前景フレームバッファ５０６は、前景画素選択部８１４ａが選択した画素を前景マッピング部５０７ｂに提供する。 In step S1204, the foreground pixel selection unit 814a selects pixels of the foreground image captured by the camera that is closest to the selected point of the three-dimensional model. The processing in step S1204 will be described in more detail. The foreground frame selection unit 502 notifies the distance calculation unit 1101 of the camera identifier of the camera that acquired the foreground image selected in step S603 and the point of the three-dimensional model selected by the point selection unit 501. The distance calculation unit 1101 acquires the position information of the camera specified by the camera identifier from the camera setting information storage unit 251. Further, the distance calculation unit 1101 acquires the position of the selected point of the three-dimensional model from the object position information held in the model buffer 500a. The distance calculation unit 1101 calculates the distance between the position of the point acquired from the object position information and the position information of the camera, and outputs the calculation result to the foreground pixel selection unit 814a. The foreground pixel selection unit 814a selects a pixel corresponding to the selected point in the foreground image captured by the camera having the shortest distance calculated by the distance calculation unit 1101 (the camera closest to the selected point). The foreground frame buffer 506 provides the pixels selected by the foreground pixel selection unit 814a to the foreground mapping unit 507b.

ステップＳ６０７において、前景マッピング部５０７ｂは、点選択部５０１から入力された点の座標情報と仮想カメラ情報に基づき、点を仮想カメラの画像に射影し、出力する画像の画素位置を算出する。算出された仮想カメラの画素位置に、前景マッピング部５０７ｂは、前景フレームバッファ５０６から提供される画素値をテクスチャマッピングする。 In step S607, the foreground mapping unit 507b projects the point on the image of the virtual camera based on the coordinate information of the point and the virtual camera information input from the point selection unit 501, and calculates the pixel position of the image to be output. The foreground mapping unit 507b texture-maps the pixel value provided from the foreground frame buffer 506 at the calculated pixel position of the virtual camera.

以上の構成と動作により、仮想視点画像の生成に関連して、複数のカメラからの前景画像を用いる場合、符号化モードで符号化による劣化の少ない画像を選択して生成することにより、画像から特徴量等の抽出を行わなくても高画質な仮想視点画像を生成できる。また、同じ品質の前景画像が存在した場合、距離が近い方を選択することにより、ボケによる解像度の低下の少ない画像を選択することで、より高画質な仮想視点画像を生成できる。 With the above configuration and operation, in the case of using foreground images from a plurality of cameras in connection with the generation of a virtual viewpoint image, by selecting and generating an image with little deterioration due to encoding in the encoding mode, A high-quality virtual viewpoint image can be generated without extracting a feature amount or the like. In addition, when foreground images of the same quality are present, a higher-quality virtual viewpoint image can be generated by selecting an image having a smaller resolution due to blurring by selecting a closer distance.

＜変形例＞
第３実施形態では、カメラと選択された点との距離を用いてテクスチャマッピングに利用する前景画像を選択する構成を設けている。このような、距離に基づく前景画像の選択は、例えば、第１実施形態で説明した量子化パラメータに基づく前景画像の選択にも適用できる。図１４は、変形例による合成器２７３ｂの動作を表したフローチャートである。ここでは、距離情報と符号化モード、符号化パラメータに基づいて、画像を選択する方法について示す。従って、合成器２７３ｂの機能構成は図１１と同様であるが、符号化モード抽出部８１１、符号化モードメモリ８１２、符号化モード比較部８１３は省略可能である。 <Modification>
In the third embodiment, a configuration is provided in which a foreground image used for texture mapping is selected using the distance between a camera and a selected point. Such selection of the foreground image based on the distance can be applied to, for example, selection of the foreground image based on the quantization parameter described in the first embodiment. FIG. 14 is a flowchart showing the operation of the synthesizer 273b according to the modification. Here, a method for selecting an image based on distance information, a coding mode, and a coding parameter will be described. Therefore, the functional configuration of the combiner 273b is the same as that of FIG. 11, but the encoding mode extraction unit 811, the encoding mode memory 812, and the encoding mode comparison unit 813 can be omitted.

ステップＳ１４０１において、前景画素選択部８１４ａは、距離演算部１１０１の演算結果と量子化パラメータ比較部５０５の比較結果から、量子化パラメータが最小である前景画像と、選択された点とカメラの距離が最小である（最近距離の）前景画像を選択する。ステップＳ１４０２において、前景画素選択部８１４ａは、量子化パラメータが最小である前景画像と最近距離の前景画像について、カメラと選択された点との距離を距離演算部１１０１から取得する。図１３はオブジェクト３２０の点とカメラ１１２ａ、１１２ｂの位置関係の例を示す図である。オブジェクト３２０上の点とカメラ１１２ａとの距離をｄＡ、カメラ１１２ｂとの距離をｄＢとする。距離演算部１１０１はこれらの距離ｄＡ、ｄＢをオブジェクトの点の位置とカメラ位置情報から算出する。ここでは、距離ｄＡが最近距離の前景画像におけるカメラと点との距離であり、距離ｄＢが最小の量子化パラメータを有する前景画像におけるカメラと点との距離であるとする。 In step S1401, the foreground pixel selection unit 814a determines from the calculation result of the distance calculation unit 1101 and the comparison result of the quantization parameter comparison unit 505 that the foreground image with the smallest quantization parameter and the distance between the selected point and the camera are equal. Select the smallest (closest distance) foreground image. In step S1402, the foreground pixel selection unit 814a acquires, from the distance calculation unit 1101, the distance between the camera and the selected point for the foreground image with the minimum quantization parameter and the closest foreground image. FIG. 13 is a diagram illustrating an example of the positional relationship between the point of the object 320 and the cameras 112a and 112b. The distance between the point on the object 320 and the camera 112a is dA, and the distance between the camera 112b and the camera 112b is dB. The distance calculation unit 1101 calculates these distances dA and dB from the position of the object point and the camera position information. Here, it is assumed that the distance dA is the distance between the camera and the point in the closest foreground image, and the distance dB is the distance between the camera and the point in the foreground image having the minimum quantization parameter.

ステップＳ１４０３において、前景画素選択部８１４ａはこれらの距離ｄＡ、ｄＢとあらかじめ決められた係数βの比較を行う。より具体的には、ｄＡ／ｄＢの値と係数βを比較し、ｄＡ／ｄＢの値の値が係数βより小さい場合はステップＳ１４０５に進み、そうでない場合はステップＳ１４０４に進む。ステップＳ１４０４において、前景画素選択部８１４ａは、量子化パラメータが最小である前景画像の画素を選択する。ステップＳ１４０５において、前景画素選択部８１４ａは、最近距離の前景画像の画素を選択する。ステップＳ６０７で、前景マッピング部５０７ａは、ステップＳ６０５、Ｓ１４０４、Ｓ１４０５のいずれかで選択された画素の画素値を用いて、仮想視点画像の画素値をマッピングする。 In step S1403, the foreground pixel selection unit 814a compares these distances dA and dB with a predetermined coefficient β. More specifically, the value of dA / dB is compared with the coefficient β. If the value of dA / dB is smaller than the coefficient β, the process proceeds to step S1405; otherwise, the process proceeds to step S1404. In step S1404, the foreground pixel selection unit 814a selects a pixel of the foreground image having the minimum quantization parameter. In step S1405, the foreground pixel selection unit 814a selects a pixel of the foreground image at the closest distance. In step S607, the foreground mapping unit 507a maps the pixel value of the virtual viewpoint image using the pixel value of the pixel selected in any of steps S605, S1404, and S1405.

以上の構成と動作により、高画質な前景画像が複数存在した場合でも、距離が近い方を選択することにより、ボケによる解像度の低下の少ない画像を選択することで、より高画質な仮想視点画像を生成できる。 With the above configuration and operation, even when there are a plurality of high-quality foreground images, by selecting an image having a shorter distance and selecting an image with less reduction in resolution due to blurring, a higher-quality virtual viewpoint image can be obtained. Can be generated.

なお、第２、第３実施形態では符号量の調整を符号化モードと量子化パラメータの両方を用いて行ったが、これに限定されない。用いる前景画像データでＩ＿ＰＣＭモードの画素を優先的に使い、もし、Ｉ＿ＰＣＭモードの画素がなければそれ以外の符号化モードの画素の平均値を用いるようにしてもよい。また、第２、第３実施形態において、デコーダ２７１ａ、２７１ｂをエンキャプセレータ２１０ａ、２１０ｂに含めたがこれに限定されず、例えば第１実施形態のようにバックエンドサーバ２７０に含めても構わない。また、画素合成部８０６（第２実施形態）による画素値の合成において、距離演算部１１０１（第３実施形態）で算出された距離情報に応じた加重平均が用いられてもよい。 In the second and third embodiments, the code amount is adjusted using both the coding mode and the quantization parameter, but the present invention is not limited to this. The pixels in the I_PCM mode may be preferentially used in the foreground image data to be used, and if there is no pixel in the I_PCM mode, the average value of the pixels in the other encoding modes may be used. In the second and third embodiments, the decoders 271a and 271b are included in the encapsulators 210a and 210b. However, the present invention is not limited to this. For example, the decoders 271a and 271b may be included in the back-end server 270 as in the first embodiment. Absent. In the synthesis of pixel values by the pixel synthesis unit 806 (second embodiment), a weighted average according to the distance information calculated by the distance calculation unit 1101 (third embodiment) may be used.

＜その他の実施形態＞
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 <Other embodiments>
The present invention supplies a program for realizing one or more functions of the above-described embodiments to a system or an apparatus via a network or a storage medium, and one or more processors in a computer of the system or the apparatus read and execute the program. This processing can be realized. Further, it can also be realized by a circuit (for example, an ASIC) that realizes one or more functions.

以上、上述した実施形態によれば、カメラ１１２の台数などのシステムを構成する装置の規模、及び撮影画像の出力解像度や出力フレームレートなどに依らず、仮想視点画像を簡便に生成することが出来る。以上、本発明の実施形態について詳述したが、本発明は上述の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形及び変更が可能である。 As described above, according to the above-described embodiment, it is possible to easily generate a virtual viewpoint image regardless of the scale of a device configuring the system such as the number of cameras 112, the output resolution of a captured image, the output frame rate, and the like. . As described above, the embodiments of the present invention have been described in detail. However, the present invention is not limited to the above embodiments, and various modifications and changes may be made within the scope of the present invention described in the appended claims. Is possible.

本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 The present invention supplies a program for realizing one or more functions of the above-described embodiments to a system or an apparatus via a network or a storage medium, and one or more processors in a computer of the system or the apparatus read and execute the program. This process can be realized. Further, it can also be realized by a circuit (for example, an ASIC) that realizes one or more functions.

１１０：センサシステム、１１１：マイク、１１２：カメラ、１１３：雲台、１２０：カメラアダプタ、１２１：：エンコーダ、１８０：スイッチングハブ、１９０：エンドユーザ端末、２３０、９３０：フロントエンドサーバ、２５０：データベース 110: Sensor system, 111: Microphone, 112: Camera, 113: Head, 120: Camera adapter, 121: Encoder, 180: Switching hub, 190: End user terminal, 230, 930: Front end server, 250: Database

Claims

An image processing apparatus that generates a virtual viewpoint image observed from a virtual viewpoint using a plurality of images obtained from a plurality of cameras,
Selection means for selecting an image available for texture mapping of a portion of the virtual viewpoint image from the plurality of images, based on a positional relationship between the virtual viewpoint and the plurality of cameras,
Determining means for determining how to use the selected image for the texture mapping based on encoding information representing the encoding used for the image selected by the selecting means,
Generating means for executing texture mapping using the selected image in accordance with the usage determined by the determining means, and generating an image of the portion of the virtual viewpoint image. .

2. The image processing apparatus according to claim 1, wherein the selection unit selects an image usable for the texture mapping for each point of a point group forming a three-dimensional model of the subject generated based on the plurality of images. An image processing apparatus according to claim 1.

The image processing apparatus according to claim 2, wherein the selection unit sets, as a processing target, a point viewed from the virtual viewpoint, out of a group of points constituting the three-dimensional model.

The determining means, when a plurality of selected images are obtained by the selecting means, from the plurality of selected images to the texture mapping based on encoding information of each of the plurality of selected images. The image processing device according to claim 1, wherein an image to be used is determined.

The encoding information includes a quantization parameter,
The image processing apparatus according to claim 4, wherein the determining unit determines an image to be used for the texture mapping such that an image having the smallest quantization parameter among the images selected by the selecting unit is used. .

The encoding information includes an encoding mode indicating whether or not lossless,
The image processing apparatus according to claim 4, wherein the determining unit determines an image to be used for the texture mapping such that an image whose encoding mode is lossless is used.

The coding information includes a coding mode indicating whether the mode is an Intra mode or an Inter mode,
The image processing apparatus according to claim 4, wherein the determining unit determines an image to be used for the texture mapping such that an image indicating that the encoding mode is Intra mode is used.

The encoding information indicates whether the encoding mode is the Intra mode and whether the encoding mode is the I_PCM mode.
The image processing apparatus according to claim 7, wherein the determining unit determines an image to be used for the texture mapping such that an image indicating that the encoding mode is the I_PCM mode is used.

The determining unit, when a plurality of selected images are obtained by the selecting unit, determines the weight of the plurality of selected images based on the encoding information,
The method according to claim 1, wherein the generation unit performs the texture mapping using a pixel value obtained by combining pixel values of the plurality of selected images using a weight determined by the determination unit. Item 4. The image processing device according to any one of Items 1 to 3.

The determining means classifies the image to be used preferentially and other images based on the coding information, and sets the number of the plurality of selected images to M, the number of images to be used preferentially to N, and a predetermined coefficient. Is α, the weight W0 of the preferentially used image and the weight W1 of the other image are
W0 = α / ((M−N) + Nα)
W1 = 1 / ((M−N) + Nα)
The image processing apparatus according to claim 9, wherein:

The encoding information includes a quantization parameter,
The image processing apparatus according to claim 10, wherein the determining unit classifies an image having a minimum quantization parameter among the plurality of selected images into an image to be used preferentially.

The encoding information includes information indicating whether an encoding mode is an Intra mode,
The image processing apparatus according to claim 10, wherein the determination unit classifies an image in an Intra mode as an encoding mode into the image to be used preferentially.

When there are a plurality of images determined to be used preferentially based on the encoding information, the determining unit determines a pixel value of a pixel corresponding to the portion of the plurality of images determined to be used preferentially. The image processing apparatus according to claim 1, wherein the image processing is performed.

14. The image processing apparatus according to claim 13, wherein the combining is an averaging.

14. The image processing apparatus according to claim 13, wherein the combination is a weighted average based on a weight according to a distance between the part and a camera.

When there are a plurality of images determined to be used preferentially based on the encoding information, the determining unit generates an image captured by a camera closest to the portion and generates the virtual viewpoint image. The image processing apparatus according to any one of claims 1 to 3, wherein the image processing apparatus is determined to be used.

A method of controlling an image processing apparatus that generates a virtual viewpoint image observed from a virtual viewpoint using a plurality of images obtained from a plurality of cameras,
A selection step of selecting an image available for texture mapping of a portion of the virtual viewpoint image from the plurality of images, based on a positional relationship between the virtual viewpoint and the plurality of cameras;
Based on encoding information representing the encoding used for the image selected by the selecting step, a determining step of determining how to use the selected image for the texture mapping,
Performing a texture mapping using the selected image according to the usage determined in the determining step, and generating an image of the portion of the virtual viewpoint image. Control method.

A plurality of cameras, each of which captures an image, performs encoding selected by rate control and outputs the result,
Using a plurality of images obtained from the plurality of cameras, an image processing apparatus that generates a virtual viewpoint image observed from a virtual viewpoint, comprising,
Selection means for selecting an image available for texture mapping of a portion of the virtual viewpoint image from the plurality of images, based on a positional relationship between the virtual viewpoint and the plurality of cameras,
Determining means for determining how to use the selected image for the texture mapping based on encoding information representing the encoding used for the image selected by the selecting means,
Generating means for performing texture mapping using the selected image according to the usage determined by the determining means, and generating an image of the portion of the virtual viewpoint image. .

A program for causing a computer to function as each unit of the image processing apparatus according to claim 1.