JP7578209B1

JP7578209B1 - Image generation system, image generation method, and image generation program

Info

Publication number: JP7578209B1
Application number: JP2024065475A
Authority: JP
Inventors: 京二郎永野; 駿之介片岡; 裕之鎌田; 直樹西田; 佑樹田中; 敬太望月; 壮一郎稲谷; 恭則鎌田
Original assignee: Sony Corp; Sony Group Corp
Current assignee: Sony Corp; Sony Group Corp
Priority date: 2024-04-15
Filing date: 2024-04-15
Publication date: 2024-11-06
Anticipated expiration: 2044-04-15
Also published as: JP2025162283A; WO2025220361A1

Abstract

The present invention obtains video data in response to a query input by a user.
[Solution] The video generation system disclosed herein comprises an acquisition unit that acquires an input query related to video generation from a user, a scenario generation unit that generates scenario data related to video generation based on the input query, a code generation unit that generates code for constructing 3D data based on the scenario data, and a video data acquisition unit that acquires video data based on the code.
[Selected figure] Figure 3

Description

本開示は、映像生成システム、映像生成方法及び映像生成プログラムに関する。 This disclosure relates to an image generation system, an image generation method, and an image generation program.

映像（「動画」ともいう）を自動で生成する技術が提供されている。例えば、２次元的な線画から仮想人物の３次元姿勢を推定し、動画を生成する技術が提供されている（例えば特許文献１）。 Technology has been provided for automatically generating images (also called "videos"). For example, technology has been provided for estimating the three-dimensional posture of a virtual person from a two-dimensional line drawing and generating a video (for example, Patent Document 1).

特開２００３－０５８９０６号公報JP 2003-058906 A

しかしながら、従来技術には、改善の余地がある。例えば、従来技術では、動画を生成するために２次元的な線画、すなわち画像が必要となり、２次元的な線画といった画像を用意することはユーザの負担が大きく、ユーザが画像を用意できない場合等、動画を生成することが難しい。そのため、ユーザの負担が少なく、ユーザビリティが高い動画生成サービスを提供することが望まれており、例えばユーザからの入力クエリに応じて動画データを取得することが望まれている。 However, there is room for improvement in the conventional technology. For example, in the conventional technology, two-dimensional line drawings, i.e., images, are required to generate videos, and preparing images such as two-dimensional line drawings places a large burden on the user, making it difficult to generate videos when the user is unable to prepare images. For this reason, it is desirable to provide a video generation service that places less burden on the user and has high usability, and it is desirable, for example, to obtain video data in response to a query input by the user.

そこで、本開示では、ユーザからの入力クエリに応じて動画データを取得することができる映像生成システム、映像生成方法及び映像生成プログラムを提案する。 Therefore, this disclosure proposes a video generation system, a video generation method, and a video generation program that can acquire video data in response to a query input by a user.

上記の課題を解決するために、本開示に係る一形態の映像生成システムは、動画生成に関する入力クエリをユーザから取得する取得部と、前記入力クエリに基づいて、動画生成に関するシナリオデータを生成するシナリオ生成部と、前記シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成するコード生成部と、前記コードに基づいて、動画データを取得する動画データ取得部と、を備える。 In order to solve the above problems, a video generation system according to one embodiment of the present disclosure includes an acquisition unit that acquires an input query related to video generation from a user, a scenario generation unit that generates scenario data related to video generation based on the input query, a code generation unit that generates code for constructing 3D data based on the scenario data, and a video data acquisition unit that acquires video data based on the code.

本開示の映像生成システムの一例を示す図である。FIG. 1 illustrates an example of an image generation system according to the present disclosure. 本開示の映像生成システムに係るハードウェア構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a hardware configuration of an image generation system according to the present disclosure. 本開示の映像生成処理の流れの一例を示す図である。FIG. 1 is a diagram showing an example of a flow of an image generation process according to the present disclosure. 本開示の映像生成処理の流れの他の一例を示す図である。FIG. 11 is a diagram showing another example of the flow of the image generation process of the present disclosure. 本開示の評価処理の流れの一例を示す図である。FIG. 11 is a diagram showing an example of a flow of an evaluation process according to the present disclosure. シナリオ生成用情報の生成処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of a generation process of scenario generation information. シナリオデータの生成処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of a generation process of scenario data. コード生成用情報の生成処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of a process for generating information for code generation. 画質改善処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of image quality improvement processing. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 13 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 13 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. 音生成用情報の生成処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of a generation process of information for sound generation. ＵＳＤファイルの一例を示す図である。FIG. 2 is a diagram showing an example of a USD file. 映像生成システムが実行する処理手順を示すフローチャートである。11 is a flowchart showing a processing procedure executed by the video production system. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. ユーザインタフェースの一例を示す図である。FIG. 4 illustrates an example of a user interface. 編集処理の一例を示す図である。FIG. 11 illustrates an example of an editing process. 編集処理の一例を示す図である。FIG. 11 illustrates an example of an editing process. 編集処理の一例を示す図である。FIG. 11 illustrates an example of an editing process. 編集処理の一例を示す図である。FIG. 11 illustrates an example of an editing process. 編集処理の一例を示す図である。FIG. 11 illustrates an example of an editing process. 編集処理の一例を示す図である。FIG. 11 illustrates an example of an editing process. 編集処理の一例を示す図である。FIG. 11 illustrates an example of an editing process. 編集時の確認作業の一例を示す図である。FIG. 13 is a diagram showing an example of a confirmation operation during editing. 映像生成システムが実行する処理手順を示すフローチャートである。13 is a flowchart showing a processing procedure executed by the video production system. 範囲選択に応じた評価処理の一例を示す図である。FIG. 13 is a diagram illustrating an example of an evaluation process according to a range selection. 被写界深度に応じた処理の一例を示す概念図である。FIG. 11 is a conceptual diagram showing an example of processing according to the depth of field. 確認作業時の映像の一例を示す図である。FIG. 13 is a diagram showing an example of an image during a confirmation operation. 変更部分の強調表示の一例を示す図である。FIG. 13 is a diagram showing an example of highlighting of a changed portion. カット間の関係の提示の一例を示す図である。FIG. 13 is a diagram showing an example of presentation of the relationship between cuts. オブジェクトの選択の一例を示す図である。FIG. 13 is a diagram illustrating an example of object selection. オブジェクトの選択の一例を示す図である。FIG. 13 illustrates an example of object selection. 参考データを用いた処理の一例を示す図である。FIG. 13 is a diagram illustrating an example of processing using reference data. ユーザ操作に応じた処理の流れを示すフローチャートである。10 is a flowchart showing a flow of processing in response to a user operation. 情報処理装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 2 is a hardware configuration diagram illustrating an example of a computer that realizes the functions of the information processing device.

以下に、本開示の実施形態について図面に基づいて詳細に説明する。なお、この実施形態により本願にかかる映像生成システム、映像生成方法及び映像生成プログラムが限定されるものではない。また、以下の各実施形態において、同一の部位には同一の符号を付することにより重複する説明を省略する。 Embodiments of the present disclosure will be described in detail below with reference to the drawings. Note that the image generation system, image generation method, and image generation program according to the present application are not limited to these embodiments. In addition, in the following embodiments, the same components are designated by the same reference numerals, and duplicated descriptions will be omitted.

以下に示す項目順序に従って本開示を説明する。
１．実施形態
１－１．本開示の映像生成システムの構成概要
１－２．本開示の映像生成システムによる処理
１－３．ユーザインタフェース
１－４．処理例
１－４－１．音生成例
１－４－２．テキストロゴ生成例
１－４－３．再学習例
１－４－４．ＵＳＤ更新例
１－４－５．評価例
１－４－６．複数カットの選択例
１－４－７．時間情報の利用例
１－４－８．言語化の度合いに応じた応答例
１－４－９．定性的な値の利用例
１－４－１０．入力途中でのレンダリング処理例
１－４－１１．編集時の確認作業例
１－４－１２．範囲選択に応じた処理例
１－４－１３．被写界深度に応じた処理例
１－４－１４．確認作業時の再生処理例
１－４－１５．強調表示例
１－４－１６．カット間の関係提示例
１－４－１７．オブジェクトの選択例
１－４－１８．３Ｄモデル利用例
１－４－１９．参考データの利用例
１－４－２０．３Ｄデータを有する利点例
１－５．ユーザから見た処理フロー例
１－６．ＡＩモデルについて
２．その他の実施形態
２－１．その他の構成例
２－２．その他
３．本開示に係る効果
４．ハードウェア構成 The present disclosure will be described in the following order.
1. Embodiment 1-1. Overview of the configuration of the video generation system of the present disclosure 1-2. Processing by the video generation system of the present disclosure 1-3. User interface 1-4. Processing examples 1-4-1. Sound generation example 1-4-2. Text logo generation example 1-4-3. Re-learning example 1-4-4. USD update example 1-4-5. Evaluation example 1-4-6. Selection example of multiple cuts 1-4-7. Use example of time information 1-4-8. Response example according to degree of verbalization 1-4-9. Use example of qualitative values 1-4-10. Rendering processing example during input 1-4-11. Confirmation work example during editing 1-4-12. Processing example according to range selection 1-4-13. Processing example according to depth of field 1-4-14. Playback processing example during confirmation work 1-4-15. Highlighting example 1-4-16. Example of presenting relationships between cuts 1-4-17. Example of selecting objects 1-4-18. Example of using 3D models 1-4-19. Example of using reference data 1-4-20. Example of advantages of having 3D data 1-5. Example of processing flow from the user's perspective 1-6. About AI models 2. Other embodiments 2-1. Other configuration examples 2-2. Others 3. Effects of the present disclosure 4. Hardware configuration

＜１．実施形態＞
＜１－１．本開示の映像生成システムの構成概要＞
図１は、本開示の映像生成システムの一例を示す図である。映像生成システム１は、映像生成モジュール１００、情報取得モジュール２００、センサ部３００、及びクライアントＵＩ表示部４００を有する。なお、図１では各々の構成を１つだけ図示するが、映像生成システム１には、複数の映像生成モジュール１００、複数の情報取得モジュール２００、複数のセンサ部３００、及び複数のクライアントＵＩ表示部４００が含まれてもよい。 1. Embodiment
1-1. Overview of the configuration of the image generation system of the present disclosure
Fig. 1 is a diagram showing an example of an image generation system according to the present disclosure. The image generation system 1 includes an image generation module 100, an information acquisition module 200, a sensor unit 300, and a client UI display unit 400. Although Fig. 1 shows only one of each configuration, the image generation system 1 may include a plurality of image generation modules 100, a plurality of information acquisition modules 200, a plurality of sensor units 300, and a plurality of client UI display units 400.

まず、映像生成処理を行う映像生成モジュール１００の構成について説明する。映像生成モジュール１００は、入力テキスト解析部１１０、センサ解析部１２０、プロンプト等生成部１３０、映像生成部１４０、サウンド生成部１５０、テキスト／ロゴ生成部１６０、コンポジット編集部１７０、評価部１８０、クライアントＵＩモジュール１９０等を有する。 First, the configuration of the image generation module 100 that performs the image generation process will be described. The image generation module 100 has an input text analysis unit 110, a sensor analysis unit 120, a prompt generation unit 130, an image generation unit 140, a sound generation unit 150, a text/logo generation unit 160, a composite editing unit 170, an evaluation unit 180, a client UI module 190, etc.

入力テキスト解析部１１０は、入力されたテキストを解析する。例えば、入力テキスト解析部１１０は、情報取得モジュール２００から入力されたテキストを解析する。センサ解析部１２０は、入力されたセンサ情報を解析する。例えば、センサ解析部１２０は、情報取得モジュール２００から取得したセンサ情報を解析する。 The input text analysis unit 110 analyzes the input text. For example, the input text analysis unit 110 analyzes the text input from the information acquisition module 200. The sensor analysis unit 120 analyzes the input sensor information. For example, the sensor analysis unit 120 analyzes the sensor information acquired from the information acquisition module 200.

プロンプト等生成部１３０は、後述する機械学習モデルであるＡＩ（Artificial Intelligence）モデル（単に「モデル」ともいう）に入力するプロンプト等を含む映像（動画）の生成のために必要となる各種情報を生成する。例えば、プロンプト等生成部１３０は、ユーザの入力と予め保存されたプロンプト（のテンプレート等）とを用いてプロンプトを生成する。なお、プロンプトは、ＡＩモデルへ入力する情報（モデル入力情報）の一例に過ぎず、ＡＩモデルへ入力するモデル入力情報はプロンプトに限らず、任意の形式のモデル入力情報が採用可能であり、「プロンプト等生成部」は「モデル入力情報等生成部」と読み替えてもよい。図１では、プロンプト等生成部１３０は、シナリオ向け生成部１３１、映像向け生成部１３２、サウンド向け生成部１３３、テキスト／ロゴ向け生成部１３４を有する。 The prompt etc. generation unit 130 generates various information required for generating a video (video) including a prompt etc. to be input to an AI (Artificial Intelligence) model (also simply referred to as a "model"), which is a machine learning model described later. For example, the prompt etc. generation unit 130 generates a prompt using a user's input and a pre-stored prompt (template, etc.). Note that a prompt is merely one example of information to be input to an AI model (model input information), and the model input information to be input to an AI model is not limited to a prompt, and any form of model input information can be adopted, and the "prompt etc. generation unit" may be read as a "model input information etc. generation unit." In FIG. 1, the prompt etc. generation unit 130 has a scenario-oriented generation unit 131, a video-oriented generation unit 132, a sound-oriented generation unit 133, and a text/logo-oriented generation unit 134.

シナリオ向け生成部１３１は、シナリオの生成に関連する各種情報を生成する。シナリオ向け生成部１３１は、シナリオを出力するモデルに入力する入力情報を生成する。例えば、シナリオ向け生成部１３１は、入力クエリに基づいて、動画生成に関するシナリオデータを生成するシナリオ生成部である。例えば、シナリオ向け生成部１３１は、入力クエリに基づいて、シナリオデータを生成するためにシナリオ生成部が用いるシナリオ生成用情報を出力する第１の出力部である。 The scenario-oriented generation unit 131 generates various information related to the generation of a scenario. The scenario-oriented generation unit 131 generates input information to be input to a model that outputs a scenario. For example, the scenario-oriented generation unit 131 is a scenario generation unit that generates scenario data related to video generation based on an input query. For example, the scenario-oriented generation unit 131 is a first output unit that outputs scenario generation information used by the scenario generation unit to generate scenario data based on an input query.

映像向け生成部１３２は、映像の生成に関連する各種情報を生成する。映像向け生成部１３２は、３Ｄ（三次元）データを構成するためのコードを出力するモデルに入力する入力情報を生成する。例えば、映像向け生成部１３２は、シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成するコード生成部である。例えば、映像向け生成部１３２は、シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成するためにコード生成部が用いるコード生成用情報を出力する第２の出力部である。 The video generation unit 132 generates various information related to the generation of videos. The video generation unit 132 generates input information to be input to a model that outputs code for constructing 3D (three-dimensional) data. For example, the video generation unit 132 is a code generation unit that generates code for constructing 3D data based on scenario data. For example, the video generation unit 132 is a second output unit that outputs code generation information used by the code generation unit to generate code for constructing 3D data based on scenario data.

サウンド向け生成部１３３は、サウンド情報（音情報）の生成に関連する各種情報を生成する。サウンド向け生成部１３３は、サウンドを出力するモデルに入力する入力情報を生成する。テキスト／ロゴ向け生成部１３４は、テキスト及びロゴの生成に関連する各種情報を生成する。テキスト／ロゴ向け生成部１３４は、テキスト及びロゴのうち少なくとも１つを出力するモデルに入力する入力情報を生成する。 The sound-oriented generation unit 133 generates various information related to the generation of sound information (audio information). The sound-oriented generation unit 133 generates input information to be input to a model that outputs sound. The text/logo-oriented generation unit 134 generates various information related to the generation of text and logos. The text/logo-oriented generation unit 134 generates input information to be input to a model that outputs at least one of text and logos.

映像生成部１４０は、映像の生成に関する処理を実行する。映像生成部１４０は、コードに基づいて、動画データを取得する動画取得部である。映像生成部１４０は、プロンプト等生成部１３０で生成された各種情報を用いて映像を生成する。例えば、映像生成部１４０は、コードに基づいて、動画データを生成する動画生成部である。なお、映像生成部１４０は、任意の態様により動画データを取得してもよい。例えば、映像生成部１４０は、動画データの生成に用いられるデータを、動画データ生成のサービスを提供する外部のサービス提供装置（ベンダー等）に送信し、そのサービス提供装置が生成した動画データを、そのサービス提供装置から受信することにより、動画データを取得してもよい。図１では、映像生成部１４０は、ＵＳＤ生成部１４１、レンダリング部１４２、映像リファイン部１４３を有する。 The video generation unit 140 executes processing related to the generation of video. The video generation unit 140 is a video acquisition unit that acquires video data based on a code. The video generation unit 140 generates video using various information generated by the prompt generation unit 130. For example, the video generation unit 140 is a video generation unit that generates video data based on a code. Note that the video generation unit 140 may acquire video data in any manner. For example, the video generation unit 140 may acquire video data by transmitting data used to generate video data to an external service providing device (vendor, etc.) that provides a video data generation service, and receiving the video data generated by the service providing device from the service providing device. In FIG. 1, the video generation unit 140 has a USD generation unit 141, a rendering unit 142, and a video refinement unit 143.

ＵＳＤ生成部１４１は、ＵＳＤ（Universal Scene Description）に関連する各種情報を生成する。例えば、ＵＳＤ生成部１４１は、映像向け生成部１３２での映像向けプロンプト生成で得られたプロンプトを用いて、Large Language Model（以下「ＬＬＭ」ともいう）などのＡＩモデルにより、ＵＳＤ－Ｐｙｔｈｏｎなどを生成する。 The USD generation unit 141 generates various information related to the USD (Universal Scene Description). For example, the USD generation unit 141 uses a prompt obtained by the video prompt generation in the video generation unit 132 to generate USD-Python, etc., with an AI model such as the Large Language Model (hereinafter also referred to as "LLM").

レンダリング部１４２は、レンダリングに関連する各種処理を実行する。レンダリング部１４２は、ＵＳＤ生成部１４１により生成されたＵＳＤをレンダリングする処理を実行する。 The rendering unit 142 executes various processes related to rendering. The rendering unit 142 executes a process of rendering the USD generated by the USD generation unit 141.

映像リファイン部１４３は、映像をリファインするための各種処理を実行する。レンダリング部１４２は、生成された映像を、映像リファインの処理によりクオリティを高くする。例えば、映像リファイン部１４３は、動画データの画質を改善する画質改善処理を実行する画質改善部である。 The image refinement unit 143 executes various processes to refine the image. The rendering unit 142 improves the quality of the generated image through image refinement processing. For example, the image refinement unit 143 is an image quality improvement unit that executes image quality improvement processing to improve the image quality of video data.

サウンド生成部１５０は、サウンド（音）を生成する処理を実行する。サウンド生成部１５０は、サウンド向け生成部１３３でのサウンド向けのプロンプト生成で得られたプロンプトを用いて、Contrastive Learning ModelなどのＡＩモデルにより、ＢＧＭ（background music）、ＳＥ（Sound Effect）、ナレーション、セリフなどのサウンド情報を生成する。 The sound generation unit 150 executes a process of generating sounds. Using the prompts obtained by the sound-oriented prompt generation in the sound-oriented generation unit 133, the sound generation unit 150 generates sound information such as background music (BGM), sound effects (SE), narration, and dialogue, based on an AI model such as a contrastive learning model.

テキスト／ロゴ生成部１６０は、テキスト及びロゴのうち少なくとも１つを生成する処理を実行する。テキスト／ロゴ生成部１６０は、テキスト／ロゴ向け生成部１３４により生成された情報を用いてテキスト及びロゴのうち少なくとも１つを生成する。 The text/logo generator 160 executes a process for generating at least one of text and a logo. The text/logo generator 160 generates at least one of text and a logo using the information generated by the text/logo generator 134.

映像生成モジュール１００は、上述した構成により、あらかじめ保存されたプロンプトとユーザの入力を合わせてシナリオ生成のためのプロンプトを生成する。映像生成モジュール１００は、生成されたプロンプトをＬＬＭなどのＡＩモデルに入力することで、シナリオを生成する。また、映像生成モジュール１００は、生成されたシナリオより、映像、サウンド、テキスト／ロゴを生成するためのプロンプトを生成する。映像生成モジュール１００は、映像、サウンド、テキスト／ロゴを生成するためのプロンプト、シナリオ等を用いて、映像生成、サウンド生成、テキスト／ロゴ生成を実施する。 With the above-described configuration, the image generation module 100 generates a prompt for scenario generation by combining a pre-stored prompt with a user's input. The image generation module 100 generates a scenario by inputting the generated prompt into an AI model such as an LLM. The image generation module 100 also generates a prompt for generating images, sounds, and text/logos from the generated scenario. The image generation module 100 performs image generation, sound generation, and text/logo generation using prompts, scenarios, etc. for generating images, sounds, and text/logos.

コンポジット編集部１７０は、編集に関連する処理を実行する。例えば、コンポジット編集部１７０は、生成された映像、サウンド、テキスト／ロゴを一つにまとめ（合成し）、一つの映像とする処理を実行する。 The composite editing unit 170 executes processes related to editing. For example, the composite editing unit 170 executes a process of combining (combining) the generated video, sound, and text/logo into one video.

評価部１８０は、各種の対象を評価する評価処理を実行する。評価部１８０は、上述した構成により生成された情報の評価を行う。例えば、評価部１８０は、シナリオデータと動画データのうち少なくとも１つの評価を示す情報を生成する。 The evaluation unit 180 executes an evaluation process to evaluate various targets. The evaluation unit 180 evaluates the information generated by the above-mentioned configuration. For example, the evaluation unit 180 generates information indicating an evaluation of at least one of the scenario data and the video data.

クライアントＵＩモジュール１９０は、クライアント側のＵＩ（User Interface）での出力に関連する処理を実行する。例えば、クライアントＵＩモジュール１９０は、クライアント側のＵＩでの出力に関連する各種情報を生成する。この場合、クライアントＵＩモジュール１９０は、ユーザ側で表示されるＵＩを生成する処理を実行する。クライアントＵＩモジュール１９０は、クライアントＵＩ表示部４００に表示させる各種情報を生成する。 The client UI module 190 executes processes related to output on the client-side UI (User Interface). For example, the client UI module 190 generates various information related to output on the client-side UI. In this case, the client UI module 190 executes processes to generate a UI to be displayed on the user side. The client UI module 190 generates various information to be displayed on the client UI display unit 400.

また、情報取得モジュール２００は、各種情報を取得する。情報取得モジュール２００は、入力テキスト取得部２１０、センサ取得部２２０等を有する。入力テキスト取得部２１０は、キーボード３２０やマイク３３０により入力されたテキスト情報を取得する。例えば、入力テキスト取得部２１０は、ユーザがキーボード３２０やマイク３３０により入力したテキスト情報を取得する。例えば、入力テキスト取得部２１０は、動画生成に関する入力クエリをユーザから取得する取得部である。 In addition, the information acquisition module 200 acquires various information. The information acquisition module 200 has an input text acquisition unit 210, a sensor acquisition unit 220, etc. The input text acquisition unit 210 acquires text information input via the keyboard 320 or the microphone 330. For example, the input text acquisition unit 210 acquires text information input by the user via the keyboard 320 or the microphone 330. For example, the input text acquisition unit 210 is an acquisition unit that acquires an input query related to video generation from the user.

センサ取得部２２０は、カメラ３４０やモーションキャプチャなどのセンサにより検知された情報（「センサ情報」ともいう）を取得する。情報取得モジュール２００は、取得した各種情報を、映像生成モジュール１００へ提供（送信）する。なお、情報取得モジュール２００は、映像生成モジュール１００と一体であってもよい。 The sensor acquisition unit 220 acquires information (also called "sensor information") detected by a sensor such as the camera 340 or a motion capture device. The information acquisition module 200 provides (transmits) the acquired various information to the image generation module 100. Note that the information acquisition module 200 may be integrated with the image generation module 100.

センサ部３００は、各種のセンサを有する。センサ部３００は、ユーザの入力をセンシングする。センサ部３００は、ユーザによる操作を受け付ける。例えば、センサ部３００は、ユーザから動画編集に関する操作を受け付ける受付部である。例えば、センサ部３００は、マウス３１０、キーボード３２０、マイク３３０、カメラ３４０、慣性計測装置であるＩＭＵ３５０等を有する。このように、センサ部３００は、マウス３１０、キーボード３２０の他に、マイク３３０、カメラ３４０、ＩＭＵ３５０を備えるユーザ端末（スマートフォン等）やモーションキャプチャなどのセンサなどを含み、ユーザの入力をセンシングする。 The sensor unit 300 has various sensors. The sensor unit 300 senses user input. The sensor unit 300 accepts operations by the user. For example, the sensor unit 300 is a reception unit that accepts operations related to video editing from the user. For example, the sensor unit 300 has a mouse 310, a keyboard 320, a microphone 330, a camera 340, an IMU 350 which is an inertial measurement unit, and the like. Thus, in addition to the mouse 310 and the keyboard 320, the sensor unit 300 includes a user terminal (such as a smartphone) equipped with the microphone 330, the camera 340, and the IMU 350, sensors such as motion capture, and the like, and senses user input.

クライアントＵＩ表示部４００は、クライアント（ユーザ）に提示するための各種情報を表示する。クライアントＵＩ表示部４００は、クライアントＵＩモジュール１９０により生成されたＵＩをクライアントのディスプレイ（表示装置）に表示する。例えば、クライアントＵＩ表示部４００は、シナリオデータに基づいたストーリーボードを表示させる表示制御部である。ストーリーボードは、動画のカット毎に動画データを表示するように構成される。 The client UI display unit 400 displays various information to be presented to the client (user). The client UI display unit 400 displays the UI generated by the client UI module 190 on the client's display (display device). For example, the client UI display unit 400 is a display control unit that displays a storyboard based on scenario data. The storyboard is configured to display video data for each cut of the video.

映像生成システム１は、図２に示すようなハードウェア構成であってもよい。図２は、本開示の映像生成システムに係るハードウェア構成の一例を示す図である。図２では、映像生成システム１は、クラウド側のコンピュータ１０、クライアント側のコンピュータ２０、カメラ等の各種のセンサを含むカメラ／センサ３０等をハードウェア構成として有する。また、映像生成システム１には、学習データ等の情報リソース４０、ＡＩモデル５０をコンピュータ１０に提供する情報提供装置（コンピュータ）が含まれてもよい。 The video generation system 1 may have a hardware configuration as shown in FIG. 2. FIG. 2 is a diagram showing an example of a hardware configuration of the video generation system of the present disclosure. In FIG. 2, the video generation system 1 has, as its hardware configuration, a cloud-side computer 10, a client-side computer 20, and a camera/sensor 30 including various sensors such as a camera. The video generation system 1 may also include an information providing device (computer) that provides information resources 40 such as learning data and an AI model 50 to the computer 10.

なお、図２に示すハードウェア構成は、一例に過ぎず、映像生成システム１は、所望の処理が実行可能であれば、任意のハードウェア構成が採用可能である。例えば、コンピュータ１０とコンピュータ２０とは一体であってもよい。また、情報リソース４０やＡＩモデル５０はコンピュータ１０内部に保存されてもよい。 The hardware configuration shown in FIG. 2 is merely an example, and the video generation system 1 can adopt any hardware configuration as long as it can execute the desired processing. For example, the computer 10 and the computer 20 may be integrated. Furthermore, the information resource 40 and the AI model 50 may be stored inside the computer 10.

コンピュータ１０は、ＣＰＵ（Central Processing Unit）１１、ＧＰＵ（Graphics Processing Unit）１２、通信装置１３、メモリ／ストレージ１４を備える。例えば、コンピュータ１０は、図１中の映像生成モジュール１００及び情報取得モジュール２００に対応する。コンピュータ１０は、映像生成サービスを提供するサービス提供装置（サーバ装置）であってもよい。ＣＰＵ１１及びＧＰＵ１２は、いわゆるプロセッサであり、例えば映像生成等の各種の処理に関連する計算処理（演算処理）を実行する。 The computer 10 includes a CPU (Central Processing Unit) 11, a GPU (Graphics Processing Unit) 12, a communication device 13, and a memory/storage 14. For example, the computer 10 corresponds to the image generation module 100 and the information acquisition module 200 in FIG. 1. The computer 10 may be a service providing device (server device) that provides an image generation service. The CPU 11 and the GPU 12 are so-called processors, and execute calculation processes (arithmetic operations) related to various processes such as image generation.

通信装置１３は、コンピュータ２０、情報提供装置等との間で情報を送受信するための通信機能を有する通信装置であり、例えば、通信回路、ＮＩＣ（Network Interface Card）等であってもよい。通信装置１３は、所定のネットワーク（インターネット等）を介してコンピュータ２０、情報提供装置等の他の装置と通信を行う。例えば、通信装置１３は、所定のネットワークと有線または無線で接続され、コンピュータ２０、情報提供装置等の他の装置との間で情報の送受信を行う。 The communication device 13 is a communication device having a communication function for transmitting and receiving information between the computer 20, an information providing device, etc., and may be, for example, a communication circuit, a NIC (Network Interface Card), etc. The communication device 13 communicates with other devices such as the computer 20 and the information providing device via a predetermined network (such as the Internet). For example, the communication device 13 is connected to the predetermined network by wire or wirelessly, and transmits and receives information between the computer 20, the information providing device, etc.

メモリ／ストレージ１４は、各種の情報を記憶する記憶装置である。メモリ／ストレージ１４は、例えば、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置である。メモリ／ストレージ１４は、ＣＰＵ１１及びＧＰＵ１２等のプロセッサが処理に用いる各種情報を記憶する。メモリ／ストレージ１４は、情報リソース４０、ＡＩモデル５０等を記憶してもよい。 The memory/storage 14 is a storage device that stores various types of information. For example, the memory/storage 14 is a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disk. The memory/storage 14 stores various types of information used for processing by processors such as the CPU 11 and the GPU 12. The memory/storage 14 may also store information resources 40, AI models 50, etc.

コンピュータ２０は、ＣＰＵ２１、ＧＰＵ２２、通信装置２３、メモリ／ストレージ２４、ＩＯインタフェース２５を備える。例えば、コンピュータ２０は、図１中のクライアントＵＩ表示部４００に対応する。コンピュータ２０は、映像生成サービスを利用するユーザが利用する端末装置（ＰＣ（Personal Computer）、スマートフォン等の携帯デバイス等）であってもよい。ＣＰＵ２１及びＧＰＵ２２は、いわゆるプロセッサであり、例えば映像表示等の各種の処理に関連する計算処理（演算処理）を実行する。なお、上記は一例に過ぎず、コンピュータ２０は、所望の処理が可能であれば任意の構成が採用可能である。例えば、コンピュータ２０は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の回路により映像表示等の各種の処理に関連する計算処理（演算処理）を実行してもよい。また、コンピュータ２０は、メモリ（メモリ／ストレージ２４等）にプログラムを保存する代わりに、プロセッサの回路内にプログラムを直接組み込むよう構成されても構わない。この場合、プロセッサは回路内に組み込まれたプログラムを読み出し実行することで機能を実現する。なお、本実施形態の各プロセッサは、プロセッサごとに単一の回路として構成される場合に限らず、複数の独立した回路を組み合わせて１つのプロセッサとして構成し、その機能を実現するようにしてもよい。また、コンピュータ１０もコンピュータ２０と同様に、所望の処理が可能であれば任意の構成が採用可能である。 The computer 20 includes a CPU 21, a GPU 22, a communication device 23, a memory/storage 24, and an IO interface 25. For example, the computer 20 corresponds to the client UI display unit 400 in FIG. 1. The computer 20 may be a terminal device (such as a PC (Personal Computer), a mobile device such as a smartphone, etc.) used by a user who uses the video generation service. The CPU 21 and the GPU 22 are so-called processors, and perform calculation processing (arithmetic processing) related to various processes such as video display. Note that the above is only an example, and the computer 20 can adopt any configuration as long as it is capable of performing the desired processing. For example, the computer 20 may perform calculation processing (arithmetic processing) related to various processes such as video display using a circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). The computer 20 may also be configured to directly incorporate a program into the circuit of the processor instead of storing the program in the memory (such as the memory/storage 24). In this case, the processor realizes the function by reading and executing the program incorporated in the circuit. In addition, each processor in this embodiment is not limited to being configured as a single circuit for each processor, but may be configured as a single processor by combining multiple independent circuits to realize its functions. Also, like computer 20, computer 10 can adopt any configuration as long as it is capable of performing the desired processing.

通信装置２３は、コンピュータ１０、センサ３０等との間で情報を送受信するための通信機能を有する通信装置であり、例えば、通信回路、ＮＩＣ等であってもよい。通信装置２３は、所定のネットワーク（インターネット等）を介してコンピュータ１０、センサ３０等の他の装置と通信を行う。例えば、通信装置２３は、所定のネットワークと有線または無線で接続され、コンピュータ１０、センサ３０等の他の装置との間で情報の送受信を行う。 The communication device 23 is a communication device having a communication function for transmitting and receiving information between the computer 10, the sensor 30, etc., and may be, for example, a communication circuit, a NIC, etc. The communication device 23 communicates with other devices such as the computer 10, the sensor 30, etc. via a predetermined network (such as the Internet). For example, the communication device 23 is connected to the predetermined network by wire or wirelessly, and transmits and receives information between the computer 10, the sensor 30, etc., and other devices.

メモリ／ストレージ２４は、各種の情報を記憶する記憶装置である。メモリ／ストレージ２４は、例えば、ＲＡＭ、フラッシュメモリ等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置である。メモリ／ストレージ２４は、ＣＰＵ２１及びＧＰＵ２２等のプロセッサが処理に用いる各種情報を記憶する。 The memory/storage 24 is a storage device that stores various types of information. For example, the memory/storage 24 is a semiconductor memory element such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk. The memory/storage 24 stores various types of information used for processing by processors such as the CPU 21 and the GPU 22.

ＩＯインタフェース２５は、入出力のインタフェース装置である。コンピュータ２０は、ＩＯインタフェース２５を介して、センサ３０からの入力を受信する。例えば、コンピュータ２０は、ＩＯインタフェース２５を介して、キーボードやマウス等の入力デバイスからの入力を受信する。また、コンピュータ２０は、ＩＯインタフェース２５を介して、ディスプレイ（表示装置）、スピーカー（音声出力装置）から情報を出力させる。例えば、コンピュータ２０は、ＩＯインタフェース２５を介して、ディスプレイ及びスピーカーにより映像を再生する。 The IO interface 25 is an input/output interface device. The computer 20 receives input from the sensor 30 via the IO interface 25. For example, the computer 20 receives input from an input device such as a keyboard or a mouse via the IO interface 25. The computer 20 also outputs information from a display (display device) and a speaker (audio output device) via the IO interface 25. For example, the computer 20 plays video on a display and a speaker via the IO interface 25.

カメラ等の各種のセンサ３０は、ユーザの入力をセンシングする。カメラ等の各種のセンサ３０は、ユーザによる操作を受け付ける。例えば、センサ３０は、図１中のセンサ部３００に対応する。また、情報リソース４０は、学習データ等の様々な情報を含む。例えば、情報リソース４０は、ＬＬＭ等の各種のＡＩモデルの学習に用いられる学習データを含む。ＡＩモデル５０は、ＬＬＭ等の映像生成に関連する処理に用いられるＡＩモデルの情報を含む。例えば、ＡＩモデル５０は、後述するモデルＭ１～Ｍ３等の各種のＡＩモデルの情報を含む。なお、上述したように、映像生成システム１は、図２に示す構成以外の構成をとってもよい。 The various sensors 30, such as a camera, sense user input. The various sensors 30, such as a camera, accept operations by the user. For example, the sensor 30 corresponds to the sensor unit 300 in FIG. 1. The information resource 40 includes various information such as learning data. For example, the information resource 40 includes learning data used to learn various AI models such as LLM. The AI model 50 includes information on an AI model used in processing related to image generation such as LLM. For example, the AI model 50 includes information on various AI models such as models M1 to M3 described below. As described above, the image generation system 1 may have a configuration other than that shown in FIG. 2.

＜１－２．本開示の映像生成システムによる処理＞
ここから、映像生成システムによる処理について説明する。まず、図３に示す映像生成処理の流れの一例について説明する。図３は、本開示の映像生成処理の流れの一例を示す図である。なお、以下で映像生成システム１を処理主体として説明する処理は、映像生成システム１に含まれる装置構成に応じて、その処理を実行可能ないずれの装置が行ってもよい。 1-2. Processing by the image generation system of the present disclosure
From here, the processing by the video generation system will be described. First, an example of the flow of the video generation processing shown in Fig. 3 will be described. Fig. 3 is a diagram showing an example of the flow of the video generation processing of the present disclosure. Note that the processing described below with the video generation system 1 as the processing subject may be performed by any device capable of executing the processing, depending on the device configuration included in the video generation system 1.

図３では「User's Input」と表記するユーザ入力情報ＵＩＮ１は、ユーザが映像生成のために入力した情報（「入力クエリ」ともいう）に対応する。なお、入力クエリは、テキスト（文字情報）に限らず、任意の情報が採用可能である。入力クエリは、テキスト、画像、音声、３Ｄデータのうち少なくとも１つを含む任意の情報であってもよい。 User input information UIN1, denoted as "User's Input" in FIG. 3, corresponds to information input by the user for image generation (also referred to as "input query"). Note that the input query is not limited to text (character information) and any information can be used. The input query may be any information including at least one of text, image, audio, and 3D data.

映像生成システム１は、ユーザ入力情報ＵＩＮ１を用いて、図３では「ＬＬＭ」と表記するモデルＭ１の入力として用いられるシナリオ生成用情報（「第１の入力情報」ともいう）を生成するが、この点については後述する。例えば、モデルＭ１は、第１の入力情報の入力に応じてシナリオデータを出力する第１のモデルである。モデルＭ１は、入力に応じて所望の出力が可能であれば、ＬＬＭ（大規模言語モデル）等の任意のＡＩモデルが採用可能である。なお、モデルＭ１等のＡＩモデルに関しては後述する。 The video generation system 1 uses the user input information UIN1 to generate scenario generation information (also referred to as "first input information") that is used as input for model M1, denoted as "LLM" in FIG. 3, which will be described later. For example, model M1 is a first model that outputs scenario data in response to the input of the first input information. Any AI model, such as an LLM (large-scale language model), can be used for model M1, so long as it is capable of producing the desired output in response to the input. AI models such as model M1 will be described later.

映像生成システム１は、モデルＭ１に第１の入力情報を入力し、モデルＭ１にシナリオデータであるシナリオＦＤ１を出力させることにより、シナリオＦＤ１を生成する。そして、映像生成システム１は、シナリオＦＤ１、シナリオＦＤ１を入力とするモデルＭ２の出力、及びユーザ入力情報ＵＩＮ２等を用いて、モデルＭ３の入力として用いられるコード生成用情報（「第２の入力情報」ともいう）であるＵＳＤ生成必要情報ＳＤ１を生成する。例えば、モデルＭ３は、第２の入力情報の入力に応じてコードを出力する第２のモデルである。モデルＭ３は、入力に応じて所望の出力が可能であれば、ＬＬＭ（大規模言語モデル）等の任意のＡＩモデルが採用可能である。 The video generation system 1 generates a scenario FD1 by inputting first input information to the model M1 and having the model M1 output a scenario FD1, which is scenario data. Then, the video generation system 1 generates information required for USD generation SD1, which is code generation information (also referred to as "second input information") used as input for the model M3, using the scenario FD1, the output of the model M2 which uses the scenario FD1 as input, and user input information UIN2, etc. For example, the model M3 is a second model which outputs a code in response to the input of the second input information. The model M3 can be any AI model, such as an LLM (large scale language model), as long as it is capable of producing the desired output in response to the input.

なお、図３では、ＵＳＤ生成必要情報ＳＤ１を１つのみ図示するが、例えば生成するＵＳＤファイルの数に応じてＵＳＤ生成必要情報ＳＤ１は複数あってもよい。例えば、ＵＳＤ生成必要情報ＳＤ１は、図１８に示すようなデータ構造に対応して生成するＵＳＤファイルの数に応じてＵＳＤ生成必要情報ＳＤ１は複数あってもよい。 Note that, although only one piece of information SD1 required for generating USD is illustrated in FIG. 3, there may be multiple pieces of information SD1 required for generating USD, for example, depending on the number of USD files to be generated. For example, there may be multiple pieces of information SD1 required for generating USD, depending on the number of USD files to be generated corresponding to the data structure shown in FIG. 18.

例えば、モデルＭ２は、シナリオデータの入力に応じて、そのシナリオデータに対応するテンプレート等を出力するモデルであってもよい。モデルＭ２は、入力に応じて所望の出力が可能であれば、任意のＡＩモデルが採用可能である。例えば、ユーザ入力情報ＵＩＮ２は、映像生成での制約条件の指定等を行うための情報であってもよい。なお、映像生成システム１は、シナリオＦＤ１とテンプレート入力情報とを用いて第２の入力情報を生成してもよいがこの点については後述する。 For example, model M2 may be a model that outputs a template or the like corresponding to scenario data in response to the input of that scenario data. Any AI model can be adopted for model M2 as long as it is capable of providing the desired output in response to the input. For example, user input information UIN2 may be information for specifying constraint conditions for image generation. Note that image generation system 1 may generate second input information using scenario FD1 and template input information, but this point will be described later.

映像生成システム１は、モデルＭ３にＵＳＤ生成必要情報ＳＤ１を入力し、モデルＭ３に図３では「python」と表記するパイソンコードＯＤ１を出力させることにより、パイソンコードＯＤ１を生成する。例えば、パイソンコードＯＤ１は実行によりＵＳＤ形式のデータ（「ＵＳＤファイル」ともいう）を生成する（プログラム）コードである。なお、パイソンは一例に過ぎず、コードは所望の３ＤＣＧ用のデータを生成可能であれば、パイソンに限らず任意の形式のコードが採用可能である。また、ＵＳＤは一例に過ぎず、３ＤＣＧ用のデータであれば、ＦＢＸ（Film Box）等任意の形式が採用可能である。映像生成システム１は、パイソンコードＯＤ１を実行し、図３では「ＵＳＤ」と表記するＵＳＤファイルＯＤ２を生成する。 The video generation system 1 generates Python code OD1 by inputting USD generation necessary information SD1 to the model M3 and having the model M3 output Python code OD1, indicated as "python" in FIG. 3. For example, the Python code OD1 is (program) code that generates USD format data (also called a "USD file") when executed. Note that Python is only an example, and any code format can be used, not limited to Python, as long as it can generate desired 3DCG data. Also, USD is only an example, and any format, such as FBX (Film Box), can be used as long as the data is for 3DCG. The video generation system 1 executes the Python code OD1 to generate a USD file OD2, indicated as "USD" in FIG. 3.

映像生成システム１は、図３では「Renderer」と表記するレンダリング処理ＰＳ１を実行することにより、図３では「PreMovie」と表記する動画データＭＶ１を生成する。例えば、動画データＭＶ１は、後述するリファイン処理ＰＳ２を実行する前のデータ（「第１の動画データ」ともいう）である。 The video generation system 1 generates video data MV1, which is indicated as "PreMovie" in FIG. 3, by executing a rendering process PS1, which is indicated as "Renderer" in FIG. 3. For example, the video data MV1 is data (also referred to as "first video data") before executing a refinement process PS2, which will be described later.

映像生成システム１は、図３では「Refiner」と表記するリファイン処理ＰＳ２を実行することにより、図３では「RefinedMovie」と表記する動画データＭＶ２を生成する。例えば、リファイン処理ＰＳ２は、動画データの画質を改善する画質改善処理である。動画データＭＶ２は、リファイン処理ＰＳ２により第１の動画データである動画データＭＶ１が更新された後のデータ（「第２の動画データ」ともいう）である。 The video generation system 1 generates video data MV2, denoted as "RefinedMovie" in FIG. 3, by executing a refinement process PS2, denoted as "Refiner" in FIG. 3. For example, the refinement process PS2 is a picture quality improvement process that improves the picture quality of the video data. The video data MV2 is data (also referred to as "second video data") obtained after the video data MV1, which is the first video data, is updated by the refinement process PS2.

映像生成システム１は、動画データＭＶ２、ユーザ入力情報ＵＩＮ３等を用いてコンポジット編集ＰＳ３を実行することにより、図３では「FinalMovie」と表記する動画データＭＶ３を生成する。例えば、コンポジット編集ＰＳ３は、ユーザ入力情報ＵＩＮ３が示すユーザの編集指示に応じて、動画データＭＶ２を更新（編集）する処理を実行することにより、動画データＭＶ２が更新された動画データＭＶ３を生成する。 The video production system 1 generates video data MV3, indicated as "FinalMovie" in FIG. 3, by executing composite editing PS3 using video data MV2, user input information UIN3, etc. For example, the composite editing PS3 executes a process of updating (editing) video data MV2 in response to a user's editing instruction indicated by user input information UIN3, thereby generating video data MV3 in which video data MV2 has been updated.

なお、図３に示す映像生成処理の流れは一例に過ぎず、映像生成システム１は、ユーザの入力クエリから動画データを生成可能であれば、任意の処理態様が採用可能である。例えば、図３では、モデルＭ１がコード（パイソンコード）を出力する場合を一例として説明したが、モデルＭ１は、ＵＳＤファイル等の３ＤＣＧ用のデータを出力するモデルであってもよい。また、映像生成システム１は、図３に示す処理に限らず、様々な態様の映像生成処理を行ってもよい。この点の一例について図４を用いて説明する。図４は、本開示の映像生成処理の流れの他の一例を示す図である。図４は、音生成必要情報ＳＤ２及びテキストロゴ必要情報ＳＤ３を生成し、それらを用いて映像生成処理を行う点等で図３と相違する。なお、図３で説明した内容と同様の点については適宜説明を省略する。 Note that the flow of the image generation process shown in FIG. 3 is merely an example, and the image generation system 1 can adopt any processing mode as long as it can generate video data from a user's input query. For example, in FIG. 3, the model M1 outputs code (Python code), but the model M1 may be a model that outputs data for 3DCG such as a USD file. In addition, the image generation system 1 may perform various types of image generation processing, not limited to the processing shown in FIG. 3. An example of this point will be described with reference to FIG. 4. FIG. 4 is a diagram showing another example of the flow of the image generation process of the present disclosure. FIG. 4 differs from FIG. 3 in that sound generation necessary information SD2 and text logo necessary information SD3 are generated and image generation processing is performed using them. Note that the same points as those described in FIG. 3 will not be described as appropriate.

図４では、映像生成システム１は、シナリオＦＤ１、シナリオＦＤ１を入力とするモデルＭ２の出力、及びユーザ入力情報ＵＩＮ２等を用いて、図３では「ＡＩ」と表記するモデルＭ４の入力として用いられる音生成用情報である音生成必要情報ＳＤ２を生成する。例えば、モデルＭ４は、音生成必要情報ＳＤ２及び動画データＭＶ２の入力に応じて各種の音データを出力するモデルである。モデルＭ４は、入力に応じて所望の出力が可能であれば、任意のＡＩモデルが採用可能である。なお、モデルＭ４は、音生成必要情報ＳＤ２のみを入力とするモデルであってもよい。 In FIG. 4, the video generation system 1 uses a scenario FD1, the output of a model M2 that uses the scenario FD1 as input, and user input information UIN2 to generate information required for sound generation SD2, which is information for sound generation used as input for a model M4, denoted as "AI" in FIG. 3. For example, model M4 is a model that outputs various sound data in response to input of information required for sound generation SD2 and video data MV2. Any AI model can be used for model M4 as long as it is capable of producing the desired output in response to the input. Note that model M4 may be a model that uses only information required for sound generation SD2 as input.

映像生成システム１は、モデルＭ４に音生成必要情報ＳＤ２を入力し、モデルＭ４にＢＧＭ（BackGround Music）用の音データＡＤ１、ＳＥ用の音データＡＤ２、Narration用の音データＡＤ３等を出力させることにより、映像に対応する音データを生成する。また、映像生成システム１は、ユーザ入力情報ＵＩＮ４を用いて、音データＡＤ１、ＡＤ２、ＡＤ３を生成してもよい。例えば、モデルＭ４が音データＡＤ１、ＡＤ２、ＡＤ３を１つの音データとして出力する場合、映像生成システム１は、ユーザ入力情報ＵＩＮ４での指定に基づいて、モデルＭ４が出力した１つの音データから、音データＡＤ１、ＡＤ２、ＡＤ３を抽出して、音データＡＤ１、ＡＤ２、ＡＤ３を生成してもよい。 The video production system 1 generates sound data corresponding to a video by inputting information SD2 required for sound production to the model M4 and having the model M4 output sound data AD1 for background music (BGM), sound data AD2 for sound effects, sound data AD3 for narration, etc. The video production system 1 may also generate sound data AD1, AD2, AD3 using user input information UIN4. For example, if the model M4 outputs sound data AD1, AD2, AD3 as one piece of sound data, the video production system 1 may extract sound data AD1, AD2, AD3 from the one piece of sound data output by the model M4 based on the specification in the user input information UIN4, and generate the sound data AD1, AD2, AD3.

図４では、映像生成システム１は、シナリオＦＤ１、シナリオＦＤ１を入力とするモデルＭ２の出力、及びユーザ入力情報ＵＩＮ２等を用いて、モデルＭ５の入力として用いられるテキストロゴ生成用情報であるテキストロゴ生成必要情報ＳＤ３を生成する。例えば、モデルＭ５は、テキストロゴ生成必要情報ＳＤ３の入力に応じてテキスト及びロゴのうち少なくとも１つを出力するモデルである。モデルＭ５は、入力に応じて所望の出力が可能であれば、任意のＡＩモデルが採用可能である。 In FIG. 4, the video generation system 1 uses a scenario FD1, the output of a model M2 that uses the scenario FD1 as input, and user input information UIN2 to generate information required for text logo generation SD3, which is information for generating a text logo to be used as input for model M5. For example, model M5 is a model that outputs at least one of text and a logo in response to the input of information required for text logo generation SD3. Any AI model can be used for model M5 as long as it is capable of producing the desired output in response to the input.

映像生成システム１は、モデルＭ５にテキストロゴ生成必要情報ＳＤ３を入力し、モデルＭ５にＴｅｘｔ用のテキストロゴデータＤＩ１、Ｌｏｇｏ用のテキストロゴデータＤＩ２等を出力させることにより、映像に対応するテキストロゴデータを生成する。 Video generation system 1 inputs information SD3 required for text logo generation to model M5, and generates text logo data corresponding to the video by having model M5 output text logo data DI1 for Text, text logo data DI2 for Logo, etc.

映像生成システム１は、動画データＭＶ２、音データＡＤ１、ＡＤ２、ＡＤ３、テキストロゴデータＤＩ１、ＤＩ２、ユーザ入力情報ＵＩＮ３等を用いてコンポジット編集ＰＳ３を実行することにより、動画データＭＶ３を生成する。例えば、コンポジット編集ＰＳ３は、動画データＭＶ２、音データＡＤ１、ＡＤ２、ＡＤ３、テキストロゴデータＤＩ１、ＤＩ２等を一つにまとめ（合成し）、一つの映像とした動画データＭＶ３を生成する処理を実行する。 The video generation system 1 generates video data MV3 by executing composite editing PS3 using video data MV2, audio data AD1, AD2, AD3, text logo data DI1, DI2, user input information UIN3, etc. For example, the composite editing PS3 executes a process of combining (combining) the video data MV2, audio data AD1, AD2, AD3, text logo data DI1, DI2, etc. into one and generating video data MV3 as a single image.

上述した図３及び図４は、初期状態としてシナリオやＵＳＤ生成必要情報、ＵＳＤ、PreMovie、RefinedMovie等がない状態での処理の一例を示す。上述したように、映像生成システム１は、ユーザの入力クエリに基づき、シナリオを生成するためのプロンプトを生成し、自然言語モデルへプロンプトを提供してシナリオを生成し、シナリオに記載のテキスト情報より、動画を構成するコード出力のためのプロンプトを生成し、自然言語モデルへプロンプトを提供して、動画を構成するコードを生成し、動画を生成する。このように、映像生成システム１では、動画を作る際、ユーザの作りたいものや目的を入力すると、３ＤＣＧの知識や映像制作の知識がなくても効果的な映像のストーリーボードと動画が生成される。また、映像生成システム１では、ストーリーボードにすることで、その後に編集しやすくなる。なおこれらの点についての詳細は後述する。 The above-mentioned Figures 3 and 4 show an example of processing in an initial state where there is no scenario, information required for generating USD, USD, PreMovie, RefinedMovie, etc. As described above, the video generation system 1 generates a prompt for generating a scenario based on a user's input query, provides the prompt to a natural language model to generate a scenario, generates a prompt for outputting code constituting a video from text information described in the scenario, provides the prompt to the natural language model to generate code constituting the video, and generates a video. In this way, when creating a video, the video generation system 1 inputs what the user wants to create or the purpose, and generates an effective video storyboard and video even without knowledge of 3DCG or video production. In addition, the video generation system 1 makes it easier to edit afterwards by creating a storyboard. These points will be described in detail later.

また、映像生成システム１は、映像生成に関連する各種の処理を行ってもよい。例えば、映像生成システム１は、生成した情報を対象として評価処理を行ってもよい。この点について、図５を用いて、評価処理の流れの一例について説明する。図５は、本開示の評価処理の流れの一例を示す図である。 The video production system 1 may also perform various processes related to video production. For example, the video production system 1 may perform evaluation processing on the generated information. In this regard, an example of the flow of the evaluation processing will be described with reference to FIG. 5. FIG. 5 is a diagram showing an example of the flow of the evaluation processing of the present disclosure.

図５では、映像生成システム１は、シナリオＦＤ１、ＵＳＤ生成必要情報ＳＤ１、音生成必要情報ＳＤ２、テキストロゴ生成必要情報ＳＤ３のうち少なくとも１つを入力し、モデルＭ１０に、入力された情報についての評価を示す評価テキスト情報ＥＶ１を出力させることにより、生成した情報に対する評価を行う。例えば、モデルＭ１０は、情報の入力に応じて、その入力された情報の評価を出力するモデルである。例えば、モデルＭ１０は、シナリオＦＤ１の入力に応じて、その入力されたシナリオＦＤ１の評価を示す評価テキストを出力する。なお、モデルＭ１０は、シナリオＦＤ１、ＵＳＤ生成必要情報ＳＤ１、音生成必要情報ＳＤ２、テキストロゴ生成必要情報ＳＤ３ごとに入力を受け付けるモデルであってもよいし、これらの情報を組み合わせた入力を受け付けるモデルであってもよい。また、モデルＭ１０は、映像を示す情報（キャプション等）の入力に応じて、その入力された情報に対応する映像の評価を出力するモデルであってもよい。 In FIG. 5, the video generation system 1 inputs at least one of the scenario FD1, information required for USD generation SD1, information required for sound generation SD2, and information required for text logo generation SD3, and causes the model M10 to output evaluation text information EV1 indicating an evaluation of the input information, thereby evaluating the generated information. For example, the model M10 is a model that outputs an evaluation of input information in response to the input of information. For example, the model M10 outputs evaluation text indicating an evaluation of the input scenario FD1 in response to the input of a scenario FD1. Note that the model M10 may be a model that accepts input of each of the scenario FD1, information required for USD generation SD1, information required for sound generation SD2, and information required for text logo generation SD3, or may be a model that accepts input of a combination of these pieces of information. The model M10 may also be a model that, in response to the input of information indicating a video (such as a caption), outputs an evaluation of the video corresponding to the input information.

ここから、上述した処理の流れについて、映像生成システム１が実行する各処理の具体例について記載する。なお、上述した内容と同様の点について適宜説明を省略する。 From here, specific examples of each process executed by the video production system 1 will be described for the above-mentioned process flow. Note that explanations of points similar to those described above will be omitted as appropriate.

例えば、映像生成システム１は、図６に示すように、シナリオ生成用情報（第１の入力情報）を生成する。図６は、シナリオ生成用情報の生成処理の一例を示す図である。図６では、映像生成システム１は、コンテンツＣＴ１にユーザが入力したユーザ入力情報ＩＤＴ１、ＩＤＴ２をユーザの入力情報として取得する。コンテンツＣＴ１は、「どういう動画を作りたいですか？」という質問事項、及び「スタイル」という質問事項の各々に対するユーザの入力情報を受け付けるためのコンテンツである。 For example, the video production system 1 generates scenario generation information (first input information) as shown in FIG. 6. FIG. 6 is a diagram showing an example of a generation process of scenario generation information. In FIG. 6, the video production system 1 acquires user input information IDT1 and IDT2 input by the user to content CT1 as user input information. Content CT1 is content for accepting user input information for each of the questions "What kind of video do you want to create?" and "Style."

例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ１を表示し、センサ部３００は、ユーザ入力情報ＩＤＴ１、ＩＤＴ２をユーザ入力情報として受け付ける。例えば、ユーザ入力情報ＩＤＴ１、ＩＤＴ２は、図３及び図４中のユーザ入力情報ＵＩＮ１に対応する。 For example, the client UI display unit 400 displays the content CT1, and the sensor unit 300 accepts the user input information IDT1 and IDT2 as user input information. For example, the user input information IDT1 and IDT2 correspond to the user input information UIN1 in Figures 3 and 4.

図６では、クライアントＵＩ表示部４００は、「どういう動画を作りたいですか？」という質問事項を表示する。センサ部３００は、「どういう動画を作りたいですか？」という質問事項に対しては、「１５秒のスニーカーのＣＭ動画」というユーザ入力情報ＩＤＴ１を受け付ける。また、クライアントＵＩ表示部４００は、「スタイル」という質問事項を表示する。センサ部３００は、「スタイル」という質問事項に対しては「映画風」というユーザ入力情報ＩＤＴ２を受け付ける。 In FIG. 6, the client UI display unit 400 displays the question "What kind of video do you want to make?". In response to the question "What kind of video do you want to make?", the sensor unit 300 accepts user input information IDT1, "a 15-second sneaker commercial video." The client UI display unit 400 also displays the question "style." In response to the question "style," the sensor unit 300 accepts user input information IDT2, "cinematic."

なお、映像生成システム１は、ユーザの入力情報を任意の態様により受け付けてもよく、複数の候補からユーザの選択を受け付けてもよい。例えば、映像生成システム１は、ユーザがキーボードやマイクから入力した情報をテキスト情報にして、ユーザの入力情報として受け付けてもよい。また、映像生成システム１は、自由文だけでなく、動画全体の秒数、スタイル、カメラワークなどの設定値、画像、動画などの他ファイルをユーザの入力情報として受け付けてもよい。 The video production system 1 may accept user input information in any manner, or may accept a user selection from multiple candidates. For example, the video production system 1 may convert information input by the user from a keyboard or microphone into text information and accept it as user input information. The video production system 1 may also accept, in addition to free text, settings such as the number of seconds for the entire video, style, and camerawork, as well as other files such as images and videos, as user input information.

映像生成システム１は、ユーザ入力情報ＩＤＴ１、ＩＤＴ２、及びテンプレート入力情報であるテンプレートＴＰ１を用いて、シナリオ生成用情報（第１の入力情報）であるプロンプトＰＴ１を生成する。例えば、テンプレートＴＰ１は、予め設定されたものであってもよいし、複数のテンプレート候補から選択されてもよい。例えば、映像生成システム１は、複数のテンプレート候補のうち、ユーザの入力情報に対応するテンプレートを選択してもよい。例えば、映像生成システム１は、ユーザ入力情報ＩＤＴ１、ＩＤＴ２が示す内容に基づいて、複数のテンプレート候補のうち、映画風の広告に関連するテンプレートＴＰ１を選択してもよい。 The video production system 1 generates a prompt PT1, which is scenario generation information (first input information), using the user input information IDT1, IDT2, and a template TP1, which is template input information. For example, the template TP1 may be preset, or may be selected from a plurality of template candidates. For example, the video production system 1 may select a template that corresponds to the user's input information from among the plurality of template candidates. For example, the video production system 1 may select a template TP1 related to a movie-style advertisement from among the plurality of template candidates based on the contents indicated by the user input information IDT1, IDT2.

例えば、映像生成システム１は、テンプレートＴＰ１にユーザ入力情報ＩＤＴ１、ＩＤＴ２を反映することにより、プロンプトＰＴ１を生成する。図６では、映像生成システム１は、制約条件のスタイルの項目に入力情報ＩＤＴ２が示す「映画風」を追加し、入力文に入力情報ＩＤＴ１が示す「１５秒のスニーカーのＣＭ動画」を追加することにより、プロンプトＰＴ１を生成する。このように、映像生成システム１は、ユーザの入力情報により、シナリオを生成するためのプロンプトを生成する。なお、ユーザの入力情報は、一画面で入力されてもよいし、いくつかの質問に答えることにより入力されてもよいが、これらの点の例について後述する。 For example, the video generation system 1 generates a prompt PT1 by reflecting the user input information IDT1 and IDT2 in the template TP1. In FIG. 6, the video generation system 1 generates the prompt PT1 by adding "cinematic" indicated by the input information IDT2 to the style item of the constraint condition, and adding "15-second sneaker commercial video" indicated by the input information IDT1 to the input sentence. In this way, the video generation system 1 generates a prompt for generating a scenario based on the user's input information. Note that the user's input information may be input on a single screen, or may be input by answering several questions, and examples of these points will be described later.

また、映像生成システム１は、図７に示すように、シナリオデータを生成する。図７は、シナリオデータの生成処理の一例を示す図である。図７では、映像生成システム１は、プロンプトＰＴ１を用いて、シナリオデータＳＮ１を生成する。例えば、シナリオデータＳＮ１は、図３及び図４中のシナリオＦＤ１に対応する。シナリオデータＳＮ１には、オープニングシーン、スニーカーを履くシーン等のシーンごとにその秒数、カットの説明などの情報が含まれる。 The video production system 1 also generates scenario data as shown in FIG. 7. FIG. 7 is a diagram showing an example of a scenario data generation process. In FIG. 7, the video production system 1 generates scenario data SN1 using a prompt PT1. For example, the scenario data SN1 corresponds to the scenario FD1 in FIG. 3 and FIG. 4. The scenario data SN1 includes information such as the number of seconds for each scene, such as the opening scene, the scene where sneakers are put on, and an explanation of the cut.

例えば、映像生成システム１は、ＬＬＭ等であるモデルＭ１にプロンプトＰＴ１を入力し、モデルＭ１にシナリオデータＳＮ１を出力させることにより、シナリオデータＳＮ１を生成する。このように、映像生成システム１は、生成したプロンプトをＡＩ（ＬＬＭなど）に入力することにより、シナリオを生成する。なお、図７に示す情報（「シナリオ情報」ともいう）以外にも、シナリオデータＳＮ１には、環境、登場人物、モーション、カメラワーク、ライティング、カラーなどの情報も含まれる。例えば、シナリオ生成には特徴が出るように、ユーザのこれまでの経験学習データや特定の監督や人の学習データを、モデルＭ１等にＲＡＧ（Retrieval-Augmented Generation）やファインチューンングで入れることで、映像生成システム１は、様々なシナリオバリエーションを生成することが可能となる。 For example, the video generation system 1 generates scenario data SN1 by inputting a prompt PT1 into a model M1, such as an LLM, and having the model M1 output scenario data SN1. In this way, the video generation system 1 generates a scenario by inputting the generated prompt into an AI (such as an LLM). In addition to the information shown in FIG. 7 (also called "scenario information"), the scenario data SN1 also includes information on the environment, characters, motion, camera work, lighting, color, and the like. For example, by inputting the user's past experience learning data or learning data of a specific director or person into the model M1, etc. using RAG (Retrieval-Augmented Generation) or fine tuning so that the scenario generation has characteristics, the video generation system 1 can generate various scenario variations.

また、映像生成システム１は、図８に示すように、コード生成用情報（第２の入力情報）を生成する。図８は、コード生成用情報の生成処理の一例を示す図である。映像生成システム１は、シナリオデータＳＮ１、及びテンプレート入力情報であるテンプレートＴＰ２を用いて、コード生成用情報（第２の入力情報）であるプロンプトＰＴ２を生成する。例えば、テンプレートＴＰ２は、予め設定されたものであってもよいし、複数のテンプレート候補から選択されてもよい。例えば、映像生成システム１は、複数のテンプレート候補のうち、シナリオに対応するテンプレートを選択してもよい。例えば、映像生成システム１は、シナリオデータＳＮ１が示す内容に基づいて、複数のテンプレート候補のうち、ＣＭに関連するテンプレートＴＰ２を選択してもよい。 The video production system 1 also generates code generation information (second input information) as shown in FIG. 8. FIG. 8 is a diagram showing an example of a process for generating code generation information. The video production system 1 generates a prompt PT2, which is code generation information (second input information), using scenario data SN1 and a template TP2, which is template input information. For example, the template TP2 may be preset or may be selected from a plurality of template candidates. For example, the video production system 1 may select a template corresponding to a scenario from among the plurality of template candidates. For example, the video production system 1 may select a template TP2 related to a commercial from among the plurality of template candidates based on the contents indicated by the scenario data SN1.

例えば、映像生成システム１は、テンプレートＴＰ２にシナリオデータＳＮ１を反映することにより、プロンプトＰＴ２を生成する。図８では、映像生成システム１は、入力文にシナリオデータＳＮ１が示す情報を追加することにより、プロンプトＰＴ２を生成する。このように、映像生成システム１は、ＡＩにより生成されたシナリオにより、映像向けのプロンプトを生成する。 For example, the video production system 1 generates a prompt PT2 by reflecting the scenario data SN1 in the template TP2. In FIG. 8, the video production system 1 generates a prompt PT2 by adding information indicated by the scenario data SN1 to the input sentence. In this way, the video production system 1 generates a prompt for a video based on a scenario generated by AI.

例えば、図８は、人に関するシナリオからＵＳＤ－Ｐｙｔｈｏｎに変換するためのプロンプト生成の一例を示す。ＵＳＤ－Ｐｙｔｈｏｎへの変換は変換の形式の一例に過ぎず、変換はＵＳＤ－Ｐｙｔｈｏｎに限らずに、任意の変換の形式であってもよい。例えば、変換の形式は、Blender向けPython、またＵＳＤなどの形式であってもよい。また、映像生成システム１は、人、環境、カメラワークなどの対象ごとに個別にプロンプトを生成してもよいし、まとめてプロンプトを生成してもよい。また、生成するプロンプトには、使用するアセットのパスやモーションのパスがされてもよいし、アセット／モーションのＡＩ生成アルゴリズムに投げる（入力する）ためのソースコードやＡＰＩ（Application Programming Interface）が記載されてもよい。 For example, FIG. 8 shows an example of prompt generation for converting a scenario related to a person into USD-Python. Conversion into USD-Python is merely one example of a conversion format, and the conversion is not limited to USD-Python and may be any conversion format. For example, the conversion format may be Python for Blender, USD, or other formats. Furthermore, the video generation system 1 may generate prompts individually for targets such as people, environments, and camerawork, or may generate prompts collectively. Furthermore, the generated prompt may include paths to assets and motions to be used, or may include source code or APIs (Application Programming Interfaces) for throwing (inputting) the assets/motions into an AI generation algorithm.

そして、映像生成システム１は、生成したプロンプトをＡＩ（ＬＬＭなど）に入力することでＵＳＤ－Ｐｙｔｈｏｎファイルを生成する。なお、ファイルの形式はpython形式に限らず、ＵＳＤなど他の形式でファイルが生成されてもよい。そして、映像生成システム１は、ＵＳＤファイル等、レンダリングできる形式に変換し、レンダリングを実行することにより、PreMovie（ｍｐ４等の映像ファイル）を生成する。 Then, the video generation system 1 generates a USD-Python file by inputting the generated prompt into an AI (such as LLM). Note that the file format is not limited to Python format, and files may be generated in other formats such as USD. The video generation system 1 then converts the file into a format that can be rendered, such as a USD file, and generates a PreMovie (a video file such as mp4) by performing rendering.

なお、映像生成システム１は、PreMovieを用いてコンポジット編集を行ってもよいが、図９に示すように、PreMovieに画質改善処理の一例であるリファイナ処理を行ってもよい。図９は、画質改善処理の一例を示す図である。図９では、映像生成システム１は、PreMovieである第１の動画ＩＮ１を入力として、RefinedMovieである第２の動画ＯＴ１を出力するDiffusionモデルであるモデルＭ１１を用いたリファイナ処理により、第１の動画ＩＮ１から第２の動画ＯＴ１を生成する。 The video production system 1 may perform composite editing using PreMovie, but may also perform refinement processing, which is an example of image quality improvement processing, on the PreMovie, as shown in FIG. 9. FIG. 9 is a diagram showing an example of image quality improvement processing. In FIG. 9, the video production system 1 generates a second video OT1 from a first video IN1 by refinement processing using a model M11, which is a diffusion model that receives a first video IN1, which is a PreMovie, as input, and outputs a second video OT1, which is a RefinedMovie.

なお、リファイナ処理に用いられるＡＩモデル（モデルＭ１１等）は、Diffusionモデルに限らず、ＬＤＭ（Latent Diffusion Model）、ＬＣＭ（Latent Consistency Model）などの任意のＡＩモデルが採用可能である。また、リファイナ処理には、AnimateDiff（時間方向安定化）、ControlNet（ラインアート制御）等の技術が用いられてもよい。このようなリファイナ処理により、映像生成システム１は、登場人物、背景、プロップ（小道具）などの一貫性を保ったまま、映像のクオリティを向上させることができる。 The AI model (such as model M11) used in the refiner process is not limited to the diffusion model, and any AI model such as the latent diffusion model (LDM) or latent consistency model (LCM) can be used. In addition, techniques such as AnimateDiff (time directional stabilization) and ControlNet (line art control) may be used in the refiner process. Through this refiner process, the video production system 1 can improve the quality of the video while maintaining the consistency of the characters, backgrounds, props, etc.

また、リファイナ処理には、動画だけでなくプロンプトを用いてもよい。例えば、モデルＭ１１は、第１の動画ＩＮ１に加えて、プロンプトＩＮ２を入力としてもよい。例えば、モデルＭ１１は、プロンプトＩＮ２により３０代女性等の対象が指定された場合、第１の動画ＩＮ１中の３０代女性の箇所を改善した第２の動画ＯＴ１を出力する。これにより、映像生成システム１は、第１の動画ＩＮ１のうちプロンプトＩＮ２により指定された対象について画質等が改善された第２の動画ＯＴ１を生成する。 In addition, the refiner process may use prompts in addition to videos. For example, the model M11 may input prompt IN2 in addition to the first video IN1. For example, when a target such as a woman in her 30s is specified by prompt IN2, the model M11 outputs a second video OT1 in which the parts of the first video IN1 that represent the woman in her 30s have been improved. This allows the video generation system 1 to generate a second video OT1 in which the image quality, etc., of the target specified by prompt IN2 in the first video IN1 has been improved.

上述した映像生成システム１の処理により、ユーザからは入力後にシナリオ（ストリーボード）と各カットの動画が生成されているように見え、その間の処理はシステム内に閉じている。これらの処理は、ユーザの入力テキストから一気にシナリオから全てのカット、リファイナ処理まで生成されることもありうるし、処理の途中で好みの選択などユーザの入力が行われてもよい。例えば、映像生成システム１においては、大筋のみを書いたシナリオを何パターンか生成後、ユーザが選択し、選択された大筋のシナリオを基に詳細シナリオと動画生成処理が実行されてもよい。また、映像生成システム１においては、シナリオ生成後、映像生成の前で生成される映像の登場人物を何パターンか生成し、ユーザが１つを選択後、動画をレンダリング処理やリファイナ処理を実行してもよい。 From the user's perspective, the above-mentioned processing of the video production system 1 appears to generate a scenario (storyboard) and video for each cut after input, and the processing in between is closed within the system. This processing may be performed in one go from the user's input text, from the scenario to all the cuts and refinement processing, or the user may input preferences during the processing. For example, the video production system 1 may generate several scenarios with only the outline, then the user may select one, and detailed scenarios and video generation processing may be performed based on the selected outline scenario. Also, the video production system 1 may generate several patterns of characters for the video to be generated before video generation after scenario generation, and after the user selects one, the video may be rendered and refined.

シナリオの構成要素として、カットの動画や代表とする画像（動画の１フレーム目など）、カットの説明、登場人物（ビジュアル、設定など）、各登場人物のモーション、ライティング、カメラワーク、背景の環境情報、カット間のトランジション、セリフ、ナレーションなどがありうる。これらはユーザに提示するものもあれば、ユーザに提示はせずに処理のために持つものもある。シナリオはカットごとに時系列に並んでいる。 The components of a scenario may include videos of cuts, a representative image (such as the first frame of a video), a description of the cut, characters (visuals, setting, etc.), the motions of each character, lighting, camera work, background environmental information, transitions between cuts, dialogue, narration, etc. Some of these are presented to the user, while others are not presented to the user and are kept for processing purposes. The scenario is arranged in chronological order by cut.

現在、Pika、Runway Gen-2、Lumiere、Stable Video Diffusionなどの様々な既存動画生成サービスが提供されている。これらは画像から空間方向や時間方向のベクトルを動かすDiffusionモデルを使い、動画を生成している。これらは、２Ｄの画像だけを使い映像を生成している。一方、映像生成システム１は、内部に３Ｄの情報を保持している。例えば、既存動画生成サービスでは動画内の指定した（Ｘ，Ｙ）領域のみ修正することは可能であるが、服の色味だけ変化させたいのにモーションも変わってしまうという課題がある。一方で、映像生成システム１の場合、内部に３Ｄの情報を保持しているため、モーションのみ、ライティングのみ、人の服の色のみなどの狙った部分のみの修正が可能となる。 Currently, various existing video generation services are provided, such as Pika, Runway Gen-2, Lumiere, and Stable Video Diffusion. These generate videos using a diffusion model that moves vectors in the spatial and temporal directions from an image. These generate videos using only 2D images. On the other hand, video generation system 1 stores 3D information internally. For example, with existing video generation services, it is possible to modify only a specified (X, Y) area in a video, but there is an issue that even if you want to change only the color of the clothes, the motion also changes. On the other hand, video generation system 1 stores 3D information internally, so it is possible to modify only the targeted parts, such as only the motion, only the lighting, or only the color of a person's clothes.

また、既存動画生成サービスはカットごとの動画を生成するのみであり、ユーザが各カットの一貫性を自ら担保する必要があるが、映像生成システム１はシナリオ（ストリーボード）から動画生成、修正まで一貫して実施することが可能であり、映像の出演者や背景、カラーグレーディングなど一貫性を持った動画生成をすることが可能である。 In addition, existing video generation services only generate videos for each cut, and users must ensure the consistency of each cut themselves. However, video generation system 1 can consistently carry out everything from the scenario (storyboard) to video generation and editing, and can generate videos with consistency in terms of actors, backgrounds, color grading, etc.

＜１－３．ユーザインタフェース＞
ここから、映像生成システム１を利用するユーザに対するユーザインタフェース（ＵＩ）について記載する。なお、上述した内容と同様の点については適宜説明を省略する。 <1-3. User interface>
From here, we will describe the user interface (UI) for users who use the image generation system 1. Note that explanations of points similar to those described above will be omitted where appropriate.

映像生成システム１は、図１０に示すように、コンテンツＣＴ１１をユーザに提供する。図１０は、ユーザインタフェースの一例を示す図である。コンテンツＣＴ１１は、ユーザの入力情報を受け付けるための表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ１１を表示する。ユーザは、コンテンツＣＴ１１を介して、どんな動画を作るかを指示するテキスト（図１０では「Prompt」の欄）及びスタイルの選択（図１０では「Style」の欄）をユーザの入力情報として入力する。このように、ユーザは、テキスト及びスタイルの選択をユーザの入力情報として入力する。例えば、ユーザは、コンテンツＣＴ１１に含まれる例文等を参考に「Prompt」の欄に文字情報を入力する。例えば、ユーザは、「Style」の欄の下向きの三角形を押す（クリック等）すること等により、表示される複数のスタイル候補から使用するスタイルを選択する。 As shown in FIG. 10, the video generation system 1 provides the user with content CT11. FIG. 10 is a diagram showing an example of a user interface. The content CT11 is a display screen (content) for receiving user input information. For example, the client UI display unit 400 displays the content CT11. The user inputs text (the "Prompt" column in FIG. 10) instructing what kind of video to create and a style selection (the "Style" column in FIG. 10) as user input information via the content CT11. In this way, the user inputs text and style selection as user input information. For example, the user inputs text information in the "Prompt" column with reference to example sentences included in the content CT11. For example, the user selects a style to use from multiple style candidates displayed by pressing (clicking, etc.) the downward triangle in the "Style" column.

ユーザの入力情報の入力が完了したユーザは、図１０中の「Ask AI Director」と表記されたボタンを選択することにより、映像生成システム１にユーザの入力情報に応じた動画生成を指示する。これにより、映像生成システム１は、ユーザの入力情報に応じた動画の生成処理を実行する。 When the user has completed inputting the user's input information, the user selects the button labeled "Ask AI Director" in FIG. 10 to instruct the video production system 1 to generate a video in accordance with the user's input information. This causes the video production system 1 to execute the process of generating a video in accordance with the user's input information.

映像生成システム１は、図１４に示すように、生成した動画に関するコンテンツＣＴ１５をユーザに提供する。図１４は、ユーザインタフェースの一例を示す図である。コンテンツＣＴ１５は、生成した動画に対するユーザの操作（指示）を受け付けるためのストーリーボード画面（コンテンツ）である。図１４に示すように、コンテンツＣＴ１５は、生成した動画のカット毎に動画データを表示するストーリーボード画面である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ１５を表示する。このように、映像生成システム１は、ユーザの入力情報に応じて、ストーリーボードと映像が出力されるＵＩを提供する。ユーザは、ストーリーボード画面にて、各カットの動画、内容、ナレーション、セリフ、カメラワーク、ＢＧＭ、ライティング、カラーなどの設定をする。 As shown in FIG. 14, the video generation system 1 provides the user with content CT15 related to the generated video. FIG. 14 is a diagram showing an example of a user interface. The content CT15 is a storyboard screen (content) for accepting user operations (instructions) for the generated video. As shown in FIG. 14, the content CT15 is a storyboard screen that displays video data for each cut of the generated video. For example, the client UI display unit 400 displays the content CT15. In this way, the video generation system 1 provides a UI that outputs a storyboard and video according to information input by the user. The user sets the video, content, narration, dialogue, camera work, background music, lighting, color, etc. for each cut on the storyboard screen.

なお、映像生成システム１は、ユーザに質問を行いながら、ユーザによるユーザの入力情報を受け付けてもよい。例えば、映像生成システム１は、図１０中の「Ask AI Director」と表記されたボタンを選択した場合、図１１～図１３に示すように、ユーザとの会話（対話）によりユーザの入力情報を受け付ける。図１１～図１３は、ユーザインタフェースの一例を示す図である。図１１中のコンテンツＣＴ１２は、図１０で入力されたユーザの入力情報に対応して生成したサンプルを提示して、ユーザにイメージに近いものがあるかを質問する表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ１２を表示する。 The video generation system 1 may accept user input information while asking the user a question. For example, when the button labeled "Ask AI Director" in FIG. 10 is selected, the video generation system 1 accepts user input information through a conversation (dialogue) with the user, as shown in FIGS. 11 to 13. FIGS. 11 to 13 are diagrams showing an example of a user interface. Content CT12 in FIG. 11 is a display screen (content) that presents samples generated in response to the user's input information entered in FIG. 10, and asks the user whether there is any that is close to the image they have in mind. For example, the client UI display unit 400 displays the content CT12.

図１２中のコンテンツＣＴ１３は、図１１中のコンテンツＣＴ１２で提示したサンプルにイメージに合うものがないとのユーザの回答（入力情報）に応じて、詳細なターゲット等を要求（質問）する表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ１３を表示する。 Content CT13 in FIG. 12 is a display screen (content) that requests (asks) detailed targets, etc. in response to the user's response (input information) that none of the samples presented in content CT12 in FIG. 11 match the image. For example, the client UI display unit 400 displays content CT13.

図１３中のコンテンツＣＴ１４は、ターゲット等を具体的に指定したユーザの回答（入力情報）に対応して再度生成したサンプルを提示して、ユーザにイメージに近いものがあるかを質問する表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ１４を表示する。図１３では、ユーザがマウスカーソルを４つのサンプルのうち左端のサンプル動画に合わせてクリック等の指定操作を行うことにより、４つのサンプルのうち左端のサンプル動画がイメージに近い動画であると、ユーザが指定した場合を示す。これにより、映像生成システム１は、４つのサンプルのうち左端のサンプル動画を指定するユーザの入力情報に応じた動画の生成処理を実行する。 Content CT14 in FIG. 13 is a display screen (content) that presents samples that have been regenerated in response to a user's answer (input information) that specifically specifies a target, etc., and asks the user whether there is one that is close to the image. For example, the client UI display unit 400 displays content CT14. FIG. 13 shows a case in which the user places the mouse cursor over the leftmost sample video of the four samples and performs a designation operation such as clicking, thereby designating that the leftmost sample video of the four samples is the video that is close to the image. As a result, the video generation system 1 executes the process of generating a video in response to the user's input information that designates the leftmost sample video of the four samples.

この場合、映像生成システム１は、図１４に示すように、生成した動画に関するコンテンツＣＴ１５をユーザに提供する。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ１５を表示する。このように、映像生成システム１は、初期入力内容で、ストーリーボードと動画を生成するにあたり、必要な情報が足りない場合はユーとの会話（対話）により必要な情報を収集しながら、詳細を詰めていってもよい。 In this case, the video generation system 1 provides the user with content CT15 related to the generated video, as shown in FIG. 14. For example, the client UI display unit 400 displays the content CT15. In this way, when generating a storyboard and video using the initial input content, if the video generation system 1 does not have enough necessary information, it may work out the details while collecting the necessary information through conversation (dialogue) with the user.

また、映像生成システム１は、図１５及び図１６に示すように、ユーザに作りたい動画に関する短文を入力させ、その短文を基に動画のストーリーをいくつか生成し、気に入ったものをユーザに選択させてもよい。図１５及び図１６は、ユーザインタフェースの一例を示す図である。 As shown in Figs. 15 and 16, the video generation system 1 may also have the user input a short sentence about the video they wish to create, generate several video stories based on the short sentence, and allow the user to select the one they like best. Figs. 15 and 16 show an example of a user interface.

映像生成システム１は、図１５に示すように、コンテンツＣＴ２１をユーザに提供する。コンテンツＣＴ２１は、ユーザの入力情報を受け付けるための表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ２１を表示する。ユーザは、コンテンツＣＴ２１を介して、どんな動画を作りたいかを示す文章（短文）をユーザの入力情報として入力する。例えば、ユーザは、コンテンツＣＴ２１中の入力欄に文字情報を入力する。 As shown in FIG. 15, the video generation system 1 provides the user with content CT21. Content CT21 is a display screen (content) for accepting information input by the user. For example, the client UI display unit 400 displays the content CT21. The user inputs a sentence (short sentence) indicating what kind of video they want to create as user input information via the content CT21. For example, the user inputs text information into an input field in the content CT21.

ユーザの入力情報の入力が完了したユーザは、コンテンツＣＴ２１中の入力欄の右端の「開始」と表記されたボタンを選択することにより、映像生成システム１にユーザの入力情報に応じた動画のストーリーの生成を指示する。これにより、映像生成システム１は、ユーザの入力情報に応じた動画のストーリーの生成処理を実行する。 When the user has finished inputting the user's input information, the user selects the button marked "Start" on the right side of the input field in content CT21 to instruct video production system 1 to generate a video story according to the user's input information. This causes video production system 1 to execute the process of generating a video story according to the user's input information.

映像生成システム１は、図１６に示すように、生成した動画のストーリーに関するコンテンツＣＴ２２をユーザに提供する。図１６中のコンテンツＣＴ２２は、図１５で入力されたユーザの入力情報に対応して生成した動画のストーリーのサンプルを提示する表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ２２を表示する。例えば、ユーザは、動画のストーリーのサンプルのうち、気に入ったものがあれば、そのサンプルの表示領域の右端の「続ける」と表記されたボタンを選択することにより、映像生成システム１に選択したサンプルに対応する動画生成を指示する。 As shown in FIG. 16, the video generation system 1 provides the user with content CT22 relating to the story of the generated video. The content CT22 in FIG. 16 is a display screen (content) that presents a sample of the story of the video generated in response to the user's input information entered in FIG. 15. For example, the client UI display unit 400 displays the content CT22. For example, if the user likes one of the video story samples, the user selects a button labeled "Continue" on the right edge of the display area for that sample, thereby instructing the video generation system 1 to generate a video corresponding to the selected sample.

なお、映像生成システム１は、ユーザが気に入ったものがなければ別パターンを再度生成する。例えば、ユーザは、動画のストーリーのサンプルのうち、気に入ったものが無ければ、短文の表示領域の右端の「続ける」と表記されたボタンを選択することにより、映像生成システム１に別パターンの動画のストーリーのサンプルを再度生成することを指示する。これにより、映像生成システム１は、別パターンの動画のストーリーのサンプルを生成する。 If the user does not like any of the video story samples, the video generation system 1 will generate a different pattern again. For example, if the user does not like any of the video story samples, the user can select the button labeled "Continue" on the right edge of the short sentence display area to instruct the video generation system 1 to generate a different pattern of video story sample again. This causes the video generation system 1 to generate a different pattern of video story sample.

＜１－４．処理例＞
ここから、上述した具体例以外に、映像生成システム１が実行する各処理の具体例について記載する。なお、上述した内容と同様の点について適宜説明を省略する。以下では、上述した映像生成システム１の処理における音等の生成処理、評価処理等についての具体例を説明する。なお、上述した内容と同様の点については適宜説明を省略する。 <1-4. Processing example>
From here, in addition to the specific examples described above, specific examples of each process executed by the video production system 1 will be described. Note that explanations of points similar to those described above will be omitted as appropriate. Below, specific examples of the generation process of sound, etc., and the evaluation process in the processing of the video production system 1 described above will be described. Note that explanations of points similar to those described above will be omitted as appropriate.

＜１－４－１．音生成例＞
例えば、映像生成システム１は、図１７に示すように、音生成用情報（図３及び図４中の音生成必要情報ＳＤ２に対応）を生成する。図１７は、音生成用情報の生成処理の一例を示す図である。映像生成システム１は、シナリオデータＳＮ１、及びテンプレート入力情報であるテンプレートＴＰ３を用いて、音生成用情報であるプロンプトＰＴ３を生成する。例えば、テンプレートＴＰ３は、予め設定されたものであってもよいし、複数のテンプレート候補から選択されてもよい。例えば、映像生成システム１は、複数のテンプレート候補のうち、シナリオに対応するテンプレートを選択してもよい。例えば、映像生成システム１は、シナリオデータＳＮ１が示す内容に基づいて、複数のテンプレート候補のうち、ＣＭに関連するテンプレートＴＰ３を選択してもよい。 <1-4-1. Sound generation example>
For example, the video production system 1 generates information for sound generation (corresponding to the information required for sound generation SD2 in FIG. 3 and FIG. 4) as shown in FIG. 17. FIG. 17 is a diagram showing an example of a process for generating information for sound generation. The video production system 1 generates a prompt PT3, which is information for sound generation, using scenario data SN1 and a template TP3, which is template input information. For example, the template TP3 may be preset or may be selected from a plurality of template candidates. For example, the video production system 1 may select a template corresponding to a scenario from among the plurality of template candidates. For example, the video production system 1 may select a template TP3 related to a commercial from among the plurality of template candidates based on the contents indicated by the scenario data SN1.

例えば、映像生成システム１は、テンプレートＴＰ３にシナリオデータＳＮ１を反映することにより、プロンプトＰＴ３を生成する。図１７では、映像生成システム１は、入力文にシナリオデータＳＮ１が示す情報を追加することにより、プロンプトＰＴ３を生成する。例えば、映像生成システム１は、シナリオにより合うサウンド（ＢＧＭ等）を生成するために、シナリオの中からキーとなるフレーズをいくつか抽出するためのプロンプトを生成する。映像生成システム１は、生成したプロンプトをＡＩ（ＬＬＭ等）に入力することで、サウンド生成に必要なキーワードやテキスト情報を取得してもよい。 For example, the video production system 1 generates a prompt PT3 by reflecting the scenario data SN1 in the template TP3. In FIG. 17, the video production system 1 generates a prompt PT3 by adding information indicated by the scenario data SN1 to an input sentence. For example, the video production system 1 generates a prompt for extracting several key phrases from the scenario in order to generate a sound (BGM, etc.) that matches the scenario. The video production system 1 may input the generated prompt to an AI (LLM, etc.) to obtain keywords and text information required for sound generation.

映像生成システム１は、生成したプロンプトＰＴ３を用いて、音データを生成する。例えば、映像生成システム１は、生成された動画とストーリーボード（シナリオデータＳＮ１等）に記載のテキスト情報に基づき、ＢＧＭやＳＥ、ナレーション、セリフなどのサウンド（音データ）を生成する。また、映像生成システム１は、セリフやナレーション等、既に必要な言葉（文字情報）がシナリオ内に抜き出されている場合は、プロンプトを生成せずに、その言葉（文字情報）をセリフやナレーション等の音データとして保存してもよい。 The video generation system 1 generates sound data using the generated prompt PT3. For example, the video generation system 1 generates sounds (sound data) such as background music, sound effects, narration, and dialogue based on the generated video and text information written on the storyboard (scenario data SN1, etc.). Furthermore, if necessary words (text information) such as dialogue or narration have already been extracted from the scenario, the video generation system 1 may not generate a prompt and may instead save the words (text information) as sound data for dialogue, narration, etc.

例えば、映像生成システム１は、得られたサウンド生成のための必要情報（音生成用情報）、動画、ユーザが入力した音声データ（音源、ユーザの声や鼻歌など）より音データを生成する。なお、映像生成システム１は、ＢＧＭやＳＥに関しては、text to music generationの様なTransformer（モデル）を利用し、テキストや動画からサウンドを生成してもよい。また、映像生成システム１は、text to music estimationの様なContrastive Learningされた音源を自然文から検索してもよい。 For example, the video generation system 1 generates sound data from the necessary information for sound generation (information for sound generation), video, and audio data input by the user (sound source, user's voice, humming, etc.). For background music and sound effects, the video generation system 1 may generate sound from text or video using a transformer (model) such as text to music generation. The video generation system 1 may also search for contrastively learned sound sources such as text to music estimation from natural text.

また、映像生成システム１は、セリフやナレーションに関しては、その言葉自体と、シナリオから得られた人物像に関するテキスト情報を元に、text to speech（DiffusionモデルやFlow Matchingなど）で音声を生成してもよい。また、映像生成システム１は、生成された動画と音声をつなげる際、音声の開始終了時間、音量等を示すため、動画内もしくは音声ファイル内にメタ情報を組み込んでもよい。 In addition, for lines and narration, the video generation system 1 may generate audio using text to speech (such as a diffusion model or flow matching) based on the words themselves and text information about the characters obtained from the scenario. In addition, when linking the generated video and audio, the video generation system 1 may incorporate meta information into the video or audio file to indicate the start and end times, volume, etc. of the audio.

＜１－４－２．テキストロゴ生成例＞
例えば、映像生成システム１は、テキストロゴ生成用情報（図３及び図４中のテキストロゴ生成必要情報ＳＤ３に対応）を生成する。映像生成システム１は、生成したストーリーボード（シナリオデータＳＮ１等）に基づき、映像上に表示されるテキストやロゴ情報（キャプション、タイトル、ロゴ、説明文等）を生成する。 <1-4-2. Example of text logo generation>
For example, video production system 1 generates information for generating a text logo (corresponding to information required for text logo generation SD3 in FIGS. 3 and 4). Based on the generated storyboard (scenario data SN1, etc.), video production system 1 generates text and logo information (captions, titles, logos, descriptions, etc.) to be displayed on the video.

例えば、生成されたシナリオ内に、表示するテキスト文章が明確に記載されている場合もあるが、明確に記載されていない場合は、映像生成システム１は、テキスト情報（テキストロゴデータ等）を生成するためのプロンプトを生成し、ＡＩ（ＬＬＭなど）に投げてテキスト情報を生成する。また、映像生成システム１は、ユーザが表示するテキストを自ら入力した場合、ユーザが入力したその情報をテキスト情報（テキストロゴデータ等）として用いてもよい。 For example, the text sentence to be displayed may be clearly stated in the generated scenario, but if it is not clearly stated, the video generation system 1 generates a prompt to generate text information (text logo data, etc.) and sends it to an AI (such as an LLM) to generate the text information. In addition, if the user inputs the text to be displayed, the video generation system 1 may use the information input by the user as text information (text logo data, etc.).

また、テキスト表示のフォントやサイズ位置は、任意の方法により決定される。例えば、映像生成システム１は、Diffusionモデル、ＶＡＥ（Variational Auto-Encoder）、ＧＡＮ（Generative Adversarial Networks）、ＤＡＬＬＥ、ＳｔｙｌｅＧＡＮ、ＳｔｙｌｅＧＡＮ２、Ｐｉｘ２Ｐｉｘ、ＴｒａｎｓＧＡＮ、ＬＬＭなどの任意のＡＩを用いて、テキスト表示のフォントやサイズ位置を決定してもよい。また、テキスト表示のフォントやサイズ位置は、ユーザが自ら手動で設定してもよい。 The font, size, and position of the text display may be determined by any method. For example, the video generation system 1 may determine the font, size, and position of the text display using any AI such as a diffusion model, a variational auto-encoder (VAE), generative adversarial networks (GAN), DALL E, StyleGAN, StyleGAN2, Pix2Pix, TransGAN, or LLM. The font, size, and position of the text display may also be manually set by the user.

ロゴやイメージに関しては、ユーザが画像や動画をｊｐｅｇ形式やｍｐ４形式等で入力してもよい。また、ロゴやイメージに関しては、映像生成システム１は、画像生成のためのプロンプトを生成し、Diffusionモデル、ＶＡＥ、ＧＡＮ、ＤＡＬＬＥ、ＳｔｙｌｅＧＡＮ、ＳｔｙｌｅＧＡＮ２、Ｐｉｘ２Ｐｉｘ、ＴｒａｎｓＧＡＮ、ＬＬＭなどの任意のＡＩに投げてロゴ情報を生成してもよい。 Regarding logos and images, the user may input images or videos in jpeg or mp4 format, etc. Also, regarding logos and images, the video generation system 1 may generate a prompt for image generation and send it to any AI such as a Diffusion model, VAE, GAN, DALL E, StyleGAN, StyleGAN2, Pix2Pix, TransGAN, or LLM to generate logo information.

また、映像生成システム１は、テキストやロゴ情報と動画をつなげる際、テキストやロゴの開始終了時間や位置、大きさを明確（メタ情報）に示すため、動画内もしくはテキストやロゴ自体にメタ情報を組み込んでもよい。 In addition, when linking text or logo information with a video, the video generation system 1 may incorporate meta information into the video or into the text or logo itself to clearly indicate (meta information) the start and end times, position, and size of the text or logo.

＜１－４－３．再学習例＞
また、映像生成システム１は、シナリオや映像を生成する際、取り替え可能な特定の学習データに基づき、その学習データにしか出せないシナリオや映像を生成してもよい。例えば、これまで制作した映像や画像を基に学習データとして再学習する事が可能である。ＲＡＧやファインチューンングで再学習することで、生成されるシナリオや映像を変化させることができる。すなわち、映像生成システム１は、過去に制作したユーザ個人のデータ（履歴）を再学習することも可能であるし、特定の映画監督の作品を学習データとして再学習したモデルで生成することも可能である。 <1-4-3. Re-learning example>
Furthermore, when generating a scenario or image, the image generation system 1 may generate a scenario or image that can only be produced based on specific replaceable learning data. For example, it is possible to re-learn as learning data based on images and videos produced in the past. By re-learning with RAG or fine tuning, it is possible to change the scenario or image to be generated. In other words, the image generation system 1 can re-learn the data (history) of an individual user produced in the past, and can also generate images and images using a model re-learned as learning data using the works of a specific movie director.

また、その学習データは、個人のＰＣ上で再学習させることも可能であるし、サーバ上で再学習させることも可能である。学習データは、ＬＬＭやDiffusionモデルなど数多くのモデルを学習させるための学習データとなる。全体のシナリオはＡ監督を用いてシナリオ生成したいが、映像のカラーは別のＢ監督を用いてカラーグレーディングを生成したい場合等においては、映像生成システム１は、特定の部分に関して別の学習データで再学習してもよい。 The learning data can also be re-learned on an individual's PC or on a server. The learning data is used to train a number of models, such as the LLM and Diffusion model. In cases such as when it is desired to generate an overall scenario using director A, but to generate color grading for the video color using a different director B, the video generation system 1 may re-learn specific parts using different learning data.

＜１－４－４．ＵＳＤ更新例＞
上述したように、映像生成システム１は、ユーザの入力情報と生成されたストーリーボードのテキスト情報に基づき、既存動画の元となる３ＤＣＧアセットやレンダリング方法などを修正する。そして、映像生成システム１は、新しく動画を構成するコード出力のためのプロンプトを出力し、自然言語モデルへプロンプトを提供して、動画を構成するコードを出力し、動画を生成する。これにより、映像生成システム１は、ユーザが３ＤＣＧや映像制作の知識がなくても、ユーザの入力に合わせて映像を修正することができる。 <1-4-4. USD update example>
As described above, the video production system 1 modifies the 3DCG assets and rendering method that are the source of the existing video, based on the user's input information and the text information of the generated storyboard. The video production system 1 then , a prompt for outputting a code for constructing a new video is output, and the prompt is provided to the natural language model to output the code for constructing the video, thereby generating the video. In this way, the video generation system 1 However, even if you have no knowledge of 3DCG or video production, you can modify the image according to the user's input.

映像生成システム１では、映像（動画）を生成した後、シナリオ情報、映像の情報などはテキストや動画として保存されている。そのため、ユーザはこれらの情報を入力情報（テキストやセンサなど）で修正することが可能である。 After generating a video (video), video generation system 1 stores scenario information, video information, and the like as text and video. This allows the user to modify this information with input information (text, sensors, etc.).

例えば、映像生成システム１では、ＵＳＤファイルは図１８に示すように、アセットやモーションごとにわかれて保存される。図１８は、ＵＳＤファイルの一例を示す図である。例えば、図１８に示すデータ構造において、上位の階層のＵＳＤには下位の階層のＵＳＤへのパス（ファイルパス）が含まれてもよい。 For example, in the video production system 1, USD files are stored separately for each asset and motion, as shown in FIG. 18. FIG. 18 is a diagram showing an example of a USD file. For example, in the data structure shown in FIG. 18, a USD in a higher hierarchical level may include a path (file path) to a USD in a lower hierarchical level.

例えば、全体ＵＳＤには、環境アセットＵＳＤ、人アセットＵＤＳ、カメラＵＳＤ等へのパスが含まれる。また、環境アセットＵＳＤには、建物アセットＵＳＤ、ＰｒｏｐアセットＵＳＤへのパスが含まれる。また、建物アセットＵＳＤには、建物自体のメッシュ情報等が含まれる。また、人アセットＵＳＤには、人のメッシュ情報、モーションＵＳＤへのパスが含まれる。このように、ＵＳＤファイル等の３ＤＣＧ用のデータ（３Ｄデータ）は、複数のデータセットを含んでもよい。なお、図１８に示すＵＳＤファイルの構成（データ構造）は一例に過ぎず、任意の構成が採用可能であり、全体が一塊（１つのデータセット）のＵＳＤ（ＵＳＤファイル）として構成されてもよい。 For example, the entire USD includes paths to the environment assets USD, human assets UDS, camera USD, etc. The environment assets USD also include paths to the building assets USD and prop assets USD. The building assets USD also include mesh information of the building itself, etc. The human assets USD also include paths to human mesh information and motion USD. In this way, data for 3DCG (3D data) such as a USD file may include multiple data sets. Note that the configuration (data structure) of the USD file shown in FIG. 18 is merely an example, and any configuration can be adopted, and the entire USD (USD file) may be configured as one block (one data set).

修正が行われる際は、図１９に示すような処理フローにより、映像生成システム１がユーザの修正情報をＡＩ（ＬＬＭなど）で解析し、ＵＳＤを取り替えるかＵＳＤの一部を修正するかにより、処理が変わる。また、修正後はレンダリング処理とリファイン処理が実行される。図１９は、映像生成システムが実行する処理手順を示すフローチャートである。具体例には、図１９は、ＵＳＤファイルの書き換えに関する処理手順を示すフローチャートである。 When modifications are made, the video production system 1 analyzes the user's modification information using AI (such as LLM) according to the processing flow shown in FIG. 19, and the processing differs depending on whether the USD is replaced or part of the USD is modified. After the modification, rendering processing and refinement processing are executed. FIG. 19 is a flowchart showing the processing procedure executed by the video production system. As a specific example, FIG. 19 is a flowchart showing the processing procedure related to rewriting a USD file.

まず、映像生成システム１は、ユーザの修正情報入力を受け付ける（ステップＳ１０１）。例えば、センサ部３００は、ユーザによる修正を指示する入力情報を受け付ける。映像生成システム１は、ＡＩにて入力を解析する（ステップＳ１０２）。例えば、映像生成モジュール１００は、各種モデル等を用いてユーザによる修正を指示する入力情報の内容を解析する。 First, the video generation system 1 accepts correction information input by the user (step S101). For example, the sensor unit 300 accepts input information instructing the user to make corrections. The video generation system 1 analyzes the input using AI (step S102). For example, the video generation module 100 analyzes the content of the input information instructing the user to make corrections using various models, etc.

映像生成システム１は、既存ＵＳＤファイルの形式を認識する（ステップＳ１０３）。例えば、映像生成モジュール１００は、修正前の状態におけるＵＳＤファイルの形式を認識する。映像生成システム１は、ＵＳＤの一部を修正するか否かを判定する（ステップＳ１０４）。例えば、映像生成モジュール１００は、ユーザによる修正を指示する入力情報の内容及び既存ＵＳＤファイルの形式に基づいて、ＵＳＤの一部を修正するか否かを判定する。 The video production system 1 recognizes the format of the existing USD file (step S103). For example, the video production module 100 recognizes the format of the USD file before modification. The video production system 1 determines whether or not to modify a part of the USD (step S104). For example, the video production module 100 determines whether or not to modify a part of the USD based on the content of the input information instructing the user to modify and the format of the existing USD file.

映像生成システム１は、ＵＳＤの一部を修正する場合（ステップＳ１０４：Ｙｅｓ）、修正用ＵＳＤ－Ｐｙｔｈｏｎ生成のためのプロンプトを生成する（ステップＳ１０５）。例えば、映像生成モジュール１００は、ＵＳＤの一部を修正する場合、修正用ＵＳＤ－Ｐｙｔｈｏｎ生成のためテンプレート等を用いて、修正用ＵＳＤ－Ｐｙｔｈｏｎ生成のためのプロンプトを生成する。 When a portion of the USD is to be modified (step S104: Yes), the video generation system 1 generates a prompt for generating a modified USD-Python (step S105). For example, when a portion of the USD is to be modified, the video generation module 100 generates a prompt for generating the modified USD-Python using a template or the like for generating the modified USD-Python.

映像生成システム１は、ＡＩ（ＬＬＭ等）にてＵＳＤ－Ｐｙｔｈｏｎを生成する（ステップＳ１０６）。例えば、映像生成モジュール１００は、ＵＳＤ－Ｐｙｔｈｏｎを生成するためのモデルに、プロンプトを入力することにより、ＵＳＤ－Ｐｙｔｈｏｎを生成する。 The video generation system 1 generates USD-Python using AI (LLM, etc.) (step S106). For example, the video generation module 100 generates USD-Python by inputting a prompt into a model for generating USD-Python.

映像生成システム１は、修正対象ＵＳＤファイルを置き換える（ステップＳ１０７）。例えば、映像生成モジュール１００は、生成したＵＳＤ－Ｐｙｔｈｏｎを修正対象ＵＳＤファイルに反映することにより、修正対象ＵＳＤファイルを置き換える。このように、映像生成システム１は、ＵＳＤファイルの複数のデータセットのうち少なくとも１つを更新する。例えば、映像生成システム１は、ＵＳＤファイルの複数のデータセットのうち一部を更新する処理を実行する。 The video production system 1 replaces the USD file to be modified (step S107). For example, the video production module 100 replaces the USD file to be modified by reflecting the generated USD-Python in the USD file to be modified. In this manner, the video production system 1 updates at least one of the multiple data sets of the USD file. For example, the video production system 1 executes a process of updating a portion of the multiple data sets of the USD file.

映像生成システム１は、更新後のＵＳＤファイルを用いてレンダリング処理を実行する（ステップＳ１０８）。例えば、映像生成モジュール１００は、書き換え後、すなわち修正後のＵＳＤファイルを用いてレンダリング処理を実行する。 The video generation system 1 executes the rendering process using the updated USD file (step S108). For example, the video generation module 100 executes the rendering process using the rewritten, i.e., modified, USD file.

一方、映像生成システム１は、ＵＳＤの一部を修正しない場合（ステップＳ１０４：Ｎｏ）、作成用ＵＳＤ－Ｐｙｔｈｏｎ生成のためのプロンプトを生成する（ステップＳ１０９）。例えば、映像生成モジュール１００は、ＵＳＤの一部を修正しない、すなわちＵＳＤを新たに作成（生成）する場合、作成用ＵＳＤ－Ｐｙｔｈｏｎ生成のためテンプレート等を用いて、作成用ＵＳＤ－Ｐｙｔｈｏｎ生成のためのプロンプトを生成する。 On the other hand, if the image generation system 1 does not modify any part of the USD (step S104: No), it generates a prompt for generating the USD-Python for creation (step S109). For example, if the image generation module 100 does not modify any part of the USD, i.e., if it creates (generates) a new USD, it generates a prompt for generating the USD-Python for creation using a template or the like for generating the USD-Python for creation.

映像生成システム１は、ＡＩ（ＬＬＭ等）にてＵＳＤ－Ｐｙｔｈｏｎ及びＵＳＤを生成する（ステップＳ１１０）。例えば、映像生成モジュール１００は、ＵＳＤ－Ｐｙｔｈｏｎを生成するためのモデルに、プロンプトを入力することにより、ＵＳＤ－Ｐｙｔｈｏｎ及びＵＳＤを生成する。 The video generation system 1 generates USD-Python and USD using AI (LLM, etc.) (step S110). For example, the video generation module 100 generates USD-Python and USD by inputting a prompt into a model for generating USD-Python.

映像生成システム１は、修正対象ＵＳＤファイルと置き換える（ステップＳ１１１）。例えば、映像生成モジュール１００は、生成したＵＳＤを、修正対象ＵＳＤファイルと置き換える。このように、映像生成システム１は、ＵＳＤファイルを更新する処理を実行する。例えば、映像生成システム１は、ＵＳＤファイルの複数のデータセット全体を更新する処理を実行する。そして、映像生成システム１は、ステップＳ１０８の処理を実行する。例えば、映像生成モジュール１００は、置き換え後、すなわち修正後のＵＳＤファイルを用いてレンダリング処理を実行する。 The video production system 1 replaces the USD file to be modified (step S111). For example, the video production module 100 replaces the generated USD with the USD file to be modified. In this manner, the video production system 1 executes a process to update the USD file. For example, the video production system 1 executes a process to update the entire multiple data sets of the USD file. Then, the video production system 1 executes the process of step S108. For example, the video production module 100 executes a rendering process using the replaced, i.e., modified, USD file.

ここで上述した修正に関するユーザインタフェース（ＵＩ）について記載する。映像生成システム１は、図２０に示すように、コンテンツＣＴ３１をユーザに提供する。図２０は、ユーザインタフェースの一例を示す図である。コンテンツＣＴ３１は、ストーリーボード等、生成した動画に関する情報を提示し、ユーザの修正指示を受け付けるための表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ３１を表示する。 Now, the user interface (UI) related to the above-mentioned modifications will be described. As shown in FIG. 20, the video generation system 1 provides the user with content CT31. FIG. 20 is a diagram showing an example of a user interface. The content CT31 is a display screen (content) for presenting information related to the generated video, such as a storyboard, and for accepting modifications from the user. For example, the client UI display unit 400 displays the content CT31.

ユーザは、コンテンツＣＴ３１を介して、生成された動画に対しての修正を指示する情報を入力する。例えば、ユーザが動画の特定の部分をクリックし、テキストで修正内容を入力した場合、映像生成システム１は、ＵＳＤを更新し、修正内容が反映された動画に更新（変更）する。 The user inputs information instructing corrections to be made to the generated video via content CT31. For example, when the user clicks on a specific part of the video and inputs corrections in text, the video generation system 1 updates the USD and updates (changes) the video to reflect the corrections.

図２０では、ユーザが一番上のカット（サムネイル）画像を選択し、「子供がスキップしてお母さんによっていく」という修正を指示した場合を示す。この場合、映像生成システム１は、生成した動画のうち、一番上のカット（サムネイル）画像に対応する部分を、「子供がスキップしてお母さんによっていく」という修正指示を基にＵＳＤを更新し、修正内容が反映された動画を生成する。 Figure 20 shows a case where the user selects the top cut (thumbnail) image and instructs a modification to "child skips and runs to mother." In this case, the video production system 1 updates the USD for the part of the generated video that corresponds to the top cut (thumbnail) image based on the modification instruction to "child skips and runs to mother," and generates a video that reflects the modification content.

なお、上記のＵＩは一例に過ぎず、映像生成システム１は、様々な態様によりユーザの修正指示を受け付けてもよい。例えば、映像生成システム１は、図２１に示すように、コンテンツＣＴ３２をユーザに提供し、ユーザの修正指示を受け付けてもよい。図２１は、ユーザインタフェースの一例を示す図である。コンテンツＣＴ３２は、ユーザの修正指示を受け付けるための表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ３２を表示する。 Note that the above UI is merely an example, and the video generation system 1 may accept the user's correction instructions in various ways. For example, as shown in FIG. 21, the video generation system 1 may provide the user with content CT32 and accept the user's correction instructions. FIG. 21 is a diagram showing an example of a user interface. The content CT32 is a display screen (content) for accepting the user's correction instructions. For example, the client UI display unit 400 displays the content CT32.

また、映像生成システム１は、図２２に示すように、コンテンツＣＴ３３をユーザに提供し、ユーザの修正指示を受け付けてもよい。図２２は、ユーザインタフェースの一例を示す図である。コンテンツＣＴ３３は、ユーザの修正指示を受け付けるための表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ３３を表示する。 Also, as shown in FIG. 22, the video generation system 1 may provide the user with content CT33 and accept the user's correction instructions. FIG. 22 is a diagram showing an example of a user interface. The content CT33 is a display screen (content) for accepting the user's correction instructions. For example, the client UI display unit 400 displays the content CT33.

ユーザは、コンテンツＣＴ３２またはコンテンツＣＴ３３を介して、生成された動画に対しての修正を指示する情報を入力する。これにより、映像生成システム１は、ユーザからの修正指示を受け付けて、修正指示を基にＵＳＤを更新し、修正内容が反映された動画を生成する。例えば、ユーザは動画中に出てくる人物や物を選択し、選択した対象に関するモーションの設定をテキストで編集することができる。また、ユーザは選択した対象のアセットを変更することができる。また、ユーザはカメラワークについてもテキストで編集することができる。また、ユーザは上記以外にも、動画の背景や照明の設定をテキストで編集することができる。 The user inputs information instructing corrections to the generated video via content CT32 or content CT33. As a result, the video generation system 1 accepts the correction instructions from the user, updates the USD based on the correction instructions, and generates a video that reflects the corrections. For example, the user can select a person or object that appears in the video, and edit the motion settings for the selected target in text. The user can also change the assets of the selected target. The user can also edit the camera work in text. In addition to the above, the user can also edit the background and lighting settings of the video in text.

なお、上記では動画の一例として説明したが、映像生成システム１は、ユーザの修正指示に基づいて、サウンド情報（ＢＧＭ、ＳＥ、ナレーション、セリフなど）やテキストロゴ情報などに対しての修正処理を実行してもよい。 Although the above description is given as an example of a moving image, the video generation system 1 may also perform correction processing on sound information (BGM, SE, narration, dialogue, etc.) and text logo information, etc., based on correction instructions from the user.

＜１－４－５．評価例＞
また、映像生成システム１は、評価処理を実行する。例えば、映像生成システム１は、モデルＭ１０等のＡＩモデルを用いてシナリオデータについての評価テキストを生成する。例えば、映像生成システム１は、シナリオデータについて生成した評価テキストを基に、ユーザが行う編集（修正等）の指示を基に、シナリオデータを再度生成してもよい。映像生成システム１は、生成した評価テキストを提示する。これにより、ユーザは評価を見ながら映像制作を行うことができる。 <1-4-5. Evaluation example>
Furthermore, the video production system 1 executes an evaluation process. For example, the video production system 1 generates an evaluation text for scenario data using an AI model such as model M10. For example, the video production system 1 may generate scenario data again based on an instruction for editing (correction, etc.) made by a user based on the evaluation text generated for the scenario data. The video production system 1 presents the generated evaluation text. This allows the user to create a video while viewing the evaluation.

例えば、映像生成システム１は、シナリオデータについての評価テキストをユーザに提示し、提示した評価テキストを確認したユーザからシナリオデータに対する編集の指示を受け付ける。映像生成システム１は、ユーザから受け付けた編集の指示を基に、シナリオデータを生成する。例えば、映像生成システム１は、ユーザから受け付けた編集の指示を基に、シナリオデータの内容を変更（更新）する。映像生成システム１は、評価テキストを基に生成されたシナリオデータを基に、コードを生成する。映像生成システム１は、評価テキストを基に生成されたコードを用いて動画データを生成する。 For example, the video generation system 1 presents evaluation text about the scenario data to the user, and accepts editing instructions for the scenario data from the user who has confirmed the presented evaluation text. The video generation system 1 generates scenario data based on the editing instructions accepted from the user. For example, the video generation system 1 changes (updates) the contents of the scenario data based on the editing instructions accepted from the user. The video generation system 1 generates code based on the scenario data generated based on the evaluation text. The video generation system 1 generates video data using the code generated based on the evaluation text.

なお、映像生成システム１は、評価テキストを基にシナリオデータまたはコードのうち少なくとも１つを自動で生成（更新）してもよい。例えば、映像生成システム１は、シナリオデータについての評価テキストが示す内容に対応するようにシナリオデータの内容を変更してもよい。例えば、映像生成システム１は、シナリオデータについての評価テキストがある登場人物の向きが良くないことを示す場合、その登場人物の向きを変更したシナリオデータを生成する。なお、上記は一例に過ぎず、映像生成システム１は、評価テキストを適宜用いて、シナリオデータまたはコードのうち少なくとも１つを生成してもよい。 The video generation system 1 may automatically generate (update) at least one of the scenario data or the code based on the evaluation text. For example, the video generation system 1 may change the content of the scenario data to correspond to the content indicated by the evaluation text for the scenario data. For example, when the evaluation text for the scenario data indicates that the orientation of a certain character is not good, the video generation system 1 generates scenario data in which the orientation of the character is changed. Note that the above is merely one example, and the video generation system 1 may generate at least one of the scenario data or the code by appropriately using the evaluation text.

映像生成システム１では、映像（動画）を生成した後、シナリオ情報、映像の情報、音情報、テキスト／ロゴ情報などはテキストや動画ファイル、サウンドファイル、画像ファイルなどで保存されている。そのため、映像生成システム１は、これらの情報を用いた評価処理を行うことが可能である。映像生成システム１は、ユーザの入力情報と生成されたストーリーボードのテキスト情報、評価テキストに基づき、既存動画の元となる３ＤＣＧアセットやレンダリング方法などを修正し、新しく動画を構成するコード出力のためのプロンプトを出力し、自然言語モデルへプロンプトを提供して、動画を構成するコードを出力し、動画を生成する。例えば、映像生成システム１は、ユーザの入力情報と生成されたストーリーボードのテキスト情報に基づき、評価テキスト（図５中の評価テキスト情報ＥＶ１に対応）を生成するためのプロンプトを出力し、自然言語モデルへプロンプトを提供して評価テキストを生成してもよい。例えば、映像生成システム１は、シナリオの評価については、シナリオ情報を基に評価のためのプロンプトを生成しＡＩ（ＬＬＭ等）に入力することで、シナリオの評価を示す評価テキストを生成してもよい。 In the video generation system 1, after generating a video (video), scenario information, video information, sound information, text/logo information, etc. are stored in text, video files, sound files, image files, etc. Therefore, the video generation system 1 can perform evaluation processing using these information. The video generation system 1 modifies the 3DCG assets and rendering method that are the source of the existing video based on the user's input information, the text information of the generated storyboard, and the evaluation text, outputs a prompt for outputting the code that constitutes the new video, provides the prompt to the natural language model, outputs the code that constitutes the video, and generates the video. For example, the video generation system 1 may output a prompt for generating evaluation text (corresponding to the evaluation text information EV1 in FIG. 5) based on the user's input information and the text information of the generated storyboard, and provide the prompt to the natural language model to generate the evaluation text. For example, the video generation system 1 may generate an evaluation prompt based on the scenario information and input it to the AI (LLM, etc.) to generate an evaluation text that indicates the evaluation of the scenario.

また、映像生成システム１は、映像の構成の評価については、動画の１フレームをＡＩ（Contrastive Captioner Model、Image Captioning Modelなど）に入力することで動画のキャプションを取得する。そして、映像生成システム１は、取得したキャプションとシナリオ文をＡＩ（ＬＬＭ等）に入力することで、映像の構成の評価を示す評価テキストを生成する。例えば、映像生成システム１は、取得したキャプションとシナリオ文をＡＩ（ＬＬＭ等）で比較することでキャプション通りの画になっているかを評価してもよい。また、映像生成システム１は、ユーザが評価をして欲しいと希望する評価者像の指定を受け付けてもよい。例えば、映像生成システム１は、マーケット戦略、コピーライター、映像監督、特定の人などの視点で評価を行ってもよい。 In addition, when evaluating the composition of a video, the video generation system 1 acquires a caption for the video by inputting one frame of the video into AI (Contrastive Captioner Model, Image Captioning Model, etc.). Then, the video generation system 1 generates an evaluation text indicating an evaluation of the composition of the video by inputting the acquired caption and scenario text into AI (LLM, etc.). For example, the video generation system 1 may evaluate whether the image is as per the caption by comparing the acquired caption and scenario text with AI (LLM, etc.). The video generation system 1 may also accept a specification of the evaluator type that the user desires to be the one to evaluate. For example, the video generation system 1 may perform an evaluation from the perspective of a market strategy, copywriter, video director, a specific person, etc.

＜１－４－６．複数カットの選択例＞
ここから、編集処理についていくつか例示を記載する。従来技術では、ストーリーボード上では複数カットを同時に修正できないという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、図２３に示すように、複数のカットを選択して編集処理（修正処理）を行ってもよい。図２３は、編集処理の一例を示す図である。具体例には、図２３は、複数カットの選択に基づく編集処理の一例を示す図である。 <1-4-6. Example of multiple cut selection>
From here, some examples of the editing process will be described. In the conventional technology, there is a problem that multiple cuts cannot be simultaneously corrected on a storyboard. In this way, the conventional technology has a problem regarding usability, and there is room for improvement in usability. Therefore, the video production system 1 may select multiple cuts and perform the editing process (correction process) as shown in FIG. 23. FIG. 23 is a diagram showing an example of the editing process. As a specific example, FIG. 23 is a diagram showing an example of the editing process based on the selection of multiple cuts.

映像生成システム１は、図２３に示すように、コンテンツＣＴ４１をユーザに提供し、ユーザの複数カットの選択に応じた修正指示を受け付けてもよい。コンテンツＣＴ４１は、カットＣＵ１～ＣＵ４等の複数のカット（シーン）を含む動画に対するユーザの修正指示を受け付けるための表示画面（コンテンツ）である。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ４１を表示する。 As shown in FIG. 23, the video production system 1 may provide the user with content CT41 and accept correction instructions according to the user's selection of multiple cuts. The content CT41 is a display screen (content) for accepting the user's correction instructions for a video including multiple cuts (scenes) such as cuts CU1 to CU4. For example, the client UI display unit 400 displays the content CT41.

ユーザは、コンテンツＣＴ４１を介して、カットＣＵ１～ＣＵ４のうち、複数のカットを選択し、選択した複数のカットに対しての修正を指示する情報を入力する。例えば、ユーザは、コンテンツＣＴ４１中のカットＣＵ１～ＣＵ４のうち、カットＣＵ１～ＣＵ３が表示された範囲を選択する操作（線で囲む操作等）を行うことやカットＣＵ１～ＣＵ３の各々をクリックすること等により、カットＣＵ１～ＣＵ３を選択する。 The user selects multiple cuts from among cuts CU1 to CU4 via content CT41, and inputs information instructing corrections to the selected multiple cuts. For example, the user selects cuts CU1 to CU3 by performing an operation to select the range in which cuts CU1 to CU3 are displayed (such as by surrounding it with a line) among cuts CU1 to CU4 in content CT41, or by clicking on each of cuts CU1 to CU3.

そして、ユーザは、カットＣＵ１～ＣＵ３を選択した後に、修正指示を示すプロンプト（文字情報等）を、ユーザの入力情報として入力することにより、映像生成システム１にカットＣＵ１～ＣＵ３を対象とした修正を指示する。映像生成システム１は、ユーザからの修正指示に応じて、カットＣＵ１～ＣＵ３を対象とした修正を実行する。これにより、ユーザは、複数のカットを選択し、プロンプトを入力することで選択されたカットの構成やカット内容を修正することができる。例えば、各カットのシナリオ情報には、カットの内容、登場人物、カット撮影時間帯などが含まれており、映像生成システム１は、ユーザの入力情報とシナリオ情報をＡＩ（ＬＬＭなど）に与えることで、シナリオを再生成する。なお、映像生成システム１は、ユーザの修正指示の内容に基づいて、任意の修正処理を実行する。例えば、映像生成システム１は、必要に応じて映像のＵＳＤファイルを更新してもよいし、ＵＳＤファイルはそのままでカットの順番のみを変更してもよい。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 After selecting cuts CU1 to CU3, the user inputs a prompt (text information, etc.) indicating a correction instruction as user input information, thereby instructing the video production system 1 to correct cuts CU1 to CU3. The video production system 1 executes corrections for cuts CU1 to CU3 in response to the correction instruction from the user. This allows the user to select multiple cuts and correct the configuration and content of the selected cuts by inputting a prompt. For example, the scenario information for each cut includes the content of the cut, the characters, the time period during which the cut was shot, etc., and the video production system 1 regenerates the scenario by providing the user's input information and scenario information to the AI (such as LLM). Note that the video production system 1 executes any correction process based on the content of the user's correction instruction. For example, the video production system 1 may update the USD file of the video as necessary, or may change only the order of the cuts while leaving the USD file as is. The above-mentioned process allows the video production system 1 to improve usability.

＜１－４－７．時間情報の利用例＞
従来技術では、特定のカットの修正後、その修正の影響を受ける他のシーンが修正（変更）されないという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、図２４に示すように、時間情報を用いて編集処理を行ってもよい。図２４は、編集処理の一例を示す図である。具体例には、図２４は、時間情報に応じて決定した修正内容に基づく編集処理の一例を示す図である。例えば、映像生成システム１は、各カットにＡＩが生成した日付時間情報（「時間情報」または「日付情報」ともいう）を入れておき、その情報を基に映像のカット割りや修正内容を決定する。 <1-4-7. Examples of using time information>
In the conventional technology, there is a problem that after a specific cut is corrected, other scenes affected by the correction are not corrected (changed). Thus, the conventional technology has a problem with usability, and there is room for improvement in usability. Therefore, the video production system 1 may perform an editing process using time information as shown in FIG. 24. FIG. 24 is a diagram showing an example of the editing process. As a specific example, FIG. 24 is a diagram showing an example of an editing process based on the correction content determined according to the time information. For example, the video production system 1 puts date and time information (also called "time information" or "date information") generated by AI into each cut, and determines the cutting of the video and the correction content based on the information.

図２４では、映像生成システム１は、各カットにそのカットに対応する日付時間情報（時間情報）を対応付けて管理する。例えば、カットＣＵ１１には２０２３年１２月２１日１０時３１分を示す時間情報ＴＩ１１が対応付けられる。また、カットＣＵ１２には２０２３年１２月２１日１０時４１分を示す時間情報ＴＩ１２が対応付けられる。このように、カットＣＵ１１及びカットＣＵ１２は、時間的に近い（近接した）カットである。この場合、映像生成システム１は、カットＣＵ１１が修正された場合、その修正をカットＣＵ１２にも反映する。動画データの各カットには時間情報（日付情報）が対応付けられている。 In FIG. 24, the video production system 1 manages each cut by associating it with date and time information (time information) corresponding to that cut. For example, cut CU11 is associated with time information TI11 indicating 10:31 on December 21, 2023. Cut CU12 is associated with time information TI12 indicating 10:41 on December 21, 2023. In this way, cut CU11 and cut CU12 are cuts that are close in time (close to each other). In this case, if cut CU11 is modified, the video production system 1 also reflects the modification in cut CU12. Time information (date information) is associated with each cut in the video data.

例えば、映像生成システム１は、カットＣＵ１１での人物Ｘの服装が修正された場合、カットＣＵ１１での人物Ｘの服装と同じようにカットＣＵ１２での人物Ｘの服装も修正する。例えば、映像生成システム１は、各カット間の時間情報を比較し、修正されたカット（「修正対象カット」ともいう）との時間差が所定の範囲内であるカット（「影響カット」ともいう）がある場合、修正対象カットでの修正内容に基づく修正を、その影響カットにも反映すると決定する。 For example, when the clothing of person X in cut CU11 is modified, the video production system 1 also modifies the clothing of person X in cut CU12 in the same way as the clothing of person X in cut CU11. For example, the video production system 1 compares the time information between each cut, and when there is a cut (also called an "affected cut") whose time difference with the modified cut (also called a "cut to be modified") is within a predetermined range, it determines that the modification based on the modification content in the cut to be modified should also be reflected in the affected cut.

そして、映像生成システム１は、修正対象カットでの修正内容に基づく修正を影響カットに対して実行する。なお、このカット間の影響に基づく影響カットの修正（変更）については、映像生成システム１は、人アセットのみに限らず、環境アセット（天候、ライティング、時間経過で変化するろうそく等のプロップ等）にも行ってもよい。 Then, the video production system 1 performs corrections on the affected cuts based on the correction content in the cut to be corrected. Note that the video production system 1 may perform corrections (changes) on the affected cuts based on the influence between cuts not only on human assets, but also on environmental assets (weather, lighting, props such as candles that change over time, etc.).

例えば、カットＣＵ２１には２０２３年１２月２１日１０時３１分を示す時間情報ＴＩ２１が対応付けられる。また、カットＣＵ２２には２０２３年１２月２１日１８時４１分を示す時間情報ＴＩ２２が対応付けられる。このように、カットＣＵ２１及びカットＣＵ２２は、時間的に遠い（離間した）カットである。この場合、映像生成システム１は、カットＣＵ２１が修正された場合、その修正をカットＣＵ２２には反映しない。 For example, cut CU21 is associated with time information TI21 indicating 10:31 on December 21, 2023. Also, cut CU22 is associated with time information TI22 indicating 18:41 on December 21, 2023. In this way, cut CU21 and cut CU22 are cuts that are distant in time (separate). In this case, when cut CU21 is modified, the video production system 1 does not reflect the modification in cut CU22.

例えば、映像生成システム１は、カットＣＵ２１での人物Ｘの服装が修正された場合、カットＣＵ２１での人物Ｘの服装の修正に応じて、カットＣＵ２２での人物Ｘの服装は修正しない。例えば、映像生成システム１は、各カット間の時間情報を比較し、修正されたカット（修正対象カット）との時間差が所定の範囲内であるカット（影響カット）がない場合、修正対象カットでの修正内容に基づく修正を他のカットには反映しないと決定する。 For example, when the clothing of person X in cut CU21 is modified, the video production system 1 does not modify the clothing of person X in cut CU22 in response to the modification of the clothing of person X in cut CU21. For example, the video production system 1 compares time information between each cut, and if there is no cut (affected cut) whose time difference with the modified cut (cut to be modified) is within a predetermined range, it determines that the modification based on the modification content of the cut to be modified will not be reflected in other cuts.

例えば、映像生成システム１は、シナリオ生成時に、各カットに日付時間情報もＡＩ（ＬＬＭなど）で生成しておき、各カットのメタ情報として保存しておく。この日付時間情報は前後のカットとの関係性を考えるためや、季節などを把握するために利用される。生成された架空の日付時間情報は、各カットの内容や各カット間の関係性を保つために入れておく。例えば、映像生成システム１は、雪が降っている朝のショットだとすると、１月２４日午前６時３０分などとする。また、例えば、映像生成システム１は、前後のカットの関係性が強いショットだとすると、１月２４日午前６時３０分と１月２４日午前７時など同日の近い時間帯とする。 For example, when generating a scenario, the video production system 1 generates date and time information for each cut using AI (such as LLM) and saves it as meta information for each cut. This date and time information is used to consider the relationship with the previous and next cuts and to understand the season, etc. The generated fictitious date and time information is included to maintain the content of each cut and the relationships between each cut. For example, if the shot is taken on a morning when it is snowing, the video production system 1 may set it to 6:30 AM on January 24th. Also, if the shot has a strong relationship with the previous and next cuts, the video production system 1 may set it to nearby time periods on the same day, such as 6:30 AM on January 24th and 7:00 AM on January 24th.

例えば、映像生成システム１は、季節や時間に応じてユーザの着ている服、太陽のライティング設定、空気のモヤなどを変化させる。また、対象カットと前のカットの日付時間情報が近い場合、対象カットを変更すると前後のカットも影響を受ける。例えば、同じ人物が別カットに出演し、対象カットと時間が近い場合、対象カットの服を変更すると別カットの服も変更される。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 For example, the video generation system 1 changes the clothes worn by the user, the sun's lighting settings, the haze of the air, etc. depending on the season and time. Also, if the date and time information of the target cut and the previous cut are close, changing the target cut will also affect the previous and next cuts. For example, if the same person appears in another cut and the time is close to that of the target cut, changing the clothes in the target cut will also change the clothes in the other cut. Through the above-mentioned processing, the video generation system 1 can improve usability.

＜１－４－８．言語化の度合いに応じた応答例＞
従来技術では、修正内容をユーザが具体的に言語化できない場合、ユーザの意図に沿った修正が難しいという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、以下に示すような処理により、修正内容をユーザが具体的に言語化できない場合であっても、ユーザの意図に沿った修正を可能にしてもよい。 <1-4-8. Response examples according to the degree of verbalization>
In the conventional technology, when the user cannot specifically verbalize the content of the correction, it is difficult to make corrections according to the user's intention. Thus, the conventional technology has problems with usability, and there is room for improvement in usability. Therefore, the image generation system 1 may enable corrections according to the user's intention by the process described below, even when the user cannot specifically verbalize the content of the correction.

例えば、映像生成システム１は、入力された文章に応じて、返答方式を変化させてもよい。映像生成システム１は、図２５に示すように、ユーザの指示の抽象度、言語化の度合いに応じて応答を異ならせてもよい。図２５は、編集処理の一例を示す図である。具体例には、図２５は、言語化の度合いに応じた編集処理の一例を示す図である。 For example, the video production system 1 may change the response method depending on the input text. As shown in FIG. 25, the video production system 1 may vary the response depending on the level of abstraction and verbalization of the user's instructions. FIG. 25 is a diagram showing an example of an editing process. As a specific example, FIG. 25 is a diagram showing an example of an editing process depending on the level of verbalization.

図２５中の応答例ＡＰ１は、「見るユーザが日常を感じる映像にしてください」といった目的ベースの修正指示をユーザが行った場合を示す。この場合、映像生成システム１は、具体的な指示の文章を添えて映像を生成する。例えば、映像生成システム１は、「見るユーザが日常を感じる映像にしてください」といった目的ベースの修正指示に対して、「手を自然な角度で下に下げ、体の向きに合わせて動かします。」という文章を添えて、その文章を基に修正した映像をディスプレイ等に表示することにより、ユーザに対して提示する。 Response example AP1 in FIG. 25 shows a case where a user issues a purpose-based correction instruction such as "Please make the video so that the viewer feels like they are in an everyday situation." In this case, image generation system 1 generates an image with a specific instruction text attached. For example, in response to a purpose-based correction instruction such as "Please make the video so that the viewer feels like they are in an everyday situation," image generation system 1 adds a text such as "Lower your hands at a natural angle and move them in accordance with the direction of your body," and presents the video corrected based on the text to the user by displaying it on a display or the like.

また、図２５中の応答例ＡＰ２は、「両腕を自然な状態にして下さい」といった抽象的な文章の修正指示をユーザが行った場合を示す。この場合、映像生成システム１は、具体的な文章を複数提示し、ユーザに選択させる。例えば、映像生成システム１は、「手を下に下ろして、空上の向きに合わせて動かします」、「手で頭をかいて、その後手を下ろします」、「ポケットに手を入れます」等の複数の文章をディスプレイ等に表示することにより、ユーザに対して提示する。そして、映像生成システム１は、複数の文章のうちユーザが選択した文章を基に修正した映像をディスプレイ等に表示することにより、ユーザに対して提示する。 Response example AP2 in FIG. 25 shows a case where the user instructs to correct an abstract sentence such as "Keep both arms in a natural position." In this case, the image generation system 1 presents a number of concrete sentences and allows the user to select one. For example, the image generation system 1 presents to the user a number of sentences such as "Put your hands down and move them in the direction of the sky," "Scratch your head with your hands and then lower your hands," and "Put your hands in your pockets" by displaying them on a display or the like. The image generation system 1 then presents to the user an image that has been corrected based on the sentence selected by the user from the multiple sentences by displaying them on a display or the like.

図２５中の応答例ＡＰ３は、「２秒で右前の車の方を見るようにして下さい」といった具体的な文書の修正指示をユーザが行った場合を示す。この場合、映像生成システム１は、ユーザが入力した文章通りに変更（修正）した映像をディスプレイ等に表示することにより、ユーザに対して提示する。例えば、目的ベースの文章、抽象的な文章、具体的な文章の違いは、ＡＩ（ＬＬＭなど）により判断され、映像生成システム１は、それに応じて生成するプロンプトを変えてＡＩ（ＬＬＭなど）に処理を投げてもよい。 Response example AP3 in FIG. 25 shows a case where the user gives a specific instruction to correct a sentence, such as "Please look at the car in front of you on the right in 2 seconds." In this case, the image generation system 1 presents the image to the user by displaying on a display or the like an image that has been changed (corrected) according to the sentence entered by the user. For example, the difference between a purpose-based sentence, an abstract sentence, and a specific sentence may be determined by an AI (such as an LLM), and the image generation system 1 may change the prompt it generates accordingly and pass the processing to the AI (such as an LLM).

例えば、映像生成システム１は、センサ用いてモーションやカメラの動きをユーザに指定させてもよい。映像生成システム１は、図２６に示すように、Ｗｅｂカメラ等の任意のセンサから得たモーション情報を映像に重畳表示（重ねて表示）し、ユーザの修正を受け付けてもよい。図２６は、編集処理の一例を示す図である。具体例には、図２６は、映像への重畳表示による編集処理の一例を示す図である。 For example, the video production system 1 may allow the user to specify motion or camera movement using a sensor. As shown in FIG. 26, the video production system 1 may superimpose (overlap) motion information obtained from an arbitrary sensor such as a web camera on the video and accept user corrections. FIG. 26 is a diagram showing an example of editing processing. As a specific example, FIG. 26 is a diagram showing an example of editing processing by superimposing a display on the video.

例えば、映像生成システム１は、モバイルモーションキャプチャ、Ｗｅｂカメラ等の任意のセンサから得たモーション情報ＭＴを映像ＭＶ１１に重畳させて表示し、ユーザの修正指示を受け付ける。このように、映像生成システム１は、顔表情等を含むモーション情報を現在の映像の上に重ねて表示することにより、現在の映像とモーション情報との差異を可視化した状態を基に、ユーザの修正指示を受け付ける。例えば、４秒のカットであれば、常に４秒のカットが再生され続け、ユーザは気に入るまで何度も自分でモーション情報を変更することができる。例えば、ユーザが気に入ったモーション情報がある場合、映像生成システム１は、自然なモーション情報を最終生成して映像のモーションを変更してもよい。 For example, the video generation system 1 displays motion information MT obtained from any sensor such as a mobile motion capture or a web camera superimposed on the video MV11 and accepts correction instructions from the user. In this way, the video generation system 1 displays motion information including facial expressions etc. superimposed on the current video, and accepts correction instructions from the user based on a visualized state of the difference between the current video and the motion information. For example, if it is a four-second cut, the four-second cut will always continue to be played, and the user can change the motion information themselves as many times as necessary until they are satisfied. For example, if there is motion information that the user likes, the video generation system 1 may finally generate natural motion information to change the motion of the video.

ユーザのモーションはモバイルモーションキャプチャ、Ｗｅｂカメラなどでトラッキングされる。また、ユーザはどの人物のモーションを修正するかは、事前にＵＩからマウスで選択してもよい。映像中の頭と体の大きさや向きより、重ねる自分のモーション表示部分の大きさと位置向きを決定する。細かい大きさと位置向きの調整は、ユーザがマウスとキーボードで入力して調整してもよい。モーションの録画は、上記画像に記載したカットが繰り返し際される方法もありうるし、スタートボタンを押してから体勢を整えるまでの数秒後に録画が開始する方法もありうる。また、撮影後、ユーザがモーション生成ボタンを押すことで、入力したモーションが指定した登場人物に適応されてもよい。 The user's motion is tracked using a mobile motion capture device, a web camera, etc. The user may also select in advance from the UI with the mouse which character's motion they wish to modify. The size and position of the portion of the user's motion to be superimposed is determined based on the size and orientation of the head and body in the video. The user may adjust the size, position, and orientation in detail using the mouse and keyboard. Motion may be recorded by repeating the cuts shown in the image above, or by starting recording several seconds after the start button is pressed and the user adjusts their position. After filming, the user may press the motion generation button, and the input motion may be applied to the specified character.

また、撮影したモーション自体は不自然であることもありうるので、映像生成システム１は、撮影したモーションをmotion to motionのＡＩを用い、モーション推定や生成などを行い、より自然なモーションに変換してから登場人物のモーションに適応してもよい。また、映像生成システム１は、入力したモーションと自然文を用い、モーションを推定したり生成したりしてもよい。例えば、映像生成システム１は、モーションを入力後、ユーザが入力した「こんな感じで活き活きとした動きにする」などの自然文と共にＡＩ（ＬＬＭ、text&motion to motion生成のモデルなど）に処理を投げてもよい。 In addition, since the captured motion itself may be unnatural, the video generation system 1 may use a motion-to-motion AI to estimate or generate the captured motion, convert it into a more natural motion, and then adapt it to the motion of the characters. The video generation system 1 may also estimate or generate the motion using the input motion and natural text. For example, after inputting the motion, the video generation system 1 may send it to an AI (such as an LLM or a text&motion to motion generation model) for processing along with the natural text input by the user, such as "Make the movement lively like this."

また、ユーザは、自身の手をカメラと見立てて動かして、カメラワークを変更してもよい。例えば、映像生成システム１は、ユーザの手の動きをＷｅｂカメラ等のセンサで取得する。映像生成システム１は、図２７に示すように、変更したカメラワークを枠として映像に重ねて提示（重畳表示）する。図２７は、編集処理の一例を示す図である。具体例には、図２７は、映像へのカメラワークの重畳表示の一例を示す図である。 The user may also change the camerawork by moving their hand as if it were a camera. For example, the video production system 1 acquires the movement of the user's hand using a sensor such as a web camera. As shown in FIG. 27, the video production system 1 presents (superimposed) the changed camerawork as a frame superimposed on the video. FIG. 27 is a diagram showing an example of editing processing. As a specific example, FIG. 27 is a diagram showing an example of superimposed display of camerawork on video.

例えば、映像生成システム１は、変更前のカメラワークを示す枠ＣＷ１と変更後のカメラワークを示す枠ＣＷ２とを映像ＭＶ１２に重畳させて表示する。そして、映像生成システム１は、ユーザが変更後のカメラワークを用いることを指示した場合、その変更後のカメラワークを基にレンダリング処理を実行する。このように、ユーザが変更後のカメラワークをＯＫとして、その変更後のカメラワークに映像がレンダリングされる。 For example, the video production system 1 displays a frame CW1 indicating the camerawork before the change and a frame CW2 indicating the camerawork after the change, superimposed on the video MV12. Then, when the user instructs the video production system 1 to use the changed camerawork, the video production system 1 executes a rendering process based on the changed camerawork. In this way, when the user approves the changed camerawork, the video is rendered with the changed camerawork.

また、ユーザは、自身の携帯端末（スマートフォン等）のＩＭＵやＩｍａｇｅＳＬＡＭ等を用いてカメラワークを指定してもよい。例えば、映像生成システム１は、ＡＲ（Augmented Reality）でメインキャラクターを実空間（現実の机の上等）に配置して提示してもよい。映像生成システム１は、変更したカメラワークを図２７に示す場合と同様に映像に枠を重ねて表示する。 The user may also specify the camerawork using an IMU or ImageSLAM of the user's mobile device (such as a smartphone). For example, the image generation system 1 may present the main character by placing it in real space (such as on a real desk) using AR (Augmented Reality). The image generation system 1 displays the changed camerawork by superimposing a frame on the image, in the same way as in the case shown in FIG. 27.

上記のように、映像生成システム１では、ユーザが手やスマホの動きを使い、モーションやカメラの動きを決めてもよい。例えば、ユーザが手の動きでカメラワークを決める際、もう一方の手で指定した人物を想定し、左手（カメラ）と右手（人物）の距離から、カメラの相対的な位置を決めてもよい。また、映像生成システム１は、ユーザが手やカメラの入力で変更したカメラワークを枠として映像に重ねて表示するが、その後、ユーザがマウス操作で枠やその動きを微調整してもよい。また、入力したカメラワークは手で入れているため不自然な動きである可能性もあるため、映像生成システム１は、motion to motion生成モデル、text&motion to motion生成モデル等を用いて、自然なカメラワークに修正してもよい。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 As described above, in the video generation system 1, the user may use the movement of his/her hand or smartphone to determine the motion and camera movement. For example, when the user determines the camerawork with the movement of his/her hand, the user may imagine a person designated with the other hand and determine the relative position of the camera from the distance between the left hand (camera) and the right hand (person). In addition, the video generation system 1 displays the camerawork changed by the user's hand or camera input as a frame superimposed on the video, but the user may then fine-tune the frame and its movement by operating the mouse. In addition, since the input camerawork is entered by hand and may be unnatural, the video generation system 1 may correct the camerawork to a more natural one using a motion to motion generation model, a text&motion to motion generation model, etc. Through the above-mentioned processing, the video generation system 1 can improve usability.

＜１－４－９．定性的な値の利用例＞
従来技術では、現状や変更履歴をシステムとしてどのように判断するのかという点については考慮されていない場合があった。例えば、「もう少し後ろに立って」や「もう少し明るくして」など、現状と比較して調整したいという場合等があり、システム的にどのように現状を理解（把握）するかについては課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、以下に示すような処理により、現状や変更履歴を適切に判断可能にしてもよい。 <1-4-9. Examples of using qualitative values>
In the conventional technology, there are cases where consideration is not given to how the system determines the current state and the change history. For example, there are cases where a user wants to make adjustments by comparing with the current state, such as "stand a little further back" or "make it a little brighter," and there is a problem with how the system understands (grasps) the current state. Thus, the conventional technology has a problem with usability, and there is room for improvement in usability. Therefore, the video generation system 1 may be capable of appropriately determining the current state and the change history by the process shown below.

例えば、映像生成システム１は、動き速度、明るさなど各ポイントを、定量的な数値として保持しておき、その値を比較して映像を修正してもよい。映像生成システム１は、図２８に示すように、定性的な値を用いて編集処理を行ってもよい。図２８は、編集処理の一例を示す図である。具体例には、図２８は、定性的な値に応じて決定した修正内容に基づく編集処理の一例を示す図である。 For example, the video production system 1 may store each point, such as the speed of motion and brightness, as a quantitative numerical value and modify the video by comparing the values. The video production system 1 may also perform editing processing using qualitative values, as shown in FIG. 28. FIG. 28 is a diagram showing an example of editing processing. As a specific example, FIG. 28 is a diagram showing an example of editing processing based on modification contents determined according to qualitative values.

図２８では、値情報ＶＬ２１は映像ＭＶ２１に対応づけられた定量的な値を示す。例えば、値情報ＶＬ２１は、町全体の明るさ、人の歩行スピード、人の顔を動かすスピード、人の位置、車のスピード、車の位置等の定性的な値を含み、映像ＭＶ２１のカット等に対応づけられる値を示す。このように、映像生成システム１は、変更されうる値は全て定量的な値として保持しておき、その値と比較して映像を変更する。 In FIG. 28, value information VL21 indicates a quantitative value associated with video MV21. For example, value information VL21 includes qualitative values such as the brightness of the entire town, people's walking speed, the speed at which people's faces move, people's positions, car speeds, and car positions, and indicates values associated with cuts in video MV21. In this way, video production system 1 stores all values that can be changed as quantitative values, and changes the video by comparing them with those values.

図２８では、映像ＭＶ２１について、ユーザが「顔をもう少しゆっくり動かして」という修正指示を行い、映像生成システム１は、映像ＭＶ２１の顔を動かすスピードを遅くした映像ＭＶ２２を生成する。映像生成システム１は、「顔をもう少しゆっくり動かして」という修正指示を基に、映像ＭＶ２１の値情報ＶＬ２１のうち人の顔を動かすスピードの値を「２１」から「１０」に減少させた値情報ＶＬ２２の映像ＭＶ２２を生成する。 In FIG. 28, the user gives a modification instruction for video MV21, "move your face a little slower," and video production system 1 generates video MV22 in which the speed at which the face moves in video MV21 is slowed down. Based on the modification instruction, "move your face a little slower," video production system 1 generates video MV22 with value information VL22 in which the value of the speed at which the person's face moves, in the value information VL21 of video MV21, is reduced from "21" to "10."

図２８では、値情報ＶＬ２２は修正後の映像ＭＶ２２に対応づけられた定量的な値を示す。例えば、値情報ＶＬ２２は、町全体の明るさ、人の歩行スピード、人の顔を動かすスピード、人の位置、車のスピード、車の位置等の定性的な値を含み、映像ＭＶ２２のカット等に対応づけられる値を示す。 In FIG. 28, value information VL22 indicates quantitative values associated with the corrected video MV22. For example, value information VL22 indicates values associated with cuts in video MV22, including qualitative values such as the brightness of the entire town, people's walking speed, the speed at which people's faces move, people's positions, car speeds, and car positions, etc.

このように、映像生成システム１は、定性的な値を用いて映像の編集処理を行ってもよい。例えば、映像生成システム１は、動き速度、明るさなど各ポイント（項目）について、定量的な数値として保持しておき、その値を比較して映像を修正する。上述したように、定性的な値を保持しておく街全体の明るさ、人の歩行スピードなどの分類は、事前に設定された項目である。例えば、映像生成システム１は、各々を自然言語からＡＩ（ＬＬＭなど）を使って設定修正してもよいし、設定値を直接修正してもよい。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 In this way, the video production system 1 may perform video editing processing using qualitative values. For example, the video production system 1 may store quantitative values for each point (item), such as movement speed and brightness, and compare the values to modify the video. As described above, classifications such as the brightness of the entire city and people's walking speed, for which qualitative values are stored, are items that are set in advance. For example, the video production system 1 may set and modify each of these using AI (such as LLM) from natural language, or may directly modify the set values. Through the above-mentioned processing, the video production system 1 can improve usability.

また、それぞれの値の取得方法の一例を以下に示すが、取得方法は以下に限らず他の取得方法であってもよい。例えば、映像生成システム１は、ライティングについては、３ＤＣＧ内のライトの設定値（位置、回転、強さ、色など）を取得する。また、映像生成システム１は、カラーグレーディングについては、コンポジット編集で設定された、ホワイトバランス、色温度、色かぶり補正、彩度、露光量、コントラスト、ハイライト、シャドウ、白レベル、黒レベル、カラー、ＬＵＴ設定などを取得する。また、映像生成システム１は、歩行スピードについては、対象の3Dモデルの腰のボーンの位置移動速度を取得する。また、映像生成システム１は、顔を動かすスピードについては、頭のボーンの回転速度を取得する。また、映像生成システム１は、位置については、３Ｄモデルの位置を取得する。 An example of a method for acquiring each value is shown below, but the acquisition method is not limited to the following and may be another method. For example, for lighting, the image generation system 1 acquires the setting values (position, rotation, strength, color, etc.) of the light in the 3DCG. For color grading, the image generation system 1 acquires the white balance, color temperature, color cast correction, saturation, exposure, contrast, highlights, shadows, white level, black level, color, LUT settings, etc. set in composite editing. For walking speed, the image generation system 1 acquires the position movement speed of the hip bone of the target 3D model. For face movement speed, the image generation system 1 acquires the rotation speed of the head bone. For position, the image generation system 1 acquires the position of the 3D model.

＜１－４－１０．入力途中でのレンダリング処理例＞
従来技術では、レンダリング時間を待つのがユーザにとって大変（ユーザビリティが低い）という課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、以下に示すような処理により、レンダリングに関するユーザビリティを向上させてもよい。 <1-4-10. Example of rendering process during input>
The conventional technology has an issue in that it is difficult for the user to wait for the rendering time (low usability). Thus, the conventional technology has an issue with usability, and there is room for improvement in usability. Therefore, the image generation system 1 may improve the usability of rendering by the process described below.

例えば、映像生成システム１は、テキスト入力途中からレンダリング処理を開始してもよい。映像生成システム１は、図２９に示すように、ユーザの入力途中でレンダリング処理を行ってもよい。図２９は、編集処理の一例を示す図である。具体例には、図２９は、入力途中でのレンダリング処理に基づく編集処理の一例を示す図である。 For example, the video production system 1 may start the rendering process in the middle of text input. As shown in FIG. 29, the video production system 1 may perform the rendering process in the middle of user input. FIG. 29 is a diagram showing an example of the editing process. As a specific example, FIG. 29 is a diagram showing an example of the editing process based on the rendering process in the middle of input.

図２９では、映像生成システム１は、ユーザが「手で頭を掻いて」というテキストＴＸ３１の入力に応じて、レンダリング処理等を実行し映像ＭＶ３１を表示する。例えば、テキストＴＸ３１中の末尾の「｜」はユーザが修正指示を入力途中であることを示す。映像生成システム１は、文章として理解できるようになったら一旦バックグラウンドで処理を開始する。例えば、映像生成システム１は、「手で頭を掻いて」までを入力した時点で、バックグラウンドでレンダリング処理等を開始する。このように、映像生成システム１は、ユーザがテキストを入力している途中でレンダリング処理を開始してもよい。 In FIG. 29, the video generation system 1 executes rendering processing and the like in response to the user's input of text TX31, "scratching his head with his hand," and displays video MV31. For example, the "|" at the end of text TX31 indicates that the user is in the middle of inputting a correction instruction. Once the video generation system 1 is able to understand the text as a sentence, it starts processing in the background. For example, the video generation system 1 starts rendering processing and the like in the background once the user has input up to "scratching his head with his hand." In this way, the video generation system 1 may start rendering processing while the user is in the middle of inputting text.

図２９では、ユーザは、テキストＴＸ３１から「手をおろして」というテキストＴＸ３２に文章を変更する。例えば、テキストＴＸ３２中の末尾の「｜」はユーザが修正指示を入力途中であることを示す。映像生成システム１は、テキストＴＸ３１からテキストＴＸ３２の変更に応じて、処理を実行する。例えば、映像生成システム１は、文章が変更されたら、その時点で行っている処理を止めて再度処理をやり直す。例えば、映像生成システム１は、テキストＴＸ３１を基に行っていた処理を終了し、テキストＴＸ３２を基に、レンダリング処理等を実行し映像ＭＶ３２を表示する。 In FIG. 29, the user changes the sentence from text TX31 to text TX32, which reads "Put your hands down." For example, the "|" at the end of text TX32 indicates that the user is in the middle of inputting a correction instruction. The video production system 1 executes processing in response to the change from text TX31 to text TX32. For example, when the sentence is changed, the video production system 1 stops the processing being performed at that time and starts the processing again. For example, the video production system 1 ends the processing that was being performed based on text TX31, and executes rendering processing, etc. based on text TX32 to display video MV32.

図２９では、ユーザは、テキストＴＸ３２から「手をおろして、自然な感じで」というテキストＴＸ３３に変更して、文章として完成させ、開始ボタン等を押すこと等により処理の実行を指示する。映像生成システム１は、始まっているバックグラウンド処理を終わらせて、テキストＴＸ３３を基にレンダリング処理等が実行された映像ＭＶ３３を表示する。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 In FIG. 29, the user changes text TX32 to text TX33, "Keep your hands down, in a natural way," completes the sentence, and instructs execution of processing by pressing a start button or the like. The video generation system 1 ends the background processing that has begun, and displays video MV33, which has been rendered based on text TX33. Through the above-mentioned processing, the video generation system 1 can improve usability.

＜１－４－１１．編集時の確認作業例＞
従来技術では、編集時の確認作業についてはユーザビリティが低いなどの課題があり改善の余地があった。そこで、映像生成システム１は、図３０に示すように、編集時の確認作業を行ってもよい。図３０は、編集処理の一例を示す図である。具体例には、図３０は、編集時の確認作業の一例を示す図である。 <1-4-11. Example of confirmation work when editing>
In the conventional technology, there is a problem that the usability is low in the confirmation work at the time of editing, and there is a room for improvement. Therefore, the video production system 1 may perform the confirmation work at the time of editing as shown in Fig. 30. Fig. 30 is a diagram showing an example of the editing process. As a specific example, Fig. 30 is a diagram showing an example of the confirmation work at the time of editing.

図３０中の確認作業ＣＰ１は、モーションやカメラワークなどの時間軸方向の変化を見る必要がある場合の確認作業の一例を示す。例えば、確認作業ＣＰ１では、ユーザは理想に近い動画を選ぶことを繰り返す。また、図３０中の確認作業ＣＰ２は、時間軸方向の変化を見る必要がある場合以外の確認作業の一例を示す。例えば、確認作業ＣＰ２では、ユーザは理想に近い画像を選ぶことを繰り返す。 Checking task CP1 in FIG. 30 shows an example of a checking task when it is necessary to check changes in the time axis direction, such as motion or camera work. For example, in checking task CP1, the user repeatedly selects a video that is close to ideal. Checking task CP2 in FIG. 30 shows an example of a checking task other than when it is necessary to check changes in the time axis direction. For example, in checking task CP2, the user repeatedly selects an image that is close to ideal.

例えば、映像生成システム１は、全てのカットの画像をレンダリングした後、編集したいものから動画のレンダリングを実施してもよい。例えば、映像生成システム１は、生成した動画をユーザに提示し、入力情報の編集を行うかどうかをユーザに判断させてもよい。例えば、確認作業ＣＰ１、ＣＰ２等に示すように、ユーザの入力情報に対してバリエーションをもたせた数パターンの動画の生成結果をユーザに提示する場合、複数の動画をレンダリングするための待ち時間が生じるという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。 For example, the video production system 1 may render images of all cuts and then render a video starting with the one to be edited. For example, the video production system 1 may present the generated video to the user and allow the user to decide whether or not to edit the input information. For example, as shown in confirmation tasks CP1, CP2, etc., when presenting the user with the generated results of several patterns of video with variations in response to the user's input information, there is an issue of waiting time for rendering multiple videos. As such, the conventional technology has issues with usability and there is room for improvement in usability.

そこで、映像生成システム１は、その待ち時間を低減するために、時間軸方向の変化を見る必要がある確認作業以外の作業（確認作業ＣＰ２に対応）については、動画中の１フレームないしは数フレームのみをレンダリングし、画像としてユーザに候補を提示する。これにより、映像生成システム１は、レンダリング待ち時間を低減することができる。 Therefore, in order to reduce this waiting time, for tasks other than the confirmation task that requires viewing changes along the time axis (corresponding to confirmation task CP2), the video generation system 1 renders only one or a few frames from the video and presents candidates to the user as images. This enables the video generation system 1 to reduce the rendering waiting time.

また、映像生成システム１は、ユーザがモーションやカメラワークの編集作業を行う場合は動画による候補提示を行う。例えば、レンダリングに用いる動画中の数フレームを選択する方法として、単純に動画の先頭と最後の２フレームのみをレンダリングする方法、ＵＳＤファイル内のアニメーションの変化量が大きいフレームを数フレームレンダリングする方法、ＡＩに全てのフレームの中でハイライトとして表示すべきフレームを選択させる方法などが挙げられる。例えば、映像生成システム１では、複数枚画像をレンダリングした場合はユーザがマウスカーソルをホバーすることでパラパラ漫画のような形で生成結果を確認することができる。 Furthermore, when the user edits the motion or camerawork, the video generation system 1 presents candidates in the form of video. For example, methods for selecting a few frames from a video to be used for rendering include simply rendering only the first and last two frames of the video, rendering a few frames with large changes in animation in the USD file, and having the AI select the frames to be displayed as highlights from among all the frames. For example, when multiple images are rendered in the video generation system 1, the user can check the generated results in the form of a flip book by hovering the mouse cursor over them.

また、映像生成システム１は、動画生成については、候補選択用の画像生成を終えた段階で、順次、各パターンの動画生成を開始することで動画プレビュー時の待ち時間を軽減することができる。パターンごとの動画生成の順番については、ランダムな順番で生成する方法の他に、ユーザがＵＩ上のボタンを押すことでレンダリング順番を選ぶ方法、マウスカーソルが画像上にホバーされた時間が長い順番にレンダリングする方法などが挙げられる。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 In addition, when generating videos, the video generation system 1 can reduce waiting time during video preview by starting video generation for each pattern in sequence once image generation for candidate selection is completed. As for the order of video generation for each pattern, in addition to a method of generating them in a random order, other methods include a method in which the user selects the rendering order by pressing a button on the UI, and a method in which images are rendered in order of the length of time the mouse cursor is hovered over them. Through the above-mentioned processing, the video generation system 1 can improve usability.

ここで、上述した確認作業に関する処理フローの一例について図３１を用いて説明する。図３１は、映像生成システムが実行する処理手順を示すフローチャートである。具体例には、図３１は、編集処理に関する処理手順を示すフローチャートである。 Here, an example of a processing flow related to the above-mentioned confirmation work will be described with reference to FIG. 31. FIG. 31 is a flowchart showing the processing procedure executed by the video production system. As a specific example, FIG. 31 is a flowchart showing the processing procedure related to editing processing.

図３１では、映像生成システム１は、ユーザの設定からＵＳＤファイルを数パターン生成する（ステップＳ２０１）。映像生成システム１は、すべてのパターンのＵＳＤファイルの画像書き出しが終了している場合（ステップＳ２０２：Ｙｅｓ）、ファイルの画像書き出しについての処理（例えばステップＳ２０２～Ｓ２０４）を終了する。 In FIG. 31, the video production system 1 generates several patterns of USD files from the user's settings (step S201). If image writing of all patterns of USD files has been completed (step S202: Yes), the video production system 1 ends the process of writing images of the files (e.g., steps S202 to S204).

映像生成システム１は、すべてのパターンのＵＳＤファイルの画像書き出しが終了していない場合（ステップＳ２０２：Ｎｏ）、書き出しが完了していないＵＳＤファイルの画像を書き出す（ステップＳ２０３）。映像生成システム１は、書き出した画像をＵＩ上に表示する（ステップＳ２０４）。その後、映像生成システム１は、ステップＳ２０５以降の処理を開始するとともに、ファイルの画像書き出しが終了するまでステップＳ２０２～Ｓ２０４の処理を繰り返す。 If image writing of all patterns of USD files has not been completed (step S202: No), the video production system 1 writes out images of the USD files for which writing has not been completed (step S203). The video production system 1 displays the written images on the UI (step S204). After that, the video production system 1 starts the process from step S205 onwards, and repeats the process of steps S202 to S204 until image writing of the files is completed.

映像生成システム１は、すべてのパターンのＵＳＤファイルの動画書き出しが終了している場合（ステップＳ２０５：Ｙｅｓ）、ファイルの動画書き出しについての処理（例えばステップＳ２０５～Ｓ２０７）を終了する。 When video export of all patterns of USD files has been completed (step S205: Yes), the video production system 1 ends the process of exporting the files (e.g., steps S205 to S207).

映像生成システム１は、すべてのパターンのＵＳＤファイルの動画書き出しが終了していない場合（ステップＳ２０５：Ｎｏ）、書き出しが完了していないＵＳＤファイルの動画を書き出す（ステップＳ２０６）。映像生成システム１は、書き出した動画をＵＩ上に表示する（ステップＳ２０７）。その後、映像生成システム１は、ファイルの動画書き出しが終了するまでステップＳ２０５～Ｓ２０７の処理を繰り返す。 If video export of all patterns of USD files has not been completed (step S205: No), the video production system 1 exports videos of USD files for which export has not been completed (step S206). The video production system 1 displays the exported videos on the UI (step S207). After that, the video production system 1 repeats the processes of steps S205 to S207 until video export of the files is completed.

＜１－４－１２．範囲選択に応じた処理例＞
従来技術では、映像（動画）生成について経験（知見）が無い人（「素人」ともいう）が動画を作ろうとした場合、何が良くて何が悪いのかを判断することが難しく、判断を誤る場合も多いという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、図３２に示すように、生成された動画について評価を行ってもよい。図３２は、範囲選択に応じた評価処理の一例を示す図である。 <1-4-12. Examples of processing according to range selection>
In the conventional technology, when a person (also called an "amateur") who has no experience (knowledge) in video (moving image) generation tries to create a video, it is difficult for the person to judge what is good and what is bad, and the person often makes an erroneous judgment. Thus, the conventional technology has a problem regarding usability, and there is room for improvement in usability. Therefore, the video generation system 1 may evaluate the generated video, as shown in FIG. 32. FIG. 32 is a diagram showing an example of evaluation processing according to range selection.

図３２では、映像生成システム１は、範囲選択に応じた評価処理を行う。図３２中の映像ＭＶ４０は、ユーザ入力に応じて生成された動画を示す。図３２中の映像ＭＶ４１は、ユーザが評価してほしい範囲として人の部分を指定し、その範囲に対する映像生成システム１による評価を示すテキストＴＸ４１が重畳表示された状態を示す。図３２では、映像生成システム１は、ユーザが指定した人の部分に対して、「この人の顔の表情を見せたほうがユーザーに感情が伝わりやすい」というテキストＴＸ４１が示す評価（修正案の提示）を行う。 In FIG. 32, the video production system 1 performs evaluation processing in response to the range selection. Video MV40 in FIG. 32 shows a moving image generated in response to user input. Video MV41 in FIG. 32 shows a state in which the user specifies a part of a person as the range they want evaluated, and text TX41 indicating the evaluation by the video production system 1 of that range is superimposed. In FIG. 32, the video production system 1 makes an evaluation (suggests a correction) of the part of the person specified by the user, as indicated by the text TX41, "it would be easier to convey emotions to the user if you showed this person's facial expression."

図３２中の映像ＭＶ４２は、映像生成システム１による評価を示すテキストＴＸ４２がさらに重畳表示された状態を示す。図３２では、映像生成システム１は、ユーザが指定した人の部分に対して、「演出の観点だと顔を真正面から捉えた方が良い」というテキストＴＸ４２が示す評価を行う。 Video MV42 in FIG. 32 shows a state in which text TX42 indicating an evaluation by video production system 1 is further superimposed. In FIG. 32, video production system 1 performs an evaluation on the part of the person specified by the user, as indicated by text TX42, saying, "From a production perspective, it would be better to capture the face from directly in front."

例えば、テキストＴＸ４２が示す評価をユーザが良いと思い、映像生成システム１は、その評価に基づく修正指示のユーザから受け付ける。そして、映像生成システム１は、テキストＴＸ４２が示す評価に対応する複数の候補動画ＭＶ４３、ＭＶ４４、ＭＶ４５を生成し、表示する。これにより、映像生成システム１は、テキストＴＸ４２が示す評価に対応する複数の候補動画ＭＶ４３、ＭＶ４４、ＭＶ４５をユーザに提示する。 For example, if the user thinks that the evaluation indicated by text TX42 is good, video production system 1 accepts from the user an instruction for correction based on that evaluation. Then, video production system 1 generates and displays multiple candidate videos MV43, MV44, and MV45 that correspond to the evaluation indicated by text TX42. In this way, video production system 1 presents to the user multiple candidate videos MV43, MV44, and MV45 that correspond to the evaluation indicated by text TX42.

例えば、ユーザはマウスで範囲を指定し、映像生成システム１は、指定された範囲に対する評価を行いユーザとの対話（議論）を開始する。映像生成システム１は、マウスでの範囲選択に応じて、その範囲を対象としてＡＩによる評価を開始する。例えば、映像生成システム１は、ユーザが修正を指示するまでの間、Ｎ秒に１回の評価（修正案の提示）を行う。ユーザが良いと思ったところで、マウスでクリックすることにより、映像生成システム１は、それまでの対話（議論）を踏まえた修正候補を複数提示する。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 For example, a user specifies a range with a mouse, and video generation system 1 evaluates the specified range and starts a dialogue (discussion) with the user. In response to the range selected with the mouse, video generation system 1 starts an AI evaluation of that range. For example, video generation system 1 performs an evaluation (presentation of correction suggestions) once every N seconds until the user instructs correction. When the user clicks with the mouse when they think it is good, video generation system 1 presents multiple correction suggestions based on the dialogue (discussion) up to that point. Through the above-mentioned processing, video generation system 1 can improve usability.

＜１－４－１３．被写界深度に応じた処理例＞
従来技術では、環境アセットのポリゴン数が高かったり、テクスチャの解像度が高かったりしてアセットが重い場合、レンダリングに要する時間の増大を抑制することが難しいという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、図３３に示すように、被写界深度に応じた処理を行ってもよい。図３３は、被写界深度に応じた処理の一例を示す概念図である。 <1-4-13. Example of processing according to depth of field>
In the conventional technology, when the number of polygons in the environmental asset is high or the texture resolution is high, and the asset is heavy, it is difficult to suppress the increase in the time required for rendering. Thus, the conventional technology has a problem regarding usability, and there is room for improvement in usability. Therefore, the video generation system 1 may perform processing according to the depth of field as shown in FIG. 33. FIG. 33 is a conceptual diagram showing an example of processing according to the depth of field.

図３３では、映像生成システム１は、被写界深度に応じて、被写体の前後所定の範囲内については高ポリゴン数、高解像度テクスチャにし、被写体の前後所定の範囲内については低ポリゴン数、低解像度テクスチャにする。図３３では、高ポリゴン数、高解像度テクスチャの対象物（被写体に近い丸）を濃いハッチングで示し、低ポリゴン数、低解像度テクスチャの対象物（被写体から遠い丸）を薄いハッチングで示す。このように、映像生成システム１は、被写界深度に応じて、ポリゴン数やテクスチャ解像度を変更する。映像生成システム１は、以下のような式（１）～（３）により、被写界深度を算出する。 In FIG. 33, the video generation system 1 uses a high polygon count and high resolution texture for a specified range in front of and behind the subject, and a low polygon count and low resolution texture for a specified range in front of and behind the subject, depending on the depth of field. In FIG. 33, objects with a high polygon count and high resolution texture (circles close to the subject) are shown with dark hatching, and objects with a low polygon count and low resolution texture (circles farther from the subject) are shown with light hatching. In this way, the video generation system 1 changes the number of polygons and texture resolution depending on the depth of field. The video generation system 1 calculates the depth of field using equations (1) to (3) as follows.

式（１）は、前方被写界深度（ｍｍ）を算出する関数である。映像生成システム１は、式（１）を用いて、前方被写界深度を算出する。例えば、図３３では、が前方被写界深度は、被写体の前方側（カメラに近づく側）に対応する。また、式（２）は、後方被写界深度（ｍｍ）を算出する関数である。映像生成システム１は、式（２）を用いて、後方被写界深度を算出する。例えば、図３３では、が後方被写界深度は、被写体の後方側（カメラから離れる側）に対応する。式（３）は、被写界深度を算出する関数である。像生成システム１は、式（３）を用いて、前方被写界深度と後方被写界深度とを足し合わせることにより、被写界深度を算出する。 Equation (1) is a function for calculating the front depth of field (mm). The image generation system 1 calculates the front depth of field using equation (1). For example, in FIG. 33, the front depth of field corresponds to the front side of the subject (the side closer to the camera). Furthermore, equation (2) is a function for calculating the rear depth of field (mm). The image generation system 1 calculates the rear depth of field using equation (2). For example, in FIG. 33, the rear depth of field corresponds to the rear side of the subject (the side farther from the camera). Equation (3) is a function for calculating the depth of field. The image generation system 1 calculates the depth of field by adding together the front depth of field and the rear depth of field using equation (3).

例えば、映像生成システム１は、前方被写界深度よりカメラに近い部分、後方被写界深度よりカメラから遠い部分に関して、レンダリング前にポリゴン数やテクスチャ解像度を落としたものに差し替えておく。例えば、映像生成システム１は、被写界深度的にボケを作る場合、ポリゴン数やテクスチャ解像度を小さく（低く）する。このように、映像生成システム１は、焦点距離、Ｆ値、被写体距離等の基準に、ポリゴン数やテクスチャ解像度を決定する。例えば、映像生成システム１は、上記の決定をＵＳＤファイル生成時に行ってもよい。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 For example, the video generation system 1 replaces parts closer to the camera than the forward depth of field and parts farther from the camera than the rear depth of field with parts with a lower number of polygons and texture resolution before rendering. For example, when creating a blur in terms of depth of field, the video generation system 1 reduces (lowers) the number of polygons and texture resolution. In this way, the video generation system 1 determines the number of polygons and texture resolution based on criteria such as focal length, F-number, and subject distance. For example, the video generation system 1 may make the above determination when generating a USD file. Through the above-mentioned processing, the video generation system 1 can improve usability.

＜１－４－１４．確認作業時の再生処理例＞
従来技術では、生成された動画を全て再生して確認する場合、確認作業に要する時間の増大を抑制することが難しいという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、確認作業時に適した再生を行ってもよい。映像生成システム１は、ユーザの操作に応じて再生の態様を変更してもよい。例えば、映像生成システム１は、ユーザの操作に応じて、パラパラ漫画のような態様で動画を再生してもよい。 <1-4-14. Example of playback process during confirmation work>
In the conventional technology, when all the generated videos are played back for review, it is difficult to suppress an increase in the time required for the reviewing work. Thus, the conventional technology has a problem regarding usability, and there is room for improvement in usability. Therefore, the video generation system 1 may perform playback suitable for the reviewing work. The video generation system 1 may change the mode of playback in response to a user's operation. For example, the video generation system 1 may play back the video in a mode like a flip book in response to a user's operation.

映像生成システム１は、ユーザによるクリックやマウスホイール回転ごとに所定の秒数（例えば０．５秒）だけ動画の再生を進めてもよい。例えば、映像生成システム１は、マウスやマウスホイールの動きや位置に応じて、カット内のコマ送りをする。映像生成システム１は、図３４に示すように、マウスの位置に対応する秒数だけ動画の再生を進めてもよい。 The video production system 1 may advance the playback of the video by a predetermined number of seconds (e.g., 0.5 seconds) for each click or rotation of the mouse wheel by the user. For example, the video production system 1 advances frames within a cut according to the movement or position of the mouse or mouse wheel. The video production system 1 may advance the playback of the video by the number of seconds corresponding to the position of the mouse, as shown in FIG. 34.

図３４では、映像生成システム１は、コンテンツＣＴ４１をユーザに提供し、ユーザによる動画を進める度合い（秒数、フレーム数等）の指示を受け付けてもよい。コンテンツＣＴ４１は、ユーザによるユーザの動画を進める度合いを受け付けるための表示画面（コンテンツ）である。コンテンツＣＴ４１は、動画に重畳させて、動画を進める秒数を指定するための情報を配置する。図３４では、０～３．５秒の間で指定可能であり、左から右に行くほど動画を進める秒数が大きくなる場合を示す。例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ４１を表示する。 In FIG. 34, the video generation system 1 may provide the user with content CT41 and accept an instruction on the degree to which the video should be advanced (number of seconds, number of frames, etc.). Content CT41 is a display screen (content) for accepting the degree to which the user should advance the video. Content CT41 arranges information for specifying the number of seconds to advance the video by superimposing it on the video. FIG. 34 shows a case in which a value between 0 and 3.5 seconds can be specified, with the number of seconds to advance the video increasing from left to right. For example, the client UI display unit 400 displays content CT41.

ユーザは、コンテンツＣＴ４１を介して、動画を再生する際に動画を進める秒数を指定する情報を入力する。図３４では、ユーザはマウスを操作して、マウスカーソルＭＳを１．０と表示された領域に位置させることに動画を進める秒数を１．０秒に指定する。この場合、映像生成システム１は、ユーザからの動画を進める秒数の指定に応じて、動画を１．０秒の間隔で再生を進める。なお、ユーザはマウスを操作して、マウスカーソルＭＳを１．０と表示された領域に位置させ、クリックすること等により、動画を進める秒数を１．０秒に指定してもよい。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 Through content CT41, the user inputs information specifying the number of seconds to advance the video when playing the video. In FIG. 34, the user operates the mouse to position the mouse cursor MS in the area marked 1.0 to specify 1.0 seconds as the number of seconds to advance the video. In this case, the video production system 1 advances the playback of the video at 1.0 second intervals in response to the user's specification of the number of seconds to advance the video. Note that the user may operate the mouse to position the mouse cursor MS in the area marked 1.0 and click, etc. to specify 1.0 seconds as the number of seconds to advance the video. Through the above-described processing, the video production system 1 can improve usability.

＜１－４－１５．強調表示例＞
従来技術では、動画のうち編集等により変更された生成された部分が分かりづらい場合があり、確認作業に要する時間の増大を抑制することが難しいという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、図３５に示すように、強調表示を行ってもよい。図３５は、変更部分の強調表示の一例を示す図である。例えば、映像生成システム１は、前回の生成結果と変化された部分を強調表示する。図３５では映像中の女性が変更された場合を一例として説明する。 <1-4-15. Highlighting example>
In the conventional technology, it may be difficult to understand the parts of the video that have been changed by editing or the like, and it is difficult to prevent the increase in time required for the confirmation work. However, there are problems with usability and there is room for improvement in usability. Therefore, the image production system 1 may perform highlighting as shown in Fig. 35. Fig. 35 shows an example of highlighting the changed portion. For example, the image generation system 1 highlights the part that has changed from the previous generation result. In Fig. 35, a case where a woman in the image has been changed will be described as an example.

図３５中の映像ＭＶ５１は、第１の強調表示態様を示す。映像ＭＶ５１は、変更された箇所以外を暗くする（明度を下げる等）ことにより、変更された箇所を強調表示する態様を示す。映像生成システム１は、映像ＭＶ５１を生成し、映像ＭＶ５１を表示することにより、編集等により変更された部分を強調表示する。 Video MV51 in FIG. 35 shows a first highlighting mode. Video MV51 shows a mode in which the changed parts are highlighted by darkening parts other than the changed parts (reducing brightness, etc.). Video production system 1 generates video MV51 and displays video MV51 to highlight the parts that have been changed by editing, etc.

また、図３５中の映像ＭＶ５２は、第２の強調表示態様を示す。映像ＭＶ５２は、変更された箇所をハイライトする（色を付ける等）ことにより、変更された箇所を強調表示する態様を示す。映像生成システム１は、映像ＭＶ５２を生成し、映像ＭＶ５２を表示することにより、編集等により変更された部分を強調表示する。このように、映像ＭＶ５１、ＭＶ５２は、人の部分が変更された場合にその部分を強調表示する場合を示す。 Video MV52 in FIG. 35 shows a second highlighting mode. Video MV52 shows a mode in which the changed parts are highlighted (by adding color, etc.). Video production system 1 generates video MV52 and displays video MV52 to highlight the parts that have been changed by editing, etc. In this way, videos MV51 and MV52 show a case in which parts of people are highlighted when they have been changed.

図３５中の映像ＭＶ５３は、第３の強調表示態様を示す。映像ＭＶ５３は、変更があった時間をシークバー上に示すことにより、変更された箇所を強調表示する態様を示す。映像生成システム１は、変更があった時間に対応する位置に色付けした点ＨＬ５３１及び点ＨＬ５３２が配置されたシークバーを含む映像ＭＶ５３を生成し、映像ＭＶ５３を表示することにより、編集等により変更された部分を強調表示する。 Video MV53 in FIG. 35 shows a third highlighting mode. Video MV53 shows a mode in which the changed portion is highlighted by indicating the time when the change occurred on the seek bar. Video production system 1 generates video MV53 including a seek bar on which colored points HL531 and HL532 are positioned corresponding to the time when the change occurred, and displays video MV53 to highlight the portion that has been changed by editing, etc.

図３５中の映像ＭＶ５４は、第４の強調表示態様を示す。映像ＭＶ５４は、変更があった時間をシークバー上に示すことにより、変更された箇所を強調表示する態様を示す。映像生成システム１は、変更があった時間帯に対応する範囲に位置に色付けしたバーＨＬ５４が配置されたシークバーを含む映像ＭＶ５４を生成し、映像ＭＶ５４を表示することにより、編集等により変更された部分を強調表示する。 Video MV54 in FIG. 35 shows a fourth highlighting mode. Video MV54 shows a mode in which the changed portion is highlighted by indicating the time when the change occurred on the seek bar. Video production system 1 generates video MV54 including a seek bar with a colored bar HL54 positioned in a range corresponding to the time period when the change occurred, and displays video MV54 to highlight the portion that has been changed by editing, etc.

上述したように、ユーザが動画生成に関わる設定を行い動画の再生成を行った際には、生成された動画の確認作業が必要になる。複数の生成結果を提示するようなＵＩでは、それぞれの動画をひとつずつ確認する動作のユーザ負担が大きい。そこで、動画の確認作業の負担を軽減するために、映像生成システム１は、前回の動画生成の結果との差分をユーザに提示する。例えば、映像生成システム１は、動画中で変更のあった箇所のみを強調表示したり、動画中で変更のあった時間をシークバー上に表示したりしてユーザに提示する。これにより、ユーザは変更があった箇所のみを確認できるようになり、映像生成システム１は、確認作業の負担を軽減させることができる。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 As described above, when a user configures settings related to video generation and regenerates a video, the generated video needs to be checked. In a UI that presents multiple generation results, the user has to check each video one by one, which places a heavy burden on the user. Therefore, in order to reduce the burden of checking the videos, the video generation system 1 presents the user with the differences from the results of the previous video generation. For example, the video generation system 1 presents to the user only the parts of the video that have been changed by highlighting them or by displaying the time when the change occurred in the video on a seek bar. This allows the user to check only the parts that have been changed, and the video generation system 1 can reduce the burden of the checking work. Through the above-mentioned processing, the video generation system 1 can improve usability.

例えば、動画中で変化した箇所を取り出す方法として、生成したＵＳＤファイルから差分を検出する方法、レンダリング済みの動画と過去にレンダリングした動画を１フレームごとに比較して差分を検出する方法等が挙げられる。例えば、生成したＵＳＤファイルから差分を検出する方法では、ＵＳＤファイルを都度生成して動画レンダリングを行っている特性を活かし、映像生成システム１は、ユーザが動画生成に関する設定を更新する前後で生成されたＵＳＤファイルの内容を比較し、変化があったオブジェクトや変化があった時間を検出する。また、例えば、レンダリング済みの動画と過去にレンダリングした動画を１フレームごとに比較して差分を検出する方法では、映像生成システム１は、ユーザが動画生成に関する設定を更新する前後で生成された動画を１フレームごとに比較して、変化があったピクセルおよび変化があった時間を検出する。 For example, methods for extracting changed parts of a video include detecting differences from a generated USD file, and detecting differences by comparing a rendered video with a previously rendered video frame by frame. For example, in a method for detecting differences from a generated USD file, taking advantage of the characteristic of generating USD files each time and performing video rendering, the video generation system 1 compares the contents of the USD files generated before and after the user updates the video generation settings, and detects objects that have changed and the time at which the changes occurred. Also, in a method for detecting differences by comparing a rendered video with a previously rendered video frame by frame, the video generation system 1 compares the videos generated before and after the user updates the video generation settings, and detects changed pixels and the time at which the changes occurred.

＜１－４－１６．カット間の関係提示例＞
従来技術では、カットごとに修正した場合、前後カットとの関係が分からなくなる場合があるという課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、図３６に示すように、カット間の関係の提示を行ってもよい。図３６は、カット間の関係の提示の一例を示す図である。 <1-4-16. Example of showing the relationship between cuts>
In the conventional technology, there is a problem that when each cut is corrected, the relationship between the previous and next cuts may become unclear. Thus, the conventional technology has a problem with usability, and there is room for improvement in usability. Therefore, the video production system 1 may present the relationship between cuts as shown in Fig. 36. Fig. 36 is a diagram showing an example of the presentation of the relationship between cuts.

図３６では、カットＣＵ６１が修正されたカット（「対象カット」ともいう）である場合を示す。対象カットであるカットＣＵ６１よりも前のカットＣＵ６０には、そのカットが対象カットよりも前のカットであることを示す前関係バーＴＲ６０が重畳表示される。例えば、前関係バーＴＲ６０は、右側を底辺として左側に延びる三角形である。前関係バーＴＲ６０は、対象カットとの時間が離れているほど左側に長く伸びる態様で表示される。 Figure 36 shows a case where cut CU61 is a modified cut (also called the "target cut"). A previous relationship bar TR60 is superimposed on cut CU60 that precedes cut CU61, the target cut, to indicate that the cut precedes the target cut. For example, the previous relationship bar TR60 is a triangle with its base on the right and extending to the left. The previous relationship bar TR60 is displayed in such a way that it extends further to the left the further away it is from the target cut.

また、対象カットであるカットＣＵ６１よりも後のカットＣＵ６２には、そのカットが対象カットよりも後のカットであることを示す後関係バーＴＲ６２が重畳表示される。例えば、後関係バーＴＲ６２は、左側を底辺として右側に延びる三角形である。対象カットであるカットＣＵ６１よりも後のカットＣＵ６２のさらに後のカットＣＵ６３には、そのカットが対象カットよりも後のカットであることを示す後関係バーＴＲ６３が重畳表示される。例えば、後関係バーＴＲ６３は、左側を底辺として右側に延びる三角形である。 In addition, a subsequent relationship bar TR62 is superimposed on cut CU62 that is after cut CU61, the target cut, indicating that the cut is a cut that comes after the target cut. For example, the subsequent relationship bar TR62 is a triangle that extends to the right with its base on the left side. A subsequent relationship bar TR63 is superimposed on cut CU63 that is after cut CU62 that is after cut CU61, the target cut, indicating that the cut is a cut that comes after the target cut. For example, the subsequent relationship bar TR63 is a triangle that extends to the right with its base on the left side.

後関係バーＴＲ６２、ＴＲ６３は、対象カットとの時間が離れているほど左側に長く延びる態様で表示される。図３６では、カットＣＵ６２よりもカットＣＵ６３の方がさらに後であるため、後関係バーＴＲ６２よりも後関係バーＴＲ６３の方が、右側へ長く延びる態様で表示される。なお、三角形での表示態様は表示態様の一例に過ぎず、時間の前後関係及びその量が提示可能であれば、任意の表示態様が採用可能である。 The later relationship bars TR62, TR63 are displayed in such a way that the further they extend to the left the further they are from the target cut. In FIG. 36, cut CU63 is later than cut CU62, so later relationship bar TR63 is displayed in such a way that it extends further to the right than later relationship bar TR62. Note that the triangular display mode is only one example of a display mode, and any display mode can be used as long as the time relationship and its amount can be presented.

映像生成システム１は、ユーザの操作に応じて、カットＣＵ６０～ＣＵ６３を含む動画を再生する。例えば、映像生成システム１は、ユーザの操作に応じて、カットＣＵ６０を表示する際は、前関係バーＴＲ６０を重畳表示する。例えば、映像生成システム１は、ユーザの操作に応じて、カットＣＵ６２を表示する際は、後関係バーＴＲ６２を重畳表示する。例えば、映像生成システム１は、ユーザの操作に応じて、カットＣＵ６３を表示する際は、後関係バーＴＲ６３を重畳表示する。これにより、映像生成システム１は、対象カットの前後のカットを再生する際、対象カットからの離れた量を提示する。このように、映像生成システム１は、対象カットの前後も再生する場合、対象カットからの離れた秒数に応じて、その量及び方向を示す情報を画面上に重畳表示することで、そのカットが対象カットから前後のどちらに、どの程度離れているかをユーザに認識させることができる。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 The video production system 1 plays a video including cut CU60 to CU63 in response to a user's operation. For example, when displaying cut CU60 in response to a user's operation, the video production system 1 superimposes a previous relationship bar TR60. For example, when displaying cut CU62 in response to a user's operation, the video production system 1 superimposes a next relationship bar TR62. For example, when displaying cut CU63 in response to a user's operation, the video production system 1 superimposes a next relationship bar TR63. In this way, when playing cuts before and after a target cut, the video production system 1 presents the amount of distance from the target cut. In this way, when playing cuts before and after the target cut, the video production system 1 superimposes information indicating the amount and direction according to the number of seconds away from the target cut on the screen, allowing the user to recognize whether the cut is before or after the target cut and how far it is from the target cut. Through the above-mentioned processing, the video production system 1 can improve usability.

＜１－４－１７．オブジェクトの選択例＞
従来技術では、編集時のオブジェクトの選択がユーザにとって大変（ユーザビリティが低い）という課題がある。このように、従来技術には、ユーザビリティに関する課題があり、ユーザビリティの改善の余地がある。そこで、映像生成システム１は、図３７に示すように、オブジェクトの選択を行ってもよい。図３７は、オブジェクトの選択の一例を示す図である。 <1-4-17. Example of object selection>
The conventional technology has an issue that it is difficult for a user to select an object during editing (low usability). Thus, the conventional technology has an issue regarding usability, and there is room for improvement in usability. Therefore, the image generation system 1 may select an object as shown in Fig. 37. Fig. 37 is a diagram showing an example of object selection.

図３７の映像ＭＶ７１では、ユーザはマウスを操作して、３人のうち右側の人が含まれる範囲にマウスカーソルＭＳ７１を位置させることに右側の人を選択する。この場合、映像生成システム１は、マウスカーソルＭＳ７１を位置する右側の人を対象オブジェクトとして選択する操作を受け付ける。例えば、映像生成システム１は、マウスカーソルＭＳ７１を位置する右側の人に対応するセグメンテーション（範囲）を、ユーザが選択した範囲として特定する。 In video MV71 of FIG. 37, the user operates the mouse to select the right-hand person of the three people by positioning the mouse cursor MS71 in a range that includes the right-hand person. In this case, video production system 1 accepts an operation to select the right-hand person where the mouse cursor MS71 is positioned as the target object. For example, video production system 1 identifies the segmentation (range) corresponding to the right-hand person where the mouse cursor MS71 is positioned as the range selected by the user.

図３７の映像ＭＶ７２では、ユーザはマウスを操作して、３人のうち中央の人が含まれる範囲にマウスカーソルＭＳ７２を位置させることに中央の人を選択する。この場合、映像生成システム１は、マウスカーソルＭＳ７２を位置する中央の人を対象オブジェクトとして選択する操作を受け付ける。例えば、映像生成システム１は、マウスカーソルＭＳ７２を位置する中央の人に対応するセグメンテーション（範囲）を、ユーザが選択した範囲として特定する。 In video MV72 of FIG. 37, the user operates the mouse to select the central person by positioning the mouse cursor MS72 in a range that includes the central person among the three people. In this case, video production system 1 accepts an operation to select the central person where the mouse cursor MS72 is positioned as the target object. For example, video production system 1 identifies the segmentation (range) corresponding to the central person where the mouse cursor MS72 is positioned as the range selected by the user.

図３７の映像ＭＶ７３では、ユーザはマウスを操作して、３人のうち左側の人が含まれる範囲にマウスカーソルＭＳ７３を位置させることに左側の人を選択する。この場合、映像生成システム１は、マウスカーソルＭＳ７３を位置する左側の人を対象オブジェクトとして選択する操作を受け付ける。例えば、映像生成システム１は、マウスカーソルＭＳ７３を位置する左側の人に対応するセグメンテーション（範囲）を、ユーザが選択した範囲として特定する。 In video MV73 of FIG. 37, the user operates the mouse to select the left person of the three people by positioning the mouse cursor MS73 in a range that includes the left person. In this case, video production system 1 accepts an operation to select the person on the left where the mouse cursor MS73 is positioned as the target object. For example, video production system 1 identifies the segmentation (range) corresponding to the person on the left where the mouse cursor MS73 is positioned as the range selected by the user.

このように、映像生成システム１は、対象オブジェクトの選択を、セグメンテーションで認識された範囲で認識する。これにより、ユーザは、クリックだけで指定したいオブジェクトを選択することができる。例えば、動画中に表示されるオブジェクトに対して、ユーザがモーション、アセットの変更を行う際には、変更対象のオブジェクトを選択する必要がある。動画中に表示されるオブジェクトの一覧から変更対象を選ぶようなＵＩよりも、動画中のオブジェクトを直接クリックできれば、ユーザはより直感的に変更対象のオブジェクトを選択することができる。動画中のオブジェクトのクリック範囲は、動画１フレームごとに各オブジェクトのピクセル領域をセグメンテーションすることにより実現できる。 In this way, the video generation system 1 recognizes the selection of the target object within the range recognized by segmentation. This allows the user to select the object they want to specify with just a click. For example, when the user wants to change the motion or assets of an object displayed in a video, it is necessary to select the object to be changed. If the user can directly click on the object in the video, rather than using a UI that allows the user to select the object to be changed from a list of objects displayed in the video, the user can more intuitively select the object to be changed. The click range of an object in a video can be achieved by segmenting the pixel area of each object for each frame of the video.

オブジェクトのクリック範囲を求める方法については、以下のような方法であってもよい。例えば、動画のフレームごとのオブジェクトとカメラの情報がＵＳＤ上にあるため、そのフレームにおいてカメラから見たあるピクセルがどのオブジェクトを指しているかのマッピングが可能であるため、映像生成システム１は、そこから逆算してユーザが動画上でクリックした座標と編集対象のオブジェクトのマッピングが可能になる。 The click range of an object may be calculated in the following manner. For example, because object and camera information for each frame of a video is stored in the USD, it is possible to map which object a pixel seen from the camera in that frame points to, and the video generation system 1 can then work backwards from that to map the coordinates where the user clicked on the video to the object to be edited.

オブジェクトのクリック範囲を求める方法としては、動画１フレームごとに、カメラから逆算したピクセルごとのオブジェクトのセグメンテーションで選んだ人と、対象ＵＳＤを紐づける方法等が挙げられる。例えば、動画自体にｘ,ｙ（座標等）のこの領域には何が位置するかの情報を含ませる。例えば、映像生成システム１は、図３８に示すように、オブジェクトの選択を行ってもよい。 Methods for determining the click range of an object include linking the target USD to a person selected by segmenting the object pixel by pixel calculated backwards from the camera for each frame of the video. For example, the video itself may contain information about what is located in a particular area of x, y (coordinates, etc.). For example, the video generation system 1 may select an object as shown in FIG. 38.

図３８中のフレームＦＲは、カメラからレンダリングされた動画の１フレームを示す。図３８中のＵＳＤデータＯＢは、ＵＳＤ上の３Ｄオブジェクトとカメラを示す。３Ｄオブジェクトとカメラの情報がＵＳＤに含まれるため、映像生成システム１は、カメラ（図３８中のカメラＣＭ等）から見たあるピクセルが３Ｄオブジェクト上のどの場所を指すのかをマッピングすることができる。そのため、映像生成システム１は、ユーザが動画上でクリックした座標とオブジェクトのマッピングが可能となる。上述した処理により、映像生成システム１は、ユーザビリティを向上させることができる。 Frame FR in FIG. 38 shows one frame of a video rendered from a camera. USD data OB in FIG. 38 shows a 3D object and a camera on the USD. Because information on the 3D object and the camera is included in the USD, the video production system 1 can map which location on the 3D object a pixel viewed from a camera (such as camera CM in FIG. 38) points to. This enables the video production system 1 to map the coordinates clicked by the user on the video to an object. Through the above-mentioned processing, the video production system 1 can improve usability.

＜１－４－１８．３Ｄモデル利用例＞
上述した処理は一例に過ぎず、映像生成システム１は、上述した処理以外にも様々な処理を実行してもよい。この点について、以下いくつかの実施例を記載する。 <1-4-18. Example of using 3D model>
The above-described process is merely an example, and the image production system 1 may execute various processes other than the above-described process. In this regard, several examples will be described below.

例えば、映像生成システム１は、登場人物やプロップ（商品など）を取り込む処理を行ってもよい。映像生成システム１における登場人物やプロップ（商品など）の取り込み方法としては、例えば指定して動画内に登場させたい、３Ｄモデルや三面図を登録しそれを動画内で使ってもよい。例えば、映像生成システム１は、ユーザにより入力された３Ｄモデルを登録し、動画内で使ってもよい。例えば、映像生成システム１においては、様々な角度で撮影した商品の写真を入れることが可能であってもよい。この場合、映像生成システム１は、ＮｅＲＦ（Neural Radiance Fields）等の技術を用いて、様々な角度で撮影した商品の写真から、その商品の３Ｄモデルを生成し、動画内で使ってもよい。 For example, the video production system 1 may perform a process of importing characters and props (such as products). As a method of importing characters and props (such as products) in the video production system 1, for example, a 3D model or three-dimensional drawing that is to be specified and appear in the video may be registered and used in the video. For example, the video production system 1 may register a 3D model input by a user and use it in the video. For example, the video production system 1 may be capable of including photos of a product taken from various angles. In this case, the video production system 1 may use technology such as NeRF (Neural Radiance Fields) to generate a 3D model of the product from photos of the product taken from various angles and use it in the video.

例えば、映像生成システム１は、動画データのうち、変更の対象とする対象物を指定するユーザの操作を受け付け、ユーザの操作が示す対象物の３Ｄデータが変更されたコードを生成する。例えば、映像生成システム１は、動画中のある登場人物（「登場人物Ａ」ともいう）が変更対象物としてユーザに指定され、登録された３Ｄモデルから別の登場人物（「登場人物Ｂ」ともいう）が変更後の登場人物としてユーザが選択した場合、動画中の登場人物Ａの３Ｄデータが登場人物Ｂの３Ｄデータに変更されたコードを生成する。これにより、映像生成システム１は、動画中の登場人物Ａが、登録された３Ｄモデルが示す登場人物Ｂに変更された動画データを生成することができる。 For example, the video generation system 1 accepts a user operation to specify an object to be changed in the video data, and generates code in which the 3D data of the object specified by the user operation has been changed. For example, when a character in a video (also referred to as "character A") is specified by the user as the object to be changed, and the user selects another character (also referred to as "character B") from a registered 3D model as the changed character, the video generation system 1 generates code in which the 3D data of character A in the video has been changed to the 3D data of character B. This allows the video generation system 1 to generate video data in which character A in the video has been changed to character B indicated by the registered 3D model.

なお、上述した処理は一例に過ぎず、映像生成システム１は、任意の処理によりユーザの操作が示す対象物の３Ｄデータが変更されたコードを生成してもよい。例えば、映像生成システム１は、動画データのうち、ユーザの操作が示す対象物の３Ｄデータ自体に変更を行うことにより、変更されたコードを生成してもよい。例えば、映像生成システム１は、動画データのうち、ユーザの操作が示す対象物の３Ｄデータの外形（身長等）の変更を行うことにより、変更されたコードを生成してもよい。また、映像生成システム１は、登録した人やプロップに関しては、レンダリング後、リファイナ処理を部分的に行わなくてもよい。また、映像生成システム１においては、登場人物やプロップなどに関しては、映像生成サービス内でマーケットプレイスを用意し販売しても良い。 Note that the above-mentioned process is merely an example, and the video production system 1 may generate a code in which the 3D data of the object indicated by the user's operation has been changed by any process. For example, the video production system 1 may generate a changed code by making a change to the 3D data itself of the object indicated by the user's operation in the video data. For example, the video production system 1 may generate a changed code by changing the external shape (height, etc.) of the 3D data of the object indicated by the user's operation in the video data. Furthermore, the video production system 1 may not need to partially perform refinement processing after rendering for registered people and props. Furthermore, the video production system 1 may prepare a marketplace within the video production service for selling characters, props, etc.

＜１－４－１９．参考データの利用例＞
また、映像生成システム１は、以前作成した（動画）プロジェクトにより、動画とストーリーを参考にしてもよい。映像生成システム１は、図３９に示すように、以前作成したプロジェクトの続編を使いたい時、そのプロジェクトを参考データとして入力を受け付けてもよい。図３９は、参考データを用いた処理の一例を示す図である。 <1-4-19. Examples of using reference data>
Furthermore, the video production system 1 may refer to a video and a story from a previously created (video) project. When a sequel to a previously created project is to be used, the video production system 1 may receive an input of the previously created project as reference data, as shown in Fig. 39. Fig. 39 is a diagram showing an example of processing using reference data.

例えば、映像生成システム１は、コンテンツＣＴ５１にユーザが入力した他のプロジェクトの参考に関するユーザの入力情報を取得する。コンテンツＣＴ５１は、他のプロジェクトを参考にするか否かをチェックマークで指定する項目、及び参考にするプロジェクトをチェックマークで指定する項目、参考にするプロジェクトのうちどの情報を参考にするかをチェックマークで指定する項目等についてのユーザの入力情報を受け付けるためのコンテンツである。 For example, the video production system 1 acquires user input information regarding references to other projects that the user has input to content CT51. Content CT51 is content for accepting user input information regarding items such as an item for specifying with a check mark whether or not to refer to other projects, an item for specifying with a check mark the projects to refer to, and an item for specifying with a check mark which information of the referred projects to refer to.

例えば、クライアントＵＩ表示部４００は、コンテンツＣＴ５１を表示し、センサ部３００は、ユーザ入力情報を受け付ける。図３９では、ユーザは「他のプロジェクトを参考にする」にチェックマークを入れ、他のプロジェクトを参考にすることを選択する。また、ユーザは「商品ＸＣＭ動画」にチェックマークを入れ、商品ＸのＣＭ動画のプロジェクトを参考にすることを選択する。 For example, the client UI display unit 400 displays the content CT51, and the sensor unit 300 accepts user input information. In FIG. 39, the user checks "Refer to other projects" to select to refer to other projects. The user also checks "Product X CM video" to select to refer to the project for the CM video for Product X.

また、ユーザは「登場人物」及び「ビジュアルスタイル」にチェックマークを入れ、商品ＸのＣＭ動画のプロジェクトのうち、登場人物及びビジュアルスタイルを参考にすることを選択する。また、ユーザは「ストーリー」、「コンテンツスタイル」及び「ＢＧＭ」にチェックマークを入れておらず、商品ＸのＣＭ動画のプロジェクトのうち、ストーリー、コンテンツスタイル及びＢＧＭについては参考にしないことを選択する。 The user also checks "Characters" and "Visual Style," choosing to use the characters and visual style of the commercial video project for Product X as a reference. The user also does not check "Story," "Content Style," or "BGM," choosing not to use the story, content style, or BGM of the commercial video project for Product X as a reference.

映像生成システム１は、コンテンツＣＴ５１で受け付けたユーザ入力情報を用いて、シナリオ生成用情報（第１の入力情報）であるプロンプトＰＴ５１を生成する。なお、図３９では説明を省略するが、映像生成システム１は、図６に示すテンプレートＴＰ１のようなテンプレート入力情報を用いて、プロンプトＰＴ５１を生成してもよい。 The video production system 1 uses the user input information received in the content CT51 to generate a prompt PT51, which is information for scenario generation (first input information). Note that although not illustrated in FIG. 39, the video production system 1 may generate the prompt PT51 using template input information such as the template TP1 shown in FIG. 6.

例えば、映像生成システム１は、商品ＸのＣＭ動画のプロジェクトのうち、登場人物及びビジュアルスタイルを反映することにより、プロンプトＰＴ５１を生成する。図３９では、映像生成システム１は、商品ＸのＣＭ動画での登場人物“Mike”を使い、商品ＸのＣＭ動画でのビジュアルスタイル“映画風”を使う事を指定する制約を含むプロンプトＰＴ５１を生成する。このように、映像生成システム１は、ユーザの選択に応じて、過去のプロジェクトを参考データに基づいて、シナリオを生成するためのプロンプトを生成する。これにより、映像生成システム１は、動画中の登場人物Ａが、ユーザが指定した過去のプロジェクトに基づいて動画データを生成することができる。 For example, the video production system 1 generates prompt PT51 by reflecting the characters and visual style in the commercial video project for product X. In FIG. 39, the video production system 1 generates prompt PT51 including a constraint specifying the use of "Mike", a character in the commercial video for product X, and the use of the "cinematic" visual style in the commercial video for product X. In this way, the video production system 1 generates a prompt for generating a scenario based on reference data for past projects in response to the user's selection. This allows the video production system 1 to generate video data for character A in the video based on the past project specified by the user.

＜１－４－２０．３Ｄデータを有する利点例＞
映像生成システム１は、３Ｄデータを内部に有するため、以下のような機能や利点を有する。映像生成システム１は、生成する映像（動画）については以下のような機能や利点を有する。例えば、映像生成システム１は、物理シミュレーションにより、より自然な映像を作ることができる。例えば、映像生成システム１は、台の上にものを乗っけたり、跳ね返ったり、転がったり、自然な布の揺れを再現したりすることができる。 <1-4-20. Examples of advantages of having 3D data>
Since the image generation system 1 has 3D data inside, it has the following functions and advantages. The image generation system 1 has the following functions and advantages regarding the images (moving images) it generates. For example, the image generation system 1 can create more natural images by using physical simulation. For example, the image generation system 1 can place an object on a platform, bounce it, roll it, and reproduce the natural swaying of cloth.

例えば、映像生成システム１は、ライティングとオブジェクト素材に応じて、光の反射具合をリアルに再現することができる。反射しやすいボンネットや鏡の時は、その手前にいる人やものが反射光の影響で明るくなる。反射しにくい布の時は、その手前にいる人やものが反射光をあまり受けない。 For example, the image generation system 1 can realistically reproduce the reflection of light depending on the lighting and object material. When it comes to a bonnet or mirror that is highly reflective, people or objects in front of it will be brighter due to the reflected light. When it comes to fabric that is less reflective, people or objects in front of it will not receive much reflected light.

例えば、映像生成システム１は、ライティングとオブジェクトが固定出来るので、時系列的な破綻が生じる可能性を低減させることができる。例えば、映像生成システム１は、ライティングの位置を映像中に変化させても、破綻なく表現することができる。例えば、映像生成システム１は、簡易光源を設定することで、リアルタイムに映像を出力できる。この場合、映像生成システム１は、リファイナ処理を行わなくてもよい。 For example, because the video generation system 1 can fix lighting and objects, it is possible to reduce the possibility of time-series breakdowns. For example, the video generation system 1 can express the lighting position without breakdowns even if it is changed during the video. For example, the video generation system 1 can output video in real time by setting a simple light source. In this case, the video generation system 1 does not need to perform refinement processing.

例えば、映像生成システム１は、商品データや登場人物を３Ｄデータとして入れることで、商品自体を忠実に映像内で再現することができる。例えば、映像生成システム１は、デプス、グローバル位置、ノーマルなどがわかるため、リファイナ処理時に破綻が生じる可能性を低減させることができる。例えば、映像生成システム１は、スピーカーなど音のなる３次元位置を固定できるため、音に対してのインタラクションを作りやすい。例えば、映像生成システム１は、Differed Renderingのようなライティング処理をすることで、レンダリング時間を削減することができる。 For example, by inputting product data and characters as 3D data, the video generation system 1 can faithfully reproduce the product itself within the video. For example, because the video generation system 1 knows the depth, global position, normal, etc., it can reduce the possibility of failure during refinery processing. For example, the video generation system 1 can fix the three-dimensional position of a speaker or other sound source, making it easy to create interaction with the sound. For example, the video generation system 1 can reduce rendering time by performing lighting processing such as Differential Rendering.

また、映像生成システム１は、映像（動画）の編集（修正）については以下のような機能や利点を有する。例えば、映像生成システム１は、カメラの位置、角度、カメラワークを大幅に修正した時でも、周りの環境やライティングなどの一貫性を保つことができる。例えば、映像生成システム１は、動画内で３次元的に位置角度を指定することができる。 In addition, the video production system 1 has the following functions and advantages when it comes to editing (correcting) video (moving images). For example, the video production system 1 can maintain consistency of the surrounding environment and lighting, even when the camera position, angle, and camerawork are significantly modified. For example, the video production system 1 can specify the position and angle three-dimensionally within the video.

例えば、映像生成システム１は、ある程度動画が生成された後であっても、登場人物のみ、プロップのみなど特定のモノだけを変更し、その他の部分は映像を保つことができる。例えば、映像生成システム１は、キャラクターをリアルな人型から二頭身キャラクターに変更することができる。例えば、映像生成システム１は、設置している看板を黒板タイプからプラスチックボードに変更することができる。例えば、映像生成システム１は、商品パッケージの中のロゴの一部のみを変更することができる。 For example, even after a certain amount of video has been generated, the video generation system 1 can change only specific things, such as only the characters or only the props, while maintaining the video for the rest of the video. For example, the video generation system 1 can change a character from a realistic human-like figure to a two-headed character. For example, the video generation system 1 can change a signboard from a blackboard type to a plastic board. For example, the video generation system 1 can change only a part of the logo on a product package.

例えば、映像生成システム１は、３次元で撮影シーンをみることで、動画のフレームには映らない箇所の映像を修正することができる。例えば、映像生成システム１は、映っていない箇所に光源を設置することができる。例えば、映像生成システム１は、映っていない箇所に人やモノを設置して、動画フレーム内に影だけを表示することができる。例えば、映像生成システム１は、映っていないところに人等を置いて、映っている人が映っていない人の目を見てなどの指定をすることができる。 For example, by viewing the captured scene in three dimensions, the image generation system 1 can correct the image of parts that are not shown in the video frames. For example, the image generation system 1 can place a light source in a part that is not shown. For example, the image generation system 1 can place a person or object in a part that is not shown and display only a shadow in the video frame. For example, the image generation system 1 can place a person or object in a part that is not shown and specify that the person who is shown looks into the eyes of the person who is not shown.

また、映像生成システム１は、上記以外の点については以下のような機能や利点を有する。例えば、映像生成システム１は、ＡＲ、ＶＲ（Virtual Reality）、ＳＲＤ（Spatial Reality Display）、３Ｄディスプレイなどの３Ｄデバイスのコンテンツに容易に変換することができる。 In addition to the above, the image generation system 1 has the following functions and advantages. For example, the image generation system 1 can easily convert content into 3D device content such as AR, VR (Virtual Reality), SRD (Spatial Reality Display), and 3D displays.

＜１－５．ユーザから見た処理フロー例＞
次に、図４０を用いて、ユーザから見た処理フローの一例として、ユーザ操作に応じた映像生成システム１による情報処理の手順について説明する。図４０は、ユーザ操作に応じた処理の流れを示すフローチャートである。 <1-5. Example of processing flow from the user's perspective>
Next, a procedure of information processing by the image production system 1 in response to a user operation will be described as an example of a processing flow seen from the user's perspective with reference to Fig. 40. Fig. 40 is a flowchart showing the flow of processing in response to a user operation.

図４０に示すように、映像生成システム１では、ユーザがプロジェクト作成ボタンを押下する（ステップＳ１）。そして、映像生成システム１では、ユーザが動画生成のための必要情報入力を行う（ステップＳ２）。例えば、ユーザは、動画作成の目的、動画を通して伝えたいメッセージ、ターゲットユーザ、伝えたい商品／サービスの機能特徴、動画の長さ、アスペクト比等を含む情報の入力を行う。 As shown in FIG. 40, in the video production system 1, the user presses a project creation button (step S1). Then, in the video production system 1, the user inputs the necessary information for video generation (step S2). For example, the user inputs information including the purpose of creating the video, the message to be conveyed through the video, the target users, the functional features of the product/service to be conveyed, the length of the video, the aspect ratio, etc.

そして、映像生成システム１では、ユーザがストーリーボード、動画、音、テキストロゴ作成ボタンを押下する（ステップＳ３）。例えば、映像生成システム１は、ユーザ操作に応じて、一気に全て（例えば動画データまで）生成してもよいし、ストーリーボードを提示して、ユーザはからの指示に応じたある程度修正してから、動画、音、テキストロゴを生成してもよい。 Then, in the video production system 1, the user presses the storyboard, video, sound, and text logo creation buttons (step S3). For example, the video production system 1 may generate everything at once (for example, up to the video data) in response to user operation, or may present a storyboard and allow the user to make some modifications in response to instructions from the user before generating the video, sound, and text logo.

そして、映像生成システム１は、ユーザ操作に応じて、ストーリーボード、動画、音、テキストロゴの修正を行う（ステップＳ４）。例えば、映像生成システム１は、それぞれをユーザの好きな順番に修正を受け付けてもよい。 Then, the video production system 1 modifies the storyboard, video, sound, and text logo in response to user operations (step S4). For example, the video production system 1 may accept modifications to each of them in the order of the user's preference.

そして、映像生成システム１は、ユーザ操作に応じて、エクスポートを行う（ステップＳ５）。例えば、映像生成システム１は、ユーザ操作に応じて、ｍｐ4、ａｖｉ、ｍｏｖなどの動画ファイル形式でのエクスポートを行ってもよい。また、例えば、映像生成システム１は、Premiere pro、after effect、davinci resolveなど、任意の動画編集ソフトウェアのファイル形式でのエクスポートを行ってもよい。 Then, the video production system 1 performs export in response to a user operation (step S5). For example, the video production system 1 may perform export in a video file format such as mp4, avi, or mov in response to a user operation. Also, for example, the video production system 1 may perform export in a file format of any video editing software such as Premiere Pro, After Effects, or Davinci Resolve.

＜１－６．ＡＩモデルについて＞
なお、上述した各処理で用いられるＡＩモデルの各々については、各箇所で記載した例示に限らず、入力に応じて、所望の情報を出力可能であれば、その内部構造は任意の構造が採用可能である。ＡＩモデルの入力、出力及び内部構造については、所望の情報を出力可能であれば、任意の組合せが採用可能である。 <1-6. About the AI model>
In addition, each of the AI models used in each of the above-mentioned processes is not limited to the examples described in each section, and any internal structure can be adopted as long as it is possible to output desired information in response to the input. Any combination of input, output, and internal structure of the AI model can be adopted as long as it is possible to output desired information.

ＡＩモデルの入力は、テキスト、画像、音声、３Ｄデータ等であってもよく、それらの組合せであってもよい。また、ＡＩモデルの出力は、テキスト、画像、音声、３Ｄデータ等であってもよい。なお、上述した入力及び出力は一例に過ぎず、上述したＡＩモデルは、任意の入力及び出力であってもよい。 The input of the AI model may be text, images, audio, 3D data, etc., or a combination thereof. The output of the AI model may be text, images, audio, 3D data, etc. Note that the above-mentioned inputs and outputs are merely examples, and the above-mentioned AI model may have any input and output.

また、ＡＩモデルの内部構造は、入力及び出力の組合せに応じて、任意の構造が採用可能である。すなわち、ＡＩモデルの内部構造は、入力に対して、所望の出力が可能であればどのような構造であってもよい。 The internal structure of the AI model can be any structure depending on the combination of input and output. In other words, the internal structure of the AI model can be any structure that can produce the desired output for the input.

例えば、ＡＩモデルは、Transformerに関する構造を有してもよい。例えば、ＡＩモデルは、Transformerに関する構造を有し、テキスト、時系列データ等、データ内での前後関係等のコンテキストを考慮した処理を行ってもよい。例えば、ＡＩモデルは、自己注意機構（self-attention mechanism）を有してもよい。例えば、ＡＩモデルは、Single-Head Attention、Multi-head Attention等の任意のアテンション機構を有してもよい。なお、ＡＩモデルは、アテンション機構を有しなくてもよい。 For example, the AI model may have a structure related to Transformer. For example, the AI model may have a structure related to Transformer and perform processing that takes into account context such as the context within data, such as text and time-series data. For example, the AI model may have a self-attention mechanism. For example, the AI model may have any attention mechanism, such as Single-Head Attention or Multi-head Attention. Note that the AI model does not have to have an attention mechanism.

ＡＩモデルは、入力から特徴を抽出する機構を有してもよい。例えば、ＡＩモデルは、エンコーダを有してもよい。ＡＩモデルは、抽出された特徴を基に、情報を生成する機構を有してもよい。例えば、ＡＩモデルは、デコーダを有してもよい。 The AI model may have a mechanism for extracting features from an input. For example, the AI model may have an encoder. The AI model may have a mechanism for generating information based on the extracted features. For example, the AI model may have a decoder.

ＡＩモデルは、ＣＮＮ（Convolutional Neural Network）に関する構造を有してもよい。例えば、ＡＩモデルは、画像を対象とする処理を行う場合、ＣＮＮに関する構造を有してもよい。例えば、ＡＩモデルは、畳み込み層、プーリング層、全結合層等のうち少なくとも１つを有してもよい。 The AI model may have a structure related to a CNN (Convolutional Neural Network). For example, when performing processing on an image, the AI model may have a structure related to a CNN. For example, the AI model may have at least one of a convolutional layer, a pooling layer, a fully connected layer, etc.

なお、上述した内部構造は一例に過ぎず、上述したＡＩモデルは、任意の内部構造を有してもよい。例えば、ＡＩモデルは、スキップ接続（skip connection）を有してもよい。また、ＡＩモデルは、Diffusionモデルに関する構造を有してもよい。 Note that the above-mentioned internal structure is merely an example, and the above-mentioned AI model may have any internal structure. For example, the AI model may have a skip connection. In addition, the AI model may have a structure related to a diffusion model.

また、上述したＡＩモデルは、任意の学習処理により生成（学習）されてもよい。ＡＩモデルは、任意の機械学習の手法を用いて学習された機械学習モデルであってもよい。例えば、ＡＩモデルは、いわゆるFoundation Model（基盤モデル）を基に、その基盤モデルを特定のタスク（例えばシナリオデータ生成、コード生成等）に適用するようにファインチューニングされて生成されたモデルであってもよい。例えば、上述したＬＬＭのようなＡＩモデルは、基盤モデルを特定のタスクに適用するようにファインチューニングされて生成されたモデルであってもよい。 The AI model described above may be generated (learned) by any learning process. The AI model may be a machine learning model learned using any machine learning method. For example, the AI model may be a model generated based on a so-called Foundation Model by fine-tuning the Foundation Model to apply it to a specific task (e.g., scenario data generation, code generation, etc.). For example, an AI model such as the LLM described above may be a model generated by fine-tuning the Foundation Model to apply it to a specific task.

ここでいう基盤モデルは、様々なタスクに適用可能なように、例えば多種多様なタスクを実行可能になるように学習されたモデルである。例えば、基盤モデルは、大量のラベル無しデータセットで事前学習させたニューラルネットワークである。なお、基盤モデルは、Transformerベースのアーキテクチャ等の任意の構造を有してもよい。例えば、基盤モデルは、正解ラベルのないデータを使用した自己教師あり学習により生成される。上記のように、基盤モデルは、幅広い下流タスクに適応できるようにファインチューニングされる。 The base model here is a model that has been trained to be applicable to various tasks, for example to be able to perform a wide variety of tasks. For example, the base model is a neural network that has been pre-trained with a large amount of unlabeled data set. Note that the base model may have any structure, such as a Transformer-based architecture. For example, the base model is generated by self-supervised learning using data without correct answer labels. As described above, the base model is fine-tuned so that it can be adapted to a wide range of downstream tasks.

例えば、シナリオデータ生成のタスクに適用される場合、基盤モデルがシナリオデータ生成のタスクに適応できるようにファインチューニングされ、シナリオデータ生成のタスクに適用したＡＩモデル（モデルＭ１等）が生成される。また、例えば、コード生成のタスクに適用される場合、基盤モデルがコード生成のタスクに適応できるようにファインチューニングされ、コード生成のタスクに適用したＡＩモデル（モデルＭ３等）が生成される。また、例えば、音データ生成のタスクに適用される場合、基盤モデルが音データ生成のタスクに適応できるようにファインチューニングされ、音データ生成のタスクに適用したＡＩモデル（モデルＭ４等）が生成される。また、例えば、テキストロゴ生成のタスクに適用される場合、基盤モデルがテキストロゴ生成のタスクに適応できるようにファインチューニングされ、テキストロゴ生成のタスクに適用したＡＩモデル（モデルＭ５等）が生成される。 For example, when applied to a task of scenario data generation, the base model is fine-tuned so that it can be adapted to the task of scenario data generation, and an AI model (model M1, etc.) applied to the task of scenario data generation is generated. Also, when applied to a task of code generation, the base model is fine-tuned so that it can be adapted to the task of code generation, and an AI model (model M3, etc.) applied to the task of code generation is generated. Also, when applied to a task of sound data generation, the base model is fine-tuned so that it can be adapted to the task of sound data generation, and an AI model (model M4, etc.) applied to the task of sound data generation is generated. Also, when applied to a task of text logo generation, the base model is fine-tuned so that it can be adapted to the task of text logo generation, and an AI model (model M5, etc.) applied to the task of text logo generation is generated.

例えば、シナリオデータ生成のタスクに適用したＡＩモデル（モデルＭ１等）は、そのＡＩモデルに対応する入力情報と、その入力情報を入力した場合の正解の出力となるシナリオデータ（「正解情報」ともいう）とを組合せを含む学習データを用いて学習される。なお、入力情報、正解情報等の学習データは、人が作成したデータであってもよいし、学習データを生成するコンピュータが自動で生成したデータであってもよい。例えば、正解情報となるシナリオデータ等は、人が作成したデータであってもよい。以下、モデルＭ１を一例として簡単に説明する。例えば、モデルＭ１は、学習データ中の各入力情報が入力された場合に、その各入力情報に対応する正解情報を出力するように学習される。例えば、モデルＭ１は、バックプロパゲーション（誤差逆伝播法）等の手法により、ある入力情報が入力された場合のモデルＭ１における出力と、その入力情報に対応する正解情報との誤差が少なくなるようにパラメータ（接続係数）が調整（補正）されることにより学習される。また、コード生成のタスクに適用したＡＩモデル（モデルＭ３等）、音データ生成のタスクに適用したＡＩモデル（モデルＭ４等）、テキストロゴ生成のタスクに適用したＡＩモデル（モデルＭ５等）、評価のタスクに適用したＡＩモデル（モデルＭ１０等）、画像改善処理のタスクに適用したＡＩモデル（モデルＭ１１等）等の他のＡＩモデルについても同様の学習処理により学習されてもよい。 For example, an AI model (such as model M1) applied to a task of generating scenario data is trained using training data that includes a combination of input information corresponding to the AI model and scenario data (also called "correct answer information") that is the correct output when the input information is input. The training data, such as the input information and the correct answer information, may be data created by a person, or may be data automatically generated by a computer that generates the training data. For example, the scenario data that is the correct answer information may be data created by a person. Below, model M1 will be briefly described as an example. For example, model M1 is trained to output correct answer information corresponding to each input information when each piece of input information in the training data is input. For example, model M1 is trained by adjusting (correcting) parameters (connection coefficients) using a method such as backpropagation (error backpropagation method) so that the error between the output of model M1 when certain input information is input and the correct answer information corresponding to that input information is reduced. In addition, other AI models such as an AI model applied to a code generation task (e.g., model M3), an AI model applied to a sound data generation task (e.g., model M4), an AI model applied to a text logo generation task (e.g., model M5), an AI model applied to an evaluation task (e.g., model M10), and an AI model applied to an image improvement processing task (e.g., model M11) may also be trained using a similar learning process.

なお、上述した学習処理は一例に過ぎず、上述したＡＩモデルは、そのＡＩモデルの入力、出力及び内部構造に応じて任意の学習処理により学習される。例えば、ＡＩモデルは、ＧＡＮ（Generative Adversarial Network）等のように、教師なし学習の手法により学習されてもよい。また、ＡＩモデルは、フェデレーテッドラーニング等のようにデータを集約せずに分散した状態で学習されてもよい。この場合、各映像生成サービスの装置（サーバ等）において、そのサービスで収集したローカルモデルを生成し、各映像生成サービスの装置（サーバ等）が生成したローカルモデルの情報（パラメータ等）を集約するサーバ（集約サーバ）がローカルモデルの情報を用いてグローバルモデルを生成してもよい。この場合、映像生成システム１は、集約サーバが生成したグローバルモデルを集約サーバから受信し、受信したグローバルモデルをＡＩモデルとして処理に用いてもよい。 Note that the above-mentioned learning process is merely an example, and the above-mentioned AI model is learned by any learning process according to the input, output, and internal structure of the AI model. For example, the AI model may be learned by an unsupervised learning method such as GAN (Generative Adversarial Network). The AI model may also be learned in a distributed state without aggregating data such as federated learning. In this case, a local model collected by each video generation service device (server, etc.) may be generated in the service, and a server (aggregation server) that aggregates information (parameters, etc.) of the local models generated by each video generation service device (server, etc.) may generate a global model using the information of the local model. In this case, the video generation system 1 may receive the global model generated by the aggregation server from the aggregation server, and use the received global model as an AI model for processing.

このように、上述したＡＩモデルは、いずれのコンピュータが生成（学習）してもよい。すなわち、ＡＩモデルを生成する学習処理は、映像生成システム１のいずれかの装置（コンピュータ等）が行ってもよいし、映像生成システム１外の装置が行ってもよい。例えば、映像生成システム１外の装置が上述したＡＩモデルのうち少なくとも１つを生成する場合、映像生成システム１は、映像生成システム１外の装置からそのＡＩモデルを取得し、取得したＡＩモデルを用いて処理を行う。 In this way, the AI models described above may be generated (learned) by any computer. In other words, the learning process to generate the AI models may be performed by any device (computer, etc.) of the video production system 1, or may be performed by a device outside the video production system 1. For example, when a device outside the video production system 1 generates at least one of the AI models described above, the video production system 1 acquires the AI model from the device outside the video production system 1 and performs processing using the acquired AI model.

＜２．その他の実施形態＞
上述した各実施形態に係る処理は、上記各実施形態や変形例以外にも種々の異なる形態（変形例）にて実施されてよい。 2. Other embodiments
The processing according to each of the above-described embodiments may be implemented in various different forms (variations) other than the above-described embodiments and variations.

＜２－１．その他の構成例＞
上記の映像生成システム１の構成は一例に過ぎず、映像生成システム１における機能の分割は任意の態様が採用可能である。すなわち、上述した構成は一例であり、上述した映像生成に関するサービスを提供可能であれば、映像生成システム１は、どのような機能の分割態様であってもよく、どのような構成であってもよい。例えば、映像生成システム１は、上述した処理を行う１つの装置（コンピュータ等）により構成されてもよい。この場合、映像生成システム１の１つの装置が、映像生成モジュール１００、情報取得モジュール２００、センサ部３００、及びクライアントＵＩ表示部４００の機能を有してもよい。例えば、映像生成システム１が提供する映像生成サービスは、ユーザが利用する端末装置（コンピュータ２０等）上で動作するツール（AI Assist Creation Tool）等のプログラムとしてユーザに提供されてもよい。 <2-1. Other configuration examples>
The above-described configuration of the image generation system 1 is merely an example, and any manner of division of functions in the image generation system 1 can be adopted. That is, the above-described configuration is merely an example, and the image generation system 1 may have any manner of division of functions and any configuration as long as it can provide the above-described service related to image generation. For example, the image generation system 1 may be configured by one device (such as a computer) that performs the above-described processing. In this case, one device of the image generation system 1 may have the functions of the image generation module 100, the information acquisition module 200, the sensor unit 300, and the client UI display unit 400. For example, the image generation service provided by the image generation system 1 may be provided to the user as a program such as a tool (AI Assist Creation Tool) that operates on a terminal device (such as the computer 20) used by the user.

＜２－２．その他＞
また、上記各実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 <2-2. Others>
Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically by a known method. In addition, the information including the processing procedures, specific names, various data and parameters shown in the above documents and drawings can be changed arbitrarily unless otherwise specified. For example, the various information shown in each drawing is not limited to the illustrated information.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 In addition, each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or part of them can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.

また、上述してきた各実施形態及び変形例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Furthermore, the above-mentioned embodiments and variations can be combined as appropriate to the extent that they do not cause any contradiction in the processing content.

また、本明細書に記載された効果はあくまで例示であって限定されるものでは無く、他の効果があってもよい。 Furthermore, the effects described in this specification are merely examples and are not limiting, and other effects may also be present.

＜３．本開示に係る効果＞
上述のように、本開示に係る映像生成システム（実施形態では映像生成システム１）は、取得部（実施形態では入力テキスト取得部２１０）と、シナリオ生成部（実施形態ではシナリオ向け生成部１３１）と、コード生成部（実施形態では映像向け生成部１３２）と、動画データ取得部（実施形態では映像生成部１４０）とを備える。取得部は、動画生成に関する入力クエリをユーザから取得する。シナリオ生成部は、入力クエリに基づいて、動画生成に関するシナリオデータを生成する。コード生成部は、シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成する。動画データ取得部は、コードに基づいて、動画データを取得する。 3. Effects of the Present Disclosure
As described above, the video generation system (video generation system 1 in the embodiment) according to the present disclosure includes an acquisition unit (input text acquisition unit 210 in the embodiment), a scenario generation unit (scenario-oriented generation unit 131 in the embodiment), a code generation unit (video-oriented generation unit 132 in the embodiment), and a video data acquisition unit (video generation unit 140 in the embodiment). The acquisition unit acquires an input query related to video generation from a user. The scenario generation unit generates scenario data related to video generation based on the input query. The code generation unit generates code for constructing 3D data based on the scenario data. The video data acquisition unit acquires video data based on the code.

このように、本開示に係る映像生成システムは、ユーザからの入力クエリに基づいて生成したシナリオデータに基づいて、３Ｄデータを構成するためのコードを生成し、コードに基づいて、動画データを取得することにより、ユーザからの入力クエリに応じて動画データを取得することができる。 In this way, the video generation system according to the present disclosure generates code for constructing 3D data based on scenario data generated based on a query input from a user, and obtains video data based on the code, thereby obtaining video data in response to a query input from a user.

また、映像生成システムは、画質改善部（実施形態では映像リファイン部１４３）を備える。画質改善部は、動画データの画質を改善する画質改善処理を実行する。このように、映像生成システムは、動画データの画質を改善することにより、高品質な動画を取得することができる。 The video generation system also includes an image quality improvement unit (in this embodiment, an image refinement unit 143). The image quality improvement unit executes an image quality improvement process that improves the image quality of the video data. In this way, the video generation system can obtain high-quality videos by improving the image quality of the video data.

また、画質改善部は、テキストプロンプトに基づいて画質改善処理を実行することにより、動画データの画質を改善する。このように、映像生成システムは、テキストプロンプトに基づいて動画データの画質を改善することにより、高品質な動画を取得することができる。 The image quality improvement unit also improves the image quality of the video data by performing an image quality improvement process based on the text prompt. In this way, the video generation system can obtain high-quality video by improving the image quality of the video data based on the text prompt.

また、画質改善部は、動画データのうち画質改善処理の対象を指定するテキストプロンプトに基づいて、画質改善処理を実行する。このように、映像生成システムは、指定された対象について動画データの画質を改善することにより、高品質な動画を取得することができる。 The image quality improvement unit also performs image quality improvement processing based on a text prompt that specifies the target of the image quality improvement processing in the video data. In this way, the video production system can obtain high-quality video by improving the image quality of the video data for the specified target.

また、画質改善部は、改善が必要と判断された対象を指定するテキストプロンプトに基づいて、画質改善処理を実行する。このように、映像生成システムは、改善が必要と判断された対象について動画データの画質を改善することにより、高品質な動画を取得することができる。 The image quality improvement unit also performs image quality improvement processing based on a text prompt that specifies the target determined to require improvement. In this way, the video generation system can obtain high-quality video by improving the image quality of the video data for the target determined to require improvement.

また、映像生成システムは、表示制御部（実施形態ではクライアントＵＩ表示部４００）を備える。表示制御部は、シナリオデータに基づいたストーリーボードを表示させる。このように、映像生成システムは、シナリオデータに基づいたストーリーボードを表示させることにより、ユーザにとって利便性が高い態様で情報を提示することができる。 The video generation system also includes a display control unit (client UI display unit 400 in this embodiment). The display control unit displays a storyboard based on the scenario data. In this way, the video generation system can present information in a manner that is highly convenient for the user by displaying a storyboard based on the scenario data.

また、ストーリーボードは、動画のカット毎に動画データを表示するように構成される。このように、映像生成システムは、ストーリーボードが動画のカット毎に動画データを表示するように構成されることにより、ユーザにとって利便性が高い態様で情報を提示することができる。 The storyboard is also configured to display video data for each cut of the video. In this way, the video production system is able to present information in a manner that is highly convenient for the user, by configuring the storyboard to display video data for each cut of the video.

また、映像生成システムは、サウンド生成部（実施形態ではサウンド生成部１５０）を備える。サウンド生成部は、シナリオデータと動画データとに基づいて、動画データに対応するサウンドデータを生成する。このように、映像生成システムは、動画データに対応するサウンドデータを生成することで、音を含む動画を取得することができる。 The video generation system also includes a sound generation unit (sound generation unit 150 in this embodiment). The sound generation unit generates sound data corresponding to the video data based on the scenario data and the video data. In this way, the video generation system can obtain a video including sound by generating sound data corresponding to the video data.

また、映像生成システムは、テキスト生成部（実施形態ではテキスト／ロゴ生成部１６０）を備える。テキスト生成部は、シナリオデータに基づいて、動画データにより表示される映像上に表示させるテキストを示すテキストデータを生成する。このように、映像生成システムは、動画データに対応するテキストデータを生成することで、テキストを含む動画を取得することができる。 The video generation system also includes a text generation unit (in this embodiment, a text/logo generation unit 160). The text generation unit generates text data indicating text to be displayed on the video displayed by the video data, based on the scenario data. In this way, the video generation system can obtain a video including text by generating text data corresponding to the video data.

また、映像生成システムは、ロゴ生成部（実施形態ではテキスト／ロゴ生成部１６０）を備える。ロゴ生成部は、シナリオデータに基づいて、動画データにより表示される映像上に表示させるロゴを示すロゴデータを生成する。このように、映像生成システムは、動画データに対応するロゴデータを生成することで、ロゴを含む動画を取得することができる。 The video generation system also includes a logo generation unit (in this embodiment, a text/logo generation unit 160). The logo generation unit generates logo data indicating a logo to be displayed on a video displayed by the video data, based on the scenario data. In this way, the video generation system can obtain a video including a logo by generating logo data corresponding to the video data.

また、入力クエリは、テキスト、画像、音声、３Ｄデータのうち少なくとも１つを含む。このように、映像生成システムは、入力クエリがテキスト、画像、音声、３Ｄデータのうち少なくとも１つを含むことにより、ユーザからの入力クエリに応じて動画データを取得することができる。 The input query also includes at least one of text, images, audio, and 3D data. In this way, the video generation system can obtain video data in response to the input query from the user by the input query including at least one of text, images, audio, and 3D data.

また、映像生成システムは、第１の出力部（実施形態ではシナリオ向け生成部１３１）を備える。第１の出力部は、入力クエリに基づいて、シナリオデータを生成するためにシナリオ生成部が用いるシナリオ生成用情報を出力する。シナリオ生成部は、シナリオ生成用情報に基づいてシナリオデータを生成する。このように、映像生成システムは、入力クエリに基づいて生成されたシナリオ生成用情報に基づいてシナリオデータを生成することにより、ユーザからの入力クエリに応じて動画データを取得することができる。 The video generation system also includes a first output unit (in this embodiment, a scenario-oriented generation unit 131). The first output unit outputs scenario generation information used by the scenario generation unit to generate scenario data based on the input query. The scenario generation unit generates scenario data based on the scenario generation information. In this way, the video generation system can acquire video data in response to an input query from a user by generating scenario data based on the scenario generation information generated based on the input query.

また、第１の出力部は、入力クエリに基づいて、シナリオデータを生成するための第１のプロンプトを、シナリオ生成用情報として生成する。シナリオ生成部は、第１のプロンプトに基づいてシナリオデータを生成する。このように、映像生成システムは、第１のプロンプトに基づいてシナリオデータを生成することにより、ユーザからの入力クエリに応じて動画データを取得することができる。 The first output unit also generates a first prompt for generating scenario data based on the input query as scenario generation information. The scenario generation unit generates scenario data based on the first prompt. In this way, the video generation system can obtain video data in response to the input query from the user by generating scenario data based on the first prompt.

また、第１の出力部は、入力クエリに基づいて、シナリオデータを生成するための第１のモデルの入力として用いられる第１の入力情報をシナリオ生成用情報として生成する。シナリオ生成部は、入力クエリを用いて生成された第１の入力情報を、第１のモデルに入力し、第１のモデルにシナリオデータを出力させることにより、シナリオデータを生成する。このように、映像生成システムは、第１のモデルを用いてシナリオデータを生成することにより、ユーザからの入力クエリに応じて動画データを取得することができる。 Furthermore, the first output unit generates, based on the input query, first input information used as an input of a first model for generating scenario data as scenario generation information. The scenario generation unit inputs the first input information generated using the input query to the first model and causes the first model to output scenario data, thereby generating scenario data. In this way, the video generation system can obtain video data in response to an input query from a user by generating scenario data using the first model.

また、映像生成システムは、第２の出力部（実施形態では映像向け生成部１３２）を備える。第２の出力部は、シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成するためにコード生成部が用いるコード生成用情報を出力する。コード生成部は、コード生成用情報に基づいてコードを生成する。このように、映像生成システムは、シナリオデータに基づいて生成されたコード生成用情報に基づいてコードを生成することにより、ユーザからの入力クエリに応じて動画データを取得することができる。 The video generation system also includes a second output unit (video generation unit 132 in this embodiment). The second output unit outputs code generation information used by the code generation unit to generate code for constructing 3D data based on the scenario data. The code generation unit generates code based on the code generation information. In this way, the video generation system can acquire video data in response to an input query from a user by generating code based on the code generation information generated based on the scenario data.

また、第２の出力部は、シナリオデータに基づいて、３Ｄデータを構成するためのコードを出力するための第２のプロンプトを、コード生成用情報として生成する。シナリオ生成部は、第２のプロンプトに基づいてコードを生成する。このように、映像生成システムは、第２のプロンプトに基づいてコードを生成することにより、ユーザからの入力クエリに応じて動画データを取得することができる。 The second output unit also generates, as code generation information, a second prompt for outputting a code for constructing 3D data based on the scenario data. The scenario generation unit generates the code based on the second prompt. In this way, the video generation system can acquire video data in response to an input query from a user by generating a code based on the second prompt.

また、第２の出力部は、入力クエリに基づいて、コードを生成するための第２のモデルの入力として用いられる第２の入力情報をコード生成用情報として生成する。シナリオ生成部は、シナリオデータを用いて生成された第２の入力情報を、第２のモデルに入力し、第２のモデルにコードを出力させることにより、コードを生成する。このように、映像生成システムは、第２のモデルを用いてコードを生成することにより、ユーザからの入力クエリに応じて動画データを取得することができる。 Furthermore, the second output unit generates, based on the input query, second input information used as an input of a second model for generating code as information for code generation. The scenario generation unit generates code by inputting the second input information generated using the scenario data to the second model and causing the second model to output code. In this way, the video generation system can obtain video data in response to an input query from a user by generating code using the second model.

また、映像生成システムは、受付部（実施形態ではセンサ部３００）を備える。受付部は、ユーザから動画編集に関する操作を受け付ける。コード生成部は、操作に基づいた編集により、コードを生成する。このように、映像生成システムは、ユーザから動画編集の操作に応じてコードを生成することにより、ユーザの編集に対応する映像を適切に取得することができる。 The video generation system also includes a reception unit (sensor unit 300 in this embodiment). The reception unit receives operations related to video editing from a user. The code generation unit generates code from editing based on the operations. In this way, the video generation system can appropriately acquire video corresponding to the user's editing by generating code in response to the video editing operations from the user.

また、受付部は、センサを用いてモーションまたはカメラの動きを指定する操作を受け付ける。コード生成部は、操作が示すモーションまたはカメラの動きに対応するコードを生成する。このように、映像生成システムは、ユーザからモーションまたはカメラの動きを指定する操作に応じてコードを生成することにより、ユーザの編集に対応する映像を適切に取得することができる。 The receiving unit also receives an operation specifying a motion or camera movement using a sensor. The code generating unit generates a code corresponding to the motion or camera movement indicated by the operation. In this way, the video generation system can appropriately acquire video corresponding to the user's editing by generating a code in response to a user's operation specifying a motion or camera movement.

また、受付部は、動画データのうち、複数のカットを選択する操作を受け付ける。コード生成部は、操作が示す複数のカットに対応する部分が変更されたコードを生成する。このように、映像生成システムは、ユーザから複数のカットを選択するに応じてコードを生成することにより、ユーザの編集に対応する映像を適切に取得することができる。 The receiving unit also receives an operation to select multiple cuts from the video data. The code generating unit generates code in which the portions corresponding to the multiple cuts indicated by the operation have been modified. In this way, the video generating system can appropriately acquire video corresponding to the user's editing by generating code in response to the user selecting multiple cuts.

また、動画データの各カットには日付情報が対応付けられている。コード生成部は、動画データの各カットの日付情報に基づいて、操作が示す編集の内容を決定する。このように、映像生成システムは、動画データの各カットの日付情報に基づいて、編集の内容を決定することにより、ユーザの編集に対応する映像を適切に取得することができる。 Furthermore, date information is associated with each cut of the video data. The code generation unit determines the content of the edit indicated by the operation based on the date information of each cut of the video data. In this way, the video generation system can appropriately acquire video corresponding to the user's editing by determining the content of the edit based on the date information of each cut of the video data.

また、受付部は、動画データのうち、変更の対象とする対象物を指定する操作を受け付ける。コード生成部は、操作が示す対象物の３Ｄデータが変更されたコードを生成する。このように、映像生成システムは、ユーザから変更の対象として指定された３Ｄデータが変更されたコードを生成することにより、ユーザの編集に対応する映像を適切に取得することができる。 The receiving unit also receives an operation to specify an object of the video data to be changed. The code generating unit generates a code in which the 3D data of the object specified by the operation has been changed. In this way, the video generating system can appropriately acquire video corresponding to the user's edits by generating a code in which the 3D data specified by the user as the object to be changed has been changed.

また、映像生成システムは、評価部（実施形態では評価部１８０）を備える。評価部は、シナリオデータと動画データのうち少なくとも１つの評価を示す情報を生成する。このように、映像生成システムは、シナリオデータと動画データのうち少なくとも１つの評価を示す情報を生成することにより、生成した情報に対して評価を行うことができる。 The video generation system also includes an evaluation unit (evaluation unit 180 in this embodiment). The evaluation unit generates information indicating an evaluation of at least one of the scenario data and the video data. In this way, the video generation system can generate information indicating an evaluation of at least one of the scenario data and the video data, thereby evaluating the generated information.

また、コード生成部は、評価に基づいて、コードを生成する。このように、映像生成システムは、評価に基づいて、コードを生成することにより、評価に応じて適切に情報を取得することができる。 The code generation unit also generates a code based on the evaluation. In this way, the video generation system can appropriately acquire information according to the evaluation by generating a code based on the evaluation.

また、シナリオ生成部は、評価に基づいて、シナリオデータを生成する。このように、映像生成システムは、評価に基づいて、シナリオデータを生成することにより、評価に応じて適切に情報を取得することができる。 The scenario generation unit also generates scenario data based on the evaluation. In this way, the video generation system can acquire information appropriately according to the evaluation by generating scenario data based on the evaluation.

また、シナリオ生成部は、評価に基づき生成されたシナリオデータに基づいて、コードを生成する。このように、映像生成システムは、評価に基づき生成されたシナリオデータに基づいて、コードを生成することにより、評価に応じて適切に情報を取得することができる。 The scenario generation unit also generates a code based on the scenario data generated based on the evaluation. In this way, the video generation system can appropriately acquire information according to the evaluation by generating a code based on the scenario data generated based on the evaluation.

＜４．ハードウェア構成＞
上述してきた各実施形態に係る映像生成モジュール１００、情報取得モジュール２００及びクライアントＵＩ表示部４００等を有する情報処理装置（情報機器）は、例えば図４１に示すような構成のコンピュータ１０００によって実現される。図４１は、情報処理装置の機能を実現するコンピュータ１０００の一例を示すハードウェア構成図である。以下、実施形態に係る映像生成モジュール１００を例に挙げて説明する。コンピュータ１０００は、ＣＰＵ１１００、ＲＡＭ１２００、ＲＯＭ（Read Only Memory）１３００、ＨＤＤ（Hard Disk Drive）１４００、通信インターフェイス１５００、及び入出力インターフェイス１６００を有する。コンピュータ１０００の各部は、バス１０５０によって接続される。 4. Hardware Configuration
An information processing device (information equipment) having the image generation module 100, the information acquisition module 200, the client UI display unit 400, and the like according to each of the above-described embodiments is realized by a computer 1000 having a configuration as shown in FIG. 41, for example. FIG. 41 is a hardware configuration diagram showing an example of the computer 1000 that realizes the functions of the information processing device. The image generation module 100 according to the embodiment will be described below as an example. The computer 1000 has a CPU 1100, a RAM 1200, a ROM (Read Only Memory) 1300, a HDD (Hard Disk Drive) 1400, a communication interface 1500, and an input/output interface 1600. Each unit of the computer 1000 is connected by a bus 1050.

ＣＰＵ１１００は、ＲＯＭ１３００又はＨＤＤ１４００に格納されたプログラムに基づいて動作し、各部の制御を行う。例えば、ＣＰＵ１１００は、ＲＯＭ１３００又はＨＤＤ１４００に格納されたプログラムをＲＡＭ１２００に展開し、各種プログラムに対応した処理を実行する。 The CPU 1100 operates based on the programs stored in the ROM 1300 or the HDD 1400 and controls each part. For example, the CPU 1100 loads the programs stored in the ROM 1300 or the HDD 1400 into the RAM 1200 and executes processes corresponding to the various programs.

ＲＯＭ１３００は、コンピュータ１０００の起動時にＣＰＵ１１００によって実行されるＢＩＯＳ（Basic Input Output System）等のブートプログラムや、コンピュータ１０００のハードウェアに依存するプログラム等を格納する。 The ROM 1300 stores boot programs such as the Basic Input Output System (BIOS) that is executed by the CPU 1100 when the computer 1000 starts up, as well as programs that depend on the hardware of the computer 1000.

ＨＤＤ１４００は、ＣＰＵ１１００によって実行されるプログラム、及び、かかるプログラムによって使用されるデータ等を非一時的に記録する、コンピュータが読み取り可能な記録媒体である。具体的には、ＨＤＤ１４００は、プログラムデータ１４５０の一例である本開示に係る映像生成プログラムを記録する記録媒体である。 HDD1400 is a computer-readable recording medium that non-temporarily records programs executed by CPU1100 and data used by such programs. Specifically, HDD1400 is a recording medium that records the image generation program according to the present disclosure, which is an example of program data 1450.

通信インターフェイス１５００は、コンピュータ１０００が外部ネットワーク１５５０（例えばインターネット）と接続するためのインターフェイスである。例えば、ＣＰＵ１１００は、通信インターフェイス１５００を介して、他の機器からデータを受信したり、ＣＰＵ１１００が生成したデータを他の機器へ送信したりする。 The communication interface 1500 is an interface for connecting the computer 1000 to an external network 1550 (e.g., the Internet). For example, the CPU 1100 receives data from other devices and transmits data generated by the CPU 1100 to other devices via the communication interface 1500.

入出力インターフェイス１６００は、入出力デバイス１６５０とコンピュータ１０００とを接続するためのインターフェイスである。例えば、ＣＰＵ１１００は、入出力インターフェイス１６００を介して、キーボードやマウス等の入力デバイスからデータを受信する。また、ＣＰＵ１１００は、入出力インターフェイス１６００を介して、ディスプレイやスピーカーやプリンタ等の出力デバイスにデータを送信する。また、入出力インターフェイス１６００は、所定の記録媒体（メディア）に記録されたプログラム等を読み取るメディアインターフェイスとして機能してもよい。メディアとは、例えばＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等である。 The input/output interface 1600 is an interface for connecting the input/output device 1650 and the computer 1000. For example, the CPU 1100 receives data from an input device such as a keyboard or a mouse via the input/output interface 1600. The CPU 1100 also transmits data to an output device such as a display, a speaker, or a printer via the input/output interface 1600. The input/output interface 1600 may also function as a media interface for reading programs and the like recorded on a predetermined recording medium. The media may be, for example, optical recording media such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable Disc), magneto-optical recording media such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.

例えば、コンピュータ１０００が実施形態に係る映像生成モジュール１００として機能する場合、コンピュータ１０００のＣＰＵ１１００は、ＲＡＭ１２００上にロードされた映像生成プログラムを実行することにより、制御部１３０１等の機能を実現する。また、ＨＤＤ１４００には、本開示に係る映像生成プログラムや、記憶部１３０２内のデータが格納される。なお、ＣＰＵ１１００は、プログラムデータ１４５０をＨＤＤ１４００から読み取って実行するが、他の例として、外部ネットワーク１５５０を介して、他の装置からこれらのプログラムを取得してもよい。 For example, when computer 1000 functions as image generation module 100 according to an embodiment, CPU 1100 of computer 1000 executes an image generation program loaded onto RAM 1200 to realize functions of control unit 1301, etc. Also, HDD 1400 stores the image generation program according to the present disclosure and data in storage unit 1302. Note that CPU 1100 reads and executes program data 1450 from HDD 1400, but as another example, these programs may be obtained from other devices via external network 1550.

なお、本技術は以下のような構成も取ることができる。
（１）
動画生成に関する入力クエリをユーザから取得する取得部と、
前記入力クエリに基づいて、動画生成に関するシナリオデータを生成するシナリオ生成部と、
前記シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成するコード生成部と、
前記コードに基づいて、動画データを取得する動画データ取得部と、
を備える映像生成システム。
（２）
前記動画データの画質を改善する画質改善処理を実行する画質改善部、
を更に備える（１）に記載の映像生成システム。
（３）
前記画質改善部は、
テキストプロンプトに基づいて前記画質改善処理を実行することにより、前記動画データの画質を改善する
（２）に記載の映像生成システム。
（４）
前記画質改善部は、
前記動画データのうち前記画質改善処理の対象を指定するテキストプロンプトに基づいて、前記画質改善処理を実行する
（２）または（３）に記載の映像生成システム。
（５）
前記画質改善部は、
改善が必要と判断された前記対象を指定するテキストプロンプトに基づいて、前記画質改善処理を実行する
（４）に記載の映像生成システム。
（６）
前記シナリオデータに基づいたストーリーボードを表示させる表示制御部、
を更に備える（１）～（５）のいずれか１つに記載の映像生成システム。
（７）
前記ストーリーボードは、動画のカット毎に前記動画データを表示するように構成される
（６）に記載の映像生成システム。
（８）
前記シナリオデータと前記動画データとに基づいて、前記動画データに対応するサウンドデータを生成するサウンド生成部、
を更に備える（１）～（７）のいずれか１つに記載の映像生成システム。
（９）
前記シナリオデータに基づいて、前記動画データにより表示される映像上に表示させるテキストを示すテキストデータを生成するテキスト生成部、
を更に備える（１）～（８）のいずれか１つに記載の映像生成システム。
（１０）
前記シナリオデータに基づいて、前記動画データにより表示される映像上に表示させるロゴを示すロゴデータを生成するロゴ生成部、
を更に備える（１）～（９）のいずれか１つに記載の映像生成システム。
（１１）
前記入力クエリは、テキスト、画像、音声、３Ｄデータのうち少なくとも１つを含む
（１）～（１０）のいずれか１つに記載の映像生成システム。
（１２）
前記入力クエリに基づいて、前記シナリオデータを生成するために前記シナリオ生成部が用いるシナリオ生成用情報を出力する第１の出力部、
を更に備え、
前記シナリオ生成部は、
前記シナリオ生成用情報に基づいて前記シナリオデータを生成する
（１）～（１１）のいずれか１つに記載の映像生成システム。
（１３）
前記第１の出力部は、
前記入力クエリに基づいて、前記シナリオデータを生成するための第１のプロンプトを、前記シナリオ生成用情報として生成し、
前記シナリオ生成部は、
前記第１のプロンプトに基づいて前記シナリオデータを生成する
（１２）に記載の映像生成システム。
（１４）
前記第１の出力部は、
前記入力クエリに基づいて、前記シナリオデータを生成するための第１のモデルの入力として用いられる第１の入力情報を前記シナリオ生成用情報として生成し、
前記シナリオ生成部は、
前記入力クエリを用いて生成された前記第１の入力情報を、前記第１のモデルに入力し、前記第１のモデルに前記シナリオデータを出力させることにより、前記シナリオデータを生成する
（１２）または（１３）に記載の映像生成システム。
（１５）
前記シナリオデータに基づいて、前記３Ｄデータを構成するためのコードを生成するために前記コード生成部が用いるコード生成用情報を出力する第２の出力部、
を更に備え、
前記コード生成部は、
前記コード生成用情報に基づいて前記コードを生成する
（１）～（１４）のいずれか１つに記載の映像生成システム。
（１６）
前記第２の出力部は、
前記シナリオデータに基づいて、前記３Ｄデータを構成するためのコードを出力するための第２のプロンプトを、前記コード生成用情報として生成し、
前記シナリオ生成部は、
前記第２のプロンプトに基づいて前記コードを生成する
（１５）に記載の映像生成システム。
（１７）
前記第２の出力部は、
前記入力クエリに基づいて、前記コードを生成するための第２のモデルの入力として用いられる第２の入力情報を前記コード生成用情報として生成し、
前記シナリオ生成部は、
前記シナリオデータを用いて生成された前記第２の入力情報を、前記第２のモデルに入力し、前記第２のモデルに前記コードを出力させることにより、前記コードを生成する
（１５）または（１６）に記載の映像生成システム。
（１８）
ユーザから動画編集に関する操作を受け付ける受付部、
を更に備え、
前記コード生成部は、
前記操作に基づいた編集により、前記コードを生成する
（１）～（１７）のいずれか１つに記載の映像生成システム。
（１９）
前記受付部は、
センサを用いてモーションまたはカメラの動きを指定する前記操作を受け付け、
前記コード生成部は、
前記操作が示す前記モーションまたは前記カメラの動きに対応する前記コードを生成する
（１８）に記載の映像生成システム。
（２０）
前記受付部は、
前記動画データのうち、複数のカットを選択する前記操作を受け付け、
前記コード生成部は、
前記操作が示す前記複数のカットに対応する部分が変更された前記コードを生成する
（１８）または（１９）に記載の映像生成システム。
（２１）
前記動画データの各カットには日付情報が対応付けられており、
前記コード生成部は、
前記動画データの各カットの日付情報に基づいて、前記操作が示す前記編集の内容を決定する
（１８）～（２０）のいずれか１つに記載の映像生成システム。
（２２）
前記受付部は、
前記動画データのうち、変更の対象とする対象物を指定する前記操作を受け付け、
前記コード生成部は、
前記操作が示す前記対象物の３Ｄデータが変更された前記コードを生成する
（１８）～（２１）のいずれか１つに記載の映像生成システム。
（２３）
前記３Ｄデータは、複数のデータセットを含み、
前記コード生成部は、
前記操作に基づいた編集により、前記複数のデータセットのうち少なくとも１つに対応する前記コードを生成する
（１８）～（２２）のいずれか１つにに記載の映像生成システム。
（２４）
前記コード生成部は、
前記操作が示す編集内容に応じて、前記複数のデータセットのうち一部を更新する処理と前記複数のデータセット全体を更新する処理とのうちのいずれかを実行する
（２３）に記載の映像生成システム。
（２５）
前記シナリオデータと前記動画データのうち少なくとも１つの評価を示す情報を生成する評価部、
を更に備える（１）～（２４）のいずれか１つに記載の映像生成システム。
（２６）
前記コード生成部は、
前記評価に基づいて、前記コードを生成する
（２５）に記載の映像生成システム。
（２７）
前記シナリオ生成部は、
前記評価に基づいて、前記シナリオデータを生成する
（２５）または（２６）に記載の映像生成システム。
（２８）
前記コード生成部は、
前記評価に基づき生成された前記シナリオデータに基づいて、前記コードを生成する
（２７）に記載の映像生成システム。
（２９）
動画生成に関する入力クエリをユーザから取得することと、
前記入力クエリに基づいて、動画生成に関するシナリオデータを生成することと、
前記シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成することと、
前記コードに基づいて、動画データを取得することと
を含む映像生成方法。
（３０）
動画生成に関する入力クエリをユーザから取得することと、
前記入力クエリに基づいて、動画生成に関するシナリオデータを生成することと、
前記シナリオデータに基づいて、３Ｄデータを構成するためのコードを生成することと、
前記コードに基づいて、動画データを取得することと
をコンピュータに実行させる映像生成プログラム。 The present technology can also be configured as follows.
(1)
An acquisition unit that acquires an input query regarding video generation from a user;
a scenario generation unit that generates scenario data related to video generation based on the input query;
a code generating unit that generates a code for constructing 3D data based on the scenario data;
a video data acquisition unit that acquires video data based on the code;
An image generation system comprising:
(2)
an image quality improvement unit that executes an image quality improvement process to improve the image quality of the video data;
The image generation system according to (1) further comprising:
(3)
The image quality improvement unit includes:
The image generation system according to any one of claims 2 to 6, further comprising: a display unit for displaying an image of the moving image data based on a text prompt;
(4)
The image quality improvement unit includes:
The image generation system according to (2) or (3), further comprising: a text prompt that specifies a target of the image quality improvement process in the video data;
(5)
The image quality improvement unit includes:
The image generation system according to claim 4, further comprising: a text prompt that specifies the object determined to require improvement, and the image quality improvement process is performed based on the text prompt.
(6)
a display control unit that displays a storyboard based on the scenario data;
The image generation system according to any one of (1) to (5), further comprising:
(7)
The video production system according to (6), wherein the storyboard is configured to display the video data for each cut of the video.
(8)
a sound generating unit that generates sound data corresponding to the video data based on the scenario data and the video data;
The image generation system according to any one of (1) to (7), further comprising:
(9)
a text generating unit that generates text data indicating text to be displayed on a video displayed based on the video data, based on the scenario data;
The image generation system according to any one of (1) to (8), further comprising:
(10)
a logo generating unit that generates logo data indicating a logo to be displayed on a video image displayed based on the moving image data, based on the scenario data;
The image generation system according to any one of (1) to (9), further comprising:
(11)
The image generation system according to any one of (1) to (10), wherein the input query includes at least one of text, image, audio, and 3D data.
(12)
a first output unit that outputs scenario generation information used by the scenario generation unit to generate the scenario data based on the input query;
Further comprising:
The scenario generation unit includes:
The image generation system according to any one of (1) to (11), further comprising: generating the scenario data based on the scenario generation information.
(13)
The first output section is
generating a first prompt for generating the scenario data based on the input query as the scenario generation information;
The scenario generation unit includes:
The image production system according to any one of claims 12 to 17, further comprising: a display unit configured to display a display screen of the image production system according to the first prompt;
(14)
The first output section is
generating, based on the input query, first input information to be used as an input of a first model for generating the scenario data, as the scenario generation information;
The scenario generation unit includes:
The image generation system according to (12) or (13), further comprising: inputting the first input information generated using the input query into the first model; and causing the first model to output the scenario data, thereby generating the scenario data.
(15)
a second output unit that outputs code generation information used by the code generation unit to generate a code for configuring the 3D data based on the scenario data;
Further comprising:
The code generation unit
The image generation system according to any one of (1) to (14), further comprising: generating the code based on the code generation information.
(16)
The second output section is
generating, as the code generation information, a second prompt for outputting a code for configuring the 3D data based on the scenario data;
The scenario generation unit includes:
The image production system of any one of claims 1 to 15, further comprising: generating the code based on the second prompt.
(17)
The second output section is
generating second input information as the code generation information based on the input query, the second input information being used as an input of a second model for generating the code;
The scenario generation unit includes:
The video generation system according to (15) or (16), wherein the code is generated by inputting the second input information generated using the scenario data into the second model and causing the second model to output the code.
(18)
a reception unit that receives operations related to video editing from a user;
Further comprising:
The code generation unit
The image generation system according to any one of (1) to (17), further comprising: generating the code by editing based on the operation.
(19)
The reception unit is
Accepting the operation specifying a motion or camera movement using a sensor;
The code generation unit
The image generation system according to claim 18, further comprising: generating the code corresponding to the motion or camera movement indicated by the operation.
(20)
The reception unit is
accepting the operation of selecting a plurality of cuts from the video data;
The code generation unit
The video generation system according to (18) or (19), further comprising: generating the code in which the portions corresponding to the plurality of cuts indicated by the operation have been changed.
(21)
Date information is associated with each cut of the video data,
The code generation unit
The video production system according to any one of (18) to (20), further comprising: determining the content of the editing indicated by the operation based on date information of each cut of the video data.
(22)
The reception unit is
accepting the operation of designating an object to be changed from among the video data;
The code generation unit
The image generation system according to any one of (18) to (21), further comprising: generating the code in which 3D data of the object indicated by the operation has been changed.
(23)
the 3D data comprises a plurality of data sets;
The code generation unit
The image generation system according to any one of (18) to (22), further comprising: generating the code corresponding to at least one of the plurality of data sets by editing based on the operation.
(24)
The code generation unit
The video generation system according to any one of the above (23) and (24), further comprising: a process for updating a part of the plurality of data sets; and a process for updating the entirety of the plurality of data sets, depending on the editing content indicated by the operation.
(25)
an evaluation unit that generates information indicating an evaluation of at least one of the scenario data and the video data;
The image generation system according to any one of (1) to (24), further comprising:
(26)
The code generation unit
The image generation system according to any one of claims 25 to 30, further comprising: generating the code based on the evaluation.
(27)
The scenario generation unit includes:
The image generation system according to (25) or (26), further comprising: generating the scenario data based on the evaluation.
(28)
The code generation unit
The image generation system according to any one of claims 27 to 31, further comprising: generating the code based on the scenario data generated based on the evaluation.
(29)
Obtaining an input query for video generation from a user;
generating scenario data related to movie generation based on the input query;
generating a code for constructing 3D data based on the scenario data; and
and acquiring video data based on the code.
(30)
Obtaining an input query for video generation from a user;
generating scenario data related to movie generation based on the input query;
generating a code for constructing 3D data based on the scenario data; and
and acquiring video data based on said code.

１映像生成システム
１００映像生成モジュール
１１０入力テキスト解析部
１２０センサ解析部
１３０プロンプト等生成部
１３１シナリオ向け生成部
１３２映像向け生成部
１３３サウンド向け生成部
１３４テキスト／ロゴ向け生成部
１４０映像生成部
１４１ＵＳＤ生成部
１４２レンダリング部
１４３映像リファイン部
１５０サウンド生成部
１６０テキスト／ロゴ生成部
１７０コンポジット編集部
１８０評価部
１９０クライアントＵＩモジュール
２００情報取得モジュール
２１０入力テキスト取得部
２２０センサ取得部
３００センサ部
４００クライアントＵＩ表示部 1 Video generation system 100 Video generation module 110 Input text analysis unit 120 Sensor analysis unit 130 Prompt generation unit 131 Scenario generation unit 132 Video generation unit 133 Sound generation unit 134 Text/logo generation unit 140 Video generation unit 141 USD generation unit 142 Rendering unit 143 Video refinement unit 150 Sound generation unit 160 Text/logo generation unit 170 Composite editing unit 180 Evaluation unit 190 Client UI module 200 Information acquisition module 210 Input text acquisition unit 220 Sensor acquisition unit 300 Sensor unit 400 Client UI display unit

Claims

An acquisition unit that acquires an input query regarding video generation from a user;
a scenario generation unit that generates scenario data related to video generation based on the input query;
a code generation unit that generates code, which is an editable computer program, for configuring 3D data and defining at least one of a scene and an object based on the scenario data;
a video data acquisition unit that acquires video data based on the code;
An image generation system comprising:

an image quality improvement unit that executes an image quality improvement process to improve the image quality of the video data;
The video production system of claim 1 further comprising:

The image quality improvement unit includes:
The image generation system of claim 2 , further comprising: a display unit configured to display a display image of the video data based on the display image of the video data;

The image quality improvement unit includes:
The video production system according to claim 2 , wherein the image quality improvement process is performed based on a text prompt that specifies a target of the image quality improvement process in the video data.

The image quality improvement unit includes:
The image production system of claim 4 , further comprising: a text prompt that specifies the object determined to require improvement, and the image quality improvement process is performed based on the text prompt that specifies the object determined to require improvement.

a display control unit that displays a storyboard based on the scenario data;
The video production system of claim 1 further comprising:

The video production system according to claim 6 , wherein the storyboard is configured to display the video data for each cut of the video.

a sound generating unit that generates sound data corresponding to the video data based on the scenario data and the video data;
The video production system of claim 1 further comprising:

a text generating unit that generates text data indicating text to be displayed on a video displayed based on the video data, based on the scenario data;
The video production system of claim 1 further comprising:

a logo generating unit that generates logo data indicating a logo to be displayed on a video image displayed based on the moving image data, based on the scenario data;
The video production system of claim 1 further comprising:

The video generation system of claim 1 , wherein the input query includes at least one of text, image, audio, and 3D data.

a first output unit that outputs scenario generation information used by the scenario generation unit to generate the scenario data based on the input query;
Further comprising:
The scenario generation unit includes:
The image production system according to claim 1 , wherein the scenario data is generated based on the scenario generation information.

The first output section is
generating a first prompt for generating the scenario data based on the input query as the scenario generation information;
The scenario generation unit includes:
The image production system of claim 12 , further comprising: a display unit configured to display a screen displaying the screen image data based on the first prompt;

The first output section is
generating, based on the input query, first input information to be used as an input of a first model for generating the scenario data, as the scenario generation information;
The scenario generation unit includes:
The image generation system according to claim 12, wherein the scenario data is generated by inputting the first input information generated using the input query into the first model and causing the first model to output the scenario data.

a second output unit that outputs code generation information used by the code generation unit to generate a code for configuring the 3D data based on the scenario data;
Further comprising:
The code generation unit
The video generation system according to claim 1 , wherein the code is generated based on the information for code generation.

The second output section is
generating, as the code generation information, a second prompt for outputting a code for configuring the 3D data based on the scenario data;
The scenario generation unit includes:
The video production system of claim 15 further comprising: generating the code based on the second prompt.

The second output section is
generating second input information as the code generation information based on the input query, the second input information being used as an input of a second model for generating the code;
The scenario generation unit includes:
The video generation system according to claim 15 , wherein the code is generated by inputting the second input information generated using the scenario data into the second model and causing the second model to output the code.

a reception unit that receives operations related to video editing from a user;
Further comprising:
The code generation unit
The video generation system according to claim 1 , wherein the code is generated by editing based on the operation.

The reception unit is
Accepting the operation specifying a motion or camera movement using a sensor;
The code generation unit
The image generation system of claim 18 , further comprising: generating the code corresponding to the motion or camera movement indicated by the operation.

The reception unit is
accepting the operation of selecting a plurality of cuts from the video data;
The code generation unit
The video production system according to claim 18 , further comprising: generating the code in which portions corresponding to the plurality of cuts indicated by the operation have been changed.

Date information is associated with each cut of the video data,
The code generation unit
The video production system according to claim 18 , further comprising: determining the content of the editing operation indicated by the operation based on date information of each cut of the video data.

The reception unit is
accepting the operation of designating an object to be changed from among the video data;
The code generation unit
The image generation system according to claim 18 , further comprising: generating the code in which 3D data of the object indicated by the operation has been changed.

the 3D data comprises a plurality of data sets;
The code generation unit
The video production system of claim 18 , further comprising: generating the code corresponding to at least one of the plurality of data sets through editing based on the operation.

The code generation unit
The video generation system according to claim 23 , further comprising: a process for updating a part of the plurality of data sets, or a process for updating all of the plurality of data sets, depending on editing content indicated by the operation.

an evaluation unit that generates information indicating an evaluation of at least one of the scenario data and the video data;
The video production system of claim 1 further comprising:

The code generation unit
The video production system of claim 25 , further comprising: generating said code based on said evaluation.

The scenario generation unit includes:
The image generation system according to claim 25 , further comprising: generating the scenario data based on the evaluation.

The code generation unit
The image generation system according to claim 27, further comprising: generating the code based on the scenario data generated based on the evaluation.

Obtaining an input query for video generation from a user;
generating scenario data related to movie generation based on the input query;
generating, based on the scenario data, code that is an editable computer program for constructing 3D data and defining at least one of a scene and an object ;
and acquiring video data based on the code.

Obtaining an input query for video generation from a user;
generating scenario data related to movie generation based on the input query;
generating, based on the scenario data, code that is an editable computer program for constructing 3D data and defining at least one of a scene and an object ;
and acquiring video data based on said code.