CN114827886A

CN114827886A - Audio generation method and device, electronic equipment and storage medium

Info

Publication number: CN114827886A
Application number: CN202210448723.2A
Authority: CN
Inventors: 陈联武; 郑羲光; 范欣悦; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-07-29

Abstract

The disclosure provides an audio generation method, an audio generation device, an electronic device and a storage medium. The audio generation method may include: acquiring an audio signal to be processed; obtaining a plurality of soundtrack signals for a plurality of sound sources by soundtrack separation of an audio signal to be processed; determining user orientation information in a three-dimensional space and generating three-dimensional space metadata corresponding to each of a plurality of soundtrack signals; and generating a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals and the three-dimensional spatial metadata corresponding to each audio track signal. The method and the device can generate the three-dimensional audio which is closer to the real three-dimensional audio, increase the immersion feeling of the three-dimensional audio and improve the user experience.

Description

Audio generation method, apparatus, electronic device and storage medium

技术领域technical field

本公开涉及音频处理技术领域，尤其涉及一种音频生成方法、音频生成装置、电子设备、存储介质以及程序产品。The present disclosure relates to the technical field of audio processing, and in particular, to an audio generation method, an audio generation apparatus, an electronic device, a storage medium, and a program product.

背景技术Background technique

三维3D音频是指音频信号分布在三维空间中并且动态播放的音频信号，例如，随着音乐的播放，人声在三维空间中由远及近，鼓声、贝斯等乐器在三维空间的不同地方来回跳动。Three-dimensional 3D audio refers to the audio signal in which the audio signal is distributed in the three-dimensional space and played dynamically. For example, as the music is played, the human voice moves from far to near in the three-dimensional space, and the drums, bass and other musical instruments are in different places in the three-dimensional space. Jump back and forth.

然而，目前主流的音频基本上是双通道立体声格式或单通道格式，而多通道以及3D格式的音频内容相对比较匮乏。因此，将传统的立体声音频等自动转换为3D音频，是3D音频发展的一个重要方向。However, the current mainstream audio is basically a two-channel stereo format or a single-channel format, and the audio content of multi-channel and 3D formats is relatively scarce. Therefore, the automatic conversion of traditional stereo audio into 3D audio is an important direction for the development of 3D audio.

发明内容SUMMARY OF THE INVENTION

本公开提供一种音频生成方法、音频生成装置、电子设备和存储介质，以至少解决上述提及的问题。The present disclosure provides an audio generation method, an audio generation apparatus, an electronic device, and a storage medium to solve at least the above-mentioned problems.

根据本公开实施例的第一方面，提供一种音频生成方法，所述方法可包括：获取待处理音频信号；通过对所述待处理音频信号进行音轨分离获得针对多个声源的多个音轨信号；确定在三维空间中的用户方位信息，并生成与所述多个音轨信号中的每个音轨信号对应的三维空间元数据，其中，所述三维空间元数据包括对应声源在三维空间中的方位信息和声音宽度信息；基于所述用户方位信息、分离出的所述多个音轨信号和与所述每个音轨信号对应的所述三维空间元数据，生成与所述待处理音频信号对应的三维音频信号。According to a first aspect of the embodiments of the present disclosure, an audio generation method is provided, the method may include: acquiring an audio signal to be processed; an audio track signal; determining user orientation information in a three-dimensional space, and generating three-dimensional spatial metadata corresponding to each of the plurality of audio track signals, wherein the three-dimensional spatial metadata includes a corresponding sound source orientation information and sound width information in three-dimensional space; based on the user orientation information, the plurality of separated audio track signals, and the three-dimensional spatial metadata corresponding to each of the audio track signals, generating the three-dimensional audio signal corresponding to the audio signal to be processed.

作为一种示例，生成与所述多个音轨信号中的每个音轨信号对应的三维空间元数据，可包括：获取所述待处理音频信号的特征信息，其中，所述特征信息包括所述待处理音频信号的节拍信息和结构信息中的至少一个，其中，所述结构信息包括所述待处理音频信号的各音频片段的类型信息；基于所述特征信息分别生成与所述每个音轨信号对应的三维空间元数据。As an example, generating the three-dimensional spatial metadata corresponding to each audio track signal in the plurality of audio track signals may include: acquiring feature information of the audio signal to be processed, wherein the feature information includes all at least one of beat information and structure information of the audio signal to be processed, wherein the structure information includes type information of each audio segment of the audio signal to be processed; The three-dimensional spatial metadata corresponding to the track signal.

作为一种示例，基于所述特征信息分别生成与所述每个音轨信号对应的三维空间元数据，可包括：针对所述多个音轨信号中的人声信号，根据所述结构信息中的各音频片段的类型，确定所述人声信号对应的声源相对于三维空间中的所述用户方位信息的位置调整信息，基于所述位置调整信息确定所述人声信号的三维空间元数据。As an example, generating the three-dimensional spatial metadata corresponding to each of the audio track signals based on the feature information may include: for the vocal signals in the plurality of audio track signals, according to the structure information determine the position adjustment information of the sound source corresponding to the vocal signal relative to the user orientation information in the three-dimensional space, and determine the three-dimensional spatial metadata of the vocal signal based on the position adjustment information .

作为一种示例，基于所述特征信息分别生成与所述每个音轨信号对应的三维空间元数据，可包括：针对所述多个音轨信号中的乐器信号，根据所述节拍信息和所述结构信息中的各音频片段的类型，确定所述乐器信号对应的声源在三维空间中的移动信息，基于所述移动信息确定所述乐器信号的三维空间元数据。As an example, generating three-dimensional spatial metadata corresponding to each of the audio track signals based on the feature information may include: for the musical instrument signals in the plurality of audio track signals, according to the tempo information and the The type of each audio segment in the structure information is determined, the movement information of the sound source corresponding to the musical instrument signal in the three-dimensional space is determined, and the three-dimensional space metadata of the musical instrument signal is determined based on the movement information.

作为一种示例，生成与所述多个音轨信号中的每个音轨信号对应的三维空间元数据，可包括：从多个预设模板中确定分别与所述每个音轨信号对应的预设模板，其中，所述预设模板包括对应声源在三维空间中的移动轨迹信息、移动速度信息和声音宽度变化信息中的至少一个；使用确定的预设模板分别生成针对所述每个音轨信号的三维空间元数据。As an example, generating the three-dimensional spatial metadata corresponding to each audio track signal in the plurality of audio track signals may include: determining from a plurality of preset templates respectively corresponding to each audio track signal A preset template, wherein the preset template includes at least one of the movement track information, movement speed information and sound width change information of the corresponding sound source in the three-dimensional space; 3D spatial metadata for the audio track signal.

作为一种示例，生成与所述多个音轨信号中的每个音轨信号对应的三维空间元数据，可包括：获取用户输入的设置信息，其中，所述设置信息包括与所述多个音轨信号对应的各声源在三维空间中的移动轨迹、移动速度和声音宽度变化值中的至少一个；基于所述设置信息分别生成针对所述每个音轨信号的三维空间元数据。As an example, generating the three-dimensional spatial metadata corresponding to each of the plurality of audio track signals may include: acquiring setting information input by a user, wherein the setting information includes the setting information corresponding to the plurality of audio track signals. at least one of the movement trajectory, movement speed, and sound width variation value of each sound source corresponding to the audio track signal in the three-dimensional space; and three-dimensional space metadata for each audio track signal is respectively generated based on the setting information.

作为一种示例，基于所述用户方位信息、分离出的所述多个音轨信号和与所述每个音轨信号对应的所述三维空间元数据，生成与所述待处理音频信号对应的三维音频信号，可包括：识别用于播放所述三维音频信号的播放设备的类型；获取与所述播放设备的类型对应的渲染策略，基于所述用户方位信息、分离出的所述多个音轨信号和与所述每个音轨信号对应的所述三维空间元数据，通过所述渲染策略生成与所述待处理音频信号对应的三维音频信号。As an example, generating an audio signal corresponding to the to-be-processed audio signal based on the user orientation information, the plurality of separated audio track signals, and the three-dimensional spatial metadata corresponding to each audio track signal The 3D audio signal may include: identifying the type of playback device used to play the 3D audio signal; acquiring a rendering strategy corresponding to the type of the playback device, and based on the user orientation information, the plurality of separated audio A track signal and the three-dimensional spatial metadata corresponding to each audio track signal, and a three-dimensional audio signal corresponding to the to-be-processed audio signal is generated through the rendering strategy.

作为一种示例，通过所述渲染策略生成与所述待处理音频信号对应的三维音频信号，可包括：当所述播放设备为入耳式播放设备时，针对所述每个音轨信号，基于与所述音轨信号的每个音频帧对应的方位信息和所述用户方位信息，生成与所述音轨信号对应的三维音频信号；基于所述每个音频帧的声音宽度信息来调整与所述音轨信号对应的三维音频信号的声音宽度。As an example, generating the three-dimensional audio signal corresponding to the to-be-processed audio signal by the rendering strategy may include: when the playback device is an in-ear playback device, for each audio track signal, based on the The orientation information corresponding to each audio frame of the audio track signal and the user orientation information generate a three-dimensional audio signal corresponding to the audio track signal; The sound width of the 3D audio signal corresponding to the audio track signal.

作为一种示例，通过所述渲染策略生成与所述待处理音频信号对应的三维音频信号，可包括：当所述播放设备为外放式播放设备时，针对所述每个音轨信号，基于与所述音轨信号对应的声源的方位信息和多个扬声器的方位信息，对所述音轨信号进行渲染，以生成与所述音轨信号对应的三维音频信号；基于所述声源的声音宽度信息来调整与所述音轨信号对应的三维音频信号的声音宽度。As an example, generating a three-dimensional audio signal corresponding to the to-be-processed audio signal by using the rendering strategy may include: when the playback device is an external playback device, for each audio track signal, based on The position information of the sound source corresponding to the sound track signal and the position information of the plurality of speakers, the sound track signal is rendered to generate a three-dimensional audio signal corresponding to the sound track signal; sound width information to adjust the sound width of the three-dimensional audio signal corresponding to the track signal.

根据本公开实施例的第二方面，提供一种音频生成装置，所述装置可包括：获取模块，被配置为获取待处理音频信号；音轨分离模块，被配置为通过对所述待处理音频信号进行音轨分离获得针对多个声源的多个音轨信号；元数据生成模块，被配置为确定在三维空间中的用户方位信息，并生成与所述多个音轨信号中的每个音轨信号对应的三维空间元数据，其中，所述三维空间元数据包括对应声源在三维空间中的方位信息和声音宽度信息；渲染模块，被配置为基于所述用户方位信息、分离出的所述多个音轨信号和与所述每个音轨信号对应的所述三维空间元数据，生成与所述待处理音频信号对应的三维音频信号。According to a second aspect of the embodiments of the present disclosure, there is provided an audio generation apparatus, the apparatus may include: an acquisition module configured to acquire an audio signal to be processed; an audio track separation module configured to The signal is track-separated to obtain a plurality of track signals for the plurality of sound sources; the metadata generation module is configured to determine user orientation information in the three-dimensional space, and generate a signal corresponding to each of the plurality of track signals The three-dimensional space metadata corresponding to the audio track signal, wherein the three-dimensional space metadata includes the position information and sound width information of the corresponding sound source in the three-dimensional space; the rendering module is configured to, based on the user position information, separate the The plurality of audio track signals and the three-dimensional spatial metadata corresponding to each of the audio track signals generate a three-dimensional audio signal corresponding to the to-be-processed audio signal.

作为一种示例，元数据生成模块可被配置为：获取所述待处理音频信号的特征信息，其中，所述特征信息包括所述待处理音频信号的节拍信息和结构信息中的至少一个，其中，所述结构信息包括所述待处理音频信号的各音频片段的类型信息；基于所述特征信息分别生成与所述每个音轨信号对应的三维空间元数据。As an example, the metadata generation module may be configured to: acquire feature information of the audio signal to be processed, wherein the feature information includes at least one of beat information and structure information of the audio signal to be processed, wherein , the structure information includes type information of each audio segment of the to-be-processed audio signal; and three-dimensional spatial metadata corresponding to each audio track signal is respectively generated based on the feature information.

作为一种示例，元数据生成模块可被配置为：针对所述多个音轨信号中的人声信号，根据所述结构信息中的各音频片段的类型，确定所述人声信号对应的声源相对于三维空间中的所述用户方位信息的位置调整信息，基于所述位置调整信息确定所述人声信号的三维空间元数据。As an example, the metadata generation module may be configured to: for the human voice signal in the plurality of audio track signals, according to the type of each audio segment in the structure information, determine the voice signal corresponding to the human voice signal. The source is relative to the position adjustment information of the user orientation information in the three-dimensional space, and the three-dimensional space metadata of the human voice signal is determined based on the position adjustment information.

作为一种示例，元数据生成模块可被配置为：针对所述多个音轨信号中的乐器信号，根据所述节拍信息和所述结构信息中的各音频片段的类型，确定所述乐器信号对应的声源在三维空间中的移动信息，基于所述移动信息确定所述乐器信号的三维空间元数据。As an example, the metadata generation module may be configured to: for an instrument signal in the plurality of track signals, determine the instrument signal according to the tempo information and the type of each audio segment in the structure information The movement information of the corresponding sound source in the three-dimensional space, and the three-dimensional space metadata of the musical instrument signal is determined based on the movement information.

作为一种示例，元数据生成模块可被配置为：从多个预设模板中确定分别与所述每个音轨信号对应的预设模板，其中，所述预设模板包括对应声源在三维空间中的移动轨迹信息、移动速度信息和声音宽度变化信息中的至少一个；使用确定的预设模板分别生成针对所述每个音轨信号的三维空间元数据。As an example, the metadata generation module may be configured to: determine a preset template corresponding to each of the audio track signals from a plurality of preset templates, wherein the preset template includes the corresponding sound source in three-dimensional at least one of movement track information, movement speed information, and sound width variation information in space; using the determined preset template to generate three-dimensional spatial metadata for each of the audio track signals respectively.

作为一种示例，元数据生成模块可被配置为：获取用户输入的设置信息，其中，所述设置信息包括与所述多个音轨信号对应的各声源在三维空间中的移动轨迹、移动速度和声音宽度变化值中的至少一个；基于所述设置信息分别生成针对所述每个音轨信号的三维空间元数据。As an example, the metadata generation module may be configured to: acquire setting information input by a user, wherein the setting information includes movement trajectories, movement trajectories, and movement trajectories of each sound source corresponding to the plurality of sound track signals in the three-dimensional space. at least one of a velocity and a sound width variation value; and generating three-dimensional spatial metadata for each of the audio track signals based on the setting information, respectively.

作为一种示例，渲染模块可被配置为：识别用于播放所述三维音频信号的播放设备的类型；获取与所述播放设备的类型对应的渲染策略，基于所述用户方位信息、分离出的所述多个音轨信号和与所述每个音轨信号对应的所述三维空间元数据，通过所述渲染策略生成与所述待处理音频信号对应的三维音频信号。As an example, the rendering module may be configured to: identify the type of playback device used to play the 3D audio signal; acquire a rendering strategy corresponding to the type of the playback device, and based on the user orientation information, the separated For the plurality of audio track signals and the three-dimensional spatial metadata corresponding to each audio track signal, a three-dimensional audio signal corresponding to the to-be-processed audio signal is generated through the rendering strategy.

作为一种示例，渲染模块可被配置为：当所述播放设备为入耳式播放设备时，针对所述每个音轨信号，基于与所述音轨信号的每个音频帧对应的方位信息和所述用户方位信息，生成与所述音轨信号对应的三维音频信号；基于所述每个音频帧的声音宽度信息来调整与所述音轨信号对应的三维音频信号的声音宽度。As an example, the rendering module may be configured to: when the playback device is an in-ear playback device, for each audio track signal, based on the orientation information corresponding to each audio frame of the audio track signal and Using the user orientation information, a three-dimensional audio signal corresponding to the audio track signal is generated; based on the sound width information of each audio frame, the sound width of the three-dimensional audio signal corresponding to the audio track signal is adjusted.

作为一种示例，渲染模块可被配置为：当所述播放设备为外放式播放设备时，针对所述每个音轨信号，基于与所述音轨信号对应的声源的方位信息和多个扬声器的方位信息，对所述音轨信号进行渲染，以生成与所述音轨信号对应的三维音频信号；基于所述声源的声音宽度信息来调整与所述音轨信号对应的三维音频信号的声音宽度。As an example, the rendering module may be configured to: when the playback device is an external playback device, for each audio track signal, based on the orientation information of the sound source corresponding to the audio track signal and the multiple The orientation information of each speaker is used to render the audio track signal to generate a 3D audio signal corresponding to the audio track signal; the 3D audio signal corresponding to the audio track signal is adjusted based on the sound width information of the sound source. The sound width of the signal.

根据本公开实施例的第三方面，提供一种电子设备，所述电子设备可包括：至少一个处理器；至少一个存储计算机可执行指令的存储器，其中，所述计算机可执行指令在被所述至少一个处理器运行时，促使所述至少一个处理器执行如上所述的音频生成方法。According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are stored by the At least one processor, when run, causes the at least one processor to perform the audio generation method as described above.

根据本公开实施例的第四方面，提供一种存储指令的计算机可读存储介质，当所述指令被至少一个处理器运行时，促使所述至少一个处理器执行如上所述的音频生成方法。According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the audio generation method as described above.

根据本公开实施例的第五方面，提供一种计算机程序产品，所述计算机程序产品中的指令被电子装置中的至少一个处理器运行以执行如上所述的音频生成方法。According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product whose instructions are executed by at least one processor in an electronic device to perform the audio generation method as described above.

本公开的实施例提供的技术方案至少带来以下有益效果：The technical solutions provided by the embodiments of the present disclosure bring at least the following beneficial effects:

通过从待处理音频中分离出多个单一声源的音轨信号并且赋予这些声源各种不同的三维空间轨迹，将待处理音频转换为三维音频，从而获得与真正三维音频更加接近的三维音频，增加了三维音频的沉浸感，提高了用户体验。By separating the audio track signals of multiple single sound sources from the audio to be processed and giving these sound sources various 3D spatial trajectories, the audio to be processed is converted into 3D audio, so as to obtain 3D audio that is closer to the real 3D audio , which increases the immersion of 3D audio and improves the user experience.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理，并不构成对本公开的不当限定。The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.

图1是根据本公开的实施例的音频生成方法的流程图；1 is a flowchart of an audio generation method according to an embodiment of the present disclosure;

图2和图3示出根据本公开的实施例的虚拟三维空间的示意图；2 and 3 illustrate schematic diagrams of a virtual three-dimensional space according to an embodiment of the present disclosure;

图4是根据本公开的另一实施例的音频生成方法的流程示意图；4 is a schematic flowchart of an audio generation method according to another embodiment of the present disclosure;

图5是根据本公开的实施例的音频生成设备的结构示意图；5 is a schematic structural diagram of an audio generation device according to an embodiment of the present disclosure;

图6是根据本公开的实施例的音频生成装置的框图；6 is a block diagram of an audio generation apparatus according to an embodiment of the present disclosure;

图7是根据本公开的实施例的电子设备的框图。7 is a block diagram of an electronic device according to an embodiment of the present disclosure.

在整个附图中，应注意，相同的参考标号用于表示相同或相似的元件、特征和结构。Throughout the drawings, it should be noted that the same reference numerals are used to refer to the same or similar elements, features and structures.

具体实施方式Detailed ways

为了使本领域普通人员更好地理解本公开的技术方案，下面将结合附图，对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

提供参照附图的以下描述以帮助对由权利要求及其等同物限定的本公开的实施例的全面理解。包括各种特定细节以帮助理解，但这些细节仅被视为是示例性的。因此，本领域的普通技术人员将认识到在不脱离本公开的范围和精神的情况下，可对描述于此的实施例进行各种改变和修改。此外，为了清楚和简洁，省略对公知的功能和结构的描述。The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the present disclosure as defined by the claims and their equivalents. Various specific details are included to aid in that understanding, but are to be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

以下描述和权利要求中使用的术语和词语不限于书面含义，而仅由发明人用来实现本公开的清楚且一致的理解。因此，本领域的技术人员应清楚，本公开的各种实施例的以下描述仅被提供用于说明目的而不用于限制由权利要求及其等同物限定的本公开的目的。The terms and words used in the following description and claims are not limited to the bibliographical meanings, but, are merely used by the inventor to enable a clear and consistent understanding of the present disclosure. Accordingly, it should be apparent to those skilled in the art that the following description of various embodiments of the present disclosure is provided for illustration purpose only and not for the purpose of limiting the present disclosure as defined by the claims and their equivalents.

需要说明的是，本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的示例。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely exemplary of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

在相关技术中，使用传统的信号处理方案从立体声信号中分离出直达声和背景声，然后对直达声和背景声进行不同的处理，以生成3D音频信号。然而基于传统信号处理的音轨分离算法效果有限，并且不能针对性地分离出诸如人声以及各种乐器等单一声源的音轨信号，最终生成的3D音频的沉浸感相比于真正的3D音频具有明显差异。此外，也可基于音频卡点检测结果，对音频信号在三维空间中进行一定程度的旋转，构建出3D音频效果。然而这种方案没有基于音轨分离算法对每个声音元素进行针对性处理，例如当鼓声旋转时，人声信号也必须一起旋转，这导致生成的3D音频效果单一。In the related art, a conventional signal processing scheme is used to separate the direct sound and the background sound from the stereo signal, and then perform different processing on the direct sound and the background sound to generate a 3D audio signal. However, the track separation algorithm based on traditional signal processing has limited effect, and cannot separate the track signal of a single sound source such as human voice and various musical instruments in a targeted manner. The resulting 3D audio is more immersive than the real 3D audio. There is a noticeable difference in audio. In addition, a 3D audio effect can also be constructed by rotating the audio signal to a certain extent in the three-dimensional space based on the detection result of the audio card point. However, this scheme does not perform targeted processing of each sound element based on the track separation algorithm. For example, when the drum sound rotates, the vocal signal must also rotate together, which results in a single 3D audio effect.

本公开结合音频卡点检测、音轨分离和音乐结构分析等一系列技术，给音频中的诸如人声、鼓声、贝斯等音轨赋予不同的三维空间轨迹，将传统立体声音频转化为3D音频信号，并且最终通过音频空间渲染技术生成沉浸式的3D音频内容。The present disclosure combines a series of technologies such as audio card point detection, audio track separation, and music structure analysis to endow audio tracks such as vocals, drums, bass, etc. with different three-dimensional spatial trajectories, and convert traditional stereo audio into 3D audio. signal, and finally generate immersive 3D audio content through audio space rendering technology.

在下文中，根据本公开的各种实施例，将参照附图对本公开的方法、装置进行详细描述。Hereinafter, according to various embodiments of the present disclosure, the method and apparatus of the present disclosure will be described in detail with reference to the accompanying drawings.

图1是根据本公开的实施例的音频生成方法的流程图。图1的音频生成方法可被用于将传统的立体声音频、单通道音频等转换为三维音频。FIG. 1 is a flowchart of an audio generation method according to an embodiment of the present disclosure. The audio generation method of FIG. 1 can be used to convert conventional stereo audio, single channel audio, etc. into three-dimensional audio.

根据本公开的音频生成方法可由任意电子设备执行。电子设备可以是智能手机、平板电脑、便携式计算机和台式计算机等中的至少一种。电子设备可安装有目标应用，用于实现本公开的三维音频生成方法。The audio generation method according to the present disclosure may be performed by any electronic device. The electronic device may be at least one of a smartphone, a tablet computer, a portable computer, a desktop computer, and the like. The electronic device may be installed with a target application for implementing the three-dimensional audio generation method of the present disclosure.

参照图1，在步骤S101，获取待处理音频信号。这里，待处理音频信号可以是例如立体声音乐信号、单通道音乐信号等。Referring to FIG. 1, in step S101, an audio signal to be processed is acquired. Here, the audio signal to be processed may be, for example, a stereo music signal, a monophonic music signal, or the like.

在步骤S102，通过对待处理音频信号进行音轨分离获得针对多个声源的多个音轨信号。In step S102, multiple track signals for multiple sound sources are obtained by performing track separation on the audio signal to be processed.

可使用基于深度学习的音轨分离系统对待处理音频信号进行音轨分离。基于深度学习的音轨分离技术，可以分离出还原度更高的人声和各个乐器信号。例如，基于深度学习的音轨分离系统可以由编码器、分离模块和解码器三部分组成。可首先对待处理音频信号进行短时傅里叶变换来得到该音频信号的频谱数据，将该频谱数据输入编码器得到该音频信号的编码特征，利用分离模块从编码特征中进一步提取目标音轨特征，最后基于目标音轨特征通过解码器得到与目标音轨信号对应的目标掩蔽矩阵。在将频谱数据与目标掩蔽矩阵相乘后，再对相乘结果进行短时逆傅里叶变换，可以得到目标音轨信号。这里，目标音轨信号可包括一个或多个音轨信号。例如，在对传统的立体声音乐信号执行上述过程后，可得到诸如人声、鼓声、贝斯和其他乐器等的多个音轨信号。The audio signal to be processed can be track separated using a deep learning based track separation system. The sound track separation technology based on deep learning can separate the vocal and individual instrument signals with a higher degree of restoration. For example, a deep learning-based audio track separation system can be composed of three parts: an encoder, a separation module, and a decoder. Can first perform short-time Fourier transform on the audio signal to be processed to obtain the spectral data of the audio signal, input the spectral data into the encoder to obtain the coding feature of the audio signal, and use the separation module to further extract the target track feature from the coding feature. , and finally obtain the target masking matrix corresponding to the target audio track signal through the decoder based on the target audio track feature. After multiplying the spectral data by the target masking matrix, and then performing short-time inverse Fourier transform on the multiplication result, the target audio track signal can be obtained. Here, the target track signal may include one or more track signals. For example, after performing the above process on a conventional stereo music signal, multiple track signals such as vocals, drums, bass, and other instruments can be obtained.

在步骤S103，确定在三维空间中的用户方位信息，并生成与分离出的多个音轨信号中的每个音轨信号的三维空间元数据。用户方位信息可包括虚拟用户在三维空间中在每个时间点或每个时段的三维位置坐标、方向、速度、移动路径等中的至少一部分。例如，用户方位信息可被固定为在三维空间的中心位置。可参照用户方位信息，确定各声源的三维空间元数据。一个音轨信号的三维空间元数据可用于指示对应声源在三维空间中如何变化，例如可指示该声源在三维空间中如何移动、声源在三维空间中的声音宽度变化等。In step S103, the user orientation information in the three-dimensional space is determined, and three-dimensional space metadata associated with each of the separated multiple audio track signals is generated. The user orientation information may include at least a part of the three-dimensional position coordinates, direction, speed, moving path, etc. of the virtual user at each time point or each time period in the three-dimensional space. For example, the user orientation information may be fixed to a central position in a three-dimensional space. The three-dimensional spatial metadata of each sound source can be determined with reference to the user orientation information. The 3D space metadata of a track signal can be used to indicate how the corresponding sound source changes in the 3D space, for example, how the sound source moves in the 3D space, the sound width of the sound source changes in the 3D space, and so on.

三维空间元数据可包括对应声源在三维空间中的方位信息和声音宽度信息。这里，声源的方位信息可包括对应声源在三维空间中在每个时间点或每个时段的三维位置坐标、方向、速度、移动路径等中的部分或全部。The three-dimensional space metadata may include position information and sound width information of the corresponding sound source in the three-dimensional space. Here, the position information of the sound source may include part or all of the three-dimensional position coordinates, direction, speed, moving path, etc. of the corresponding sound source at each time point or each time period in the three-dimensional space.

图2和图3示出根据本公开的实施例的虚拟三维空间的示意图。2 and 3 show schematic diagrams of a virtual three-dimensional space according to an embodiment of the present disclosure.

假设存在一个虚拟的三维空间，空间的三维坐标轴为X轴、Y轴、Z轴，其中，Z轴代表高度，每个坐标轴的范围被设置为[0,1]，如图2所示。Assuming that there is a virtual three-dimensional space, the three-dimensional coordinate axes of the space are X-axis, Y-axis, Z-axis, where Z-axis represents height, and the range of each coordinate axis is set to [0,1], as shown in Figure 2 .

用户/听者在三维空间的中心位置，诸如(0.5,0.5,0.5)，图3中的白色圆点可表示各个声源，诸如歌唱人声和乐器等声源，圆点的坐标可表示声源的三维空间位置，圆点的尺寸可表示声音的宽度。然而，上述示例仅是示例性的，本公开不限于此。此外，三维空间元数据中的声源方位可以是相对于用户方位的方位。The central position of the user/listener in the three-dimensional space, such as (0.5, 0.5, 0.5), the white dots in Figure 3 can represent various sound sources, such as vocals and musical instruments, and the coordinates of the dots can represent the sound source. The three-dimensional spatial location of the source, the size of the dots can represent the width of the sound. However, the above-described examples are merely exemplary, and the present disclosure is not limited thereto. Furthermore, the sound source orientation in the three-dimensional spatial metadata may be an orientation relative to the user orientation.

根据本公开的实施例，可基于待处理音频信号的特征信息来生成针对待处理音频信号中包括的每个音轨信号的三维空间元数据。特征信息可包括待处理音频信号的节拍信息和结构信息中的至少一个。According to an embodiment of the present disclosure, three-dimensional spatial metadata for each track signal included in the audio signal to be processed may be generated based on feature information of the audio signal to be processed. The feature information may include at least one of tempo information and structure information of the audio signal to be processed.

可通过对待处理音频信号进行音频卡点检测来获得节拍信息。例如，音频卡点检测可以通过基于深度学习的模型实现，模型可主要包括特征提取模块、基于深度学习的概率预测模块和全局节拍位置估计模块。基于深度学习的卡点检测，可以准确判断出音频信号的节拍位置。The beat information can be obtained by performing audio jam point detection on the audio signal to be processed. For example, audio jam point detection can be implemented by a deep learning-based model, which can mainly include a feature extraction module, a deep learning-based probability prediction module, and a global beat position estimation module. Based on deep learning detection, the beat position of the audio signal can be accurately determined.

特征提取通常使用频域特征，例如，可使用待处理音频信号的梅尔谱及其一阶差分作为特征提取模块的输入特征，然后将提取的特征输入至概率预测模块。概率预测模块可使用CRNN等深度网络实现，学习待处理音频信号的局部特征和时序特征。通过概率预测模块，可以针对每一帧音频数据计算其是否为节拍点的概率。最后基于预测概率，利用全局节拍位置估计模块得到全局最优的节拍位。全局节拍位置估计模块可由动态规划算法实现。生成的节拍位可包括正常节拍以及重拍两种类型。上述音频卡点检测的过程仅是示例性的，本公开不限于此。Feature extraction usually uses frequency domain features. For example, the Mel spectrum of the audio signal to be processed and its first-order difference can be used as the input features of the feature extraction module, and then the extracted features are input to the probability prediction module. The probabilistic prediction module can be implemented using deep networks such as CRNN to learn the local and time-series features of the audio signal to be processed. Through the probability prediction module, the probability of whether it is a beat point can be calculated for each frame of audio data. Finally, based on the predicted probability, the global optimal beat position is obtained by using the global beat position estimation module. The global beat position estimation module can be implemented by a dynamic programming algorithm. The generated beat bits can include normal beats and rebeats. The above process of audio jam point detection is only exemplary, and the present disclosure is not limited thereto.

可通过对待处理音频信号进行音频结构分析来获得音频信号的结构信息。结构信息可包括音频信号的各音频片段的类型信息和时间信息。例如，音频结构分析技术是指通过算法将音频信号分割成不同类型的片段，例如序曲、主歌、副歌和过渡段等几个不同类型片段。例如，音频结构分析过程主要包括分割、聚类和标识等几个步骤。以音乐信号为例，首先对输入的音乐信号进行分帧处理，并且提取语音帧的频谱特征(诸如梅尔频率倒谱系数等)。通过计算帧与帧之间的特征相关性，可得到音乐信号的相关性矩阵。根据相关性矩阵，可以将语音信号分割成多个片段，并且根据这些片段之间的相关性，可以将这些片段进行聚类。在通过分割和聚类过程之后，对于输入的音乐信号，可以得到诸如a-b-c-b-c形式的音乐结构和对应片段的时间点。最后基于各个片段的重复次数以及音量、明亮度等声学特征，可以判断出音乐信号中主歌部分、副歌部分等。上述音频结构分析的过程仅是示例性的，本公开不限于此。The structure information of the audio signal can be obtained by performing audio structure analysis on the audio signal to be processed. The structure information may include type information and time information of each audio segment of the audio signal. For example, audio structure analysis technology refers to segmenting an audio signal into different types of segments through algorithms, such as several different types of segments such as overture, verse, chorus, and transition. For example, the audio structure analysis process mainly includes several steps such as segmentation, clustering and identification. Taking a music signal as an example, the input music signal is first processed into frames, and the spectral features (such as Mel frequency cepstral coefficients, etc.) of the speech frame are extracted. By calculating the feature correlation between frames, the correlation matrix of the music signal can be obtained. According to the correlation matrix, the speech signal can be segmented into segments, and the segments can be clustered according to the correlation between the segments. After going through the segmentation and clustering process, for the input music signal, a music structure such as a-b-c-b-c form and the time points of the corresponding segments can be obtained. Finally, based on the number of repetitions of each segment and the acoustic characteristics such as volume and brightness, the verse part and the chorus part in the music signal can be determined. The above process of audio structure analysis is only exemplary, and the present disclosure is not limited thereto.

在获得待处理音频信号的节拍信息和各音频片段的结构信息后，可以针对不同的音轨信号生成相应的三维空间元数据。After obtaining the tempo information of the audio signal to be processed and the structure information of each audio segment, corresponding three-dimensional spatial metadata can be generated for different audio track signals.

对于分离出的多个音轨信号中的人声信号，可根据结构信息中的各个音频片段的类型，确定人声信号对应的声源相对于三维空间中的用户方位信息的位置调整信息，基于位置调整信息确定人声信号的三维空间元数据。For the human voice signal in the separated multiple audio track signals, the position adjustment information of the sound source corresponding to the human voice signal relative to the user orientation information in the three-dimensional space can be determined according to the type of each audio segment in the structure information. The position adjustment information determines the three-dimensional spatial metadata of the vocal signal.

作为示例，针对音轨信号中的人声信号，可在第一预设类型的音频片段期间将与人声信号对应的声源设置为在三维空间中向用户所在的位置移动，并且将与移动相关的信息作为人声信号的三维空间元数据。As an example, for the vocal signal in the audio track signal, the sound source corresponding to the vocal signal may be set to move in the three-dimensional space toward the user's position during the audio segment of the first preset type, and the sound source corresponding to the vocal signal may be set to move toward the user's position in the three-dimensional space, and the sound source corresponding to the vocal signal may be set to move in the three-dimensional space during the audio segment of the first preset type. The relevant information is used as the three-dimensional spatial metadata of the human voice signal.

以音乐信号为例，对于人声信号，可在主歌部分在三维空间中使人声信号对应的声源与听者的距离由远及近。例如，该声源在图2的三维空间中从坐标(0.5，0，0.5)变化到(0.5，0.3，0.5)逐渐靠近听者，从而增加音乐的代入感。Taking a music signal as an example, for a human voice signal, the distance between the sound source corresponding to the human voice signal and the listener can be changed from far to near in the three-dimensional space in the verse part. For example, the sound source changes from coordinates (0.5, 0, 0.5) to (0.5, 0.3, 0.5) in the three-dimensional space of FIG. 2 and gradually approaches the listener, thereby increasing the sense of substitution of music.

作为另一示例，针对音轨信号中的人声信号，可在第二预设类型的音频片段期间将与人声信号对应的声源在三维空间中的高度坐标增加至预定高度并且将该声源的声音宽度增加至预定声音宽度，并且将与增加至预定高度和预定声音宽度相关的信息作为人声信号的三维空间元数据。As another example, for the vocal signal in the audio track signal, the height coordinates of the sound source in the three-dimensional space corresponding to the vocal signal may be increased to a predetermined height during the audio segment of the second preset type and the sound source may be increased to a predetermined height. The sound width of the source is increased to a predetermined sound width, and the information related to the increase to the predetermined height and the predetermined sound width is taken as three-dimensional spatial metadata of the vocal signal.

以音乐信号为例，对于人声信号，可在主歌部分与副歌部分的过渡段，在三维空间中逐渐提高人声信号对应的声源的三维坐标中的高度以及整个声音的宽度，使得该声源最后在副歌部分达到一定的高度和声音宽度值，从而提高副歌的震撼效果。例如，该声源的位置坐标在图2的三维空间中从(0.5，0，0.5)变化到(0.5，0，1)，声音宽度值从0.05变化到0.5。高度值和宽度值的变化可遵循预设的线性函数或非线性函数进行改变。Taking the music signal as an example, for the vocal signal, the height in the three-dimensional coordinates of the sound source corresponding to the vocal signal and the width of the entire sound can be gradually increased in the three-dimensional space in the transition section between the verse and the chorus, so that The sound source finally reaches a certain height and sound width value in the chorus part, thereby enhancing the shocking effect of the chorus. For example, the position coordinates of the sound source change from (0.5, 0, 0.5) to (0.5, 0, 1) in the three-dimensional space of FIG. 2, and the sound width value changes from 0.05 to 0.5. Changes in height and width values can follow a preset linear or non-linear function.

对于分离出的多个音轨信号中的乐器信号，可根据节拍信息和结构信息中的各音频片段的类型，确定乐器信号对应的声源在三维空间中的移动信息，基于移动信息确定乐器信号的三维空间元数据。For the musical instrument signals in the separated multiple track signals, the movement information of the sound source corresponding to the musical instrument signal in the three-dimensional space can be determined according to the type of each audio segment in the beat information and the structure information, and the musical instrument signal can be determined based on the movement information. 3D spatial metadata.

作为示例，针对音轨信号中的乐器信号，可根据节拍信息将与乐器信号对应的声源设置为在三维空间中按照预定轨迹周期性地移动，并且将与移动相关的信息作为乐器信号的三维空间元数据。以音乐信号为例，对于与鼓声和贝斯等乐器对应的声源，可以根据节拍在三维空间中以特定的轨迹进行周期性变化。例如，可使乐器声源在三维空间中按照特定轨迹进行旋转，并且在重拍时使该声源位于听者头部正中间位置，以进一步提升听者的听觉体验。As an example, for the musical instrument signal in the audio track signal, the sound source corresponding to the musical instrument signal may be set to periodically move according to a predetermined trajectory in the three-dimensional space according to the tempo information, and the information related to the movement may be used as the three-dimensional signal of the musical instrument signal. Spatial metadata. Taking music signals as an example, sound sources corresponding to musical instruments such as drums and bass can periodically change with a specific trajectory in a three-dimensional space according to the beat. For example, the sound source of the musical instrument can be rotated according to a specific trajectory in three-dimensional space, and the sound source can be positioned in the middle of the listener's head when replaying, so as to further enhance the listening experience of the listener.

作为又一示例，针对音轨信号中的乐器信号，可在第二预设类型的音频片段期间将与乐器信号对应的声源在三维空间中的旋转速度增加至预定旋转速度，并且将与增加至预定旋转速度相关的信息作为乐器信号的三维空间元数据。以音乐信号为例，对于与鼓声和贝斯等乐器对应的声源，可根据主歌和副歌的特点，提高乐器声源在副歌部分的空间旋转速度，从而提升副歌的动感。As yet another example, for the musical instrument signal in the audio track signal, the rotational speed of the sound source in the three-dimensional space corresponding to the musical instrument signal may be increased to a predetermined rotational speed during the audio segment of the second preset type, and the same as the increase Information related to a predetermined rotational speed is used as three-dimensional spatial metadata of the musical instrument signal. Taking a music signal as an example, for sound sources corresponding to instruments such as drums and bass, the spatial rotation speed of the instrument sound source in the chorus can be increased according to the characteristics of the verse and chorus, thereby enhancing the dynamic of the chorus.

根据本公开的另一实施例，可从多个预设模板中确定分别与分离出的每个音轨信号对应的预设模板，使用确定的预设模板分别生成针对每个音轨信号的三维空间元数据。According to another embodiment of the present disclosure, a preset template corresponding to each of the separated audio track signals may be determined from a plurality of preset templates, and the determined preset template may be used to generate a three-dimensional model for each audio track signal, respectively. Spatial metadata.

每个预设模板可包括声源在三维空间中的移动轨迹信息、移动速度信息和声音宽度变化信息中的至少一个。预设模板可预先设置声源在三维空间中的变化过程，例如声源在三维空间中如何移动、声源在三维空间中的声音宽度变化等。在将待处理音频信号分离出多个音轨信号后，可针对每个分离出的音轨信号，分配一个预设模板，使得音轨信号按照对应模板中的信息进行变化。例如，可根据每个音轨信号的特征，从多个预设模板中确定与音轨信号相匹配的预设模板。Each preset template may include at least one of movement track information, movement speed information, and sound width variation information of the sound source in the three-dimensional space. The preset template can preset the change process of the sound source in the three-dimensional space, such as how the sound source moves in the three-dimensional space, and the sound width of the sound source in the three-dimensional space changes. After the to-be-processed audio signal is separated into a plurality of track signals, a preset template may be allocated for each separated track signal, so that the track signal changes according to the information in the corresponding template. For example, a preset template matching the audio track signal may be determined from a plurality of preset templates according to the characteristics of each audio track signal.

作为示例，预设模板可包括在主歌部分使人声声源与听者的距离由远及近的移动、在主副歌的过渡段逐渐提高人声声源在三维坐标中的高度以及整个声音的宽度、根据节拍使乐器声源在三维空间中以特定的轨迹进行周期性变化、提高乐器声源在副歌部分的空间旋转速度等模板。然而上述示例仅是示例性的，本公开不限于此。通过对不同音轨信号应用对应预设模板，生成更加符合各音轨信号属性的三维空间元数据，使得能生成更加逼真的三维音频信号。As an example, the preset template may include moving the distance between the vocal sound source and the listener from far to near in the verse part, gradually increasing the height of the vocal sound source in three-dimensional coordinates in the transition section of the main chorus, and the whole Templates such as the width of the sound, the periodic change of the instrument sound source in a specific trajectory in the three-dimensional space according to the beat, and the improvement of the spatial rotation speed of the instrument sound source in the chorus. However, the above examples are only exemplary, and the present disclosure is not limited thereto. By applying corresponding preset templates to different audio track signals, three-dimensional spatial metadata that is more in line with the properties of each audio track signal is generated, so that a more realistic three-dimensional audio signal can be generated.

根据本公开的另一实施例，可获取用户输入的设置信息，设置信息可包括与多个音轨信号对应的各声源在三维空间中的移动轨迹、移动速度和声音宽度变化值中的至少一个，然后基于设置信息分别生成针对每个音轨信号的三维空间元数据。According to another embodiment of the present disclosure, setting information input by a user may be acquired, and the setting information may include at least one of a movement trajectory, movement speed, and sound width variation value of each sound source corresponding to a plurality of sound track signals in a three-dimensional space. One, and then generate three-dimensional spatial metadata for each track signal separately based on the setting information.

作为示例，可基于用户输入分别生成针对分离出的每个音轨信号的三维空间元数据。用户输入可用于设置与多个音轨信号对应的各声源在三维空间中的移动轨迹、移动速度和声音宽度变化值中的至少一个。用户可根据对待处理音频内容的理解来自定义针对各个音轨信号的变化。As an example, three-dimensional spatial metadata for each separated audio track signal may be generated separately based on user input. The user input may be used to set at least one of a movement trajectory, a movement speed, and a sound width variation value of each sound source corresponding to the plurality of track signals in the three-dimensional space. The user can customize the signal changes for each track according to his understanding of the audio content to be processed.

以音乐信号为例，用户可根据自己对该音乐内容的理解，自定义该音乐中的贝斯声源在副歌部分时在三维空间中的移动轨迹等，在生成三维音乐信号时，可将用户自定义的移动轨迹应用于该贝斯声源。然而上述示例仅是示例性的，本公开不限于此。用户可根据自己的喜好、对音乐内容的理解，设置各个声源在三维空间中的变化情况，以得到自己期望的音乐效果，满足了用户需求并且提高了用户体验。Taking the music signal as an example, the user can customize the movement track of the bass sound source in the chorus in the three-dimensional space according to his own understanding of the music content. When generating the three-dimensional music signal, the user can A custom movement track is applied to this bass source. However, the above examples are only exemplary, and the present disclosure is not limited thereto. Users can set the changes of each sound source in the three-dimensional space according to their own preferences and understanding of the music content, so as to obtain the desired music effect, which satisfies the user's needs and improves the user experience.

根据本公开的又一示例，可基于待处理音频信号的节拍信息和各音频片段，使用预设模板来生成针对分离出的多个音轨信号的三维空间元数据。也就是说，各个音轨信号对应的声源可在三维空间中按照预设模板在对应的节拍位和片段部分进行改变。According to yet another example of the present disclosure, a preset template may be used to generate three-dimensional spatial metadata for the separated multiple audio track signals based on beat information of the audio signal to be processed and each audio segment. That is to say, the sound source corresponding to each audio track signal can be changed in the corresponding beat position and segment part according to the preset template in the three-dimensional space.

在步骤S104，基于用户方位信息、分离出的多个音轨信号和与每个音轨信号对应的三维空间元数据，生成与待处理音频信号对应的三维音频信号。生成的三维音频信号可包括分离的音轨信号和每个音轨信号对应的三维空间元数据，可通过空间音频渲染技术得到最终的3D音频效果。In step S104, a three-dimensional audio signal corresponding to the to-be-processed audio signal is generated based on the user orientation information, the separated multiple audio track signals, and the three-dimensional spatial metadata corresponding to each audio track signal. The generated 3D audio signal may include separated audio track signals and 3D spatial metadata corresponding to each audio track signal, and the final 3D audio effect may be obtained through spatial audio rendering technology.

根据本公开的实施例，在生成三维音频信号时，可考虑用于播放该三维音频信号的播放设备的类型，针对不同播放设备，使用不同的方式渲染出三维音频信号。According to an embodiment of the present disclosure, when generating a 3D audio signal, the type of playback device used to play the 3D audio signal may be considered, and different ways are used to render the 3D audio signal for different playback devices.

具体地，可识别用于播放三维音频信号的播放设备的类型，获取与播放设备的类型对应的渲染策略，然后基于用户方位信息、分离出的多个音轨信号和与每个音轨信号对应的三维空间元数据，通过获取的渲染策略生成与待处理音频信号对应的三维音频信号。Specifically, the type of playback device used to play the 3D audio signal can be identified, the rendering strategy corresponding to the type of the playback device can be obtained, and then based on the user orientation information, the separated multiple audio track signals and the corresponding audio track signals The 3D spatial metadata of the 3D space is generated, and the 3D audio signal corresponding to the to-be-processed audio signal is generated through the acquired rendering strategy.

对于入耳式播放设备，可针对分离出的每个音轨信号，基于与音轨信号的每个音频帧对应的方位信息和用户方位信息，生成与音轨信号对应的三维音频信号，并且基于每个音频帧的声音宽度信息来调整与音轨信号对应的三维音频信号的声音宽度。例如，可基于头部相关传递函数(HRTF)，将音轨信号的每个音频帧与其三维坐标对应的HRTF进行卷积，得到对应的双耳音频信号。For the in-ear playback device, a three-dimensional audio signal corresponding to the audio track signal may be generated based on the orientation information and user orientation information corresponding to each audio frame of the audio track signal for each of the separated audio track signals, and based on each audio track signal The sound width information of each audio frame is used to adjust the sound width of the three-dimensional audio signal corresponding to the audio track signal. For example, each audio frame of the audio track signal can be convolved with its HRTF corresponding to its three-dimensional coordinates based on a head-related transfer function (HRTF) to obtain a corresponding binaural audio signal.

对于外放式播放设备，可针对每个音轨信号，基于与音轨信号对应的声源的方位信息和多个扬声器的方位信息，对音轨信号进行渲染，以生成与音轨信号对应的三维音频信号，并且基于该声源的声音宽度信息来调整与音轨信号对应的三维音频信号的声音宽度。例如，可基于矢量基幅值相移(VBAP)技术，将声音的单位方向矢量表示为距离声源方向最近的多个扬声器的单位方向矢量的线性组合，计算每个扬声器的增益因子，从而针对多个扬声器渲染得到3D音乐效果。For an external playback device, the audio track signal can be rendered based on the orientation information of the sound source corresponding to the audio track signal and the orientation information of the plurality of speakers for each audio track signal to generate a corresponding audio track signal. The three-dimensional audio signal is adjusted, and the sound width of the three-dimensional audio signal corresponding to the track signal is adjusted based on the sound width information of the sound source. For example, based on the Vector Basis Amplitude Phase Shift (VBAP) technique, the unit directional vector of the sound can be expressed as a linear combination of the unit directional vectors of the speakers closest to the direction of the sound source, and the gain factor of each speaker can be calculated, so as to Multiple speakers are rendered for 3D music effects.

考虑到播放设备的类型，可生成使各类型播放设备的播放效果更佳的三维音频信号。Considering the type of playback device, a three-dimensional audio signal can be generated to make the playback effect of each type of playback device better.

本公开提出的3D音频生成方法可以将传统双通道立体声音乐自动转换为3D音乐，增加音乐的沉浸感。The 3D audio generation method proposed in the present disclosure can automatically convert traditional two-channel stereo music into 3D music, increasing the immersion of the music.

图4是根据本公开的另一实施例的音频生成方法的流程示意图。在图4中，以将立体声音乐/单通道音乐转换为3D音乐为例进行描述。然而，图4示出的系统也可被用于将任何形式的音频转换为3D音频。FIG. 4 is a schematic flowchart of an audio generation method according to another embodiment of the present disclosure. In FIG. 4 , the description is made by taking the conversion of stereo music/mono-channel music into 3D music as an example. However, the system shown in Figure 4 can also be used to convert any form of audio to 3D audio.

输入的单通道或者立体声音乐通过音轨分离模块，提取出预先设定好的多种音轨信号，例如人声、鼓声和贝斯等声源的音频信号。同时，输入的单通道或者立体声音乐通过音频卡点检测模块提取出音乐的节拍点信息，并且通过音频结构分析模块提取出主歌、副歌等类型的音乐片段的结构信息。基于这些卡点信息和音乐结构信息，根据一些自定义模版或者用户编辑，通过三维元数据生成模块，确定每个音轨信号的三维空间元数据。空间音频渲染模块可基于分离出来的音轨信号和对应的三维空间元数据，利用空间音频渲染技术得到最终的3D音乐信号。The input single-channel or stereo music is extracted through the track separation module to extract a variety of preset track signals, such as the audio signals of sound sources such as vocals, drums, and bass. At the same time, the input single-channel or stereo music extracts the beat information of the music through the audio card point detection module, and extracts the structure information of the verse, chorus and other types of music clips through the audio structure analysis module. Based on the card point information and music structure information, according to some custom templates or user editing, the three-dimensional metadata of each track signal is determined through the three-dimensional metadata generation module. The spatial audio rendering module can use the spatial audio rendering technology to obtain the final 3D music signal based on the separated audio track signal and the corresponding three-dimensional spatial metadata.

音轨分离模块可基于深度学习的方式实现，例如可包括编码器、解码器和分离器。输入时域音乐信号经过STFT模块，得到对应的音乐频谱信号，经过多层卷积层构成的编码器后，利用分离器进一步提取音轨特征，最后通过解码器得到目标音轨信号对应的目标掩蔽矩阵。将音乐频谱信号与目标掩蔽矩阵相乘后，再经过ISTFT模块，可以得到目标音轨信号，如人声、鼓声、贝斯和其他乐器等。The audio track separation module can be implemented based on deep learning, for example, it can include an encoder, a decoder, and a separator. The input time-domain music signal passes through the STFT module to obtain the corresponding music spectrum signal. After passing through the encoder composed of multi-layer convolutional layers, the separator is used to further extract the track features, and finally the target mask corresponding to the target track signal is obtained through the decoder. matrix. After multiplying the music spectrum signal with the target masking matrix, and then passing through the ISTFT module, the target audio track signal, such as vocals, drums, bass and other instruments, can be obtained.

音频卡点检测模块可以基于深度学习的方式实现，例如可包括特征提取模块、基于深度模型的概率预测模块、全局节拍位置估计模块。首先特征提取通常使用频域特征，在一种实现中，梅尔谱以及其一阶差分通常会用做输入特征。概率预测模块通常选择CRNN等深度网络实现，来学习局部特征和时序特征，通过概率预测模块可以为每一帧音频数据计算是否为拍点的概率。最后基于预测概率，全局节拍位置估计模块利用动态规划的算法得到全局最优的节拍位。生成的节拍位可包括正常节拍以及重拍两种。The audio jam point detection module can be implemented based on deep learning, for example, it can include a feature extraction module, a probability prediction module based on a deep model, and a global beat position estimation module. First, feature extraction usually uses frequency domain features. In one implementation, the Mel spectrum and its first-order difference are usually used as input features. The probability prediction module is usually implemented by deep networks such as CRNN to learn local features and time series features. Through the probability prediction module, the probability of whether it is a beat point can be calculated for each frame of audio data. Finally, based on the predicted probability, the global beat position estimation module uses the dynamic programming algorithm to obtain the global optimal beat position. The generated beat bits can include normal beats and rebeats.

音乐结构分析模块可通过算法将音乐信号分割成不同的片段，例如序曲、主歌、副歌和过渡段等几个不同的部分。音乐结构分析过程主要包括分割、聚类和标识等几个步骤。首先对音乐信号进行分帧处理，并且提取语音帧的梅尔频率倒谱系数MFCC等频谱特征。通过计算帧与帧之间的特征相关性，可以得到音乐信号的相关性矩阵。根据相关性矩阵可以将语音信号分割成多个片段，并且根据这些片段之间的相关性，可以将这些片段进行聚类。通过分割和聚类过程之后，对于待处理音乐信号，可以得到类似于a-b-c-b-c形式的音乐结构和对应片段时间点。最后基于片段的重复次数、以及音量、明亮度等声学特征，可以判断出音乐中主歌和副歌等部分。The music structure analysis module can divide the music signal into different segments by algorithm, such as several different parts such as overture, verse, chorus and transition. The music structure analysis process mainly includes several steps such as segmentation, clustering and identification. Firstly, the music signal is divided into frames, and the spectral features such as the Mel frequency cepstral coefficient MFCC of the speech frame are extracted. By calculating the feature correlation between frames, the correlation matrix of the music signal can be obtained. The speech signal can be segmented into segments according to the correlation matrix, and the segments can be clustered according to the correlation between the segments. After the segmentation and clustering process, for the music signal to be processed, a music structure similar to a-b-c-b-c form and corresponding segment time points can be obtained. Finally, based on the number of repetitions of the segment, as well as acoustic characteristics such as volume and brightness, the verse and chorus in the music can be judged.

三维元数据生成模块可基于音乐节拍信息和音乐结构信息，针对不同音轨信号生成对应的三维空间元数据。三维空间元数据对应音轨信号在三维空间的信息，具体主要包括三维位置坐标以及声音的宽度等。The three-dimensional metadata generation module can generate corresponding three-dimensional spatial metadata for different track signals based on the music tempo information and the music structure information. The three-dimensional space metadata corresponds to the information of the audio track signal in the three-dimensional space, and specifically mainly includes the three-dimensional position coordinates and the width of the sound.

可以通过由使用者根据自己对于音乐内容的理解自定义各个声源的模板，也可以通过一些预先设定好的模版自动生成各个声源的三维空间元数据。The template of each sound source can be customized by the user according to his own understanding of the music content, or the three-dimensional spatial metadata of each sound source can be automatically generated through some preset templates.

例如，预先设定的模版可包括在主歌部分使人声声源与听者的距离由远及近的移动、在主副歌的过渡段逐渐提高人声声源在三维坐标中的高度以及整个声音的宽度、根据节拍使乐器声源在三维空间中以特定的轨迹进行周期性变化、提高乐器声源在副歌部分的空间旋转速度等模板。然而上述示例仅是示例性的，本公开不限于此。For example, the preset template may include moving the distance between the vocal source and the listener from far to near in the verse, gradually increasing the height of the vocal source in three-dimensional coordinates in the transitional section of the chorus, and Templates such as the width of the entire sound, the periodic change of the instrument sound source in a specific trajectory in the three-dimensional space according to the beat, and the improvement of the spatial rotation speed of the instrument sound source in the chorus. However, the above examples are only exemplary, and the present disclosure is not limited thereto.

空间音频渲染模块生成的3D音乐信号可包括分离的音轨信号和每个音轨对应的三维元数据，通过空间音频渲染技术可以得到最终的3D音乐效果。例如，对于耳机播放设备，可以基于头部相关传递函数(HRTF)，将输入音轨的每个音频帧与其三维坐标对应的HRTF进行卷积，得到对应的双耳音频信号。对于多个扬声器播放设备，可以基于矢量幅度平移技术(VBAP)，将声音的单位方向矢量表示为距离声源方向最近的多个扬声器的单位方向矢量的线性组合，计算每个扬声器的增益因子，从而针对多个扬声器渲染得到3D音乐效果。The 3D music signal generated by the spatial audio rendering module may include separate audio track signals and three-dimensional metadata corresponding to each audio track, and the final 3D music effect can be obtained through the spatial audio rendering technology. For example, for a headphone playback device, each audio frame of an input audio track can be convolved with its HRTF corresponding to its three-dimensional coordinates based on a head-related transfer function (HRTF) to obtain a corresponding binaural audio signal. For multiple speaker playback devices, the unit direction vector of the sound can be expressed as a linear combination of the unit direction vectors of multiple speakers closest to the sound source direction based on the vector magnitude translation technique (VBAP), and the gain factor of each speaker can be calculated, Thereby rendering a 3D music effect for multiple speakers.

图5是本公开实施例的硬件运行环境的音频生成设备的结构示意图。FIG. 5 is a schematic structural diagram of an audio generation device of a hardware operating environment according to an embodiment of the present disclosure.

如图5所示，音频生成设备500可包括：处理组件501、通信总线502、网络接口503、输入输出接口504、存储器505以及电源组件506。其中，通信总线502用于实现这些组件之间的连接通信。输入输出接口504可以包括视频显示器(诸如，液晶显示器)、麦克风和扬声器以及用户交互接口(诸如，键盘、鼠标、触摸输入装置等)，可选地，输入输出接口504还可包括标准的有线接口、无线接口。网络接口503可选的可包括标准的有线接口、无线接口(如无线保真接口)。存储器505可以是高速的随机存取存储器，也可以是稳定的非易失性存储器。存储器505可选的还可以是独立于前述处理组件501的存储装置。As shown in FIG. 5 , the audio generation device 500 may include: a processing component 501 , a communication bus 502 , a network interface 503 , an input and output interface 504 , a memory 505 , and a power supply component 506 . Among them, the communication bus 502 is used to realize the connection and communication between these components. The input-output interface 504 may include a video display (such as a liquid crystal display), a microphone and speakers, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and optionally, the input-output interface 504 may also include a standard wired interface , wireless interface. Optionally, the network interface 503 may include a standard wired interface and a wireless interface (such as a Wi-Fi interface). The memory 505 may be a high-speed random access memory or a stable non-volatile memory. Optionally, the memory 505 may also be a storage device independent of the aforementioned processing component 501 .

本领域技术人员可以理解，图5中示出的结构并不构成对音频生成设备500的限定，可包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation on the audio generation device 500, and may include more or less components than those shown, or combine some components, or arrange different components.

如图5所示，作为一种存储介质的存储器505中可包括操作系统(诸如MAC操作系统)、数据存储模块、网络通信模块、用户接口模块、程序以及数据库。As shown in FIG. 5 , the memory 505 as a storage medium may include an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, a program, and a database.

在图5所示的音频生成设备500中，网络接口503主要用于与外部电子设备/终端进行数据通信；输入输出接口504主要用于与用户进行数据交互；音频生成设备500中的处理组件501、存储器505可被设置在音频生成设备500中，音频生成设备500通过处理组件501调用存储器505中存储的程序以及由操作系统提供的各种API，执行本公开实施例提供的音频生成方法。In the audio generation device 500 shown in FIG. 5 , the network interface 503 is mainly used for data communication with external electronic devices/terminals; the input and output interface 504 is mainly used for data interaction with the user; the processing component 501 in the audio generation device 500 The memory 505 may be set in the audio generation device 500, and the audio generation device 500 invokes the program stored in the memory 505 and various APIs provided by the operating system through the processing component 501 to execute the audio generation method provided by the embodiments of the present disclosure.

处理组件501可以包括至少一个处理器，存储器505中存储有计算机可执行指令集合，当计算机可执行指令集合被至少一个处理器执行时，执行根据本公开实施例的音频生成方法。然而，上述示例仅是示例性的，本公开不限于此。The processing component 501 may include at least one processor, and a computer-executable instruction set is stored in the memory 505, and when the computer-executable instruction set is executed by the at least one processor, performs the audio generation method according to an embodiment of the present disclosure. However, the above-described examples are merely exemplary, and the present disclosure is not limited thereto.

处理组件501可通过执行程序来实现对音频生成设备500所包括的组件的控制。The processing component 501 can realize the control of the components included in the audio generation apparatus 500 by executing a program.

作为示例，音频生成设备500可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里，音频生成设备500并非必须是单个的电子设备，还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。音频生成设备500还可以是集成控制系统或系统管理器的一部分，或者可以被配置为与本地或远程(例如，经由无线传输)以接口互联的便携式电子设备。As an example, the audio generation device 500 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or any other device capable of executing the above set of instructions. Here, the audio generating device 500 does not have to be a single electronic device, but can also be any set of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets). Audio generation device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

在音频生成设备500中，处理组件501可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制，处理组件501还可以包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In the audio generation apparatus 500, the processing component 501 may comprise a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller or a microprocessor. By way of example and not limitation, processing components 501 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

处理组件501可运行存储在存储器中的指令或代码，其中，存储器505还可以存储数据。指令和数据还可以经由网络接口503而通过网络被发送和接收，其中，网络接口503可以采用任何已知的传输协议。Processing component 501 can execute instructions or code stored in memory, where memory 505 can also store data. Instructions and data may also be sent and received over a network via network interface 503, which may employ any known transport protocol.

存储器505可以与处理组件501集成为一体，例如，将RAM或闪存布置在集成电路微处理器等之内。此外，存储器505可包括独立的装置，诸如，外部盘驱动、存储阵列或任何数据库系统可以使用的其他存储装置。存储器和处理组件501可以在操作上进行耦合，或者可以例如通过I/O端口、网络连接等互相通信，使得处理组件501能够读取存储在存储器505中的数据The memory 505 may be integrated with the processing component 501, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, memory 505 may comprise a separate device, such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory and processing component 501 may be operatively coupled, or may communicate with each other, such as through I/O ports, network connections, etc., to enable processing component 501 to read data stored in memory 505

图6是根据本公开的实施例的音频生成装置的框图。FIG. 6 is a block diagram of an audio generation apparatus according to an embodiment of the present disclosure.

参照图6，音频生成装置600可包括获取模块601、音轨分离模块602、元数据生成模块603和渲染模块604。音频生成装置600中的每个模块可由一个或多个模块来实现，并且对应模块的名称可根据模块的类型而变化。在各种实施例中，可省略音频生成装置600中的一些模块，或者还可包括另外的模块。此外，根据本公开的各种实施例的模块/元件可被组合以形成单个实体，并且因此可等效地执行相应模块/元件在组合之前的功能。6 , the audio generation apparatus 600 may include an acquisition module 601 , an audio track separation module 602 , a metadata generation module 603 and a rendering module 604 . Each module in the audio generating apparatus 600 may be implemented by one or more modules, and the name of the corresponding module may vary according to the type of the module. In various embodiments, some modules in the audio generation apparatus 600 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the corresponding modules/elements prior to combination.

获取模块601可获取待处理音频信号。The obtaining module 601 can obtain the audio signal to be processed.

音轨分离模块602可通过对待处理音频信号进行音轨分离获得针对多个声源的多个音轨信号。The track separation module 602 can obtain multiple track signals for multiple sound sources by performing track separation on the audio signal to be processed.

元数据生成模块603可确定在三维空间中的用户方位信息，并生成与分离出的多个音轨信号中的每个音轨信号对应的三维空间元数据，其中，三维空间元数据可包括对应声源在三维空间中的方位信息和声音宽度信息。The metadata generation module 603 may determine the user orientation information in the three-dimensional space, and generate three-dimensional spatial metadata corresponding to each of the separated multiple audio track signals, wherein the three-dimensional spatial metadata may include corresponding The position information and sound width information of the sound source in three-dimensional space.

可选地，元数据生成模块603可获取待处理音频信号的特征信息，其中，特征信息可包括待处理音频信号的节拍信息和各音频片段的结构信息中的至少一个；基于特征信息生成针对每个音轨信号的三维空间元数据。Optionally, the metadata generation module 603 may acquire feature information of the audio signal to be processed, wherein the feature information may include at least one of beat information of the audio signal to be processed and structure information of each audio segment; 3D spatial metadata for each track signal.

可选地，元数据生成模块603可通过对待处理音频信号进行音频卡点检测来获得节拍信息；通过对待处理音频信号进行音频结构分析来获得各音频片段的结构信息。Optionally, the metadata generation module 603 may obtain beat information by performing audio jam point detection on the audio signal to be processed; and obtain the structure information of each audio segment by performing audio structure analysis on the audio signal to be processed.

可选地，元数据生成模块603可针对多个音轨信号中的人声信号，根据结构信息中的各音频片段的类型，确定人声信号对应的声源相对于三维空间中的用户方位信息的位置调整信息，基于位置调整信息确定人声信号的三维空间元数据。Optionally, the metadata generation module 603 can determine the sound source corresponding to the vocal signal relative to the user orientation information in the three-dimensional space according to the type of each audio segment in the structure information for the vocal signal in the multiple audio track signals. Based on the position adjustment information, the three-dimensional spatial metadata of the human voice signal is determined based on the position adjustment information.

例如，元数据生成模块603可针对多个音轨信号中的人声信号，在第一预设类型的音频片段期间将与人声信号对应的声源设置为在三维空间中向用户所在的位置移动，将与移动相关的信息作为人声信号的三维空间元数据。For example, the metadata generation module 603 may set the sound source corresponding to the vocal signal to the position of the user in the three-dimensional space during the audio segment of the first preset type for the vocal signal in the plurality of audio track signals For movement, the information related to the movement is used as the three-dimensional spatial metadata of the human voice signal.

又例如，元数据生成模块603可针对多个音轨信号中的人声信号，在第二预设类型的音频片段期间将与人声信号对应的声源在三维空间中的高度坐标增加至预定高度并且将声源的声音宽度增加至预定声音宽度，将与增加至预定高度和预定声音宽度相关的信息作为人声信号的三维空间元数据。For another example, the metadata generation module 603 may increase the height coordinates of the sound source in the three-dimensional space corresponding to the vocal signal to a predetermined value during the audio segment of the second preset type for the vocal signal in the plurality of audio track signals. height and increase the sound width of the sound source to a predetermined sound width, and use information related to the increase to the predetermined height and the predetermined sound width as three-dimensional spatial metadata of the human voice signal.

可选地，元数据生成模块603可针对多个音轨信号中的乐器信号，根据节拍信息和结构信息中的各音频片段的类型，确定乐器信号对应的声源在三维空间中的移动信息，基于移动信息确定乐器信号的三维空间元数据。Optionally, the metadata generation module 603 can determine the movement information of the sound source corresponding to the musical instrument signal in the three-dimensional space according to the type of each audio segment in the beat information and the structure information for the musical instrument signal in the multiple audio track signals, Three-dimensional spatial metadata of the instrument signal is determined based on the movement information.

例如，元数据生成模块603可针对多个音轨信号中的乐器信号，根据节拍信息将与乐器信号对应的声源设置为在三维空间中按照预定轨迹周期性地移动，将与移动相关的信息作为乐器信号的三维空间元数据。For example, the metadata generation module 603 can set the sound source corresponding to the musical instrument signal to periodically move according to a predetermined trajectory in the three-dimensional space according to the beat information for the musical instrument signal in the plurality of audio track signals, and convert the information related to the movement to the sound source corresponding to the musical instrument signal. 3D spatial metadata as an instrument signal.

又例如，元数据生成模块603可针对多个音轨信号中的乐器信号，在第二预设类型的音频片段期间将与乐器信号对应的声源在三维空间中的旋转速度增加至预定旋转速度，将与增加至预定旋转速度相关的信息作为乐器信号的三维空间元数据。For another example, the metadata generation module 603 may increase the rotational speed of the sound source corresponding to the musical instrument signal in the three-dimensional space to a predetermined rotational speed during the audio segment of the second preset type for the musical instrument signal in the plurality of audio track signals , the information related to the increase to a predetermined rotational speed is used as the three-dimensional spatial metadata of the musical instrument signal.

可选地，元数据生成模块603可从多个预设模板中确定分别与每个音轨信号对应的预设模板，其中，预设模板包括声源在三维空间中的移动轨迹信息、移动速度信息和声音宽度变化信息中的至少一个；使用确定的预设模板分别生成针对每个音轨信号的三维空间元数据。Optionally, the metadata generation module 603 may determine a preset template corresponding to each audio track signal from a plurality of preset templates, wherein the preset template includes the movement track information and movement speed of the sound source in the three-dimensional space. at least one of information and sound width variation information; using the determined preset template to generate three-dimensional spatial metadata for each audio track signal, respectively.

可选地，元数据生成模块603可基于待处理音频信号的特征信息按照确定的预设模板分别生成针对多个音轨信号的三维空间元数据。Optionally, the metadata generation module 603 may respectively generate three-dimensional spatial metadata for multiple audio track signals according to the determined preset template based on the feature information of the audio signal to be processed.

可选地，元数据生成模块603可获取用户输入的设置信息，其中，设置信息包括与多个音轨信号对应的各声源在三维空间中的移动轨迹、移动速度和声音宽度变化值中的至少一个，并且基于设置信息分别生成针对每个音轨信号的三维空间元数据。Optionally, the metadata generation module 603 can obtain setting information input by the user, wherein the setting information includes the moving track, moving speed and sound width variation value of each sound source corresponding to the multiple sound track signals in the three-dimensional space. At least one, and three-dimensional spatial metadata for each track signal is generated separately based on the setting information.

例如，元数据生成模块603可接收用户输入，其中，用户输入用于设置与多个音轨信号对应的各声源在三维空间中的移动轨迹、移动速度和声音宽度变化值中的至少一个；基于用户输入分别生成针对多个音轨信号的三维空间元数据。For example, the metadata generation module 603 may receive user input, wherein the user input is used to set at least one of the movement trajectory, movement speed and sound width variation value of each sound source corresponding to the plurality of sound track signals in the three-dimensional space; The three-dimensional spatial metadata for the plurality of audio track signals are respectively generated based on the user input.

渲染模块604可基于用户方位信息、分离出的多个音轨信号和与每个音轨信号对应的三维空间元数据，生成与待处理音频信号对应的三维音频信号。The rendering module 604 may generate a three-dimensional audio signal corresponding to the to-be-processed audio signal based on the user orientation information, the separated multiple audio track signals, and the three-dimensional spatial metadata corresponding to each audio track signal.

渲染模块604可识别用于播放三维音频信号的播放设备的类型，获取与播放设备的类型对应的渲染策略，基于用户方位信息、分离出的多个音轨信号和与每个音轨信号对应的三维空间元数据，通过获取的渲染策略生成与待处理音频信号对应的三维音频信号。The rendering module 604 can identify the type of the playback device used to play the three-dimensional audio signal, obtain the rendering strategy corresponding to the type of the playback device, based on the user orientation information, the separated multiple audio track signals and the corresponding audio track signals. The three-dimensional spatial metadata is used to generate a three-dimensional audio signal corresponding to the to-be-processed audio signal through the acquired rendering strategy.

对于入耳式播放设备，渲染模块604可针对多个音轨信号中的每个音轨信号，基于与音轨信号的每个音频帧对应的方位信息和用户方位信息，生成与音轨信号对应的三维音频信号；基于每个音频帧的声音宽度信息来调整与音轨信号对应的三维音频信号的声音宽度。For an in-ear playback device, the rendering module 604 may, for each audio track signal in the plurality of audio track signals, generate a corresponding audio track signal based on the orientation information and user orientation information corresponding to each audio frame of the audio track signal. 3D audio signal; adjusts the sound width of the 3D audio signal corresponding to the track signal based on the sound width information of each audio frame.

对于外放式播放设备，渲染模块604可针对多个音轨信号中的每个音轨信号，基于与音轨信号对应的声源的方位信息和多个扬声器的方位信息对音轨信号进行渲染，以生成与音轨信号对应的三维音频信号；基于声源的声音宽度信息来调整与音轨信号对应的三维音频信号的声音宽度。For an external playback device, the rendering module 604 may render the audio track signal based on the position information of the sound source corresponding to the audio track signal and the position information of the plurality of speakers for each audio track signal in the plurality of audio track signals , to generate a three-dimensional audio signal corresponding to the audio track signal; and adjust the sound width of the three-dimensional audio signal corresponding to the audio track signal based on the sound width information of the sound source.

上面已根据图1至图4详细描述了将传统的立体声信号转化为三维音频信号的方式，这里不再进行描述。The method of converting a conventional stereo signal into a three-dimensional audio signal has been described in detail above according to FIG. 1 to FIG. 4 , and will not be described here again.

根据本公开的实施例，可提供一种电子设备。图7是根据本公开实施例的电子设备的框图，该电子设备700可包括至少一个存储器702和至少一个处理器701，所述至少一个存储器702存储有计算机可执行指令集合，当计算机可执行指令集合被至少一个处理器701执行时，执行根据本公开实施例的音频生成方法。According to an embodiment of the present disclosure, an electronic device can be provided. 7 is a block diagram of an electronic device according to an embodiment of the present disclosure, the electronic device 700 may include at least one memory 702 and at least one processor 701, the at least one memory 702 stores a set of computer-executable instructions, when the computer-executable instructions When the collection is executed by the at least one processor 701, the audio generation method according to the embodiment of the present disclosure is executed.

处理器701可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制，处理器701还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。The processor 701 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller or a microprocessor. By way of example and not limitation, processor 701 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

作为一种存储介质的存储器702可包括操作系统、数据存储模块、网络通信模块、用户接口模块、用于执行本公开的音频生成方法的程序以及数据库。The memory 702 as a storage medium may include an operating system, a data storage module, a network communication module, a user interface module, a program for executing the audio generation method of the present disclosure, and a database.

存储器702可与处理器701集成为一体，例如，可将RAM或闪存布置在集成电路微处理器等之内。此外，存储器702可包括独立的装置，诸如，外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器702和处理器701可在操作上进行耦合，或者可例如通过I/O端口、网络连接等互相通信，使得处理器701能够读取存储在存储器702中的文件。The memory 702 may be integrated with the processor 701, eg, RAM or flash memory may be arranged within an integrated circuit microprocessor or the like. Additionally, memory 702 may comprise a separate device, such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory 702 and the processor 701 may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor 701 to read files stored in the memory 702 .

此外，电子设备700还可包括视频显示器(诸如，液晶显示器)和用户交互接口(诸如，键盘、鼠标、触摸输入装置等)。电子设备700的所有组件可经由总线和/或网络而彼此连接。Additionally, the electronic device 700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of electronic device 700 may be connected to each other via a bus and/or network.

本领域技术人员可理解，图7中示出的结构并不构成对的限定，可包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structures shown in FIG. 7 do not constitute a limitation, and may include more or less components than those shown, or combine some components, or arrange different components.

根据本公开的实施例，还可提供一种存储指令的计算机可读存储介质，其中，当指令被至少一个处理器运行时，促使至少一个处理器执行根据本公开的音频生成方法。这里的计算机可读存储介质的示例包括：只读存储器(ROM)、随机存取可编程只读存储器(PROM)、电可擦除可编程只读存储器(EEPROM)、随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、闪存、非易失性存储器、CD-ROM、CD-R、CD+R、CD-RW、CD+RW、DVD-ROM、DVD-R、DVD+R、DVD-RW、DVD+RW、DVD-RAM、BD-ROM、BD-R、BD-R LTH、BD-RE、蓝光或光盘存储器、硬盘驱动器(HDD)、固态硬盘(SSD)、卡式存储器(诸如，多媒体卡、安全数字(SD)卡或极速数字(XD)卡)、磁带、软盘、磁光数据存储装置、光学数据存储装置、硬盘、固态盘以及任何其他装置，所述任何其他装置被配置为以非暂时性方式存储计算机程序以及任何相关联的数据、数据文件和数据结构并将所述计算机程序以及任何相关联的数据、数据文件和数据结构提供给处理器或计算机使得处理器或计算机能执行所述计算机程序。上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行，此外，在一个示例中，计算机程序以及任何相关联的数据、数据文件和数据结构分布在联网的计算机系统上，使得计算机程序以及任何相关联的数据、数据文件和数据结构通过一个或多个处理器或计算机以分布式方式存储、访问和执行。According to embodiments of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the audio generation method according to the present disclosure. Examples of the computer-readable storage medium herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Disc Storage, Hard Disk Drive (HDD), Solid State Hard disk (SSD), card memory (such as a multimedia card, Secure Digital (SD) card, or Extreme Digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid state disk, and any other apparatuses configured to store, in a non-transitory manner, a computer program and any associated data, data files and data structures and to provide said computer program and any associated data, data files and data structures The computer program is given to a processor or computer so that the processor or computer can execute the computer program. The computer program in the above-mentioned computer readable storage medium can be executed in an environment deployed in computer equipment such as client, host, proxy device, server, etc. Furthermore, in one example, the computer program and any associated data, data files and data structures are distributed over networked computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.

根据本公开的实施例中，还可提供一种计算机程序产品，该计算机程序产品中的指令可由计算机设备的处理器执行以完成上述音频生成方法。According to an embodiment of the present disclosure, a computer program product can also be provided, wherein instructions in the computer program product can be executed by a processor of a computer device to complete the above audio generation method.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily suggest themselves to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

应当理解的是，本公开并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of audio generation, the method comprising:

acquiring an audio signal to be processed;

obtaining a plurality of soundtrack signals for a plurality of sound sources by soundtrack separation of the audio signal to be processed;

determining user orientation information in a three-dimensional space, and generating three-dimensional space metadata corresponding to each of the plurality of soundtrack signals, wherein the three-dimensional space metadata includes orientation information and sound width information of a corresponding sound source in the three-dimensional space;

generating a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each of the audio track signals.

2. The method of claim 1, wherein generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals comprises:

acquiring feature information of the audio signal to be processed, wherein the feature information includes at least one of beat information and structure information of the audio signal to be processed, and the structure information includes type information of each audio clip of the audio signal to be processed;

and respectively generating three-dimensional space metadata corresponding to each audio track signal based on the characteristic information.

3. The method according to claim 2, wherein generating three-dimensional spatial metadata corresponding to each of the audio track signals respectively based on the feature information comprises:

for a human voice signal in the plurality of audio track signals, according to the type of each audio clip in the structure information, determining position adjustment information of a sound source corresponding to the human voice signal relative to the user orientation information in a three-dimensional space, and determining three-dimensional space metadata of the human voice signal based on the position adjustment information.

4. The method according to claim 2, wherein generating three-dimensional spatial metadata corresponding to each of the audio track signals respectively based on the feature information comprises:

and for the musical instrument signals in the plurality of music track signals, determining the movement information of the sound source corresponding to the musical instrument signals in the three-dimensional space according to the beat information and the types of the audio fragments in the structure information, and determining the three-dimensional space metadata of the musical instrument signals based on the movement information.

5. The method of claim 1, wherein generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals comprises:

determining a preset template corresponding to each audio track signal from a plurality of preset templates, wherein the preset template comprises at least one of movement track information, movement speed information and sound width change information of a corresponding sound source in a three-dimensional space;

and respectively generating three-dimensional space metadata for each audio track signal by using the determined preset template.

6. The method of claim 1, wherein generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals comprises:

acquiring setting information input by a user, wherein the setting information comprises at least one of a moving track, a moving speed and a sound width variation value of each sound source corresponding to the plurality of sound track signals in a three-dimensional space;

generating three-dimensional spatial metadata for each of the track signals, respectively, based on the setting information.

7. An apparatus for audio generation, the apparatus comprising:

an acquisition module configured to acquire an audio signal to be processed;

a sound track separation module configured to obtain a plurality of sound track signals for a plurality of sound sources by performing sound track separation on the audio signal to be processed;

a metadata generation module configured to determine user orientation information in a three-dimensional space and generate three-dimensional space metadata corresponding to each of the plurality of soundtrack signals, wherein the three-dimensional space metadata includes orientation information and sound width information of a corresponding sound source in the three-dimensional space;

a rendering module configured to generate a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each of the audio track signals.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio generation method of any of claims 1 to 6.

9. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the audio generation method of any of claims 1 to 6.

10. A computer program product in which instructions are executed by at least one processor in an electronic device to perform the audio generation method of any of claims 1 to 6.