[go: up one dir, main page]

CN118972776A - 3D audio system - Google Patents

3D audio system Download PDF

Info

Publication number
CN118972776A
CN118972776A CN202411229561.9A CN202411229561A CN118972776A CN 118972776 A CN118972776 A CN 118972776A CN 202411229561 A CN202411229561 A CN 202411229561A CN 118972776 A CN118972776 A CN 118972776A
Authority
CN
China
Prior art keywords
sound
audio
dimensional
tracks
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411229561.9A
Other languages
Chinese (zh)
Inventor
李齐
丁寅
J·奥兰
J·泰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
American Lct Co
Original Assignee
American Lct Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/227,067 external-priority patent/US11240621B2/en
Application filed by American Lct Co filed Critical American Lct Co
Publication of CN118972776A publication Critical patent/CN118972776A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/12Circuits for transducers, loudspeakers or microphones for distributing signals to two or more loudspeakers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/02Spatial or constructional arrangements of loudspeakers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/05Generation or adaptation of centre channel in multi-channel audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/13Application of wave-field synthesis in stereophonic audio systems

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Stereophonic System (AREA)

Abstract

本说明包括可以在用户界面中呈现的三维空间的网格上定义滤波器及其分布网络的技术规范。可以从输入单声道或立体声中分离或从麦克风阵列的输出中确定多个音轨,其中每个音轨与在三维空间中相应声源位置有关。相对位置可由听者在用户界面设置。根据三维空间中听者和声源相对位置,在滤波器的分布网格中选择一对或多对滤波器。将每对滤波器应用于相应的音轨,以生成多个经滤波的音轨,然后用从多个经滤波的音轨生成三维声音供耳机收听。如使用多个扬声器,上述每个分离或确定的音轨可直接(即全通滤波)经放大器驱动每一个对应的扬声器。多个扬声器的声音可直接构成三维音场,不必佩戴耳机亦可听到所述的三维声音。

The present description includes technical specifications for defining filters and their distribution networks on a grid in a three-dimensional space that can be presented in a user interface. Multiple audio tracks can be separated from an input mono or stereo or determined from the output of a microphone array, where each audio track is related to a corresponding sound source position in three-dimensional space. The relative position can be set by the listener in the user interface. According to the relative position of the listener and the sound source in three-dimensional space, one or more pairs of filters are selected in the distribution grid of filters. Each pair of filters is applied to the corresponding audio track to generate multiple filtered audio tracks, and then three-dimensional sound is generated from the multiple filtered audio tracks for headphones to listen to. If multiple speakers are used, each of the above separated or determined audio tracks can be directly (i.e., all-pass filtered) driven by an amplifier to drive each corresponding speaker. The sound of multiple speakers can directly constitute a three-dimensional sound field, and the three-dimensional sound can be heard without wearing headphones.

Description

三维音频系统3D audio system

本申请为申请号202110585625.9、申请日2021年5月27日、标题“三维音频系统”的中国专利申请(基于并要求2020年6月9日提交的美国临时申请No.63/036,797和2021年4月9日提交的美国申请17/227,067的优先权)的分案申请,各申请的全部内容通过引用包含于此。This application is a divisional application of the Chinese patent application with application number 202110585625.9, application date May 27, 2021, and title “Three-dimensional Audio System” (based on and claiming priority of U.S. Provisional Application No. 63/036,797 filed on June 9, 2020 and U.S. Application No. 17/227,067 filed on April 9, 2021), and the entire contents of each application are incorporated herein by reference.

技术领域Technical Field

本说明涉及三维声音的产生,并且特别地涉及用于拾音或处理混合音轨成为分离的声音类型,然后对分离的声音施加传递函数以生成三维声音的系统和方法,该三维声音包含有关声源的空间信息,以及根据用户设置重新创建的三维(3D)声场。The present description relates to the generation of three-dimensional sound, and in particular to systems and methods for picking up or processing mixed audio tracks into separate sound types and then applying transfer functions to the separate sounds to generate three-dimensional sound that contains spatial information about the sound sources and a three-dimensional (3D) sound field recreated according to user settings.

背景技术Background Art

世界各地数十亿人听音乐,但大多数听者仅能以单声道或立体声声音格式听音乐,立体声是声音再现的方法。但是,立体声声音通常仅指使用两个扬声器或头戴式耳机播放的两个音频通道。诸如环绕声之类的更多的沉浸式声音技术则需要记录并保存多个音轨(例如5.1或7.1环绕声设置),并且声音必须通过相等数量的扬声器播放。音频通道或音轨中的每一个都包含来自多个声源的混合声音。因此,立体声声音不同于“真实的”声音(例如,音乐会舞台前的声音),因为关于独立声源的空间信息(例如,乐器和人声的方位)没有反映在该声音中。本说明的方法可以使用多个扬声器或头戴式耳机播放的多个独立的音频通道,使得来自扬声器或耳机的声音如在自然听觉中那样来自各种方向。Billions of people around the world listen to music, but most listeners can only listen to music in mono or stereo sound formats, with stereo being a method of sound reproduction. However, stereo sound generally refers to only two audio channels played using two speakers or headphones. More immersive sound technologies such as surround sound require that multiple audio tracks (such as a 5.1 or 7.1 surround sound setup) be recorded and saved, and the sound must be played through an equal number of speakers. Each of the audio channels or tracks contains mixed sounds from multiple sound sources. Therefore, stereo sound is different from "real" sound (e.g., the sound in front of a concert stage) because spatial information about independent sound sources (e.g., the orientation of instruments and vocals) is not reflected in the sound. The method of the present description can use multiple independent audio channels played by multiple speakers or headphones so that the sound from the speakers or headphones comes from various directions as in natural hearing.

人用两只耳朵可以感知空间信息并听到“真正的”三维(3D)声音作为双耳(Binaural)声音(例如,由左耳和右耳感知的声音),例如在音乐厅、剧院或者在体育场或竞技场的体育赛事中通过双耳感知音乐。但是,如上所述,今天的音乐技术通常仅提供没有空间提示或空间信息的单声道或立体声声音。因此,与通过头戴式耳机或耳塞式耳机或在扬声器或甚至在多音轨、多扬声器环绕系统上的音乐和在剧院、竞技场和音乐厅中的体验的音乐会有所不同,后者通常会更令人愉悦。目前,可以实现生成3D声音,例如,通过安装在电影院的墙壁上的许多扬声器且各个扬声器被电影制作过程中录制的分离的音轨所驱动。但是,这种3D音频系统可能非常昂贵,无法在移动装置中应用(应用软件)实现,甚至在大多数家庭影院或车载音响中也无法实现。因此,在今天的音乐和娱乐行业中,大多数的音乐或其他音频数据作为单声道或立体声声音被存储和播放,其中所有的声源(如人声和各种乐器)被预混合成仅一个(单声道)或两个(立体声)音轨。People can perceive spatial information and hear "real" three-dimensional (3D) sound as binaural sound (e.g., sound perceived by the left and right ears) with two ears, such as music perceived by binaural in concert halls, theaters, or sports events in stadiums or arenas. However, as mentioned above, today's music technology usually only provides mono or stereo sound without spatial cues or spatial information. Therefore, it is different from music experienced in theaters, arenas and concert halls through headphones or earphones or on speakers or even on multi-track, multi-speaker surround systems, which are usually more pleasant. At present, it is possible to generate 3D sound, for example, by many speakers installed on the walls of a cinema and each speaker is driven by a separate sound track recorded during the film production process. However, this 3D audio system may be very expensive and cannot be implemented in mobile devices (application software), and even cannot be implemented in most home theaters or car audio. Therefore, in today's music and entertainment industry, most music or other audio data is stored and played as mono or stereo sound, where all sound sources (such as vocals and various instruments) are pre-mixed into only one (mono) or two (stereo) tracks.

来自视频会议装置(诸如计算机、膝上型计算机、智能电话、或平板电脑)的大部分音频/声音是单声道声音。尽管在显示屏幕上,用户(例如,出席者或参与者)可以在分开的窗口中看到该会议的所有参与者,但音频通常仅为具有窄带宽的一个单声道。使用各个不同出席者的视频,可以实现虚拟会议室,但音频组件无法匹配视频组件,因为不能提供更准确的(例如,空间)虚拟现实声音体验所需的3D声音。此外,当两个出席者具有相似的发声语音时,用户在他们同时或甚至单独讲话时可能无法分辨语音。例如,当用户在另一个屏幕或视频窗口上看共享文档而用户没有看着出席者面部,就可能发生这种情况。当更多的出席者参加视频会议(例如,远程学习教室)时,该问题可能会更加严重。用户需要空间信息,例如3D声音,以帮助识别哪位出席者正在讲话。Most of the audio/sound from video conferencing devices (such as computers, laptops, smart phones, or tablets) is monophonic sound. Although on the display screen, the user (e.g., attendee or participant) can see all the participants of the conference in separate windows, the audio is usually only a monophonic channel with a narrow bandwidth. Using the video of each different attendee, a virtual conference room can be realized, but the audio component cannot match the video component because the 3D sound required for a more accurate (e.g., spatial) virtual reality sound experience cannot be provided. In addition, when two attendees have similar sounding voices, the user may not be able to distinguish the voices when they speak at the same time or even separately. For example, this may happen when the user is looking at a shared document on another screen or video window and the user is not looking at the attendee's face. When more attendees participate in the video conference (e.g., a distance learning classroom), the problem may be more serious. Users need spatial information, such as 3D sound, to help identify which attendee is speaking.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过以下给出的详细描述和本说明的各种实施例的附图,将更充分地理解本说明。然而,附图不应被用于将本说明限制于特定实现实例,而仅是为了解释和理解。The present invention will be more fully understood through the detailed description given below and the accompanying drawings of various embodiments of the present invention. However, the accompanying drawings should not be used to limit the present invention to specific implementation examples, but are only for explanation and understanding.

图1A-1B示出了根据本说明实现的用于生成三维声音的系统。1A-1B illustrate a system for generating three-dimensional sound implemented in accordance with the present teachings.

图2A至2B示出了根据本说明实现的三维空间中的声源与听者之间的空间关系以及用于生成反映该空间关系的3D声音的HRTF滤波器的数据结构和检索。2A and 2B illustrate a spatial relationship between a sound source and a listener in a three-dimensional space and a data structure and retrieval of an HRTF filter for generating 3D sound reflecting the spatial relationship implemented according to the present description.

图3示出了根据本说明实现的用于训练机器学习模型来分离混合音轨的系统。FIG3 illustrates a system for training a machine learning model to separate mixed audio tracks implemented in accordance with the present description.

图4示出了根据本说明实现的用于使用变换后声音信号来分离和滤波混合音轨的系统。4 illustrates a system for separating and filtering mixed audio tracks using transformed sound signals implemented in accordance with the present teachings.

图5A-5E在波形和声谱图中示出了根据本说明实现的原始混合声音以及该混合声音被分别分离为人声、鼓、贝斯(低音、bass)和其他声音。5A-5E illustrate in waveform and spectrogram an original mixed sound and the separation of the mixed sound into vocals, drums, bass, and other sounds, respectively, according to an implementation of the present teachings.

图6示出了根据本说明实现的具有通过语音和声音分离检索音乐的3D双耳音乐系统的远场语音控制。6 illustrates far-field voice control of a 3D binaural music system with music retrieval via voice and sound separation implemented in accordance with the present teachings.

图7A-7D示出了根据本说明实现的用于3D声音的用户设置的GUI,包括在乐队中(7A-7C)和在乐队前(7D)的听者位置。7A-7D illustrate a GUI for user settings for 3D sound implemented in accordance with the present teachings, including listener positions in the middle of the band (7A-7C) and in front of the band (7D).

图8示出了根据本说明实现的利用麦克风阵列生成3D声音的系统。FIG. 8 illustrates a system for generating 3D sound using a microphone array according to an implementation of the present disclosure.

图9A-9B示出了根据本说明实现的3D麦克风和具有空间噪声消除功能的3D麦克风阵列的波束图。9A-9B illustrate beam patterns of a 3D microphone and a 3D microphone array with spatial noise cancellation implemented in accordance with the present teachings.

图10示出了根据本说明实现的用于生成三维声音的会议或虚拟音乐会系统。FIG. 10 shows a conference or virtual concert system for generating three-dimensional sound according to the present description.

图11示出了根据本说明实现的为用于生成三维声音的会议系统的GUI显示的虚拟会议室。11 illustrates a virtual conference room displayed as a GUI for a conference system for generating three-dimensional sound implemented in accordance with the present teachings.

图12示出了根据本说明实现的用于生成三维声音的方法。FIG. 12 illustrates a method for generating three-dimensional sound implemented in accordance with the present description.

图13示出了根据本说明实现的用于生成三维声音的方法。FIG. 13 illustrates a method for generating three-dimensional sound implemented in accordance with the present description.

图14示出了根据本说明的一个或多个实现方式的计算机系统的硬件的框图。14 illustrates a block diagram of hardware of a computer system according to one or more implementations of the present description.

具体实施方式DETAILED DESCRIPTION

本文描述了三维(3D)可设置的音场(soundstage)音频系统以及应用和实现。三维(3D)声场(sound field)是指包括位于不同空间位置的离散声源的声音。3D音场是表示3D声场的声音。例如,音场音乐可以让听者在通过耳机、头戴式耳机或扬声器听一段给定音乐时有对乐器和人声源的独立位置的听觉感知。一般而言,3D音场可以使听者产生空间信息的感知。该音场是可设置的,可以通过听者、DJ、软件或音频系统进行设置。例如,在3D声场中每个乐器的位置可以被移动,在3D声场中听者的位置可靠近喜好的乐器位置成静态或动态移动。This article describes a three-dimensional (3D) configurable soundstage audio system and applications and implementations. A three-dimensional (3D) soundfield refers to sound that includes discrete sound sources located at different spatial locations. A 3D soundstage is a sound that represents a 3D soundfield. For example, soundstage music can allow a listener to have an auditory perception of the independent locations of musical instruments and human voice sources when listening to a given piece of music through headphones, headsets, or speakers. In general, a 3D soundstage can give a listener a perception of spatial information. The soundstage is configurable and can be set by the listener, DJ, software, or audio system. For example, the position of each instrument in the 3D soundfield can be moved, and the listener's position in the 3D soundfield can be statically or dynamically moved close to the preferred instrument position.

为了听或播放3D音场,听者可以使用由两条轨道表示的双耳声音,一条轨道用于左耳,一条轨道用于右耳,使听者感知与声源相关联的空间位置信息。通过耳机、头戴式耳麦或其他此类装置,双耳声音可以有像3D声音一样的体验(例如,感觉来自不同位置)。另外,3D声音也可以直接用来播放3D音场。在直接3D声音中,从位于不同3D位置(例如,对应于3D声场中各个声源的期望位置)处的一组扬声器播放声音。每个扬声器可以播放一个单独的分离的音轨,例如,一个扬声器用于鼓,另一个扬声器用于贝斯。听者可以从扬声器直接听到3D声场,因为它们是在真实世界3D声场中。在双耳和直接3D声音两种情况,听者的大脑可以感知3D声场并可识别和跟踪离散的单个声源,就像在真实世界中一样,这在本说明中可以被称为声学虚拟现实。In order to listen to or play the 3D sound field, the listener can use binaural sound represented by two tracks, one track for the left ear and one track for the right ear, so that the listener perceives the spatial position information associated with the sound source. Through headphones, headsets or other such devices, binaural sound can have the same experience as 3D sound (for example, feeling from different positions). In addition, 3D sound can also be used directly to play the 3D sound field. In direct 3D sound, sounds are played from a group of speakers located at different 3D positions (for example, corresponding to the expected positions of each sound source in the 3D sound field). Each speaker can play a separate separate track, for example, one speaker for drums and another speaker for bass. The listener can hear the 3D sound field directly from the speakers because they are in the real world 3D sound field. In both binaural and direct 3D sound, the listener's brain can perceive the 3D sound field and can identify and track discrete individual sound sources, just like in the real world, which can be referred to as acoustic virtual reality in this description.

此外,实现3D声场的另一种方法可以是直接用专用的双耳/3D麦克风录制双耳声音。现有的大多数双耳麦克风只是耳朵中安装了麦克风的假人头,对于许多应用而言,其尺寸可能太大并且/或者价格太昂贵。因此,本文描述了一种3D麦克风,其通过使用非常小的麦克风阵列和信号处理技术可具有小的尺寸。这种小型3D麦克风可与任何手持式录制装置(例如智能电话或平板电脑)一起使用。由3D麦克风拾音的声音输出可为双耳、立体声或多轨录音,其中每个轨道用于与3D声场的声源相关联的一个空间方向。In addition, another method of realizing 3D sound field can be to record binaural sound directly with dedicated binaural/3D microphones. Most existing binaural microphones are just dummy heads with microphones installed in the ears, which may be too large and/or too expensive for many applications. Therefore, this paper describes a 3D microphone that can have a small size by using a very small microphone array and signal processing technology. This small 3D microphone can be used with any handheld recording device (such as a smart phone or tablet). The sound output picked up by the 3D microphone can be binaural, stereo or multi-track recording, where each track is used for a spatial direction associated with the sound source of the 3D sound field.

在本说明中,使用下面描述三种技术增强音频信号的信噪比(SNR)。其中降噪是基于时间的信息(诸如信号与噪声之间的统计属性或不同种类信号的频率分布)降低音频通道中的背景噪声的过程。麦克风阵列使用一个或多个声音波束图,以增强来自一个波束方向的声音同时消除来自该波束方向之外的声音。声学回声消除(AEC,acoustic echocanceller)使用一个或多个参考信号来消除在麦克风采集的信号中混合的相应信号。参考信号与AEC将要消除的信号相关。In this description, the signal-to-noise ratio (SNR) of an audio signal is enhanced using the following three techniques described below. Noise reduction is the process of reducing background noise in an audio channel based on temporal information, such as statistical properties between signal and noise or frequency distribution of different types of signals. A microphone array uses one or more sound beam patterns to enhance the sound from one beam direction while eliminating the sound from outside the beam direction. Acoustic echo cancellation (AEC) uses one or more reference signals to eliminate corresponding signals mixed in the signal collected by the microphone. The reference signal is related to the signal that the AEC is to eliminate.

系统system

图1A-1B示出了根据本说明实现的用于生成三维声音的系统100A和100B。系统100A和100B可以是独立的计算机系统或在计算云中实现的联网的计算资源。1A-1B illustrate systems 100A and 100B for generating three-dimensional sound implemented in accordance with the present teachings. Systems 100A and 100B may be stand-alone computer systems or networked computing resources implemented in a computing cloud.

参考图1A,系统100A可以包括声音分离单元102A、用于存储多个滤波器(如头形相关传递函数(HRTF,head related transfer function)滤波器、全通滤波器、或均衡滤波器)的存储单元104A、信号处理单元106A、和具有图形用户界面(GUI)110A以接收用户输入的3D声场设置单元108A。为了讨论的简洁,在下文中的滤波器被称为HRTF滤波器,尽管可以理解的是,滤波器可以是任何类型的合适的滤波器,包括全通滤波器或均衡器滤波器。声音分离单元102A、存储单元104A和3D声场设置单元108A可以连接到信号处理单元106A。信号处理单元106A可以是可编程装置,该装置可以根据用户界面装置(未示出)GUI 110A上的设置编程,以生成三维声音。Referring to FIG. 1A , the system 100A may include a sound separation unit 102A, a storage unit 104A for storing a plurality of filters (such as head shape related transfer function (HRTF) filters, all-pass filters, or equalization filters), a signal processing unit 106A, and a 3D sound field setting unit 108A having a graphical user interface (GUI) 110A to receive user input. For the sake of simplicity of discussion, the filter hereinafter is referred to as an HRTF filter, although it is understood that the filter can be any type of suitable filter, including an all-pass filter or an equalizer filter. The sound separation unit 102A, the storage unit 104A, and the 3D sound field setting unit 108A can be connected to the signal processing unit 106A. The signal processing unit 106A can be a programmable device that can be programmed according to the settings on the user interface device (not shown) GUI 110A to generate three-dimensional sound.

在图1A的示例中,输入到声音分离单元102A的是单声道或立体声信号或音频构成的原始的混合音轨,而信号处理单元106A的输出分别是左耳和右耳的3D双耳音频。混合音轨或通道的输入中的每一个音轨可以首先由声音分离单元102A分离成一组分离的音轨(例如,与一个或一组声音类型相关联的声源),其中每个轨道表示一种类型(或类别)的声音,例如人声、鼓、贝斯或其他(例如,基于相应声源的性质)。In the example of Fig. 1A, the input to the sound separation unit 102A is the original mixed audio track composed of mono or stereo signals or audio, and the output of the signal processing unit 106A is the 3D binaural audio for the left ear and the right ear respectively. Each audio track in the input of the mixed audio track or channel can first be separated by the sound separation unit 102A into a group of separated audio tracks (e.g., sound sources associated with one or a group of sound types), wherein each track represents a type (or category) of sound, such as vocals, drums, bass or others (e.g., based on the properties of the corresponding sound source).

然后信号处理单元106A可以使用来自存储单元104A的成对HRTF滤波器处理每一个分离的音轨,以对于每个分离的音轨输出分别代表左耳通道和右耳通道的两个音频通道。在一个实现方式中,上述过程可以对各个输入的混合音轨并行处理。Then the signal processing unit 106A can process each separated audio track using the paired HRTF filters from the storage unit 104A to output two audio channels representing the left ear channel and the right ear channel for each separated audio track. In one implementation, the above process can process each input mixed audio track in parallel.

每对HRTF滤波器(例如,下面描述的图2B的一对左右HRTF滤波器200B)可以与三维空间中网格上的点相关联(例如,这些HRTF滤波器可以在数据库中存储为网格点构成的分布网),并且每个网格点可以用两个参数来表示:方位角θ和姿态角γ(例如,分别为2B中的202B和204B)。HRTF滤波器(例如200B)的分布网(mesh)可以是在三维空间(例如200A)中的网格上定义的、预先计算或预先测量的左右HRTF滤波器对的阵列,其中网格的每个点有一对相关的左右HRTF滤波器。可以通过应用激活函数来检索HRTF滤波器对,其中激活函数的输入可以包括声源和听者之间的相对位置和距离/范围,激活函数的输出可以是所确定的用于检索定义在网格点上的HRTF滤波器对的HRTF数据库索引。例如,在一种激活函数实现中,激活函数的输入可以是方位角θ和姿态角γ,而输出为用于检索左右HRTF滤波器对的数据库索引。然后检索到的HRTF滤波器可以被用来对分离的音轨进行滤波。对于每个分离的音轨,需要调用激活函数来检索相应的HRTF滤波器对。方位角θ和姿态角γ的值可以根据用户设置规范确定。例如,如图7A所示,方位角θ具有值0°(人声)、30°(鼓)、180°(贝斯)、和330°(键盘乐器)而姿态角γ为0,则需要由激活函数检索四对HRTF滤波器,以分别对四个分离的音轨进行滤波。Each pair of HRTF filters (e.g., a pair of left and right HRTF filters 200B of FIG. 2B described below) can be associated with a point on a grid in three-dimensional space (e.g., these HRTF filters can be stored in a database as a distribution network of grid points), and each grid point can be represented by two parameters: an azimuth angle θ and an attitude angle γ (e.g., 202B and 204B in 2B, respectively). The distribution network (mesh) of HRTF filters (e.g., 200B) can be an array of pre-calculated or pre-measured left and right HRTF filter pairs defined on a grid in three-dimensional space (e.g., 200A), wherein each point of the grid has a pair of associated left and right HRTF filters. HRTF filter pairs can be retrieved by applying an activation function, wherein the input of the activation function may include the relative position and distance/range between the sound source and the listener, and the output of the activation function may be a determined HRTF database index for retrieving HRTF filter pairs defined on the grid points. For example, in one activation function implementation, the input of the activation function may be an azimuth angle θ and an attitude angle γ, and the output is a database index for retrieving left and right HRTF filter pairs. The retrieved HRTF filters can then be used to filter the separated tracks. For each separated track, an activation function needs to be called to retrieve the corresponding HRTF filter pair. The values of the azimuth angle θ and the attitude angle γ can be determined according to the user setting specification. For example, as shown in Figure 7A, the azimuth angle θ has the values 0° (voice), 30° (drums), 180° (bass), and 330° (keyboard instrument) and the attitude angle γ is 0, then four pairs of HRTF filters need to be retrieved by the activation function to filter the four separated tracks respectively.

如图2A与图2B所示,听者(例如,202A)和/或声源(例如,204A)可以移动,伴随着角度θ和γ随时间改变。接着可能需要动态地检索一系列新的HRTF滤波器对(例如,200B),以便输出正确的双耳声音以虚拟地表示由在3D声音空间(例如,200A)中的听者(例如,202A)接收的声音。可以通过将这些滤波器作为分布网进行存储来便于动态检索HRTF滤波器,因为存储的HRTF滤波器对可能已经被关联到三维空间中网格上的任一点,在该3D空间中听者和/或声源可以在移动过程中被定位。范围R(210A)可以由经滤波后的声音的音量表示。这样,听者与声源之间的距离越近,音量越大。As shown in Figures 2A and 2B, the listener (e.g., 202A) and/or the sound source (e.g., 204A) can move, with angles θ and γ changing over time. Then, a series of new HRTF filter pairs (e.g., 200B) may need to be dynamically retrieved in order to output correct binaural sounds to virtually represent the sounds received by the listener (e.g., 202A) in the 3D sound space (e.g., 200A). Dynamic retrieval of HRTF filters can be facilitated by storing these filters as a distribution network, because the stored HRTF filter pairs may have been associated with any point on a grid in a three-dimensional space, in which the listener and/or the sound source can be located during movement. Range R (210A) can be represented by the volume of the filtered sound. In this way, the closer the distance between the listener and the sound source, the louder the volume.

然后可以对所有滤波器输出的左音频轨道进行混合以生成双耳声音的左声道(例如,双耳L),同时可以对所有的右声道进行混合以生成双耳声音的右声道(例如,双耳R)。当L和R声道通过耳机或头戴式耳麦同时播放,听者可以体验3D双耳声音和感知3D声场中声源的空间位置。The left audio tracks of all filter outputs can then be mixed to generate the left channel of binaural sound (e.g., binaural L), while all right channels can be mixed to generate the right channel of binaural sound (e.g., binaural R). When the L and R channels are played simultaneously through headphones or headsets, the listener can experience 3D binaural sound and perceive the spatial location of the sound source in the 3D sound field.

此外,听者可以通过GUI 110A在3D声场中设置每个声源的和/或该听者的位置和/或音量。虚拟地(例如,在声学虚拟现实中),听者和声源可被定位于3D声场内的任何位置,并且每个声源的音量可以正比于3D声场中从听者的位置到声源的位置的距离。例如,声源位置和/或音量可以通过可经由用户界面装置呈现的GUI 110A来设置。用户界面装置可以例如是采用智能电话(图7A-7D)或平板电脑上的触摸屏的形式。在一个实现方式中,在3D声场中人声声源的虚拟位置可以在听者的前面,鼓声源可以是在听者的右前,贝斯声源可以是相对于听者(例如远离)在其他声源后面,以及“其他”乐器(例如,未识别声音类型或类别)可以是在听者的左前,并且通过将听者(虚拟头部)定位于鼓和贝斯的附近来将鼓和贝斯声源设置为更响而将人声和“其他”声源设置为更轻柔(图7C)。然后根据听者自己的设置,听者可以从双耳输出(例如,双耳L和双耳R)听到该3D声场。如果将虚拟头部和乐器置于同一的位置(例如,图7B),听者会听到独奏声音。In addition, the listener can set the position and/or volume of each sound source and/or the listener in the 3D sound field through the GUI 110A. Virtually (for example, in acoustic virtual reality), the listener and the sound source can be positioned at any position in the 3D sound field, and the volume of each sound source can be proportional to the distance from the listener's position to the position of the sound source in the 3D sound field. For example, the sound source position and/or volume can be set through the GUI 110A that can be presented via a user interface device. The user interface device can, for example, be in the form of a touch screen on a smart phone (Figures 7A-7D) or a tablet computer. In one implementation, the virtual position of the vocal sound source in the 3D sound field can be in front of the listener, the drum sound source can be in front of the right of the listener, the bass sound source can be behind the other sound sources relative to the listener (e.g., far away), and the "other" instrument (e.g., unidentified sound type or category) can be in front of the left of the listener, and the drum and bass sound sources are set to be louder and the vocals and "other" sound sources are set to be softer by positioning the listener (virtual head) near the drums and bass (Fig. 7C). Then, according to the listener's own settings, the listener can hear the 3D sound field from binaural outputs (e.g., binaural L and binaural R). If the virtual head and the instrument are placed in the same position (e.g., Fig. 7B), the listener will hear the solo sound.

在一个实现方式中,为生成双耳输出(例如,双耳L+R),如图1A所示,对于与相应声源位置相关联每个分离的音轨,可以选择(例如,从存储单元104A)一对相应HRTF滤波器来(例如,通过信号处理单元106A)将该分离的音轨处理成两个输出:L和R音频。最后,混合器(未示出)可以分别混合所有的L轨道和所有的R轨道,以输出双耳L、R信号。对相应的HRTF滤波器的选择将在下面进一步详细讨论(例如,参见下面的图2的描述)。如果混合音轨是立体声(双音轨),每一个音轨需要经过上述处理以生成混合的双耳声音。当L和R通道通过耳机或头戴式耳麦同时播放时,听者可以体验到3D双耳声音和感知3D声场。In one implementation, to generate binaural output (e.g., binaural L+R), as shown in FIG. 1A , for each separated audio track associated with the corresponding sound source position, a pair of corresponding HRTF filters can be selected (e.g., from storage unit 104A) to process the separated audio track into two outputs (e.g., by signal processing unit 106A): L and R audio. Finally, a mixer (not shown) can mix all L tracks and all R tracks respectively to output binaural L and R signals. The selection of corresponding HRTF filters will be discussed in further detail below (e.g., see the description of FIG. 2 below). If the mixed audio track is stereo (dual audio track), each audio track needs to be processed as described above to generate mixed binaural sound. When the L and R channels are played simultaneously through headphones or headsets, the listener can experience 3D binaural sound and perceive a 3D sound field.

参考图1B,系统100B可以包括声音分离单元102B、3D信号处理单元104B、放大器106B、扬声器108B、和具有图形用户界面(GUI)112B以用于接收用户输入的3D声场设置单元110B。声音分离单元102B和3D声场设置单元110B可以被连接到信号处理单元104B。信号处理单元104B可以是可编程装置,该可编程装置可以被编程为根据经由在用户界面装置(未示出)上呈现的GUI 112B接收的设置来实现三维声音生成。1B, the system 100B may include a sound separation unit 102B, a 3D signal processing unit 104B, an amplifier 106B, a speaker 108B, and a 3D sound field setting unit 110B having a graphical user interface (GUI) 112B for receiving user input. The sound separation unit 102B and the 3D sound field setting unit 110B may be connected to the signal processing unit 104B. The signal processing unit 104B may be a programmable device that may be programmed to implement three-dimensional sound generation according to the settings received via the GUI 112B presented on a user interface device (not shown).

在图1B的示例中,声音分离单元102B的输入是原始的立体声或者单声道或立体声信号或音频构成的混合音轨,而从3D信号处理单元104B输出的是一组音轨以通过放大器106B驱动多个扬声器108B。混合轨道或通道的输入中的每一个可以首先由声音分离单元102B分离成一组分离的音轨(例如,关于一个相应声源或类型),其中每个轨道表示一种类型(或类别)的声音,例如人声、鼓、贝斯或其他(例如,基于相应声源的性质)。接着,每个分离的音轨可以由3D信号处理单元104B处理,以输出一个单一音轨,以针对每个处理后的音轨分别通过一个放大器106B驱动一个扬声器108B。在一个实现方式中,上述过程可以对各个输入的混合音轨并行执行。然后所有输出音轨可以通过(例如,在真实世界中的不同位置处的)扬声器108B播放,以形成用于听者的真实世界位置的真实世界3D声场。In the example of FIG. 1B , the input of the sound separation unit 102B is a mixed track composed of the original stereo or mono or stereo signal or audio, and the output from the 3D signal processing unit 104B is a set of tracks to drive multiple speakers 108B through an amplifier 106B. Each of the inputs of the mixed tracks or channels can first be separated by the sound separation unit 102B into a set of separated tracks (e.g., about a corresponding sound source or type), wherein each track represents a type (or category) of sound, such as vocals, drums, bass or other (e.g., based on the properties of the corresponding sound source). Then, each separated track can be processed by the 3D signal processing unit 104B to output a single track to drive a speaker 108B through an amplifier 106B for each processed track. In one implementation, the above process can be performed in parallel for each input mixed track. Then all output tracks can be played through speakers 108B (e.g., at different locations in the real world) to form a real world 3D sound field for the real world position of the listener.

如以上关于图1A所指出的,听者可以通过GUI 112B在3D声场中设置每个声源的和/或听者的位置和/或音量。虚拟地(例如,在声学虚拟现实中),听者和声源可被定位于3D声场内的任何位置,并且每个声源的音量可以正比于3D声场中从听者的位置到声源的位置的距离。例如,声源位置和/或音量可以通过GUI 112B来设置,该GUI 112B可以经由用户界面装置来呈现。用户界面装置可以例如是以智能电话或平板电脑上的触摸屏的形式。然后听者可以听到来自扬声器108B的输出的根据听者自己的设置的3D声场。As noted above with respect to FIG. 1A , the listener can set the position and/or volume of each sound source and/or the listener in the 3D sound field through GUI 112B. Virtually (e.g., in acoustic virtual reality), the listener and the sound source can be positioned at any position in the 3D sound field, and the volume of each sound source can be proportional to the distance from the listener's position to the position of the sound source in the 3D sound field. For example, the sound source position and/or volume can be set through GUI 112B, which can be presented via a user interface device. The user interface device can be, for example, in the form of a touch screen on a smart phone or tablet computer. The listener can then hear the 3D sound field according to the listener's own settings output from the speaker 108B.

在图7A-7D中可以看到GUI 110A或GUI 112B的实现方式。这些在下面将被详细描述。7A-7D, one can see an implementation of GUI 110A or GUI 112B, which will be described in detail below.

图2A-2B示出了根据本说明实现的3D空间200A中的声源204A与听者202A之间的空间关系,以及用于生成反映该空间关系的3D声音的HRTF滤波器200B的选择。2A-2B illustrate a spatial relationship between a sound source 204A and a listener 202A in a 3D space 200A, and the selection of an HRTF filter 200B for generating 3D sound that reflects the spatial relationship, implemented in accordance with the present teachings.

头形相关传递函数(HRTF)滤波器(例如,类似于存储在图1A的存储单元104A中的那些)可以表征人类听者(头上具有外部人耳)如何在三维空间中的第一指定位置处接收来自同一3D空间中第二指定位置处的声源的声音。当声波碰到听者时,头、耳朵、耳道的大小和形状、头发的密度、鼻腔和口腔的大小和形状都会改变声音,并通过增强一些频率并衰减其他频率影响听者的感知。但是响应频谱的包络可能比简单的增强或衰减更复杂:它可能影响宽的频谱和/或它可以从不同的声音方向带来显著改变。A head shape dependent transfer function (HRTF) filter (e.g., similar to those stored in memory unit 104A of FIG. 1A ) can characterize how a human listener (with external human ears on his head) receives sound at a first specified location in three-dimensional space from a sound source at a second specified location in the same 3D space. The head, ears, size and shape of the ear canals, density of hair, size and shape of the nasal and oral cavities all change the sound when the sound waves hit the listener and affect the listener's perception by enhancing some frequencies and attenuating others. But the envelope of the response spectrum can be more complex than a simple enhancement or attenuation: it can affect a wide spectrum and/or it can bring significant changes from different sound directions.

用两只耳朵(例如,双耳听觉),听者可以在三个维度上定位声音:范围(距离);在上下方向;以及在前面和后面,以及任一侧。这是可能的,因为大脑、内耳和外耳(耳廓)共同工作以推断位置。听者可以通过获取来自每一只耳朵的提示(单耳提示),并通过比较在两只耳朵处接收到的提示(差异提示或双耳提示)来估计声源的位置。在人脑感知的是是声音到达每个耳朵的时间差异和强度差异。单声道提示来自声源与听者的人体解剖结构之间的相互作用,其中原始声源在进入耳道以由耳蜗和大脑进行处理之前,被内耳和外耳(耳廓)改变。这些改变对声源位置进行编码,并且可以经由声源位置与听者位置之间的关系来拾音。基于这种关系的音轨滤波器在本文被称为HRTF滤波器。利用一对HRTF滤波器对音轨的卷积转换声音,以分别为左耳和右耳生成双耳信号,其中如果源声音是在与该对HRTF滤波器相关联的位置处播放,所述双耳声音信号(例如,图1A的双耳L+R)对应于将会在听者位置处听到的真实世界的3D声场信号。With two ears (e.g., binaural hearing), a listener can locate sounds in three dimensions: range (distance); in the up and down directions; and in front and behind, as well as on either side. This is possible because the brain, inner ear, and outer ear (pinna) work together to infer the location. The listener can estimate the location of the sound source by getting cues from each ear (monaural cues) and by comparing the cues received at the two ears (difference cues or binaural cues). What the human brain perceives is the time difference and intensity difference of the sound arriving at each ear. Monophonic cues come from the interaction between the sound source and the human anatomy of the listener, where the original sound source is changed by the inner ear and outer ear (pinna) before entering the ear canal to be processed by the cochlea and the brain. These changes encode the location of the sound source, and the sound can be picked up via the relationship between the location of the sound source and the listener's position. The soundtrack filter based on this relationship is referred to as an HRTF filter in this article. The sound of an audio track is convolved using a pair of HRTF filters to generate binaural signals for the left and right ears respectively, wherein if the source sound is played at the position associated with the pair of HRTF filters, the binaural sound signals (e.g., binaural L+R of FIG. 1A ) correspond to real-world 3D sound field signals that would be heard at the listener's position.

用于听者的左耳和右耳的一对双耳轨道可以被用来从单声道或立体声生成双耳声音,其似乎是来自空间中的特定位置。HRTF滤波器是传递函数,描述来自3D空间中特定位置的声音如何到达听者的位置(通常在听者耳道的外端处)。HRTF滤波器可以被实现为时域中的卷积计算或频域中的乘法以节省计算时间,如图4所示(下面更全面地描述)。多对HRTF滤波器可应用于来自多个声源的多个音轨,以生成表示为双耳声音信号的3D声场。相应的HRTF滤波器可以基于听者的设定(即,期望的声源对听者的相对位置)来选择。A pair of binaural tracks for the left and right ears of the listener can be used to generate binaural sounds from mono or stereo, which seem to be from a specific position in space. HRTF filters are transfer functions that describe how sounds from a specific position in 3D space reach the listener's position (usually at the outer end of the listener's ear canal). HRTF filters can be implemented as convolution calculations in the time domain or multiplications in the frequency domain to save computing time, as shown in Figure 4 (described more fully below). Multiple pairs of HRTF filters can be applied to multiple tracks from multiple sound sources to generate a 3D sound field represented as a binaural sound signal. The corresponding HRTF filter can be selected based on the listener's settings (i.e., the relative position of the desired sound source to the listener).

参照图2A,声源(例如204A)和听者202A所位于的3D声音空间200A可以被表示为具有极坐标系的网格。从听者202A到声源204A的相对位置和距离可以根据下面三个参数来确定:方位角θ(图2B的202B),姿态角γ(图2B的204B),和半径R(210A)。2A, a 3D sound space 200A in which a sound source (e.g., 204A) and a listener 202A are located can be represented as a grid having a polar coordinate system. The relative position and distance from the listener 202A to the sound source 204A can be determined based on the following three parameters: azimuth angle θ (202B of FIG. 2B), attitude angle γ (204B of FIG. 2B), and radius R (210A).

参照图2B,听者在3D空间200A的每个位置的相应HRTF滤波器200B可以作为代表3D空间200A的极坐标系的函数被测量、生成、保存和存放。每个HRTF滤波器200B(例如,一对左和右HRTF滤波器)可以与网格上的点相关联(例如,HRTF滤波器被存储为分布网)并且每个网格点可以通过两个参数表示:方位角θ202B和姿态角γ204B。根据用户的设置,该系统(例如,图1A的100A)将会知道每个声源(例如,204A)和听者202A之间的空间关系,即,该系统将会知道α206A、β208A和R 210A。因此,基于θ=α,和γ=β,该系统可以针对与声源204A相关联的分离的音轨,检索出用于听者的左耳和右耳的相应HRTF滤波器对200B(例如,HRTFRightand HRTFLeft)。然后可以使用所检索的HRTF滤波器200B处理(例如,通过图1A的信号处理单元106A)声源204A的音轨。所生成的3D声音的输出音量可以是半径R 210A的函数。R 210A的长度越短,输出3D音量越大。2B, the corresponding HRTF filter 200B for each position of the listener in the 3D space 200A can be measured, generated, saved and stored as a function of the polar coordinate system representing the 3D space 200A. Each HRTF filter 200B (e.g., a pair of left and right HRTF filters) can be associated with a point on a grid (e.g., the HRTF filters are stored as a distribution grid) and each grid point can be represented by two parameters: an azimuth angle θ202B and an attitude angle γ204B. According to the user's settings, the system (e.g., 100A of FIG. 1A) will know the spatial relationship between each sound source (e.g., 204A) and the listener 202A, that is, the system will know α206A, β208A and R 210A. Therefore, based on θ=α, and γ=β, the system can retrieve the corresponding HRTF filter pair 200B (e.g., HRTF Right and HRTF Left ) for the left and right ears of the listener for the separated audio tracks associated with the sound source 204A. The audio track of the sound source 204A may then be processed (e.g., by the signal processing unit 106A of FIG. 1A ) using the retrieved HRTF filter 200B. The output volume of the generated 3D sound may be a function of the radius R 210A. The shorter the length of R 210A, the greater the output 3D volume.

在一个实施方式中,对于多个声源(如声源204A),该系统可以为每个声源重复上述滤波器检索和滤波操作,然后将滤波后的音轨组合(例如混频)在一起,以用于到两个扬声器的最终双耳输出或立体声(优于单声道)输出。In one embodiment, for multiple sound sources (such as sound source 204A), the system can repeat the above filter retrieval and filtering operations for each sound source, and then combine (e.g., mix) the filtered audio tracks together for final binaural output or stereo (better than mono) output to two speakers.

如以上关于图1A所指出的,听者202A和/或声源204A可以移动,伴随着角度θ和γ随时间改变。接着,可能需要动态地检索一系列新的HRTF滤波器对200B,以便输出正确的双耳声音来虚拟地表示由在3D声音空间200A中的听者202A接收的声音。可以通过将这些滤波器作为分布网进行存储来帮助HRTF滤波器200B的动态检索,因为存储的HRTF滤波器对可能已经被关联到三维空间中网格上的任一点,其中听者和/或声源可以在移动过程中被定位。As noted above with respect to FIG. 1A , the listener 202A and/or the sound source 204A may move, with the angles θ and γ changing over time. Subsequently, a series of new HRTF filter pairs 200B may need to be dynamically retrieved in order to output correct binaural sounds to virtually represent the sounds received by the listener 202A in the 3D sound space 200A. The dynamic retrieval of the HRTF filters 200B may be facilitated by storing these filters as a distribution grid, since the stored HRTF filter pairs may have been associated with any point on the grid in three-dimensional space, where the listener and/or the sound source may be located during movement.

图3示出了根据的本说明的一个实施方式的用于训练机器学习模型308以分离混合音轨的系统300。FIG. 3 illustrates a system 300 for training a machine learning model 308 to separate mixed audio tracks, according to one embodiment of the present description.

虽然可以使用多个麦克风在多个轨道上录制音乐,其中每个单独的轨道表示在工作室中录制的每个乐器或人声,消费者最经常得到的音乐串流(stream)被混合为立体声声音。多音轨音频的录制、存储、带宽、传输和播放的成本可能是非常高的,因此大多数现有的音乐录制和传输装置(无线电或智能电话)被设置成仅用单声道或立体声声音。为了从常规的混合音轨格式(单声道和立体声)生成3D音场,该系统(例如,图1A的系统100A或图1B的100B)需要将每个混合音轨分离成多个音轨,其中每个音轨表示或分离一种类型(或类别)声音或乐器。该分离可以根据数学模型和相应的软件或硬件实现来执行,其中,输入为混合音轨而输出为分离的音轨。在一个实施方式中,对于立体声输入,左轨道和右轨道可以被联合地或分开地处理(例如,由图1A的声音分离单元102A或图1B的声音分离单元102B)。Although music can be recorded on multiple tracks using multiple microphones, where each individual track represents each instrument or voice recorded in the studio, the music stream that consumers most often get is mixed into stereo sound. The cost of recording, storing, bandwidth, transmitting and playing multi-track audio may be very high, so most existing music recording and transmission devices (radios or smart phones) are set to use only mono or stereo sound. In order to generate a 3D sound field from a conventional mixed track format (mono and stereo), the system (e.g., the system 100A of Figure 1A or 100B of Figure 1B) needs to separate each mixed track into multiple tracks, where each track represents or separates a type (or category) of sound or instrument. The separation can be performed according to a mathematical model and corresponding software or hardware implementation, where the input is a mixed track and the output is a separated track. In one embodiment, for stereo input, the left track and the right track can be processed jointly or separately (e.g., by the sound separation unit 102A of Figure 1A or the sound separation unit 102B of Figure 1B).

本说明中的机器学习是指在硬件处理装置上用软件实现的方法,该方法使用统计技术和/或人工神经网络来赋予计算机从数据中“学习”(即,逐步改善特定任务的性能)的能力,而无需被明确地被编程。机器学习可以使用参数化模型(称为“机器学习模型”),其可以使用监督学习/半监督学习、无监督学习或强化学习方法来实现。监督/半监督学习方法可以使用标记的训练样本来训练机器学习模型。为了使用监督的机器学习模型执行任务,计算机可使用样本(通常被称为“训练数据”)来训练机器学习模型并基于性能测量(例如,错误率)来调整机器学习模型的参数。调整机器学习模型的参数的过程(通常称为“训练机器学习模型”)可以生成特定模型,以执行针对其训练的实际任务。在训练之后,计算机可以接收与任务相关联的新的数据输入,并基于训练后的机器学习模型来计算用于预测任务结果的机器学习模型的估计输出。每个训练样本可以包括输入数据和相应的期望的输出数据,其中数据可以以适当的形式(诸如字母数字符号或数值的向量值)作为音轨的表示。Machine learning in this description refers to a method implemented in software on a hardware processing device that uses statistical techniques and/or artificial neural networks to give computers the ability to "learn" (i.e., gradually improve the performance of a specific task) from data without being explicitly programmed. Machine learning can use a parameterized model (called a "machine learning model"), which can be implemented using supervised learning/semi-supervised learning, unsupervised learning, or reinforcement learning methods. Supervised/semi-supervised learning methods can use labeled training samples to train machine learning models. In order to perform tasks using supervised machine learning models, computers can use samples (usually referred to as "training data") to train machine learning models and adjust the parameters of machine learning models based on performance measurements (e.g., error rates). The process of adjusting the parameters of machine learning models (usually referred to as "training machine learning models") can generate specific models to perform the actual tasks for which they are trained. After training, the computer can receive new data inputs associated with the task and calculate the estimated output of the machine learning model for predicting the results of the task based on the trained machine learning model. Each training sample can include input data and corresponding expected output data, where the data can be in an appropriate form (such as a vector value of alphanumeric symbols or numerical values) as a representation of the track.

学习过程可以是迭代过程。该过程可以包括前向传播过程,以基于机器学习模型和馈送到机器学习模型中的输入数据来计算输出,然后计算期望的输出数据和所计算的输出数据之间的差。该过程可以进一步包括反向传播过程,以基于计算出的差来调整机器学习模型的参数。The learning process may be an iterative process. The process may include a forward propagation process to calculate the output based on the machine learning model and the input data fed into the machine learning model, and then calculate the difference between the expected output data and the calculated output data. The process may further include a back propagation process to adjust the parameters of the machine learning model based on the calculated difference.

可以通过机器学习、统计或信号处理技术来训练用于分离混合音轨的机器学习模型308的参数。如图3所示,该机器学习模型308可以具有两个阶段:训练时段和分离时段。在机器学习模型308的训练时段中,混合声音的音频或音乐录音可以被用作为特征提取单元302的输入,并且相应的分离的音轨可以被用作分离模型训练单元304的目标,即,作为期望的分离输出的样本。分离模型训练单元304可以包括数据处理单元,其包括数据归一化/数据扰动单元306、特征提取单元302。数据归一化对输入训练数据进行归一化,以使它们具有相似的动态范围。数据扰动生成合理的数据变化,以覆盖比在训练数据中可用的信号更多的信号情况,从而有更多的数据用于更多的训练。数据归一化和扰动可以是可选的,取决于可用数据量。The parameters of the machine learning model 308 for separating mixed tracks can be trained by machine learning, statistics or signal processing techniques. As shown in Figure 3, the machine learning model 308 can have two stages: a training period and a separation period. In the training period of the machine learning model 308, the audio or music recording of the mixed sound can be used as the input of the feature extraction unit 302, and the corresponding separated tracks can be used as the target of the separation model training unit 304, that is, as a sample of the desired separation output. The separation model training unit 304 may include a data processing unit, which includes a data normalization/data perturbation unit 306 and a feature extraction unit 302. Data normalization normalizes the input training data so that they have similar dynamic ranges. Data perturbation generates reasonable data changes to cover more signal situations than the signals available in the training data, so that there are more data for more training. Data normalization and perturbation can be optional, depending on the amount of available data.

特征提取单元302可以从原始的输入数据(例如,混合声音)中提取特征,以便帮助训练和分离计算。可以通过快速傅立叶变换(FFT)、短时傅立叶变换(STFT)、声谱图、听觉变换、小波或其它变换,在时域(原始数据),频域、特征域、或时间-频率域处理训练数据。图4(下面更全面地描述)示出了如何在变换域中进行音轨分离和HRTF滤波。The feature extraction unit 302 can extract features from the original input data (e.g., mixed sound) to help with training and separation calculations. The training data can be processed in the time domain (raw data), frequency domain, feature domain, or time-frequency domain by Fast Fourier Transform (FFT), Short Time Fourier Transform (STFT), spectrogram, auditory transform, wavelet or other transforms. FIG. 4 (described more fully below) shows how to perform track separation and HRTF filtering in the transform domain.

用于机器学习模型308的模型结构和训练算法可以是神经网络(NN)、卷积神经网络(CNN)、深度神经网络(DNN)、递归神经网络(RNN)、长短期记忆(LSTM)、高斯混合模型(GMM)、隐马尔可夫模型(HMM)或可用于分离混合音轨中的声源的任何模型和/或算法。在训练之后,在分离时段中,通过训练后的分离模型计算单元310可以将输入的音乐数据分离为多个轨道,每个分离的音轨对应于一种单独的声音。在一个实施方式中,可以通过用户设置(例如,经由图1A的GUI 110A)以不同的方式混合多个分离的音轨以用于不同的声音效果。The model structure and training algorithm for the machine learning model 308 can be a neural network (NN), a convolutional neural network (CNN), a deep neural network (DNN), a recursive neural network (RNN), a long short-term memory (LSTM), a Gaussian mixture model (GMM), a hidden Markov model (HMM), or any model and/or algorithm that can be used to separate the sound sources in the mixed audio track. After training, in the separation period, the input music data can be separated into multiple tracks by the trained separation model calculation unit 310, and each separated audio track corresponds to a separate sound. In one embodiment, multiple separated audio tracks can be mixed in different ways for different sound effects by user settings (e.g., via the GUI 110A of Figure 1A).

在一个实施方式中,机器学习模型300可以是DNN或CNN,其可以包括多个层,特别是包括用于接收数据输入的输入层(例如,训练时段)、用于生成输出的输出层(例如,分离时段)以及一个或多个隐藏层,每个隐藏层包括线性或非线性计算元素(称为神经元)以执行从输入层传播到输出层的DNN或CNN计算,这可以将数据输入变换为输出。两个相邻的层可以通过连线连接。每条连线可以与参数值(称为突触权重值)相关联,该参数值为后一层中一个或多个神经元的输入的前一层神经元的输出值提供比例系数。In one embodiment, the machine learning model 300 can be a DNN or CNN, which can include multiple layers, in particular, an input layer for receiving data input (e.g., training period), an output layer for generating output (e.g., separation period), and one or more hidden layers, each hidden layer includes linear or nonlinear computing elements (referred to as neurons) to perform DNN or CNN calculations propagated from the input layer to the output layer, which can transform the data input into output. Two adjacent layers can be connected by a line. Each line can be associated with a parameter value (referred to as a synaptic weight value), which provides a proportional coefficient for the output value of the previous layer of neurons of the input of one or more neurons in the latter layer.

图5示出了(下面将更全面地描述)与音乐的混合音轨(例如,混合声音输入)以及人声、鼓、贝斯和其他声音的分离的音轨相关联的波形和对应的声谱图,其中所述混合音轨是被训练过的机器学习模型308分离的。可以根据图4所示系统400来执行分离计算。FIG5 shows (described more fully below) waveforms and corresponding spectrograms associated with a mixed track of music (e.g., a mixed sound input) and separated tracks of vocals, drums, bass, and other sounds, where the mixed track is separated by the trained machine learning model 308. The separation calculations can be performed according to the system 400 shown in FIG4.

图4示出了根据本说明的一个实现方式的用于分离和使用变换域声音信号滤波混合音轨的系统400。FIG. 4 illustrates a system 400 for separating and filtering a mixed audio track using a transform domain sound signal, according to one implementation of the present description.

训练数据(例如,时域混合声音信号)可以在时域(例如,原始数据)中由分离单元404处理(例如,图1A的声音分离单元102A),或者可以使用前向变换402,使得可以通过快速傅立叶变换(FFT)、短时傅立叶变换(STFT)、声谱图、听觉变换、小波或其它变换在频域、特征域或时间-频率域处理训练数据。HRTF滤波器406(例如在图1A的存储单元104A中存储的那些)可以被实现为时域中的卷积计算,或可以使用逆变换408以使得HRTF滤波器406可以被实现为频域中的乘法以节省计算时间。因此,音轨分离和HRTF滤波都可以在变换域中进行。The training data (e.g., time-domain mixed sound signals) can be processed by the separation unit 404 in the time domain (e.g., raw data) (e.g., the sound separation unit 102A of FIG. 1A ), or a forward transform 402 can be used so that the training data can be processed in the frequency domain, feature domain, or time-frequency domain by a fast Fourier transform (FFT), a short-time Fourier transform (STFT), a spectrogram, an auditory transform, a wavelet, or other transform. The HRTF filter 406 (e.g., those stored in the storage unit 104A of FIG. 1A ) can be implemented as a convolution calculation in the time domain, or an inverse transform 408 can be used so that the HRTF filter 406 can be implemented as a multiplication in the frequency domain to save computation time. Therefore, both track separation and HRTF filtering can be performed in the transform domain.

图5A-5E以波形和声谱图示出了根据本说明实现的原始混合声音以及该混合声音被分别分离为人声、鼓、贝斯(bass)和其他声音。5A-5E illustrate, in waveform and spectrogram, an original mixed sound and the separation of the mixed sound into vocals, drums, bass, and other sounds, respectively, according to an implementation of the present teachings.

图5A中所示为与音乐的混合音轨相关联的波形和对应的声谱图(例如,图1A的系统100A的混合声音输入)。Shown in FIG. 5A is a waveform and corresponding spectrogram associated with a mixed track of music (eg, the mixed sound input to system 100A of FIG. 1A ).

图5B中所示为与来自音乐的混合音轨的人声声音的分离的音轨相关联的波形和对应的声谱图。Shown in FIG. 5B is a waveform and corresponding spectrogram associated with a separated track of a vocal sound from a mixed track of music.

图5C中所示为与来自音乐的混合音轨的鼓声的分离的音轨相关联的波形和对应的声谱图。Shown in FIG. 5C is a waveform and corresponding spectrogram associated with a separated track of a drum sound from a mixed track of music.

图5D中所示为与来自音乐的混合音轨的贝斯声的分离的音轨相关联的波形和对应的声谱图。Shown in FIG. 5D is a waveform and corresponding spectrogram associated with a separated track of a bass sound from a mixed track of music.

图5E中所示为与来自音乐的混合音轨的其他声音(例如,未识别声音类型)的分离的音轨相关联的波形和对应的声谱图。Shown in FIG. 5E is a waveform and corresponding spectrogram associated with a separated track of other sounds (eg, unidentified sound types) from a mixed track of music.

在本说明的一个实施方式中,使用训练的机器学习模型308分离混合音轨。可以根据上述关于图4的系统400来执行分离计算。In one embodiment of the present description, the mixed audio tracks are separated using the trained machine learning model 308. The separation calculations may be performed according to the system 400 described above with respect to FIG.

图6示出根据本说明实现的利用声音分离的3D双耳音乐系统600的远场语音控制。FIG. 6 illustrates far-field voice control of a 3D binaural music system 600 utilizing sound separation implemented in accordance with the present teachings.

首先,麦克风阵列602可以拾到语音命令。前置放大器/模数转换器(ADC)604可以放大该模拟信号和/或将其转换为数字信号。前置放大器和ADC两者为可选的,这取决于在麦克风阵列602中使用什么种类的麦克风。例如,数字麦克风可能不需要它们。First, microphone array 602 can pick up a voice command. Preamplifier/analog-to-digital converter (ADC) 604 can amplify the analog signal and/or convert it to a digital signal. Both the preamplifier and ADC are optional, depending on what kind of microphones are used in microphone array 602. For example, digital microphones may not need them.

声学波束形成器606形成声束(一个或多个),以增强语音或语音命令并抑制任何背景噪声。声学回声消除器(AEC)608还使用参考信号消除由麦克风阵列602拾取的扬声器声音(例如,来自扬声器630)。该参考信号可以通过扬声器630附近的一个或多个参考麦克风610拾音或在将音频信号发送至用于扬声器630的放大器628之前从(例如,来自设置/均衡器单元624的)音频信号获得。接着可以将AEC的输出发送到降噪单元612以进一步降低背景噪声。The acoustic beamformer 606 forms an acoustic beam(s) to enhance speech or voice commands and suppress any background noise. The acoustic echo canceller (AEC) 608 also uses a reference signal to cancel the speaker sound (e.g., from the speaker 630) picked up by the microphone array 602. The reference signal can be picked up by one or more reference microphones 610 near the speaker 630 or obtained from the audio signal (e.g., from the setup/equalizer unit 624) before sending the audio signal to the amplifier 628 for the speaker 630. The output of the AEC can then be sent to the noise reduction unit 612 to further reduce background noise.

然后将处理干净的语音发送到唤醒短语识别器614,其可以识别系统600预先定义的唤醒短语。系统600可以使扬声器630静音以进一步改善语音质量。接着,自动语音识别器(ASR,automatic speech recognizer)616可以识别语音命令(例如歌曲音乐名称),然后指示音乐检索单元618从音乐库620中检索音乐。在一个实施方式中,唤醒短语识别器614和ASR 616可以被组合为一个单元。此外,检索到的音乐可以接着由声音分离单元622分离,该声音分离单元622可以类似于图1A中的声音分离单元102A。The cleaned speech is then sent to the wake-up phrase recognizer 614, which can recognize the wake-up phrase predefined by the system 600. The system 600 can mute the speaker 630 to further improve the speech quality. Next, the automatic speech recognizer (ASR) 616 can recognize the voice command (e.g., the name of the song music) and then instruct the music retrieval unit 618 to retrieve music from the music library 620. In one embodiment, the wake-up phrase recognizer 614 and the ASR 616 can be combined into one unit. In addition, the retrieved music can then be separated by the sound separation unit 622, which can be similar to the sound separation unit 102A in Figure 1A.

然后,设置/均衡器单元624可以调整每个声源的音量和/或执行每个音轨的均衡(每个频带或每个乐器或人声的增益)。最后,如图1B的系统100B所示,分离的音乐音轨可以作为直接3D声音(经由放大器628)从扬声器630播放,或如图1A的系统100A如示,HRTF滤波器626可用于处理分离的音轨,以便生成双耳声音。Then, the settings/equalizer unit 624 can adjust the volume of each sound source and/or perform equalization of each track (gain of each frequency band or each instrument or vocal). Finally, as shown in the system 100B of FIG. 1B , the separated music tracks can be played from the speakers 630 as direct 3D sound (via amplifier 628), or as shown in the system 100A of FIG. 1A , the HRTF filter 626 can be used to process the separated tracks to generate binaural sound.

图7A-7D示出了根据本说明实现的用于3D声音的用户设置的GUI 700,其选择的听者的位置分别在乐队之中(图7A-7C)和在乐队之前(图7D)。7A-7D illustrate a GUI 700 for user settings of 3D sound implemented in accordance with the present teachings, with selected listener positions being respectively within the band (FIGS. 7A-7C) and in front of the band (FIG. 7D).

在一个实现方式中,GUI 700可以被设置为使得(例如,来自在舞台上的音乐乐队的)所有声源由虚拟舞台上的乐队成员图标表示,并且听者由听者头部图标(佩戴头戴式耳机以突出左耳和右耳的位置)表示,听者头部图标可以由GUI 700的用户围绕舞台自由移动。在另一实施方式中,图7A-7D中的所有图标都可以通过GUI 700的用户触摸而在舞台上自由移动。In one implementation, the GUI 700 may be arranged so that all sound sources (e.g., from a music band on stage) are represented by band member icons on a virtual stage, and the listener is represented by a listener head icon (wearing headphones to highlight the location of the left and right ears) that can be freely moved around the stage by the user of the GUI 700. In another embodiment, all of the icons in FIGS. 7A-7D can be freely moved around the stage by touch by the user of the GUI 700.

在图7A中,当听者头部图标被放置在虚拟舞台的中心,听者可以听到双耳声音并感觉声场:人声声音被感知为来自前方,鼓声来自右边,贝斯声来自背面,其他乐器(例如键盘乐器)来自左边。In FIG. 7A , when the listener head icon is placed at the center of the virtual stage, the listener can hear binaural sounds and feel the sound field: vocal sounds are perceived as coming from the front, drum sounds from the right, bass sounds from the back, and other instruments (e.g., keyboards) from the left.

在图7B中,当听者头部图标被放置在乐队鼓手图标之上,听者会能够听到分离的鼓独奏轨道。In FIG. 7B , when the listener head icon is placed over the band drummer icon, the listener will be able to hear a separate drum solo track.

在图7C中,当听者头部图标被放置靠近鼓手和贝斯手图标,鼓和贝斯的声音会被增强(例如,被增加的音量),而来自其它乐器的声音(例如,人声和其他)可以相对降低(例如,被降低的音量),因此,听者可以通过经由GUI 700的设置来感受到增强的贝斯和节拍效果。In FIG. 7C , when the listener head icon is placed close to the drummer and bassist icons, the sounds of the drums and bass will be enhanced (e.g., increased volume), while the sounds from other instruments (e.g., vocals and others) can be relatively reduced (e.g., reduced volume), so that the listener can feel the enhanced bass and beat effects through the settings via GUI 700.

在图7D中,示出了另一虚拟3D声场设置。在该设置中,听者可以虚拟地感觉并听到乐队是在他或她的前面,即使在现实世界音乐舞台录音中并非这种情况。在GUI 700的显示上可以将所有乐队成员图标和听者头部图标的位置移动到任何地方,以设置和改动虚拟声场和听觉体验。In FIG. 7D , another virtual 3D sound field setting is shown. In this setting, the listener can virtually feel and hear that the band is in front of him or her, even though this is not the case in a real-world music stage recording. The positions of all band member icons and the listener head icon can be moved anywhere on the display of GUI 700 to set and modify the virtual sound field and listening experience.

GUI 700也可以适用于遥控器来控制具有直接3D声音系统电视,或其他类似的应用。例如,当用户正在观看电影(movie),她或他可以移动听者头部图标更靠近于人声图标,以便语音的音量被增强而其他背景声音(例如,音乐)的音量可以被降低,使得用户可以听到更清晰的语音。GUI 700 can also be applied to a remote control for controlling a TV with a direct 3D sound system, or other similar applications. For example, when a user is watching a movie, she or he can move the listener head icon closer to the voice icon so that the volume of the voice is enhanced and the volume of other background sounds (e.g., music) can be reduced, so that the user can hear the voice more clearly.

图8示出了根据本说明的一个实现方式的用于利用麦克风阵列802生成3D声音的系统800。FIG. 8 illustrates a system 800 for generating 3D sound using a microphone array 802 , according to one implementation of the present description.

系统800可以被描述为可以直接拾音和输出3D和双耳声音的3D麦克风系统。如本文所提及,3D麦克风系统可以包括麦克风阵列系统,所述麦克风阵列系统可以拾取来自不同的方向的声音以及关于声源的位置的空间信息。该系统800可产生2种输出:(1)多个轨道,每一个对应于来自一个方向的声音,其中多个轨道中的每一个可以驱动一个或一组组扬声器以生成一个三维声场;(2)用于耳塞或耳机的双耳L和R轨道以虚拟形式表示3D声场。System 800 can be described as a 3D microphone system that can directly pick up and output 3D and binaural sounds. As mentioned herein, a 3D microphone system can include a microphone array system that can pick up sounds from different directions as well as spatial information about the location of the sound source. The system 800 can produce two outputs: (1) multiple tracks, each corresponding to sounds from a direction, where each of the multiple tracks can drive one or a group of speakers to generate a three-dimensional sound field; (2) binaural L and R tracks for earbuds or headphones to represent the 3D sound field in a virtual form.

麦克风阵列802的每个麦克风可以使它们的信号由前置放大器/ADC单元804处理。前置放大器和模数转换器(ADC)可放大模拟信号和/或将其转换为数字信号。前置放大器和ADC都是可选的,取决于所选择的用于麦克风阵列802的麦克风组件。例如,对于数字麦克风而言,它们可能不是必需的。Each microphone of microphone array 802 can have their signal processed by preamplifier/ADC unit 804. Preamplifier and analog-to-digital converter (ADC) can amplify analog signals and/or convert them into digital signals. Preamplifier and ADC are optional, depending on the microphone components selected for microphone array 802. For example, they may not be necessary for digital microphones.

声学波束形成器806可以同时形成指向不同的方向或不同的声源的声学波束图,如图9B所示。每个波束增强了来自“看”方向的声音,同时抑制了来自其他方向的声音,从而改善了信噪比(SNR),并将来自“看”方向的声音与来自其他方向的声音隔离开来。如果需要,降噪单元808可以进一步降低波束形成器输出的背景噪声。波束形成器的输出可以包括与来自不同方向的声音相对应的多个音轨。The acoustic beamformer 806 can simultaneously form acoustic beam patterns pointing to different directions or different sound sources, as shown in FIG9B . Each beam enhances the sound from the “look” direction while suppressing the sound from other directions, thereby improving the signal-to-noise ratio (SNR) and isolating the sound from the “look” direction from the sound from other directions. If necessary, the noise reduction unit 808 can further reduce the background noise output by the beamformer. The output of the beamformer can include multiple audio tracks corresponding to sounds from different directions.

为了生成直接的3D声音,多个轨道可以驱动多个放大器和扬声器以为听者构造3D声音场。To generate direct 3D sound, multiple tracks can drive multiple amplifiers and speakers to construct a 3D sound field for the listener.

为了生成双耳输出,多个音轨可以经过多对选择的HRTF滤波器810以将空间音轨转换为双耳声音。可以基于用户的设置选择HRTF滤波器(例如,经由输出音频设置单元814)或基于在真实世界中声源的实际空间位置选择HRTF滤波器。此外,混合器812接着可以将HRTF输出组合为分别用于左耳和右耳的一对双耳输出。最终的双耳输出表示由麦克风阵列802录制的3D声场。In order to generate binaural output, multiple tracks can be passed through multiple pairs of selected HRTF filters 810 to convert the spatial tracks into binaural sound. The HRTF filter can be selected based on the user's settings (e.g., via the output audio setting unit 814) or based on the actual spatial position of the sound source in the real world. In addition, the mixer 812 can then combine the HRTF outputs into a pair of binaural outputs for the left ear and the right ear respectively. The final binaural output represents the 3D sound field recorded by the microphone array 802.

基于麦克风阵列802只有分别指向左和右的两个声学波束图,麦克风阵列作为立体声麦克风工作,这是3D麦克风的一种特殊情况。Since the microphone array 802 has only two acoustic beam patterns pointing to the left and right respectively, the microphone array works as a stereo microphone, which is a special case of a 3D microphone.

图9A-9B分别示出了根据本说明实现的具有空间噪声消除的3D麦克风902和3D麦克风阵列904的波束图。9A-9B illustrate beam patterns of a 3D microphone 902 and a 3D microphone array 904 , respectively, with spatial noise cancellation implemented in accordance with the present teachings.

图9A示出了3D麦克风902的波束图,其可以拾音来自不同方向的声音以及关于声源的空间信息。FIG. 9A shows a beam pattern of a 3D microphone 902 , which can pick up sounds from different directions as well as spatial information about the sound source.

图9B示出了具有分别由波束形成器A和B形成的波束图A和B的麦克风阵列904(例如,包括多个麦克风902),麦克风阵列904被设置为拾取来自两个不同声源A和B的声音。从在一个声学波束的“看”方向的声源A拾取的声音(如波束图A)经常与从其他方向(例如声源B的方向)拾取的声音混合。为了消除来自其他方向的声音,3D麦克风阵列904可以使用同一麦克风阵列904形成另一个(些)波束图,例如波束图B。由波束图B拾音的声音可以用于消除由波束图A拾取的声音中不想要混合的声音。然后可以从波束图A的输出中消除已经与来自波束图A的“看”方向的声音相混合的来自声源B的方向的声音。消除算法可以由声学回声消除器(AEC)单元906提供。9B shows a microphone array 904 (e.g., including a plurality of microphones 902) having beam patterns A and B formed by beamformers A and B, respectively, and the microphone array 904 is arranged to pick up sounds from two different sound sources A and B. Sound picked up from a sound source A in the “look” direction of one acoustic beam (e.g., beam pattern A) is often mixed with sounds picked up from other directions (e.g., the direction of sound source B). In order to cancel sounds from other directions, the 3D microphone array 904 can form another (some) beam patterns, such as beam pattern B, using the same microphone array 904. Sound picked up by beam pattern B can be used to cancel unwanted mixed sounds in the sound picked up by beam pattern A. Sound from the direction of sound source B that has been mixed with the sound from the “look” direction of beam pattern A can then be canceled from the output of beam pattern A. The cancellation algorithm can be provided by an acoustic echo canceller (AEC) unit 906.

图10示出了根据本说明实现的用于生成三维声音的会议系统1000。FIG. 10 illustrates a conferencing system 1000 for generating three-dimensional sound implemented according to the present description.

会议系统1000可以包括信号处理和计算单元1002、头形相关传递函数(HRTF)滤波器的存储库1004、具有图形用户界面(GUI)的显示单元1006、放大器1008、头戴式耳麦或耳机1010和扬声器1012。例如,系统1000可以被实现为在用户的连接有头戴式耳麦的笔记本电脑、平板电脑、计算机或智能电话上的软件。视频和音频会议(以下称为“会议”)可能也被称为电话会议、虚拟会议、网络会议、网络研讨会或视频会议。一个这样的会议可以包括多个本地和/或多个远程出席者。在一个实现方式中,出席者可以通过互联网和电话网络1014连接。在另一个实现方式中,会议可以由云服务器或远程服务器经由互联网和电话网络1014控制。The conference system 1000 may include a signal processing and computing unit 1002, a repository 1004 of head shape related transfer function (HRTF) filters, a display unit 1006 with a graphical user interface (GUI), an amplifier 1008, a headset or earphone 1010, and a speaker 1012. For example, the system 1000 may be implemented as software on a user's laptop, tablet, computer, or smart phone connected to a headset. Video and audio conferences (hereinafter referred to as "conferences") may also be referred to as teleconferences, virtual conferences, web conferences, webinars, or video conferences. One such conference may include multiple local and/or multiple remote attendees. In one implementation, the attendees may be connected via the Internet and a telephone network 1014. In another implementation, the conference may be controlled by a cloud server or a remote server via the Internet and a telephone network 1014.

系统1000的用户可以是会议或虚拟音乐会的出席者之一。她或他是运行具有视频和音频的会议软件的笔记本电脑、平板电脑、计算机或智能电话的所有者,并可能戴着头戴式耳麦1010。术语“发言者”或“出席者”是指参加会议的人。扬声器1012可以是任何可以转换音频信号为可听见的声音的装置。放大器1008中可以是用于增加信号功率以驱动扬声器1012或头戴式耳麦1010的电子装置或电路。头戴式耳麦1010可以是耳机、耳罩、或入耳式音频装置。The user of system 1000 may be one of the attendees of a conference or virtual concert. She or he is the owner of a laptop, tablet, computer or smart phone running a conference software with video and audio, and may be wearing a headset 1010. The term "speaker" or "attendee" refers to a person attending a conference. Speaker 1012 may be any device that can convert an audio signal into audible sound. Amplifier 1008 may be an electronic device or circuit for increasing signal power to drive speaker 1012 or headset 1010. Headset 1010 may be headphones, earmuffs, or in-ear audio devices.

输入信号(例如,经由1014来自云)可以包括视频、音频和发言者的标识(ID)。该发言者的ID可以将视频和音频输入关联到正在讲话的出席者。当没有发言者的ID,新的发言者ID可以由发言者ID单元1016生成,如下所述。The input signal (e.g., from the cloud via 1014) may include video, audio, and a speaker's identification (ID). The speaker's ID may associate the video and audio input to the attendee who is speaking. When there is no speaker's ID, a new speaker ID may be generated by the speaker ID unit 1016, as described below.

发言者ID单元1016可以是基于用于发言者的视频会议会话的发言者ID,从会议软件获得发言者ID。此外,发言者ID单元1016可以从麦克风阵列(例如,图8的麦克风阵列802或图9的904)获得发言者ID。例如,在图9B中的麦克风阵列波束图(例如,波束图A和B)可以检测相对于所述麦克风阵列的发言者的方向。基于检测到的方向,系统1000可以检测发言者ID。更进一步地,发言者ID单元1016可以基于发言者ID算法获得发言者ID。例如,基于由多个发言者的语音组成的音轨,发言者ID系统可以有两个时段:训练和推断。在训练期间,使用可用的标记,每个发言者的语音被用来训练依赖于发言者的模型,一个模型用于一个发言者。如果没有标记,发言者ID系统可以首先执行无监督训练,然后用发言者ID标注来自音轨的语音,接下来进行监督训练以为每个发言者生成一个模型。在推断期间,给定会议音频,发言者识别单元1016可以使用训练的模型来处理输入声音并识别相应的发言者。该模型可以为高斯混合模型(GMM)、隐马尔可夫模型(HMM)、DNN、CNN、LSTM或RNN。The speaker ID unit 1016 can be based on the speaker ID of the video conference session for the speaker, and the speaker ID is obtained from the conference software. In addition, the speaker ID unit 1016 can obtain the speaker ID from the microphone array (e.g., microphone array 802 of Figure 8 or 904 of Figure 9). For example, the microphone array beam pattern (e.g., beam patterns A and B) in Figure 9B can detect the direction of the speaker relative to the microphone array. Based on the detected direction, the system 1000 can detect the speaker ID. Further, the speaker ID unit 1016 can obtain the speaker ID based on the speaker ID algorithm. For example, based on a track composed of the voices of multiple speakers, the speaker ID system can have two periods: training and inference. During training, using available tags, the voice of each speaker is used to train a speaker-dependent model, one model for one speaker. If there is no tag, the speaker ID system can first perform unsupervised training, then annotate the voice from the track with the speaker ID, and then perform supervised training to generate a model for each speaker. During inference, given the conference audio, the speaker identification unit 1016 can use the trained model to process the input sound and identify the corresponding speaker. The model can be a Gaussian mixture model (GMM), a hidden Markov model (HMM), a DNN, a CNN, a LSTM, or a RNN.

对于正在讲话的出席者,可以在显示器/GUI 1006中在视觉上突出显示与该出席者相关联的视频窗口,因此用户知道会议的哪位出席者正在讲话,例如如下所述的图11中的出席者2。从发言者的位置,例如,从用户的50度角,系统1000可从预先存储的数据库或存储器1004检索一对相应HRTF滤波器。信号处理单元1002可以利用来自预先存储的数据库或存储器1004的HRTF滤波器对输入的单声道信号执行卷积计算。信号处理和计算单元1002的输出可以具有分别用于左耳和右耳的双耳声音的两个通道。用户或出席者可以佩戴头戴式耳麦单元1010以便听到双耳声音并体验3D音效。例如,用户没有看着显示器1006,但正佩戴着头戴式耳麦1010仍可以从3D声音感知哪位出席者正在讲话,使得用户可以感觉好像她或他是在真实会议室中。For the attendee who is speaking, the video window associated with the attendee can be visually highlighted in the display/GUI 1006, so the user knows which attendee of the meeting is speaking, such as attendee 2 in FIG. 11 as described below. From the position of the speaker, for example, 50 degrees from the user, the system 1000 can retrieve a pair of corresponding HRTF filters from a pre-stored database or memory 1004. The signal processing unit 1002 can perform convolution calculations on the input mono signal using the HRTF filters from the pre-stored database or memory 1004. The output of the signal processing and calculation unit 1002 can have two channels of binaural sound for the left ear and the right ear, respectively. The user or attendee can wear a headset unit 1010 to hear binaural sound and experience 3D sound effects. For example, the user is not looking at the display 1006, but is wearing the headset 1010 and can still perceive which attendee is speaking from the 3D sound, so that the user can feel as if she or he is in a real conference room.

基于在真实会议室中使用的多个显示器/GUI 1006和多个扬声器1012,每个扬声器1012可以专用于在一个位置的一个显示器/GUI 1006中一个发言者的声音。在这种情况下,用户无需使用头戴式耳麦1010,且她或他可以从扬声器1012体验3D声音。多个扬声器可以放置在家庭影院、电影院、条形音箱、电视机、智能扬声器、智能电话、移动装置、手持装置、笔记本电脑、PC、汽车或任何具有多于一个扬声器或发声器的地方。Based on multiple displays/GUIs 1006 and multiple speakers 1012 used in real conference rooms, each speaker 1012 can be dedicated to the sound of one speaker in one display/GUI 1006 in one location. In this case, the user does not need to use the headset 1010, and she or he can experience 3D sound from the speakers 1012. Multiple speakers can be placed in home theaters, movie theaters, sound bars, televisions, smart speakers, smart phones, mobile devices, handheld devices, laptops, PCs, cars, or anywhere there is more than one speaker or sound generator.

图11示出了根据本发明实现方式的用于生成三维声音的会议系统1000的GUI1006所显示的虚拟会议室1100。FIG. 11 shows a virtual conference room 1100 displayed by a GUI 1006 of a conference system 1000 for generating three-dimensional sound according to an implementation of the present invention.

虚拟会议室1100可以具有包括用户和会议出席者的视频的多个窗口(1102-1112)。窗口(1102-1112)的位置可以由会议软件(例如,运行在笔记本电脑上)或由用户(例如,经由图10的显示器/GUI 1006)安排。例如,用户可以在四处移动窗口(1102-1112)以布置虚拟会议室1100。在一个实施方式中,会议室1100的中心可以包括虚拟会议桌。The virtual conference room 1100 may have multiple windows (1102-1112) including videos of the user and conference attendees. The positions of the windows (1102-1112) may be arranged by the conference software (e.g., running on a laptop) or by the user (e.g., via the display/GUI 1006 of FIG. 10 ). For example, the user may move the windows (1102-1112) around to arrange the virtual conference room 1100. In one embodiment, the center of the conference room 1100 may include a virtual conference table.

如上所指出,虚拟会议室1100还可以由用户设置,使得出席者的视频窗口(1104-1112)可以使用鼠标、小键盘、或触摸屏等虚拟地放置在虚拟会议室1100的任何地方。从发言者(例如,出席者2)相对于用户的位置(例如,从出席者2的视频窗口1106到用户的视频窗口1102的角度),当他们讲话时,相关的HRTF可以被选择并自动应用于出席者2。As noted above, the virtual conference room 1100 may also be set up by the user so that attendees' video windows (1104-1112) may be virtually placed anywhere in the virtual conference room 1100 using a mouse, keypad, or touch screen, etc. From the position of the speaker (e.g., attendee 2) relative to the user (e.g., the angle from attendee 2's video window 1106 to the user's video window 1102), the relevant HRTF may be selected and automatically applied to attendee 2 as they speak.

方法method

为了简化说明,将本说明的方法描绘和描述为一系列动作。然而,根据本说明的动作可以以各种顺序和/或同时发生,并且具有本文未呈现和描述的其他动作。此外,可能不需要所有示出的动作来实现根据所说明的主题的方法。另外,本领域技术人员将理解并认识到,所述方法可替代地经由状态图或事件被表示为一系列相互关联的状态。另外,应当理解,在本说明书中说明的方法能够被存储在制品上,以便于将这样的方法传输和转移到计算装置。本文所使用的术语“制品”旨在涵盖可从任何计算机可读装置或存储介质访问的计算机程序。In order to simplify the description, the method of this description is depicted and described as a series of actions. However, the actions according to this description can occur in various orders and/or simultaneously, and have other actions not presented and described herein. In addition, all the actions shown may not be required to implement the method according to the described subject matter. In addition, those skilled in the art will understand and appreciate that the method may be represented as a series of interrelated states via state diagrams or events alternatively. In addition, it should be understood that the method described in this specification can be stored on an article of manufacture, so that such a method is transmitted and transferred to a computing device. The term "article of manufacture" used herein is intended to cover a computer program accessible from any computer-readable device or storage medium.

所述的方法可以通过可包括硬件的处理装置(例如,电路、专用逻辑)、计算机可读指令(例如,在通用计算机系统或专用机器上运行的)、或两者的组合来执行。该方法及其各个单独的功能、例程、子例程或操作可以由执行该方法的计算机装置的一个或多个处理器来执行。在某些实施方式中,该方法可以由单个处理线程来执行。可替代地,该方法可以由两个或更多的处理线程执行,每个线程执行一个或多个单独的功能、例程、子例程、或操作。The method may be performed by a processing device that may include hardware (e.g., circuits, dedicated logic), computer-readable instructions (e.g., running on a general-purpose computer system or a dedicated machine), or a combination of the two. The method and each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of a computer device that performs the method. In some embodiments, the method may be performed by a single processing thread. Alternatively, the method may be performed by two or more processing threads, each thread performing one or more individual functions, routines, subroutines, or operations.

图12示出了根据本发明的一个实现方式的用于生成三维声音的方法1200。FIG. 12 shows a method 1200 for generating three-dimensional sound according to one implementation of the present invention.

在一个实现方式中,方法1200可以通过图1A的系统100A或图1B的子系统100B的信号处理单元执行。In one implementation, the method 1200 may be performed by a signal processing unit of the system 100A of FIG. 1A or the subsystem 100B of FIG. 1B .

在1202处,该方法包括接收三维空间(例如,图2A的200A)和在三维空间中的网格上定义的头形相关传递函数(HRTF)滤波器(例如,图2B的200B)的分布网的技术规范(specification),其中三维空间呈现在用户界面装置的用户界面(例如,图1A的GUI 110A)中。At 1202, the method includes receiving a three-dimensional space (e.g., 200A of FIG. 2A ) and a specification of a distribution grid of head-related transfer function (HRTF) filters (e.g., 200B of FIG. 2B ) defined on a grid in the three-dimensional space, wherein the three-dimensional space is presented in a user interface of a user interface device (e.g., GUI 110A of FIG. 1A ).

在1204处,该方法包括(例如,通过图1A的声音分离单元102A)确定多个音轨(例如,分离的音轨),其中,多个音轨中的每一个与相应的声源(例如人声)对应。At 1204 , the method includes determining (eg, via the sound separation unit 102A of FIG. 1A ) a plurality of audio tracks (eg, separated audio tracks), wherein each of the plurality of audio tracks corresponds to a respective sound source (eg, a human voice).

在1206处,该方法包括在三维空间中表示听者(例如,图2A的听者202A)和多个音轨的声源(例如,图2A的声源204A)。At 1206 , the method includes representing a listener (eg, listener 202A of FIG. 2A ) and sound sources (eg, sound sources 204A of FIG. 2A ) of the plurality of audio tracks in three-dimensional space.

在1208处,该方法包括对应于对三维空间中的听者位置或声源位置中的至少一个的用户设置(例如,经由图1A的GUI 110A),基于(例如,存储在图1A的存储单元104A中)HRTF滤波器的分布网以及声源和听者在三维空间中的位置,生成多个HRTF滤波器(例如,图2B的200B)。At 1208, the method includes generating multiple HRTF filters (e.g., 200B of FIG. 2B ) corresponding to a user setting of at least one of a listener position or a sound source position in three-dimensional space (e.g., via GUI 110A of FIG. 1A ), based on a distribution network of HRTF filters (e.g., stored in storage unit 104A of FIG. 1A ) and the positions of the sound source and the listener in the three-dimensional space.

在1210处,该方法包括将多个HRTF滤波器(例如,图2B的200B)中的每个滤波器应用于多个分离的音轨中的相应一个,以生成多个经滤波的音轨;以及At 1210, the method includes applying each of a plurality of HRTF filters (e.g., 200B of FIG. 2B) to a corresponding one of a plurality of separate audio tracks to generate a plurality of filtered audio tracks; and

在1212处,该方法包括基于经滤波的音轨生成三维声音。At 1212, the method includes generating three-dimensional sound based on the filtered audio track.

图13示出了根据本说明的一个实施方式的用于生成三维声音的方法1300。FIG. 13 shows a method 1300 for generating three-dimensional sound according to one embodiment of the present description.

在1302处,该方法包括利用包括多个麦克风(例如,图9A的麦克风902)的麦克风阵列(例如,图8的麦克风阵列802)拾取来自多个声源的声音。At 1302 , the method includes picking up sound from a plurality of sound sources using a microphone array (eg, microphone array 802 of FIG. 8 ) including a plurality of microphones (eg, microphone 902 of FIG. 9A ).

在1304处,该方法包括利用一个或多个扬声器(例如,图1B的扬声器108B)呈现三维声音。At 1304 , the method includes presenting three-dimensional sound using one or more speakers (eg, speakers 108B of FIG. 1B ).

在1306处,该方法包括利用声学回声消除单元(例如,图6的AEC 608)移除多个音轨中的回声。At 1306 , the method includes removing echoes in the plurality of audio tracks using an acoustic echo cancellation unit (eg, AEC 608 of FIG. 6 ).

在1308处,该方法包括利用降噪单元(例如,图6的降噪单元612)减少多个音轨中的噪声成分。At 1308 , the method includes reducing noise components in the plurality of audio tracks using a noise reduction unit (eg, noise reduction unit 612 of FIG. 6 ).

在1310处,该方法包括用声音均衡器单元(例如,图6的设置/均衡器单元624)处理多个音轨。At 1310, the method includes processing a plurality of audio tracks with a sound equalizer unit (eg, setup/equalizer unit 624 of FIG. 6).

在1312处,该方法包括利用位于一个或多个扬声器(例如,图6的扬声器630)附近的参考声音拾音电路(例如,图6的参考麦克风610)拾取参考信号,其中声学回声消除单元(例如,图6的AEC 608)将会用拾取的参考信号来移除回声。At 1312, the method includes picking up a reference signal using a reference sound pickup circuit (e.g., reference microphone 610 of FIG. 6 ) located near one or more speakers (e.g., speaker 630 of FIG. 6 ), wherein an acoustic echo cancellation unit (e.g., AEC 608 of FIG. 6 ) will use the picked up reference signal to remove the echo.

在1314处,该方法包括利用语音识别单元(例如,图6的语音识别器616)识别语音命令。At 1314 , the method includes recognizing a voice command using a voice recognition unit (eg, voice recognizer 616 of FIG. 6 ).

硬件hardware

图14描绘了根据本说明的一个或多个方面进行操作的计算机系统1400的框图。在各种示例中,计算机系统1400可对应于与本文中所呈现的系统(诸如图1A的系统100A或图1B的系统100B)相关描述的任何信号处理单元/装置。14 depicts a block diagram of a computer system 1400 that operates according to one or more aspects of the present description. In various examples, computer system 1400 may correspond to any signal processing unit/device described in connection with the systems presented herein, such as system 100A of FIG. 1A or system 100B of FIG. 1B .

在某些实现方式中,计算机系统1400可以(例如,经由网络,诸如局域网(LAN)、内联网、外联网或互联网)被连接到其他计算机系统。计算机系统1400可以在客户机服务器环境中以服务器或客户机计算机的身份执行,或在对等或分布式网络环境中作为对等计算机执行。计算机系统1400可以由以下设备提供:个人计算机(PC)、平板电脑、机顶盒(STB)、个人数字助理(PDA)、蜂窝电话、Web电器、服务器、网络路由器、交换机或桥接器、在车辆、家庭、房间或办公室中的计算装置、或能够执行一组指令(顺序指令或其他指令)的任何装置,所述指令指定该装置要采取的动作。此外,术语“计算机”应包括计算机、处理器、芯片、或SoC的任何集合,它们单独或共同执行一组(或多组)指令以执行本文所述的任何一个或多个方法。In some implementations, the computer system 1400 can be connected to other computer systems (e.g., via a network, such as a local area network (LAN), an intranet, an extranet, or the Internet). The computer system 1400 can be executed as a server or client computer in a client-server environment, or as a peer computer in a peer-to-peer or distributed network environment. The computer system 1400 can be provided by the following devices: a personal computer (PC), a tablet computer, a set-top box (STB), a personal digital assistant (PDA), a cellular phone, a Web appliance, a server, a network router, a switch or a bridge, a computing device in a vehicle, a home, a room, or an office, or any device capable of executing a set of instructions (sequential instructions or other instructions) that specify the actions to be taken by the device. In addition, the term "computer" shall include any collection of computers, processors, chips, or SoCs that individually or collectively execute a set (or multiple sets) of instructions to perform any one or more of the methods described herein.

在另一实现中,机器作为独立装置运行,或者可以连接(例如联网)到其他机器。在网络部署中,该机器可以在服务器-客户机或云网络环境中以服务器或客户端机器的身份运行,或者可以在对等(或分布式)网络环境中充当对等计算机。该机器可以是车载系统、可穿戴装置、个人计算机(PC)、平板电脑、混合平板电脑、个人数字助理(PDA)、移动电话或任何能够执行指令的机器(顺序或其他),所述指令指定该机器要执行的动作。此外,虽然仅示出了单个机器,但是术语“机器”也应被认为包括机器的任何集合,这些机器单独地或共同地执行一组(或多组)指令以执行本文所讨论的任何一个或多个方法。类似地,术语“基于处理器的系统”应被认为包括一个或多个机器的任何集合,其由处理器(例如,芯片,计算机或云服务器)控制或操作以单独或共同执行指令以执行本文所讨论的方法中的一个或多个。In another implementation, the machine operates as a standalone device, or can be connected (e.g., networked) to other machines. In a network deployment, the machine can operate as a server or client machine in a server-client or cloud network environment, or can act as a peer computer in a peer (or distributed) network environment. The machine can be an in-vehicle system, a wearable device, a personal computer (PC), a tablet computer, a hybrid tablet computer, a personal digital assistant (PDA), a mobile phone, or any machine (sequential or other) capable of executing instructions, which specify the actions to be performed by the machine. In addition, although only a single machine is shown, the term "machine" should also be considered to include any collection of machines that individually or collectively execute a set (or multiple sets) of instructions to perform any one or more methods discussed herein. Similarly, the term "processor-based system" should be considered to include any collection of one or more machines that are controlled or operated by a processor (e.g., a chip, a computer, or a cloud server) to individually or collectively execute instructions to perform one or more of the methods discussed herein.

示例计算机系统1400包括至少一个处理器1402(例如,中央处理单元(CPU)、图形处理单元(GPU)或两者、处理器核、计算节点、云服务器等)、主存储器1404和静态存储器1406,它们经由链路1408(例如,总线)彼此通信。计算机系统1400可以进一步包括视频显示单元1410、字母数字输入装置1412(例如,键盘)和用户界面(UI)导航装置1414(例如,鼠标)。在一个实现中,视频显示单元1410、输入装置1412和UI导航装置1414被结合到触摸屏显示器中。计算机系统1400可以另外包括存储装置1416(例如,驱动单元)、声音产生装置1418(例如,扬声器)、网络界面装置1420以及一个或多个传感器1422、例如如全球定位系统(GPS)传感器、加速度计、陀螺测试仪、位置传感器、运动传感器、磁力计或其它传感器。The example computer system 1400 includes at least one processor 1402 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both, a processor core, a computing node, a cloud server, etc.), a main memory 1404, and a static memory 1406, which communicate with each other via a link 1408 (e.g., a bus). The computer system 1400 may further include a video display unit 1410, an alphanumeric input device 1412 (e.g., a keyboard), and a user interface (UI) navigation device 1414 (e.g., a mouse). In one implementation, the video display unit 1410, the input device 1412, and the UI navigation device 1414 are incorporated into a touch screen display. The computer system 1400 may additionally include a storage device 1416 (e.g., a drive unit), a sound generating device 1418 (e.g., a speaker), a network interface device 1420, and one or more sensors 1422, such as a global positioning system (GPS) sensor, an accelerometer, a gyroscope, a position sensor, a motion sensor, a magnetometer, or other sensors.

存储装置1416包括机器可读介质1424,在该机器可读介质1424上存储一组或多组数据结构和指令1426(例如,软件),这些数据结构和指令1426被本文所描述的一种或多种方法或功能所体现或利用。指令1426还可以在由包括机器可读介质的具有主存储器1404、静态存储器1406和处理器1402的计算机系统1400执行时,全部或至少部分地驻留在主存储器1404、静态存储器1406、和/或处理器1402中。The storage device 1416 includes a machine-readable medium 1424 on which is stored one or more sets of data structures and instructions 1426 (e.g., software) that are embodied or utilized by one or more of the methods or functions described herein. The instructions 1426 may also reside in whole or at least in part in the main memory 1404, the static memory 1406, and/or the processor 1402 when executed by the computer system 1400 having the main memory 1404, the static memory 1406, and the processor 1402 including the machine-readable medium.

尽管在示例实施例中将机器可读介质1424示为单个介质,但是术语“机器可读介质”可以包括存储一个或多个指令1426的单个介质或多个介质(例如,集中式、云或分布式数据库和/或相关联的缓存和服务器)。术语“机器可读介质”还应当被认为包括能够存储、编码或承载用于由机器执行的指令的任何有形介质,并使机器执行本说明的任何一个或多个方法,或者能够存储,编码或承载由此类指令利用或与之关联的数据结构。因此,术语“机器可读介质”应被认为包括但不限于固态存储器、以及光和磁介质。机器可读介质的特定示例包括易失性或非易失性存储器,包括但不限于,举例来说,半导体存储装置(例如,电可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM))和闪存装置;诸如内部硬盘和可移动磁盘之类的磁盘;以及CD-ROM和DVD-ROM磁盘。Although the machine-readable medium 1424 is shown as a single medium in the example embodiment, the term "machine-readable medium" may include a single medium or multiple media (e.g., a centralized, cloud, or distributed database and/or associated caches and servers) storing one or more instructions 1426. The term "machine-readable medium" should also be considered to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by a machine and causes the machine to perform any one or more of the methods described herein, or capable of storing, encoding, or carrying data structures utilized by or associated with such instructions. Thus, the term "machine-readable medium" should be considered to include, but is not limited to, solid-state memories, and optical and magnetic media. Specific examples of machine-readable media include volatile or non-volatile memories, including, but not limited to, for example, semiconductor memory devices (e.g., electrically programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs)) and flash memory devices; magnetic disks such as internal hard disks and removable disks; and CD-ROM and DVD-ROM disks.

还可以利用多种众所周知的传输协议(例如HTTP)中的任何一种经由网络界面装置1420使用传输介质在通信网络1428上发送或接收指令1426。通信网络的示例包括局域网(LAN)、广域网(WAN)、互联网、移动电话网络、普通旧电话(POTS)网络和无线数据网络(例如,蓝牙、Wi-Fi、3G、以及4G LTE/LTE-A或WiMAX网络)。术语“传输介质”应被认为包括能够存储、编码或承载由机器执行的指令的任何无形介质,并且包括数字或模拟信号或其他无形介质以帮助这种软件的通信。Instructions 1426 may also be sent or received over a communication network 1428 using a transmission medium via the network interface device 1420 using any of a number of well-known transmission protocols (e.g., HTTP). Examples of communication networks include a local area network (LAN), a wide area network (WAN), the Internet, a mobile phone network, a plain old telephone (POTS) network, and a wireless data network (e.g., Bluetooth, Wi-Fi, 3G, and 4G LTE/LTE-A or WiMAX networks). The term "transmission medium" shall be deemed to include any intangible medium that is capable of storing, encoding, or carrying instructions to be executed by the machine, and includes digital or analog signals or other intangible media to facilitate communication of such software.

示例计算机系统1400还可以包括输入/输出控制器1430,以从至少一个中央处理器1402接收输入和输出请求,然后将装置特定的控制信号发送到它们控制的装置。输入/输出控制器1430可以释放至少一个中央处理器1402免于必须处理控制每种单独种类的装置的细节。The example computer system 1400 may also include an input/output controller 1430 to receive input and output requests from the at least one central processor 1402 and then send device-specific control signals to the devices they control. The input/output controller 1430 may free the at least one central processor 1402 from having to handle the details of controlling each individual type of device.

语言language

除非另有明确说明,否则诸如“接收”,“关联”,“确定”,“更新”之类的术语是指由计算机系统执行或实现的动作和过程,所述动作和过程对表示为计算机系统寄存器和存储器内的物理(电子)量的数据操作并转换其为其他数据,所述其它数据类似地表示为计算机系统存储器或寄存器或其他此类信息存储、传输或显示装置内的物理量。而且,如本文中所使用的,术语“第一”、“第二”、“第三”、“第四”等意在用作区分不同元件的标签,并且可以不具有根据其数字名称的有序数含义。Unless expressly stated otherwise, terms such as "receiving", "associating", "determining", "updating" and the like refer to actions and processes performed or implemented by a computer system that operate on and transform data represented as physical (electronic) quantities within computer system registers and memories into other data similarly represented as physical quantities within computer system memories or registers or other such information storage, transmission or display devices. Moreover, as used herein, the terms "first", "second", "third", "fourth", etc. are intended to be used as labels to distinguish different elements and may not have an ordinal number meaning according to their numerical names.

本文描述的示例还涉及用于执行本文描述的方法的设备。该设备可以被特别构造用于执行本文描述的方法,或者它可以包括由存储在计算机系统中的计算机程序选择性地编程的通用计算机系统。这样的计算机程序可以被存储在计算机可读的有形存储介质中。The examples described herein also relate to a device for performing the methods described herein. The device may be specially constructed to perform the methods described herein, or it may include a general-purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a computer-readable tangible storage medium.

本文描述的方法和说明性示例不特指任何特定计算机或其他设备。根据本文描述的教导,可以使用各种通用系统,或者可以证明构造更专用的设备以执行方法和/或其各个功能、例程、子例程或操作中的每一个是方便的。在上面的描述中阐述了用于各种这些系统的结构的示例。The methods and illustrative examples described herein are not specific to any particular computer or other device. Various general purpose systems may be used in accordance with the teachings described herein, or it may prove convenient to construct more specialized equipment to perform the methods and/or each of their individual functions, routines, subroutines, or operations. Examples of structures for various of these systems are set forth in the description above.

上面的描述是说明性的,而不是限制性的。尽管已经参考特定的说明性示例和实施方式描述了本说明,但是将认识到,本说明不限于所描述的示例和实施方式。本说明的范围应参考所附权利要求书以及权利要求书所赋予的等效物的全部范围来确定。The above description is illustrative rather than restrictive. Although the present description has been described with reference to specific illustrative examples and implementations, it will be appreciated that the present description is not limited to the described examples and implementations. The scope of the present description should be determined with reference to the appended claims and the full scope of equivalents to which the claims are entitled.

Claims (21)

1.A three-dimensional sound generation system, comprising:
one or more processing devices for:
Receiving a series flow sound consisting of one or more sound sources, wherein each sound source corresponds to one or more sound types;
Receiving a specification, preset, or setting of a three-dimensional space including a location of one or more listeners, one or more sound source locations, one or more sound generating devices, and locations thereof, to generate a desired sound field in the three-dimensional space;
separating the streaming sound into one or more tracks, wherein each track includes the one or more sound types;
combining the separated audio tracks to generate one or more audio channels based on the specifications, preset values, or settings;
Driving the one or more sound generating devices with the one or more audio channels to generate the desired sound field such that the desired sound field produces an equivalent rendering effect that sounds are heard from the one or more sound source locations at the one or more listener locations in the three-dimensional space.
2. The three-dimensional sound generation system of claim 1, wherein the three-dimensional space is an acoustic virtual space of real space comprising a stage space of a band or concert, or non-real space comprising a telephone, video, and/or web conference space, and wherein the specifications, preset values, and/or settings are implemented or modified or updated by a user via one or more user interfaces or one or more computer programs, wherein the user interfaces comprise at least a graphical interface.
3. The three-dimensional sound generation system of claim 1, wherein separating the streaming sound into the one or more audio tracks and combining the separated audio tracks to generate one or more audio channels are implemented in combination or separately by at least one machine learning system, at least one artificial neural network system, at least one digital signal processing system, or a combination thereof.
4. A three-dimensional sound generation system as in claim 1 wherein the sound generation device is a binaural sound generation device comprising earplugs, headphones, or headsets, each track being processed by a retrieved Head Related Transfer Function (HRTF) of a corresponding location and combined into a binaural audio signal to drive the binaural sound generation device to generate the desired sound field; wherein the binaural sound generating device is a stand alone device or part of another device comprising at least an in-vehicle system, a wearable apparatus, a computer, a personal digital assistant, a mobile phone or a display.
5. The three-dimensional sound generation system of claim 1, wherein the sound generation device is an loud sound generation device comprising speakers, the three-dimensional sound generation system being implemented in an open or closed space, the closed space comprising a vehicle and a room, the open space comprising an outdoor venue or a network space; wired or wireless connections between the one or more processing devices and with the one or more sound generating devices, including the internet, wi-Fi, bluetooth, bus, local network, or cloud network.
6. A three-dimensional sound generation system, comprising:
one or more processing devices for:
Receiving a series flow sound consisting of one or more sound sources, wherein each sound source corresponds to one or more sound types;
Receiving one or more desired output sound types and specifications, preset, online setting or user setting of corresponding sound volumes thereof;
separating the streaming sound into one or more sound tracks according to the above specifications or settings;
The user adjusts the volume or frequency response of each soundtrack between 0 and maximum, respectively, or combines into one or more audio channel signals according to specifications and settings to drive a sound generating device comprising a speaker, earplug, earphone or headset for converting the audio channel signals into sound.
7. The three-dimensional sound generation system of claim 6, wherein each output audio signal is solo, vocal only, speech only, instrument only, or a combination thereof, and each soundtrack separation is accomplished by at least one machine learning system, at least one artificial neural network system, at least one signal processing system, or a combination thereof.
8. A three-dimensional sound generation system for video conferencing, virtual concert, or cloud communication, comprising:
A plurality of user terminals are connected through wired or wireless communication;
each user terminal comprises at least one receiving device, at least one processing device and at least one sound generating device;
The user side receives streaming sound formed by sound sources of other user sides or video signals comprising the streaming sound, and separates the streaming sound into one or more sound tracks according to different user sides, wherein each sound track comprises one or more sound source signals;
The user side converts the separated one or more audio tracks into one or more channel signals according to specifications, preset or settings, and drives the at least one sound generating device to generate a sound field, wherein each user side sound source has a specific position in the sound field.
9. A three-dimensional sound generation system according to claim 8, wherein the specification, preset or setting is implemented or on-line adjusted by a graphic display or a computer program in a user interface or user interface device of each user side and/or a designated user side, when there is a video display, the sound source position is set to be at or near the position of the user image display, and when the display position or virtual position of the other user side in the specification, preset or setting is changed, the sound generation device generates a correspondingly changed sound field.
10. The three-dimensional sound generation system of claim 8, wherein separating the streaming sound into the plurality of audio tracks is accomplished by at least one machine learning system, at least one artificial neural network system, at least one digital signal processing system, or a combination thereof, and the streaming sound is separated into the plurality of audio tracks by identifying or detecting different sound source locations from microphone array pickup and/or from known user side sources.
11. The three-dimensional sound generation system of claim 8, wherein the sound generation device comprises at least one speaker, earplug, earphone, headset, or any device capable of converting the channel signal into sound field sound at the particular location.
12. The three-dimensional sound generation system of claim 8, wherein each of the separated sound tracks is processed by a Head Related Transfer Function (HRTF), the function selected according to specifications, preset values, a setting at a user side, or a sound source position displayed at the user side, and the processed sound tracks are combined to generate the channel signal to drive earplugs, headphones, or a headset to generate the sound field.
13. A three-dimensional sound generation system, comprising:
A sound source device and a sound receiving device connected by wired or wireless communication, wherein each device has or is associated with one or more sensors, the sound receiving device comprising one or more sound generating devices;
The sound receiving apparatus receiving an audio signal from a sound source apparatus and receiving relative position information between the sound source apparatus and the sound receiving apparatus, the relative position information being obtained by processing a signal from the sensor,
The sound receiving device or the sound source device or both cooperatively retrieve a related Head Related Transfer Function (HRTF) based on the relative position information, and process the received audio signal using the retrieved HRTF, and drive the sound generating device on the sound receiving device with the processed audio signal to generate a sound field to perceive a direction of sound in the sound field from the sound source device.
14. A three-dimensional sound generation system with a video display, comprising:
One or more clients;
Each user terminal comprises at least one receiving device, at least one processing device, at least one sound generating device and at least one display device;
each user terminal receives a streaming signal composed of a video signal and an audio signal, wherein the video signal is used for displaying an object at a specific position;
identifying a corresponding sound field from the streaming signal, including one or more sound sources associated with the particular location;
Separating the audio signal into one or more audio tracks, wherein each audio track contains one or more sound sources in a corresponding sound field of the audio signal;
combining the separated audio tracks into an audio channel signal according to the identified sound source settings;
the at least one sound generating device is driven using the above-described audio channel signal to reproduce the sound field at the displayed specific position to achieve a rendering effect of sound from the displayed object at the specific position.
15. The three-dimensional sound generation system of claim 14, wherein the object and its particular location are identified from one or a series of video frames of the video signal by at least one machine learning system, at least one artificial neural network system, at least one digital signal processing system, or a combination thereof, and the steps of separating the audio signal into the one or more audio tracks, combining the separated audio tracks into an audio channel signal, and driving the sound generation device are implemented, respectively, by at least one machine learning system, at least one artificial neural network system, at least one digital signal processing system, or a combination thereof, in combination.
16. The three-dimensional sound generation system of claim 14, wherein the sound generation device comprises a speaker, an earplug, an earphone, or a headset; when the sound generating device comprises an earplug, an earphone or a headset, the user side retrieves a Head Related Transfer Function (HRTF) corresponding to the sound source position for processing the separated audio track.
17. A computer method for three-dimensional sound generation, comprising:
Receiving a series flow sound consisting of one or more sound sources, wherein each sound source corresponds to one or more sound types;
Receiving a specification, preset, or setting of a three-dimensional space including a location of one or more listeners, one or more sound source locations, one or more sound generating devices, and locations thereof, to generate a desired sound field in the three-dimensional space;
separating the streaming sound into one or more tracks, wherein each track includes the one or more sound types;
combining the separated audio tracks to generate one or more audio channels based on the specifications, preset values, or settings;
Driving the one or more sound generating devices with the one or more audio channels to generate the desired sound field such that the desired sound field produces an equivalent rendering effect that sounds are heard from the one or more sound source locations at the one or more listener locations in the three-dimensional space.
18. The computer method of claim 17, wherein the three-dimensional space is an acoustic virtual space of real space comprising a stage space of a band or concert, or non-real space comprising a telephone, video, and/or web conference space, and wherein the specifications, preset values, and/or settings are implemented or modified or updated by a user via one or more user interfaces or one or more computer programs, wherein the user interfaces comprise at least a graphical interface.
19. The computer method of claim 17, wherein the steps of separating the streaming sound into the one or more audio tracks and combining the separated audio tracks to generate one or more audio channels are implemented in combination or separately by at least one machine learning system, at least one artificial neural network system, at least one digital signal processing system, or a combination thereof.
20. A computer method according to claim 17, wherein the sound generating device is a binaural sound generating device comprising earplugs, headphones or headsets, each track being processed by a retrieved Head Related Transfer Function (HRTF) of the corresponding location and combined into a binaural audio signal to drive the binaural sound generating device to generate the desired sound field; wherein the binaural sound generating device is a stand alone device or part of another device comprising at least an in-vehicle system, a wearable apparatus, a computer, a personal digital assistant, a mobile phone or a display.
21. A computer method according to claim 17, wherein the sound generating device is a loud sound generating device comprising loudspeakers, the computer method being implemented in an open or closed space, the closed space comprising a vehicle and a room, the open space comprising an outdoor venue or a network space; wired or wireless connections between the one or more processing devices and with the one or more sound generating devices, including the internet, wi-Fi, bluetooth, bus, local network, or cloud network.
CN202411229561.9A 2020-06-09 2021-05-27 3D audio system Pending CN118972776A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US202063036797P 2020-06-09 2020-06-09
US63/036797 2020-06-09
US17/227067 2021-04-09
US17/227,067 US11240621B2 (en) 2020-04-11 2021-04-09 Three-dimensional audio systems
CN202110585625.9A CN113784274B (en) 2020-06-09 2021-05-27 Three-dimensional audio system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202110585625.9A Division CN113784274B (en) 2020-06-09 2021-05-27 Three-dimensional audio system

Publications (1)

Publication Number Publication Date
CN118972776A true CN118972776A (en) 2024-11-15

Family

ID=78835713

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202411229561.9A Pending CN118972776A (en) 2020-06-09 2021-05-27 3D audio system
CN202110585625.9A Active CN113784274B (en) 2020-06-09 2021-05-27 Three-dimensional audio system

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202110585625.9A Active CN113784274B (en) 2020-06-09 2021-05-27 Three-dimensional audio system

Country Status (1)

Country Link
CN (2) CN118972776A (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11307825B1 (en) * 2021-02-28 2022-04-19 International Business Machines Corporation Recording a separated sound from a sound stream mixture on a personal device
CN114067827A (en) * 2021-12-20 2022-02-18 Oppo广东移动通信有限公司 A kind of audio processing method, device and storage medium
CN114827886A (en) * 2022-04-26 2022-07-29 北京达佳互联信息技术有限公司 Audio generation method and device, electronic equipment and storage medium
EP4515861A1 (en) * 2022-04-29 2025-03-05 Zoom Video Communications, Inc. Providing spatial audio in virtual conferences
CN115174959B (en) * 2022-06-21 2024-01-30 咪咕文化科技有限公司 Video 3D sound effect setting method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9131305B2 (en) * 2012-01-17 2015-09-08 LI Creative Technologies, Inc. Configurable three-dimensional sound system
US9906884B2 (en) * 2015-07-31 2018-02-27 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for utilizing adaptive rectangular decomposition (ARD) to generate head-related transfer functions
US9584653B1 (en) * 2016-04-10 2017-02-28 Philip Scott Lyren Smartphone with user interface to externally localize telephone calls
US10491643B2 (en) * 2017-06-13 2019-11-26 Apple Inc. Intelligent augmented audio conference calling using headphones
US9900555B1 (en) * 2017-06-27 2018-02-20 The Florida International University Board Of Trustees VRT: virtual round table
JP6617783B2 (en) * 2018-03-14 2019-12-11 カシオ計算機株式会社 Information processing method, electronic device, and program

Also Published As

Publication number Publication date
CN113784274B (en) 2024-09-20
CN113784274A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
US12061835B2 (en) Binaural rendering for headphones using metadata processing
CN113784274B (en) Three-dimensional audio system
US11611840B2 (en) Three-dimensional audio systems
CN109644314B (en) Method of rendering sound program, audio playback system, and article of manufacture
JP4921470B2 (en) Method and apparatus for generating and processing parameters representing head related transfer functions
Rafaely et al. Spatial audio signal processing for binaural reproduction of recorded acoustic scenes–review and challenges
KR20160015317A (en) An audio scene apparatus
CN109891503A (en) Acoustics scene back method and device
US20220246161A1 (en) Sound modification based on frequency composition
US10523171B2 (en) Method for dynamic sound equalization
US10142760B1 (en) Audio processing mechanism with personalized frequency response filter and personalized head-related transfer function (HRTF)
US20230319492A1 (en) Adaptive binaural filtering for listening system using remote signal sources and on-ear microphones
Lee et al. A real-time audio system for adjusting the sweet spot to the listener's position
US20240340605A1 (en) Information processing device and method, and program
US20240056735A1 (en) Stereo headphone psychoacoustic sound localization system and method for reconstructing stereo psychoacoustic sound signals using same
KR101111734B1 (en) Method and apparatus for outputting sound by classifying a plurality of sound sources
Moore et al. Measuring audio-visual speech intelligibility under dynamic listening conditions using virtual reality
US20120288125A1 (en) Psycho-acoustic noise suppression
WO2024227940A1 (en) Method and system for multi-device playback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination