[go: up one dir, main page]

CN116830560A - Echo reference generation and echo reference index estimation based on rendering information - Google Patents

Echo reference generation and echo reference index estimation based on rendering information Download PDF

Info

Publication number
CN116830560A
CN116830560A CN202280013949.8A CN202280013949A CN116830560A CN 116830560 A CN116830560 A CN 116830560A CN 202280013949 A CN202280013949 A CN 202280013949A CN 116830560 A CN116830560 A CN 116830560A
Authority
CN
China
Prior art keywords
echo
audio
examples
echo reference
references
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280013949.8A
Other languages
Chinese (zh)
Inventor
B·J·索斯韦尔
D·古纳万
Y-L·何
S·C·萨马拉塞克拉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Priority claimed from PCT/US2022/015436 external-priority patent/WO2022173684A1/en
Publication of CN116830560A publication Critical patent/CN116830560A/en
Pending legal-status Critical Current

Links

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

Some implementations relate to receiving location information for each of a plurality of audio devices in an audio environment, generating rendering information for the plurality of audio devices in the audio environment based at least in part on the location information, and determining a plurality of echo reference indicators based at least in part on the rendering information. Each echo reference indicator may correspond to audio data reproduced by one or more of the plurality of audio devices. The rendering information may include a loudspeaker activation matrix. Some examples involve estimating importance of each of a plurality of echo references based at least in part on an echo reference indicator, selecting one or more echo references based at least in part on the importance estimates, and providing them to at least one echo management system for use in cancelling or suppressing echoes.

Description

Echo reference generation and echo reference index estimation based on rendering information
Cross Reference to Related Applications
The present application claims priority from European patent application No. 21177382.5 filed on month 2 of 2021 and U.S. provisional patent application No. 63/201,939 filed on month 19 of 2021 and U.S. provisional patent application No. 63/147,573 filed on month 2 of 2021, which are incorporated herein by reference.
Technical Field
The present disclosure relates to devices, systems, and methods for implementing echo management.
Background
Audio devices with acoustic echo management systems have been widely deployed. The acoustic echo management system may include an acoustic echo canceller and/or an acoustic echo suppressor. While existing devices, systems, and methods for acoustic echo management provide benefits, improved devices, systems, and methods would still be desirable.
Symbols and terms
Throughout this disclosure, including in the claims, the terms "speaker," "loudspeaker," and "audio reproduction transducer" are synonymously used to refer to any sound producing transducer (or set of transducers). A typical set of headphones includes two speakers. The speakers may be implemented to include multiple transducers (e.g., woofers and tweeters) that may be driven by a single common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuit branches coupled to different transducers.
Throughout this disclosure, including in the claims, the expression "perform an operation on (on)" a signal or data (e.g., filter, scale, transform, or apply gain) is used in a broad sense to mean either directly performing the operation on the signal or data or performing the operation on a processed version of the signal or data (e.g., a version of the signal that has undergone preliminary filtering or preprocessing prior to performing the operation thereon).
Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.
Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to mean a system or device that is programmable or otherwise configurable (e.g., in software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chip sets), digital signal processors programmed and/or otherwise configured to perform pipelined processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chip sets.
Throughout this disclosure, including in the claims, the term "coupled" or "coupled" is used to mean a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.
As used herein, a "smart device" is an electronic device that may operate interactively and/or autonomously to some degree, typically configured to communicate with one or more other devices (or networks) via various wireless protocols such as bluetooth, zigbee, near field communication, wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, and the like. Some well-known smart device types are smart phones, smart cars, smart thermostats, smart doorbell, smart locks, smart refrigerators, tablet phones and tablet computers, smart watches, smart bracelets, smart key chains, and smart audio devices. The term "smart device" may also refer to a device that exhibits some properties of pervasive computing such as artificial intelligence.
The expression "smart audio device" is used herein to denote a smart device that is a single-purpose audio device or a multi-purpose audio device (e.g., an audio device implementing at least some aspects of the virtual assistant functionality). A single-use audio device is a device that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera) and is designed largely or primarily to achieve a single use, such as a Television (TV). For example, while a TV may generally play (and be considered capable of playing) audio from program material, in most instances, modern TVs run some operating system on which applications (including television-watching applications) run locally. In this sense, single-use audio devices having speaker(s) and microphone(s) are typically configured to run local applications and/or services to directly use the speaker(s) and microphone(s). Some single-use audio devices may be configured to be combined together to enable playback of audio over a zone or user-configured area.
One common type of multi-purpose audio device is an audio device that implements at least some aspects of virtual assistant functionality, although other aspects of virtual assistant functionality may be implemented by one or more other devices, such as one or more servers with which the multi-purpose audio device is configured to communicate. Such multi-purpose audio devices may be referred to herein as "virtual assistants. A virtual assistant is a device (e.g., a smart speaker or voice assistant integrated device) that includes or is coupled to at least one microphone (and optionally also includes or is coupled to at least one speaker and/or at least one camera). In some examples, the virtual assistant may provide the ability to use multiple devices (other than the virtual assistant) for applications that in a sense support the cloud or that are otherwise not fully implemented in or on the virtual assistant itself. In other words, at least some aspects of the virtual assistant functionality (e.g., speech recognition functionality) may be implemented (at least in part) by one or more servers or other devices with which the virtual assistant may communicate via a network (e.g., the internet). Virtual assistants can sometimes work together, for example, in a discrete and conditionally defined manner. For example, two or more virtual assistants may work together in the sense that one of them (e.g., the virtual assistant that is most confident that the wake word has been heard) responds to the wake word. In some implementations, the connected virtual assistants may form a constellation that may be managed by a host application, which may be (or implement) the virtual assistant.
In this document, the "wake word" is used in a broad sense to mean any sound (e.g., a word spoken by a human or other sound), wherein the smart audio device is configured to wake up in response to detecting ("hearing") the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, "wake-up" means a state in which the device enters a waiting (in other words, listening) sound command. In some examples, a so-called "wake word" herein may include more than one word, e.g., a phrase.
Herein, the expression "wake word detector" means a device (or means including software for configuring the device to continuously search for an alignment between real-time sound (e.g., speech) features and a training model). Typically, a wake word event is triggered whenever the wake word detector determines that the probability of detecting a wake word exceeds a predefined threshold. For example, the threshold may be a predetermined threshold that is adjusted to give a reasonable tradeoff between false acceptance rate and false rejection rate. After the wake word event, the device may enter a state (which may be referred to as an "awake" state or an "attention" state) in which the device listens for commands and passes the received commands to a larger, more computationally intensive recognizer.
As used herein, the terms "program stream" and "content stream" refer to a collection of one or more audio signals, and in some instances, a collection of video signals, at least portions of which are intended to be heard together. Examples include music selections, movie soundtracks, movies, television programs, audio portions of television programs, podcasts, live voice conversations, synthesized voice responses from intelligent assistants, and the like. In some examples, the content stream may include multiple versions of at least a portion of the audio signal, e.g., the same conversation in more than one language. In such instances, only one version of the audio data or portion thereof (e.g., a version corresponding to a single language) is intended to be reproduced at a time.
Disclosure of Invention
At least some aspects of the present disclosure may be implemented via one or more audio processing methods. In some examples, the method(s) may be implemented at least in part by a control system and/or via instructions (e.g., software) stored on one or more non-transitory media. Some such methods may involve receiving, by a control system, location information for each of a plurality of audio devices in an audio environment. Some such methods may involve generating, by the control system and based at least in part on the location information, rendering information for a plurality of audio devices in the audio environment. Some such methods may involve determining, by the control system and based at least in part on the rendering information, a plurality of echo reference indicators. In some examples, each of the plurality of echo reference indicators may correspond to audio data reproduced by one or more of the plurality of audio devices.
According to some examples, the rendering information may be or may include a loudspeaker activation matrix. In some examples, the at least one echo reference indicator may correspond to a level of a corresponding echo reference, a uniqueness of a corresponding echo reference, a temporal persistence of a corresponding echo reference, an audibility of a corresponding echo reference, or one or more combinations thereof.
In some examples, the method may further involve receiving, by the control system, a content stream comprising audio data and corresponding metadata. According to some such examples, determining the at least one echo reference indicator may be based at least in part on one or more of loudspeaker metadata, metadata corresponding to received audio data, or an upmix matrix.
In some embodiments, the control system may be or may include an audio device control system. According to some such embodiments, the method may involve estimating, by the control system and based at least in part on the echo reference indicator, an importance of each echo reference of a plurality of echo references. In some such implementations, making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. The at least one echo management system may be or include an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES. In some such embodiments, the method may involve selecting, by the control system and based at least in part on the importance estimate, one or more selected echo references. In some such embodiments, the method may involve providing, by the control system, the one or more selected echo references to the at least one echo management system.
In some examples, the method may further involve causing at least one echo management system to cancel or suppress echo based at least in part on the one or more selected echo references. According to some examples, making the importance estimate may involve determining an importance index for a corresponding echo reference. In some examples, determining the importance index may be based at least in part on a current listening object, a current ambient noise estimate, or both the current listening object and the current ambient noise estimate.
According to some examples, the method may also involve cost determination by the control system. In some examples, the cost determination may involve determining a cost of at least one echo reference of the plurality of echo references. In some such examples, selecting the one or more selected echo references may be determined based at least in part on the cost. In some examples, the cost determination may be based on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, echo management system calculation requirements for using the at least one echo reference by the at least one echo management system, or one or more combinations thereof.
In some examples, the method may further involve determining a current echo management system performance level. According to some such examples, the importance estimate may be based at least in part on the current echo management system performance level.
According to some examples, the method may further involve receiving, by the control system, scene change metadata. In some examples, the importance estimate may be based at least in part on the scene change metadata.
In some examples, the method may also involve rendering the audio data based at least in part on the rendering information to produce rendered audio data. According to some embodiments, the control system may be or may include an orchestration device control system. In some such implementations, the method may further involve providing at least a portion of the rendered audio data to each of the plurality of audio devices.
In some examples, the method may further involve providing at least one echo reference indicator to each of the plurality of audio devices.
According to some examples, the method may further involve generating, by the control system, at least one virtual echo reference corresponding to two or more of the plurality of audio devices.
In some examples, the method may further involve determining, by the control system, a weighted sum of echo references within a certain low frequency range. According to some such examples, the method may involve providing the weighted sum to the at least one echo management system.
Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices in accordance with instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. Thus, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some embodiments, the apparatus is or includes an audio processing system having an interface system and a control system. The control system may include one or more general purpose single or multi-chip processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or a combination thereof.
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Drawings
Like reference numbers and designations in the various drawings indicate like elements.
Fig. 1A is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure.
Fig. 1B illustrates an example of an audio environment.
Fig. 1C and 1D illustrate examples of how audio devices 110A-110C may receive playback channels.
Fig. 1E illustrates another example of an audio environment.
Fig. 2A presents a block diagram of an audio device capable of performing at least some of the disclosed embodiments.
Fig. 2B and 2C illustrate additional examples of audio devices in an audio environment.
Fig. 3A presents a block diagram illustrating components of an audio device according to one example.
Fig. 3B and 3C are graphs showing examples of expected echo management performance and the number of echo references used for echo management.
Fig. 4 presents a block diagram illustrating components of an echo reference orchestrator according to one example.
Fig. 5A is a flow chart summarizing one example of a disclosed method.
Fig. 5B is a flow chart summarizing another example of the disclosed methods.
FIG. 6 is a flow chart summarizing one example of a disclosed method.
Fig. 7 and 8 illustrate block diagrams including components of an echo reference orchestrator according to some alternative examples.
Fig. 9A shows an example of a graph showing the positions of a listener and an audio device in an audio environment.
Fig. 9B shows an example of a chart corresponding to the rendering matrix of each audio device shown in fig. 9A.
Fig. 10A and 10B show examples of graphs indicating spatial audio object counts for a single song.
Fig. 11A and 11B show examples of spatial notification correlation matrices and non-notification rendering correlation matrices.
Fig. 12A, 12B, and 12C show examples of echo reference importance ranking according to PCM-based correlation matrix, spatial notification correlation matrix, and non-notification correlation matrix, respectively.
Fig. 13 illustrates a simplified example of determining a virtual echo reference.
Fig. 14 shows an example of a low frequency management module.
Fig. 15A and 15B illustrate examples of low frequency management for embodiments with and without subwoofers.
Fig. 15C illustrates elements that may be used to implement a higher frequency management method according to one example.
FIG. 16 is a block diagram outlining another example of the disclosed method.
FIG. 17 is a flow chart summarizing another example of the disclosed methods.
Fig. 18 shows an example of a plan view of an audio environment, which in this example is a living space.
Detailed Description
Fig. 1A is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the present disclosure. As with the other figures provided herein, the types and numbers of elements shown in fig. 1A are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, the apparatus 50 may be configured to perform at least some of the methods disclosed herein. In some implementations, the apparatus 50 may be or may include one or more components of an audio system. For example, in some embodiments, the apparatus 50 may be an audio device, such as a smart audio device. In other examples, apparatus 50 may be a mobile device (e.g., a cellular telephone), a laptop computer, a tablet computer device, a television, or other type of device.
According to some alternative embodiments, the apparatus 50 may be or may include a server. In some such examples, the apparatus 50 may be or may include an encoder. Thus, in some examples, the apparatus 50 may be a device configured for use within an audio environment, such as a home audio environment, while in other examples, the apparatus 50 may be a device configured for use in a "cloud", e.g., a server.
In this example, the apparatus 50 includes an interface system 55 and a control system 60. In some implementations, the interface system 55 can be configured to communicate with one or more other devices of the audio environment. In some examples, the audio environment may be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. In some implementations, the interface system 55 can be configured to exchange control information and associated data with audio devices of an audio environment. In some examples, the control information and associated data may relate to one or more software applications being executed by the apparatus 50.
In some implementations, the interface system 55 can be configured for receiving a content stream or for providing a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some examples, the audio data may include spatial data such as channel data and/or spatial metadata. For example, the metadata may be provided by a device that may be referred to herein as an "encoder. In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 55 may include one or more network interfaces and/or one or more external device interfaces (e.g., one or more Universal Serial Bus (USB) interfaces). According to some embodiments, interface system 55 may include one or more wireless interfaces. The interface system 55 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, interface system 55 may include one or more interfaces between control system 60 and a memory system (such as optional memory system 65 shown in fig. 1A). However, in some examples, control system 60 may include a memory system. In some implementations, the interface system 55 may be configured to receive input from one or more microphones in an environment.
For example, control system 60 may include a general purpose single or multi-chip processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some embodiments, control system 60 may reside in more than one device. For example, in some implementations, a portion of the control system 60 may reside in a device within one of the environments depicted herein, and another portion of the control system 60 may reside in a device outside of the environment, such as a server, mobile device (e.g., smart phone or tablet computer), or the like. In other examples, a portion of control system 60 may reside in a device within one of the environments depicted herein, and another portion of control system 60 may reside in one or more other devices of the environments. For example, the functionality of the control system may be distributed across multiple intelligent audio devices of the environment, or may be shared by orchestration devices (as may be referred to herein as devices of the intelligent home hub) and one or more other devices of the environment. In other examples, a portion of control system 60 may reside in a device (e.g., a server) implementing a cloud-based service, and another portion of control system 60 may reside in another device (e.g., another server, a memory device, etc.) implementing a cloud-based service. In some examples, interface system 55 may also reside in more than one device.
In some embodiments, the control system 60 may be configured to at least partially perform the methods disclosed herein. According to some examples, the control system 60 may be configured to obtain a plurality of echo references. The plurality of echo references may include at least one echo reference for each of a plurality of audio devices in the audio environment. Each echo reference may, for example, correspond to audio data played back by one or more loudspeakers of one of the plurality of audio devices.
In some implementations, the control system 60 may be configured to perform an importance estimation for each of a plurality of echo references. In some examples, making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. The echo management system(s) may include an Acoustic Echo Canceller (AEC) and/or an Acoustic Echo Suppressor (AES).
According to some examples, control system 60 may be configured to select one or more selected echo references based at least in part on the importance estimate. In some examples, control system 60 may be configured to provide one or more selected echo references to at least one echo management system.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. One or more non-transitory media may reside, for example, in the optional memory system 65 and/or the control system 60 shown in fig. 1A. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. For example, the software may include instructions for controlling at least one device to perform some or all of the methods disclosed herein. For example, the software may be executed by one or more components of a control system, such as control system 60 of FIG. 1A.
In some examples, the apparatus 50 may include an optional microphone system 70 shown in fig. 1A. The optional microphone system 70 may include one or more microphones. According to some examples, optional microphone system 70 may include a microphone array. In some examples, the microphone array may be configured to determine direction of arrival (DOA) and/or time of arrival (TOA) information, for example, according to instructions from the control system 60. In some examples, the microphone array may be configured for receive side beamforming, e.g., according to instructions from the control system 60. In some implementations, one or more microphones may be part of or associated with another device (e.g., a speaker of a speaker system, a smart audio device, etc.). In some examples, the apparatus 50 may not include the microphone system 70. However, in some such embodiments, the apparatus 50 may still be configured to receive microphone data for one or more microphones in an audio environment via the interface system 60. In some such embodiments, a cloud-based embodiment of the apparatus 50 may be configured to receive microphone data or data corresponding to microphone data from one or more microphones in an audio environment via the interface system 60.
According to some embodiments, the apparatus 50 may include an optional loudspeaker system 75 shown in fig. 1A. The optional loudspeaker system 75 may include one or more loudspeakers, which may also be referred to herein as "speakers" or more generally as "audio reproduction transducers". In some examples (e.g., cloud-based implementations), the apparatus 50 may not include the loudspeaker system 75.
In some embodiments, the apparatus 50 may include an optional sensor system 80 shown in fig. 1A. The optional sensor system 80 may include one or more touch sensors, gesture sensors, motion detectors, and the like. According to some embodiments, the optional sensor system 80 may include one or more cameras. In some implementations, the camera may be a standalone camera. In some examples, one or more cameras of the optional sensor system 80 may reside in a smart audio device, which may be a single-use audio device or a virtual assistant. In some such examples, one or more cameras of the optional sensor system 80 may reside in a television, mobile phone, or smart speaker. In some examples, the apparatus 50 may not include the sensor system 80. However, in some such embodiments, the apparatus 50 may still be configured to receive sensor data for one or more sensors in the audio environment via the interface system 60.
In some embodiments, the apparatus 50 may include an optional display system 85 shown in fig. 1A. The optional display system 85 may include one or more displays, such as one or more Light Emitting Diode (LED) displays. In some examples, optional display system 85 may include one or more Organic Light Emitting Diode (OLED) displays. In some examples, optional display system 85 may include one or more displays of a smart audio device. In other examples, optional display system 85 may include a television display, a laptop computer display, a mobile device display, or another type of display. In some examples where apparatus 50 includes display system 85, sensor system 80 may include a touch sensor system and/or a gesture sensor system proximate to one or more displays of display system 85. According to some such embodiments, control system 60 may be configured to control display system 85 to present one or more Graphical User Interfaces (GUIs).
According to some such examples, apparatus 50 may be or may include a smart audio device. In some such embodiments, the apparatus 50 may be or may include a wake-up word detector. For example, the apparatus 50 may be or may include a virtual assistant.
For stereo or mono playback media, it is traditionally rendered into an audio environment (e.g., living space, automobile, office space, etc.) via a pair of speakers that are connected to an audio player (e.g., CD/DVD player, television (TV), etc.) via a physical cable. With the popularity of smart speakers, users typically have more than two audio devices (which may include, but are not limited to, smart speakers or other smart audio devices) configured for wireless communication capable of playing back audio in their home (or other audio environment).
The smart speaker is typically configured to operate in accordance with voice commands. Thus, such intelligent speakers are typically configured to listen for wake-up words that will typically be followed by voice commands. Any continuous listening task (such as waiting for a wake-up word or performing any type of "continuous calibration") will preferably continue to run while content playback (such as music playback, track playback of movies and television programs, etc.) and device interactions occur (e.g., during a telephone conversation). Audio devices that need to listen to during playback of content typically need to employ some form of echo management, such as echo cancellation and/or echo suppression, to remove "echoes" (content played by the device) from the microphone signal.
Fig. 1B illustrates an example of an audio environment. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 1B are provided as examples only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements.
According to this example, audio environment 100 includes audio devices 110A, 110B, and 110C. In this example, each of the audio devices 110A-110C is an example of the apparatus 50 of FIG. 1A and includes an example of the microphone system 70 and the loudspeaker system 75, but these are not shown in FIG. 1B. According to some examples, each audio device 110A-110C may be a smart audio device, such as a smart speaker.
In this example, audio devices 110A-110C play back audio content while person 130 is speaking. The microphone of audio device 110B detects not only audio content played back by its own speaker, but also voice sounds 131 of person 130 and audio content played back by audio devices 110A and 110C.
In order to utilize as many speakers as possible at the same time, a typical approach is to have all audio devices in the audio environment play back the same content and use some timing mechanism to keep the playback media synchronized. This has the advantage of simplifying distribution because all devices will receive the same copy of the playback media, whether downloaded or streamed to each audio device or broadcast and multicast by one device to all audio devices.
One major drawback of this approach is that no spatial effect is obtained. The spatial effect may be achieved by adding more playback channels (e.g. one for each speaker), e.g. by upmixing. In some examples, the spatial effect may be implemented via a flexible rendering process such as centroid amplitude translation (CMAP), flexible Virtualization (FV), or a combination of CMAP and FV. Related examples of CMAP, FV and combinations thereof are described in International patent publication No. WO 2021/021707 A1 (e.g., pages 25-41), which is hereby incorporated by reference.
Fig. 1C and 1D illustrate additional examples of audio devices in an audio environment. According to these examples, audio environment 100 includes smart home hub 105 and audio devices 110A, 110B, and 110C. In these examples, smart home hub 105 and audio devices 110A-110C are examples of apparatus 50 of FIG. 1A. According to these examples, each of the audio devices 110A-110C includes a corresponding one of the microphones 121A, 121B, and 121C. According to some examples, each audio device 110A-110C may be a smart audio device, such as a smart speaker.
Fig. 1C and 1D illustrate examples of how audio devices 110A-110C may receive playback channels. In fig. 1C, the encoded audio bitstream is multicast to all audio devices 110A-110C. In fig. 1D, each of the audio devices 110A-110C receives only the channels that are required by the audio device for playback. The choice of bitstream distribution may vary according to individual implementations, and may be based on, for example, available system bandwidth, codec efficiency of the audio codec used, capabilities of the audio devices 110A-110C, and/or other factors. The exact topology of the audio environment shown in fig. 1C and 1D is not important. However, these examples illustrate the fact that: distributing audio channels to audio devices will incur some costs. Costs may be assessed in terms of network bandwidth required, computational costs added to encoding and decoding the audio channels, and the like.
Fig. 1E illustrates another example of an audio environment. According to this example, audio environment 100 includes audio devices 110A, 110B, 110C, and 110D. In this example, each of the audio devices 110A-110D is an example of the apparatus 50 of fig. 1A and includes at least one microphone (see microphones 120A, 120B, 120C, and 120D), at least one loudspeaker (see loudspeakers 121A, 121B, 121C, and 121D). According to some examples, each audio device 110A-110D may be a smart audio device, such as a smart speaker.
In this example, audio devices 110A-110D render content 122A, 122B, 122C, and 122D via loudspeakers 121A-121D. Each of the microphones 120A-120D detects an "echo" corresponding to content 122A-122D played back by each of the audio devices 110A-110D. In this example, audio devices 110A-110D are configured to listen for commands or wake words in speech 131 from person 130 within audio environment 100.
Fig. 2A presents a block diagram of an audio device capable of performing at least some of the disclosed embodiments. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 2A are provided as examples only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. In this example, audio device 110A is an example of audio device 110A of fig. 1E. Here, the audio device 110A includes a control system 60A, which is an example of the control system 60 of fig. 1A. According to this embodiment, the control system 60 is able to listen to the voice 131 of the person 130 in the presence of echoes corresponding to the content 122A, 122B, 122C and 122D played back by each audio device in the audio environment 100.
According to this example, control system 60 implements renderer 201A, a multi-channel acoustic echo management system (MC-EMS) 203A, and a speech processing block 240A. The MC-EMS203A may include an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES, depending on the particular implementation. According to this example, the speech processing block 240A is configured to detect wake words and commands of the user. In some implementations, the speech processing block 240A can be configured to support a communication session, such as a telephone call.
In this embodiment, the renderer 201A is configured to provide the local echo reference 220A to the MC-EMS 203A. The local echo reference 220A corresponds to (and in this example is equivalent to) a speaker feed provided to the loudspeaker 121A for playback by the audio device 110A. According to this example, the renderer 201A is further configured to provide a non-local echo reference 221A (corresponding to content 122B, 122C, and 122D played back by other audio devices in the audio environment 100) to the MC-EMS 203A.
According to some examples, audio device 110A receives a combined bitstream of audio data (e.g., as shown in FIG. 1C) that includes all of audio devices 110A-110D of FIG. 1E. In some such examples, the renderer 201A may be configured to separate the local echo reference 220A from the non-local echo reference 221A to provide the local echo reference 220A to the loudspeaker 121A and to provide the local echo reference 220A and the non-local echo reference 221A to the MC-EMS 203A. In some alternative examples, audio device 110A may receive a bitstream intended for playback only on audio device 110A, e.g., as shown in fig. 1D. In some such examples, the smart home hub 105 (or other audio devices 110B-D) may provide the non-local echo reference 221A to the audio device 110A, as indicated by the dashed arrow next to reference 221A in fig. 2A.
In some examples, local echo reference 220A and/or non-local echo reference 221A may be full fidelity replicas of the speaker feed provided to loudspeakers 121A-121D for playback. In some alternative examples, local echo reference 220A and/or non-local echo reference 221A may be lower fidelity representations of speaker feed signals provided to microphones 121A-121D for playback. In some such examples, non-local echo reference 221A may be a downsampled version of the speaker feed provided to microphones 121B-121D for playback. According to some examples, non-local echo reference 221A may be a lossy compression of a speaker feed signal provided to loudspeakers 121B-121D for playback. In some examples, non-local echo reference 221A may be segment power information (banded power information) corresponding to speaker feed signals provided to microphones 121B-121D for playback.
According to this embodiment, the MC-EMS203A is configured to use the local echo reference 220A and the non-local echo reference 221A to predict and cancel and/or suppress echoes from the microphone signal 223A, thereby generating a residual signal 224A in which the speech-to-echo ratio (SER) may have been improved relative to the microphone signal 223A. The residual signal 224A may enable the speech processing block 240A to detect user wake-up words and commands. In some implementations, the speech processing block 240A can be configured to support a communication session, such as a telephone call.
Some aspects of the present disclosure relate to importance estimation for each of a plurality of echo references (e.g., for local echo reference 220A and non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment (e.g., echo mitigation by MC-EMS203A of audio device 110A). Various examples are provided below.
In the context of distributed and orchestrated devices, each audio device may obtain, for purposes of echo management, according to some examples, in addition to its own echo reference, an echo reference corresponding to content played back by one or more other audio devices in the audio environment. The impact of including a particular echo reference in a local echo management system or "EMS" (such as MC-EMS203A of audio device 110A) may vary depending on a number of parameters, such as the diversity of the audio content being played out, the network bandwidth required for transmitting the echo reference, the encoding calculation requirements for encoding the echo reference in the case of transmitting an encoded echo reference, the decoding calculation requirements for decoding the echo reference, the echo management system calculation requirements for using the echo reference by the echo management system, the relative audibility of the audio device, etc.
For example, if each audio device is rendering the same content (in other words, if mono audio is being played back), providing an additional reference to the EMS has little (though non-zero) benefit. Furthermore, due to practical limitations (such as bandwidth-limited networks), it may not be desirable for all devices to share a replica of their local echo reference. Thus, some embodiments may provide Distributed and Orchestrated EMS (DOEMS), where echo references are prioritized and transmitted (or not transmitted) accordingly. Some such examples may implement a tradeoff between cost (e.g., required network bandwidth and/or required computational overhead) and benefit (e.g., expected echo mitigation improvements, which may be measured in terms of signal-to-echo ratio (SER) and/or echo loss enhancement (ERLE)) for each additional echo reference.
Fig. 2B and 2C illustrate additional examples of audio devices in an audio environment. According to these examples, audio environment 100 includes smart home hub 105 and audio devices 110A, 110B, and 110C. In these examples, smart home hub 105 and audio devices 110A-110C are examples of apparatus 50 of FIG. 1A. According to these examples, each of the audio devices 110A-110C includes a corresponding one of the microphones 120A, 120B, and 120C and a corresponding one of the microphones 121A, 121B, and 121C. According to some examples, each audio device 110A-110C may be a smart audio device, such as a smart speaker.
In FIG. 2B, the smart home hub 105 sends the same encoded audio bitstream to all of the audio devices 110A-110C. In fig. 2C, the smart home hub 105 only transmits the audio channels required for playback by each of the audio devices 110A-110C. In both examples, audio channel 0 is intended for playback on audio device 110A, audio channel 1 is intended for playback on audio device 110B and audio channel 2 is intended for playback on audio device 110C.
Fig. 2B and 2C illustrate examples of sharing echo reference data over a local network. In these examples, audio device 110A sends echo reference 220A' to audio devices 110B and 110C over the local network, which is an echo reference corresponding to loudspeaker playback of audio device 110A. In these examples, the echo reference 220A' is different from channel 0 audio found in the bitstream. In some examples, the echo reference 220A' may be different from channel 0 audio because post-playback processing is implemented on the audio device 110A. In the example shown in fig. 2C, not all audio devices 110A-110C are provided with the combined bitstream, and thus another device (such as the audio device 110A or the smart home hub 105) provides the echo reference 220A'. In the scenario depicted in fig. 2B, even though the combined bit stream is provided to all audio devices 110A-110C, in some such instances, it may still be desirable to transmit the echo reference 220A'.
In other examples, the echo reference 220A 'may be different from channel 0 audio because the echo reference 220A' may not be a full fidelity replica of the audio data played back on the audio device 110A. In some such examples, the echo reference 220A 'may correspond to audio data played back on the audio device 110A, but may require relatively less data than a complete replica, and thus may consume relatively less local network bandwidth when the echo reference 220A' is transmitted.
According to some such examples, the audio device 110A may be configured to generate a downsampled version of the local echo reference 220A described above with reference to fig. 2A. In some such examples, the echo reference 220A' may be or may include a downsampled version.
In some examples, the audio device 110A may be configured to lossy compress the local echo reference 220A. In such an instance, the echo reference 220A' may be the result of the control system 60A applying a lossy compression algorithm to the local echo reference 220A.
According to some examples, audio device 110A may be configured to provide audio devices 110B and 110C with segment power information corresponding to local echo reference 220A. In some such examples, instead of transmitting a full fidelity replica of the audio data played back on audio device 110A, control system 60A may be configured to determine a power level in each of a plurality of frequency bands of the audio data played back on audio device 110A and transmit corresponding segment power information to audio devices 110B and 110C. In some such examples, the echo reference 220A' may be or may include segment power information.
Fig. 3A presents a block diagram illustrating components of an audio device according to one example. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 3A are provided as examples only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. For example, some implementations may be configured to send and/or receive "original" echo references (which may be a complete full-fidelity replica of audio being rendered on an audio device), low-fidelity versions or representations of audio being rendered on an audio device (such as downsampled versions, versions produced by lossy compression, or segment power information corresponding to audio being rendered on an audio device), but not to send and/or receive both the original version and the low-fidelity version.
In this example, the audio device 110A is an example of the audio device 110A of fig. 1E and includes a control system 60A, which is an example of the control system 60 of fig. 1A. According to this example, control system 60A is configured to implement a renderer 201A, a multi-channel acoustic echo management system (MC-EMS) 203A, a speech processing block 240A, an echo reference orchestrator 302A, a decoder 303A, and a noise estimator 304A. The reader may assume that the MC-EMS203A and the speech processing block 240A function as described above with reference to FIG. 2A, unless otherwise indicated by the following description of FIG. 3A. In this example, network interface 301A is an example of interface system 55 described above with reference to FIG. 1A.
In this example, the elements of fig. 3A are as follows:
110A: an audio device;
120A: a representative microphone. In some implementations, the audio device 110A may have more than one microphone;
121A: representative microphones. In some implementations, the audio device 110A may have more than one loudspeaker;
201A: a renderer that generates a reference for local playback and an echo reference that simulates audio played back by other audio devices in the audio environment;
203A: a multi-channel acoustic echo management system (MC-EMS) that may include an Acoustic Echo Canceller (AEC) and/or an Acoustic Echo Suppressor (AES);
220A: local echo references for playback and cancellation;
221A: locally generated copies of echo references being played by one or more non-local audio devices (one or more other audio devices in an audio environment);
223A: a plurality of microphone signals;
224A: a plurality of residual signals (MC-EMS 203A eliminates and/or suppresses microphone signals after the predicted echo);
240A: a voice processing block configured for wake word detection, voice command detection and/or providing telephony communications;
301A: a network interface configured for communication between audio devices, which may also be configured for communication via the internet and/or via one or more cellular networks;
302A: an echo reference composer configured to rank the echo references and select an appropriate set of one or more echo references;
303A: an audio decoder block;
304A: a noise estimator block;
310A: one or more decoded echo references received by audio device 110A from one or more other devices in the audio environment;
311A: transmitting a request for echo references from one or more other devices (such as a smart home hub or one or more of the audio devices 110B-110D) over a local network;
312A: metadata, which may be or may include metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, upmix matrices, and/or loudspeaker activation matrices;
313A: an echo reference selected by echo reference composer 302A;
314A: echo references received by device 110A from one or more other devices;
315A: echo references sent from device 110A to other devices;
316A: the device 110A receives raw echo references from one or more other devices of the audio environment;
317A: a low fidelity (e.g., codec) version of the echo reference received by device 110A from one or more other devices of the audio environment;
318A: an audio environmental noise estimate;
350A: one or more indicators indicative of the current performance of the MC-EMS203A, which may be or may include adaptive filter coefficient data or other AEC statistics, speech Echo (SER) ratio data, and the like.
Echo reference orchestrator 302A may function in various ways, depending on the particular implementation. Many examples are disclosed herein. In some examples, the echo reference composer 302A may be configured to make an importance estimate for each of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment (e.g., echo mitigation by MC-EMS203A of audio device 110A).
Some examples of making the importance estimate may involve determining an importance index. In some such examples, the importance index may be based at least in part on one or more characteristics of each echo reference, such as level, uniqueness, time duration, audibility, or one or more combinations thereof. In some examples, the importance index may be based at least in part on metadata (e.g., metadata 312A), such as metadata corresponding to the audio device layout, loudspeaker metadata, metadata corresponding to the received audio data, an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, the importance index may be based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.
According to some examples, echo reference orchestrator 302A may be configured to select a set of one or more echo references based at least in part on the cost determination. In some examples, echo reference orchestrator 302A may be configured to make a cost determination, while in other examples, another block of control system 60a may be configured to make a cost determination. In some examples, the cost determination may involve determining a cost of at least one of the plurality of echo references, or in some cases, each of the plurality of echo references. In some examples, the cost determination may be based on network bandwidth required for transmitting the echo reference, encoding calculation requirements for encoding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, downsampling costs for making a downsampled version of the echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or one or more combinations thereof.
According to some examples, the cost determination may be based on a replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, a lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, or one or more combinations thereof. In some examples, the cost determination may be based on a method that compresses less than a relatively less important echo reference. In some implementations, the echo reference orchestrator 302A (or another block of the control system 60A) may be configured to determine a current echo management system performance level (e.g., based at least in part on the indicator(s) 350A). In some such examples, selecting one or more selected echo references may be based at least in part on a current echo management system performance level.
Depending on the distributed audio device system, its configuration, and the type of audio session (e.g., communication or listening to music) and/or the nature of the rendered content, the rate at which the importance of each echo reference is estimated and the rate at which the echo reference set is estimated may be different. Furthermore, the rate at which the importance is estimated need not be equal to the rate at which the echo reference selection process makes decisions. If the two are not synchronized, the importance calculations will be more frequent in some examples. In some examples, the echo reference selection may be a discrete process in which binary decisions are made with or without specific echo references.
Fig. 3B and 3C are graphs showing examples of expected echo management performance and the number of echo references used for echo management. In fig. 3B, it can be seen that with the addition of additional references, the expected echo performance is also improved. However, in this example, it can be found that there are only a few discrete points at which the system can operate. In some examples, the points shown in fig. 3B may correspond to processing a complete, full-fidelity replica of each echo reference. For example, point 301 may correspond to an instance of processing a local echo reference (e.g., local reference 220A of fig. 2A or 3A), and point 310 may correspond to an instance of receiving a complete replica of a first non-local echo reference (e.g., a full fidelity version of one of received echo references 314A of fig. 3A, which may have been selected as the most important non-local echo reference) and processing both the local echo reference and the complete replica of the first non-local echo reference.
FIG. 3C illustrates one example of operation between any two of the discrete operating points shown in FIG. 3B. The line connecting the points in fig. 3B may, for example, correspond to a range of echo reference fidelity, including a lower fidelity version or representation of each echo reference. For example, points 303, 305, and 307 may correspond to copies or representations of the increased fidelity level of the first non-local echo reference, where point 303 corresponds to the lowest fidelity representation and point 307 corresponds to the highest fidelity representation other than the full fidelity replica. In some examples, point 303 may correspond to segment power information of the first non-local echo reference. According to some examples, points 305 and 307 may correspond to a relatively higher lossy compression of the first non-local echo reference and a relatively less lossy compression of the first non-local echo reference, respectively.
The fidelity of a copy or representation of an echo reference is generally inversely proportional to the number of bits required for each such copy or representation. Thus, the fidelity of the copy or representation of the echo reference provides an indication of the tradeoff between network cost (due to the number of bits required for transmission) and expected echo management performance (since performance should increase as fidelity increases). Note that the straight line connecting the points in fig. 3C represents only one of many different possible trajectories, in part because the incremental change from one echo reference to the next depends on which echo reference will be selected as the next echo reference, and in part because there may not be a linear relationship between expected echo management performance and fidelity.
Fig. 4 presents a block diagram illustrating components of an echo reference orchestrator according to one example. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 4 are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements. For example, some implementations may be configured to send and/or receive "original" echo references (which may be full-fidelity copies of audio reproduced on an audio device), low-fidelity versions or representations of audio reproduced on an audio device (such as downsampled versions, versions produced by lossy compression, or segment power information corresponding to audio reproduced on an audio device), but not send and/or receive both the original and low-fidelity versions. As another example, some implementations of the echo reference orchestrator 302A may include a metadata-based metric calculation module, such as the metadata-based metric calculation module 705 described herein with reference to fig. 7 and subsequent figures. In some such examples, the metadata-based metrics calculation module may generate EMS look-ahead statistics based at least in part on the scene change message(s) from the scene change analyzer, and may provide the EMS look-ahead statistics to the MC-EMS performance model 405A. According to some examples, the metadata-based metric calculation module may generate an echo reference characteristic from which the importance metric 420 may be determined. In some examples, the echo reference characteristics may be based at least in part on the metadata 312. According to some examples, the echo reference characteristic may be based at least in part on the audio scene change message. In some examples, the metadata-based metric calculation module may provide the echo reference characteristics to the echo reference importance estimator 401A. According to some examples, the metadata-based metrics calculation module may provide the echo reference characteristics to the echo reference selector 402A.
In this example, the echo reference orchestrator 302A is an example of the echo reference orchestrator 302A of fig. 3A and is implemented by an example of the control system 60a of fig. 3A. According to this example, the elements of fig. 4 are as follows:
220A: local echo references for playback and cancellation;
221A: a locally generated copy of a non-local echo reference being played by another audio device of the audio environment;
302A: an echo reference composer configured to rank and select a set of one or more echo references;
310A: one or more decoded echo references received by audio device 110A from one or more other devices in the audio environment;
311A: transmitting a request for echo referencing from one or more other devices of the audio environment over a local network;
312A: metadata, which may be or may include metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, upmix matrices, and/or loudspeaker activation matrices;
313A: in this example, a set of one or more echo references selected by the echo reference orchestrator 302A and sent to the MC-EMS 203A;
316A: the device 110A receives raw echo references from one or more other devices of the audio environment;
317A: a low fidelity (e.g., codec) version of the echo reference received by device 110A from one or more other devices of the audio environment;
318A: an audio environmental noise estimate;
350A: one or more indicators indicative of the current performance of the MC-EMS203A, which may be or may include adaptive filter coefficient data or other AEC statistics, speech Echo (SER) ratio data, and the like.
401A: an echo reference importance estimator configured to estimate an expected importance of each echo reference and, in this example, generate a corresponding importance index 420A;
402: an echo reference selector configured to select the echo reference set 313A based at least in part on a current listening object (as shown at 421A), a cost per echo reference (as shown at 422A), a current state/performance of the EMS (as shown at 350A), and an estimated importance of each candidate echo reference (as shown by an importance index 420A), in this example;
403A: a cost estimation module configured to determine cost(s) (e.g., computational and/or network cost) to include the echo references in echo reference set 313A;
404A: an optional module for determining or estimating a current listening object of the audio device 110A;
405A: a module configured to implement one or more MC-EMS performance models, which in some examples may generate data such as that shown in fig. 3B or fig. 3C;
420A: an importance index 420A generated by the echo reference importance estimator 401A;
421A: information indicating a current listening object;
422A: information indicating cost(s) to include the echo references in echo reference set 313A; and
423A: information generated by MC-EMS performance model 405A, which in some examples may be or include data such as shown in FIG. 3B or FIG. 3C; the information 423A may be referred to herein as "EMS health data.
The echo reference importance estimator 401A may function in various ways depending on the particular implementation. Various examples are provided in the present disclosure. In some examples, the echo reference importance estimator 401A may be configured to perform an importance estimation for each of a plurality of echo references (e.g., for the local echo reference 220A and the non-local echo reference 221A). Making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment (e.g., echo mitigation by MC-EMS203A of audio device 110A).
In this example, making the importance estimate involves determining an importance index 420A. The importance index 420A may be based at least in part on one or more characteristics of each echo reference, such as level, uniqueness, time duration, audibility, or one or more combinations thereof. In some examples, the importance index may be based at least in part on metadata (e.g., metadata 312A) that may include metadata corresponding to an audio device layout, loudspeaker metadata (e.g., sound Pressure Level (SPL) ratings, frequency ranges, whether the loudspeaker is an upsounding loudspeaker, etc.), metadata corresponding to received audio data (e.g., location metadata, metadata indicating a human voice or other speech, etc.), an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, the echo reference importance estimator 401A may provide an importance index 420A to the MC-EMS performance model 405A, as indicated by the dashed arrow 420A.
According to this example, the importance index 420A is based at least in part on the current listening object, as indicated by information 421A. As described in more detail below, the current listening goals may significantly change the manner in which factors such as level, uniqueness, time duration, audibility, etc. are evaluated. For example, the importance analysis during a telephone call may be distinct from waiting for a wake word.
In this example, the importance index 420A is based at least in part on the current ambient noise estimate 318A, the index(s) 350A indicative of the current performance of the MC-EMS203A, the information 423A generated by the MC-EMS performance model 405A, or one or more combinations thereof. In some implementations, the echo reference importance estimator 401A may determine that if the room noise level is relatively high (as indicated by the current ambient noise estimate 318A), then adding an echo reference will be unlikely to help significantly mitigate echo. As described above, the information 423A may correspond to the type of information described above with reference to fig. 3B and 3C, which may provide a direct correlation between the use of echo references and the expected performance increase of the MC-EMS 203A. As described in more detail below, performance of the EMS may be based in part on the robustness of the EMS when disturbed by noise in an audio environment.
According to this embodiment, the echo reference selector 402 selects a set of one or more echo references based at least in part on: one or more metrics 350A indicating the current performance of the MC-EMS203A, an importance metric 420A, a current listening object 421A, information 422A indicating the cost(s) of including the echo references in the echo reference set 313A, and information 423A generated by the MC-EMS performance model 405A. Some detailed examples of how the echo reference selector 402 may select an echo reference are provided below.
In this example, the cost estimation module 403A is configured to determine a computational and/or network cost of including the echo references in the echo reference set 313A. The computational cost may, for example, include additional computational cost of using a particular echo reference by the MC-EMS 203A. This computational cost may in turn depend on the number of bits required to represent the echo reference. In some examples, the computational cost may include the computational cost of a lossy echo reference encoding process and/or the computational cost of a corresponding echo reference decoding process. Determining the network cost may involve determining the amount of data required to transmit a complete replica of the echo reference or a copy or representation of the echo reference across a local data network (e.g., a local wireless data network).
In some examples, the echo reference selection block 402A may generate and transmit a request 311A for another device in the audio environment to send one or more echo references thereto over a network. (element 314A of FIG. 3A indicates that one or more echo references are received by audio device 110A, which may have responded to request 311A in some instances). In some examples, the request 311A may specify the fidelity of the requested echo reference, e.g., whether an "original" copy of the echo reference (full fidelity copy) should be sent, whether an encoded version of the echo reference should be sent, whether a relatively more or relatively less lossy compression algorithm should be applied to the echo reference if an encoded version of the echo reference should be sent, whether segment power information corresponding to the echo reference should be sent, etc.
One may note that the request for the encoded echo reference not only introduces network costs due to the sending of the request and reference, but also increases the computational cost of the response device(s) (e.g., smart home hub 105 or one or more of audio devices 110B-110D) having to encode the reference, as well as the computational cost of audio device 110A decoding the received reference. However, the encoding cost may be a one-time cost. Thus, sending a request for an encoded reference over a network from one audio device to another changes the potential performance/cost tradeoff performed in the other devices (e.g., in audio devices 402C and 402D).
In some implementations, one or more blocks of the echo reference orchestrator 302A may be performed by an orchestration device (e.g., the smart home hub 105 or one of the audio devices 110A-110D). According to some such embodiments, at least some functions of the echo reference importance estimator 401A and/or the echo reference selection block 402A may be performed by an orchestration device. Some such implementations may be capable of determining a cost/benefit tradeoff for the overall system in view of performance enhancements for all instances of the MC-EMS in an audio environment, overall computing requirements for all instances of the MC-EMS, overall requirements for the local network, and/or overall computing requirements for all encoders and decoders.
Examples of various indices and components
Importance index
Briefly, an importance index (which may be referred to herein as "importance" or "I") may be a measure of the expected improvement in EMS performance due to inclusion of a particular echo reference. In some embodiments, the importance may depend on the current state of the EMS, in particular on the echo reference sets already in use and at what level of fidelity they are being received. The importance may be obtained on different time scales depending on the particular implementation. In one extreme case, importance may be implemented on a frame-by-frame basis (e.g., based on the importance signal of each frame). In other examples, the importance may be implemented as a constant value for the duration of the content segment or as a constant value for the time of use of a particular configuration of the audio device. The configuration of the audio device may correspond to an audio device location and/or an audio device orientation.
Thus, the importance index may be calculated on various time scales depending on the particular implementation, for example:
analyzing the current audio content in real time, e.g., according to events in the audio environment (e.g., incoming calls), etc.;
On a longer time scale, e.g. track by track, where the tracks correspond to content segments such as songs or other pieces of music content that may last, e.g., on a time scale of a few minutes; or alternatively
Only once, for example, when the audio system is initially configured or reconfigured.
The decision about which echo references to select for echo management purposes may be made on a time scale similar to (or slower than) the time scale of evaluating the importance index. For example, a device or system may estimate importance every 30 seconds and make decisions about changing the selected echo reference every few minutes.
According to some examples, the control system may be configured to determine an importance matrix, which may include all importance information of the current audio device system. In some such examples, the importance matrix may have a dimension n×m, including an entry for each audio device and an entry for each potential echo reference channel. In some such examples, N represents the number of audio devices and M represents the number of potential echo references. This type of importance matrix is not always square, as some audio devices may play back more than one channel.
In some implementations, the importance index I may be based on one or more of the following:
l: level of echo reference;
u: uniqueness of the echo references;
p: time persistence of echo references, and/or
A: audibility of the device rendering the echo reference.
As used herein, the acronym "LUPA" generally refers to an echogenic reference characteristic from which an importance indicator may be determined, including but not limited to one or more of L, U, P and/or a.
L or "horizontal" aspects
This aspect describes the level or loudness of the echo reference. Other conditions being equal, it is known that the louder the playback signal, the greater the impact on EMS performance. As used herein, the term "level" refers to a level within a digital representation of an audio signal, and not necessarily to the actual sound pressure level of the audio signal after reproduction via a loudspeaker. In some examples, the loudness of a single channel of the echo reference may be based on a Root Mean Square (RMS) indicator or an LKFS (k-weighted loudness relative to full scale) indicator. Such an index is easily calculated in real time on the echo reference or may exist as metadata in the bit stream. According to some embodiments, L may be determined from a volume setting, such as an audio system volume setting or a volume setting within a media application.
U or "uniqueness" aspects
The uniqueness aspect aims to capture the new amount of information provided by a particular echo reference about the overall audio presentation. From a statistical perspective, multi-channel audio presentations often contain redundancy across channels. For example, such redundancy may occur because musical instruments and other sound sources are duplicated on the left and right channels of the room, or the signal is translated and thus further duplicated in multiple active loudspeakers simultaneously. Although this scenario results in the EMS having to solve the problem of superscalar, where the echo filter may infer observations from multiple echo paths, some benefits and higher performance may still be observed in practice.
U may be calculated or estimated in various ways. In some examples, U may be based at least in part on a correlation coefficient between each echo reference. In one such example, U may be estimated as follows:
where the subscript "r" corresponds to the particular echo reference being evaluated, N represents the total number of audio devices in the audio environment, N represents a single audio device, M represents the total number of potential echo references in the audio environment, and M represents a single echo reference.
Alternatively or additionally, in some examples, U may be based at least in part on decomposing the audio signal to find redundancy. Some such examples may involve instantaneous frequency estimation, fundamental frequency (F0) estimation, spectral inversion, and/or non-Negative Matrix Factorization (NMF).
According to some examples, U may be based at least in part on data for matrix decoding. Matrix decoding is an audio technique in which a small number of discrete audio channels (e.g., 2) are decoded into a large number of channels (e.g., 4 or 5) upon playback. The channels are typically arranged for transmission or recording by an encoder and decoding by a decoder for playback. Matrix decoding allows multichannel audio (e.g., surround sound) to be encoded into a stereo signal for playback as stereo on a stereo device and for playback as surround sound on a surround sound device. In one such example, if the dolby 5.1 system is receiving a stereo audio data stream, a static upmix matrix may be applied to the stereo audio data to provide correctly rendered audio for each loudspeaker in the dolby 5.1 system. According to some examples, U may be based at least in part on coefficients of an upmix or downmix matrix for each loudspeaker (e.g., each of the audio devices 110A-110D) used to assign audio to an audio environment.
In some examples, U may be based at least in part on a standard specification loudspeaker layout (e.g., dolby 5.1, dolby 7.1, etc.) used in an audio environment. Some such examples may involve utilizing ways of mixing and rendering media content traditionally in a loudspeaker layout of such specifications. For example, in dolby 5.1 or dolby 7.1 systems, artists typically place a human voice in the center channel, rather than in the surround channel. As described above, audio corresponding to musical instruments and other sound sources is typically reproduced on channels on the left and right sides of a room. In some instances, sounds, conversations, instrumentalities, etc. may be identified via metadata received with corresponding audio data.
P or "persistence" aspects
The persistence indicator aims at capturing aspects of different types of playback media that may have a wide range of temporal persistence, where the different types of content have different degrees of silence and loudspeaker activation. A spectrally dense continuous content stream, such as the audio output of a music or video game console, may have a high level of time persistence, while podcasts may have a lower level of time persistence. The time duration level of infrequent system notifications will be very low. Depending on the specific list task at hand, echo references corresponding to media with a lower degree of persistence may be less important to the EMS. For example, occasional system notifications are less likely to collide with wake words or episodic requests, and thus the relative importance of managing the echo is less.
The following are examples of metrics that may be used to measure or estimate persistence:
the percentage of time in the recent history window that the playback signal is above the specific digital loudness threshold;
metadata tags or media classification indications indicating that the content corresponds to music, broadcast content, podcasts or system sounds; and/or
The percentage of time during the last history window that the playback signal is in the typical frequency range of human voice (e.g., 100Hz to 3 KHz).
According to some examples, the audio content type may affect the estimation of L, U and/or P. For example, knowing that the audio content is stereo music will allow ranking of all echo references using only the channel assignments described above. Alternatively, if the control system does not analyze the audio content, but relies on channel assignments, knowing that the audio content is panoramic sound (attos) may alter the default L, U and/or P assumptions.
A or "audibility" aspect
Audibility index is directed to the fact that: audio devices have different playback characteristics and the distance between the audio devices may be different in any given audio environment. The following are examples of metrics that may be used to measure or estimate the audibility of an audio device:
direct measurement of audibility of an audio device;
refer to data structures including characteristics of one or more loudspeakers of the audio device, such as rated SPL, frequency response, and directivity (e.g., whether the loudspeaker is omni-directional, sounding forward, sounding upward, etc.);
based on an estimate of distance from the audio device; and/or
Any combination of the above.
Other factors may be evaluated for estimating importance and, in some instances, for determining an importance index.
Listening object
The listening object may define the context and desired performance characteristics of the EMS. In some examples, the listening object may modify parameters and/or fields of the LUPA evaluation. The following discussion will consider 3 potential contexts in which the listening object changes. In these different contexts, we will see how probability and criticality can affect LUPA.
1. Episodic (e.g., detect wake-up word example)
There is no immediate urgency when waiting for a conversation: it is generally considered that the probability of the user speaking a wake-up word is the same in all time intervals in the future. Furthermore, wake-up word detector may be the most robust element in speech assistance, and the effect of echo leakage is less critical.
2. Command
The likelihood of a person speaking a command immediately after the person speaks a wake-up word is very high. Therefore, the probability of collision with echo is high in the near future. Furthermore, because the command recognition module may be relatively less robust than the wake word detector, the criticality of echo leakage may often be high.
3. AC power
During a voice call, the likelihood of any participant (person(s) and remote person(s) in the audio environment) talking to each other is determined. In other words, the probability of an echo colliding with the user's voice is essentially 1. However, since the person or persons at the far end are human and can cope well with background noise, the criticality is small because they are unlikely to suffer from echo leakage.
In these different listening object contexts, the manner in which the LUPA is evaluated may vary in some examples.
1. Telephone with inserting function
There may be no temporal distinction because the probability of uttering a wake-up word is considered to be the same for all future time intervals. Thus, the time frame in which the control system evaluates the LUPA may be quite long in order to obtain a better estimate of these parameters. In some such examples, the time interval at which the control system evaluates the LUPA may be set to look at a relatively far future (e.g., within a time frame of a few minutes).
2. Command
The time interval immediately after the wake-up word is spoken is likely to speak the command. Thus, after detecting the wake word, in some embodiments, the LUPA may be evaluated on a much shorter time scale (e.g., on the order of a few seconds) than in the episodic context. In some examples, during this time interval, a reference that is sparse in time and has content playing within the next few seconds after wake word detection will be considered more important because of the high likelihood of collisions.
Fig. 5A is a flow chart summarizing one example of a disclosed method. As with other methods described herein, the blocks of method 500 need not be performed in the order indicated. In some examples, one or more blocks may be performed concurrently. Moreover, such methods may include more or fewer blocks than shown and/or described. For example, some implementations may not include block 501.
In this example, method 500 is an echo reference selection method. The blocks of method 500 may be performed, for example, by a control system (such as control system 60a of fig. 2A or 3A). In some examples, blocks of method 500 may be performed by an echo reference selector module (such as echo reference selector 402A described above with reference to fig. 4).
The reference selection method of fig. 5A is an example of what may be referred to herein as a "greedy" echo reference selection method that involves evaluating cost and expected performance improvement (in other words, how many references, including the selected echo references, the MC-EMS is currently using) only at the current operating point of the MC-EMS, and evaluating the results of adding each additional echo reference, e.g., in descending order of importance. Accordingly, this example involves a process of determining whether to add a new echo reference. In some implementations, the echo references evaluated in method 500 may have been ranked according to the estimated importance (e.g., by echo reference importance estimator 401A). If more complex techniques are employed (such as tree search methods), there may be more optimal solution types in terms of cost and performance. Alternative examples may involve other search and/or optimization routines, including brute force methods. Some alternative implementations may involve determining whether to discard or discard a previously selected echo reference.
In this example, block 501 involves determining whether a current performance level of the EMS is greater than or equal to a desired performance level. If so, the process terminates (block 510). However, if it is determined that the current performance level is below the desired performance level, in this example, the process continues to block 502. According to this example, the determination of block 501 is based at least in part on one or more metrics indicative of the current performance of the EMS, such as adaptive filter coefficient data or other AEC statistics, speech-to-echo (SER) ratio data, and the like. In some examples where the determination of block 501 is made by the echo reference orchestrator 302A, this determination may be based at least in part on one or more metrics 350A from the MC-EMS 203A. As noted above, some embodiments may not include block 501.
According to this example, block 502 involves ranking the remaining unselected echo references by importance and estimating the potential EMS performance boost obtained by including the most important echo references that the EMS has not used. In some examples where the process of block 502 is performed by the echo reference orchestrator 302A, the process may be based at least in part on information 423A generated by the MC-EMS performance model 405A, which in some examples may be or include data as shown in fig. 3B or fig. 3C. In some implementations, the ranking and prediction process described above may be performed at an earlier stage of the method 500, for example, when evaluating a previous echo reference. In some examples, the ranking and prediction process described above may be performed prior to performing method 500. In some embodiments where the ranking and prediction process described above has been previously performed, block 502 may simply involve selecting the highest ranked unselected echo reference determined by such previous process.
In this example, block 503 involves comparing the performance and cost of adding the echo reference selected in block 502. In some examples where the process of block 503 is performed by echo reference orchestrator 302A, block 503 may be based at least in part on information 422A from cost estimation module 403A indicating the cost(s) to include the echo reference in echo reference set 313A.
Because performance and cost may be variables with different ranges and/or domains, directly comparing these variables may be challenging. Thus, in some embodiments, the evaluation of block 503 may be facilitated by mapping the performance and cost, which may be variables, to similar scales (such as a range between predefined minimum and maximum values).
In some embodiments, the cost of adding the estimated echo reference may simply be set to zero if adding the echo reference does not result in a budget exceeding a predetermined network bandwidth and/or computational cost. In some such examples, the cost of adding the estimated echo reference may be set to infinity if adding the echo reference would result in a budget exceeding a predetermined network bandwidth and/or computational cost. This example has the benefit of simplicity and efficiency. In this way, the control system can simply add the maximum number of echo references within the budget allowed by the predetermined network bandwidth and/or computational cost.
According to some examples, if the estimated performance improvement corresponding to adding the echo reference is not above a predetermined threshold (e.g., 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, etc.), the estimated performance improvement may be set to zero. Such an approach may prevent network bandwidth and/or computational overhead from being consumed by including echo references that increase only a negligible performance boost. Some detailed alternative examples of cost determination are described below.
In this example, block 504 involves determining whether a new echo reference is to be added given the performance/cost evaluation of block 503. In some examples, blocks 503 and 504 may be combined into a single block. According to this example, block 504 involves determining whether the cost of adding the evaluated echo reference will be less than the EMS performance boost estimated to be caused by adding the echo reference. In this example, if the estimated cost is not less than the estimated performance boost, then the process continues to block 511 and the method 500 terminates. However, in this embodiment, if the estimated cost is less than the estimated performance improvement, then the process continues to block 505.
According to this example, block 505 involves adding a new echo reference to the selected echo reference set. In some examples, block 505 may include notifying renderer 201 to output the relevant echo reference. According to some examples, block 505 may involve sending an echo reference over a local network or sending a command 311 to another device to send the echo reference over the local network.
The echo references evaluated in method 500 may be local echo references or non-local echo references, which may be determined locally (e.g., by a local renderer as described above) or received over a local network. Thus, the cost estimation of some echo references may involve evaluating both computational and network costs.
According to some examples, to evaluate the next echo reference after block 505, the control system may simply reset the selected and unselected echo references and revert to the previous blocks of fig. 5A, such as block 501, block 502, or block 503. However, more complex methods may also involve evaluating the already selected references, e.g. ranking all the references already selected, and deciding whether to discard the echo reference with the lowest estimated importance.
Alternative echo reference forms
The echo references may be transmitted in a number of forms or variations (or used locally within a device such as the device that generated the entire echo reference), which may alter the cost/benefit ratio of the particular echo reference. For example, if we transform the echo reference into a segmented power form (in other words, determine the power in each of the plurality of frequency bands and transmit segmented power information about the power in each frequency band), it is possible to reduce the cost of transmitting the echo reference through the local network. However, the potential improvement that can be achieved by EMS using low fidelity variants of echo references is typically also lower. Selection such that any particular variant of the echo reference is available may be interpreted as making it a potential selection candidate.
In some embodiments, the echo references may be one of the following forms listed below (with the first four being arranged in descending order of estimated performance):
full fidelity (original, exact) echo reference, which will result in full computational cost and network cost (if transmitted over the network)
Downsampling an echo reference, whose computational cost and network cost will be scaled down according to the downsampling factor, but which will result in the computational cost of the downsampling process;
the network cost of the encoded echo reference generated via the lossy encoding process can be reduced according to the compression ratio of the encoding scheme, but the encoding and decoding computation costs are incurred;
segmented power information corresponding to the echo reference, whose network cost can be significantly reduced because the number of frequency bands can be much lower than the number of subbands of the full fidelity echo reference, and whose computational cost can be significantly reduced because the cost of implementing segmented AES is much lower than the cost of implementing subband AEC; or alternatively
Reduce fidelity in exchange for any other form of cost reduction, whether computational, network, or other costs, such as memory.
Fig. 5B is a flow chart summarizing another example of the disclosed methods. As with other methods described herein, the blocks of method 550 need not be performed in the order indicated. In some examples, one or more blocks may be performed concurrently. Moreover, such methods may include more or fewer blocks than shown and/or described.
The blocks of method 550 may be performed, for example, by a control system (such as control system 60a of fig. 2A or 3A). In some examples, blocks of method 550 may be performed by an echo reference selector module (such as echo reference selector 402A described above with reference to fig. 4).
The method 550 takes into account the following facts: the echo reference is not necessarily transmitted or used in full fidelity form, but may be transmitted or used in one of the alternative partial fidelity forms described above. Thus, in method 550, the evaluation of performance and cost does not involve a binary decision as to whether or not to use the full fidelity form of the echo reference. Instead, method 550 involves determining whether to include one or more low fidelity versions of the echo reference, which may involve and potentially less EMS performance improvement, but at a lower cost. Methods such as method 550 provide additional flexibility for potential echo reference sets to be used by the echo management system.
In this example, method 550 is an extension of echo reference selection method 500 described above with reference to fig. 5A. Accordingly, blocks 501 (if included), 502, 503, 504, and 505 may be performed as described above with reference to fig. 5A, unless otherwise indicated below. Method 550 adds a potential iteration loop including blocks 506 and 507 to method 500. According to this example, if it is determined (here, in block 504) that the estimated cost of adding one version of the echo reference will not be less than the estimated EMS performance boost, then a determination is made in block 506 as to whether another version of the echo reference is present. In some examples, the full fidelity version of the echo reference may be evaluated before the lower fidelity version (if any is available). According to this embodiment, if it is determined in block 506 that another version of the echo reference is available, then in block 507 another version of the echo reference (e.g., the highest fidelity version that is not the full fidelity version) will be selected and evaluated in block 503.
Thus, method 550 involves evaluating a lower fidelity version of the echo reference, if any is available. Such lower fidelity versions may include downsampled versions of the echo reference, encoded versions of the echo reference generated via a lossy encoding process, and/or segment power information corresponding to the echo reference.
Cost model
The "cost" of an echo reference refers to the resources required for echo management using the reference, whether AEC or AES is used. Some disclosed embodiments may involve estimating one or more of the following types of costs:
computational cost, which may be determined with reference to the use of a limited amount of processing power available on one or more devices in an audio environment. The computational cost may refer to one or more of the following:
the cost required to perform echo management on a particular listening device using this reference. This may mean that the reference is used in AEC or AES. One will note that AEC runs on bins (bin) or subbands (which are complex) and requires much more CPU operations than AES running on bands (the number of bins/subbands used by AES is less and the band power is real instead of complex);
Cost required to encode or decode an echo reference when using a codec's reference;
the cost required to segment the signal (in other words, transform the signal from a simple linear frequency domain representation to a segmented frequency domain representation); and/or
The cost required to generate the echo reference (e.g., by the renderer).
Network cost refers to the use of a limited amount of network resources, such as bandwidth available in a local network (e.g., a local wireless network in an audio environment) for sharing echo references between devices.
The total cost of a particular set of echo references may be determined as the sum of the costs of each echo reference in the set. Some disclosed examples relate to combining network costs and computational costs. According to some examples, total cost C total Can be determined as follows:
in the above equation, R comp Representing the total amount of computing resources available for echo management, R network Representing the total amount of network resources available for echo management;represents the computational cost associated with using the mth reference, and +.>Representing the network cost associated with using the mth reference (where a total of M references are used in the EMS). One might notice that this definition means
0≤C total ≤1,
And C total Only the cost component closest to the cost that becomes limited by the available resources of the system is included.
Performance of
The "capabilities" of an Echo Management System (EMS) may refer to the following:
the amount of echo removed from the microphone feed, which can be measured in echo loss enhancement (ERLE), which is measured in decibels and is the ratio of the transmit power to the power of the residual signal. The metrics may be normalized, for example, according to an application-based metric such as a minimum ERLE required to support an Automatic Speech Recognition (ASR) processor to perform a wake word detection task that detects a particular keyword spoken in the presence of an echo;
robustness of EMS when disturbed by room noise sources, non-linearities of the local audio system, double talk etc.;
robustness of EMS when using echo references below full fidelity;
the ability of the EMS to track system changes, including the ability of the EMS to initially converge; and/or
The EMS's ability to track changes in rendered audio scenes. For example, this may refer to a shift of the echo reference covariance matrix and robustness of the EMS to non-stationary non-uniqueness issues.
Some examples may involve determining a single performance index P. Some such examples use ERLE and robustness estimated from adaptive filter coefficient data or other AEC statistics obtained from EMS. According to some such examples, the performance robustness index P rob The "microphone probability" extracted from the AEC may be used to determine, for example, as follows:
P Rob =1-M_prob
in the above equation, 0.ltoreq.P Rob ≤1、0.ltoreq.M_prob.ltoreq.1 and M_prob represent microphone probabilities, which are the proportion of the number of sub-band adaptive filters in the AEC that produce poor echo predictions that do not provide substantial (or any) echo cancellation in the respective sub-bands.
The performance of a wake-up word (WW) detector depends largely on the speech-to-echo ratio (SER), which can be scaled up by ERLE of the EMS. When SER is too low, WW detectors are more likely to trigger (false positive) and miss keywords uttered by the user (missed detection) because echoes can corrupt the microphone signal and reduce the accuracy of the system. The SER of the residual signal (e.g., residual signal 224A of fig. 2A) consumed by the ASR processor (e.g., speech processing block 240A of fig. 2A) is increased by the EMS in proportion to the ERLE of the EMS, thereby improving the performance of the WW detector.
Thus, some disclosed examples involve mapping desired WW performance levels to nominal SER levels, which in turn, in combination with knowledge of typical playback levels of devices in the system, allow the control system to map such desired WW performance levels directly to nominal ERLEs. In some examples, the method may be extended to map WW performance of the system to ERLE at various SER levels. In some such embodiments, input data having a range of SER values may be used to generate a Receiver Operating Characteristic (ROC) curve for a particular WW detector. Some examples involve selecting a particular false positive rate (FAR) of interest and regarding that particular FAR, regarding the accuracy of the WW detector as a function of SER as our application basis. In some such examples, the first and second light sources,
Acc(SER res )=ROC(SER res ,FAR I )
In the above equation, acc (SER res ) SER representing accuracy of WW detector as SER representing residual signal output by EMS res Is a function of (2). ROC () represents a set of ROC curves for multiple SER, and FAR I Representative false positive rates of interest may be 3 per 24 hours and 1 per 10 hours. Accuracy Acc (SER) res ) May be expressed as a percentage or normalized such that it is in the range of 0 to 1, which may be expressed as follows:
0≤Acc(SER res )≤1
with knowledge of the playback capabilities of the audio device in the audio environment, a typical SER value in a microphone signal (e.g., microphone signal 223A of fig. 2A) may be determined using, for example, the LUPA component of the actual echo level in combination with a typical speech level in the target audio environment, e.g., as follows:
in the above equations, the speech_pwr and echo_pwr represent the expected baseline Speech power level and Echo power level, respectively, of the target audio environment. SER by EMS mic Can be improved to SER proportional to ERLE res For example, the following are possible:
in the above equation, the superscript dB indicates that the variable is in decibels in this example. For completeness, some embodiments may define ERLE of EMS as follows:
using the foregoing equations, some embodiments may define EMS performance metrics based on WW applications as follows:
Wherein,,representing the SER in the target environment. In some examples, a->May be a static default number that is set,while in other examples +_s>May be estimated as a function of, for example, one or more LUPA components. Some embodiments may involve defining the net performance index P as a vector containing each element, for example as follows:
P=[P WW ,P Rob ]
in some examples, one or more additional performance components may be added by increasing the size of the net performance vector. In some alternative examples, one or more additional performance components may be combined into a single scalar indicator by weighting them, for example, as follows:
P=(1-K)P WW +K P Rob
in the above equation, K represents a weighting factor selected by the system designer, which is used to determine the degree of contribution of each component to net performance. Some alternative examples may use another approach, such as simply averaging the individual performance metrics. However, it may be advantageous to combine the individual performance metrics into a single scalar metric.
Cost and performance trade-off
When comparing the estimated cost of echo references and the estimated EMS performance enhancement, a method is needed to somehow compare these two parameters, which are not typically in the same domain. One such method involves evaluating the cost estimate and the performance estimate separately and employing P, which is the lowest cost and meets predefined minimum performance criteria min Is a solution to (a). The predefined EMS performance criteria may be determined, for example, based on the requirements of a particular downstream application (e.g., providing a phone call, music playback, waiting for WW, etc.).
For example, in embodiments where the application is WW detection, the performance may be equal to WW performance index P WW And (5) correlation. In some such examples, there may be some minimum level of WW detector accuracy deemed sufficient (e.g., 80% level of WW detector accuracy, 85% level of WW detector accuracy, 90% level of WW detector accuracy, 95% level of WW detector accuracy)Degree, etc.), which will have a corresponding ERLE according to the previous section dB . In some such examples, an EMS performance model (e.g., MC-EMS performance model 405 of fig. 4) may be used to estimate ERLE of the EMS. Thus, if the goal is to find only the least costly solution (e.g., for total cost C total In other words), such an embodiment does not require direct cost and performance tradeoffs.
As an alternative to meeting some minimum performance metrics, some embodiments may involve using performance metrics P and cost metrics C. Some such examples may involve using a trade-off parameter λ (e.g., a lagrangian multiplier) and expressing the cost/performance evaluation process as an optimization problem seeking to maximize some amount, such as the variable F in the following expression:
F=P-λC total It can be observed that in the above equation, a relatively large value of F corresponds to the performance index P and λ and the total cost C total The difference between the products of (c) is relatively large. The trade-off parameter λ may be selected (e.g., by a system designer) to directly trade-off cost and performance. An optimization algorithm may then be used to find a solution to the echo reference set used by the EMS, where the echo reference set (which may include all available echo reference fidelity levels) determines the search space.
FIG. 6 is a flow chart summarizing one example of a disclosed method. As with other methods described herein, the blocks of method 600 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In some examples, two or more blocks may be performed simultaneously. In this example, method 600 is an audio processing method.
The method 600 may be performed by an apparatus or system of the apparatus 50 as shown in fig. 1A and described above. In some examples, blocks of method 600 may be performed by one or more devices within an audio environment, e.g., by an audio system controller (e.g., a device referred to herein as a smart home hub) or by another component of an audio system, such as a smart speaker, a television control module, a laptop computer, a mobile device (e.g., a cellular telephone), etc. In some implementations, the audio environment can include one or more rooms of a home environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. However, in alternative embodiments, at least some of the blocks of method 600 may be performed by a device (e.g., a server) implementing a cloud-based service.
In this embodiment, block 605 involves obtaining a plurality of echo references by a control system. In this example, the plurality of echo references includes at least one echo reference for each of a plurality of audio devices in the audio environment. Here, each echo reference corresponds to audio data played back by one or more loudspeakers of one of the plurality of audio devices.
In this example, block 610 involves performing, by the control system, an importance estimation for each of a plurality of echo references. According to this example, making the importance estimate involves determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment. In this example, the at least one echo management system includes an Acoustic Echo Canceller (AEC) and/or an Acoustic Echo Suppressor (AES).
In this embodiment, block 615 involves selecting, by the control system and based at least in part on the importance estimate, one or more selected echo references. In this example, block 620 involves providing, by the control system, one or more selected echo references to at least one echo management system. In some implementations, the method 600 may involve causing at least one echo management system to cancel or suppress echo based at least in part on one or more selected echo references.
In some examples, obtaining the plurality of echo references may involve receiving a content stream including audio data and determining one or more of the plurality of echo references based on the audio data. Some examples are described above with reference to renderer 201A of fig. 2A.
In some implementations, the control system may include an audio device control system of an audio device of a plurality of audio devices in an audio environment. In some such examples, the method may involve rendering, by the audio device control system, audio data for reproduction on the audio device, thereby producing a local speaker feed signal. In some such examples, the method may involve determining a local echo reference corresponding to a local speaker feed signal.
In some examples, obtaining the plurality of echo references may involve determining one or more non-local echo references based on the audio data. For example, each non-native echo reference may correspond to a non-native speaker feed for playback on another audio device of the audio environment.
According to some examples, obtaining the plurality of echo references may involve receiving one or more non-local echo references. For example, each non-native echo reference may correspond to a non-native speaker feed for playback on another audio device of the audio environment. In some examples, receiving one or more non-local echo references may involve receiving one or more non-local echo references from one or more other audio devices of the audio environment. In some examples, receiving one or more non-native echo references may involve receiving each of the one or more non-native echo references from a single other device of the audio environment.
In some examples, the method may involve cost determination. According to some such examples, the cost determination may involve determining a cost of at least one of the plurality of echo references. In some such examples, selecting one or more selected echo references may be based at least in part on a cost determination. According to some such examples, the cost determination may be based at least in part on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or one or more combinations thereof. In some examples, the cost determination may be based at least in part on a full fidelity replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, lossy compression of the at least one echo reference, segmented power information of the at least one echo reference, or one or more combinations thereof. In some examples, the cost determination may be based at least in part on a method of compressing less than less important echo references than relatively more important echo references.
According to some examples, the method may involve determining a current echo management system performance level. In some such examples, selecting one or more selected echo references may be based at least in part on a current echo management system performance level.
In some examples, making the importance estimate may involve determining an importance index for the corresponding echo reference. In some examples, determining the importance index may involve determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a time duration of the corresponding echo reference, determining an audibility of the corresponding echo reference, or one or more combinations thereof. According to some examples, determining the importance index may be based at least in part on metadata corresponding to the audio device layout, loudspeaker metadata, metadata corresponding to the received audio data, an upmix matrix, a loudspeaker activation matrix, or one or more combinations thereof. In some examples, determining the importance index may be based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or one or more combinations thereof.
Some disclosed embodiments relate to challenges that require other ("non-native") devices to provide playback references for each native Echo Management System (EMS). The bandwidth required to transmit the echo reference to all participating audio devices in the audio environment can be significant. Such bandwidth requirements may be excessive if the number of audio devices is large and if the transmitted echo reference is a full fidelity replica of the speaker feed signal provided to the loudspeaker. The computational resources required to implement such methods and systems, including but not limited to those used to implement post-processing of non-native devices, can also be substantial.
However, for some implementations, it may not be necessary, or even desirable, to transmit all playback streams to all participating audio devices in the audio environment. This is so in part because the amount of echo in the audio device depends largely on the content, the listening object(s), and the audio device configuration.
It is noted above that one way to evaluate the importance of each "non-local" reference is via an importance index 420, which may be calculated using the rendered audio stream used as an echo reference in the EMS. It is also noted above that in some disclosed examples, the importance index may be based at least in part on metadata (e.g., one or more components of the metadata 312 described above), such as metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data (e.g., a spatial index), an upmix matrix, a loudspeaker activation matrix (which may also be referred to herein as a "rendering matrix"), or one or more combinations thereof. Furthermore, it is noted above that in some examples, the "uniqueness" aspect of the U ("LUPA" echo reference characteristic from which the importance index may be determined) may be based at least in part on data for matrix decoding, such as a static upmix matrix.
Fig. 7 and the subsequent figures and corresponding description detail such an alternative method of calculating an importance index based on metadata (including, but not limited to, rendering information). Some disclosed examples disclose echo references generated using such metadata. Such an embodiment may significantly reduce the computational and bandwidth requirements of EMS management, at least in part because many relevant metrics may be pre-computed and encoded in an efficient manner.
Fig. 7 and 8 illustrate block diagrams including components of an echo reference orchestrator according to some alternative examples. As with the other figures provided herein, the types, numbers, and arrangements of elements shown in fig. 7 and 8 are provided by way of example only. Other embodiments may include more, fewer, and/or different types, numbers, and/or arrangements of elements.
Many of the elements shown in fig. 7 and 8 have been disclosed elsewhere herein. The reader may assume that such elements are shown and described elsewhere herein unless they are described differently. However, some of the above elements are optional in the embodiments of fig. 7 and 8. For example, all individual elements corresponding to arrows rendered as dashed lines are optional, including the following elements previously described:
Echo references 314 received from one or more other devices in the audio environment;
information 422 indicating the cost(s) to include the echo references in the echo reference set 313; and
a request 311 for another device in the audio environment to send one or more echo references to it over the network.
Furthermore, in the embodiments of fig. 7 and 8, the cost estimation module 403 itself is optional.
The new or newly defined elements in fig. 7 and 8 are as follows:
701: audio data, which may include audio signals corresponding to audio objects, such as Pulse Code Modulation (PCM) data, audio bed signals corresponding to loudspeaker positions, etc.;
702: audio object metadata, which may include audio object space metadata, audio object size metadata, and the like. In some examples, the audio object metadata 702 may be received as a component of the metadata 312 disclosed elsewhere herein;
703: audio scene change metadata (such as spatial renderer scene change metadata) indicating an impending change in the audio scene, such as a change that will occur within a determined time interval (e.g., within the next second, within the next 100 milliseconds, etc.), which may be used by the audio scene change analyzer 755 to estimate the sound field change. In some examples, the audio scene change metadata 703 may include aggregated audio object statistics calculated from the audio object metadata 702. According to some examples, the scene change metadata may include one or more indications (e.g., in a specified portion of the audio data structure, such as a header portion) that are selectable (e.g., by the content creator) for indicating changes in the audio scene. In some examples, the audio scene change metadata 703 may be received as a component of the metadata 312 disclosed elsewhere herein;
705: an index calculation module based on metadata, in this example, is configured to calculate echo reference characteristics 733 based on metadata 312 and audio scene change message 715, from which an importance index 420 may be determined. In some such examples, the metadata 312 may include audio object metadata 702 and/or rendering information, such as information about the rendering matrix 722 (or the rendering matrix 722 itself). In some examples, the echo reference characteristics 733 may include an approximation of L, U, P and/or a as described above;
710: an echo reference generator, in this example, configured to generate one or more local audio device echo references 220 and non-local audio device echo references 721 for the MC-EMS203 based on the rendered audio stream 720, the metadata 312, the selected echo references 313, the EMS statistics 350, and/or the information 423 generated by the MC-EMS performance model 405. In some implementations, the echo reference generator 710 may be configured to generate a virtual echo reference 742. In some alternative implementations, the renderer 201 may be configured to generate the virtual echo reference 742. In the examples shown in fig. 7 and 8, echo reference generator 710 is configured to generate subspace-based non-local device echo references 723 and/or low frequency device echo references 723LF. As noted elsewhere herein, in some examples, the low frequency non-native device echo reference 723LF may be considered a subset of the subspace-based non-native device echo reference 723. In some examples, the echo reference generator 710 may be configured to customize the echo reference for each audio device. In some examples, the echo reference generator 710 may also use EMS look-ahead statistics 732 and/or audio scene change messages 715 as inputs. Although fig. 7 and 8 indicate that MC-EMS203 receives all of the outputs of echo reference generator 710, in some embodiments MC-EMS203 receives only selected echo references 313, as shown in fig. 3A. According to some such embodiments, all echo references generated by echo reference generator 710 may be provided to blocks 401 and 402 (and in some examples to block 705), and only selected echo reference 313 may be provided to MC-EMS203;
715: one or more scene change messages from scene change analyzer 755;
720: a rendered audio stream of an audio device of an audio environment;
721: locally generated copies of the references being played by non-native devices. This element will appear in many embodiments, but in some embodiments one or more virtual echo references may replace the element;
722: rendering information, which in these examples includes one or more rendering matrices;
723: one or more subspace-based non-native device references;
723LF: one or more low frequency device references;
732: EMS look-ahead statistics generated by metadata-based metrics calculation module 705 based at least in part on scene change message(s) 715 and provided to MC-EMS performance model 405;
733: echo reference characteristics output by the metadata-based index calculation module 705;
742: virtual echo reference; and
755: scene change analyzer.
In the example shown in fig. 7, the illustrated blocks are implemented by one or more examples of the control system 60 disclosed herein (see, e.g., fig. 1A and associated description). In some disclosed examples, all of the blocks of the control system 60 shown in fig. 7 may be implemented via a single device (e.g., an audio device). In some such "distributed model" implementations, each of the plurality of audio devices in the audio environment may implement all of the blocks of the control system 60 shown in fig. 7, as well as other features (such as loudspeaker systems, microphone systems, noise estimators, and/or speech processing blocks).
If non-native audio device post-processing signal chain parameters are available to the native audio device (e.g., via a set calibration step), a non-native device reference may be generated locally and added to the rendered audio stream 720. In this case, the non-native device references may take the form of one or more virtual audio device references, or a set of device-specific non-native device streams selected for the device by the reference selection block. In the event that device computing power becomes a bottleneck, each audio device may render only its local references and use the local network to exchange echo references (e.g., as described above with reference to fig. 1-6). In such an instance, optional elements 311 and 314 of fig. 7 (requests to receive echo reference 311 and echo reference 314 from one or more other audio devices) may be included in the signals transmitted and received by each audio device. However, such elements are only necessary when the computing power of the audio device is severely limited or the audio device does not have non-native device signal chain parameters available for locally generating non-native echo references.
In other implementations, the blocks shown in fig. 7 may be implemented via two or more devices. In some such implementations, the echo reference orchestrator 302 of fig. 7 may be implemented (at least in part) by an orchestration device (such as the smart home hub 105 disclosed herein) or via an audio device 110 configured to act as an orchestration device. According to some implementations, the echo reference orchestrator 302 of fig. 7 may be implemented (at least in part) via a cloud-based service (e.g., via one or more servers).
In the example shown in fig. 8, a portion of the renderer 201 and the echo reference orchestrator 302 are implemented by the hub device 805, while the other portions of the MC-EMS203 and the echo reference orchestrator 302 are implemented by the audio device 110 (only one of which is illustrated in fig. 8). This type of implementation may be referred to herein as a "hub and spoke" model. In some examples, the "hub" may be a smart Television (TV), and the audio device 110 may be a set of wireless microphones configured to communicate with the smart TV. In other examples, hub device 805 may be smart home hub 105 as disclosed elsewhere herein. In other examples, hub device 805 may be one of audio devices 110, such as audio device 110 having greater computing power than other audio devices 110. In other implementations, the portion of echo reference orchestrator 302 implemented by hub device 805 may be implemented by one or more servers.
In this hub-radial example, both the renderer 201 and the echo reference generator 710 reside in the hub device 805. In this example, each audio device 110 receives rendered audio data for playback from the hub device 805. In this example, the rendered audio data for playback includes a local echo reference 220. The non-native audio device references may be rendered at hub device 805 as one or more single virtual non-native device references, device-specific echo references (e.g., as described above with reference to fig. 1-6), or a combination thereof. In this example, hub device 805 is provided with information required to generate an echo reference, such as rendering information (which may include rendering matrix information), audio device specific information (such as audio device capability information), spatial metadata, and the like. In some alternative examples, the local echo reference 220 may be created in each audio device 110.
Computing echo reference indicators based on rendering information
Various examples of computing an echo reference indicator from rendering information, such as rendering metadata, are disclosed in the following paragraphs.
Audibility estimation based on rendering matrix information
The primary component of the rendering metadata set is the rendering matrix (722) of the given audio device configuration. The rendering matrix defines the spatial frequency response of the audio device configuration to any audio object in the encoded audio stream. In some rendering matrix examples, an audio environment (e.g., a room in which an audio device is located) is first discretized into [ n ] x ,n y ,n z ]Each point, and a rendering filter is designed for each device, each spatial point. Rendering filters may be defined in the frequency bin domain, where all filters have n bin Individual drawerHead (tap). Thus, for a system of N devices, the rendering matrix is N N x ×n y ×n z Group n bin A filter of length.
Let us assume that the sound source should be located at point x a ,y a ,z a . In some implementations, this information can be obtained for each audio object in an audio object metadata file (e.g., as an attos. Prm file) provided to the renderer (702). An ideal rendering system would achieve this with high accuracy. However, in some examples, such a level of accuracy may not be guaranteed, taking into account limitations of the audio device configuration, but only best effort results of the renderer may be achieved. For example, the audio object vector may be approximated by a weighted average of the values corresponding to the nearest grid points of the rendering matrix, and a subset of the rendering filters activated for these points may be used to render the sound source at that location.
Thus, the rendering matrix may act as a spatial transfer function defined at each device and at each point on the spatial grid. Thus, the rendering matrix includes information about the audibility of each audio device to each other audio device (which may be referred to herein as "mutual audibility"). Although the rendering matrix 722 contains this information, it is desirable to calculate an audibility index that can be easily consumed by the echo reference importance estimator 401. The embodiments described below provide relevant examples. In some implementations, the metadata-based index calculation block 705 is configured to perform the calculation, and the echo reference characteristics 733 include the indices.
Fig. 9A shows an example of a graph showing the positions of a listener and an audio device in an audio environment. In this example, the audio environment is a room. The vertical axis of the graph 900 indicates the y-coordinate (width) of the room in meters and the horizontal axis indicates the x-coordinate (length) of the room in meters. According to this example, listener L is located at the center of the audio environment, i.e., the origin (position (0, 0)) of graph 900, and audio devices 1, 2, 3, 4, and 5 and subwoofer S are located at various points along a circle one meter from listener L. In the example shown in fig. 9A, the audio devices 1-5 are all of the same type and have the same or substantially the same audio device characteristics (e.g., number of loudspeakers, type, and function). Other audio environments may include different numbers, types, and/or arrangements of audio devices, listener(s), etc.
Fig. 9B shows an example of a chart corresponding to the rendering matrix of each audio device shown in fig. 9A. In this example, a graph 905a corresponds to audio device 1, a graph 905b corresponds to audio device 2, a graph 905c corresponds to audio device 3, a graph 905d corresponds to audio device 4, and a graph 905e corresponds to audio device 5. In these examples, graphs 905 a-905 e show a rendering matrix cross section of each audio device after averaging in the z and frequency dimensions. Furthermore, in these examples, the x, y plane of the audio environment is divided into 64 equal regions, each region having a side length of 0.5 meters, and only one loudspeaker activation value is represented for each of the 64 regions. Such regions may be referred to herein as "spatial tiles". The loudspeaker activation value is similar to the total wideband gain for playback of each corresponding audio device. Here we show a rendering matrix cross section averaged over frequency and height (z) to demonstrate some general concepts of the present disclosure. A more practical implementation uses a complete rendering matrix.
The rendering matrix for each device contains all the information needed to estimate the spatial implementation of an audio object (e.g., an attos audio object) for the audio device configuration shown in fig. 9A. In other words, the rendering matrix information may be used to estimate the percentage of audio objects to be rendered in each device and the degree of similarity of the device channels.
A simple implementation involves computing a covariance matrix of a device-level rendering matrix and using the covariance matrix as a proxy for covariance of the resulting speaker feed. We refer to this as the "non-notification rendering covariance matrix" or "non-notification rendering correlation matrix" herein.
We can see how the rendering matrix itself contains spatial information from which inter-device audibility can be estimated. Even in its simplest form, one can use a non-notification rendering correlation matrix to obtain the audibility ranking of each device heard from each other device. In addition, the complete non-notification rendering correlation matrix will also contain information about how this audibility varies with frequency.
Metadata-based LUPA estimation
Similarly, some embodiments relate to transforming audio object spatial metadata (which may be components of audio object metadata 702) into an index that may be readily consumed by echo reference importance estimator 401. In some examples, the metadata-based metrics calculation module 705 may be configured to perform such transformations.
In the discussion above, it is noted that in some embodiments, the importance index I may be based on one or more of the following:
L: level of echo reference;
u: uniqueness of the echo references;
p: time persistence of echo references, and/or
A: audibility of the device rendering the echo reference.
As used herein, the acronym "LUPA" generally refers to an echogenic reference characteristic from which an importance indicator may be determined, including but not limited to one or more of L, U, P and/or a. As described above, the rendering matrix includes audibility information, which is the "a" component of the LUPA. Other LUPA parameters may be estimated based on the rendering matrix and the spatial data. Some embodiments estimate the LUPA parameters by determining statistical data based on aggregated spatial data that is highly correlated with one or more LUPA parameters.
In some examples, the audio object spatial metadata indicates a spatial-temporal distribution of each audio source in the received audio data bitstream. Some implementations relate to calculating an amount of time an audio object is present in each spatial grid tile. Some such implementations relate to generating a "counted" 3D heat map for each audio object channel.
Fig. 10A and 10B show examples of graphs indicating spatial audio object counts for a single song. In these examples, the song is in the Atmos format and the audio object is an Atmos audio object. In these examples, graph 1005a corresponds to audio object 1, graph 1005b corresponds to audio object 2, graph 1005c corresponds to audio object 3, graph 1005d corresponds to audio object 4, graph 1005e corresponds to audio object 5, graph 1005f corresponds to audio object 6, graph 1005g corresponds to audio object 7, graph 1005h corresponds to audio object 8, graph 1005i corresponds to audio object 9, graph 1005j corresponds to audio object 10, graph 1005k corresponds to audio object 11, graph 1005l corresponds to audio object 12, graph 1005m corresponds to audio object 14, graph 1005n corresponds to audio object 14, and graph 1005o corresponds to audio object 15.
In each graph, the coordinates x, y, and z represent the length, width, and height of an Atmos bin of an acoustic space, which is an example of a cubic audio environment. In each graph, the spheres in a particular position represent a "count", i.e., an instance of time that the corresponding audio object is in the corresponding (x, y, z) position during the song. In some implementations, the audio object count may be used as a basis for estimating P (time duration of echo reference).
Some embodiments use audio object counts as a basis for spatial importance weighting. In some examples, the spatial importance weighting may be used with various other types of importance indicators, such as audibility indicators. For example, if spatial importance weighting is used in conjunction with a "non-notification rendering correlation matrix" such as described with reference to fig. 9B, some embodiments involve generating a "spatial notification correlation matrix" in which the spatial locations where there are more audio objects are more prominent.
As described above, in some embodiments, U may be based at least in part on a correlation index between each echo reference. In some examples, the spatial notification correlation matrix may be used as a proxy for correlation metrics based on audio data (e.g., correlation matrices based on PCM data for each echo reference) to generate the importance metrics for input to the echo reference importance estimator 401.
Fig. 11A and 11B show examples of a non-notification rendering correlation matrix and a spatial notification correlation matrix, respectively. The non-notification rendering correlation matrix of fig. 11A and the spatial notification correlation matrix of fig. 11B both correspond to the audio device arrangement shown in fig. 9A and the same audio content used to generate the spatial audio object count shown in fig. 10. In these examples, the highest possible ranking is 1.0, and the lowest possible ranking is zero. In both cases, the effect of the subwoofer is omitted. Further, in both cases, the value of each audio device corresponding to its own correlation is omitted.
One can observe that the ranking of the spatial notification correlation matrix is different from the ranking of the non-notification rendering correlation matrix. In addition to the ranking corresponding to the audio device 4, the highest ranked non-local echo reference according to the non-notification rendering correlation matrix is different from the highest ranked non-local echo reference of the spatial notification rendering correlation matrix. Referring first to fig. 11A, for example, audio played back by audio device 2 is the highest ranked non-local echo reference of audio device 1 according to the non-notification rendering correlation matrix, while audio played back by audio device 5 is the highest ranked non-local echo reference of audio device 1 according to the spatial notification rendering correlation matrix of fig. 11B.
One way to compare the utility of the approximations of the PCM based correlation matrices via the spatial notification correlation matrices is to evaluate the resulting non-local reference management scheme implemented by the local device based on each of these metrics. A simple indicator of how close the approximation is would be a comparison of the echo reference ranking produced by the echo reference importance estimator 401 based on each type of indicator.
Fig. 12A, 12B, and 12C illustrate examples of echo reference importance ranking generated by the echo reference importance estimator 401 from PCM-based correlation matrix, spatial notification correlation matrix, and non-notification correlation matrix, respectively, using the same audio content as that used to generate the spatial audio object count shown in fig. 10. In these examples, the highest possible ranking is 1.0, and the lowest possible ranking is zero.
In this example, the echo reference importance ranking shown in fig. 12A (which corresponds to the echo reference importance ranking produced by the echo reference importance estimator 401 from the PCM based correlation matrix in some embodiments) is used as a "ground truth" from which other rankings can be evaluated. A comparison of fig. 12A and 12C shows that the non-notification-based relevance matrix importance ranking provides a very rough approximation to the importance ranking according to the PCM-based relevance matrix: for example, the highest ranked non-local echo reference based on the non-notification correlation matrix does not match the highest ranked non-local echo reference of any PCM-based correlation matrix. However, a comparison of fig. 12A, 12B, and 12C shows that the importance ranking based on the spatial notification correlation matrix provides a better approximation of the importance ranking according to the PCM based correlation matrix than the importance ranking based on the non-notification correlation matrix. For example, the highest ranked non-local echo references of audio devices 1 and 5 based on the spatial notification correlation matrix match the highest ranked non-local echo references of audio devices 1 and 5 based on the PCM based correlation matrix. Furthermore, the highest ranked non-local echo references of audio devices 3 and 4 according to the spatial notification correlation matrix are the second highest ranked non-local echo references according to the PCM-based correlation matrix.
Audio scene change metadata
In some implementations, the LUPA estimation is based on the assumption that the spatial scene is stationary within an estimated time window. The LUPA estimate will ultimately reflect any significant changes in the spatially rendered scene after some (variable) time delay. This means that during significant audio scene changes, the echo references selected using these estimates, as well as the virtual echo references generated, may be incorrect. In some examples, echo loss enhancement (ERLE) may be reduced beyond operational limits, which may cause instability of the echo management system. Such conditions may also trigger fast reference switching, which may not actually be needed, but rather artifacts caused by changes in scene dynamics. To prevent these potential negative consequences, we disclose here two additions to upstream data processing:
1) Audio scene change metadata 703, which may include information about:
a) A significant spatially varying event for each audio object and bed source, and/or
b) An instance of audio object overlap/interaction; and
2) The scene change analyzer 755, which may be configured to analyze the audio scene change metadata 703 and/or corresponding audio data (e.g., PCM data), compares the spatial energy distribution of the current audio scene with the spatial energy distribution of the upcoming audio scene and generates an audio scene change message 715 based on the audio content look ahead. The look-ahead time window may vary depending on the particular implementation. In some examples, the scene change analyzer 755 may be configured to analyze the audio scene change metadata 703 and/or corresponding audio data to be reproduced during a look-ahead time window spanning multiple seconds in the future, such as 5 seconds, 8 seconds, 10 seconds, 12 seconds, etc.
These supplements have various potential advantages. For example, the audio scene change message 715 may enable the echo reference importance estimator 401 and the metadata-based metric calculation module 705 to dump their histories and reset their memory buffers, thereby enabling a fast response to audio scene changes. The "fast" response may be on the order of hundreds of milliseconds (e.g., 300 to 500 milliseconds). For example, such a fast response may avoid the risk of AEC divergence.
In some examples, the audio object metadata file contains spatial coordinates for each time interval. This spatial metadata may be used as input to, for example, the scene change analyzer 755 of fig. 7 and 8. In some examples, based at least in part on the spatial metadata, the scene change analyzer 755 may calculate an audio object density in each tile (area or volume unit) of the spatial grid used to represent the audio environment. In its most basic form, this may be a count of all audio objects in a given tile, as shown in the example of FIG. 10.
However, a key point is to have one look-ahead buffer with values related to audio scene changes during a look-ahead time window (e.g., 5 seconds, 8 seconds, 10 seconds, 12 seconds, etc.). The input from the look-ahead buffer may enable the scene change analyzer 755 to estimate the similarity of the currently rendered audio scene as compared to the audio scene rendered in the near future. The metadata-based indicator calculation module 705 and the echo reference importance estimator 401 may then use this information to adjust the adaptation rate of their indicators. In some implementations, the audio scene change message 715 is device-specific in that the audio device only needs information about audio scene changes within a subset of the audible spatial grid of the audio device (a subset of grids that significantly affect the operation of the MC-EMS 203).
According to some examples, an example importance index (I (t)) at time t may be expressed as follows:
in the above equation, i denotes the spatial grid index, n denotes the look-ahead window, C i (t+k) represents the audio object count at look-ahead time k, and α ik And beta ik Representing predefined coefficients for each spatial grid point depending on the configuration of the audio device. In most cases, alpha ik And beta ik Less than 1. Such importance indicators may be designed to approximate weighted object densities, or cumulative object persistence within the spatial and temporal regions of interest.
Using the previously deployed metadata scheme, rendering can only access spatial data before the current time t. In contrast, in the examples above, some disclosed embodiments utilize audio object spatial coordinates within the look-ahead window n to enhance the audio data stream.
EMS health prediction based on metadata
A component of metadata-based scene analysis for the purpose of echo management is EMS health data 423 determined by MC-EMS performance model 405. The EMS health data 423 is highly sensitive to significant audio scene changes and may, for example, indicate EMS divergence caused by such audio scene changes.
Because, in some disclosed examples, information about such audio scene changes is now transmitted in advance in time (e.g., via EMS look-ahead statistics 732 from metadata-based metric calculation module 705 and/or audio scene change messages 715 from scene change analyzer 755), some implementations of echo reference composer 302 may be configured to use such audio scene change information to predict EMS health data 423 (e.g., via MC-EMS performance model 405). For example, if the MC-EMS performance model 405 predicts possible EMS filter divergences based on one or more EMS look-ahead statistics 732 and/or audio scene change messages 715, the MC-EMS performance model 405 may be configured to provide corresponding EMS health data 423 to the echo reference importance estimator 401 and echo reference selector 402, which may reset their algorithms accordingly, according to some disclosed examples.
In some examples, the MC-EMS performance model 405 may be configured to implement embodiments of EMS health prediction according to a regression model based on scene change importance look-ahead data, for example as follows:
A(t+k)=f(({I ik })
in the above equation, A represents EMS health data, f represents a regression function (which may be linear or nonlinear), and the set { I } ik And represents a set of importance values for the total look-ahead window and the spatial grid set.
Rendering virtual echo references for echo management
We can reduce the complexity of the echo generation process by approximating the salient nature of the non-local echo references and generating the minimum set of echo references needed to achieve satisfactory ERLE performance on each device. The resulting virtual echo reference (also referred to herein as a virtual sound source) may even improve ERLE output by the device EMS compared to the device-level remote echo reference set. The use of virtual echo references may provide important benefits, especially when the number of audio devices in an audio environment is large (e.g., > 10). In this case, if a device-level PCM based algorithm is used, the non-uniqueness of the echo references may lead to excessive ERLE and EMS failures.
Fig. 13 illustrates a simplified example of determining a virtual echo reference. In some implementations, the echo reference generator 710 may be configured to generate one or more virtual echo references according to the methods disclosed in this section. In some alternative implementations, the renderer 201 may be configured to generate one or more virtual echo references according to such methods. According to this example, audio device a is a local audio device and audio devices B and C are non-local audio devices. In this example, the virtual echo reference corresponds to a virtual sound source D at position O.
The position O may be obtained via an initial and/or periodic calibration step, for example, using room drawing data (such as audio device position data) available to the renderer 201 or echo reference generator 710. For example, if all speakers are not occluded, a location O may be determined from the centroid locations of the cumulative far-end device heatmaps, which may be generated by adding a rendering matrix slice for each non-local or "far-end" audio device (e.g., as shown in FIG. 9B). For example, if we set the far-end device ith wideband gain at spatial tile j to w ij The position vectorMay be
The virtual sound source D corresponds to playback of the audio devices B and C from the perspective of the audio device a. The virtual sound source D is an equivalent sound source at location O that creates the same non-local audio device playback sound field as the separate playback audio from audio devices B and C would create at the location of audio device a. It should be noted that virtual source D need not approximate the complete sound field that individual playback audio from audio devices B and C would create in all parts of the audio environment.
The low-dimensional approximation of D of the far-end device echo reference may be implemented using different methods Some of these methods are now described herein. In general, these methods may involve finding a weight matrixInput subspace matrix +.>(e.g., a PCM matrix or a Principal Component Analysis (PCA) matrix) such that the device echo reference frame d (t) (e.g., a PCM frame) at time t may be
In the examples described below, we use frequency domain separation (low frequency and high frequency), audio object based methods, static independence based methods as an example method of creating virtual sound sources.
Low frequency management
At low frequencies, the renderer will produce different references due to the difference in the ability of the loudspeakers to play back content at these frequencies. The particular low frequency range may depend on the specifics of the particular implementation, such as loudspeaker capabilities. In some embodiments where the ability of one or more loudspeakers in the audio environment to reproduce sound in the bass range is minimal, the low frequency range may be 400Hz or less, while for other embodiments the low frequency range may be 350Hz or less, 300Hz or less, 250Hz or less, 200Hz or less, etc. In the context of a multi-channel echo canceller, a renderer configuration may be used to determine the reference signals for low frequency cancellation. Instead of delivering all or a subset of the available echo references, a weighted sum of echo references at a proportion of the low frequencies may be used. The amount of crossover with higher frequency cancellation may also be considered. Given the weights w, for any echo reference r, at frequency k, the selected echo reference can be expressed as:
In equation a, the superscript n represents the total number of echo references. In some examples, weights may be extracted from the rendering information, the low frequency range in which the summation is used, and the cross-scale with higher frequency cancellation. Examples of weights and low frequency ranges are described below. The weights and low frequency ranges may be based at least in part on the capabilities and limitations of the individual loudspeakers, and how the content may be rendered for each device. One motivation for implementing low frequency management methods is to avoid non-uniqueness issues and high cross-correlation between low frequency echo references.
Fig. 14 shows an example of a low frequency management module. In this example, the low frequency management module 1410 is a component of the echo reference generator 710. In some implementations, the low frequency management module 1410 may be configured to determine the weights referenced in equation a, the low frequency range for summation, and the cross-scale with higher frequency cancellation (if any). The elements of fig. 14 are as follows:
312: metadata, which may be or may include metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, upmix matrices, and/or loudspeaker activation matrices;
220: local echo references for playback and cancellation;
721: locally generated copies of echo references being played by non-local devices;
1402: a frequency selector module configured to select a low frequency and a crossover (if any) to be applied. The frequency selector module 1402 may, for example, select a threshold value for k in equation a;
1403: a weight generation module configured to generate a weight for each echo reference based on the loudspeaker metadata 312;
1404: a summation module configured to calculate a weighted sum of echo references;
1411: the frequencies (and possible crossings) selected by the frequency selector module 1402 and provided to the summing module 1404;
1412: the weights generated by the weight generation module 1403 and provided to the summation module 1404; and
723LF: the weighted sum of the echo references over a range of low frequencies produced by the summing block 1404. The summing module 1404 can generate one or more weighted sums of the low frequency device echo references 723 LF.
In some implementations, the low frequency management module 1410 may be configured to select frequencies and/or generate weights based at least in part on rendering information, such as information about the rendering matrix 722 (or the rendering matrix 722 itself).
Frequency selection
The frequency at which the low frequency management is performed may be based on a hard cut-off frequency or on a frequency range within a crossover frequency range. A crossover frequency range may be desirable to account for different loudspeaker capabilities in overlapping frequency regions where the frequency content of the echo references of certain audio devices is lower than the sum reference. For example, when a subwoofer is present, a crossover frequency range may be desirable and may be considered the primary or sole reference at most lower frequencies. A frequency range within a cut-off frequency or crossover frequency range may be included in the rendering information, which may take into account loudspeaker capabilities of the audio device in the audio environment. In some examples, the cutoff frequency may have a value of a few hundred Hz, such as 200Hz, 250Hz, 300Hz, 350Hz, 400Hz, etc. According to some examples, the low end of the crossover frequency range may be 100Hz, 150Hz, 200Hz, etc., and may include frequencies up to 200Hz, 250Hz, 300Hz, 350Hz, 400Hz, etc.
Weight generation
In some implementations, weights may be applied according to audio device configuration and capabilities, such as using only local or subwoofer references for low frequency playback if a subwoofer is present. If a subwoofer is present, it is often desirable to play back most of the low frequency audio content by the subwoofer, rather than providing the low frequency audio content to an audio device that may not play back audible lower frequencies without distortion. According to some embodiments that do not include a subwoofer and where the audio devices have the same (or similar) capabilities for low frequency audio reproduction, to obtain any audible low frequency performance, the low frequencies reproduced by all audio devices may be the same in order to maximize power. In some such embodiments, the reproduced low frequency audio may be mono/non-directional. In some such examples, the weight of a single reference may be 1.0. This corresponds to mono-only echo cancellation below a certain frequency, which may be referred to herein as "max_mono_hz".
Fig. 15A and 15B illustrate examples of low frequency management for embodiments with and without subwoofers. In both examples, the multi-channel echo cancellation occurs between frequencies min_multi_hz and max_cancel_hz. Fig. 15A shows an example of low frequency management for an embodiment with subwoofers. In this example, max_mono_hz is in the frequency range (higher than min_multi_hz) where multi-channel echo cancellation occurs. This example applies to a subwoofer reference ("Sub Ref" in fig. 15A), where it is desirable to perform echo cancellation corresponding to a subwoofer reference of max_mono_hz value up to several hundred Hz (such as 200Hz, 300Hz, 400Hz, etc.). In some such examples, min_multi_hz may be 100Hz, 150Hz, 200Hz, etc. This allows some crossover between the mono-only and multi-channel echo cancellation frequency ranges.
Fig. 15B shows an example of low frequency management for an embodiment without a subwoofer. In this example, the local reference from 0Hz to max_mono_hz is used for mono cancellation. In this example, max_mono_hz and min_multi_hz are set to the same frequency. In some alternative examples, max_mono_hz and min_multi_hz may not be the same frequency for embodiments without a subwoofer. According to some alternative embodiments including subwoofers, max_mono_hz and min_multi_hz may be the same frequency.
Higher frequency management
As used herein, "higher frequency" may refer to any audible frequency that is higher than one of the low frequency ranges described with reference to fig. 14. The differences in propagation characteristics in higher frequencies compared to lower frequencies and the differences in audio driver (speaker) beamforming for different frequencies indicate that the reference, importance and selection will be highly frequency sensitive at higher frequencies. For example, low frequency audio content is typically much less directional than high frequency audio content. Furthermore, typical rendered audio scenes contain more information in the high frequencies than in the lower frequencies.
Thus, the rendering of an echo reference having a large number of high frequency components may be relatively more complex than the rendering of a virtual reference having mainly low frequency components. For example, some high frequency management implementations may involve multiple instances of equation a, each instance for a different portion of the high frequency range and each instance having potentially different weighting factors. The non-uniqueness and associated AEC divergence is less risky in the higher frequency band.
However, some examples exploit the frequency sparsity of some audio content to manage multi-band reference generation. Creating a mix based at least in part on the frequency dependent audibility differences may eliminate the need to have multiple echo references without degrading the quality of AEC health. In some such examples, the rendering implementation may be similar to the frequency management implementation for echo references. In some such examples, only the weight generator and frequency selector parameters may be different.
Fig. 15C illustrates elements that may be used to implement a higher frequency management method according to one example. Fig. 15C illustrates blocks configured to perform a multi-band higher frequency management method, which in this example is implemented via frequency management modules 1410A-1410K. In this example, each of the frequency management modules 1410A-1410K is configured to implement frequency management for one of the frequency bands a-K. According to some examples, band a is a band adjacent to the low frequency band processed by the low frequency management module 1410 of fig. 14. In some examples, band a may overlap with the low frequency band processed by the low frequency management module 1410. In some such embodiments, bands a through K may also overlap. In this example, K represents an integer of three or more. The frequency bands a to K may be selected according to any convenient method, such as according to a linear frequency scale, according to a logarithmic frequency scale, according to a mel scale, etc.
According to this example, each of the frequency management modules 1410A-1410K is configured to operate generally as described above with reference to the low frequency management module 1410 of fig. 14, except that each of the frequency selectors 1402A-1402K is configured to select a different frequency range. The weights generated by the weight generation modules 1403A-1403K may also vary according to frequency. In these examples, the frequency management modules 1410A-1410K are configured to output frequency segmented non-native device echo references 723BA-723BK to the band-to-PCM converter 1515, each of which corresponds to one of the frequency bands a-K. In this example, the band-to-PCM converter 1515 is configured to combine the non-native device echo references 723BA-723BK of the frequency segments and output the non-native higher frequency device echo references 723HF.
Echo reference generation based on statistical subspaces
Some disclosed subspace-based examples involve defining low-dimensional embedding via statistical properties. Some subspace-based examples involve methods using independent component analysis or principal component analysis, among others. By implementing such a method, the control system may be configured to find K statistically independent audio streams that approximate a non-local reference.
Fig. 16 is a block diagram outlining an example of another disclosed method. As with other methods described herein, the blocks of method 1600 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In some examples, two or more blocks may be performed simultaneously. Method 1600 may be performed by an apparatus or system such as apparatus 50 shown in fig. 1A and described above. Method 1600 may be performed, for example, by control system 60 of fig. 1A.
In this example, the echo reference generator 710 works in conjunction with the echo reference importance estimator 401 and the echo reference selector 402: in this example, blocks 1620, 1655, and 1660 are implemented by the echo reference importance estimator 401 and/or the echo reference selector 402, and blocks 1625-1650 are implemented by the echo reference generator 710.
In this example, the method 1600 begins at block 1601, followed by selecting an initial local audio device and an initial non-local ("remote") audio device in block 1605. In block 1610, a determination is made as to whether all local audio devices have been processed. If so, the process stops (block 1615). If, however, it is determined in block 1610 that the current local audio device has not been processed, then the process continues to block 1620.
According to this example, block 1620 involves determining whether each remote device has been evaluated for the current local audio device. If not, the process continues to block 1655 where it is determined whether the echo reference characteristics (e.g., LUPA values) of the audio stream of the current remote device exceed a threshold. In some examples, the threshold may be a long-term function of audio device configuration (such as audio device layout and audio device capabilities), audio environment characteristics, and playback content characteristics. According to some such examples, the threshold may be approximated as a long-term average of echo reference characteristics of the current audio device configuration, audio environment, and content type. In this context, a "long term" may be hours or days. In some examples, playback may not be continuous during a "long-term" time interval. Thus, this example involves selecting a subset of the remote devices based on the echo reference characteristics 733 (e.g., LUPA scores) output by the metadata-based metrics calculation module 705. The current playback frame of the selected remote device forms the pcm matrix P of the local device currently being evaluated. Thus, if it is determined in block 1655 that the echo reference characteristics of the current remote device's audio stream exceeds the threshold, then the remote device's audio frame is added to the column of the pcm matrix P in block 1660. In this example, the next remote device (if any) is selected in block 1662, and the process then continues to block 1620.
After determining in block 1620 that all remote devices have been evaluated for the current local audio device, the process continues to block implemented by echo reference generator 710. In this example, the process continues to block 1625, which involves obtaining the pcm matrix P (e.g., from memory). According to this example, dimension reduction is performed to reduce any feature redundancy. For example, dimension reduction may be achieved by methods such as Principal Component Analysis (PCA). Other examples may implement other dimension reduction methods. In the PCA example shown in fig. 16, the mean value of the PCM matrix array is zeroed in block 1630 as follows:
P c =P-mean(P)
in this example, covariance matrix C is calculated as in block 1635
In the above equation, n represents the number of rows of the PCM matrix. According to this example, block 1640 involves performing a feature decomposition to determine a feature value matrix D and a feature vector matrix V such that
C=VDV -1
In this example, only feature values greater than the threshold T are retained and redundant features are discarded. An example implementation of such a threshold may be constructed using an energy-based approximation. Given that D is a diagonal matrix whose values decrease along the left diagonal, we can define D by retaining the most significant eigenvalue that contains a percentage (90% in this example) of the signal energy T
In other examples, different percentages (e.g., 75%, 80%, 85%, 95%, etc.) may be used. Thus, in this example, block 1645 involves determining a truncated eigenvalue matrix D T And truncated eigenvector matrix V T . Truncated eigenvalue matrix D T Is an example of a weight matrix in equation 0, and truncated matrix V T Collectively, the corresponding eigenvectors of (a) are an example of the input matrix X in equation 0. Thus, in this example, block 1650 involves by inserting D T Multiplied by V T To determine an echo reference.
In the example shown in fig. 16, after performing block 1650, the process continues to block 1652 where if there are any more local audio devices to process, another local audio device is selected for processing. In this example, block 1652 involves incrementing the number of local audio devices to be processed. For example, if the local audio device 1 has just been processed, the audio device number may be increased to the audio device 2. The process returns to block 1610 where a determination is made as to whether all local audio devices have been processed. After all local audio devices have been processed, the process ends (block 1615).
FIG. 17 is a flow chart summarizing another example of the disclosed methods. As with other methods described herein, the blocks of method 1700 need not be performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described. In some examples, two or more blocks may be performed simultaneously. In this example, method 1700 is an audio processing method.
The method 1700 may be performed by an apparatus or system such as the apparatus 50 shown in fig. 1A and described above. Method 1700 may be performed, for example, by control system 60 of fig. 1A. In some examples, the blocks of method 1700 may be performed by one or more devices within an audio environment, for example, by an audio system controller (e.g., a device referred to herein as a smart home hub) or by another component of an audio system, such as a smart speaker, a television control module, a laptop computer, a mobile device (e.g., a cellular telephone), etc. However, in alternative embodiments, at least some of the blocks of method 1700 may be performed by a device (e.g., a server) implementing a cloud-based service. In some implementations, the audio environment can include one or more rooms of a home environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth.
In this embodiment, block 1705 relates to receiving, by a control system, location information for each of a plurality of audio devices in an audio environment. In some examples, the location information may be included in metadata 312 disclosed herein, which may include information corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, and the like. According to some examples, block 1705 may involve receiving, by a renderer, location information such as renderer 201 described herein (see, e.g., fig. 7 and 8).
In this example, block 1710 relates to generating, by a control system and based at least in part on the location information, rendering information for a plurality of audio devices in an audio environment. In some examples, the rendering information may be or may include a loudspeaker activation matrix. According to some examples, the method 1700 may involve rendering audio data based at least in part on rendering information to produce rendered audio data. In some such examples, the control system may be an orchestration device control system. In some such implementations, the method 1700 may involve providing at least a portion of the rendered audio data to each of a plurality of audio devices in an audio environment.
In this implementation, block 1715 involves determining, by the control system and based at least in part on the rendering information, a plurality of echo reference indicators. In this example, each of the plurality of echo reference indicators corresponds to audio data reproduced by one or more of the plurality of audio devices. In some such examples, the control system may be an orchestration device control system. In some such implementations, the method 1700 may involve providing at least one echo reference indicator to each of a plurality of audio devices.
In some examples, the method 1700 may involve receiving, by a control system, a content stream including audio data and corresponding metadata. In some such examples, determining the at least one echo reference indicator may be based at least in part on loudspeaker metadata, metadata corresponding to the received audio data, and/or an upmix matrix.
According to some examples, block 1715 may be performed, at least in part, by metadata-based metrics calculation module 705 of fig. 7 and 8. In some such examples, the at least one echo reference indicator may correspond to the echo reference characteristics 733 output by the metadata-based indicator calculation module 705. In some examples, the at least one echo reference indicator may correspond to a level of a corresponding echo reference, a uniqueness of a corresponding echo reference, a time duration of a corresponding echo reference, or an audibility of a corresponding echo reference.
In some examples, the method 1700 may involve performing, by the control system and based at least in part on the echo reference indicator, an importance estimation for each of the plurality of echo references. In some such embodiments, the control system may be an audio device control system. According to some embodiments, the echo reference importance estimator 401 may make an importance estimate. According to some examples, making the importance estimate may involve determining an expected contribution of each echo reference to echo mitigation by an echo management system of an audio device of the audio environment. The echo management system may include an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES. The echo management system may be or may include an instance of the MC-EMS203 disclosed herein.
According to some examples, making the importance estimate may involve determining an importance index for the corresponding echo reference. In some examples, determining the importance index may be based at least in part on one or more of a current listening goal or a current ambient noise estimate.
Some such examples may involve selecting, by the control system and based at least in part on the importance estimate, one or more selected echo references. According to some examples, the echo reference may be selected by an instance of the echo reference selector 402 disclosed herein. Some examples may involve providing, by the control system, one or more selected echo references to the at least one echo management system.
Some examples may involve cost determination by a control system. In some such examples, the cost estimation module 403 may be configured to make a cost determination. The cost determination may, for example, involve determining a cost of at least one of the plurality of echo references. In some such examples, selecting one or more selected echo references may be based at least in part on a cost determination. According to some examples, the cost determination may be based on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, and/or echo management system calculation requirements for using the at least one echo reference by the at least one echo management system.
Some embodiments may involve determining, by the control system, a current echo management system performance level. In some such examples, the MC-EMS performance model 405 may be configured to determine a current echo management system performance level. According to some examples, the importance estimate may be based at least in part on a current echo management system performance level.
Some examples may involve receiving, by a control system, scene change metadata. In some examples, the importance estimate may be based at least in part on scene change metadata. In some implementations, the scene change analyzer 755 can receive scene change metadata and can generate one or more scene change messages 715. In some such examples, the importance estimate may be based at least in part on one or more scene change messages 715.
In some examples, the method 1700 may involve generating, by the control system, at least one echo reference. In some examples, at least one echo reference may be generated by echo reference generator 710. According to some examples, the echo reference generator 710 may generate at least one echo reference based at least in part on one or more components of the metadata 312, such as a loudspeaker activation matrix (e.g., rendering matrix 722). In some examples, the method 1700 may involve generating, by the control system, at least one virtual echo reference. The virtual echo reference may, for example, correspond to two or more of the plurality of audio devices.
In some examples, method 1700 may involve generating (e.g., by echo reference generator 710) one or more subspace-based non-local device echo references. In some examples, the subspace-based non-native device echo reference may comprise a low frequency non-native device echo reference. Some such examples may involve determining, by the control system, a weighted sum of echo references within a certain low frequency range. Some such examples may involve providing a weighted sum to an echo management system. Some embodiments may involve causing an echo management system to cancel or suppress echo based at least in part on one or more selected echo references.
Fig. 18 shows an example of a plan view of an audio environment, which in this example is a living space. As with the other figures provided herein, the types and numbers of elements shown in fig. 18 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements.
According to this example, the environment 1800 includes a living room 1810 at the upper left, a kitchen 1815 at the lower center, and a bedroom 1822 at the lower right. The boxes and circles distributed across living space represent a set of loudspeakers 1805a-1805h, at least some of which may be intelligent loudspeakers in some embodiments, placed in convenient locations to the space, but not following any standard prescribed layout (arbitrarily placed). In some examples, television 1830 may be configured to at least partially implement one or more of the disclosed embodiments. In this example, environment 1800 includes cameras 1811a-1811e distributed throughout the environment. In some implementations, one or more intelligent audio devices in the environment 1800 can also include one or more cameras. The one or more intelligent audio devices may be single-use audio devices or virtual assistants. In some such examples, one or more cameras of the optional sensor system 130 may reside in or on a television 1830, in a mobile phone, or in a smart speaker (e.g., one or more of microphones 1805b, 1805d, 1805e, or 1805 h). Although cameras 1811a-1811e are not shown in each depiction of an audio environment presented in this disclosure, in some implementations, each audio environment may still include one or more cameras.
Various features and aspects will be understood from the following Enumerated Example Embodiments (EEEs):
eee1. An audio processing method comprising:
obtaining, by a control system, a plurality of echo references, the plurality of echo references comprising at least one echo reference for each of a plurality of audio devices in an audio environment, each echo reference corresponding to audio data played back by one or more loudspeakers of one of the plurality of audio devices;
performing, by the control system, an importance estimate for each echo reference of the plurality of echo references, wherein performing the importance estimate involves determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment, the at least one echo management system comprising an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES;
selecting, by the control system and based at least in part on the importance estimates, one or more selected echo references; and
the one or more selected echo references are provided by the control system to the at least one echo management system.
EEE2 the audio processing method of EEE1 further comprising causing at least one echo management system to cancel or suppress echo based at least in part on the one or more selected echo references.
EEE3. The audio processing method of either EEE1 or EEE2, wherein obtaining the plurality of echo references involves:
receiving a content stream comprising audio data; and
one or more echo references of the plurality of echo references are determined based on the audio data.
EEE4. The audio processing method of EEE3 wherein the control system comprises an audio device control system of an audio device of the plurality of audio devices in the audio environment, the audio processing method further comprising:
rendering, by the audio device control system, the audio data for reproduction on the audio device to produce a local speaker feed; and
a local echo reference corresponding to the local speaker feed signal is determined.
EEE5. The audio processing method of EEE4 wherein obtaining the plurality of echo references involves determining one or more non-local echo references based on the audio data, each of the non-local echo references corresponding to a non-local speaker feed for playback on another audio device of the audio environment.
EEE6. The audio processing method of EEE4 wherein obtaining the plurality of echo references involves receiving one or more non-local echo references, each of the non-local echo references corresponding to a non-local speaker feed for playback on another audio device of the audio environment.
EEE7. The audio processing method of EEE6 wherein receiving the one or more non-local echo references involves receiving the one or more non-local echo references from one or more other audio devices of the audio environment.
EEE8. the audio processing method of EEE6 wherein receiving the one or more non-local echo references involves receiving each of the one or more non-local echo references from a single other device of the audio environment.
EEE9. the audio processing method of any of EEEs 1-8, further comprising a cost determination involving determining a cost of at least one of the plurality of echo references, wherein selecting the one or more selected echo references is based at least in part on the cost determination.
EEE10. The audio processing method of EEE9 wherein the cost determination is based on network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, echo management system calculation requirements for using the at least one echo reference by the echo management system, or a combination thereof.
EEE11. The audio processing method of EEE9 or EEE10 wherein the cost determination is based on a replica of the at least one echo reference in the time or frequency domain, a downsampled version of the at least one echo reference, a lossy compression of the at least one echo reference, segment power information of the at least one echo reference, or a combination thereof.
EEE12. The audio processing method according to any of the claims EEE 9-11, wherein the cost determination is based on a method of less compressing a relatively more important echo reference than a relatively less important echo reference.
EEE13 the audio processing method of any of EEEs 1-12, further comprising determining a current echo management system performance level, wherein selecting the one or more selected echo references is based at least in part on the current echo management system performance level.
EEE14. The audio processing method according to any one of the EEEs 1-13, wherein making the importance estimation involves determining an importance index for the corresponding echo reference.
EEE15. The audio processing method of EEE14 wherein determining the importance index involves determining a level of the corresponding echo reference, determining a uniqueness of the corresponding echo reference, determining a time duration of the corresponding echo reference, determining an audibility of the corresponding echo reference, or a combination thereof.
EEE16. The audio processing method of claim 14 or 15, wherein determining the importance index is based at least in part on metadata corresponding to an audio device layout, loudspeaker metadata, metadata corresponding to received audio data, an upmix matrix, a loudspeaker activation matrix, or a combination thereof.
EEE17 the audio processing method of any of EEEs 14-16, wherein determining the importance index is based at least in part on a current listening objective, a current ambient noise estimate, an estimate of a current performance of the at least one echo management system, or a combination thereof.
EEE18. An apparatus configured to perform the audio processing method of any of EEEs 1-17.
EEE19. A system configured to perform the audio processing method of any of EEEs 1-17.
EEE20. One or more non-transitory media having software stored thereon that includes instructions for controlling one or more devices to perform the audio processing method of any of EEEs 1-17. Aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer-readable medium (e.g., disk) storing code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems may be or include a programmable general purpose processor, digital signal processor, or microprocessor programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including embodiments of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) Digital Signal Processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform the required processing on the audio signal(s), including the execution of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general-purpose processor (e.g., a Personal Computer (PC) or other computer system or microprocessor, which may include an input device and memory) programmed and/or otherwise configured with software or firmware to perform any of a variety of operations, including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general-purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more microphones and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or keyboard), memory, and a display device.
Another aspect of the disclosure is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code (e.g., an encoder executable to perform one or more examples of the disclosed methods or steps thereof) for performing one or more examples of the disclosed methods or steps thereof.
While specific embodiments of, and applications for, the present disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many more modifications than mentioned herein are possible without departing from the scope of the disclosure described and claimed herein. It is to be understood that while certain forms of the disclosure have been illustrated and described, the disclosure is not to be limited to the specific embodiments described and illustrated or to the specific methods described.

Claims (20)

1. An audio processing method, comprising:
receiving, by a control system, location information for each of a plurality of audio devices in an audio environment;
generating, by the control system and based at least in part on the location information, rendering information for a plurality of audio devices in an audio environment; and
a plurality of echo reference indicators are determined by the control system and based at least in part on the rendering information, each echo reference indicator of the plurality of echo reference indicators corresponding to audio data reproduced by one or more of the plurality of audio devices.
2. The audio processing method of claim 1, wherein the rendering information comprises a loudspeaker activation matrix.
3. The audio processing method of claim 1 or claim 2, wherein the at least one echo reference indicator corresponds to one or more of: the level of the corresponding echo reference, the uniqueness of the corresponding echo reference, the time duration of the corresponding echo reference, or the audibility of the corresponding echo reference.
4. A method of audio processing according to any of claims 1 to 3, further comprising receiving, by the control system, a content stream comprising audio data and corresponding metadata, wherein determining the at least one echo reference indicator is based at least in part on one or more of loudspeaker metadata, metadata corresponding to the received audio data, or an upmix matrix.
5. The audio processing method according to any one of claims 1 to 4, wherein the control system includes an audio device control system, the audio processing method further comprising:
performing, by the control system and based at least in part on the echo reference indicator, an importance estimate for each echo reference of a plurality of echo references, wherein performing the importance estimate involves determining an expected contribution of each echo reference to echo mitigation by at least one echo management system of at least one audio device of the audio environment, the at least one echo management system comprising an Acoustic Echo Canceller (AEC), an Acoustic Echo Suppressor (AES), or both AEC and AES;
Selecting, by the control system and based at least in part on the importance estimates, one or more selected echo references; and
the one or more selected echo references are provided by the control system to the at least one echo management system.
6. The audio processing method of claim 5, further comprising causing the at least one echo management system to cancel or suppress echo based at least in part on the one or more selected echo references.
7. The audio processing method of claim 5 or claim 6, wherein performing the importance estimation involves determining an importance index for a corresponding echo reference.
8. The audio processing method of claim 7, wherein determining the importance index is based at least in part on one or more of a current listening objective or a current ambient noise estimate.
9. The audio processing method of claim 5, further comprising making, by the control system, a cost determination involving determining a cost of at least one of the plurality of echo references, wherein selecting the one or more selected echo references is based at least in part on the cost determination.
10. The audio processing method of claim 9, wherein the cost determination is based on one or more of: network bandwidth required for transmitting the at least one echo reference, coding calculation requirements for coding the at least one echo reference, decoding calculation requirements for decoding the at least one echo reference, or echo management system calculation requirements for using the at least one echo reference by the at least one echo management system.
11. The audio processing method of any of claims 5 to 10, further comprising determining a current echo management system performance level, wherein the importance estimate is based at least in part on the current echo management system performance level.
12. The audio processing method of any of claims 5 to 11, further comprising receiving scene change metadata by the control system, wherein the importance estimate is based at least in part on the scene change metadata.
13. The audio processing method of claim 4, further comprising rendering the audio data based at least in part on the rendering information to produce rendered audio data.
14. The audio processing method of claim 13, wherein the control system comprises an orchestration device control system, the audio processing method further comprising providing at least a portion of the rendered audio data to each of the plurality of audio devices.
15. The audio processing method of any of claims 1 to 4, wherein the control system comprises an orchestration device control system, the audio processing method further comprising providing at least one echo reference indicator to each of the plurality of audio devices.
16. The audio processing method of any of claims 1 to 15, further comprising generating, by the control system, at least one virtual echo reference corresponding to two or more of the plurality of audio devices.
17. The audio processing method of any one of claims 1 to 16, further comprising:
determining, by the control system, a weighted sum of echo references in a low frequency range; and
the weighted sum is provided to at least one echo management system.
18. An apparatus configured to perform the method of any one of claims 1 to 17.
19. A system configured to perform the method of any one of claims 1 to 17.
20. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of claims 1-17.
CN202280013949.8A 2021-02-09 2022-02-07 Echo reference generation and echo reference index estimation based on rendering information Pending CN116830560A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US63/147,573 2021-02-09
US202163201939P 2021-05-19 2021-05-19
US63/201,939 2021-05-19
EP21177382.5 2021-06-02
PCT/US2022/015436 WO2022173684A1 (en) 2021-02-09 2022-02-07 Echo reference generation and echo reference metric estimation according to rendering information

Publications (1)

Publication Number Publication Date
CN116830560A true CN116830560A (en) 2023-09-29

Family

ID=88114965

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202280013990.5A Pending CN116830561A (en) 2021-02-09 2022-02-07 Echo reference prioritization and selection
CN202280013949.8A Pending CN116830560A (en) 2021-02-09 2022-02-07 Echo reference generation and echo reference index estimation based on rendering information

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202280013990.5A Pending CN116830561A (en) 2021-02-09 2022-02-07 Echo reference prioritization and selection

Country Status (1)

Country Link
CN (2) CN116830561A (en)

Also Published As

Publication number Publication date
CN116830561A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
US10607629B2 (en) Methods and apparatus for decoding based on speech enhancement metadata
US10224046B2 (en) Spatial comfort noise
EP2936485B1 (en) Object clustering for rendering object-based audio content based on perceptual criteria
US20240267469A1 (en) Coordination of audio devices
US11817114B2 (en) Content and environmentally aware environmental noise compensation
EP3818730A1 (en) Energy-ratio signalling and synthesis
US20240296822A1 (en) Echo reference generation and echo reference metric estimation according to rendering information
WO2017043309A1 (en) Speech processing device and method, encoding device, and program
CN116830560A (en) Echo reference generation and echo reference index estimation based on rendering information
US20230421952A1 (en) Subband domain acoustic echo canceller based acoustic state estimator
WO2023086273A1 (en) Distributed audio device ducking
RU2818982C2 (en) Acoustic echo cancellation control for distributed audio devices
CN118235435A (en) Distributed audio device evasion
CN116783900A (en) Acoustic state estimator based on subband-domain acoustic echo canceller

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination