JPWO2022023417A5

JPWO2022023417A5 -

Info

Publication number: JPWO2022023417A5
Application number: JP2023506248A
Authority: JP
Publication date: 2024-12-16

Description

本発明は、拡張現実（ＡＲ：ａｕｇｍｅｎｔｅｄｒｅａｌｉｔｙ）におけるバイノーラル再生のためのヘッドホン等化および室内適応に関する。 The present invention relates to headphone equalization and room adaptation for binaural reproduction in augmented reality (AR).

選択的聴覚（ＳＨ：Ｓｅｌｅｃｔｉｖｅｈｅａｒｉｎｇ）は、聴取者が聴覚シーンにおいて特定の音源または複数の音源に注意を向ける能力を指す。同様に、これは、聴取者の関心のないソースへの集中が低減されることを意味する。 Selective hearing (SH) refers to the ability of a listener to direct attention to a particular sound source or sources in an auditory scene. In turn, this means that the listener's focus on sources of no interest is reduced.

したがって、人間の聴取者は、大きな環境でも通信することができる。これは、通常、異なる態様を利用し、２つの耳で聞く場合、方向に依存する時間差およびレベル差、ならびに方向に依存する異なる音のスペクトル着色がある。後者により、片耳で聞く場合であっても、聴覚によって音源の方向を判別し、異なる音源を分離することができる。 Human listeners can therefore communicate in loud environments, usually by exploiting different aspects: when listening with two ears, there are time and level differences that depend on the direction, as well as different spectral colorings of sound that depend on the direction. The latter allows hearing to determine the direction of sound sources and separate different sound sources, even when listening with one ear.

時間差およびレベル差だけでは音源の正確な位置を決定するのに十分ではなく、同じ時間差およびレベル差を有する位置は双曲面上に位置する。結果として生じる位置決定の曖昧さは、混同の円錐と呼ばれる。部屋では、各音源は境界面によって反射される。これらのいわゆるミラーソースの各々は、さらなる双曲面上に位置する。人間の聴覚は、直接音に関する情報と聴覚イベントへの関連する反射とを組み合わせ、これによって混同の曖昧さを解決する。同時に、音源に属する反射は、音源の知覚される音量を増加させる。 Time and level differences alone are not enough to determine the exact location of a sound source, and locations with the same time and level differences are located on a hyperbolic surface. The resulting localization ambiguity is called the cone of confusion. In a room, each sound source is reflected by boundary surfaces. Each of these so-called mirror sources is located on a further hyperbolic surface. Human hearing combines information about the direct sound and the associated reflections to the auditory event, thus resolving the confusion ambiguity. At the same time, the reflections that belong to the sound source increase the perceived loudness of the sound source.

また、自然音源、特に音声の場合、異なる周波数の信号部分が時間的に結合される。バイノーラル聴覚では、これらの態様のすべてが一緒に使用される。さらに、十分に局在可能な大きな外乱源は、いわば能動的に無視することができる。 Also, in the case of natural sound sources, especially speech, signal parts of different frequencies are combined in time. In binaural hearing, all of these aspects are used together. Moreover, large disturbance sources that can be well localized can be actively ignored, so to speak.

文献では、選択的聴覚の概念は、補助聴取［１］、仮想聴覚環境および増幅聴覚環境［２］などの他の用語に関連している。補助聴取は、仮想、増幅およびＳＨ用途を含むより広い用語である。 In the literature, the concept of selective hearing is related to other terms such as assisted hearing [1], virtual hearing environments and amplified hearing environments [2]. Assisted hearing is a broader term that includes virtual, amplified and SH applications.

従来技術によれば、古典的な聴覚デバイスは、主にモノラル方式で動作する、すなわち、左右の耳の信号処理は、周波数応答および動的圧縮に関して完全に独立している。結果として、耳信号間の時間、レベル、および周波数の差が失われる。 According to the prior art, classical hearing devices operate mainly in a monophonic manner, i.e. the signal processing for the left and right ears is completely independent in terms of frequency response and dynamic compression. As a result, time, level and frequency differences between the ear signals are lost.

最新の、いわゆるバイノーラル聴覚デバイスは、２つの聴覚デバイスの補正係数を結合する。多くの場合、それらはいくつかのマイクロホンを有するが、通常、選択されるのは「最も音声らしい」信号を有するマイクロホンのみであり、明示的なビームフォーミングは計算されない。複雑な聴覚状況では、所望の音信号と望ましくない音信号とが同じように増幅され、したがって、所望の音成分への集中はサポートされない。 Modern, so-called binaural hearing devices combine the correction factors of two hearing devices. Often they have several microphones, but usually only the microphone with the "most speech-like" signal is selected and no explicit beamforming is calculated. In complex hearing situations, the desired and undesired sound signals are amplified in the same way, and therefore focusing on the desired sound component is not supported.

例えば電話のためのハンズフリーデバイスの分野では、今日既にいくつかのマイクロホンが使用されており、個々のマイクロホン信号からいわゆるビームが計算され、ビームの方向から来る音が増幅され、他の方向からの音が低減される。今日の方法は、背景の一定の音を学習し（例えば、自動車内のエンジンおよび風雑音）、さらなるビームを介して十分に局在可能な大きな外乱を学習し、これらを使用信号から減算する（例：一般化サイドローブキャンセラ）。時々、電話システムは、音声の静的特性を検出する検出器を使用し、音声のように構造化されていないすべてを抑制する。ハンズフリーデバイスでは、モノラル信号のみが最終的に送信され、状況を捕捉するために、特に、いくつかの話者が相互呼び出しを有する場合に、あたかも「１つはそこにあった」かのように錯覚を提供するために興味深い空間情報を伝送路で失う。非音声信号を抑制することにより、通話相手の音響環境に関する重要な情報が失われ、通信が妨げられる可能性がある。 In the field of hands-free devices, for example for telephone, several microphones are already used today, so-called beams are calculated from the individual microphone signals, sounds coming from the direction of the beam are amplified and sounds from other directions are reduced. Today's methods learn constant sounds in the background (e.g. engine and wind noise in a car) and learn large disturbances that are well localizable via further beams and subtract these from the used signal (e.g. generalized sidelobe canceller). Sometimes telephone systems use detectors that detect static characteristics of speech and suppress everything that is not structured like speech. In hands-free devices, only mono signals are finally transmitted, losing interesting spatial information in the transmission path to capture the situation and to provide the illusion as if "one was there", especially in the case of several talkers having mutual calls. By suppressing non-speech signals, important information about the acoustic environment of the other party is lost, which can hinder communication.

本来、人間は、周囲の個々の音源に「選択的に聴取する」意識的に集中することができる。人工知能（ＡＩ：ａｒｔｉｆｉｃｉａｌｉｎｔｅｌｌｉｇｅｎｃｅ）による選択的聴覚のための自動システムは、最初に基礎となる概念を学習しなければならない。音響シーンの自動分解（シーン分解）は、まず、すべてのアクティブな音源の検出および分類を必要とし、その後、それらを別々の音声対象物としてさらに処理、増幅、または弱めることができるように分離する。 Naturally, humans can consciously focus on "selective listening" to individual sound sources in their environment. An automated system for selective hearing using artificial intelligence (AI) must first learn the underlying concepts. Automatic decomposition of an acoustic scene (scene decomposition) first requires the detection and classification of all active sound sources, and then separating them so that they can be further processed, amplified, or attenuated as separate audio objects.

聴覚シーン分析の研究分野は、記録された音声信号に基づいて、ステップ、拍手、または声などの時間的に位置する音イベント、ならびにコンサート、レストラン、またはスーパーマーケットなどのよりグローバルな音響シーンを検出および分類しようとする。この場合、現在の方法は、人工知能（ＡＩ）および深層学習の分野の方法のみを使用する。これは、音声信号内の特性パターンを検出するために、大きな訓練量に基づいて学習する深層ニューラルネットワークのデータ駆動学習を含む［７０］。とりわけ、画像処理（コンピュータビジョン）および音声処理（自然言語処理）の研究分野の進歩に触発されて、スペクトログラム表現における２次元パターン検出のための畳み込みニューラルネットワークと、音の時間モデリングのための再帰層（リカレントニューラルネットワーク）との混合が原則として使用される。 The research field of auditory scene analysis seeks to detect and classify, based on recorded audio signals, temporally located sound events such as steps, clapping, or voices, as well as more global acoustic scenes such as concerts, restaurants, or supermarkets. In this case, current methods exclusively use methods from the fields of artificial intelligence (AI) and deep learning. This involves data-driven learning of deep neural networks that learn on a large training volume to detect characteristic patterns within audio signals [70]. Inspired by progress in the research fields of image processing (computer vision) and speech processing (natural language processing), among others, a mixture of convolutional neural networks for two-dimensional pattern detection in spectrogram representations and recurrent layers (recurrent neural networks) for temporal modeling of sounds is used in principle.

音声分析には、対処すべき一連の特定の課題がある。それらの複雑さのために、深層学習モデルは非常にデータを大量に消費する。画像処理および音声処理の研究分野とは対照的に、音声処理に利用可能なデータセットは比較的小さい。最大のデータセットは、約２００万の音例および６３２の異なる音イベントクラスを有するＧｏｏｇｌｅ［８３］のＡｕｄｉｏＳｅｔデータセットであり、研究で使用されるほとんどのデータセットは著しく小さい。この少量の訓練データは、例えば、転送学習を用いて対処することができ、大きなデータセットで事前訓練されたモデルは、その後、ユースケースに対して決定された新しいクラスを用いてより小さなデータセットに微調整される（微調整）［７７］。さらに、半教師あり学習からの方法は、訓練において、一般に大量に入手可能な注釈付けされていない音声データも含むように利用される。 Speech analysis has a set of specific challenges to address. Due to their complexity, deep learning models are very data intensive. In contrast to the research fields of image processing and speech processing, the datasets available for speech processing are relatively small. The largest dataset is the AudioSet dataset from Google [83], with about 2 million sound examples and 632 different sound event classes, and most datasets used in research are significantly smaller. This small amount of training data can be addressed, for example, using transfer learning, where a model pre-trained on a large dataset is then fine-tuned on a smaller dataset with new classes determined for the use case (fine-tuning) [77]. Furthermore, methods from semi-supervised learning are exploited to include in the training also unannotated speech data, which is generally available in large quantities.

画像処理と比較したさらなる重要な違いは、同時に聞くことができる音響イベントの場合、（画像の場合のように）音対象物のマスキングがなく、複雑な位相依存のオーバーラップがあることである。深層学習における現在のアルゴリズムは、いわゆる「注意」メカニズムを使用し、例えば、モデルが特定の時間セグメントまたは周波数範囲に分類に集中することを可能にする［２３］。音イベントの検出は、それらの持続時間に関する高い分散によってさらに複雑になる。アルゴリズムは、銃声などの非常に短い事象、および通過列車などの長い事象をロバストに検出することができるべきである。 A further important difference compared to image processing is that in the case of simultaneously audible sound events, there is no masking of sound objects (as in the case of images) but rather a complex phase-dependent overlap. Current algorithms in deep learning use so-called "attention" mechanisms, which allow, for example, the model to focus the classification on a specific time segment or frequency range [23]. The detection of sound events is further complicated by the high variance regarding their duration. Algorithms should be able to robustly detect very short events, such as gunshots, and long events, such as a passing train.

訓練データの記録における音響条件に対するモデルの強い依存性のために、モデルは、例えば、空間残響またはマイクロホンの位置決めに関して異なる、新しい音響環境において予期しない挙動を示すことが多い。この問題を緩和するための様々な解決手法が開発されている。例えば、データ増強方法は、異なる音響条件のシミュレーション［６８］および異なる音源の人工的な重複を通じてモデルのより高いロバスト性および不変性を達成しようとする。さらに、複雑なニューラルネットワークのパラメータは、訓練データに対する過剰訓練および特殊化が回避され、同時に見えないデータに対するより良好な一般化を達成するように、異なる方法で調整することができる。近年、以前に訓練されたモデルを新しい適用条件に適合させるために、「ドメイン適応」［６７］に対して異なるアルゴリズムが提案されている。このプロジェクトで計画されているヘッドホン内の使用シナリオでは、音源検出アルゴリズムのリアルタイム機能は基本的に重要である。ここで、ニューラルネットワークの複雑さと、基礎となるコンピューティングプラットフォーム上の計算動作の最大可能数との間のトレードオフが必然的に行われなければならない。音声イベントの持続時間が長くても、対応する音源分離を開始するためには、可能な限り迅速に検出する必要がある。 Due to the strong dependence of the models on the acoustic conditions in the recording of the training data, the models often show unexpected behavior in new acoustic environments, which are different, for example, in terms of spatial reverberation or microphone positioning. Various solutions have been developed to alleviate this problem. For example, data augmentation methods try to achieve higher robustness and invariance of the models through the simulation of different acoustic conditions [68] and the artificial overlap of different sound sources. Furthermore, the parameters of complex neural networks can be adjusted in different ways so that overtraining and specialization on the training data is avoided and at the same time a better generalization to unseen data is achieved. In recent years, different algorithms have been proposed for "domain adaptation" [67] to adapt previously trained models to new application conditions. For the in-headphone usage scenario planned in this project, the real-time capability of the sound source detection algorithm is of fundamental importance. Here, a trade-off must necessarily be made between the complexity of the neural network and the maximum possible number of computational operations on the underlying computing platform. Even if the duration of a sound event is long, it needs to be detected as quickly as possible in order to start the corresponding sound source separation.

フラウンホーファー（Ｆｒａｕｎｈｏｆｅｒ）ＩＤＭＴでは、近年、自動音源検出の分野において多くの研究が行われている。研究プロジェクト「シュタットラーム」では、ノイズレベルを測定し、都市内の異なる場所で記録された音声信号に基づいて１４の異なる音響シーンとイベントクラスとの間で分類することができる分散型センサネットワークが開発された［６９］。この場合、センサにおける処理は、組み込みプラットフォームＲａｓｐｂｅｒｒｙＰｉ３上でリアルタイムで実行される。前の研究では、オートエンコーダネットワークに基づいてスペクトログラムのデータ圧縮のための新規な手法が検討された［７１］。最近では、音楽信号処理（音楽情報検索）の分野における深層学習からの方法の使用を通じて、音楽転写［７６］、［７７］、コード検出［７８］、および楽器検出［７９］などの用途において大きな進歩があった。産業用音声処理の分野では、新しいデータセットが確立されており、例えば電気モータの音響状態を監視するために、深層学習の方法が使用されている［７５］。 At Fraunhofer IDMT, much work has been done in the field of automatic sound source detection in recent years. In the research project "Stadtram", a distributed sensor network was developed that can measure noise levels and classify between 14 different acoustic scenes and event classes based on audio signals recorded at different locations in a city [69]. In this case, the processing at the sensors is performed in real time on an embedded platform Raspberry Pi 3. In previous work, a novel approach for data compression of spectrograms based on autoencoder networks was considered [71]. Recently, through the use of methods from deep learning in the field of music signal processing (music information retrieval), great progress has been made in applications such as music transcription [76], [77], chord detection [78], and instrument detection [79]. In the field of industrial sound processing, new datasets have been established and deep learning methods are being used, for example, to monitor the acoustic condition of electric motors [75].

この実施形態で対処されるシナリオは、数およびタイプが最初は未知であり、絶えず変化し得るいくつかの音源を想定している。音源分離のために、いくつかのスピーカなどの同様の特性を有するいくつかの音源は、特に大きな課題である［８０］。 The scenario addressed in this embodiment assumes several sound sources whose number and type are initially unknown and may change continuously. For source separation, several sound sources with similar characteristics, such as several loudspeakers, are a particularly big challenge [80].

高い空間分解能を達成するために、いくつかのマイクロホンをアレイの形態で使用しなければならない［７２］。モノラル（１チャネル）またはステレオ（２チャネル）の従来の音声録音とは対照的に、そのような録音シナリオは、聴取者の周りの音源の正確な位置特定を可能にする。 To achieve high spatial resolution, several microphones must be used in the form of an array [72]. In contrast to conventional sound recordings, either mono (one channel) or stereo (two channels), such a recording scenario allows for precise localization of sound sources around the listener.

音源分離アルゴリズムは、通常、音源［５］間の歪みおよびクロストークなどのアーチファクトを残し、これは一般に、聴取者によって妨害として知覚され得る。トラックを再混合することにより、そのようなアーチファクトを部分的にマスクすることができ、したがって低減することができる［１０］。 Source separation algorithms typically leave behind artifacts such as distortion and crosstalk between the sources [5], which can generally be perceived as disruptive by the listener. By remixing the tracks, such artifacts can be partially masked and therefore reduced [10].

「ブラインド」音源分離を強化するために、音源の検出された数およびタイプまたはそれらの推定空間位置などの追加情報がしばしば使用される（インフォームドソース分離［７４］）。いくつかの話者が活動している会議の場合、現在の分析システムは、話者の数を同時に推定し、それぞれの時間的活動を決定し、続いて音源分離によってそれらを分離することができる［６６］。 To enhance the "blind" source separation, additional information such as the detected number and type of sources or their estimated spatial location is often used (informed source separation [74]). For a conference with several active speakers, current analysis systems are able to simultaneously estimate the number of speakers, determine their respective temporal activity, and subsequently separate them by source separation [66].

フラウンホーファーＩＤＭＴでは、近年、音源分離アルゴリズムの知覚ベースの評価に関する多くの研究が行われている［７３］。 At the Fraunhofer IDMT, much work has been done in recent years on perception-based evaluation of sound source separation algorithms [73].

音楽信号処理の分野では、追加情報として単独楽器のベース周波数推定を利用して、単独楽器および付随する楽器を分離するためのリアルタイム対応アルゴリズムが開発されている［８１］。深層学習方法に基づいて複雑な楽曲から歌唱を分離するための代替的な手法が［８２］において提案されている。産業用音声分析の文脈における用途のために、特殊な音源分離アルゴリズムも開発されている［７］。 In the field of music signal processing, real-time capable algorithms have been developed for separating isolated and associated instruments, using base frequency estimates of isolated instruments as additional information [81]. An alternative approach for separating singing from complex musical compositions based on deep learning methods is proposed in [82]. Specialized source separation algorithms have also been developed for applications in the context of industrial audio analysis [7].

ヘッドホンは、周囲の音響的知覚に大きく影響する。ヘッドホンの構造に応じて、耳に向かう音の入射は、異なる程度に減衰される。インイヤーヘッドホンは、耳チャネルを完全に遮断する［８５］。耳介を取り囲む閉じたヘッドホンは、聴取者を外部環境からも音響的に強く遮断する。開放型および半開放型のヘッドホンは、音が完全にまたは部分的に通過することを可能にする［８４］。日常生活の多くの用途において、ヘッドホンは、その構造タイプで可能であるよりも強く、望ましくない周囲の音を分離することが望ましい。 Headphones greatly affect the acoustic perception of the surroundings. Depending on the headphone construction, the sound incident towards the ear is attenuated to different degrees. In-ear headphones completely block the ear channel [85]. Closed headphones that surround the pinna also strongly acoustically isolate the listener from the external environment. Open and semi-open headphones allow sound to pass through completely or partially [84]. In many applications in daily life, it is desirable for headphones to isolate unwanted surrounding sounds to a greater extent than is possible for their construction type.

さらに、能動的雑音制御（ＡＮＣ：ａｃｔｉｖｅｎｏｉｓｅｃｏｎｔｒｏｌ）によって、外部からの干渉の影響を減衰させることができる。これは、ヘッドホンのマイクロホンによって入射音声信号を記録し、次いで、これらの音声部分とヘッドホンを貫通する音声部分とが干渉によって互いに打ち消し合うように、ラウドスピーカによってそれらを再生することによって実現される。全体として、これは、周囲からの強力な音響的分離を達成することができる。しかしながら、多くの日常の状況では、これは危険に付随しており、そのため、オンデマンドでこの機能を知的にオンにできることが望まれている。 Furthermore, the effects of external interference can be attenuated by active noise control (ANC). This is achieved by recording the incoming sound signals by the microphones of the headphones and then reproducing them by the loudspeakers so that these sound parts and the sound parts penetrating the headphones cancel each other out by interference. Overall, this can achieve a strong acoustic isolation from the surroundings. However, in many everyday situations this is associated with dangers, so it is desirable to be able to intelligently turn on this function on demand.

第１の製品は、受動的絶縁を低減するために、マイクロホン信号がヘッドホンに通されることを可能にする。そのため、試作品［８６］以外にも、「透明聴取」の機能を宣伝する商品は既に存在する。例えば、Ｓｅｎｎｈｅｉｓｅｒは、ＡＭＢＥＯヘッドセット［８８］と共に機能を提供し、Ｂｒａｇｉは、製品「ＤａｓｈＰｒｏ」と共に提供する。しかしながら、この可能性は始まりにすぎない。将来、この機能は、周囲の音を完全にオンおよびオフにすることに加えて、個々の信号部分（例えば、音声信号または警報信号のみ）をオンデマンドで排他的にヒアラブルにすることができるように大幅に拡張されるべきである。フランスの会社Ｏｒｏｓｏｕｎｄは、ヘッドセット「チルドイヤホン」を着用している人がスライダを用いてＡＮＣの強度を適合させることを可能にする［８９］。加えて、会話相手の音声もまた、起動されたＡＮＣ中に誘導され得る。しかしながら、これは、会話の相手が６０°の円錐の中に向かい合って位置する場合にのみ機能する。方向に依存しない適応は不可能である。 The first product allows the microphone signal to be passed through headphones to reduce passive isolation. So, besides the prototype [86], there are already commercial products advertising the feature of "transparent hearing". For example, Sennheiser offers the feature with its AMBEO headset [88], and Bragi with its product "Dash Pro". However, this possibility is only the beginning. In the future, this functionality should be significantly expanded to be able to make individual signal parts (e.g. only the voice signal or the alarm signal) exclusively hearable on demand, in addition to turning the ambient sound completely on and off. The French company Orosound allows the person wearing the headset "chilled earphone" to adapt the strength of the ANC using a slider [89]. In addition, the voice of the conversation partner can also be induced during the activated ANC. However, this only works if the conversation partner is located face to face in the 60° cone. Direction-independent adaptation is not possible.

米国特許出願公開第２０１５１９５６４１号明細書（［９１］参照）は、ユーザのための聴覚環境を生成するために実施される方法を開示している。この場合、本方法は、ユーザの周囲聴覚環境を表す信号を受信するステップと、周囲聴覚環境内の複数の音声タイプのうちの少なくとも１つの音声タイプを識別するようにマイクロプロセッサを使用して信号を処理するステップと、を含む。さらに、本方法は、複数の音声タイプの各々についてのユーザ選好を受信するステップと、周囲聴覚環境内の各音声タイプについての信号を修正するステップと、修正された信号を少なくとも１つのラウドスピーカに出力して、ユーザの聴覚環境を生成するステップと、を含む。 US2015195641 (see [91]) discloses a method implemented to generate an auditory environment for a user. In this case, the method includes receiving a signal representative of the user's ambient auditory environment and processing the signal using a microprocessor to identify at least one sound type of a plurality of sound types in the ambient auditory environment. The method further includes receiving a user preference for each of the plurality of sound types, modifying the signal for each sound type in the ambient auditory environment, and outputting the modified signal to at least one loudspeaker to generate the auditory environment for the user.

拡張現実（ＡＲ）におけるバイノーラル再生のヘッドホン等化および室内適応（または空間／空間適応または空間／空間補償）は重要な問題である。 Headphone equalization and room adaptation (or spatial/spatial adaptation or spatial/spatial compensation) for binaural reproduction in augmented reality (AR) are important issues.

典型的なシナリオでは、人間の聴取者は、音響的に（部分的に）透明なヘッドホンを装着し、ヘッドホンを通して周囲の音を聞く。さらに、追加の音源がヘッドホンを介して再生され、前記音源は、聴取者が実際の音響シーンと追加の音とを区別することができないように実際の周囲に埋め込まれる。 In a typical scenario, a human listener wears acoustically (partially) transparent headphones and hears ambient sounds through the headphones. Furthermore, an additional sound source is played through the headphones, said sound source being embedded in the real surroundings in such a way that the listener cannot distinguish between the real acoustic scene and the additional sounds.

通常、頭部が回転する方向および部屋（または空間）内の聴取者の位置は、追跡（６自由度（６ＤｏＦ：ｄｅｇｒｅｅｓｏｆｆｒｅｅｄｏｍ））によって決定される。録音室および再生室の室内音響特性が一致する場合、または録音が再生室に適合される場合、良好な結果（すなわち、外部化および正確な局在化）が達成されることが研究から知られている。 Typically, the direction the head rotates and the position of the listener in the room (or space) are determined by tracking (6 degrees of freedom (DoF)). Research shows that good results (i.e. externalization and accurate localization) are achieved if the room acoustics of the recording and playback rooms match or if the recording is adapted to the playback room.

この場合、例示的な解決策は以下のように実現することができる。 In this case, an exemplary solution can be implemented as follows:

第１のステップでは、ヘッドホンを用いないＢＲＩＲの測定が、個別化された様式で、またはプローブマイクロホンによる人工ヘッドを用いて行われる。 In the first step, the BRIR is measured without headphones, either in an individualized manner or using an artificial head with a probe microphone.

第２のステップでは、測定されたＢＲＩＲに基づいて、記録室の室内特性の解析が行われる。 In the second step, an analysis of the room characteristics of the recording room is performed based on the measured BRIR.

第３のステップでは、ヘッドホン伝達関数の測定が、個別化された方法で、または同じ場所にあるプローブマイクロホンによって人工頭部を用いて行われる。これにより、等化関数が決定される。 In a third step, measurements of the headphone transfer function are made using an artificial head, either in an individualized manner or by means of a co-located probe microphone, from which the equalization function is determined .

任意選択で、第４のステップにおいて、再生室の室内特性の測定、再生室の音響特性の分析、および再生室に対するＢＲＩＲの適合が実行されてもよい。 Optionally, in a fourth step, measurements of the room characteristics of the reproduction room, analysis of the acoustic characteristics of the reproduction room, and adaptation of the BRIR to the reproduction room may be performed.

次に、さらなるステップでは、２つの生チャネルを得るために、正しく配置され、任意選択的に適合されたＢＲＩＲで増強されるソースの畳み込み（または折り畳み）が実行される。ヘッドホン信号を取得するための生チャネルと等化機能との畳み込み。 Then, in a further step, a convolution (or folding) of the source, augmented with a correctly positioned and optionally adapted BRIR, is performed to obtain two raw channels. Convolution of the raw channels with an equalization function to obtain a headphone signal.

最後に、さらなるステップにおいて、ヘッドホン信号の再生がヘッドホンを介して実行される。 Finally, in a further step, playback of the headphone signal is performed via headphones.

しかしながら、ヘッドホンを装着すると、ＢＲＩＲに対する耳介の影響がなくなるという問題がある。すなわち、ＢＲＩＲは、ヘッドホンがない場合とは異なる。これにより、ヘッドホンがない場合とは異なる自然音源が聞こえるが、あたかもヘッドホンがないかのように仮想的な拡張音源が再生される。 However, the problem is that when headphones are worn, the effect of the pinna on the BRIR is eliminated; i.e. the BRIR is different than without the headphones. This results in a different natural sound source being heard than without the headphones, but a virtual augmented sound source being reproduced as if the headphones were not there.

米国特許出願公開第２０１５１９５６４１号明細書US Patent Publication No. 2015195641

再生室の室内特性の簡単で迅速かつ効率的な決定を可能にする概念を提供することが望ましい。 It would be desirable to provide a concept that allows for a simple, fast and efficient determination of the room characteristics of a regeneration room.

本発明の実施形態を以下に提供する。 Embodiments of the present invention are provided below.

したがって、請求項１は本発明の実施形態によるシステムを提供し、請求項１９は方法を提供し、請求項２０はコンピュータプログラムを提供する。 Accordingly, claim 1 provides a system according to an embodiment of the invention, claim 19 provides a method, and claim 20 provides a computer program.

本発明の一実施形態によるシステムは、複数のバイノーラル室内インパルス応答を決定するための分析器と、複数のバイノーラル室内インパルス応答に応じて、かつ少なくとも１つの音源の音源信号に応じて、少なくとも２つのラウドスピーカ信号を生成するためのラウドスピーカ信号発生器とを含む。分析器は、複数のバイノーラル室内インパルス応答の各々が、ヘッドホンがユーザによって装着されたことに起因する効果を考慮するように、複数のバイノーラル室内インパルス応答を決定するように構成される。 A system according to an embodiment of the invention comprises an analyzer for determining a plurality of binaural room impulse responses and a loudspeaker signal generator for generating at least two loudspeaker signals in response to the plurality of binaural room impulse responses and in response to a source signal of at least one sound source, the analyzer being configured to determine the plurality of binaural room impulse responses such that each of the plurality of binaural room impulse responses takes into account an effect resulting from headphones being worn by a user.

さらに、本発明の一実施形態による方法が提供され、本方法は、
複数のバイノーラル室内インパルス応答を決定するステップと、
複数のバイノーラル室内インパルス応答に応じて、かつ少なくとも１つの音源の音源信号に応じて、少なくとも２つのラウドスピーカ信号を生成するステップと、
を含む。 Further provided is a method according to an embodiment of the present invention, the method comprising:
determining a plurality of binaural room impulse responses;
- generating at least two loudspeaker signals in response to a plurality of binaural room impulse responses and in response to a source signal of at least one sound source;
Includes.

複数のバイノーラル室内インパルス応答は、複数のバイノーラル室内インパルス応答の各々が、ユーザがヘッドホンを装着したことに起因する効果を考慮するように決定される。 The plurality of binaural room impulse responses are determined such that each of the plurality of binaural room impulse responses takes into account the effect resulting from a user wearing headphones.

さらに、上述の方法を実行するためのプログラムコードを有する本発明の一実施形態によるコンピュータプログラムが提供される。 Furthermore, a computer program according to an embodiment of the present invention is provided having program code for executing the above-mentioned method.

続いて、本発明の好ましい実施形態を図面を参照して説明する。 Next, a preferred embodiment of the present invention will be described with reference to the drawings.

一実施形態によるシステムを示す図である。FIG. 1 illustrates a system according to one embodiment. さらなる実施形態による選択的聴覚を支援するためのさらなるシステムを示す図である。FIG. 1 illustrates a further system for assisting selective hearing according to a further embodiment. ユーザインターフェースをさらに含む、選択的聴覚を支援するためのさらなるシステムを示す図である。FIG. 1 illustrates a further system for assisting selective hearing, further including a user interface. ２つの対応するラウドスピーカを有する聴覚デバイスを含む、選択的聴覚を支援するためのシステムを示す図である。FIG. 1 illustrates a system for assisting selective hearing that includes a hearing device having two corresponding loudspeakers. ハウジング構造および２つのラウドスピーカを含む、選択的聴覚を支援するためのシステムを示す図である。FIG. 1 illustrates a system for assisting selective hearing including a housing structure and two loudspeakers. ２つのラウドスピーカを有するヘッドホンを含む、選択的聴覚を支援するためのシステムを示す図である。FIG. 1 illustrates a system for assisting selective hearing that includes headphones with two loudspeakers. 検出器および位置決定器ならびに音声タイプ分類器ならびに信号部分修正器および信号発生器を含む遠隔デバイス１９０を含む、一実施形態によるシステムを示す図である。FIG. 1 illustrates a system according to one embodiment including a remote device 190 including a detector and locator as well as a sound type classifier and a signal portion modifier and signal generator. ５つのサブシステムを含む、一実施形態によるシステムを示す図である。FIG. 1 illustrates a system according to one embodiment, including five subsystems. 一実施形態による対応するシナリオを示す図である。FIG. 2 illustrates a corresponding scenario according to one embodiment. ４つの外部音源を有する一実施形態によるシナリオを示す図である。FIG. 1 illustrates a scenario according to an embodiment with four external sound sources. 実施形態に係るＳＨ用途の処理ワークフローを示す図である。FIG. 2 is a diagram illustrating a processing workflow for SH applications according to an embodiment.

図１は、一実施形態によるシステムを示す図である。 Figure 1 illustrates a system according to one embodiment.

システムは、複数のバイノーラル室内インパルス応答を決定するための分析器１５２を含む。 The system includes an analyzer 152 for determining a number of binaural room impulse responses.

さらに、システムは、複数のバイノーラル室内インパルス応答に応じて、かつ少なくとも１つの音源の音源信号に応じて、少なくとも２つのラウドスピーカ信号を生成するためのラウドスピーカ信号発生器１５４を含む。 Furthermore, the system includes a loudspeaker signal generator 154 for generating at least two loudspeaker signals in response to the multiple binaural room impulse responses and in response to the source signal of at least one sound source.

分析器１５２は、複数のバイノーラル室内インパルス応答の各々が、ユーザがヘッドホンを装着したことに起因する効果を考慮するように、複数のバイノーラル室内インパルス応答を決定するように構成される。 The analyzer 152 is configured to determine a plurality of binaural room impulse responses such that each of the plurality of binaural room impulse responses takes into account effects resulting from a user wearing headphones.

一実施形態では、例えば、システムは、ヘッドホンを含んでもよく、例えば、ヘッドホンは、少なくとも２つのラウドスピーカ信号を出力するように構成されてもよい。 In one embodiment, for example, the system may include headphones, which may be configured to output at least two loudspeaker signals.

一実施形態によれば、例えば、ヘッドホンは、少なくとも２つのヘッドホンカプセルと、例えば、２つのヘッドホンカプセルのそれぞれにおける音を測定するための少なくとも１つのマイクロホンとを含んでもよく、例えば、音を測定するための少なくとも１つのマイクロホンは、２つのヘッドホンカプセルのそれぞれに配置されてもよい。ここで、例えば、分析器１５２は、２つのヘッドホンカプセルのそれぞれにおける少なくとも１つのマイクロホンの測定値を用いて、複数のバイノーラル室内インパルス応答の決定を行うように構成されてもよい。バイノーラル再生を目的としたヘッドホンは、常に少なくとも２つのヘッドホンカプセル（例えば、異なる周波数範囲について）を備え、３つ以上のカプセルが設けられてもよい。 According to one embodiment, for example, a headphone may include at least two headphone capsules and at least one microphone for measuring sound in each of the two headphone capsules, for example, the at least one microphone for measuring sound may be arranged in each of the two headphone capsules, where for example the analyzer 152 may be configured to perform a determination of a plurality of binaural room impulse responses using the measurements of the at least one microphone in each of the two headphone capsules. A headphone intended for binaural reproduction always comprises at least two headphone capsules (for example for different frequency ranges), and three or more capsules may also be provided.

一実施形態では、例えば、２つのヘッドホンカプセルの各々における少なくとも１つのマイクロホンは、ヘッドホンによる少なくとも２つのラウドスピーカ信号の再生に先立って、再生室（または空間）内の音の状況の１つまたは複数の録音を生成し、１つまたは複数の録音からの少なくとも１つの音源の生の音声信号の推定値を決定し、再生室内の音源についての複数のバイノーラル室内インパルス応答のバイノーラル室内インパルス応答を決定するように構成されてもよい。 In one embodiment, for example, at least one microphone in each of the two headphone capsules may be configured to generate one or more recordings of a sound situation in a reproduction room (or space) prior to playback of the at least two loudspeaker signals by the headphones, determine an estimate of a raw audio signal of at least one sound source from the one or more recordings, and determine a binaural room impulse response of a plurality of binaural room impulse responses for the sound source in the reproduction room.

一実施形態によれば、例えば、２つのヘッドホンカプセルの各々の少なくとも１つのマイクロホンは、ヘッドホンによる少なくとも２つのラウドスピーカ信号の再生中に、再生室内の音状況の１つまたは複数のさらなる録音を生成し、これらの１つまたは複数のさらなる録音から拡張信号を減算し、１つまたは複数の音源からの生音声信号の推定値を決定し、再生室内の音源に対する複数のバイノーラル室内インパルス応答のバイノーラル室内インパルス応答を決定するように構成されてもよい。 According to one embodiment, for example, at least one microphone of each of the two headphone capsules may be configured to generate, during playback of the at least two loudspeaker signals by the headphones, one or more further recordings of the sound situation in the reproduction room, subtract the augmented signal from these one or more further recordings, determine an estimate of the raw sound signal from the one or more sound sources, and determine a binaural room impulse response of the multiple binaural room impulse responses for the sound sources in the reproduction room.

一実施形態では、例えば、分析器１５２は、再生室の音響室特性を決定し、音響室特性に応じて複数のバイノーラル室内インパルス応答を適合させるように構成され得る。 In one embodiment, for example, the analyzer 152 may be configured to determine acoustic room characteristics of a reproduction room and to adapt the multiple binaural room impulse responses according to the acoustic room characteristics.

一実施形態によれば、例えば、少なくとも１つのマイクロホンは、外耳道の入り口付近の音を測定するために、２つのヘッドホンカプセルの各々に配置されてもよい。 According to one embodiment, for example, at least one microphone may be placed in each of the two headphone capsules to measure sounds near the entrance of the ear canal.

一実施形態では、例えば、システムは、再生室内の音状況を測定するために、２つのヘッドホンカプセルの外側に１つまたは複数のさらなるマイクロホンを含み得る。 In one embodiment, for example, the system may include one or more additional microphones outside the two headphone capsules to measure the sound conditions in the playback room.

一実施形態によれば、例えば、ヘッドホンはヘッドホンバンドを含むことができ、例えば、１つまたは複数のさらなるマイクロホンのうちの少なくとも１つがヘッドホンバンド上に配置される。 According to one embodiment, for example, the headphones may include a headphone band , for example, at least one of the one or more further microphones being located on the headphone band .

一実施形態では、例えば、ラウドスピーカ信号発生器１５４は、複数のバイノーラル室内インパルス応答の各々が複数の１つまたは複数の音源信号の音源信号と畳み込まれることによって少なくとも２つのラウドスピーカ信号を生成するように構成されてもよい。 In one embodiment, for example, the loudspeaker signal generator 154 may be configured to generate at least two loudspeaker signals by convolving each of a plurality of binaural room impulse responses with a source signal of one or more of the source signals.

一実施形態によれば、例えば、分析器１５２は、ヘッドホンの動きに応じて、複数のバイノーラル室内インパルス応答（またはいくつかまたはすべてのバイノーラル室内インパルス応答）のうちの少なくとも１つを決定するように構成されてもよい。 According to one embodiment, for example, the analyzer 152 may be configured to determine at least one of a plurality of binaural room impulse responses (or some or all of the binaural room impulse responses) in response to headphone movement.

実施形態では、システムは、ヘッドホンの動きを決定するためのセンサを含み得る。例えば、センサは、頭の回転を捕捉するように少なくとも３ＤｏＦ（３自由度）を備える加速度ピックアップなどのセンサであってもよい。例えば、６ＤｏＦのセンサ（６自由度センサ）を用いてもよい。 In an embodiment, the system may include a sensor for determining the movement of the headphones. For example, the sensor may be a sensor such as an accelerometer with at least 3DoF (three degrees of freedom) to capture head rotation. For example, a 6DoF sensor (six degrees of freedom sensor) may be used.

本発明の特定の実施形態は、聴覚環境において非常に大きいことが多く、聴覚環境における特定の音が邪魔であり、選択的な聴覚が望まれるという技術的課題に対処する。人間の脳自体はある程度選択的な聴覚を実行することができるが、知的技術補助者は選択的な聴覚を大幅に改善することができる。眼鏡が現代生活において多くの人々が自分の環境をよりよく知覚するのを助けるのと同様に、聴覚用の補聴器があるが、通常の聴力を有する人々であっても、多くの状況においてインテリジェントシステムによる支援から利益を得ることができる。「インテリジェントヒアラブル」（聴覚デバイスまたは補聴器）を実現するために、技術システムは、（音響）環境を分析し、個々の音源を個別に処理できるように識別する必要がある。この課題に対する研究は既に行われているが、従来技術では、音響環境全体をリアルタイム（耳に透明）かつ高音質（通常の音響環境と区別できないように聞こえるコンテンツ）で解析・処理することは実現されていなかった。 Certain embodiments of the present invention address the technical challenge that in an auditory environment, which is often very loud, certain sounds in the auditory environment are disturbing and selective hearing is desired. While the human brain itself can perform selective hearing to some extent, intelligent technical assistants can significantly improve selective hearing. Just as glasses help many people in modern life to better perceive their environment, there are hearing aids for hearing, but even people with normal hearing can benefit from assistance from intelligent systems in many situations. To realize an "intelligent hearable" (auditory device or hearing aid), a technical system needs to analyze the (acoustic) environment and identify individual sound sources so that they can be processed separately. Research has already been conducted on this issue, but the prior art has not achieved the analysis and processing of the entire acoustic environment in real time (transparent to the ear) and with high sound quality (content that sounds indistinguishable from the normal acoustic environment).

機械聴取のための改善された概念が以下に提供される。 Improved concepts for machine listening are provided below.

第１のステップでは、ヘッドホンを用いたＢＲＩＲの測定は、プローブマイクロホンを用いて個別に、またはヘッドホンを用いて行われる。 In the first step, measurements of the BRIR using headphones are performed either individually using a probe microphone or using headphones.

任意選択で、例えば、第３のステップにおいて、再生前に、各シェル内の少なくとも１つの内蔵マイクロホンが、再生室内の実際の音状況を記録する。これらの録音から、１つまたは複数の音源の生音声信号の推定値が決定され、再生室内の音源／音源のそれぞれのＢＲＩＲが決定される。この推定から、再生室の音響室特性が決定され、それに記録室内のＢＲＩＲが適合される。 Optionally, for example in a third step, before playback, at least one built-in microphone in each shell records the actual sound situation in the playback room. From these recordings, estimates of the raw sound signals of one or more sound sources are determined , and the BRIR of each of the sound sources/sources in the playback room is determined . From this estimate, the acoustic room characteristics of the playback room are determined , to which the BRIR in the recording room is adapted.

任意選択で、例えばさらなるステップにおいて、再生中に、各シェル内の少なくとも１つの内蔵マイクロホンが、再生室内の実際の音状況を記録する。これらの録音から、拡張信号が最初に減算され、次いで、１つまたは複数の音源の生音声信号の推定値が決定され、再生室内の音源／音源のそれぞれのＢＲＩＲが決定される。この推定から、再生室の音響室特性が決定され、再生室のＢＲＩＲがそれに適合される。 Optionally, for example in a further step, during playback, at least one built-in microphone in each shell records the actual sound situation in the playback room. From these recordings, the augmented signal is first subtracted, then an estimate of the raw sound signal of one or more sound sources is determined , and the BRIR of each of the sound sources/sources in the playback room is determined . From this estimate, the acoustic room characteristics of the playback room are determined , and the BRIR of the playback room is adapted thereto.

さらなるステップでは、ヘッドホン信号を取得するために、正しく位置決めされ、任意選択的に適合されたＢＲＩＲで増強されるソースの畳み込みが実行される。 In a further step, a convolution of the source, augmented with a correctly positioned and optionally adapted BRIR, is performed to obtain the headphone signal.

最後に、さらなるステップにおいて、ヘッドホン信号の再生が、ヘッドホンを介して実行される。 Finally, in a further step, playback of the headphone signal is performed via headphones.

一実施形態では、例えば、外耳道の入り口付近の音を測定するために、少なくとも１つのマイクロホンが各ヘッドホンカプセル内に配置される。 In one embodiment, for example, at least one microphone is placed within each headphone capsule to measure sounds near the entrance of the ear canal.

一実施形態によれば、再生室内の音状況を測定および分析するために、追加のマイクロホンが任意選択的にヘッドホンの外側に、場合によってはヘッドホンバンドの上側にも配置される。 According to one embodiment, additional microphones are optionally placed on the outside of the headphones, possibly even above the headphone band , to measure and analyze the sound situation in the playback room.

実施形態では、同一の自然音源および拡張音源の音が実現される。 In the embodiment, the same natural and augmented sound sources are realized.

実施形態は、ヘッドホンの特性の測定が不要であることを認識する。 The embodiment recognizes that measuring headphone characteristics is not required.

したがって、実施形態は、再生室の室内特性を測定するための概念を提供する。 Thus, the embodiment provides a concept for measuring the room characteristics of a playback room.

いくつかの実施形態は、室内適応の開始値および（後）最適化を提供する。提供される概念は、再生室の室内音響効果が変化する場合、例えば、聴取者が別の室内（または空間）に移動する場合にも機能する。 Some embodiments provide starting values and (post)optimization of room adaptation. The concepts provided also work when the room acoustics of the reproduction room change, e.g. when the listener moves to a different room (or space).

とりわけ、実施形態は、技術的システムにおいて聴覚を支援するための異なる技術をインストールし、その後、音および生活の質（例えば、所望の音はより大きく、望ましくない音はより柔らかく、発話の理解性がより良好である。）の改善が正常な聴覚を有する人々および難聴を有する人々に対して達成されるように組み合わせることに基づいている。 Among other things, the embodiment is based on installing different technologies for assisting hearing in a technical system and then combining them in such a way that an improvement in sound and quality of life (e.g. desired sounds are louder, undesired sounds are softer, speech intelligibility is better) is achieved for people with normal hearing and for people with hearing loss.

図２は、一実施形態による選択的聴覚を支援するためのシステムを示す図である。 Figure 2 illustrates a system for supporting selective hearing in one embodiment.

システムは、聴覚環境（または聴取環境）の少なくとも２つの受信マイクロホン信号を使用することによって、１つまたは複数の音源の音源信号部分を検出するための検出器１１０を含む。 The system includes a detector 110 for detecting source signal portions of one or more sound sources by using at least two received microphone signals of the auditory environment (or listening environment).

さらに、システムは、位置情報を１つまたは複数の音源の各々に割り当てるための位置決定器１２０を含む。 In addition, the system includes a position determiner 120 for assigning position information to each of the one or more sound sources.

さらに、システムは、音声信号タイプを、１つまたは複数の音源の各々の音源信号部分に割り当てるための音声タイプ分類器１３０を含む。 The system further includes an audio type classifier 130 for assigning an audio signal type to the audio source signal portion of each of the one or more audio sources.

さらに、システムは、少なくとも１つの音源の修正音声信号部分を取得するために、少なくとも１つの音源の音源信号部分の音声信号タイプに応じて、１つまたは複数の音源の少なくとも１つの音源の音源信号部分を変更するための信号部分修正器１４０を含む。 Furthermore, the system includes a signal portion modifier 140 for modifying the source signal portion of at least one of the one or more sound sources depending on the sound signal type of the source signal portion of the at least one sound source to obtain a modified sound signal portion of the at least one sound source.

図１の分析器１５２およびラウドスピーカ信号発生器１５４は共に信号発生器１５０を形成する。 The analyzer 152 and the loudspeaker signal generator 154 of FIG. 1 together form the signal generator 150.

信号発生器１５０の分析器１５２は、複数のバイノーラル室内インパルス応答を生成するように構成され、複数のバイノーラル室内インパルス応答は、この音源の位置情報およびユーザの頭部の向きに依存する、１つまたは複数の音源の各音源に対する複数のバイノーラル室内インパルス応答である。 The analyzer 152 of the signal generator 150 is configured to generate a plurality of binaural room impulse responses for each of the one or more sound sources, the plurality of binaural room impulse responses being dependent on the position information of the sound source and the orientation of the user's head.

信号発生器１５０のラウドスピーカ信号発生器１５４は、複数のバイノーラル室内インパルス応答に応じて、かつ少なくとも１つの音源の修正音声信号部分に応じて、少なくとも２つのラウドスピーカ信号を生成するように構成される。 The loudspeaker signal generator 154 of the signal generator 150 is configured to generate at least two loudspeaker signals in response to a plurality of binaural room impulse responses and in response to the modified audio signal portion of at least one sound source.

一実施形態によれば、例えば、検出器１１０は、深層学習モデルを使用することによって、１つまたは複数の音源の音源信号部分を検出するように構成されてもよい。 According to one embodiment, for example, detector 110 may be configured to detect source signal portions of one or more sound sources by using a deep learning model.

一実施形態では、例えば、位置決定器１２０は、１つまたは複数の音源の各々について、捕捉画像または記録された映像に応じて位置情報を決定するように構成されてもよい。 In one embodiment, for example, position determiner 120 may be configured to determine position information for each of one or more sound sources as a function of the captured image or recorded video.

一実施形態によれば、例えば、位置決定器１２０は、映像内の人物の唇の動きを検出し、唇の動きに応じて、位置情報を、１つまたは複数の音源のうちの１つの音源信号部分に割り当てることによって、１つまたは複数の音源の各々について、映像に応じた位置情報を決定するように構成されてもよい。 According to one embodiment, for example, the position determiner 120 may be configured to determine position information for each of the one or more sound sources in response to the video by detecting lip movements of a person in the video and assigning position information to a source signal portion of one of the one or more sound sources in response to the lip movements.

一実施形態では、例えば、検出器１１０は、少なくとも２つの受信マイクロホン信号に応じて、聴覚環境の１つまたは複数の音響特性を決定するように構成されてもよい。 In one embodiment, for example, detector 110 may be configured to determine one or more acoustic characteristics of the auditory environment in response to at least two received microphone signals.

一実施形態によれば、例えば、信号発生器１５０は、聴覚環境の１つまたは複数の音響特性に応じて複数のバイノーラル室内インパルス応答を決定するように構成されてもよい。 According to one embodiment, for example, the signal generator 150 may be configured to determine multiple binaural room impulse responses as a function of one or more acoustic characteristics of the auditory environment.

一実施形態では、例えば、信号部分修正器１４０は、その音源信号部分が以前に学習されたユーザシナリオに応じて修正される少なくとも１つの音源を選択し、それを以前に学習されたユーザシナリオに応じて修正するように構成されてもよい。 In one embodiment, for example, the signal portion modifier 140 may be configured to select at least one sound source whose sound source signal portion is to be modified in response to a previously learned user scenario and to modify it in response to the previously learned user scenario.

一実施形態によれば、例えば、システムは、２つ以上の以前に学習されたユーザシナリオのグループから以前に学習されたユーザシナリオを選択するためのユーザインターフェース１６０を含み得る。図３は、そのようなユーザインターフェース１６０をさらに含む、一実施形態によるそのようなシステムを示す。 According to one embodiment, for example, the system may include a user interface 160 for selecting a previously learned user scenario from a group of two or more previously learned user scenarios. FIG. 3 illustrates such a system according to one embodiment, further including such a user interface 160.

一実施形態では、例えば、検出器１１０および／または位置決定器１２０および／または音声タイプ分類器１３０および／または信号修正器１４０および／または信号発生器１５０は、ハフ変換を使用して、または複数のＶＬＳＩチップを使用して、または複数のメモリスタを使用することによって、並列信号処理を実行するように構成され得る。 In an embodiment, for example, the detector 110 and/or the locator 120 and/or the sound type classifier 130 and/or the signal modifier 140 and/or the signal generator 150 may be configured to perform parallel signal processing using a Hough transform, or using multiple VLSI chips, or by using multiple memristors.

一実施形態によれば、例えば、システムは、聴覚能力が制限されているおよび／または聴覚を損傷しているユーザのための補聴器として機能する聴覚デバイス１７０を含むことができ、聴覚デバイスは、少なくとも２つのラウドスピーカ信号を出力するための少なくとも２つのラウドスピーカ１７１、１７２を含む。図４は、２つの対応するラウドスピーカ１７１、１７２を有するそのような聴覚デバイス１７０を含む、一実施形態によるそのようなシステムを示す。 According to one embodiment, for example, the system may include a hearing device 170 that functions as a hearing aid for a user with limited hearing capabilities and/or impaired hearing, the hearing device including at least two loudspeakers 171, 172 for outputting at least two loudspeaker signals. Figure 4 shows such a system according to one embodiment including such a hearing device 170 with two corresponding loudspeakers 171, 172.

一実施形態では、例えば、システムは、少なくとも２つのラウドスピーカ信号を出力するための少なくとも２つのラウドスピーカ１８１、１８２と、少なくとも２つのラウドスピーカを収容するハウジング構造１８３とを含むことができ、少なくとも１つのハウジング構造１８３は、ユーザの頭部１８５またはユーザの任意の他の身体部分に固定されるのに適している。図５ａは、そのようなハウジング構造１８３および２つのラウドスピーカ１８１、１８２を含む対応するシステムを示す。 In one embodiment, for example, the system may include at least two loudspeakers 181, 182 for outputting at least two loudspeaker signals and a housing structure 183 for housing the at least two loudspeakers, at least one housing structure 183 being suitable for being fixed to a user's head 185 or any other body part of the user. FIG. 5a shows such a housing structure 183 and a corresponding system including two loudspeakers 181, 182.

一実施形態によれば、例えば、システムは、少なくとも２つのラウドスピーカ信号を出力するための少なくとも２つのラウドスピーカ１８１、１８２を含むヘッドホン１８０を含み得る。図５ｂは、一実施形態による、２つのラウドスピーカ１８１、１８２を有する対応するヘッドホン１８０を示す。 According to one embodiment, for example, the system may include a headphone 180 including at least two loudspeakers 181, 182 for outputting at least two loudspeaker signals. Figure 5b shows a corresponding headphone 180 having two loudspeakers 181, 182 according to one embodiment.

一実施形態では、例えば、検出器１１０および位置決定器１２０ならびに音声タイプ分類器１３０ならびに信号部分修正器１４０および信号発生器１５０は、ヘッドホン１８０に統合されてもよい。 In one embodiment, for example, the detector 110 and the locator 120 as well as the sound type classifier 130 and the signal portion modifier 140 and the signal generator 150 may be integrated into the headphones 180 .

図６に示す一実施形態によれば、例えば、システムは、検出器１１０および位置決定器１２０ならびに音声タイプ分類器１３０ならびに信号部分修正器１４０および信号発生器１５０を含む遠隔デバイス１９０を含み得る。この場合、例えば、遠隔デバイス１９０は、ヘッドホン１８０から空間的に分離されていてもよい。 6, for example, the system may include a remote device 190 including the detector 110 and the locator 120 as well as the sound type classifier 130 as well as the signal portion modifier 140 and the signal generator 150. In this case, for example, the remote device 190 may be spatially separated from the headphones 180.

一実施形態では、例えば、遠隔デバイス１９０はスマートフォンであってもよい。 In one embodiment, for example, the remote device 190 may be a smartphone.

実施形態は、必ずしもマイクロプロセッサを使用するのではなく、とりわけ、人工ニューラルネットワークのためにも、エネルギー効率の良い実現のために、ハフ変換、ＶＬＳＩチップ、またはメモリスタなどの並列信号処理ステップを使用する。 Embodiments do not necessarily use microprocessors, but rather use parallel signal processing steps such as Hough transforms, VLSI chips, or memristors for an energy-efficient implementation, especially for artificial neural networks.

実施形態では、聴覚環境は空間的に捕捉され再生され、一方では入力信号の表現に２つ以上の信号を使用し、他方では空間的再生も使用する。 In an embodiment, the auditory environment is captured and reproduced spatially, using two or more signals for the representation of the input signal on the one hand, and also spatial reproduction on the other hand.

実施形態では、信号分離は、深層学習（ＤＬ：ｄｅｅｐｌｅａｒｎｉｎｇ）モデル（例えば、ＣＮＮ、ＲＣＮＮ、ＬＳＴＭ、シャムネットワーク）によって実行され、少なくとも２つのマイクロホンチャネルからの情報を同時に処理し、各ヒアラブルに少なくとも１つのマイクロホンがある。本発明によれば、（個々の音源に応じた）いくつかの出力信号が、それらのそれぞれの空間位置と共に相互分析によって決定される。記録手段（マイクロホン）がヘッドに接続されている場合、対象物の位置はヘッドの移動に伴って変化する。これにより、例えば音対象物に向くことによって、重要な／重要でない音に自然に焦点を合わせることが可能になる。 In an embodiment, the signal separation is performed by a deep learning (DL) model (e.g. CNN, RCNN, LSTM, Siamese network), simultaneously processing information from at least two microphone channels, at least one microphone in each hearable. According to the invention, several output signals (corresponding to the individual sound sources) are determined by mutual analysis together with their respective spatial positions. If the recording means (microphones) are connected to the head, the position of the object changes with the movement of the head. This allows a natural focus on important/unimportant sounds, for example by orienting oneself towards the sound object.

いくつかの実施形態では、信号分析のためのアルゴリズムは、例えば深層学習アーキテクチャに基づいている。あるいは、これは、解析ユニットによる変動、または態様の位置特定、検出、および音の分離のための分離されたネットワークによる変動を使用する。一般化相互相関（相関対時間オフセット）の代替的な使用は、頭部による周波数依存性シャドーイング／分離に対応し、位置特定、検出、および音源分離を改善する。 In some embodiments, the algorithm for signal analysis is based, for example, on deep learning architectures. Alternatively, it uses variation by analysis units or variation by separate networks for aspect localization, detection and sound separation. An alternative use of generalized cross-correlation (correlation vs. time offset) accommodates frequency-dependent shadowing/separation by the head, improving localization, detection and sound source separation.

一実施形態によれば、異なるソースカテゴリ（例えば、スピーチ、乗り物、子供の男性／女性／声、警告トーンなど。）は、訓練段階において検出器によって学習される。ここで、音源分離ネットワークはまた、高い信号品質、ならびに定位の高い精度に関する標的刺激を有する定位ネットワークに関して訓練される。 According to one embodiment, different source categories (e.g., speech, vehicles, male/female/voice of children, warning tones, etc.) are learned by the detector in a training phase. Here, the source separation network is also trained on the target stimuli for high signal quality as well as the localization network with high accuracy of localization.

例えば、上述の訓練ステップは、マルチチャネル音声データを使用し、第１の訓練ラウンドは、通常、シミュレートまたは記録された音声データを用いて実験室で実行される。これに続いて、異なる自然環境（例えば、居室、教室、駅、（産業用）生産環境など。）での訓練実行が行われ、すなわち転移学習およびドメイン適応が実行される。 For example, the training steps described above use multi-channel speech data, and a first training round is typically performed in a laboratory using simulated or recorded speech data. This is followed by training runs in different natural environments (e.g., a room, a classroom, a train station, an (industrial) production environment, etc.), i.e. transfer learning and domain adaptation are performed.

代替的または追加的に、位置検出器は、音源／音源の視覚位置も決定するように１つまたは複数のカメラに結合することができる。発話の場合、唇の動きと音源分離器から来る音声信号とが相関し、より正確な位置特定を達成する。 Alternatively or additionally, the position detector can be coupled to one or more cameras to also determine the visual position of the sound source/source. In case of speech, lip movements are correlated with the audio signal coming from the sound source separator to achieve more accurate localization.

訓練後、ネットワークアーキテクチャおよび関連するパラメータを有するＤＬモデルが存在する。 After training, we have a DL model with the network architecture and associated parameters.

いくつかの実施形態では、聴覚化はバイノーラル合成によって行われる。バイノーラル合成は、望ましくない成分を完全に削除することはできないが、知覚可能であるが妨害しない程度までそれらを減らすことができるというさらなる利点を提供する。これは、完全にオフにされた場合に見逃されるであろう予期しないさらなるソース（警告信号、吹き出し、．．．）を知覚するというさらなる利点を有する。 In some embodiments, the auralization is performed by binaural synthesis. Binaural synthesis offers the additional advantage that, although it cannot completely remove undesirable components, it can reduce them to a level that is perceptible but not disturbing. This has the additional advantage of perceiving unexpected additional sources (warning signals, speech bubbles, ...) that would be missed if they were completely turned off.

いくつかの実施形態によれば、聴覚環境の分析は、対象物を分離するためだけでなく、音響特性（例えば、残響時間、初期時間ギャップ）を分析するためにも使用される。次いで、これらの特性は、予め記憶された（場合によっては個別化された）バイノーラル室内インパルス応答（ＢＲＩＲ：ｂｉｎａｕｒａｌｒｏｏｍｉｍｐｕｌｓｅｒｅｓｐｏｎｓｅｓ）を実際の部屋（または空間）に適合させるようにバイノーラル合成において使用される。室内発散を低減することによって、聴取者は、最適化された信号を理解するときに、著しく低減された聴取労力を有する。部屋の広がりを最小限に抑えることは、聴覚イベントの外部化、したがって監視室における空間音声再生の信憑性に影響を及ぼす。音声理解または最適化された信号の一般的な理解のために、従来の技術には既知の解決策はない。 According to some embodiments, the analysis of the auditory environment is used not only to separate objects, but also to analyze acoustic characteristics (e.g. reverberation time, initial time gap). These characteristics are then used in binaural synthesis to adapt pre-stored (possibly individualized) binaural room impulse responses (BRIRs) to the real room (or space). By reducing the room divergence, the listener has a significantly reduced listening effort when understanding the optimized signal. Minimizing the room divergence affects the externalization of auditory events and therefore the veracity of spatial audio reproduction in the monitoring room. For speech understanding or for general understanding of the optimized signal, there are no known solutions in the prior art.

実施形態では、ユーザインターフェースを使用して、どの音源が選択されるかを決定する。本発明によれば、これは、「音声を真正面から増幅する」（１人との会話）、「音声を±６０度の範囲で増幅する」（グループでの会話）、「音楽を抑制し、音楽を増幅する」（コンサートに行く人の声を聴きたくない）、「すべてを無音にする」（１人にしておきたい）、「すべての声および警告トーンを抑制する」などの異なるユーザシナリオを事前に学習することによって行われる。 In an embodiment, a user interface is used to determine which sound source is selected. According to the invention, this is done by pre-learning different user scenarios such as "amplify voice directly in front" (conversation with one person), "amplify voice in ±60 degree range" (group conversation), "suppress music and amplify music"(don't want to hear concert goers), "silence everything" (want to be left alone), "suppress all voices and warning tones", etc.

いくつかの実施形態は、使用されるハードウェアに依存せず、すなわち、開放型および閉鎖型ヘッドホンを使用することができる。信号処理は、ヘッドホンに組み込まれてもよいし、外部デバイスに組み込まれてもよいし、スマートフォンに組み込まれてもよい。任意選択的に、音響的に記録され処理された信号の再生に加えて、スマートフォンから信号を直接再生することができる（例えば、音楽、電話）。 Some embodiments are agnostic to the hardware used, i.e., open and closed headphones can be used. Signal processing may be built into the headphones, an external device, or the smartphone. Optionally, in addition to playing acoustically recorded and processed signals, signals can be played directly from the smartphone (e.g., music, phone calls).

他の実施形態では、「ＡＩ支援による選択的聴取」のためのエコシステムが提供される。実施形態は、「個人向け聴覚現実」（ＰＡＲｔｙ：ｐｅｒｓｏｎａｌｉｚｅｄａｕｄｉｔｏｒｙｒｅａｌｉｔｙ）を指す。そのような個人向け環境では、聴取者は、定義された音響対象物を増幅、低減、または修正することができる。個々の要件に適合した健全な体験を作り出すために、一連の分析および合成処理が実行されるべきである。目標とする変換段階の研究は、このための必須の構成要素を形成する。 In another embodiment, an ecosystem for "AI-assisted selective listening" is provided. The embodiment refers to "personalized auditory reality" (PARty). In such a personalized environment, the listener can amplify, reduce or modify defined acoustic objects. A series of analysis and synthesis processes should be performed to create a wholesome experience adapted to individual requirements. Research into targeted transformation stages forms an essential component for this.

いくつかの実施形態は、実際の音環境の分析および個々の音響対象物の検出、利用可能な対象物の分離、追跡、および編集可能性、ならびに修正された音響シーンの再構成および再生を実現する。 Some embodiments provide analysis of the real sound environment and detection of individual acoustic objects, isolation, tracking and editability of available objects, as well as reconstruction and playback of modified acoustic scenes.

実施形態では、音イベントの検出、音イベントの分離、およびいくつかの音イベントの抑制が実現される。 In an embodiment, detection of sound events, separation of sound events, and suppression of some sound events are achieved.

実施形態では、ＡＩ方法（特に深層学習ベースの方法）が使用される。 In an embodiment, AI methods (particularly deep learning-based methods) are used.

本発明の実施形態は、空間音声の記録、信号処理、および再生のための技術開発に寄与する。 Embodiments of the present invention contribute to technological developments for recording, signal processing, and playback of spatial audio.

例えば、実施形態は、対話するユーザを有するマルチメディアシステムにおいて空間性および三次元性を生成する。 For example, embodiments create spatiality and three-dimensionality in multimedia systems with interacting users.

この場合、実施形態は、空間聴覚／聴取の知覚および認知処理の研究知識に基づく。 In this case, the embodiment is based on research knowledge of perceptual and cognitive processing of spatial hearing/listening.

いくつかの実施形態は、以下の概念のうちの２つ以上を使用する。 Some embodiments use two or more of the following concepts:

シーン分解：これは、実際の環境の空間音響検出、ならびにパラメータ推定および／または位置依存音場解析を含む。 Scene decomposition: This involves spatial acoustic detection of the real environment, as well as parameter estimation and/or position-dependent sound field analysis.

シーン表現：これは、対象物および／または環境の表現および識別、ならびに／あるいは効率的な表現および記憶を含む。 Scene representation: This involves the representation and identification, and/or efficient representation and storage of objects and/or the environment.

シーンの組み合わせと再生：これには、対象物と環境の適応と変化、および／またはレンダリングと聴覚化が含まれる。 Scene combination and reproduction: This includes the adaptation and transformation of objects and environments, and/or the rendering and auralization.

品質評価：これには、技術的および／または聴覚的品質測定が含まれる。 Quality assessment: This includes technical and/or auditory quality measurements.

マイクロホン位置決め：これは、マイクロホンアレイの適用および適切な音声信号処理を含む。 Microphone positioning: This involves the application of a microphone array and appropriate audio signal processing.

信号調整：これは、特徴抽出ならびにＭＬ（ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ：機械学習）のためのデータセット生成を含む。 Signal conditioning: This includes feature extraction as well as dataset generation for machine learning (ML).

室内および周囲音響の推定：これは、室内音響パラメータのその場測定および推定、ならびに／または音源分離およびＭＬのための室内音響特徴の提供を含む。 Estimation of room and ambient acoustics: This involves in-situ measurement and estimation of room acoustic parameters and/or providing room acoustic signatures for source separation and ML.

聴覚化：これには、環境への聴覚適応を伴う空間音声再生および／または検証および評価および／または機能的証明および品質推定が含まれる。 Auralization: This includes spatial audio reproduction with auditory adaptation to the environment and/or validation and evaluation and/or functional proof and quality estimation.

図８は、一実施形態による対応するシナリオを示す図である。 Figure 8 illustrates a corresponding scenario according to one embodiment.

実施形態は、音源の検出、分類、分離、位置特定、および強化のための概念を組み合わせ、各分野における最近の進歩が強調され、それらの間の接続が示される。 Embodiments combine concepts for sound source detection, classification, separation, localization, and enhancement, highlighting recent advances in each field and illustrating connections between them.

以下では、現実のＳＨに必要な柔軟性および堅牢性を提供するために、音源を組み合わせ／検出／分類／位置特定および分離／強化することができる一貫した概念を提供する。 In the following, we provide a consistent concept that can combine/detect/classify/localize and separate/enhance sound sources to provide the flexibility and robustness required for real-world SH.

さらに、実施形態は、現実の聴覚シーンのダイナミクスを扱うときのリアルタイム性能に適した低レイテンシの概念を提供する。 Furthermore, the embodiments provide a low-latency concept suitable for real-time performance when dealing with the dynamics of real-world auditory scenes.

いくつかの実施形態は、深層学習、機械聴取、およびスマートヘッドホン（スマートヒアラブル）の概念を使用し、聴取者が聴覚シーンを選択的に修正することを可能にする。 Some embodiments use concepts from deep learning, machine hearing, and smart headphones (smart hearables) to allow the listener to selectively modify the auditory scene.

実施形態は、ヘッドホン、イヤホンなどの聴覚デバイスを用いて聴覚シーン内の音源を選択的に増強、減衰、抑制、または修正する可能性を聴取者に提供する。 Embodiments provide the listener with the possibility to selectively enhance, attenuate, suppress or modify sound sources within an auditory scene using hearing devices such as headphones, earphones etc.

図９は、４つの外部音源を有する一実施形態によるシナリオを示す図である。 Figure 9 shows a scenario in one embodiment with four external sound sources.

図９において、ユーザは聴覚シーンの中心である。この場合、ユーザの周囲では４つの外部音源（Ｓ１～Ｓ４）がアクティブになっている。ユーザインターフェースは、聴取者が聴覚シーンに影響を与えることを可能にする。ソースＳ１～Ｓ４は、それらの対応するスライダによって減衰、改善、または抑制され得る。図２に見られるように、聴取者は、聴覚シーン内に保持されるべき、または聴覚シーンから抑制されるべき音源または音イベントを定義することができる。図２では、都市の暗騒音は抑制されるべきであるが、警報または電話の呼び出しは保持されるべきである。常に、ユーザは、聴覚デバイスを介して音楽またはラジオなどの追加の音声ストリームを再生（または再生）する可能性を有する。 In Fig. 9, the user is the center of the auditory scene. In this case, four external sound sources (S1-S4) are active around the user. The user interface allows the listener to influence the auditory scene. Sources S1-S4 can be attenuated, improved or suppressed by their corresponding sliders. As can be seen in Fig. 2, the listener can define sound sources or sound events that should be kept in the auditory scene or suppressed from it. In Fig. 2, the urban background noise should be suppressed, but alarms or phone calls should be kept. At all times, the user has the possibility to play (or play) additional audio streams such as music or radio through the auditory device.

ユーザは、通常、システムの中心であり、制御ユニットによって聴覚シーンを制御する。ユーザは、図９に示すようなユーザインターフェースを用いて、または音声制御、ジェスチャ、視線方向などの任意のタイプの対話を用いて聴覚シーンを修正することができる。ユーザがシステムにフィードバックを提供すると、次のステップは、検出／分類／位置特定段階からなる。場合によっては、例えば、ユーザが聴覚シーンで発生する任意の発話を保持したい場合、検出のみが必要である。他の場合では、例えば、ユーザが、電話の呼び出し音やオフィスの雑音ではなく、聴覚シーンで火災警報を維持したい場合には、分類が必要であり得る。場合によっては、ソースの位置のみがシステムに関連する。これは、例えば、図９の４つの音源の場合であり、ユーザは、音源の種類または特性に関係なく、特定の方向から来る音源を除去または減衰することを決定することができる。 The user is usually the center of the system and controls the auditory scene by means of a control unit. The user can modify the auditory scene by means of a user interface as shown in FIG. 9 or by means of any type of interaction such as voice control, gestures, gaze direction, etc. Once the user has provided feedback to the system, the next step consists of a detection/classification/location stage. In some cases, only detection is necessary, for example if the user wants to keep any speech occurring in the auditory scene. In other cases, classification may be necessary, for example if the user wants to keep fire alarms in the auditory scene, but not telephone ringing or office noise. In some cases, only the location of the source is relevant to the system. This is for example the case of the four sound sources in FIG. 9, where the user can decide to remove or attenuate sound sources coming from a certain direction, regardless of the type or characteristics of the source.

図１０は、実施形態に係るＳＨ用途の処理ワークフローを示す。 Figure 10 shows the processing workflow for SH applications according to an embodiment.

まず、図１０の分離強調段階で聴覚シーンを修正する。これは、特定の音源（例えば、または特定の音源）を抑制、減衰、または増強することによって行われる。図１０に示されるように、ＳＨにおける追加の処理選択肢は、聴覚シーンにおけるバックグラウンドノイズを除去または最小化するという目的を有するノイズ制御である。おそらく、ノイズ制御のための最も一般的で広く普及している技術は、能動的ノイズ制御（ＡＮＣ）である［１１］。 First, the separation enhancement stage in Fig. 10 modifies the auditory scene. This is done by suppressing, attenuating, or enhancing a particular sound source (e.g., or a particular sound source). As shown in Fig. 10, an additional processing option in SH is noise control, which has the objective of removing or minimizing background noise in the auditory scene. Perhaps the most common and widespread technique for noise control is active noise control (ANC) [11].

選択的聴覚は、シーンに仮想音源を追加しようとすることなく、聴覚シーンにおいて実際の音源のみが変更される用途に選択的聴覚を制限することによって、仮想および拡張聴覚環境と区別される。 Selective hearing is distinguished from virtual and augmented auditory environments by restricting selective hearing to applications where only real sound sources are modified in the auditory scene, without any attempt to add virtual sound sources to the scene.

機械聴取の観点から、選択的聴覚用途は、音源を自動的に検出、位置特定、分類、分離、および強化するための技術を必要とする。選択的聴覚に関する用語をさらに明確にするために、以下の用語を定義し、それらの違いおよび関係を強調する。 From a machine hearing perspective, selective hearing applications require techniques to automatically detect, localize, classify, separate, and enhance sound sources. To further clarify the terminology related to selective hearing, we define the following terms and highlight their differences and relationships:

実施形態では、例えば、聴覚シーン内の音源の位置を検出する能力を指す音源位置特定が使用される。音声処理の文脈では、音源位置は通常、所与の音源の到来方向（ＤＯＡ：ｄｉｒｅｃｔｉｏｎｏｆａｒｒｉｖａｌ）を指し、これは、仰角を含む場合に２Ｄ座標（方位角）または３Ｄ座標のいずれかとして与えることができる。いくつかのシステムはまた、音源からマイクロホンまでの距離を位置情報として推定する［３］。音楽処理の文脈では、位置は、最終的な混合物における音源のパンニングを指すことが多く、通常、度単位の角度として与えられる［４］。 In embodiments, for example, source localization is used, which refers to the ability to detect the position of a sound source within an auditory scene. In the context of audio processing, source location usually refers to the direction of arrival (DOA) of a given sound source, which can be given either as a 2D coordinate (azimuth) or a 3D coordinate if it includes the elevation angle. Some systems also estimate the distance from the sound source to the microphone as position information [3]. In the context of music processing, location often refers to the panning of the sound source in the final mix, usually given as an angle in degrees [4].

実施形態によれば、例えば、所与の音源タイプの任意のインスタンスが聴覚シーンに存在するかどうかを決定する能力を参照して、音源検出が使用される。検出タスクの一例は、シーン内に話者が存在するかどうかを決定することである。これに関連して、シーン内のスピーカの数またはスピーカの識別情報を決定することは、音源検出の範囲外である。検出は、クラスが「ソースが存在する」および「ソースが存在しない」に対応するバイナリ分類タスクとして理解することができる。 According to an embodiment, sound source detection is used with reference to the ability to determine , for example, whether any instances of a given sound source type are present in an auditory scene. An example of a detection task is determining whether a speaker is present in a scene. In this context, determining the number of speakers in a scene or the identity of the speakers is outside the scope of sound source detection. Detection can be understood as a binary classification task where the classes correspond to "source present" and "source absent".

実施形態では、例えば、音源分類が使用され、所定のクラスのセットからのクラスラベルを所与の音源または所与の音イベントに割り当てる。分類タスクの一例は、所与の音源が音声、音楽、または環境雑音に対応するかどうかを決定することである。音源の分類と検出は密接に関連した概念である。場合によっては、分類システムは、「クラスなし」を可能なラベルの１つとして考慮することによって検出段階を含む。これらの場合、システムは暗黙的に音源の有無を検出することを学習し、音源のいずれかがアクティブであるという十分な証拠がない場合にはクラスラベルを割り当てることを強制されない。 In an embodiment, for example, sound source classification is used, which assigns a class label from a set of predefined classes to a given sound source or a given sound event. An example of a classification task is to determine whether a given sound source corresponds to speech, music, or environmental noise. Sound source classification and detection are closely related concepts. In some cases, the classification system includes a detection stage by considering "no class" as one of the possible labels. In these cases, the system learns to detect the presence or absence of a sound source implicitly and is not forced to assign a class label in the absence of sufficient evidence that any of the sound sources is active.

実施形態によれば、例えば、音声混合または聴覚シーンからの所与の音源の抽出を参照して、音源分離が使用される。音源分離の例は、混合音声から歌唱音声を抽出することであり、歌唱者以外に、他の楽器が同時に演奏している［５］。音源分離は、聴取者にとって関心のない音源を抑制することを可能にするので、選択的聴取シナリオに関連するようになる。いくつかの音声分離システムは、混合物から音源を抽出する前に検出タスクを暗黙的に実行する。しかしながら、これは必ずしも規則ではなく、したがって、これらのタスク間の区別を強調する。さらに、分離は、ソース強調［６］または分類［７］などの他のタイプの分析の前処理段階として機能することが多い。 According to an embodiment, source separation is used with reference to, for example, the extraction of a given sound source from a sound mixture or an auditory scene. An example of source separation is the extraction of a singing voice from a sound mixture where, apart from the singer, other instruments are playing simultaneously [5]. Source separation becomes relevant in selective listening scenarios, since it allows to suppress sound sources that are not of interest to the listener. Some sound separation systems implicitly perform a detection task before extracting the sound source from the mixture. However, this is not necessarily the rule, and therefore highlights the distinction between these tasks. Moreover, separation often serves as a pre-processing stage for other types of analysis, such as source emphasis [6] or classification [7].

実施形態では、例えば、音源識別が使用され、これはさらに一段階進み、音声信号内の音源の特定のインスタンスを識別することを目的とする。話者識別は、今日ではおそらく音源識別の最も一般的な使用法である。このタスクにおける目標は、特定の話者がシーン内に存在するかどうかを識別することである。図１の例では、ユーザは、聴覚シーンに保持される音源の１つとして「スピーカＸ」を選択している。これには、音声の検出および分類を超える技術が必要であり、この正確な識別を可能にする話者固有のモデルが必要である。 In an embodiment, for example, audio source identification is used, which goes one step further and aims to identify specific instances of audio sources within an audio signal. Speaker identification is perhaps the most common use of audio source identification today. In this task, the goal is to identify whether a particular speaker is present in the scene. In the example of Figure 1, the user has selected "Speaker X" as one of the audio sources held in the auditory scene. This requires techniques that go beyond voice detection and classification, and requires speaker-specific models that allow this accurate identification.

実施形態によれば、例えば音源強調が使用されるとは、聴覚シーンにおける所与の音源の顕著性を増加させる処理を指す［８］。音声信号の場合、目標は、その知覚品質および了解度を高めることであることが多い。音声強調の一般的なシナリオは、ノイズによって損なわれた音声のノイズ除去である［９］。音楽処理の文脈において、ソース強化は、リミックスの概念に関連し、１つの楽器（音源）をミックスにおいてより顕著にするためにしばしば実行される。リミキシング用途は、個々の音源にアクセスして混合物の特性を変更するために音声分離フロントエンドを使用することが多い［１０］。音源強調の前に音源分離段階を行うことができるが、これは常にそうであるとは限らず、したがって、これらの用語の区別も強調する。 According to an embodiment, for example source enhancement is used to refer to a process that increases the salience of a given source in an auditory scene [8]. For a speech signal, the goal is often to increase its perceptual quality and intelligibility. A common scenario for speech enhancement is the denoising of speech corrupted by noise [9]. In the context of music processing, source enhancement is related to the concept of remixing and is often performed to make one instrument (source) more prominent in the mix. Remixing applications often use a sound separation front-end to access individual sources and modify the characteristics of the mixture [10]. Although a source separation stage can be performed before source enhancement, this is not always the case, and therefore we also highlight the distinction between these terms.

音源の検出、分類、および識別の分野では、例えば、いくつかの実施形態は、音響シーンおよびイベントの検出および分類などの以下の概念のうちのいずれかを使用する［１８］。これに関連して、家庭環境における音声イベント検出（ＡＥＤ：ａｕｄｉｏｅｖｅｎｔｄｅｔｅｃｔｉｏｎ）のための方法が提案されており、目標は、１０秒の録音［１９］、［２０］以内に所与の音イベントの時間境界を検出することである。この特定の事例では、猫、犬、話し声、警報、および水道水を含む１０の音イベントクラスが考慮された。ポリフォン音イベント（いくつかの同時イベント）検出のための方法もまた、文献［２１］、［２２］に提案されている。［２１］において、双方向ロングショートタームメモリ（ＢＬＳＴＭ：ｂｉ－ｄｉｒｅｃｔｉｏｎａｌｌｏｎｇｓｈｏｒｔ－ｔｅｒｍｍｅｍｏｒｙ）リカレントニューラルネットワーク（ＲＮＮ：ｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋ）に基づくバイナリアクティビティ検出器を使用して、現実の文脈からの合計６１個の音イベントが検出される、ポリフォン音イベント検出のための方法が提案されている。 In the field of sound source detection, classification and identification, for example, some embodiments use one of the following concepts, such as acoustic scene and event detection and classification [18]. In this context, a method has been proposed for audio event detection (AED) in a domestic environment, where the goal is to detect the time boundary of a given sound event within 10 seconds of recording [19], [20]. In this particular case, 10 sound event classes were considered, including cat, dog, talking, alarm, and tap water. Methods for polyphonic sound event (several simultaneous events) detection have also been proposed in the literature [21], [22]. In [21], a method for polyphone sound event detection is proposed in which a total of 61 sound events from real-world contexts are detected using a binary activity detector based on a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN).

例えば、弱くラベル付けされたデータを扱うために、いくつかの実施形態は、分類のための信号の特定の領域に焦点を合わせるための時間的注意メカニズムを組み込む［２３］。分類におけるノイズの多いラベルの問題は、クラスラベルが非常に多様であり、高品質の注釈が非常にコストがかかる選択的聴覚用途に特に関連する［２４］。音事象分類タスクにおけるノイズの多いラベルは、［２５］で対処されており、カテゴリのクロスエントロピーに基づくノイズに強い損失関数、ならびにノイズの多いデータと手動でラベル化されたデータの両方を評価する方法が提示されている。同様に、［２６］は、訓練例の複数のセグメントに対するＣＮＮの予測コンセンサスに基づくノイズの多いラベルの検証ステップを組み込んだ畳み込みニューラルネットワーク（ＣＮＮ：ｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔｗｏｒｋ）に基づく音声イベント分類のためのシステムを提示する。 For example, to handle weakly labeled data, some embodiments incorporate a temporal attention mechanism to focus on specific regions of the signal for classification [23]. The problem of noisy labels in classification is particularly relevant for selective hearing applications where class labels are highly diverse and high-quality annotation is very costly [24]. Noisy labels in sound event classification tasks are addressed in [25], where a noise-robust loss function based on categorical cross-entropy is presented, as well as a method to evaluate both noisy and manually labeled data. Similarly, [26] presents a system for sound event classification based on a convolutional neural network (CNN) that incorporates a validation step of noisy labels based on the prediction consensus of the CNN on multiple segments of training examples.

例えば、いくつかの実施形態は、音イベントの同時検出および位置特定を実現する。したがって、いくつかの実施形態は、［２７］のようなマルチラベル分類タスクとして検出を実行し、位置は、各音イベントの到来方向（ＤＯＡ）の３Ｄ座標として与えられる。 For example, some embodiments provide simultaneous detection and localization of sound events. Thus, some embodiments perform detection as a multi-label classification task, such as in [27], where the location is given as the 3D coordinates of the direction of arrival (DOA) of each sound event.

いくつかの実施形態は、ＳＨのための音声アクティビティ検出および話者認識／識別の概念を使用する。音声アクティビティ検出は、ノイズ除去オートエンコーダ［２８］、リカレントニューラルネットワーク［２９］を使用して、または生波形を使用するエンドツーエンドシステム［３０］として、ノイズの多い環境で対処されてきた。話者認識用途のために、文献［３１］において多数のシステムが提案されており、その大部分は、例えばデータ増強または認識を容易にする改善された埋め込み［３２］～［３４］を用いて、異なる条件に対するロバスト性を高めることに焦点を当てている。したがって、実施形態のいくつかは、これらの概念を使用する。 Some embodiments use concepts of voice activity detection and speaker recognition/identification for SH. Voice activity detection has been addressed in noisy environments using denoising autoencoders [28], recurrent neural networks [29], or as end-to-end systems using raw waveforms [30]. For speaker recognition applications, numerous systems have been proposed in the literature [31], most of which focus on increasing robustness to different conditions, for example with data augmentation or improved embeddings that facilitate recognition [32]-[34]. Some of the embodiments therefore use these concepts.

さらなる実施形態は、音イベント検出のための楽器の分類のための概念を使用する。モノラル設定とポリフォニック設定の両方における楽器分類は、文献［３５］、［３６］で対処されている。［３５］では、３秒の音声セグメントにおける支配的な楽器は、１１の楽器クラスの間で分類され、いくつかの集約技術が提案されている。同様に、［３７］は、１秒のより細かい時間分解能で楽器を検出することができる楽器アクティビティ検出のための方法を提案している。歌唱音声分析の分野では、かなりの量の研究が行われてきた。特に、歌声が活発である録音におけるセグメントを検出するタスクのための［３８］などの方法が提案されている。いくつかの実施形態は、これらの概念を使用する。 Further embodiments use concepts for instrument classification for sound event detection. Instrument classification in both monophonic and polyphonic settings has been addressed in [35], [36]. In [35], dominant instruments in 3-s audio segments are classified among 11 instrument classes and several aggregation techniques are proposed. Similarly, [37] proposes a method for instrument activity detection that can detect instruments with a finer time resolution of 1 s. A significant amount of research has been carried out in the field of singing voice analysis. In particular, methods such as [38] have been proposed for the task of detecting segments in recordings where the singing voice is active. Some embodiments use these concepts.

実施形態のいくつかは、音源位置特定のために以下で説明する概念のうちの１つを使用する。音源定位は、聴覚シーン内の音源の数が現実の用途では通常知られていないため、音源カウントの問題と密接に関連している。いくつかのシステムは、シーン内のソースの数が既知であるという仮定の下で動作する。これは、例えば、能動強度ベクトルのヒストグラムを使用してソースの位置を特定する［３９］に提示されたモデルの場合である。教師ありの観点から、［４０］は、入力表現として位相マップを使用して聴覚シーン内の複数の話者のＤＯＡを推定するためのＣＮＮベースのアルゴリズムを提案する。対照的に、文献のいくつかの研究は、シーン内のソースの数およびそれらの位置情報を共同で推定する。これは、［４１］の場合であり、ここでは、雑音環境および残響環境におけるマルチスピーカ位置特定のためのシステムが提案されている。システムは、ソースの数およびそれらの位置特定の両方を推定するために、複素値ガウス混合モデル（ＧＭＭ：ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）を使用する。そこに記載された概念は、いくつかの実施形態によって使用される。 Some of the embodiments use one of the concepts described below for source localization. Source localization is closely related to the problem of source counting, since the number of sources in an auditory scene is usually unknown in real-world applications. Some systems work under the assumption that the number of sources in a scene is known. This is the case, for example, of the model presented in [39], which uses a histogram of active intensity vectors to localize the sources. From a supervised perspective, [40] proposes a CNN-based algorithm to estimate the DOA of multiple speakers in an auditory scene using a phase map as an input representation. In contrast, some works in the literature jointly estimate the number of sources in a scene and their location information. This is the case in [41], where a system for multi-speaker localization in noisy and reverberant environments is proposed. The system uses a complex-valued Gaussian Mixture Model (GMM) to estimate both the number of sources and their localization. The concepts described there are used by some of the embodiments.

音源位置特定アルゴリズムは、聴覚シーンの周りの大きな空間をスキャンすることを含むことが多いため、計算上要求が厳しい場合がある［４２］。位置推定アルゴリズムにおける計算要件を低減するために、いくつかの実施形態は、クラスタリングアルゴリズムを使用することによって［４３］、またはステアリング応答電力位相変換（ＳＲＰ－ＰＨＡＴ：ｓｔｅｅｒｅｄｒｅｓｐｏｎｓｅｐｏｗｅｒｐｈａｓｅｔｒａｎｓｆｏｒｍ）に基づく方法などの確立された方法で多重解像度探索を実行することによって［４２］、探索空間を低減する概念を使用する。他の方法は、スパース性制約を課し、所与の時間－周波数領域においてただ１つの音源が優勢であると仮定する［４４］。最近、生波形から直接方位角検出するためのエンドツーエンドシステムが［４５］で提案されている。いくつかの実施形態は、これらの概念を使用する。 Sound source localization algorithms can be computationally demanding, as they often involve scanning a large space around the auditory scene [42]. To reduce the computational requirements in localization algorithms, some embodiments use concepts that reduce the search space, such as by using clustering algorithms [43] or by performing a multi-resolution search with established methods, such as those based on the steered response power phase transform (SRP-PHAT) [42]. Other methods impose sparsity constraints and assume that only one sound source dominates in a given time-frequency region [44]. Recently, an end-to-end system for azimuth detection directly from raw waveforms has been proposed in [45]. Some embodiments use these concepts.

いくつかの実施形態は、特に音声分離および音楽分離の分野からの音源分離（ＳＳＳ：ｓｏｕｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ）について後に説明する概念を使用する。 Some embodiments use concepts described below for sound source separation (SSS), particularly from the field of speech and music separation.

特に、いくつかの実施形態は、話者に依存しない分離の概念を使用する。分離は、シーン内の話者に関するいかなる事前情報もなしにそこで実行される［４６］。いくつかの実施形態はまた、分離を実行するためにスピーカの空間位置を評価する［４７］。 In particular, some embodiments use the concept of speaker-independent separation: separation is performed in situ without any prior information about the speakers in the scene [46]. Some embodiments also estimate the spatial location of the speakers to perform the separation [47].

選択的聴覚用途における計算性能の重要性を考えると、低レイテンシを達成するという特定の目的で行われた研究は、特に重要である。利用可能な訓練データがほとんどない状態で低遅延音声分離（＜１０ｍｓ）を実行するためのいくつかの研究が提案されている［４８］。周波数領域におけるフレーミング解析によって生じる遅延を回避するために、いくつかのシステムは、時間領域に適用されるフィルタを慎重に設計することによって分離問題にアプローチする［４９］。他のシステムは、エンコーダ－デコーダフレームワークを使用して時間領域信号を直接モデリングすることによって低レイテンシ分離を達成する［５０］。対照的に、いくつかのシステムは、周波数領域分離手法におけるフレーミング遅延を低減することを試みている［５１］。これらの概念は、いくつかの実施形態によって採用される。 Given the importance of computational performance in selective hearing applications, research conducted with the specific goal of achieving low latency is particularly important. Several studies have been proposed to perform low-latency speech separation (<10 ms) with little available training data [48]. To avoid the delays caused by framing analysis in the frequency domain, some systems approach the separation problem by carefully designing filters applied in the time domain [49]. Other systems achieve low-latency separation by directly modeling the time-domain signal using an encoder-decoder framework [50]. In contrast, some systems attempt to reduce the framing delay in frequency-domain separation approaches [51]. These concepts are adopted by some embodiments.

いくつかの実施形態は、リード楽器伴奏分離のための概念［５２］などの、音声混合［５］から音楽ソースを抽出する音楽音分離（ＭＳＳ：ｍｕｓｉｃｓｏｕｎｄｓｅｐａｒａｔｉｏｎ）のための概念を使用する。これらのアルゴリズムは、そのクラスラベルに関係なく、混合物において最も顕著な音源を取得し、それを残りの付随物から分離しようと試みる。いくつかの実施形態は、歌声分離のための概念を使用する［５３］。ほとんどの場合、歌唱音声の特性を捕捉するために、特定のソースモデル［５４］またはデータ駆動モデル［５５］のいずれかが使用される。［５５］で提案されているようなシステムは、分離を達成するために分類または検出段階を明示的に組み込んでいないが、これらの手法のデータ駆動型の性質は、これらのシステムが分離前に歌唱音声を特定の精度で検出することを暗黙的に学習することを可能にする。音楽ドメインにおける別のクラスのアルゴリズムは、分離前に音源を分類または検出しようと試みることなく、音源の位置のみを使用して分離を実行しようと試みる［４］。 Some embodiments use concepts for music sound separation (MSS) to extract music sources from audio mixtures [5], such as those for reed instrument accompaniment separation [52]. These algorithms attempt to take the most prominent sound source in a mixture and separate it from the remaining accompaniments, regardless of its class label. Some embodiments use concepts for singing voice separation [53]. In most cases, either a specific source model [54] or a data-driven model [55] is used to capture the characteristics of singing voices. Although systems such as those proposed in [55] do not explicitly incorporate a classification or detection stage to achieve separation, the data-driven nature of these approaches allows these systems to implicitly learn to detect singing voices with a certain accuracy before separation. Another class of algorithms in the music domain attempts to perform separation using only the location of the sound sources, without attempting to classify or detect the sound sources before separation [4].

いくつかの実施形態は、アクティブノイズキャンセル（ＡＮＣ）などのアクティブノイズコントロール（ＡＮＣ）の概念を使用する。ＡＮＣシステムは、主に、雑音除去信号を導入してキャンセルすることにより、ヘッドホンユーザの背景雑音を除去することを目的としている［１１］。ＡＮＣは、ＳＨの特別なケースと考えることができ、同様に厳しい性能要件に直面する［１４］。いくつかの研究は、自動車のキャビン［５６］または産業シナリオ［５７］などの特定の環境における能動騒音制御に焦点を当てている。［５６］の作業は、ロードノイズやエンジンノイズなどの異なる種類のノイズの除去を分析し、異なる種類のノイズに対処できる統一されたノイズ制御システムを必要とする。いくつかの研究は、特定の空間領域にわたってノイズを除去するためのＡＮＣシステムの開発に焦点を当てている。［５８］において、空間領域にわたるＡＮＣは、雑音フィールドを表すための基底関数として球面調和関数を使用して対処される。いくつかの実施形態は、本明細書に記載の概念を使用する。 Some embodiments use concepts of active noise control (ANC), such as active noise cancellation (ANC). ANC systems primarily aim to remove background noise for headphone users by introducing and canceling a noise-removing signal [11]. ANC can be considered as a special case of SH and faces similarly stringent performance requirements [14]. Some works have focused on active noise control in specific environments, such as automobile cabins [56] or industrial scenarios [57]. The work in [56] analyzes the removal of different types of noise, such as road noise and engine noise, and calls for a unified noise control system that can address different types of noise. Some works have focused on developing ANC systems to remove noise over specific spatial regions. In [58], ANC over spatial regions is addressed using spherical harmonics as basis functions to represent the noise field. Some embodiments use the concepts described herein.

実施形態のいくつかは、音源拡張のための概念を使用する。 Some of the embodiments use concepts for audio source extension.

音声強調の文脈において、最も一般的な用途の１つは、ノイズによって損なわれた音声の強調である。多くの研究が、単一チャネル音声強調の位相処理に集中している［８］。深層ニューラルネットワークの観点から、音声のノイズ除去の問題は、［５９］のノイズ除去オートエンコーダ、［６０］の深層ニューラルネットワーク（ＤＮＮ：ｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋ）を使用したクリーンな音声とノイズの多い音声との間の非線形回帰問題、および［６１］の生成敵対ネットワーク（ＧＡＮ：ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋｓ）を使用したエンドツーエンドシステムで対処されている。多くの場合、［６２］の場合のように、音声強調は自動音声認識（ＡＳＲ：ａｕｔｏｍａｔｉｃｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ）システムのフロントエンドとして適用され、ＬＳＴＭＲＮＮで音声強調にアプローチする。音声強調はまた、最初に音声を抽出し、次に分離音声信号に強調技術を適用することが考えられる音源分離手法と併せて行われることも多い［６］。本明細書に記載の概念は、いくつかの実施形態によって使用される。 In the context of speech enhancement, one of the most common applications is the enhancement of speech corrupted by noise. Many studies have focused on phase processing for single-channel speech enhancement [8]. From the perspective of deep neural networks, the problem of speech denoising has been addressed by denoising autoencoders in [59], by nonlinear regression problems between clean and noisy speech using deep neural networks (DNNs) in [60], and by end-to-end systems using Generative Adversarial Networks (GANs) in [61]. Often, speech enhancement is applied as a front-end of automatic speech recognition (ASR) systems, as in the case of [62], which approaches speech enhancement with an LSTM RNN. Speech enhancement is also often performed in conjunction with source separation techniques, which may involve first extracting the speech and then applying enhancement techniques to the separated speech signal [6]. The concepts described herein are used by some embodiments.

ほとんどの場合、音楽に関連するソース強化とは、音楽リミックスを作成するための用途を指す。多くの場合、音声が雑音源によってのみ損なわれると仮定される音声強調とは対照的に、音楽用途は、ほとんどの場合、他の音源（楽器）が強調されるべき音源と同時に再生されていると仮定する。このため、音楽リミックスの用途は、ソース分離ステージが先行するように常に提供される。例えば、［１０］では、混合物におけるより良好な音バランスを達成するために、リード伴奏および調波打楽器分離技術を適用することによって初期のジャズ録音がリミックスされた。同様に、［６３］は、歌唱音声とバッキングトラックの相対音量を変更するために異なる歌唱音声分離アルゴリズムの使用を研究し、最終的な混合物にわずかであるが可聴歪みを導入することによって６ｄＢの増加が可能であることを示した。［６４］において、著者らは、音源分離技術を適用して新しいミックスを達成することにより、蝸牛インプラントユーザの音楽知覚を向上させる方法を研究している。そこに記載された概念は、いくつかの実施形態によって使用される。 In most cases, source enhancement in the context of music refers to applications for creating music remixes. In contrast to speech enhancement, where it is often assumed that the speech is only impaired by noise sources, music applications almost always assume that other sound sources (musical instruments) are being played simultaneously with the source to be enhanced. For this reason, music remix applications are always provided to be preceded by a source separation stage. For example, in [10], early jazz recordings were remixed by applying lead accompaniment and harmonic percussion separation techniques to achieve a better tonal balance in the mixture. Similarly, [63] studied the use of different singing voice separation algorithms to modify the relative volume of the singing voice and the backing track, showing that an increase of 6 dB was possible by introducing a slight but audible distortion in the final mixture. In [64], the authors study how to improve the music perception of cochlear implant users by applying source separation techniques to achieve a new mix. The concepts described there are used by some embodiments.

選択的聴覚用途における最大の課題の１つは、処理時間に関する厳しい要件に関する。ユーザの自然さおよび知覚品質を維持するために、完全な処理ワークフローを最小限の遅延で実行する必要がある。システムの最大許容レイテンシは、用途および聴覚シーンの複雑さに大きく依存する。例えば、ＭｃＰｈｅｒｓｏｎらは、インタラクティブ音楽インターフェースの許容可能なレイテンシ基準として１０ｍｓを提案している［１２］。ネットワークを介した音楽パフォーマンスについて、［１３］の著者らは、遅延が２０～２５ｍｓ～５０～６０ｍｓの範囲で知覚可能になると報告している。しかしながら、能動的雑音制御／除去（ＡＮＣ）技術は、より良好な性能のために超低遅延処理を必要とする。これらのシステムでは、許容可能なレイテンシの量は、周波数および減衰の両方に依存するが、２００Ｈｚ未満の周波数の約５ｄＢの減衰に対して１ｍｓ程度に低くすることができる［１４］。ＳＨ用途における最後の考察は、修正された聴覚シーンの知覚品質を指す。様々な用途における音声品質の信頼できる評価のための方法論にかなりの量の作業が費やされてきた［１５］、［１６］、［１７］。しかしながら、ＳＨの課題は、処理の複雑さと知覚品質との間の明確なトレードオフを管理することである。いくつかの実施形態は、そこに記載されている概念を使用する。 One of the biggest challenges in selective hearing applications relates to the stringent requirements on processing time. To maintain naturalness and perceived quality for the user, the complete processing workflow needs to be performed with minimal delay. The maximum acceptable latency of the system depends heavily on the application and the complexity of the auditory scene. For example, McPherson et al. propose 10 ms as an acceptable latency criterion for interactive music interfaces [12]. For music performance over a network, the authors of [13] report that delays become perceptible in the range of 20-25 ms to 50-60 ms. However, active noise control/cancellation (ANC) techniques require ultra-low latency processing for better performance. In these systems, the amount of acceptable latency depends on both frequency and attenuation, but can be as low as 1 ms for an attenuation of about 5 dB for frequencies below 200 Hz [14]. A final consideration in SH applications refers to the perceived quality of the modified auditory scene. A significant amount of work has been devoted to methodologies for reliable assessment of speech quality in various applications [15], [16], [17]. However, a challenge in SH is managing the clear trade-off between processing complexity and perceptual quality. Some embodiments use the concepts described therein.

いくつかの実施形態は、［４１］に記載されているようなカウント／計算および位置特定、［２７］に記載されているような位置特定および検出、［６５］に記載されているような分離および分類、ならびに［６６］に記載されているような分離およびカウントのための概念を使用する。 Some embodiments use concepts for counting/counting and localization as described in [41], localization and detection as described in [27], separation and classification as described in [65], and separation and counting as described in [66].

いくつかの実施形態は、［２５］、［２６］、［３２］、［３４］に記載されているように、現在の機械聴取方法のロバスト性を高めるための概念を使用し、新しい出現方向は、ドメイン適応［６７］および複数のデバイスで記録されたデータセットに対する訓練［６８］を含む。 Some embodiments use concepts to increase the robustness of current machine listening methods, as described in [25], [26], [32], [34], and new emerging directions include domain adaptation [67] and training on datasets recorded on multiple devices [68].

いくつかの実施形態は、生の波形を扱うことができる、［４８］に記載されているような機械聴取方法の計算効率を高めるための概念、または［３０］、［４５］、［５０］、［６１］に記載されている概念を使用する。 Some embodiments can handle raw waveforms and use concepts to improve the computational efficiency of machine listening methods such as those described in [48] or concepts described in [30], [45], [50], [61].

いくつかの実施形態は、シーン内の音源を選択的に修正することができるように、組み合わされた方法で検出／分類／位置特定および分離／強調する統合最適化スキームを実現し、独立した検出、分離、位置特定、分類、および強調方法は信頼性が高く、ＳＨに必要な堅牢性および柔軟性を提供する。 Some embodiments provide an integrated optimization scheme that detects/classifies/localizes and separates/enhances in a combined manner so that sound sources in a scene can be selectively modified, while the independent detection, separation, localization, classification, and enhancement methods are reliable and provide the robustness and flexibility required for SH.

いくつかの実施形態は、アルゴリズムの複雑さと性能との間に良好なトレードオフがあるリアルタイム処理に適している。 Some embodiments are suitable for real-time processing with a good trade-off between algorithmic complexity and performance.

いくつかの実施形態は、ＡＮＣと機械聴取とを組み合わせる。例えば、聴覚シーンが最初に分類され、次いでＡＮＣが選択的に適用される。 Some embodiments combine ANC with machine hearing. For example, an auditory scene is first classified and then ANC is selectively applied.

さらなる実施形態を以下に提供する。 Further embodiments are provided below.

仮想音声対象物を用いて実際の聴覚環境を増強するために、音声対象物の各位置から部屋内の聴取者の各位置への伝達関数を十分に知らなければならない。 To augment a real auditory environment with virtual sound objects, the transfer functions from each location of the sound object to each location of the listener in the room must be well known.

伝達関数は、音源の特性、および対象物とユーザとの間の直接音、および部屋内で発生するすべての反射をマッピングする。聴取者が現在いる実際の室内音響効果のための正しい空間音声再生を保証するために、伝達関数はさらに、聴取者室内の室内音響特性を十分な精度でマッピングする必要がある。 The transfer function maps the characteristics of the sound source and the direct sound between the object and the user, as well as all reflections that occur in the room. To ensure correct spatial audio reproduction for the actual room acoustics in which the listener is currently located, the transfer function must additionally map with sufficient accuracy the room acoustics characteristics in the listener's room.

部屋の異なる位置にある個々の音声対象物の表現に適した音声システムでは、多数の音声対象物が存在する場合の課題は、個々の音声対象物の適切な検出および分離である。また、各対象物の音声信号は、部屋の録音位置または聴取位置で重なっている。室内音響効果および音声信号のオーバーラップは、室内内の対象物および／または聴取位置が変化するときに変化する。 In an audio system suitable for the representation of individual audio objects at different locations in a room, the challenge is proper detection and separation of the individual audio objects when multiple audio objects are present, and the audio signals of each object overlap at the recording or listening position in the room. The room acoustics and audio signal overlap change when the objects and/or listening positions in the room change.

相対的な動きにより、室内音響効果パラメータの推定は、十分に迅速に実行されなければならない。ここで、推定の低レイテンシは高精度よりも重要である。音源および受信機の位置が変化しない場合（静的な場合）、高い精度が要求される。提案されたシステムでは、室内音響パラメータ、ならびに室内幾何学形状および聴取者位置は、音声信号のストリームから推定または抽出される。音声信号は、音源（複数可）および受信機（複数可）が任意の方向に移動することができ、音源（複数可）および／または受信機（複数可）がそれらの向きを任意に変更することができる実際の環境で記録される。 Due to the relative motion, the estimation of the room acoustics parameters must be performed quickly enough. Here, low estimation latency is more important than high accuracy. High accuracy is required when the positions of the sound sources and receivers do not change (static case). In the proposed system, the room acoustic parameters, as well as the room geometry and listener positions, are estimated or extracted from a stream of audio signals. The audio signals are recorded in a real environment, where the sound source(s) and receiver(s) can move in any direction and the sound source(s) and/or receiver(s) can arbitrarily change their orientation.

音声信号ストリームは、１つまたは複数のマイクロホンを含む任意のマイクロホンセットアップの結果であってもよい。ストリームは、前処理および／またはさらなる分析のために信号処理段に供給される。次に、出力は特徴抽出段階に供給される。この段階は、例えばＴ６０（残響時間）、ＤＲＲ（Ｄｉｒｅｃｔ－ｔｏ－ＲｅｖｅｒｂｅｒａｎｔＲａｔｉｏ）などの室内音響パラメータを推定する。 The audio signal stream may be the result of any microphone setup, including one or more microphones. The stream is fed to a signal processing stage for pre-processing and/or further analysis. The output is then fed to a feature extraction stage, which estimates room acoustic parameters, e.g. T60 (reverberation time), DRR (direct-to-reverberant ratio), etc.

第２のデータストリームは、マイクロホン設定の向きおよび位置を取り込む６ＤｏＦセンサ（「６自由度」：室内位置と視線方向の３次元）によって生成される。位置データストリームは、前処理またはさらなる分析のために６ＤｏＦ信号処理段に供給される。 The second data stream is generated by a 6DoF sensor ("6 degrees of freedom": three dimensions of room position and line of sight) that captures the orientation and position of the microphone setup. The position data stream is fed into a 6DoF signal processing stage for pre-processing or further analysis.

６ＤｏＦ信号処理の出力、音声特徴抽出段、および前処理されたマイクロホンストリームは、機械学習ブロックに供給され、機械学習ブロックでは、聴覚空間、すなわち聴取室内（サイズ、幾何学的形状、反射面）、および室内内のマイクロホンフィールドの位置が推定される。さらに、よりロバストな推定を可能にするために、ユーザ挙動モデルが適用される。このモデルは、人間の動き（例えば、連続移動、速度など。）の制限、ならびに様々なタイプの動きの確率分布を考慮する。 The output of the 6DoF signal processing, the audio feature extraction stage, and the preprocessed microphone streams are fed into a machine learning block, where the auditory space, i.e. the listening room (size, geometry, reflective surfaces) and the position of the microphone field within the room, are estimated. Furthermore, to enable more robust estimation, a user behavior model is applied. This model takes into account the limitations of human motion (e.g. continuous movement, speed, etc.) as well as the probability distribution of different types of motion.

実施形態のいくつかは、任意のマイクロホン配置を使用し、ユーザの位置および姿勢情報を追加することによって、ならびに機械学習方法を用いたデータの解析によって、室内音響パラメータのブラインド推定を実現する。 Some embodiments achieve blind estimation of room acoustic parameters using arbitrary microphone arrangements and by adding user position and pose information, as well as by analyzing the data using machine learning methods.

例えば、実施形態によるシステムは、音響拡張現実（ＡＡＲ：ａｃｏｕｓｔｉｃａｌｌｙａｕｇｍｅｎｔｅｄｒｅａｌｉｔｙ）に使用することができる。この場合、推定されたパラメータから仮想室内インパルス応答を合成しなければならない。 For example, systems according to embodiments can be used for acoustically augmented reality (AAR), where a virtual room impulse response must be synthesized from estimated parameters.

いくつかの実施形態は、記録された信号からの残響の除去を含む。そのような実施形態の例は、正常な聴覚の人々および聴覚障害の人々のための補聴器である。この場合、推定されたパラメータの助けを借りて、マイクロホン設定の入力信号から残響を除去することができる。 Some embodiments include the removal of reverberation from the recorded signal. An example of such an embodiment is hearing aids for people with normal hearing and for people with hearing impairments. In this case, with the help of estimated parameters, reverberation can be removed from the input signal of the microphone setup.

さらなる用途は、現在の聴覚空間以外の部屋で生成された音声シーンの空間合成である。この目的のために、音声シーンの一部である室内音響効果パラメータは、聴覚空間の室内音響効果パラメータに対して適合される。 A further application is the spatial synthesis of sound scenes generated in rooms other than the current auditory space. For this purpose, the room acoustics parameters that are part of the sound scene are adapted to the room acoustics parameters of the auditory space.

バイノーラル合成の場合、この目的のために、利用可能なＢＲＩＲは、聴覚空間の異なる音響パラメータに適合される。 In the case of binaural synthesis, for this purpose the available BRIRs are adapted to the different acoustic parameters of the auditory space.

一実施形態では、１つまたは複数の室内音響効果パラメータを決定するための装置が提供される。 In one embodiment, an apparatus for determining one or more room acoustics parameters is provided.

装置は、１つまたは複数のマイクロホン信号を含むマイクロホンデータを取得するように構成される。 The device is configured to acquire microphone data including one or more microphone signals.

さらに、装置は、ユーザの位置および／または向きに関する追跡データを取得するように構成される。 Further, the device is configured to obtain tracking data regarding the user's position and/or orientation.

さらに、装置は、マイクロホンデータおよび追跡データに応じて１つまたは複数の室内音響効果パラメータを決定するように構成される。 Additionally, the apparatus is configured to determine one or more room acoustics parameters as a function of the microphone data and the tracking data.

一実施形態によれば、例えば、装置は、マイクロホンデータおよび追跡データに応じて１つまたは複数の室内音響効果パラメータを決定するために機械学習を使用するように構成されてもよい。 According to one embodiment, for example, the device may be configured to use machine learning to determine one or more room acoustics parameters as a function of the microphone data and the tracking data.

実施形態では、例えば、装置は、装置がニューラルネットワークを使用するように構成され得るという点で、機械学習を使用するように構成され得る。 In an embodiment, for example, the device may be configured to use machine learning, in that the device may be configured to use a neural network.

一実施形態によれば、例えば、装置は、機械学習のためにクラウドベースの処理を使用するように構成されてもよい。 According to one embodiment, for example, the device may be configured to use cloud-based processing for machine learning.

一実施形態では、例えば、１つまたは複数の室内音響効果パラメータは、残響時間を含み得る。 In one embodiment, for example, one or more room acoustics parameters may include reverberation time.

一実施形態によれば、例えば、１つまたは複数の室内音響効果パラメータは、指向性対残響比を含み得る。 According to one embodiment, for example, one or more room acoustics parameters may include a directivity to reverberation ratio.

一実施形態では、例えば、追跡データは、ユーザの位置をラベル付けするためのｘ座標、ｙ座標、およびｚ座標を含み得る。 In one embodiment, for example, the tracking data may include x, y, and z coordinates to label the user's location.

実施形態によれば、例えば、追跡データは、ユーザの向きをラベル付けするためのピッチ座標、ヨー座標、およびロール座標を含み得る。 According to an embodiment, for example, tracking data may include pitch, yaw, and roll coordinates to label the user's orientation.

実施形態では、例えば、装置は、１つまたは複数のマイクロホン信号を時間領域から周波数領域に変換するように構成されてもよく、例えば、装置は、周波数領域における１つまたは複数のマイクロホン信号の１つまたは複数の特徴を抽出するように構成されてもよく、装置は、１つまたは複数の特徴に応じて１つまたは複数の室内音響パラメータを決定するように構成されてもよい。 In an embodiment, for example, the device may be configured to transform one or more microphone signals from the time domain to the frequency domain, e.g., the device may be configured to extract one or more features of the one or more microphone signals in the frequency domain, and the device may be configured to determine one or more room acoustic parameters as a function of the one or more features.

一実施形態によれば、例えば、装置は、１つまたは複数の特徴を抽出するためにクラウドベースの処理を使用するように構成されてもよい。 According to one embodiment, for example, the device may be configured to use cloud-based processing to extract one or more features.

一実施形態では、例えば、装置は、いくつかのマイクロホン信号を記録するためのいくつかのマイクロホンのマイクロホン構成を含み得る。 In one embodiment, for example, the device may include a microphone arrangement of several microphones for recording several microphone signals.

一実施形態によれば、例えば、マイクロホン構成は、ユーザの身体に装着されるように構成されてもよい。 According to one embodiment, for example, the microphone arrangement may be configured to be worn on the user's body.

実施形態では、例えば、上述のシステムは、１つまたは複数の室内音響効果パラメータを決定するための上述の装置をさらに含み得る。 In an embodiment, for example, the above-mentioned system may further include the above-mentioned apparatus for determining one or more room acoustics parameters.

一実施形態によれば、例えば、信号部分修正器１４０は、および／または、前記信号発生器１５０は、前記１つまたは複数の室内音響効果パラメータのうちの前記少なくとも一方に応じて、前記１つまたは複数の音源の各音源について、前記複数のバイノーラル室内インパルス応答のうちの少なくとも一方の生成を実行するように構成されてもよい。 According to one embodiment, for example, the signal portion modifier 140 and/or the signal generator 150 may be configured to perform the generation of at least one of the plurality of binaural room impulse responses for each of the one or more sound sources depending on the at least one of the one or more room acoustics parameters.

図７は、５つのサブシステム（サブシステム１～５）を含む、一実施形態によるシステムを示す。 Figure 7 shows a system according to one embodiment, which includes five subsystems (subsystems 1-5).

サブシステム１は、１つ、２つ、またはそれ以上の個々のマイクロホンのマイクロホンセットアップを含み、１つまたは複数のマイクロホンが利用可能であれば、これらを組み合わせてマイクロホンフィールドにすることができる。マイクロホン／マイクロホンの互いに対する位置決めおよび相対配置は任意であり得る。マイクロホン構成は、ユーザによって装着されたデバイスの一部であってもよく、または対象の部屋に配置された別個のデバイスであってもよい。 Subsystem 1 includes a microphone setup of one, two or more individual microphones, which can be combined into a microphone field if one or more microphones are available. The positioning and relative placement of the microphones/microphones with respect to each other can be arbitrary. The microphone configuration can be part of a device worn by the user or can be a separate device placed in the room of interest.

また、サブシステム１は、部屋内におけるユーザの並進位置およびユーザの頭部姿勢を計測する追跡デバイスを備える。６ＤｏＦ（ｘ座標、ｙ座標、ｚ座標、ピッチ角、ヨー角、ロール角）まで測定することができる。 Subsystem 1 also includes a tracking device that measures the user's translational position in the room and the user's head pose. It can measure up to 6DoF (x, y, z, pitch, yaw, roll).

追跡デバイスは、ユーザの頭部に配置されてもよく、または必要なＤｏＦを測定するためにいくつかのサブデバイスに分割されてもよく、ユーザに配置されてもされなくてもよい。 The tracking device may be placed on the user's head or may be split into several sub-devices to measure the required DoF, and may or may not be placed on the user.

したがって、サブシステム１は、マイクロホン信号入力インターフェース１０１および位置情報入力インターフェース１０２を含む入力インターフェースを表す。 Thus, subsystem 1 represents an input interface including a microphone signal input interface 101 and a position information input interface 102.

サブシステム２は、記録されたマイクロホン信号の信号処理を含む。これは、周波数変換および／または時間領域ベースの処理を含む。さらに、これは、異なるマイクロホン信号を組み合わせてフィールド処理を実現するための方法を含む。サブシステム２における信号処理のパラメータを適合させるために、システム４からのフィードバックが可能である。マイクロホン信号の信号処理ブロックは、マイクロホンが組み込まれているデバイスの一部であってもよく、または別個のデバイスの一部であってもよい。また、クラウドベースの処理の一部であってもよい。 Subsystem 2 includes signal processing of the recorded microphone signals. This includes frequency conversion and/or time domain based processing. Furthermore, it includes methods for combining different microphone signals to realize field processing. Feedback from system 4 is possible to adapt the parameters of the signal processing in subsystem 2. The signal processing block of the microphone signals may be part of the device in which the microphone is embedded or may be part of a separate device. It may also be part of cloud-based processing.

さらに、サブシステム２は、記録された追跡データのための信号処理を含む。これは、周波数変換および／または時間領域ベースの処理を含む。さらに、ノイズ抑制、平滑化、補間、および外挿を使用することによって信号の技術的品質を向上させる方法が含まれる。さらに、より高いレベルの情報を導出するための方法を含む。これには、速度、加速度、経路方向、アイドル時間、移動範囲、および移動経路が含まれる。また、近い将来の移動経路や、近い将来の速度の予測を含む。追跡信号の信号処理ブロックは、追跡デバイスの一部であってもよいし、別個のデバイスの一部であってもよい。また、クラウドベースの処理の一部であってもよい。 Furthermore, the subsystem 2 includes signal processing for the recorded tracking data. This includes frequency transformation and/or time domain based processing. Furthermore, it includes methods to improve the technical quality of the signal by using noise suppression, smoothing, interpolation and extrapolation. Furthermore, it includes methods to derive higher level information. This includes speed, acceleration, path direction, idle time, movement range and movement path. It also includes prediction of near future movement path and near future speed. The signal processing block of the tracking signal may be part of the tracking device or may be part of a separate device. It may also be part of cloud based processing.

サブシステム３は、処理されたマイクロホンの特徴の抽出を含む。 Subsystem 3 includes extraction of processed microphone features.

特徴抽出ブロックは、ユーザのウェアラブルデバイスの一部であってもよいし、別個のデバイスの一部であってもよい。また、クラウドベースの処理の一部であってもよい。 The feature extraction block may be part of the user's wearable device, or it may be part of a separate device, or it may be part of cloud-based processing.

サブシステム２および３は、それらのモジュール１１１および１２１と共に、例えば、検出器１１０、音声タイプ分類器１３０、および信号部分修正器１４０を実現する。例えば、サブシステム３、モジュール１２１は、音声分類の結果をサブシステム２、モジュール１１１（フィードバック）に出力することができる。例えば、サブシステム２、モジュール１１２は、位置決定器１２０を実現する。さらに、一実施形態では、サブシステム２および３は、例えば、サブシステム２、モジュール１１１がバイノーラル室内インパルス応答およびラウドスピーカ信号を生成することによって、信号発生器１５０を実現することもできる。 Subsystems 2 and 3, together with their modules 111 and 121, for example, implement detector 110, voice type classifier 130, and signal portion modifier 140. For example, subsystem 3, module 121 may output the result of voice classification to subsystem 2, module 111 (feedback). For example, subsystem 2, module 112 implements position determiner 120. Furthermore, in one embodiment, subsystems 2 and 3 may also implement signal generator 150, for example by subsystem 2, module 111 generating binaural room impulse response and loudspeaker signals.

サブシステム４は、処理されたマイクロホン信号、抽出されたマイクロホン信号の特徴、および処理された追跡データを使用して室内音響パラメータを推定する方法およびアルゴリズムを含む。このブロックの出力は、アイドルデータとしての室内音響特性パラメータ、ならびにサブシステム２におけるマイクロホン信号処理のパラメータの制御および変動である。機械学習ブロック１３１は、ユーザのデバイスの一部であってもよいし、別個のデバイスの一部であってもよい。また、クラウドベースの処理の一部であってもよい。 Subsystem 4 contains methods and algorithms to estimate room acoustic parameters using the processed microphone signals, extracted microphone signal features, and processed tracking data. The output of this block is the room acoustic characteristic parameters as idle data, as well as control and variation of parameters of microphone signal processing in subsystem 2. Machine learning block 131 may be part of the user's device or a separate device. It may also be part of cloud-based processing.

さらに、サブシステム４は、室内音響アイドルデータパラメータの後処理を含む（例えばブロック１３２において）。これには、外れ値の検出、個々のパラメータの新しいパラメータへの組み合わせ、平滑化、外挿、補間、および妥当性検証が含まれる。また、このブロックは、サブシステム２から情報を取得する。これは、近い将来の音響パラメータを推定するために、部屋内のユーザの近い将来の位置を含む。このブロックは、ユーザのデバイスの一部であってもよいし、別個のデバイスの一部であってもよい。また、クラウドベースの処理の一部であってもよい。 Furthermore, Subsystem 4 includes post-processing of the room acoustic idle data parameters (e.g., in block 132). This includes outlier detection, combining individual parameters into new parameters, smoothing, extrapolation, interpolation, and validation. This block also gets information from Subsystem 2, including the near future position of the user in the room, to estimate the near future acoustic parameters. This block may be part of the user's device or a separate device. It may also be part of cloud-based processing.

サブシステム５は、下流システム（例えば、メモリ１４１において、）のための室内音響効果パラメータの記憶および割り当てを含む。パラメータの割り当ては、ジャストインタイムで実現されてもよく、および／または時間応答が格納されてもよい。記憶は、ユーザ上またはユーザの近くに位置するデバイス内で実行されてもよく、またはクラウドベースのシステム内で実行されてもよい。 Subsystem 5 includes storage and allocation of room acoustics parameters for downstream systems (e.g., in memory 141). Parameter allocation may be realized just-in-time and/or time-responsively stored. Storage may be performed in a device located on or near the user, or in a cloud-based system.

本発明の実施形態のユースケースを以下に説明する。 A use case for an embodiment of the present invention is described below.

一実施形態のユースケースは、家庭娯楽であり、家庭環境のユーザに関する。例えば、ユーザは、ＴＶ、ラジオ、ＰＣ、タブレットなどの特定の再生デバイスに集中したいと考え、他の外乱源（他のユーザや子供のデバイス、工事騒音、街頭騒音）を抑制したいと考える。この場合、ユーザは、好適な再生デバイスの近くに位置し、デバイスまたはその位置を選択する。ユーザの位置にかかわらず、選択されたデバイス、または音源位置は、ユーザが自分の選択をキャンセルするまで音響的に強調される。 One embodiment use case is home entertainment, involving a user in a home environment. For example, a user wants to focus on a particular playback device, such as a TV, radio, PC, tablet, etc., and wants to suppress other sources of disturbance (other users' or children's devices, construction noise, street noise). In this case, the user positions himself close to the preferred playback device and selects the device or its location. Regardless of the user's location, the selected device, or sound source location, is acoustically emphasized until the user cancels his selection.

例えば、ユーザは、対象音源の近くに移動する。ユーザは適切なインターフェースを介して対象音源を選択し、ヒアラブルはそれに応じて、ユーザ位置、ユーザの視線方向、および対象音源に基づいて音声再生を適合させて、雑音が妨害される場合でも対象音源を十分に理解できるようにする。 For example, the user moves closer to a target sound source. The user selects the target sound source via an appropriate interface, and the hearable accordingly adapts the sound playback based on the user position, the user's gaze direction, and the target sound source, ensuring that the target sound source is well understood even in the presence of noise interference.

あるいは、ユーザは、特に妨害する音源の近くを移動する。ユーザは適切なインターフェースを介してこの妨害音源を選択し、ヒアラブル（聴覚デバイス）はそれに応じて、ユーザ位置、ユーザの視線方向、および妨害音源に基づいて音声再生を調整して、妨害音源を明示的に調整する。 Alternatively, the user moves closer to a particularly disruptive sound source. The user selects this disruptive sound source via an appropriate interface, and the hearable adjusts the audio playback accordingly based on the user's position, the user's gaze direction, and the disruptive sound source to explicitly accommodate the disruptive sound source.

さらなる実施形態のさらなるユースケースは、ユーザがいくつかのスピーカの間に位置するカクテルパーティーである。 A further use case for a further embodiment is a cocktail party where the user is positioned between several speakers.

多くの話者の存在下では、例えば、ユーザは、それらのうちの１つ（または複数）に集中することを望み、他の外乱源を調整または減衰させることを望む。このユースケースでは、ヒアラブルの制御は、ユーザからの対話をほとんど必要としないはずである。バイオシグナルまたは会話困難の検出可能な指標（頻出する質問、外国語、強いダイアモンド語）に基づく選択性の強度の制御は任意であろう。 In the presence of many talkers, for example, the user may wish to focus on one (or more) of them and tune out or attenuate other sources of disturbance. In this use case, control of the hearable should require little interaction from the user. Control of the strength of selectivity based on biosignals or detectable indicators of speech difficulty (frequent questions, foreign language, strong diamond words) would be optional.

例えば、話者はランダムに分布し、聴取者に対して相対的に移動する。さらに、音声の周期的な一時停止があり、新しい話者が追加され、または他の話者がシーンを離れる。おそらく、音楽などの外乱の音は比較的大きい。選択された話者は、音響的に強調され、発話の一時停止、自身の位置または姿勢の変化の後に再び認識される。 For example, speakers are randomly distributed and move relative to the listener. In addition, there are periodic pauses in speech, new speakers are added or others leave the scene. Perhaps external disturbing sounds, such as music, are relatively loud. The selected speaker is acoustically highlighted and is recognized again after a pause in speech, a change in his or her position or posture.

例えば、ヒアラブルは、ユーザの近傍の話者を認識する。適切な制御可能性（例えば、視線方向、注意制御）により、ユーザは好ましい話者を選択することができる。ヒアラブルは、ユーザの視線方向および選択された対象音源に応じて音声再生を適応させることにより、騒音が妨害される場合であっても対象音源を十分に理解することができる。 For example, the hearable recognizes speakers in the user's vicinity. With appropriate controllability (e.g., gaze direction, attention control) the user can select a preferred speaker. The hearable adapts the audio playback according to the user's gaze direction and the selected target sound source, allowing the target sound source to be fully understood even in the presence of noise interference.

あるいは、ユーザが（以前は）好ましくない話者によって直接アドレス指定されている場合、自然なコミュニケーションを確実にするために、ユーザは少なくとも可聴でなければならない。 Alternatively, if a user is directly addressed by a (previously) non-preferred speaker, the user must at least be audible to ensure natural communication.

別の実施形態の別のユースケースは、ユーザが自分の（または）自動車に位置する自動車におけるものである。運転中、ユーザは、妨害雑音（風、モータ、乗客）の隣でそれらをよりよく理解することができるように、ナビゲーションデバイス、ラジオ、または会話相手などの特定の再生デバイスに自分の音響的注意を能動的に向けることを望む。 Another use case of another embodiment is in an automobile where the user is located in his/her (or her) car. While driving, the user wants to actively direct his/her acoustic attention to specific playback devices such as a navigation device, a radio, or a conversation partner so that he/she can better understand them next to interfering noises (wind, motor, passengers).

例えば、ユーザおよび目標音源は、自動車内の固定位置に配置される。ユーザは基準システムに対して静止しているが、車両自体は動いている。これには、適応追跡解決策が必要である。選択された音源位置は、ユーザが選択をキャンセルするまで、または警告信号がデバイスの機能を停止するまで音響的に強調される。 For example, the user and the target sound source are placed at fixed locations in a car. The user is stationary relative to the reference system, but the vehicle itself is moving. This requires an adaptive tracking solution. The selected sound source location is acoustically highlighted until the user cancels the selection or a warning signal stops the device functioning.

例えば、ユーザが自動車に乗車し、デバイスによって周囲が検出される。適切な制御可能性（例えば、速度認識）により、ユーザは対象音源を切り替えることができ、ヒアラブルは、ノイズが妨害される場合でも対象音源を十分に理解できるように、ユーザの視線方向および選択された対象音源に応じて音声再生を適合させる。 For example, a user enters a car and the surroundings are detected by the device. With appropriate controllability (e.g. speed awareness), the user can switch between target sound sources and the hearable adapts the sound playback according to the user's gaze direction and the selected target sound source so that the target sound source can be fully understood even in the presence of noise interference.

あるいは、例えば、交通関連の警告信号が通常の流れを中断し、ユーザの選択をキャンセルする。その後、通常の流れの再開が実行される。 Alternatively, for example, a traffic-related warning signal interrupts the normal flow and cancels the user's selection. Resumption of normal flow is then performed.

さらなる実施形態の別の使用事例は、ライブ音楽であり、ライブ音楽イベントにおけるゲストに関する。例えば、コンサートまたはライブ音楽のパフォーマンスにおけるゲストは、聞き取れる人の助けを借りてパフォーマンスへの集中力を高めたいと望み、妨害的に行動する他のゲストを無視したいと望む。さらに、例えば、好ましくない聴取位置または室内音響効果のバランスをとるために、音声信号自体を最適化することができる。 Another use case for further embodiments is live music and concerns guests at a live music event. For example, guests at a concert or live music performance may wish to increase their focus on the performance with the help of those who can hear, and wish to ignore other guests who behave in a disruptive manner. Furthermore, the audio signal itself may be optimized, for example to balance unfavorable listening positions or room acoustic effects.

例えば、ユーザは、多くの外乱源の間に位置する。しかしながら、ほとんどの場合、性能は比較的大きい。対象音源は、固定された位置または少なくとも所定の領域に配置されるが、ユーザは非常に移動しやすい（例えば、ユーザはダンスをしていてもよい。）。選択された音源位置は、ユーザが選択をキャンセルするまで、または警告信号がデバイスの機能を停止するまで音響的に強調される。 For example, the user is located among many sources of disturbance. In most cases, however, the performance is relatively large: the target sound sources are located at fixed positions or at least in a predefined area, while the user is highly mobile (e.g., the user may be dancing). The selected sound source position is acoustically emphasized until the user cancels the selection or until a warning signal stops the device functioning.

例えば、ユーザは、ステージエリアまたはミュージシャンを対象音源として選択する。適切な制御可能性により、ユーザは、ステージ／ミュージシャンの位置を定義することができ、ヒアラブルは、ノイズが妨害される場合であっても対象音源を十分に理解することができるように、ユーザの視線方向および選択された対象音源に従って音声再生を適合させる。 For example, the user selects a stage area or musicians as the target sound source. With appropriate controllability, the user can define the stage/musician position and the hearable adapts the sound playback according to the user's gaze direction and the selected target sound source so that the target sound source can be fully understood even in the case of noise interference.

あるいは、例えば、警告情報（例えば、屋外イベントの場合の避難、近づきつつある雷雨）および警告信号は、通常の流れを中断し、ユーザの選択をキャンセルすることができる。その後、通常の流れの再開がある。 Alternatively, for example, warning information (e.g., evacuation in case of an outdoor event, oncoming thunderstorm) and warning signals can interrupt the normal flow and cancel the user's selection, after which there is a resumption of the normal flow.

別の実施形態のさらなるユースケースは、主要イベントであり、主要イベントにおけるゲストに関する。したがって、主要イベント（例えば、フットボール競技場、アイスホッケー競技場、大型コンサートホールなど。）では、ヒアラブルを使用して、そうでなければ群衆の騒音にかき消される家族や友人の声を強調することができる。 A further use case of another embodiment is for major events and guests at major events. Thus, at major events (e.g., football stadiums, ice hockey stadiums, large concert halls, etc.), hearables can be used to highlight the voices of family and friends that would otherwise be drowned out by crowd noise.

例えば、スタジアムや大きなコンサートホールでは、多くの出席者が集まる大きなイベントが行われる。グループ（家族、友人、学校の授業）は、イベントに参加し、大群衆が歩き回るイベント場所の外側または中に位置する。１人または複数の子供は、グループとの眼の接触を失い、ノイズに起因する高いノイズレベルにもかかわらず、グループを求める。その後、ユーザは音声認識をオフにし、ヒアラブルは音声を増幅しなくなる。 For example, a big event with many attendees takes place in a stadium or a large concert hall. A group (family, friends, school class) attends the event and is located outside or inside the event location where a large crowd is roaming. One or more children lose eye contact with the group and seek out the group despite high noise levels caused by noise. The user then turns off the voice recognition and the hearable no longer amplifies the sound.

例えば、グループの人物は、ヒアラブルで、迷子の子供の音声を選択する。ヒアラブルは音声を見つける。そして、ヒアラブルは、音声を増幅し、ユーザは、増幅された音声に基づいて、迷子を回復することができる（より早く）。 For example, a person in the group selects the lost child's voice on the hearable. The hearable finds the voice. The hearable then amplifies the voice and the user can recover the lost child (faster) based on the amplified voice.

あるいは、例えば、行方不明の子供もヒアラブルを装着し、親の声を選択する。ヒアラブルは、親の音声を増幅する。増幅により、子供は両親の位置を突き止めることができる。これにより、子供は、歩いて親に戻ることができる。あるいは、例えば、行方不明の子供もヒアラブルを装着し、親の声を選択する。ヒアラブルは、親の音声（複数可）を見つけ、その音声までの距離をアナウンスする。これにより、子供は親を見つけやすくなる。任意選択的に、距離の告知のためにヒアラブルからの人工音声の再生が提供されてもよい。 Alternatively, for example, a missing child may also wear a hearable and select a parent's voice. The hearable amplifies the parent's voice. The amplification allows the child to locate the parents. This allows the child to walk back to the parent. Alternatively, for example, a missing child may also wear a hearable and select a parent's voice. The hearable finds the parent's voice(s) and announces the distance to the voice(s). This helps the child find the parent. Optionally, playback of an artificial voice from the hearable may be provided for the distance announcement.

例えば、音声の選択的増幅のためのヒアラブルの結合が提供され、音声プロファイルが記憶される。 For example, coupling of hearables for selective amplification of sound is provided and sound profiles are stored.

さらなる実施形態のさらなるユースケースは、レクリエーションスポーツであり、レクリエーション競技者に関する。スポーツ時に音楽を聴くことは人気がある；しかし、危険も伴う。警告信号または他の道路利用者が聞こえない可能性がある。音楽の再生に加えて、ヒアラブルは、警告信号または声に反応し、音楽再生を一時的に中断することができる。これに関連して、さらなるユースケースは、小グループにおけるスポーツである。スポーツグループのヒアラブルは、他の妨害ノイズを抑制しながら、スポーツ中の良好なコミュニケーションを確保するために接続することができる。 A further use case of a further embodiment is recreational sports and concerns recreational athletes. Listening to music during sports is popular; however, it also involves dangers. Warning signals or other road users may not be heard. In addition to playing music, the hearables can react to warning signals or voices and temporarily interrupt the music playback. In this context, a further use case is sports in small groups. The hearables of a sports group can be connected to ensure good communication during sports while suppressing other disturbing noises.

例えば、ユーザは移動可能であり、可能な警告信号は多くの外乱源によって重複される。警告信号のすべてが潜在的にユーザに関係するわけではないことが問題である（街中の遠隔サイレン、通りでの警笛）。これにより、ヒアラブルは、音楽再生を自動的に停止し、ユーザが選択をキャンセルするまで、通信相手の警告信号を音響的に強調する。その後、音楽が正常に再生される。 For example, users can be mobile and possible warning signals are overlapped by many sources of disturbance. The problem is that not all warning signals are potentially relevant to the user (remote sirens in the city, horns in the street). This causes the hearable to automatically stop music playback and acoustically emphasize the communication partner's warning signal until the user cancels the selection. After which the music will play normally.

例えば、ユーザは、スポーツに従事しており、ヒアラブルを介して音楽を聴いている。ユーザに関する警告信号または声が自動的に検出され、ヒアラブルは音楽の再生を中断する。ヒアラブルは、対象音源／音響環境を十分に理解できるように音声再生を適合させる。次いで、ヒアラブルは、（例えば、警告信号の終了後に、）音楽の再生を自動的に継続するか、またはユーザによる要求に従って継続する。 For example, a user is engaged in sports and is listening to music via a hearable. A warning signal or voice regarding the user is automatically detected and the hearable pauses the music playback. The hearable adapts the audio playback to fully understand the target sound source/acoustic environment. The hearable then continues playing the music automatically (e.g., after the warning signal ends) or as requested by the user.

あるいは、例えば、グループのアスリートは、彼らのヒアラブルを接続することができる。グループメンバー間の発話理解性が最適化され、他の妨害雑音が抑制される。 Or, for example, athletes in a group can connect their hearables, optimizing speech intelligibility between group members and suppressing other distracting noises.

別の実施形態の別の使用事例は、いびきの抑制であり、いびきによって妨害される睡眠を望むすべての人々に関する。パートナーのいびきをかいている人は、夜間の安静が妨げられ、睡眠に問題がある。ヒアラブルは、いびき音を抑制し、夜間の休息を保証し、家庭内の安全を提供するので、安心感を提供する。同時に、ヒアラブルは、ユーザが外界から音響的に完全に隔離されないように、他の音を通過させる（赤ん坊が叫ぶ、警報音など。）。例えば、いびき検出が提供される。 Another use case of another embodiment is the suppression of snoring, which concerns all those who want sleep that is disturbed by snoring. People whose partners are snoring have trouble sleeping as their night rest is disturbed. The hearable provides a sense of security since it suppresses the snoring sounds, guarantees a night's rest and provides security in the home. At the same time, the hearable lets other sounds through (babies crying, alarm sounds, etc.) so that the user is not completely acoustically isolated from the outside world. For example, snore detection is provided.

例えば、ユーザは、いびき音のために睡眠障害を有する。ヒアラブルを使用することにより、ユーザは再びよりよく睡眠することができ、これはストレス低減効果を有する。 For example, a user has trouble sleeping due to snoring sounds. By using the hearable, the user can sleep better again, which has a stress reducing effect.

例えば、ユーザは、睡眠中にヒアラブルを装着する。彼／彼女は、すべてのいびき音を抑制するスリープモードにヒアラブルを切り替える。就寝後、再びヒアラブルをオフにする。 For example, a user wears the hearable while sleeping. He/she switches the hearable to a sleep mode that suppresses all snoring sounds. After falling asleep, he/she turns the hearable off again.

あるいは、睡眠中に工事の騒音、芝刈り機の騒音などの他の音を抑制することができる。 Alternatively, it can suppress other sounds such as construction noise, lawnmower noise, etc. while you sleep.

さらなる実施形態のさらなるユースケースは、日常生活におけるユーザのための診断デバイスである。ヒアラブルは、好み（例えば、どの音源が選択され、どの減衰／増幅が選択されるか）を記録し、使用期間を介して傾向を有するプロファイルを作成する。このデータは、聴覚能力に関する変化に関する結論を引き出すことを可能にし得る。この目的は、可及的速やかに難聴を検出することである。 A further use case of a further embodiment is a diagnostic device for the user in daily life. The hearable records preferences (e.g. which sound source is selected, which attenuation/amplification is selected) and creates a profile with trends over the period of use. This data may make it possible to draw conclusions regarding changes in hearing ability. The aim is to detect hearing loss as soon as possible.

例えば、ユーザは、数ヶ月または数年間、日常生活または言及されたユースケースでデバイスを携帯する。ヒアラブルは、選択された設定に基づいて分析を作成し、警告および推奨をユーザに出力する。 For example, a user carries the device with them in their daily life or the mentioned use case for months or years. The hearable creates analytics based on the selected settings and outputs warnings and recommendations to the user.

例えば、ユーザは、ヒアラブルを長期間（数ヶ月から数年）にわたって装着する。デバイスは、聴覚選好に基づいて分析を作成し、デバイスは、聴覚損失の発症の場合に推奨および警告を出力する。 For example, a user wears a hearable for an extended period of time (months to years). The device creates an analysis based on the hearing preferences, and the device outputs recommendations and warnings in case of the onset of hearing loss.

別の実施形態のさらなる使用事例は、治療デバイスであり、日常生活における聴覚損傷を有するユーザに関する。聴覚デバイスに向かう途中の移行デバイスとしての役割では、可能な限り早期に潜在的な患者が支援され、したがって認知症が予防的に治療される。他の可能性は、濃度トレーナ（例えばＡＤＨＳの場合）としての使用、耳鳴りの治療、およびストレス軽減である。 A further use case of another embodiment is as a therapeutic device, relating to users with hearing impairment in their daily life. In its role as a transitional device on the way to a hearing device, potential patients are assisted as early as possible, thus treating dementia preventatively. Other possibilities are use as a concentration trainer (e.g. in the case of ADHS), treatment of tinnitus, and stress reduction.

例えば、聴取者は、聴覚の問題または注意欠陥を有し、一時的に／暫定的に聴覚デバイスとしてヒアラブルを使用する。聴覚の問題に応じて、聴覚器によって、例えば、すべての信号の増幅（聴覚の硬さ）、好ましい音源の高い選択性（注意欠陥）、治療音の再生（耳鳴りの治療）によって軽減される。 For example, a listener has a hearing problem or attention deficit and temporarily/temporarily uses a hearable as a hearing device. Depending on the hearing problem, it is alleviated by the hearing device, for example, by amplification of all signals (hardness of hearing), high selectivity of preferred sound sources (attention deficit), playback of therapeutic sounds (treatment of tinnitus).

ユーザは、独立して、または医師の助言に基づいて、治療の形態を選択し、好ましい調整を行い、ヒアラブルは選択された治療を実行する。 The user, independently or based on a physician's advice, selects the form of treatment, makes the desired adjustments, and the hearable carries out the selected treatment.

あるいは、ヒアラブルは、ＵＣ－ＰＲＯ１から聴覚の問題を検出し、検出された問題に基づいて再生を自動的に適合させ、ユーザに通知する。 Alternatively, the hearable can detect hearing problems from the UC-PRO1 and automatically adapt playback based on the detected problem and notify the user.

さらなる実施形態のさらなるユースケースは、公共部門での仕事であり、公共部門の従業員に関する。仕事中に高レベルの騒音を受ける公共部門の従業員（病院、小児科医、空港カウンター、教育者、レストラン業界、サービスカウンターなど。）は、例えばストレスの軽減を通じて、１人または少数の人々の発言を強調してより良好に伝達し、仕事におけるより良好な安全のためにヒアラブルを着用する。 A further use case of a further embodiment is public sector work and concerns public sector employees. Public sector employees who are subject to high levels of noise during their work (hospitals, pediatricians, airport counters, educators, restaurant industry, service counters, etc.) wear hearables for better communication of one or a few people's speech with emphasis, for example through stress reduction, and for better safety at work.

例えば、従業員は、彼らの作業環境において高レベルの騒音を受け、バックグラウンド騒音にもかかわらず、より穏やかな環境に切り替えることができずにクライアント、患者、または同僚と話す必要がある。病院の従業員は、医療デバイスの音およびビープ音による高レベルのノイズ（または任意の他の業務関連ノイズ）を受け、依然として患者または同僚と通信することができなければならない。小児科医および教育者は、子供の騒音または叫ぶ中で働き、親と話すことができなければならない。空港のカウンターでは、従業員は、空港のコンコース内の騒音レベルが高い場合に、航空会社の乗客を理解することが困難である。ウエイターは、よく行くレストランの騒音の中で、客の注文を聞くことが困難である。その後、例えば、ユーザは音声選択をオフにし、ヒアラブルはもはや音声を増幅しない。 For example, employees experience high levels of noise in their work environment and need to talk to clients, patients, or coworkers despite the background noise without being able to switch to a quieter environment. Hospital employees experience high levels of noise from medical devices ringing and beeping (or any other work-related noise) and must still be able to communicate with patients or coworkers. Pediatricians and educators must work amidst noise or screaming children and be able to talk to parents. At an airport counter, employees have difficulty understanding airline passengers when the noise level is high in the airport concourse. A waiter has difficulty hearing customers' orders amidst the noise of a frequented restaurant. Then, for example, the user turns off the voice selection and the hearable no longer amplifies the voice.

例えば、人は、搭載されたヒアラブルをオンにする。ユーザは、ヒアラブルを近くの音声の音声選択に設定し、ヒアラブルは、近くの音声、または近くの少数の音声を増幅し、同時にバックグラウンドノイズを抑制する。その場合、ユーザは関連する音声をよりよく理解する。 For example, a person turns on an equipped hearable. The user sets the hearable to audio selection for nearby voices, and the hearable amplifies the nearby voice, or a few nearby voices, while simultaneously suppressing background noise. The user then better understands the relevant voice.

あるいは、人がヒアラブルを継続的なノイズサプレッションに設定する。ユーザは、利用可能な音声を検出して増幅するために機能をオンにする。したがって、ユーザは、より低いレベルのノイズで作業を続けることができる。ｘメートル付近から直接アドレス指定されると、次にヒアラブルは音声を増幅する。したがって、ユーザは、低レベルのノイズで他の人と会話することができる。会話の後、ヒアラブルはノイズ抑制モードに戻り、仕事の後、ユーザはヒアラブルを再びオフにする。 Alternatively, a person sets the hearable to continuous noise suppression. The user turns on the function to detect and amplify available sound. Thus, the user can continue working with a lower level of noise. When directly addressed from within x meters, the hearable then amplifies the sound. Thus, the user can converse with others with a low level of noise. After the conversation, the hearable goes back to noise suppression mode, and after work, the user turns the hearable off again.

別の実施形態の別のユースケースは、乗客の輸送であり、乗客の輸送のための自動車のユーザに関する。例えば、乗客輸送機のユーザおよび運転者は、運転中に乗客ができる限り注意を逸らさないことを望む。乗客は妨害の主な原因であるにもかかわらず、時々彼らとの通信が必要である。 Another use case of another embodiment is passenger transportation, and concerns users of automobiles for the transportation of passengers. For example, users and drivers of passenger transport vehicles want their passengers to be as distracted as possible while driving. Even though passengers are a major source of disturbance, communication with them is necessary from time to time.

例えば、ユーザまたは運転者、および外乱源は、自動車内の固定位置に配置される。ユーザは基準システムに対して静止しているが、車両自体は動いている。これには、適応追跡解決策が必要である。したがって、通信が行われない限り、乗客の音および会話はデフォルトで音響的に抑制される。 For example, the user or driver and the disturbance sources are located at fixed positions in a car. The user is stationary relative to the reference system, but the vehicle itself is moving. This requires an adaptive tracking solution. Thus, passenger sounds and conversations are acoustically suppressed by default unless communication is taking place.

例えば、ヒアラブルは、デフォルトで搭乗者の騒音を抑制する。ユーザは、適切な制御可能性（音声認識、車両内のボタン）を通じて手動で抑制を解除することができる。ここで、ヒアラブルは、選択に応じて音声再生を適応させる。 For example, the hearable suppresses noise for the occupants by default. The user can manually override the suppression through appropriate control possibilities (voice recognition, buttons in the vehicle). The hearable then adapts the audio playback according to the selection.

あるいは、ヒアラブルは、搭乗者が積極的に運転者に話しかけていることを検出し、ノイズ抑制を一時的に停止する。 Alternatively, the hearable could detect when the passenger is actively talking to the driver and temporarily halt noise suppression.

さらなる実施形態の別の使用事例は、学校および教育であり、クラスの教師および生徒に関する。一例では、ヒアラブルは２つの役割を有し、デバイスの機能は部分的に結合される。教師／話者のデバイスは、妨害雑音を抑圧し、生徒からの発話／質問を増幅する。また、聴取者のヒアラブルは、教師のデバイスを介して制御されてもよい。したがって、特に重要なコンテンツは、より大きな声で話す必要なく強調され得る。生徒は、教師をよりよく理解することができ、邪魔なクラスメートを除外することができるように、ヒアラブルを設定することができる。 Another use case of further embodiments is school and education, involving teachers and students in a class. In one example, the hearable has two roles and the functions of the devices are partially combined. The teacher/speaker's device suppresses interfering noise and amplifies speech/questions from the students. The hearable of the listener may also be controlled via the teacher's device. Thus, particularly important content can be highlighted without the need to speak louder. Students can set their hearables in such a way that they can understand the teacher better and filter out disruptive classmates.

例えば、教師および生徒は、閉じた空間内の定義された領域に位置する（これが規則である）。すべてのデバイスが互いに結合されている場合、相対位置は交換可能であり、これによりソース分離が単純化される。選択された音源は、ユーザ（教師／生徒）が選択をキャンセルするまで、または警告信号がデバイスの機能を中断するまで音響的に強調される。 For example, the teacher and students are located in defined areas in a closed space (this is the rule). If all devices are coupled to each other, the relative positions are interchangeable, which simplifies source separation. The selected sound source is acoustically emphasized until the user (teacher/student) cancels the selection or until a warning signal interrupts the functioning of the device.

例えば、教師またはスピーカがコンテンツを提示し、デバイスは妨害雑音を抑制する。教師は、生徒の質問を聞きたいと思い、（自動的にまたは適切な制御可能性を介して）質問を有する人にヒアラブルの焦点を変更する。通信後、すべての音は再び抑制される。さらに、例えば、クラスメートによって妨害されていると感じている学生が、音響的に彼らを調整することが提供され得る。例えば、先生から遠く離れて座っている生徒が、先生の音声を増幅するようにしてもよい。 For example, a teacher or speaker presents content and the device suppresses distracting noise. The teacher wants to hear a student's question and changes the focus of the hearable (automatically or via suitable control possibilities) to the person with the question. After the communication, all sounds are suppressed again. Furthermore, it may be provided that students who feel, for example, that they are being disturbed by their classmates, tune them out acoustically. For example, a student sitting far away from the teacher may have the teacher's voice amplified.

あるいは、例えば、教師および生徒のデバイスが結合されてもよい。生徒デバイスの選択性は、教師デバイスを介して一時的に制御されてもよい。特に重要なコンテンツの場合、教師は、自分の声を増幅するために生徒デバイスの選択性を変更する。 Alternatively, for example, teacher and student devices may be combined. Selectivity of the student device may be temporarily controlled via the teacher device. For particularly important content, the teacher changes the selectivity of the student device to amplify his or her voice.

別の実施形態のさらなるユースケースは軍事であり、兵士に関する。一方では、現場の兵士間の口頭のコミュニケーションは、無線を介して行われ、他方では、声および直接の接触を介して行われる。通信が異なるユニットとサブグループとの間で行われる場合、無線がほとんど使用される。所定の無線エチケットが使用されることが多い。噴出および直接接触は、大抵の場合、部隊またはグループ内で通信するために行われる。兵士の任務の間、両方の通信経路を損なう可能性がある困難な音響条件（例えば、人々の悲鳴、武器の騒音、悪天候）が存在する可能性がある。イヤホンを備えた無線装置は、兵士の装備の一部であることが多い。これらは、音声再生の目的に加えて、より高いレベルの音圧に対する保護機能も提供する。これらのデバイスは、キャリアの耳に環境信号をもたらすためにマイクロホンを装備することが多い。能動的雑音抑制もまた、そのようなシステムの一部である。機能範囲の強化／拡張は、妨害雑音の知的減衰および指向性再生による音声の選択的強調によって、騒々しい環境における兵士の声出しおよび直接接触を可能にする。この目的のために、部屋／フィールド内の兵士の相対位置が知られなければならない。さらに、音声信号および妨害雑音は、空間的におよびコンテンツによって互いに分離されなければならない。システムは、低いささやきから悲鳴および爆発音まで、同様に高いＳＮＲレベルを処理することができなければならない。そのようなシステムの利点は以下の通りである：騒がしい環境における兵士間の口頭通信、聴覚保護の維持、無線エチケットの放棄可能性、（無線解決策ではないため）傍受セキュリティ。 A further use case of another embodiment is military and concerns soldiers. On the one hand, verbal communication between soldiers in the field is done via radio, on the other hand, via voice and direct contact. Radio is mostly used when communication is between different units and subgroups. A certain radio etiquette is often used. Ejection and direct contact are mostly done to communicate within a unit or group. During a soldier's mission, difficult acoustic conditions (e.g. people screaming, weapon noise, bad weather) may exist that can impair both communication paths. Radio devices with earphones are often part of the soldier's equipment. In addition to the purpose of voice reproduction, they also provide a protection function against higher levels of sound pressure. These devices are often equipped with a microphone to bring environmental signals to the carrier's ear. Active noise suppression is also part of such systems. Enhancement/extension of the functional range allows the soldier's voice-ejection and direct contact in noisy environments by intelligent attenuation of interference noise and selective emphasis of voice with directional reproduction. For this purpose, the relative position of the soldier in the room/field must be known. Furthermore, the voice signal and the interfering noise must be separated from each other spatially and by content. The system must be able to handle high SNR levels as well, from low whispers to screams and explosions. The advantages of such a system are: oral communication between soldiers in noisy environments, preservation of hearing protection, possibility of waiving radio etiquette, and (as it is not a radio solution) interception security.

例えば、任務中の兵士間の声および直接的な接触は、妨害雑音のために複雑になり得る。この問題は、現在、近距離およびより長い距離の無線解決策によって対処されている。新しいシステムは、それぞれのスピーカの知的かつ空間的な強調および周囲のノイズの減衰によって、近距離場での声出しおよび直接接触を可能にする。 For example, voice and direct contact between soldiers on a mission can be complicated by interference noise. This problem is now being addressed by short-range and longer-range wireless solutions. New systems enable voice and direct contact in the near field with intelligent spatial emphasis of each speaker and attenuation of ambient noise.

例えば、兵士は任務中である。声および音声が自動的に検出され、システムはそれらをバックグラウンドノイズの同時減衰で増幅する。システムは、対象音源を十分に理解できるように空間音声再生を適合させる。 For example, a soldier is on a mission. Voices and sounds are automatically detected and the system amplifies them with simultaneous attenuation of background noise. The system adapts the spatial audio playback to ensure that the target sound source is fully understandable.

あるいは、例えば、システムは、グループの兵士を知ることができる。これらのグループメンバーの音声信号のみを通過させる。 Or, for example, the system could know which soldiers are in a group and only let through the voice signals of those group members.

さらなる実施形態のさらなるユースケースは、警備員および警備員に関する。したがって、例えば、ヒアラブルは、犯罪の先制検出のために主要イベント（お祝い、デモ）を混乱させるのに使用され得る。ヒアラブルの選択性は、キーワード、例えば助けを求める声や暴力を求める声によって制御される。これは、音声信号（例えば、音声認識）のコンテンツ分析を前提としている。 A further use case of a further embodiment relates to security guards and security officers. Thus, for example, hearables can be used to disrupt major events (celebrations, demonstrations) for preemptive detection of crimes. The selectivity of the hearable is controlled by keywords, for example calls for help or calls for violence. This presupposes a content analysis of the audio signal (for example speech recognition).

例えば、警備員は多くの大きな音源に囲まれており、ガードおよびすべての音源は移動している可能性がある。助けを求めている人は、通常の聴覚条件下では聞くことができないか、または限られた範囲（悪いＳＮＲ）しか聞くことができない。手動または自動で選択された音源は、ユーザが選択をキャンセルするまで音響的に強調される。任意選択で、仮想音対象物は、位置（例えば、助けを求める１回限りの電話の場合）を容易に見つけることができるように、関心のある音源の位置／方向に配置される。 For example, a security guard is surrounded by many loud sound sources, and the guard and all sound sources may be moving; a person calling for help may not be able to hear them under normal hearing conditions or can only hear them to a limited extent (poor SNR). A manually or automatically selected sound source is acoustically enhanced until the user cancels the selection. Optionally, a virtual sound object is placed at the location/direction of the sound source of interest to make its location (e.g. in case of a one-off call for help) easy to find.

例えば、ヒアラブルは、潜在的な危険源を有する音源を検出する。警備員は、どの音源、またはどのイベントに従いたいかを選択する（例えば、タブレット上での選択による）。続いて、ヒアラブルは、雑音が妨害される場合であっても音源をよく理解して位置特定することができるように音声再生を適合させる。 For example, the hearable detects a sound source that has a potential danger. The guard selects (e.g., by selecting on a tablet) which sound source or event he wants to follow. The hearable then adapts the audio playback so that the sound source can be well understood and localized even in the presence of noise interference.

あるいは、例えば、目標音源が無音である場合、音源に向かう／音源の距離内の位置特定信号が配置されてもよい。 Alternatively, for example, if the target sound source is silent, a localization signal may be placed towards/within the distance of the sound source.

別の実施形態の別のユースケースは、ステージ上のコミュニケーションであり、ミュージシャンに関する。ステージ上では、リハーサルまたはコンサート（例えば、バンド、オーケストラ、コーラス、音楽）において、単一の楽器（グループ）は、他の環境では依然として聞こえたとしても、困難な音響条件のために聞こえない可能性がある。重要な（付随する）音声はもはや知覚できないため、これは対話を損なう。ヒアラブルは、これらの音声を強調し、それらを再びヒアラブルにすることができ、したがって、個々のミュージシャンの対話を改善または保証することができる。この使用により、個々のミュージシャンの騒音曝露を低減することができ、例えばドラムを減衰させることによって聴力の喪失を防止することができ、また、ミュージシャンはすべての重要なことを同時に聞くことができる。 Another use case of another embodiment is communication on stage, and concerns musicians. On stage, in a rehearsal or concert (e.g. band, orchestra, chorus, music), a single instrument (group) may not be heard due to difficult acoustic conditions, even if it is still audible in the other environment. This impairs the dialogue, since important (accompanying) sounds are no longer perceptible. Hearables can highlight these sounds and make them hearable again, thus improving or ensuring the dialogue of the individual musicians. This use can reduce the noise exposure of the individual musicians, preventing hearing loss, for example by attenuating the drums, and also allows the musicians to hear all important things at the same time.

例えば、ヒアラブルのないミュージシャンは、もはやステージ上で少なくとも１つの他の音声を聞くことができない。この場合、ヒアラブルを用いてもよい。リハーサルまたはコンサートの終了後、ユーザは、ヒアラブルをオフにした後に取り外す。 For example, a musician without hearables can no longer hear at least one other voice on stage. In this case, hearables may be used. After the rehearsal or concert is over, the user turns off and then removes the hearables.

一例では、ユーザはヒアラブルをオンにする。ユーザは、増幅されるべき１つまたは複数の所望の楽器を選択する。一緒に音楽を作成するとき、選択された音楽楽器は増幅され、したがってヒアラブルによって再び聞こえるようにされる。音楽を作成した後、ユーザは再びヒアラブルをオフにする。 In one example, a user turns on the hearable. The user selects one or more desired instruments to be amplified. When creating music together, the selected musical instruments are amplified and thus made audible again by the hearable. After creating the music, the user turns off the hearable again.

別の例では、ユーザはヒアラブルをオンにする。ユーザは、音量を小さくしたい所望の楽器を選択する。一緒に音楽を作るとき、選択された楽器の音量は、ユーザが中程度の音量でしか聞くことができないように、ヒアラブルによって低減される。 In another example, a user turns on a hearable. The user selects a desired instrument that they would like to have the volume reduced. When creating music together, the volume of the selected instrument is reduced by the hearable so that the user can only hear it at a moderate volume.

例えば、楽器プロファイルをヒアラブルに格納することができる。 For example, instrument profiles can be stored in a hearable.

さらなる実施形態の別の使用事例は、エコシステムの意味での聴覚デバイス用のソフトウェアモジュールとしての音源分離であり、聴覚デバイスの製造業者、または聴覚デバイスのユーザに関する。製造業者は、聴覚デバイスの追加ツールとして音源分離を使用し、それを顧客に提供することができる。したがって、聴覚デバイスは、開発から利益を得ることもできる。他の市場／デバイス（ヘッドホン、携帯電話等。）用のライセンスモデルも考えられる。 Another use case for further embodiments is source separation as a software module for hearing devices in the ecosystem sense, which concerns the manufacturer of the hearing device, or the user of the hearing device. The manufacturer can use source separation as an additional tool for the hearing device and offer it to the customer. Thus, the hearing device can also benefit from the development. Licensing models for other markets/devices (headphones, mobile phones, etc.) are also conceivable.

例えば、聴覚デバイスのユーザは、例えば特定の話者に焦点を合わせるために、複雑な聴覚状況において異なる音源を分離することが困難である。外部の追加システム（例えば、Ｂｌｕｅｔｏｏｔｈを介した移動無線機セットからの信号の転送、ＦＭ機器または誘導聴覚機器を介した教室での選択的な信号の転送）がなくても選択的に聞くことができるようにするために、ユーザは、選択的聴取のための追加機能を有する聴覚デバイスを使用する。したがって、外部の努力がなくても、ユーザは、音源分離を通じて個々の音源に焦点を合わせることができる。最後に、ユーザは、追加機能をオフにして、聴覚デバイスで正常に聞き続ける。 For example, a user of a hearing device has difficulty separating different sound sources in a complex hearing situation, for example to focus on a particular speaker. In order to be able to listen selectively without an external additional system (e.g. transfer of signals from a mobile radio set via Bluetooth, selective transfer of signals in a classroom via FM equipment or inductive hearing equipment), the user uses a hearing device with an additional function for selective listening. Thus, without any external effort, the user can focus on individual sound sources through sound source separation. Finally, the user turns off the additional function and continues to listen normally with the hearing device.

例えば、聴覚デバイスユーザは、選択的聴覚のための統合された追加機能を有する新しい聴覚デバイスを取得する。ユーザは、聴覚デバイスに選択的聴覚のための機能を設定する。次に、ユーザはプロファイルを選択する（例えば、最も大きい／最も近いソースを増幅し、個人的な周囲の特定の音声の音声認識を増幅する（例えば、ＵＣ－ＣＥ５－主要イベントなど）。聴覚デバイスは、設定されたプロファイルに従ってそれぞれの音源を増幅し、要求に応じてバックグラウンドノイズを同時に抑制し、聴覚デバイスのユーザは、「ノイズ」／音響源のクラッタだけでなく、複雑な聴覚シーンから個々の音源を聞く。 For example, a hearing device user acquires a new hearing device with an integrated additional feature for selective hearing. The user sets up the hearing device with the feature for selective hearing. The user then selects a profile (e.g. amplify the loudest/closest source, amplify speech recognition of specific sounds in the personal surroundings (e.g. UC-CE5-Major Events, etc.). The hearing device amplifies each sound source according to the set profile, simultaneously suppressing background noise on request, and the hearing device user hears individual sound sources from a complex auditory scene and not just the "noise"/clutter of sound sources.

または、聴覚デバイスのユーザは、自身の聴覚デバイスのソフトウェア等として、選択的聴覚のための追加機能を取得する。ユーザは、自分の聴覚デバイスに追加機能をインストールする。そして、ユーザは、聴覚デバイスに選択的聴覚のための機能を設定する。ユーザはプロファイル（最も大きい／最も近い音源を増幅し、個人の周囲からの特定の音声の音声認識を増幅する（ＵＣ－ＣＥ５－主要イベントなど））を選択し、聴覚デバイスは設定されたプロファイルに従ってそれぞれの音源を増幅し、同時に要求に応じて暗騒音を抑制する。この場合、聴覚デバイスのユーザは、「ノイズ」／音響源の乱雑さだけでなく、複雑な聴覚シーンから個々の音源を聞く。 Alternatively, the user of the hearing device acquires an add-on feature for selective hearing, e.g. as software for his hearing device. The user installs the add-on feature in his hearing device. Then the user configures the feature for selective hearing in the hearing device. The user selects a profile (amplify the loudest/closest sound source and amplify speech recognition of specific sounds from the personal surroundings (e.g. UC-CE5-Major Events)) and the hearing device amplifies each sound source according to the configured profile, while simultaneously suppressing background noise on demand. In this case the user of the hearing device hears individual sound sources from a complex hearing scene and not just a clutter of "noise"/sound sources.

例えば、ヒアラブルは、記憶可能な音声プロファイルを提供することができる。 For example, hearables can provide memorizable voice profiles.

さらなる実施形態のさらなるユースケースは、プロスポーツであり、競技におけるアスリートに関する。バイアスロン、トライアスロン、サイクリング、マラソンなどのスポーツでは、プロアスリートは、指導者の情報またはチームメイトとのコミュニケーションに頼っている。しかし、集中できるようにするために、大きな音（バイアスロンでの射撃、大きな拍手、パーティーのクラクションなど。）から自分自身を保護したい状況もある。ヒアラブルは、関連する音源（特定の音声の検出、典型的な妨害雑音に対する音量制限）の完全自動選択を可能にするように、それぞれのスポーツ／アスリートに適合させることができる。 A further use case of a further embodiment concerns professional sports and athletes in competitions. In sports such as biathlon, triathlon, cycling, marathon, etc., professional athletes rely on coaches for information or to communicate with teammates. However, there are situations where they want to protect themselves from loud sounds (shooting in biathlon, loud applause, party horns, etc.) to be able to concentrate. The hearable can be adapted to the respective sport/athlete to allow a fully automatic selection of the relevant sound source (detection of specific voices, volume limiting for typical disturbing noises).

例えば、ユーザは非常に移動しやすく、妨害ノイズの種類はスポーツに依存する。激しい身体的緊張のために、競技者によるデバイスの制御は不可能であるか、または限られた範囲にすぎない。しかし、ほとんどのスポーツでは、所定の手順（バイアスロン：ランニング、射撃）があり、重要なコミュニケーション相手（トレーナー、チームメイト）を事前に定義することができる。ノイズは、一般に、または活動の特定の段階で抑制される。競技者とチームメイトおよび指導者との間のコミュニケーションは、常に強調される。 For example, users are very mobile and the type of disturbing noise depends on the sport. Due to intense physical tension, control of the device by the athlete is not possible or only to a limited extent. However, in most sports there are prescribed procedures (biathlon: running, shooting) and important communication partners (trainer, teammates) can be predefined. Noise is suppressed in general or at certain stages of the activity. Communication between the athlete and teammates and coaches is always emphasized.

例えば、競技者は、スポーツの種類に合わせて特別に調整されたヒアラブルを使用する。ヒアラブルは、特にそれぞれのタイプのスポーツにおいて高度な注意が必要とされる状況において、完全に自動的に（事前調整されて）妨害雑音を抑制する。加えて、ヒアラブルは、トレーナーおよびチームメンバーが聴力範囲にあるときに完全に自動的に（事前調整されて）強調する。 For example, athletes use hearables that are specially tuned for their type of sport. The hearables fully automatically (pre-tuned) suppress disturbing noises, especially in situations that require high levels of attention in each type of sport. In addition, the hearables fully automatically (pre-tuned) highlight trainers and team members when they are within hearing range.

さらなる実施形態のさらなるユースケースは、聴覚訓練であり、音楽学生、プロのミュージシャン、趣味のミュージシャンに関する。音楽リハーサル（例えば、オーケストラでは、バンドでは、アンサンブルでは、音楽の授業では、）では、ヒアラブルを選択的に使用して、フィルタリングされた方法で個々の音声を追跡することができる。特にリハーサルの開始時には、ピースの最終記録を聞き、自分の声を追跡することが有用である。構図によっては、前景の音声を聞くだけでは、背景の音声をうまく聞き取ることができない。ヒアラブルでは、楽器等に基づいて音声を選択的に強調して、より的を絞った練習を行うことができる。 A further use case of further embodiments is hearing training, for music students, professional musicians, and hobbyist musicians. In music rehearsals (e.g., in orchestras, bands, ensembles, music classes), hearables can be used selectively to track individual voices in a filtered way. Especially at the beginning of a rehearsal, it is useful to hear the final recording of a piece and track your own voice. In some compositions, hearing foreground voices alone does not allow you to hear background voices well. Hearables can selectively highlight voices based on instruments, etc., for more targeted practice.

（希望する）音楽の学生は、ヒアラブルを使用して聴覚能力を訓練し、最終的に助けを借りずに複雑な曲から個々の音声を抽出するまで、個々の強調を段階的に最小限に抑えて選択的に試験に備えることもできる。 Music students (who wish to do so) could use hearables to train their hearing abilities and selectively prepare for exams by gradually minimizing individual emphasis until they are finally able to extract individual sounds from complex pieces without aid.

さらなる可能なユースケースは、例えば、Ｓｉｎｇｓｔａｒなどが近くで利用できない場合のカラオケである。カラオケにサインするための楽器バージョンのみを聞くために、必要に応じて歌声を音楽から抑制することができる。 A further possible use case is karaoke, for example when Singstar or similar is not available nearby. The vocals can be suppressed from the music as needed to hear only the instrumental version for signing karaoke.

例えば、ミュージシャンは、曲から音声を学習し始める。ＣＤプレーヤ等の再生媒体を介して音楽の録音を聴取する。ユーザは、練習を終えると、再びヒアラブルをオフにする。 For example, a musician may begin learning the sounds from a song by listening to a recording of the music via a playback medium such as a CD player. When the user is done practicing, they turn off the hearables again.

一例では、ユーザはヒアラブルをオンにする。増幅させたい所望の楽器を選択する。ヒアラブルは、音楽を聴いているときに、音楽楽器の音声を増幅し、残りの音楽楽器の音量を下げ、したがって、ユーザは、自身の音声をより良好に追跡することができる。 In one example, a user turns on the hearable. They select the desired instruments they want to amplify. The hearable will amplify the sound of the musical instruments and reduce the volume of the remaining musical instruments when listening to music, thus allowing the user to better track their own voice.

別の例では、ユーザはヒアラブルをオンにする。抑制したい所望の楽器を選択する。楽曲を聴取する際には、選択された楽曲の音声を抑制し、残りの音声のみが聞こえるようにする。したがって、ユーザは、録音からの音声によって気を取られることなく、他の音声で自身の楽器で音声を練習することができる。 In another example, a user turns on the hearable. Selects the desired instrument that they wish to suppress. When listening to a song, the audio of the selected song is suppressed so that only the remaining audio is heard. Thus, the user can practice audio on their instrument with other audio without being distracted by audio from the recording.

実施例では、ヒアラブルは、格納された楽器プロファイルを提供することができる。 In an embodiment, the hearable can provide a stored instrument profile.

別の実施形態の別のユースケースは、作業時の安全性であり、騒がしい環境の作業者に関する。機械ホールまたは建設現場などの騒々しい環境にいる労働者は、騒音から自分自身を保護しなければならないが、警告信号を知覚し、同僚と通信することもできなければならない。 Another use case for another embodiment is safety at work, which concerns workers in noisy environments. Workers in noisy environments, such as machine halls or construction sites, must protect themselves from the noise, but must also be able to perceive warning signals and communicate with their colleagues.

例えば、ユーザは非常に大きな環境に位置しており、目的音源（警告信号、同僚）は妨害雑音よりもかなり柔らかい可能性がある。ユーザはモバイルであってもよい。しかしながら、妨害ノイズはしばしば静止している。聴覚保護と同様に、騒音は恒久的に低下し、ヒアラブルは警告信号を完全に自動的に強調する。同僚とのコミュニケーションは、スピーカ音源の増幅によって保証される。 For example, the user is located in a very loud environment and the desired sound source (warning signal, colleague) may be significantly softer than the disturbing noise. The user may be mobile. However, the disturbing noise is often stationary. Similar to hearing protection, the noise is permanently reduced and the hearable highlights the warning signal completely automatically. Communication with the colleague is ensured by the amplification of the loudspeaker sound source.

例えば、ユーザは仕事中であり、聴覚保護としてヒアラブルを使用する。警告信号（例えば、火災報知器）は音響的に強調され、ユーザは必要に応じて作業を停止する。 For example, a user is at work and uses a hearable as hearing protection. A warning signal (e.g. a fire alarm) is acoustically accentuated, causing the user to stop working if necessary.

あるいは、例えば、ユーザは仕事中であり、聴覚保護としてヒアラブルを使用する。同僚とのコミュニケーションの必要性がある場合、コミュニケーションパートナーが選択され、適切なインターフェース（ここでは、例えば、眼の制御）の助けを借りて音響的に強調される。 Or, for example, the user is at work and uses the hearable as hearing protection. When there is a need to communicate with a colleague, a communication partner is selected and acoustically enhanced with the help of a suitable interface (here, for example, eye control).

さらなる実施形態の別のユースケースは、ライブトランスレータ用のソフトウェアモジュールとしてのソース分離であり、ライブトランスレータのユーザに関する。ライブ翻訳者は、話し言葉の外国語をリアルタイムで翻訳し、ソース分離のために上流のソフトウェアモジュールから利益を得ることができる。特に、複数の話者が存在する場合、ソフトウェアモジュールは、目標話者を抽出し、潜在的に翻訳を改善することができる。 Another use case of a further embodiment is source separation as a software module for a live translator, with respect to the user of the live translator. The live translator translates spoken foreign languages in real time and can benefit from an upstream software module for source separation. In particular, in the case of multiple speakers, the software module can extract the target speaker and potentially improve the translation.

例えば、ソフトウェアモジュールは、ライブトランスレータ（スマートフォン上の専用デバイスまたはアプリ）の一部である。例えば、ユーザは、デバイスのディスプレイを介して目標話者を選択することができる。ユーザおよび対象音源は、並進時に移動しないか、またはわずかしか移動しないことが有利である。選択された音源位置は音響的に強調され、したがって並進を潜在的に改善する。 For example, the software module is part of a live translator (a dedicated device or an app on a smartphone). The user can, for example, select a target speaker via the device's display. Advantageously, the user and the target sound source do not move or move only slightly during the translation. The selected sound source position is acoustically highlighted, thus potentially improving the translation.

例えば、ユーザは、外国語での会話を希望したり、外国語の話者の話を聞いたりする。ユーザは適切なインターフェース（例えば、ディスプレイ上のＧＵＩ）を介して目標話者を選択し、ソフトウェアモジュールはトランスレータでさらに使用するために録音を最適化する。 For example, a user may wish to converse in a foreign language or listen to a speaker of a foreign language. The user selects the target speaker via an appropriate interface (e.g., a GUI on the display) and the software module optimizes the recording for further use in the translator.

別の実施形態のさらなるユースケースは、救助隊の業務における安全性であり、消防士、市民保護、警察、救急サービスに関する。軽減力のためには、ミッションを首尾よく処理するために良好な通信が不可欠である。周囲の騒音が大きいにもかかわらず、放圧は通信を不可能にするので、聴覚保護を行うことができないことが多い。例えば、消防士は、例えば、無線機を介して部分的に発生する大きなモータ音にもかかわらず、命令を正確に伝達し、それらを理解することができなければならない。したがって、逃がし力は、聴覚保護条例を順守することができない大きな騒音にさらされる。一方では、ヒアラブルは、リリーフ力の聴覚保護を提供し、他方では、リリーフ力間の通信を依然として可能にする。さらに、ヒアラブルの助けを借りて、ヘルメット／保護装置を運ぶときに、リリーフ力が環境から音響的に切り離されず、したがってより良好な支持を提供することができる。それらはより良好に通信することができ、また、それら自体の危険性をより良好に推定することもできる（例えば、発生している火災のタイプを聞く）。 A further use case of another embodiment is safety in rescue operations, for firefighters, civil protection, police, emergency services. For the mitigating forces, good communication is essential to successfully handle the mission. Despite the high ambient noise, hearing protection is often not possible, since the relief makes communication impossible. For example, firefighters must be able to accurately transmit orders and understand them, despite the loud motor sounds that occur in part via radios. The relief forces are therefore exposed to loud noises that do not allow compliance with hearing protection regulations. On the one hand, the hearable provides hearing protection for the relief forces, and on the other hand still allows communication between the relief forces. Furthermore, with the help of the hearable, the relief forces are not acoustically isolated from the environment when carrying the helmet/protective equipment and can therefore provide better support. They can communicate better and also estimate their own danger better (for example, hear the type of fire that is occurring).

例えば、ユーザは、強い周囲の雑音を受け、したがって、聴覚保護を着用することができず、依然として他人と通信することができなければならない。ヒアラブルを使用する。ミッションが完了した後、または危険の状況が終了した後、ユーザは再びヒアラブルを外す。 For example, a user is subjected to strong ambient noise and therefore cannot wear hearing protection and must still be able to communicate with others. Hearables are used. After the mission is completed or the dangerous situation has ended, the user removes the hearables again.

例えば、ユーザは、ミッション中にヒアラブルを装着する。ヒアラブルをオンにする。ヒアラブルは、周囲の雑音を抑圧し、周囲の同僚や他の話者の発話を増幅する（例えば火災犠牲者）。 For example, a user wears a hearable during a mission. The hearable is turned on. The hearable suppresses background noise and amplifies the speech of nearby colleagues and other speakers (e.g., fire victims).

あるいは、ユーザは、任務中にヒアラブルを装着する。ヒアラブルをオンにし、ヒアラブルは周囲の雑音を抑圧し、ラジオを介して同僚の音声を増幅する。 Or, a user could wear a hearable while on a mission. They would turn the hearable on and it would suppress background noise and amplify the voice of their colleagues over the radio.

適用可能な場合、ヒアラブルは、動作仕様に従って動作に対する構造的適合性を満たすように特別に設計される。場合によっては、ヒアラブルは、無線デバイスへのインターフェースを備える。 Where applicable, the hearable is specially designed to meet structural compliance for operation in accordance with operational specifications. In some cases, the hearable includes an interface to a wireless device.

いくつかの態様がデバイスの文脈内で説明されているが、前記態様は対応する方法の説明も表すことが理解され、その結果、デバイスのブロックまたは構造的構成要素はまた、対応する方法ステップまたは方法ステップの特徴として理解されるべきである。同様に、方法ステップの文脈内でまたは方法ステップとして説明されている態様はまた、対応するデバイスの対応するブロックまたは詳細または特徴の説明を表す。方法ステップの一部またはすべては、マイクロプロセッサ、プログラマブルコンピュータ、または電子回路などのハードウェアデバイスを使用しながら実行されてもよい。いくつかの実施形態では、最も重要な方法ステップのいくつかまたはいくつかは、そのようなデバイスによって実行されてもよい。 Although some aspects are described in the context of a device, it will be understood that said aspects also represent a description of a corresponding method, such that blocks or structural components of the device should also be understood as corresponding method steps or features of method steps. Similarly, aspects described in the context of or as method steps also represent a description of corresponding blocks or details or features of the corresponding device. Some or all of the method steps may be performed using hardware devices such as microprocessors, programmable computers, or electronic circuits. In some embodiments, some or some of the most important method steps may be performed by such devices.

具体的な実装要件に応じて、本発明の実施形態は、ハードウェアまたはソフトウェアで実装することができる。実装は、デジタル記憶媒体、例えば、フロッピーディスク、ＤＶＤ、ブルーレイディスク、ＣＤ、ＲＯＭ、ＰＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭもしくはフラッシュメモリ、ハードディスク、またはそれぞれの方法が実行されるようにプログラマブルコンピュータシステムと協働するか、または協働することができる電子的に読み取り可能な制御信号が記憶された任意の他の磁気もしくは光学メモリを使用して行われてもよい。これが、デジタル記憶媒体がコンピュータ可読であり得る理由である。 Depending on the specific implementation requirements, embodiments of the invention can be implemented in hardware or in software. Implementation may be done using digital storage media, such as floppy disks, DVDs, Blu-ray disks, CDs, ROMs, PROMs, EPROMs, EEPROMs or flash memories, hard disks or any other magnetic or optical memory on which electronically readable control signals are stored that cooperate or can cooperate with a programmable computer system so that the respective method is executed. This is why the digital storage medium may be computer readable.

したがって、本発明によるいくつかの実施形態は、本明細書に記載の方法のいずれかが実行されるようにプログラム可能なコンピュータシステムと協働することができる電子的に読み取り可能な制御信号を含むデータキャリアを含む。 Thus, some embodiments according to the invention include a data carrier that includes electronically readable control signals that can cooperate with a programmable computer system to perform any of the methods described herein.

一般に、本発明の実施形態は、プログラムコードを有するコンピュータプログラム製品として実施することができ、プログラムコードは、コンピュータプログラム製品がコンピュータ上で実行されるときに方法のいずれかを実行するのに有効である。 In general, embodiments of the invention may be implemented as a computer program product having program code that is effective to perform any of the methods when the computer program product is run on a computer.

プログラムコードは、例えば、機械可読キャリアに格納することもできる。 The program code may, for example, be stored on a machine-readable carrier.

他の実施形態は、本明細書に記載の方法のいずれかを実行するためのコンピュータプログラムを含み、前記コンピュータプログラムは、機械可読キャリアに格納される。言い換えれば、本発明の方法の一実施形態は、したがって、コンピュータプログラムがコンピュータ上で実行されるときに、本明細書に記載の方法のいずれかを実行するためのプログラムコードを有するコンピュータプログラムである。 Other embodiments include a computer program for performing any of the methods described herein, said computer program being stored on a machine readable carrier. In other words, an embodiment of the inventive method is therefore a computer program having a program code for performing any of the methods described herein, when the computer program runs on a computer.

したがって、本発明の方法のさらなる実施形態は、本明細書に記載の方法のいずれかを実行するためのコンピュータプログラムが記録されるデータキャリア（またはデジタル記憶媒体もしくはコンピュータ可読媒体）である。データキャリア、デジタル記憶媒体、または記録媒体は、通常、有形または不揮発性である。 Thus, a further embodiment of the method of the present invention is a data carrier (or digital storage medium or computer readable medium) on which is recorded a computer program for performing any of the methods described herein. The data carrier, digital storage medium or recording medium is typically tangible or non-volatile.

したがって、本発明の方法のさらなる実施形態は、本明細書に記載の方法のいずれかを実行するためのコンピュータプログラムを表すデータストリームまたは一連の信号である。データストリームまたは信号シーケンスは、例えば、データ通信リンクを介して、例えばインターネットを介して送信されるように構成することができる。 A further embodiment of the inventive method is therefore a data stream or a sequence of signals representing a computer program for performing any of the methods described herein. The data stream or signal sequence can for example be configured to be transmitted over a data communication link, for example via the Internet.

さらなる実施形態は、本明細書に記載の方法のいずれかを実行するように構成または適合された処理ユニット、例えばコンピュータまたはプログラマブル論理デバイスを含む。 Further embodiments include a processing unit, e.g., a computer or programmable logic device, configured or adapted to perform any of the methods described herein.

さらなる実施形態は、本明細書に記載の方法のいずれかを実行するためのコンピュータプログラムがインストールされたコンピュータを含む。 A further embodiment includes a computer having installed thereon a computer program for performing any of the methods described herein.

本発明によるさらなる実施形態は、本明細書に記載の方法のうちの少なくとも１つを実行するためのコンピュータプログラムを受信機に送信するように構成されたデバイスまたはシステムを含む。送信は、例えば、電子的または光学的であってもよい。受信機は、例えば、コンピュータ、モバイルデバイス、メモリデバイス、または同様のデバイスであってもよい。デバイスまたはシステムは、例えば、コンピュータプログラムを受信機に送信するためのファイルサーバを含み得る。 Further embodiments according to the invention include a device or system configured to transmit a computer program for performing at least one of the methods described herein to a receiver. The transmission may be, for example, electronic or optical. The receiver may be, for example, a computer, a mobile device, a memory device, or a similar device. The device or system may, for example, include a file server for transmitting the computer program to the receiver.

いくつかの実施形態では、プログラマブル論理デバイス（例えば、フィールドプログラマブルゲートアレイ、ＦＰＧＡ）を使用して、本明細書に記載の方法の機能の一部またはすべてを実行することができる。いくつかの実施形態では、フィールドプログラマブルゲートアレイは、本明細書に記載の方法のいずれかを実行するためにマイクロプロセッサと協働することができる。一般に、本方法は、いくつかの実施形態では、任意のハードウェアデバイスによって実行される。前記ハードウェアデバイスは、コンピュータプロセッサ（ＣＰＵ）などの任意の普遍的に適用可能なハードウェアであってもよく、ＡＳＩＣなどの方法に固有のハードウェアであってもよい。 In some embodiments, a programmable logic device (e.g., a field programmable gate array, FPGA) may be used to perform some or all of the functions of the methods described herein. In some embodiments, the field programmable gate array may cooperate with a microprocessor to perform any of the methods described herein. In general, the methods are performed in some embodiments by any hardware device. The hardware device may be any universally applicable hardware, such as a computer processor (CPU), or may be method-specific hardware, such as an ASIC.

上述の実施形態は、本発明の原理の単なる例示を表す。他の当業者は、本明細書に記載の構成および詳細の修正および変形を理解するであろうことが理解される。このため、本発明は、実施形態の説明および議論によって本明細書に提示された特定の詳細によってではなく、以下の特許請求の範囲によってのみ限定されることが意図される。 The above-described embodiments merely represent illustrations of the principles of the present invention. It is understood that others skilled in the art will appreciate modifications and variations of the configurations and details described herein. As such, it is intended that the present invention be limited only by the scope of the following claims and not by the specific details presented herein by way of the description and discussion of the embodiments.

参考文献
［１］Ｖ．Ｖａｌｉｍａｋｉ，Ａ．Ｆｒａｎｃｋ，Ｊ．Ｒａｍｏ，Ｈ．Ｇａｍｐｅｒ，ａｎｄＬ．Ｓａｖｉｏｊａ，”Ａｓｓｉｓｔｅｄｌｉｓｔｅｎｉｎｇｕｓｉｎｇａｈｅａｄｓｅｔ：Ｅｎｈａｎｃｉｎｇａｕｄｉｏｐｅｒｃｅｐｔｉｏｎｉｎｒｅａｌ，ａｕｇｍｅｎｔｅｄ，ａｎｄｖｉｒｔｕａｌｅｎｖｉｒｏｎｍｅｎｔｓ，” ＩＥＥＥＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＭａｇａｚｉｎｅ，ｖｏｌｕｍｅ３２，ｎｏ．２，ｐｐ．９２－９９，Ｍａｒｃｈ２０１５ References [1] V. Valimaki, A. Franck, J. Ramo, H. Gamper, and L. Savioja, “Assisted listening using a headset: Enhancing audio perception in real, augmented, and virtual environment,” IEEE Signal Processing Magazine, volume 32, no. 2, pp. 92-99, March 2015

［２］Ｋ．Ｂｒａｎｄｅｎｂｕｒｇ，Ｅ．Ｃａｎｏ，Ｆ．Ｋｌｅｉｎ，Ｔ．Ｋｏｌｌｍｅｒ，Ｈ．Ｌｕｋａｓｈｅｖｉｃｈ，Ａ．Ｎｅｉｄｈａｒｄｔ，Ｕ．Ｓｌｏｍａ，ａｎｄＳ．Ｗｅｒｎｅｒ，”Ｐｌａｕｓｉｂｌｅａｕｇｍｅｎｔａｔｉｏｎｏｆａｕｄｉｔｏｒｙｓｃｅｎｅｓｕｓｉｎｇｄｙｎａｍｉｃｂｉｎａｕｒａｌｓｙｎｔｈｅｓｉｓｆｏｒｐｅｒｓｏｎａｌｉｚｅｄａｕｄｉｔｏｒｙｒｅａｌｉｔｉｅｓ，” ｉｎＰｒｏｃ．ｏｆＡＥＳＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｕｄｉｏｆｏｒＶｉｒｔｕａｌａｎｄＡｕｇｍｅｎｔｅｄＲｅａｌｉｔｙ，Ａｕｇｕｓｔ２０１８ [2] K. Brandenburg, E. Cano, F. Klein, T. Kollmer, H. Lukashevich, A. Neidhardt, U. Sloma, and S. Werner, “Plausible augmentation of auditory scenes using dynamic binaural synthesis for personalized auditory “realities,” in Proc. of AES International Conference on Audio for Virtual and Augmented Reality, August 2018

［３］Ｓ．Ａｒｇｅｎｔｉｅｒｉ，Ｐ．Ｄａｎｓ，ａｎｄＰ．Ｓｏｕｒｅｓ，”Ａｓｕｒｖｅｙｏｎｓｏｕｎｄｓｏｕｒｃｅｌｏｃａｌｉｚａｔｉｏｎｉｎｒｏｂｏｔｉｃｓ：Ｆｒｏｍｂｉｎａｕｒａｌｔｏａｒｒａｙｐｒｏｃｅｓｓｉｎｇｍｅｔｈｏｄｓ，” ＣｏｍｐｕｔｅｒＳｐｅｅｃｈＬａｎｇｕａｇｅ，ｖｏｌｕｍｅ３４，ｎｏ．１，ｐｐ．８７－１１２，２０１５ [3] S. Argentieri, P. Dans, and P. Soures, “A survey on sound source localization in robotics: From binural to array processing methods,” Computer Speech Language, volume 34, no. 1, pp. 87-112, 2015

［４］Ｄ．ＦｉｔｚＧｅｒａｌｄ，Ａ．Ｌｉｕｔｋｕｓ，ａｎｄＲ．Ｂａｄｅａｕ，”Ｐｒｏｊｅｃｔｉｏｎ－ｂａｓｅｄｄｅｍｉｘｉｎｇｏｆｓｐａｔｉａｌａｕｄｉｏ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓ．ｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２４，ｎｏ．９，ｐｐ．１５６０－１５７２，２０１６ [4]D. FitzGerald, A. Liutkus, and R. Badeau, “Projection-based demixing of spatial audio,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, volume 24, no. 9, pp. 1560-1572, 2016

［５］Ｅ．Ｃａｎｏ，Ｄ．ＦｉｔｚＧｅｒａｌｄ，Ａ．Ｌｉｕｔｋｕｓ，Ｍ．Ｄ．Ｐｌｕｍｂｌｅｙ，ａｎｄＦ．Ｓｔｏｔｅｒ，”Ｍｕｓｉｃａｌｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ：Ａｎｉｎｔｒｏｄｕｃｔｉｏｎ，” ＩＥＥＥＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＭａｇａｚｉｎｅ，ｖｏｌｕｍｅ３６，ｎｏ．１，ｐｐ．３１－４０，Ｊａｎｕａｒｙ２０１９ [5] E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F. Stoter, “Musical source separation: An introduction,” IEEE Signal Processing Magazine, volume 36, no. 1, pp. 31-40, January 2019

［６］Ｓ．Ｇａｎｎｏｔ，Ｅ．Ｖｉｎｃｅｎｔ，Ｓ．Ｍａｒｋｏｖｉｃｈ－Ｇｏｌａｎ，ａｎｄＡ．Ｏｚｅｒｏｖ，”Ａｃｏｎｓｏｌｉｄａｔｅｄｐｅｒｓｐｅｃｔｉｖｅｏｎｍｕｌｔｉｍｉｃｒｏｐｈｏｎｅｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔａｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２５，ｎｏ．４，ｐｐ．６９２－７３０，Ａｐｒｉｌ２０１７ [6] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 25, no. 4, pp. 692-730, April 2017

［７］Ｅ．Ｃａｎｏ，Ｊ．Ｎｏｗａｋ，ａｎｄＳ．Ｇｒｏｌｌｍｉｓｃｈ，”Ｅｘｐｌｏｒｉｎｇｓｏｕｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎｆｏｒａｃｏｕｓｔｉｃｃｏｎｄｉｔｉｏｎｍｏｎｉｔｏｒｉｎｇｉｎｉｎｄｕｓｔｒｉａｌｓｃｅｎａｒｉｏｓ，” ｉｎＰｒｏｃ．ｏｆ２５ｔｈＥｕｒｏｐｅａｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＣｏｎｆｅｒｅｎｃｅ（ＥＵＳＩＰＣＯ），Ａｕｇｕｓｔ２０１７，ｐｐ．２２６４－２２６８ [7] E. Cano, J. Nowak, and S. Grollmisch, “Exploring sound source separation for acoustic condition monitoring in industrial scenarios,” in Proc. of 25th European Signal Processing Conference (EUSIPCO), August 2017, pp. 2264-2268

［８］Ｔ．Ｇｅｒｋｍａｎｎ，Ｍ．Ｋｒａｗｃｚｙｋ－Ｂｅｃｋｅｒ，ａｎｄＪ．ＬｅＲｏｕｘ，”Ｐｈａｓｅｐｒｏｃｅｓｓｉｎｇｆｏｒｓｉｎｇｌｅ－ｃｈａｎｎｅｌｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ：Ｈｉｓｔｏｒｙａｎｄｒｅｃｅｎｔａｄｖａｎｃｅｓ，” ＩＥＥＥＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＭａｇａｚｉｎｅ，ｖｏｌｕｍｅ３２，ｎｏ．２，ｐｐ．５５－６６，Ｍａｒｃｈ２０１５ [8] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Processing Magazine, volume 32, no. 2, pp. 55-66, March 2015

［９］Ｅ．Ｖｉｎｃｅｎｔ，Ｔ．Ｖｉｒｔａｎｅｎ，ａｎｄＳ．Ｇａｎｎｏｔ，ＡｕｄｉｏＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎａｎｄＳｐｅｅｃｈＥｎｈａｎｃｅｍｅｎｔ．Ｗｉｌｅｙ，２０１８ [9] E. Vincent, T. Virtanen, and S. Gannot, Audio Source Separation and Speech Enhancement. Wiley, 2018

［１０］Ｄ．Ｍａｔｚ，Ｅ．Ｃａｎｏ，ａｎｄＪ．Ａｂｅｓｓｅｒ，”Ｎｅｗｓｏｎｏｒｉｔｉｅｓｆｏｒｅａｒｌｙｊａｚｚｒｅｃｏｒｄｉｎｇｓｕｓｉｎｇｓｏｕｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎａｎｄａｕｔｏｍａｔｉｃｍｉｘｉｎｇｔｏｏｌｓ，” ｉｎＰｒｏｃ．ｏｆｔｈｅ１６ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＳｏｃｉｅｔｙｆｏｒＭｕｓｉｃＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌＣｏｎｆｅｒｅｎｃｅ．Ｍａｌａｇａ，Ｓｐａｉｎ：ＩＳＭＩＲ，Ｏｃｔｏｂｅｒ２０１５，ｐｐ．７４９－７５５ [10]D. Matz, E. Cano, and J. Abesser, “New sonorities for early jazz recordings using sound source separation and automatic mixing tools,” in Proc. of the 16th International Society for Music Information Retrieval Conference. Malaga, Spain: ISMIR, October 2015, pp. 749-755

［１１］Ｓ．Ｍ．ＫｕｏａｎｄＤ．Ｒ．Ｍｏｒｇａｎ，”Ａｃｔｉｖｅｎｏｉｓｅｃｏｎｔｒｏｌ：ａｔｕｔｏｒｉａｌｒｅｖｉｅｗ，” ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥ，ｖｏｌｕｍｅ８７，ｎｏ．６，ｐｐ．９４３－９７３，Ｊｕｎｅ１９９９ [11] S. M. Kuo and D. R. Morgan, “Active noise control: a tutorial review,” Proceedings of the IEEE, volume 87, no. 6, pp. 943-973, June 1999

［１２］Ａ．ＭｃＰｈｅｒｓｏｎ，Ｒ．Ｊａｃｋ，ａｎｄＧ．Ｍｏｒｏ，”Ａｃｔｉｏｎ－ｓｏｕｎｄｌａｔｅｎｃｙ：Ａｒｅｏｕｒｔｏｏｌｓｆａｓｔｅｎｏｕｇｈ？” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＮｅｗＩｎｔｅｒｆａｃｅｓｆｏｒＭｕｓｉｃａｌＥｘｐｒｅｓｓｉｏｎ，Ｊｕｌｙ２０１６ [12] A. McPherson,R. Jack, and G. Moro, “Action-sound latency: Are our tools fast enough?” in Proceedings of the International Conference on New Interfaces for Musical Expression, July 2016

［１３］Ｃ．Ｒｏｔｔｏｎｄｉ，Ｃ．Ｃｈａｆｅ，Ｃ．Ａｌｌｏｃｃｈｉｏ，ａｎｄＡ．Ｓａｒｔｉ，”Ａｎｏｖｅｒｖｉｅｗｏｎｎｅｔｗｏｒｋｅｄｍｕｓｉｃｐｅｒｆｏｒｍａｎｃｅｔｅｃｈｎｏｌｏｇｉｅｓ，” ＩＥＥＥＡｃｃｅｓｓ，ｖｏｌｕｍｅ４，ｐｐ．８８２３－８８４３，２０１６ [13]C. Rottondi, C. Chafe, C. Allocchio, and A. Sarti, “An overview on networked music performance technologies,” IEEE Access, volume 4, pp. 8823-8843, 2016

［１４］Ｓ．Ｌｉｅｂｉｃｈ，Ｊ．Ｆａｂｒｙ，Ｐ．Ｊａｘ，ａｎｄＰ．Ｖａｒｙ，”Ｓｉｇｎａｌｐｒｏｃｅｓｓｉｎｇｃｈａｌｌｅｎｇｅｓｆｏｒａｃｔｉｖｅｎｏｉｓｅｃａｎｃｅｌｌａｔｉｏｎｈｅａｄｐｈｏｎｅｓ，” ｉｎＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎ；１３ｔｈＩＴＧ－Ｓｙｍｐｏｓｉｕｍ，Ｏｃｔｏｂｅｒ２０１８，ｐｐ．１－５ [14] S. Liebich, J. Fabry, P. Jax, and P. Vary, “Signal processing challenges for active noise cancellation headsphones,” in Speech Communication; 13th ITG-Symposium, October 2018, pp. 1-5

［１５］Ｅ．Ｃａｎｏ，Ｊ．Ｌｉｅｂｅｔｒａｕ，Ｄ．Ｆｉｔｚｇｅｒａｌｄ，ａｎｄＫ．Ｂｒａｎｄｅｎｂｕｒｇ，”Ｔｈｅｄｉｍｅｎｓｉｏｎｓｏｆｐｅｒｃｅｐｔｕａｌｑｕａｌｉｔｙｏｆｓｏｕｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ａｐｒｉｌ２０１８，ｐｐ．６０１－６０５ [15] E. Cano, J. Liebetrau, D. Fitzgerald, and K. Brandenburg, “The dimensions of perceptual quality of sound source separation,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 601-605

［１６］Ｐ．Ｍ．ＤｅｌｇａｄｏａｎｄＪ．Ｈｅｒｒｅ，”Ｏｂｊｅｃｔｉｖｅａｓｓｅｓｓｍｅｎｔｏｆｓｐａｔｉａｌａｕｄｉｏｑｕａｌｉｔｙｕｓｉｎｇｄｉｒｅｃｔｉｏｎａｌｌｏｕｄｎｅｓｓｍａｐｓ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｍａｙ２０１９，ｐｐ．６２１－６２５ [16] P. M. Delgado and J. Herre, “Objective assessment of spatial audio quality using directional loudness maps,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 621-625

［１７］Ｃ．Ｈ．Ｔａａｌ，Ｒ．Ｃ．Ｈｅｎｄｒｉｋｓ，Ｒ．Ｈｅｕｓｄｅｎｓ，ａｎｄＪ．Ｊｅｎｓｅｎ，”Ａｎａｌｇｏｒｉｔｈｍｆｏｒｉｎｔｅｌｌｉｇｉｂｉｌｉｔｙｐｒｅｄｉｃｔｉｏｎｏｆｔｉｍｅ－ｆｒｅｑｕｅｎｃｙｗｅｉｇｈｔｅｄｎｏｉｓｙｓｐｅｅｃｈ，” ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ１９，ｎｏ．７，ｐｐ．２１２５－２１３６，Ｓｅｐｔｅｍｂｅｒ２０１１ [17] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligence prediction of time-frequency weighted noise speech,” IEEE Transactions on Audio, Speech, and Language Processing, volume 19, no. 7, pp. 2125-2136, September 2011

［１８］Ｍ．Ｄ．Ｐｌｕｍｂｌｅｙ，Ｃ．Ｋｒｏｏｓ，Ｊ．Ｐ．Ｂｅｌｌｏ，Ｇ．Ｒｉｃｈａｒｄ，Ｄ．Ｐ．Ｅｌｌｉｓ，ａｎｄＡ．Ｍｅｓａｒｏｓ，ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＤｅｔｅｃｔｉｏｎａｎｄＣｌａｓｓｉｆｉｃａｔｉｏｎｏｆＡｃｏｕｓｔｉｃＳｃｅｎｅｓａｎｄＥｖｅｎｔｓ２０１８Ｗｏｒｋｓｈｏｐ（ＤＣＡＳＥ２０１８）．ＴａｍｐｅｒｅＵｎｉｖｅｒｓｉｔｙｏｆＴｅｃｈｎｏｌｏｇｙ．ＬａｂｏｒａｔｏｒｙｏｆＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，２０１８ [18] M. D. Plumbley, C. Kroos, J. P. Bello, G. Richard, D. P. Ellis, and A. Mesaros, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Tampere University of Technology. Laboratory of Signal Processing, 2018

［１９］Ｒ．Ｓｅｒｉｚｅｌ，Ｎ．Ｔｕｒｐａｕｌｔ，Ｈ．Ｅｇｈｂａｌ－Ｚａｄｅｈ，ａｎｄＡ．ＰａｒａｇＳｈａｈ，”Ｌａｒｇｅ－ＳｃａｌｅＷｅａｋｌｙＬａｂｅｌｅｄＳｅｍｉ－ＳｕｐｅｒｖｉｓｅｄＳｏｕｎｄＥｖｅｎｔＤｅｔｅｃｔｉｏｎｉｎＤｏｍｅｓｔｉｃＥｎｖｉｒｏｎｍｅｎｔｓ，” Ｊｕｌｙ２０１８，ｓｕｂｍｉｔｔｅｄｔｏＤＣＡＳＥ２０１８Ｗｏｒｋｓｈｏｐ [19] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. Parag Shah, “Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments,” July 2018, submitted to DCASE2018 Workshop

［２０］Ｌ．ＪｉａＫａｉ，”Ｍｅａｎｔｅａｃｈｅｒｃｏｎｖｏｌｕｔｉｏｎｓｙｓｔｅｍｆｏｒｄｃａｓｅ２０１８ｔａｓｋ４，” ＤＣＡＳＥ２０１８Ｃｈａｌｌｅｎｇｅ，Ｔｅｃｈ．Ｒｅｐ．，Ｓｅｐｔｅｍｂｅｒ２０１８ [20] L. JiaKai, “Mean teacher convolution system for dcase 2018 task 4,” DCASE2018 Challenge, Tech. Rep. ,September 2018

［２１］Ｇ．Ｐａｒａｓｃａｎｄｏｌｏ，Ｈ．Ｈｕｔｔｕｎｅｎ，ａｎｄＴ．Ｖｉｒｔａｎｅｎ，”Ｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓｆｏｒｐｏｌｙｐｈｏｎｉｃｓｏｕｎｄｅｖｅｎｔｄｅｔｅｃｔｉｏｎｉｎｒｅａｌｌｉｆｅｒｅｃｏｒｄｉｎｇｓ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｍａｒｃｈ２０１６，ｐｐ．６４４０－６４４４ [21] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural networks for polyphonic sound event detection in real life recordings,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 6440-6444

［２２］Ｅ．Ｃ，ＣａｋｉｒａｎｄＴ．Ｖｉｒｔａｎｅｎ，”Ｅｎｄ－ｔｏ－ｅｎｄｐｏｌｙｐｈｏｎｉｃｓｏｕｎｄｅｖｅｎｔｄｅｔｅｃｔｉｏｎｕｓｉｎｇｃｏｎｖｏｌｕｔｉｏｎａｌｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓｗｉｔｈｌｅａｒｎｅｄｔｉｍｅ－ｆｒｅｑｕｅｎｃｙｒｅｐｒｅｓｅｎｔａｔｉｏｎｉｎｐｕｔ，” ｉｎＰｒｏｃ．ｏｆＩｎｔｅｒｎａｔｉｏｎａｌＪｏｉｎｔＣｏｎｆｅｒｅｎｃｅｏｎＮｅｕｒａｌＮｅｔｗｏｒｋｓ（ＩＪＣＮＮ），Ｊｕｌｙ２０１８，ｐｐ．１－７ [22] E. C, Cakir and T. Virtanen, “End-to-end polyphonic sound event detection using convolutional recurring neural networks with learned time-frequency representation input,” in Proc. of International Joint Conference on Neural Networks (IJCNN), July 2018, pp. 1-7

［２３］Ｙ．Ｘｕ，Ｑ．Ｋｏｎｇ，Ｗ．Ｗａｎｇ，ａｎｄＭ．Ｄ．Ｐｌｕｍｂｌｅｙ，”Ｌａｒｇｅ－ＳｃａｌｅＷｅａｋｌｙＳｕｐｅｒｖｉｓｅｄＡｕｄｉｏＣｌａｓｓｉｆｉｃａｔｉｏｎＵｓｉｎｇＧａｔｅｄＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｃａｌｇａｒｙ，ＡＢ，Ｃａｎａｄａ，２０１８，ｐｐ．１２１－１２５ [23] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, “Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 121-125

［２４］Ｂ．ＦｒｅｎａｙａｎｄＭ．Ｖｅｒｌｅｙｓｅｎ，”Ｃｌａｓｓｉｆｉｃａｔｉｏｎｉｎｔｈｅｐｒｅｓｅｎｃｅｏｆｌａｂｅｌｎｏｉｓｅ：Ａｓｕｒｖｅｙ，” ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＮｅｕｒａｌＮｅｔｗｏｒｋｓａｎｄＬｅａｒｎｉｎｇＳｙｓｔｅｍｓ，ｖｏｌｕｍｅ２５，ｎｏ．５，ｐｐ．８４５－８６９，Ｍａｙ２０１４ [24] B. Frenay and M. Verleysen, “Classification in the presence of label noise: A survey,” IEEE Transactions on Neural Networks and Learning Systems, volume 25, no. 5, pp. 845-869, May 2014

［２５］Ｅ．Ｆｏｎｓｅｃａ，Ｍ．Ｐｌａｋａｌ，Ｄ．Ｐ．Ｗ．Ｅｌｌｉｓ，Ｆ．Ｆｏｎｔ，Ｘ．Ｆａｖｏｒｙ，ａｎｄＸ．Ｓｅｒｒａ，”Ｌｅａｒｎｉｎｇｓｏｕｎｄｅｖｅｎｔｃｌａｓｓｉｆｉｅｒｓｆｒｏｍｗｅｂａｕｄｉｏｗｉｔｈｎｏｉｓｙｌａｂｅｌｓ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｂｒｉｇｈｔｏｎ，ＵＫ，２０１９ [25] E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, and X. Serra, “Learning sound event classifiers from web audio with noise labels,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019

［２６］Ｍ．ＤｏｒｆｅｒａｎｄＧ．Ｗｉｄｍｅｒ，”Ｔｒａｉｎｉｎｇｇｅｎｅｒａｌ－ｐｕｒｐｏｓｅａｕｄｉｏｔａｇｇｉｎｇｎｅｔｗｏｒｋｓｗｉｔｈｎｏｉｓｙｌａｂｅｌｓａｎｄｉｔｅｒａｔｉｖｅｓｅｌｆ－ｖｅｒｉｆｉｃａｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＤｅｔｅｃｔｉｏｎａｎｄＣｌａｓｓｉｆｉｃａｔｉｏｎｏｆＡｃｏｕｓｔｉｃＳｃｅｎｅｓａｎｄＥｖｅｎｔｓ２０１８Ｗｏｒｋｓｈｏｐ（ＤＣＡＳＥ２０１８），Ｓｕｒｒｅｙ，ＵＫ，２０１８ [26] M. Dorfer and G. Widmer, “Training general-purpose audio tagging networks with noisy labels and iterative self-verification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Surrey, UK, 2018

［２７］Ｓ．Ａｄａｖａｎｎｅ，Ａ．Ｐｏｌｉｔｉｓ，Ｊ．Ｎｉｋｕｎｅｎ，ａｎｄＴ．Ｖｉｒｔａｎｅｎ，”Ｓｏｕｎｄｅｖｅｎｔｌｏｃａｌｉｚａｔｉｏｎａｎｄｄｅｔｅｃｔｉｏｎｏｆｏｖｅｒｌａｐｐｉｎｇｓｏｕｒｃｅｓｕｓｉｎｇｃｏｎｖｏｌｕｔｉｏｎａｌｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓ，” ＩＥＥＥＪｏｕｒｎａｌｏｆＳｅｌｅｃｔｅｄＴｏｐｉｃｓｉｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，ｐｐ．１－１，２０１８ [27] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurring neural networks,” IEEE Journal of Selected Topics in Signal Processing, pp. 1-1, 2018

［２８］Ｙ．Ｊｕｎｇ，Ｙ．Ｋｉｍ，Ｙ．Ｃｈｏｉ，ａｎｄＨ．Ｋｉｍ，”Ｊｏｉｎｔｌｅａｒｎｉｎｇｕｓｉｎｇｄｅｎｏｉｓｉｎｇｖａｒｉａｔｉｏｎａｌａｕｔｏｅｎｃｏｄｅｒｓｆｏｒｖｏｉｃｅａｃｔｉｖｉｔｙｄｅｔｅｃｔｉｏｎ，” ｉｎＰｒｏｃ．ｏｆＩｎｔｅｒｓｐｅｅｃｈ，Ｓｅｐｔｅｍｂｅｒ２０１８，ｐｐ．１２１０－１２１４ [28] Y. Jung, Y. Kim, Y. Choi, and H. Kim, “Joint learning using denoising variational autoencoders for voice activity detection,” in Proc. of Interspeech, September 2018, pp. 1210-1214

［２９］Ｆ．Ｅｙｂｅｎ，Ｆ．Ｗｅｎｉｎｇｅｒ，Ｓ．Ｓｑｕａｒｔｉｎｉ，ａｎｄＢ．Ｓｃｈｕｌｌｅｒ，”Ｒｅａｌ－ｌｉｆｅｖｏｉｃｅａｃｔｉｖｉｔｙｄｅｔｅｃｔｉｏｎｗｉｔｈＬＳＴＭｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓａｎｄａｎａｐｐｌｉｃａｔｉｏｎｔｏｈｏｌｌｙｗｏｏｄｍｏｖｉｅｓ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｍａｙ２０１３，ｐｐ．４８３－４８７ [29] F. Eyben, F. Weninger, S. Squartini, and B. Schuller, “Real-life voice activity detection with LSTM recurring neural networks and an application to hollywood movies,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 483-487

［３０］Ｒ．Ｚａｚｏ－Ｃａｎｄｉｌ，Ｔ．Ｎ．Ｓａｉｎａｔｈ，Ｇ．Ｓｉｍｋｏ，ａｎｄＣ．Ｐａｒａｄａ，”Ｆｅａｔｕｒｅｌｅａｒｎｉｎｇｗｉｔｈｒａｗ－ｗａｖｅｆｏｒｍＣＬＤＮＮｓｆｏｒｖｏｉｃｅａｃｔｉｖｉｔｙｄｅｔｅｃｔｉｏｎ，” ｉｎＰｒｏｃ．ｏｆＩＮＴＥＲＳＰＥＥＣＨ，２０１６ [30] R. Zazo-Candil, T. N. Sainath, G. Simko, and C. Parada, “Feature learning with raw-waveform CLDNNs for voice activity detection,” in Proc. of INTERSPEECH, 2016

［３１］Ｍ．ＭｃＬａｒｅｎ，Ｙ．Ｌｅｉ，ａｎｄＬ．Ｆｅｒｒｅｒ，”Ａｄｖａｎｃｅｓｉｎｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋａｐｐｒｏａｃｈｅｓｔｏｓｐｅａｋｅｒｒｅｃｏｇｎｉｔｉｏｎ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ａｐｒｉｌ２０１５，ｐｐ．４８１４－４８１８ [31] M. McLaren, Y. Lei, and L. Ferrer, “Advances in deep neural network approaches to speaker recognition,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 4814-4818

［３２］Ｄ．Ｓｎｙｄｅｒ，Ｄ．Ｇａｒｃｉａ－Ｒｏｍｅｒｏ，Ｇ．Ｓｅｌｌ，Ｄ．Ｐｏｖｅｙ，ａｎｄＳ．Ｋｈｕｄａｎｐｕｒ，”Ｘ－ｖｅｃｔｏｒｓ：ＲｏｂｕｓｔＤＮＮｅｍｂｅｄｄｉｎｇｓｆｏｒｓｐｅａｋｅｒｒｅｃｏｇｎｉｔｉｏｎ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ａｐｒｉｌ２０１８，ｐｐ．５３２９－５３３３ [32]D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 5329-5333

［３３］Ｍ．ＭｃＬａｒｅｎ，Ｄ．Ｃａｓｔａｎ，Ｍ．Ｋ．Ｎａｎｄｗａｎａ，Ｌ．Ｆｅｒｒｅｒ，ａｎｄＥ．Ｙｉｌｍａｚ，”Ｈｏｗｔｏｔｒａｉｎｙｏｕｒｓｐｅａｋｅｒｅｍｂｅｄｄｉｎｇｓｅｘｔｒａｃｔｏｒ，” ｉｎＯｄｙｓｓｅｙ，２０１８ [33] M. McLaren, D. Castan, M. K. Nandwana, L. Ferrer, and E. Yilmaz, “How to train your speaker embeddings extractor,” in Odyssey, 2018

［３４］Ｓ．Ｏ．Ｓａｄｊａｄｉ，Ｊ．Ｗ．Ｐｅｌｅｃａｎｏｓ，ａｎｄＳ．Ｇａｎａｐａｔｈｙ，”ＴｈｅＩＢＭｓｐｅａｋｅｒｒｅｃｏｇｎｉｔｉｏｎｓｙｓｔｅｍ：Ｒｅｃｅｎｔａｄｖａｎｃｅｓａｎｄｅｒｒｏｒａｎａｌｙｓｉｓ，” ｉｎＰｒｏｃ．ｏｆＩｎｔｅｒｓｐｅｅｃｈ，２０１６，ｐｐ．３６３３－３６３７ [34] S. O. Sadjadi, J. W. Pelecanos, and S. Ganapathy, “The IBM speaker recognition system: Recent advances and error analysis,” in Proc. of Interspeech, 2016, pp. 3633-3637

［３５］Ｙ．Ｈａｎ，Ｊ．Ｋｉｍ，ａｎｄＫ．Ｌｅｅ，”Ｄｅｅｐｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔｗｏｒｋｓｆｏｒｐｒｅｄｏｍｉｎａｎｔｉｎｓｔｒｕｍｅｎｔｒｅｃｏｇｎｉｔｉｏｎｉｎｐｏｌｙｐｈｏｎｉｃｍｕｓｉｃ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２５，ｎｏ．１，ｐｐ．２０８－２２１，Ｊａｎｕａｒｙ２０１７ [35] Y. Han, J. Kim, and K. Lee, “Deep convolutional neural networks for predominant instrument recognition in polyphonic music,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 25, no. 1, pp. 208-221, January 2017

［３６］Ｖ．ＬｏｎｓｔａｎｌｅｎａｎｄＣ．－Ｅ．Ｃｅｌｌａ，”Ｄｅｅｐｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｔｗｏｒｋｓｏｎｔｈｅｐｉｔｃｈｓｐｉｒａｌｆｏｒｍｕｓｉｃａｌｉｎｓｔｒｕｍｅｎｔｒｅｃｏｇｎｉｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１７ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＳｏｃｉｅｔｙｆｏｒＭｕｓｉｃＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌＣｏｎｆｅｒｅｎｃｅ．ＮｅｗＹｏｒｋ，ＵＳＡ：ＩＳＭＩＲ，２０１６，ｐｐ．６１２－６１８ [36] V. Lonstanlen and C. -E. Cella, “Deep convolutional networks on the pitch spiral for musical instrument recognition,” in Proceedings of the 17th International Society for Music Information Retrieval Conference. New York, USA: ISMIR, 2016, pp. 612-618

［３７］Ｓ．Ｇｕｒｕｒａｎｉ，Ｃ．Ｓｕｍｍｅｒｓ，ａｎｄＡ．Ｌｅｒｃｈ，”Ｉｎｓｔｒｕｍｅｎｔａｃｔｉｖｉｔｙｄｅｔｅｃｔｉｏｎｉｎｐｏｌｙｐｈｏｎｉｃｍｕｓｉｃｕｓｉｎｇｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋｓ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１９ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＳｏｃｉｅｔｙｆｏｒＭｕｓｉｃＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌＣｏｎｆｅｒｅｎｃｅ．Ｐａｒｉｓ，Ｆｒａｎｃｅ：ＩＳＭＩＲ，Ｓｅｐｔｅｍｂｅｒ２０１８，ｐｐ．５６９－５７６ [37] S. Gururani, C. Summers, and A. Lerch, “Instrument activity detection in polyphonic music using deep neural networks,” in Proceedings of the 19th International Society for Music Information Retrieval Conference. Paris, France: ISMIR, September 2018, pp. 569-576

［３８］Ｊ．ＳｃｈｌｕｔｔｅｒａｎｄＢ．Ｌｅｈｎｅｒ，”Ｚｅｒｏｍｅａｎｃｏｎｖｏｌｕｔｉｏｎｓｆｏｒｌｅｖｅｌ－ｉｎｖａｒｉａｎｔｓｉｎｇｉｎｇｖｏｉｃｅｄｅｔｅｃｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１９ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＳｏｃｉｅｔｙｆｏｒＭｕｓｉｃＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌＣｏｎｆｅｒｅｎｃｅ．Ｐａｒｉｓ，Ｆｒａｎｃｅ：ＩＳＭＩＲ，Ｓｅｐｔｅｍｂｅｒ２０１８，ｐｐ．３２１－３２６ [38] J. Schlutter and B. Lehner, “Zero mean convolutions for level-invariant singing voice detection,” in Proceedings of the 19th International Society for Music Information Retrieval Conference. Paris, France: ISMIR, September 2018, pp. 321-326

［３９］Ｓ．Ｄｅｌｉｋａｒｉｓ－Ｍａｎｉａｓ，Ｄ．Ｐａｖｌｉｄｉ，Ａ．Ｍｏｕｃｈｔａｒｉｓ，ａｎｄＶ．Ｐｕｌｋｋｉ，”ＤＯＡｅｓｔｉｍａｔｉｏｎｗｉｔｈｈｉｓｔｏｇｒａｍａｎａｌｙｓｉｓｏｆｓｐａｔｉａｌｌｙｃｏｎｓｔｒａｉｎｅｄａｃｔｉｖｅｉｎｔｅｎｓｉｔｙｖｅｃｔｏｒｓ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｍａｒｃｈ２０１７，ｐｐ．５２６－５３０ [39] S. Delikaris-Manias, D. Pavlidi, A. Mouchtaris, and V. Pulkki, “DOA estimation with histogram analysis of spatially constrained active intensity vectors,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 526-530

［４０］Ｓ．ＣｈａｋｒａｂａｒｔｙａｎｄＥ．Ａ．Ｐ．Ｈａｂｅｔｓ，”Ｍｕｌｔｉ－ｓｐｅａｋｅｒＤＯＡｅｓｔｉｍａｔｉｏｎｕｓｉｎｇｄｅｅｐｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｔｗｏｒｋｓｔｒａｉｎｅｄｗｉｔｈｎｏｉｓｅｓｉｇｎａｌｓ，” ＩＥＥＥＪｏｕｒｎａｌｏｆＳｅｌｅｃｔｅｄＴｏｐｉｃｓｉｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ１３，ｎｏ．１，ｐｐ．８－２１，Ｍａｒｃｈ２０１９ [40] S. Chakrabarty and E. A. P. Habets, “Multi-speaker DOA estimation using deep convolutional networks trained with noise signals,” IEEE Journal of Selected Topics in Signal Processing, volume 13, no. 1, pp. 8-21, March 2019

［４１］Ｘ．Ｌｉ，Ｌ．Ｇｉｒｉｎ，Ｒ．Ｈｏｒａｕｄ，ａｎｄＳ．Ｇａｎｎｏｔ，”Ｍｕｌｔｉｐｌｅ－ｓｐｅａｋｅｒｌｏｃａｌｉｚａｔｉｏｎｂａｓｅｄｏｎｄｉｒｅｃｔ－ｐａｔｈｆｅａｔｕｒｅｓａｎｄｌｉｋｅｌｉｈｏｏｄｍａｘｉｍｉｚａｔｉｏｎｗｉｔｈｓｐａｔｉａｌｓｐａｒｓｉｔｙｒｅｇｕｌａｒｉｚａｔｉｏｎ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２５，ｎｏ．１０，ｐｐ．１９９７－２０１２，Ｏｃｔｏｂｅｒ２０１７ [41] X. Li, L. Girin, R. Horaud, and S. Gannot, “Multiple-speaker localization based on direct-path features and likelihood maximization with spatial sparsity regulation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 25, no. 10, pp. 1997-2012, October 2017

［４２］Ｆ．ＧｒｏｎｄｉｎａｎｄＦ．Ｍｉｃｈａｕｄ，”Ｌｉｇｈｔｗｅｉｇｈｔａｎｄｏｐｔｉｍｉｚｅｄｓｏｕｎｄｓｏｕｒｃｅｌｏｃａｌｉｚａｔｉｏｎａｎｄｔｒａｃｋｉｎｇｍｅｔｈｏｄｓｆｏｒｏｐｅｎａｎｄｃｌｏｓｅｄｍｉｃｒｏｐｈｏｎｅａｒｒａｙｃｏｎｆｉｇｕｒａｔｉｏｎｓ，” ＲｏｂｏｔｉｃｓａｎｄＡｕｔｏｎｏｍｏｕｓＳｙｓｔｅｍｓ，ｖｏｌｕｍｅ１１３，ｐｐ．６３－８０，２０１９ [42] F. Grondin and F. Michael, “Lightweight and optimized sound source localization and tracking methods for open and closed microphone array configurations,” Robotics and Autonomous Systems, volume 113, pp. 63-80, 2019

［４３］Ｄ．Ｙｏｏｋ，Ｔ．Ｌｅｅ，ａｎｄＹ．Ｃｈｏ，”Ｆａｓｔｓｏｕｎｄｓｏｕｒｃｅｌｏｃａｌｉｚａｔｉｏｎｕｓｉｎｇｔｗｏ－ｌｅｖｅｌｓｅａｒｃｈｓｐａｃｅｃｌｕｓｔｅｒｉｎｇ，” ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＣｙｂｅｒｎｅｔｉｃｓ，ｖｏｌｕｍｅ４６，ｎｏ．１，ｐｐ．２０－２６，Ｊａｎｕａｒｙ２０１６ [43]D. Yook, T. Lee, and Y. Cho, “Fast sound source localization using two-level search space clustering,” IEEE Transactions on Cybernetics, volume 46, no. 1, pp. 20-26, January 2016

［４４］Ｄ．Ｐａｖｌｉｄｉ，Ａ．Ｇｒｉｆｆｉｎ，Ｍ．Ｐｕｉｇｔ，ａｎｄＡ．Ｍｏｕｃｈｔａｒｉｓ，”Ｒｅａｌ－ｔｉｍｅｍｕｌｔｉｐｌｅｓｏｕｎｄｓｏｕｒｃｅｌｏｃａｌｉｚａｔｉｏｎａｎｄｃｏｕｎｔｉｎｇｕｓｉｎｇａｃｉｒｃｕｌａｒｍｉｃｒｏｐｈｏｎｅａｒｒａｙ，” ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２１，ｎｏ．１０，ｐｐ．２１９３－２２０６，Ｏｃｔｏｂｅｒ２０１３ [44]D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, “Real-time multiple sound source localization and counting using a circular microphone array,” IEEE Transactions on Audio, Speech, and Language Processing, volume 21, no. 10, pp. 2193-2206, October 2013

［４５］Ｐ．Ｖｅｃｃｈｉｏｔｔｉ，Ｎ．Ｍａ，Ｓ．Ｓｑｕａｒｔｉｎｉ，ａｎｄＧ．Ｊ．Ｂｒｏｗｎ，”Ｅｎｄ－ｔｏ－ｅｎｄｂｉｎａｕｒａｌｓｏｕｎｄｌｏｃａｌｉｓａｔｉｏｎｆｒｏｍｔｈｅｒａｗｗａｖｅｆｏｒｍ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｍａｙ２０１９，ｐｐ．４５１－４５５ [45] P. Vecchiotti, N. Ma, S. Squartini, and G. J. Brown, “End-to-end binural sound localization from the raw waveform,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 451-455

［４６］Ｙ．Ｌｕｏ，Ｚ．Ｃｈｅｎ，ａｎｄＮ．Ｍｅｓｇａｒａｎｉ，”Ｓｐｅａｋｅｒ－ｉｎｄｅｐｅｎｄｅｎｔｓｐｅｅｃｈｓｅｐａｒａｔｉｏｎｗｉｔｈｄｅｅｐａｔｔｒａｃｔｏｒｎｅｔｗｏｒｋ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２６，ｎｏ．４，ｐｐ．７８７－７９６，Ａｐｒｉｌ２０１８ [46] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 26, no. 4, pp. 787-796, April 2018

［４７］Ｚ．Ｗａｎｇ，Ｊ．ＬｅＲｏｕｘ，ａｎｄＪ．Ｒ．Ｈｅｒｓｈｅｙ，”Ｍｕｌｔｉ－ｃｈａｎｎｅｌｄｅｅｐｃｌｕｓｔｅｒｉｎｇ：Ｄｉｓｃｒｉｍｉｎａｔｉｖｅｓｐｅｃｔｒａｌａｎｄｓｐａｔｉａｌｅｍｂｅｄｄｉｎｇｓｆｏｒｓｐｅａｋｅｒ－ｉｎｄｅｐｅｎｄｅｎｔｓｐｅｅｃｈｓｅｐａｒａｔｉｏｎ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ａｐｒｉｌ２０１８，ｐｐ．１－５ [47] Z. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech.” separation,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 1-5

［４８］Ｇ．Ｎａｉｔｈａｎｉ，Ｔ．Ｂａｒｋｅｒ，Ｇ．Ｐａｒａｓｃａｎｄｏｌｏ，Ｌ．ＢｒａｍｓｌＬｗ，Ｎ．Ｈ．Ｐｏｎｔｏｐｐｉｄａｎ，ａｎｄＴ．Ｖｉｒｔａｎｅｎ，”Ｌｏｗｌａｔｅｎｃｙｓｏｕｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎｕｓｉｎｇｃｏｎｖｏｌｕｔｉｏｎａｌｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＷｏｒｋｓｈｏｐｏｎＡｐｐｌｉｃａｔｉｏｎｓｏｆＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇｔｏＡｕｄｉｏａｎｄＡｃｏｕｓｔｉｃｓ（ＷＡＳＰＡＡ），Ｏｃｔｏｂｅｒ２０１７，ｐｐ．７１－７５ [48] G. Naithani, T. Barker, G. Parascandolo, L. BramslLw, N. H. Pontoppidan, and T. Virtanen, “Low latency sound source separation using convolutional recurring neural networks,” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), October 2017, pp. 71-75

［４９］Ｍ．Ｓｕｎｏｈａｒａ，Ｃ．Ｈａｒｕｔａ，ａｎｄＮ．Ｏｎｏ，”Ｌｏｗ－ｌａｔｅｎｃｙｒｅａｌ－ｔｉｍｅｂｌｉｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎｆｏｒｈｅａｒｉｎｇａｉｄｓｂａｓｅｄｏｎｔｉｍｅ－ｄｏｍａｉｎｉｍｐｌｅｍｅｎｔａｔｉｏｎｏｆｏｎｌｉｎｅｉｎｄｅｐｅｎｄｅｎｔｖｅｃｔｏｒａｎａｌｙｓｉｓｗｉｔｈｔｒｕｎｃａｔｉｏｎｏｆｎｏｎ－ｃａｕｓａｌｃｏｍｐｏｎｅｎｔｓ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｍａｒｃｈ２０１７，ｐｐ．２１６－２２０ [49] M. Sunohara, C. Haruta, and N. Ono,”Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with truncation of non-causal components,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 216-220

［５０］Ｙ．ＬｕｏａｎｄＮ．Ｍｅｓｇａｒａｎｉ，”ＴａＳＮｅｔ：Ｔｉｍｅ－ｄｏｍａｉｎａｕｄｉｏｓｅｐａｒａｔｉｏｎｎｅｔｗｏｒｋｆｏｒｒｅａｌ－ｔｉｍｅ，ｓｉｎｇｌｅ－ｃｈａｎｎｅｌｓｐｅｅｃｈｓｅｐａｒａｔｉｏｎ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ａｐｒｉｌ２０１８，ｐｐ．６９６－７００ [50] Y. Luo and N. Mesgarani, “TaSNet: Time-domain audio separation network for real-time, single-channel speech separation,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 696-700

［５１］Ｊ．Ｃｈｕａ，Ｇ．Ｗａｎｇ，ａｎｄＷ．Ｂ．Ｋｌｅｉｊｎ，”Ｃｏｎｖｏｌｕｔｉｖｅｂｌｉｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎｗｉｔｈｌｏｗｌａｔｅｎｃｙ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｋｓｈｏｐｏｎＡｃｏｕｓｔｉｃＳｉｇｎａｌＥｎｈａｎｃｅｍｅｎｔ（ＩＷＡＥＮＣ），Ｓｅｐｔｅｍｂｅｒ２０１６，ｐｐ．１－５ [51] J. Chua, G. Wang, and W. B. Kleijn, “Convolutive blind source separation with low latency,” in Proc. of IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), September 2016, pp. 1-5

［５２］Ｚ．Ｒａｆｉｉ，Ａ．Ｌｉｕｔｋｕｓ，Ｆ．Ｓｔｏｔｅｒ，Ｓ．Ｉ．Ｍｉｍｉｌａｋｉｓ，Ｄ．ＦｉｔｚＧｅｒａｌｄ，ａｎｄＢ．Ｐａｒｄｏ，”Ａｎｏｖｅｒｖｉｅｗｏｆｌｅａｄａｎｄａｃｃｏｍｐａｎｉｍｅｎｔｓｅｐａｒａｔｉｏｎｉｎｍｕｓｉｃ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２６，ｎｏ．８，ｐｐ．１３０７－１３３５，Ａｕｇｕｓｔ２０１８ [52] Z. Rafii, A. Liutkus, F. Stoter, S. I. Mimilakis, D. FitzGerald, and B. Pardo, “An overview of lead and accommodation separation in music,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 26, no. 8, pp. 1307-1335, August 2018

［５３］Ｆ．－Ｒ．Ｓｔｏｔｅｒ，Ａ．Ｌｉｕｔｋｕｓ，ａｎｄＮ．Ｉｔｏ，”Ｔｈｅ２０１８ｓｉｇｎａｌｓｅｐａｒａｔｉｏｎｅｖａｌｕａｔｉｏｎｃａｍｐａｉｇｎ，” ｉｎＬａｔｅｎｔＶａｒｉａｂｌｅＡｎａｌｙｓｉｓａｎｄＳｉｇｎａｌＳｅｐａｒａｔｉｏｎ，Ｙ．Ｄｅｖｉｌｌｅ，Ｓ．Ｇａｎｎｏｔ，Ｒ．Ｍａｓｏｎ，Ｍ．Ｄ．Ｐｌｕｍｂｌｅｙ，ａｎｄＤ．Ｗａｒｄ，Ｅｄｓ．Ｃｈａｍ：ＳｐｒｉｎｇｅｒＩｎｔｅｒｎａｔｉｏｎａｌＰｕｂｌｉｓｈｉｎｇ，２０１８，ｐｐ．２９３－３０５ [53] F. -R. Stoter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation, Y. Deville, S. Gannot, R. Mason, M. D. Plumbley, and D. Ward, Eds. Cham: Springer International Publishing, 2018, pp. 293-305

［５４］Ｊ．－Ｌ．Ｄｕｒｒｉｅｕ，Ｂ．Ｄａｖｉｄ，ａｎｄＧ．Ｒｉｃｈａｒｄ，”Ａｍｕｓｉｃａｌｌｙｍｏｔｉｖａｔｅｄｍｉｄｌｅｖｅｌｒｅｐｒｅｓｅｎｔａｔｉｏｎｆｏｒｐｉｔｃｈｅｓｔｉｍａｔｉｏｎａｎｄｍｕｓｉｃａｌａｕｄｉｏｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ，” ＳｅｌｅｃｔｅｄＴｏｐｉｃｓｉｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，ＩＥＥＥＪｏｕｒｎａｌｏｆ，ｖｏｌｕｍｅ５，ｎｏ．６，ｐｐ．１１８０－１１９１，Ｏｃｔｏｂｅｒ２０１１ [54] J. -L. Durrieu, B. David, and G. Richard, “A musically stimulated midlevel representation for pitch estimation and musical audio source separation,” Selected Topics in Signal Processing, IEEE Journal of, volume 5, no. 6, pp. 1180-1191, October 2011

［５５］Ｓ．Ｕｈｌｉｃｈ，Ｍ．Ｐｏｒｃｕ，Ｆ．Ｇｉｒｏｎ，Ｍ．Ｅｎｅｎｋｌ，Ｔ．Ｋｅｍｐ，Ｎ．Ｔａｋａｈａｓｈｉ，ａｎｄＹ．Ｍｉｔｓｕｆｕｊｉ，”Ｉｍｐｒｏｖｉｎｇｍｕｓｉｃｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎｂａｓｅｄｏｎｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋｓｔｈｒｏｕｇｈｄａｔａａｕｇｍｅｎｔａｔｉｏｎａｎｄｎｅｔｗｏｒｋｂｌｅｎｄｉｎｇ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），２０１７ [55] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017

［５６］Ｐ．Ｎ．Ｓａｍａｒａｓｉｎｇｈｅ，Ｗ．Ｚｈａｎｇ，ａｎｄＴ．Ｄ．Ａｂｈａｙａｐａｌａ，”Ｒｅｃｅｎｔａｄｖａｎｃｅｓｉｎａｃｔｉｖｅｎｏｉｓｅｃｏｎｔｒｏｌｉｎｓｉｄｅａｕｔｏｍｏｂｉｌｅｃａｂｉｎｓ：Ｔｏｗａｒｄｑｕｉｅｔｅｒｃａｒｓ，” ＩＥＥＥＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＭａｇａｚｉｎｅ，ｖｏｌｕｍｅ３３，ｎｏ．６，ｐｐ．６１－７３，Ｎｏｖｅｍｂｅｒ２０１６ [56] P. N. Samarasinghe, W. Zhang, and T. D. Abhayapala, “Recent advances in active noise control inside automobile cabins: Toward quieter cars,” IEEE Signal Processing Magazine, volume 33, no. 6, pp. 61-73, November 2016

［５７］Ｓ．Ｐａｐｉｎｉ，Ｒ．Ｌ．Ｐｉｎｔｏ，Ｅ．Ｂ．Ｍｅｄｅｉｒｏｓ，ａｎｄＦ．Ｂ．Ｃｏｅｌｈｏ，”Ｈｙｂｒｉｄａｐｐｒｏａｃｈｔｏｎｏｉｓｅｃｏｎｔｒｏｌｏｆｉｎｄｕｓｔｒｉａｌｅｘｈａｕｓｔｓｙｓｔｅｍｓ，” ＡｐｐｌｉｅｄＡｃｏｕｓｔｉｃｓ，ｖｏｌｕｍｅ１２５，ｐｐ．１０２－１１２，２０１７ [57] S. Papini, R. L. Pinto, E. B. Medeiros, and F. B. Coelho, “Hybrid approach to noise control of industrial exhaust systems,” Applied Acoustics, volume 125, pp. 102-112, 2017

［５８］Ｊ．Ｚｈａｎｇ，Ｔ．Ｄ．Ａｂｈａｙａｐａｌａ，Ｗ．Ｚｈａｎｇ，Ｐ．Ｎ．Ｓａｍａｒａｓｉｎｇｈｅ，ａｎｄＳ．Ｊｉａｎｇ，”Ａｃｔｉｖｅｎｏｉｓｅｃｏｎｔｒｏｌｏｖｅｒｓｐａｃｅ：Ａｗａｖｅｄｏｍａｉｎａｐｐｒｏａｃｈ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２６，ｎｏ．４，ｐｐ．７７４－７８６，Ａｐｒｉｌ２０１８ [58] J. Zhang, T. D. Abhayapala, W. Zhang, P. N. Samarasinghe, and S. Jiang, “Active noise control over space: A wave domain approach,” IEEE/ACM Transactions on Audio, Speech, and Language. Processing, volume 26, no. 4, pp. 774-786, April 2018

［５９］Ｘ．Ｌｕ，Ｙ．Ｔｓａｏ，Ｓ．Ｍａｔｓｕｄａ，ａｎｄＣ．Ｈｏｒｉ，”Ｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｂａｓｅｄｏｎｄｅｅｐｄｅｎｏｉｓｉｎｇａｕｔｏｅｎｃｏｄｅｒ，” ｉｎＰｒｏｃ．ｏｆＩｎｔｅｒｓｐｅｅｃｈ，２０１３ [59] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in Proc. of Interspeech, 2013

［６０］Ｙ．Ｘｕ，Ｊ．Ｄｕ，Ｌ．Ｄａｉ，ａｎｄＣ．Ｌｅｅ，”Ａｒｅｇｒｅｓｓｉｏｎａｐｐｒｏａｃｈｔｏｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｂａｓｅｄｏｎｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋｓ，” ＩＥＥＥ／ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＡｕｄｉｏ，Ｓｐｅｅｃｈ，ａｎｄＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌｕｍｅ２３，ｎｏ．１，ｐｐ．７－１９，Ｊａｎｕａｒｙ２０１５ [60] Y. Xu, J. Du, L. Dai, and C. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 23, no. 1, pp. 7-19, January 2015

［６１］Ｓ．Ｐａｓｃｕａｌ，Ａ．Ｂｏｎａｆｏｎｔｅ，ａｎｄＪ．Ｓｅｒｒａ，”ＳＥＧＡＮ：ｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｇｅｎｅｒａｔｉｖｅａｄｖｅｒｓａｒｉａｌｎｅｔｗｏｒｋ，” ｉｎＰｒｏｃ．ｏｆＩｎｔｅｒｓｐｅｅｃｈ，Ａｕｇｕｓｔ２０１７，ｐｐ．３６４２－３６４６ [61] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: speech enhancement generative adversarial network,” in Proc. of Interspeech, August 2017, pp. 3642-3646

［６２］Ｆ．Ｗｅｎｉｎｇｅｒ，Ｈ．Ｅｒｄｏｇａｎ，Ｓ．Ｗａｔａｎａｂｅ，Ｅ．Ｖｉｎｃｅｎｔ，Ｊ．ＬｅＲｏｕｘ，Ｊ．Ｒ．Ｈｅｒｓｈｅｙ，ａｎｄＢ．Ｓｃｈｕｌｌｅｒ，”ＳｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔｗｉｔｈＬＳＴＭｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓａｎｄｉｔｓａｐｐｌｉｃａｔｉｏｎｔｏｎｏｉｓｅ－ｒｏｂｕｓｔＡＳＲ，” ｉｎＬａｔｅｎｔＶａｒｉａｂｌｅＡｎａｌｙｓｉｓａｎｄＳｉｇｎａｌＳｅｐａｒａｔｉｏｎ，Ｅ．Ｖｉｎｃｅｎｔ，Ａ．Ｙｅｒｅｄｏｒ，Ｚ．Ｋｏｌｄｏｖｓｋｙ，ａｎｄＰ．Ｔｉｃｈａｖｓｋｙ，Ｅｄｓ．Ｃｈａｍ：ＳｐｒｉｎｇｅｒＩｎｔｅｒｎａｔｉｏｎａｌＰｕｂｌｉｓｈｉｎｇ，２０１５，ｐｐ．９１－９９ [62] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Latent Variable Analysis and Signal Separation, E. Vincent, A. Yeredor, Z. Koldovsky, and P. Tichavsky, Eds. Cham: Springer International Publishing, 2015, pp. 91-99

［６３］Ｈ．Ｗｉｅｒｓｔｏｒｆ，Ｄ．Ｗａｒｄ，Ｒ．Ｍａｓｏｎ，Ｅ．Ｍ．Ｇｒａｉｓ，Ｃ．Ｈｕｍｍｅｒｓｏｎｅ，ａｎｄＭ．Ｄ．Ｐｌｕｍｂｌｅｙ，”Ｐｅｒｃｅｐｔｕａｌｅｖａｌｕａｔｉｏｎｏｆｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎｆｏｒｒｅｍｉｘｉｎｇｍｕｓｉｃ，” ｉｎＰｒｏｃ．ｏｆＡｕｄｉｏＥｎｇｉｎｅｅｒｉｎｇＳｏｃｉｅｔｙＣｏｎｖｅｎｔｉｏｎ１４３，Ｏｃｔｏｂｅｒ２０１７ [63] H. Wierstorf, D. Ward, R. Mason, E. M. Grais, C. Hummersone, and M. D. Plumbley, “Perceptual evaluation of source separation for remixing music,” in Proc. of Audio Engineering Society Convention 143, October 2017

［６４］Ｊ．Ｐｏｎｓ，Ｊ．Ｊａｎｅｒ，Ｔ．Ｒｏｄｅ，ａｎｄＷ．Ｎｏｇｕｅｉｒａ，”Ｒｅｍｉｘｉｎｇｍｕｓｉｃｕｓｉｎｇｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎａｌｇｏｒｉｔｈｍｓｔｏｉｍｐｒｏｖｅｔｈｅｍｕｓｉｃａｌｅｘｐｅｒｉｅｎｃｅｏｆｃｏｃｈｌｅａｒｉｍｐｌａｎｔｕｓｅｒｓ，” ＴｈｅＪｏｕｒｎａｌｏｆｔｈｅＡｃｏｕｓｔｉｃａｌＳｏｃｉｅｔｙｏｆＡｍｅｒｉｃａ，ｖｏｌｕｍｅ１４０，ｎｏ．６，ｐｐ．４３３８－４３４９，２０１６ [64] J. Pons, J. Janer, T. Rode, and W. Nogueira, “Remixing music using source separation algorithms to improve the musical experience of cochlear implant users,” The Journal of the Acoustical Society of America, volume 140, no. 6, pp. 4338-4349, 2016

［６５］Ｑ．Ｋｏｎｇ，Ｙ．Ｘｕ，Ｗ．Ｗａｎｇ，ａｎｄＭ．Ｄ．Ｐｌｕｍｂｌｅｙ，”Ａｊｏｉｎｔｓｅｐａｒａｔｉｏｎ－ｃｌａｓｓｉｆｉｃａｔｉｏｎｍｏｄｅｌｆｏｒｓｏｕｎｄｅｖｅｎｔｄｅｔｅｃｔｉｏｎｏｆｗｅａｋｌｙｌａｂｅｌｌｅｄｄａｔａ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｍａｒｃｈ２０１８ [65] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “A joint separation-classification model for sound event detection of weakly labeled data,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2018

［６６］Ｔ．ｖ．Ｎｅｕｍａｎｎ，Ｋ．Ｋｉｎｏｓｈｉｔａ，Ｍ．Ｄｅｌｃｒｏｉｘ，Ｓ．Ａｒａｋｉ，Ｔ．Ｎａｋａｔａｎｉ，ａｎｄＲ．Ｈａｅｂ－Ｕｍｂａｃｈ，”Ａｌｌ－ｎｅｕｒａｌｏｎｌｉｎｅｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ，ｃｏｕｎｔｉｎｇ，ａｎｄｄｉａｒｉｚａｔｉｏｎｆｏｒｍｅｅｔｉｎｇａｎａｌｙｓｉｓ，” ｉｎＰｒｏｃ．ｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｍａｙ２０１９，ｐｐ．９１－９５ [66] T. v. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation, counting, and diarization for meeting analysis,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 91-95

［６７］Ｓ．Ｇｈａｒｉｂ，Ｋ．Ｄｒｏｓｓｏｓ，Ｅ．Ｃａｋｉｒ，Ｄ．Ｓｅｒｄｙｕｋ，ａｎｄＴ．Ｖｉｒｔａｎｅｎ，”Ｕｎｓｕｐｅｒｖｉｓｅｄａｄｖｅｒｓａｒｉａｌｄｏｍａｉｎａｄａｐｔａｔｉｏｎｆｏｒａｃｏｕｓｔｉｃｓｃｅｎｅｃｌａｓｓｉｆｉｃａｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＤｅｔｅｃｔｉｏｎａｎｄＣｌａｓｓｉｆｉｃａｔｉｏｎｏｆＡｃｏｕｓｔｉｃＳｃｅｎｅｓａｎｄＥｖｅｎｔｓＷｏｒｋｓｈｏｐ（ＤＣＡＳＥ），Ｎｏｖｅｍｂｅｒ２０１８，ｐｐ．１３８－１４２ [67] S. Gharib, K. Drossos, E. Cakir, D. Serdyuk, and T. Virtanen, “Unsupervised adversarial domain adaptation for acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), November 2018, pp. 138-142

［６８］Ａ．Ｍｅｓａｒｏｓ，Ｔ．Ｈｅｉｔｔｏｌａ，ａｎｄＴ．Ｖｉｒｔａｎｅｎ，”Ａｍｕｌｔｉ－ｄｅｖｉｃｅｄａｔａｓｅｔｆｏｒｕｒｂａｎａｃｏｕｓｔｉｃｓｃｅｎｅｃｌａｓｓｉｆｉｃａｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＤｅｔｅｃｔｉｏｎａｎｄＣｌａｓｓｉｆｉｃａｔｉｏｎｏｆＡｃｏｕｓｔｉｃＳｃｅｎｅｓａｎｄＥｖｅｎｔｓＷｏｒｋｓｈｏｐ，Ｓｕｒｒｅｙ，ＵＫ，２０１８ [68] A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop, Surrey, UK, 2018

［６９］Ｊ．Ａｂｅｓｓｅｒ，Ｍ．Ｇｏｔｚｅ，Ｓ．Ｋｕｈｎｌｅｎｚ，Ｒ．Ｇｒａｆｅ，Ｃ．Ｋｕｈｎ，Ｔ．Ｃｌａｕｓｓ，Ｈ．Ｌｕｋａｓｈｅｖｉｃｈ，”ＡＤｉｓｔｒｉｂｕｔｅｄＳｅｎｓｏｒＮｅｔｗｏｒｋｆｏｒＭｏｎｉｔｏｒｉｎｇＮｏｉｓｅＬｅｖｅｌａｎｄＮｏｉｓｅＳｏｕｒｃｅｓｉｎＵｒｂａｎＥｎｖｉｒｏｎｍｅｎｔｓ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ６ｔｈＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＦｕｔｕｒｅＩｎｔｅｒｎｅｔｏｆＴｈｉｎｇｓａｎｄＣｌｏｕｄ（ＦｉＣｌｏｕｄ），Ｂａｒｃｅｌｏｎａ，Ｓｐａｉｎ，ｐｐ．３１８－３２４．，２０１８ [69] J. Abesser, M. Gotze, S. Kuhnlenz, R. Grafe, C. Kuhn, T. Clauss, H. Lukashevich, “A Distributed Sensor Network for Monitoring Noise Level and Noise Sources in Urban Environments,” in Proceedings of the 6th IEEE International Conference on Future Internet of Things and Cloud (FiCloud), Barcelona, Spain, pp. 318-324. ,2018

［７０］Ｔ．Ｖｉｒｔａｎｅｎ，Ｍ．Ｄ．Ｐｌｕｍｂｌｅｙ，Ｄ．Ｅｌｌｉｓ（Ｅｄｓ．），”ＣｏｍｐｕｔａｔｉｏｎａｌＡｎａｌｙｓｉｓｏｆＳｏｕｎｄＳｃｅｎｅｓａｎｄＥｖｅｎｔｓ，” Ｓｐｒｉｎｇｅｒ，２０１８ [70] T. Virtanen, M. D. Plumbley, D. Ellis (Eds.), “Computational Analysis of Sound Scenes and Events,” Springer, 2018

［７１］Ｊ．Ａｂｅｓｓｅｒ，Ｓ．ＩｏａｎｎｉｓＭｉｍｉｌａｋｉｓ，Ｒ．Ｇｒａｆｅ，Ｈ．Ｌｕｋａｓｈｅｖｉｃｈ，”Ａｃｏｕｓｔｉｃｓｃｅｎｅｃｌａｓｓｉｆｉｃａｔｉｏｎｂｙｃｏｍｂｉｎｉｎｇａｕｔｏｅｎｃｏｄｅｒ－ｂａｓｅｄｄｉｍｅｎｓｉｏｎａｌｉｔｙｒｅｄｕｃｔｉｏｎａｎｄｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔ－ｗｏｒｋｓ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２ｎｄＤＣＡＳＥＷｏｒｋｓｈｏｐｏｎＤｅｔｅｃｔｉｏｎａｎｄＣｌａｓｓｉｆｉｃａｔｉｏｎｏｆＡｃｏｕｓｔｉｃＳｃｅｎｅｓａｎｄＥｖｅｎｔｓ，Ｍｕｎｉｃｈ，Ｇｅｒｍａｎｙ，２０１７ [71] J. Abesser, S. Ioannis Mimilakis, R. Grafe, H. Lukashevich, “Acoustic scene classification by combining autoencoder-based dimensionality reduction and convolutional neural net-works,” in Procedures of the 2nd DCASE Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 2017

［７２］Ａ．Ａｖｎｉ，Ｊ．Ａｈｒｅｎｓ，Ｍ．Ｇｅｉｅｒｃ，Ｓ．Ｓｐｏｒｓ，Ｈ．Ｗｉｅｒｓｔｏｒｆ，Ｂ．Ｒａｆａｅｌｙ，”Ｓｐａｔｉａｌｐｅｒｃｅｐｔｉｏｎｏｆｓｏｕｎｄｆｉｅｌｄｓｒｅｃｏｒｄｅｄｂｙｓｐｈｅｒｉｃａｌｍｉｃｒｏｐｈｏｎｅａｒｒａｙｓｗｉｔｈｖａｒｙｉｎｇｓｐａｔｉａｌｒｅｓｏｌｕｔｉｏｎ，” ＪｏｕｒｎａｌｏｆｔｈｅＡｃｏｕｓｔｉｃＳｏｃｉｅｔｙｏｆＡｍｅｒｉｃａ，１３３（５），ｐｐ．２７１１－２７２１，２０１３ [72] A. Avni, J. Ahrens, M. Geierc, S. Spors, H. Wierstorf, B. Rafaely, “Spatial perception of sound fields recorded by spherical microphone arrays with varying spatial resolution,” Journal of the Acoustic Society of America, 133(5), pp. 2711-2721, 2013

［７３］Ｅ．Ｃａｎｏ，Ｄ．ＦｉｔｚＧｅｒａｌｄ，Ｋ．Ｂｒａｎｄｅｎｂｕｒｇ，”Ｅｖａｌｕａｔｉｏｎｏｆｑｕａｌｉｔｙｏｆｓｏｕｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎａｌｇｏｒｉｔｈｍｓ：Ｈｕｍａｎｐｅｒｃｅｐｔｉｏｎｖｓｑｕａｎｔｉｔａｔｉｖｅｍｅｔｒｉｃｓ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２４ｔｈＥｕｒｏｐｅａｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＣｏｎｆｅｒｅｎｃｅ（ＥＵＳＩＰＣＯ），ｐｐ．１７５８－１７６２，２０１６ [73] E. Cano, D. FitzGerald, K. Brandenburg, “Evaluation of quality of sound source separation algorithms: Human perception vs quantitative metrics,” in Proceedings of the 24th European Signal Processing Conference (EUSIPCO), pp. 1758-1762, 2016

［７４］Ｓ．Ｍａｒｃｈａｎｄ，”Ａｕｄｉｏｓｃｅｎｅｔｒａｎｓｆｏｒｍａｔｉｏｎｕｓｉｎｇｉｎｆｏｒｍｅｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ，” ＴｈｅＪｏｕｒｎａｌｏｆｔｈｅＡｃｏｕｓｔｉｃａｌＳｏｃｉｅｔｙｏｆＡｍｅｒｉｃａ，１４０（４），ｐ．３０９１，２０１６ [74] S. Marchand, “Audio scene transformation using informed source separation,” The Journal of the Acoustical Society of America, 140(4), p. 3091, 2016

［７５］Ｓ．Ｇｒｏｌｌｍｉｓｃｈ，Ｊ．Ａｂｅｓｓｅｒ，Ｊ．Ｌｉｅｂｅｔｒａｕ，Ｈ．Ｌｕｋａｓｈｅｖｉｃｈ，”Ｓｏｕｎｄｉｎｇｉｎｄｕｓｔｒｙ：Ｃｈａｌｌｅｎｇｅｓａｎｄｄａｔａｓｅｔｓｆｏｒｉｎｄｕｓｔｒｉａｌｓｏｕｎｄａｎａｌｙｓｉｓ（ＩＳＡ），” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２７ｔｈＥｕｒｏｐｅａｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＣｏｎｆｅｒｅｎｃｅ（ＥＵＳＩＰＣＯ）（ｓｕｂｍｉｔｔｅｄ），ＡＣｏｒｕｎａ，Ｓｐａｉｎ，２０１９ [75] S. Grollmisch, J. Abesser, J. Liebetrau, H. Lukashevich, “Sounding industry: Challenges and datasets for industrial sound analysis (ISA),” in Proceedings of the 27th European Signal Processing Conference (EUSIPCO) (submitted), A Coruna, Spain, 2019

［７６］Ｊ．Ａｂｅｓｓｅｒ，Ｍ．Ｍｕｌｌｅｒ，”Ｆｕｎｄａｍｅｎｔａｌｆｒｅｑｕｅｎｃｙｃｏｎｔｏｕｒｃｌａｓｓｉｆｉｃａｔｉｏｎ：Ａｃｏｍｐａｒｉｓｏｎｂｅｔｗｅｅｎｈａｎｄ－ｃｒａｆｔｅｄａｎｄＣＮＮ－ｂａｓｅｄｆｅａｔｕｒｅｓ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ４４ｔｈＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），２０１９ [76] J. Abesser, M. Muller, “Fundamental frequency contour classification: A comparison between hand-crafted and CNN-based features,” in Proceedings of the 44th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019

［７７］Ｊ．Ａｂｅｓｓｅｒ，Ｓ．Ｂａｌｋｅ，Ｍ．Ｍｕｌｌｅｒ，”Ｉｍｐｒｏｖｉｎｇｂａｓｓｓａｌｉｅｎｃｙｅｓｔｉｍａｔｉｏｎｕｓｉｎｇｌａｂｅｌｐｒｏｐａｇａｔｉｏｎａｎｄｔｒａｎｓｆｅｒｌｅａｒｎｉｎｇ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１９ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＳｏｃｉｅｔｙｆｏｒＭｕｓｉｃＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌＣｏｎｆｅｒｅｎｃｅ（ＩＳＭＩＲ），Ｐａｒｉｓ，Ｆｒａｎｃｅ，ｐｐ．３０６－３１２，２０１８ [77] J. Abesser, S. Balke, M. Muller, “Improving bass salience estimation using label propagation and transfer learning,” in Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, pp. 306-312, 2018

［７８］Ｃ．－Ｒ．Ｎａｇａｒ，Ｊ．Ａｂｅｓｓｅｒ，Ｓ．Ｇｒｏｌｌｍｉｓｃｈ，”ＴｏｗａｒｄｓＣＮＮ－ｂａｓｅｄａｃｏｕｓｔｉｃｍｏｄｅｌｉｎｇｏｆｓｅｖｅｎｔｈｃｈｏｒｄｓｆｏｒｒｅｃｏｇｎｉｔｉｏｎｃｈｏｒｄｒｅｃｏｇｎｉｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１６ｔｈＳｏｕｎｄ＆ＭｕｓｉｃＣｏｍｐｕｔｉｎｇＣｏｎｆｅｒｅｎｃｅ（ＳＭＣ）（ｓｕｂｍｉｔｔｅｄ），Ｍａｌａｇａ，Ｓｐａｉｎ，２０１９ [78] C. -R. Nagar, J. Abesser, S. Grollmisch, “Towards CNN-based acoustic modeling of seventh chords for recognition chord recognition,” in Proceedings. of the 16th Sound & Music Computing Conference (SMC) (submitted), Malaga, Spain, 2019

［７９］Ｊ．Ｓ．Ｇｏｍｅｚ，Ｊ．Ａｂｅｓｓｅｒ，Ｅ．Ｃａｎｏ，”Ｊａｚｚｓｏｌｏｉｎｓｔｒｕｍｅｎｔｃｌａｓｓｉｆｉｃａｔｉｏｎｗｉｔｈｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔｗｏｒｋｓ，ｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ，ａｎｄｔｒａｎｓｆｅｒｌｅａｒｎｉｎｇ”，ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１９ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＳｏｃｉｅｔｙｆｏｒＭｕｓｉｃＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌＣｏｎｆｅｒｅｎｃｅ（ＩＳＭＩＲ），Ｐａｒｉｓ，Ｆｒａｎｃｅ，ｐｐ．５７７－５８４，２０１８ [79] J. S. Gomez, J. Abesser, E. Cano, “Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning”, in Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, pp. 577-584, 2018

［８０］Ｊ．Ｒ．Ｈｅｒｓｈｅｙ，Ｚ．Ｃｈｅｎ，Ｊ．ＬｅＲｏｕｘ，Ｓ．Ｗａｔａｎａｂｅ，”Ｄｅｅｐｃｌｕｓｔｅｒｉｎｇ：Ｄｉｓｃｒｉｍｉｎａｔｉｖｅｅｍｂｅｄｄｉｎｇｓｆｏｒｓｅｇｍｅｎｔａｔｉｏｎａｎｄｓｅｐａｒａｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），ｐｐ．３１－３５，２０１６ [80] J. R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31-35, 2016

［８１］Ｅ．Ｃａｎｏ，Ｇ．Ｓｃｈｕｌｌｅｒ，Ｃ．Ｄｉｔｔｍａｒ，”Ｐｉｔｃｈ－ｉｎｆｏｒｍｅｄｓｏｌｏａｎｄａｃｃｏｍｐａｎｉｍｅｎｔｓｅｐａｒａｔｉｏｎｔｏｗａｒｄｓｉｔｓｕｓｅｉｎｍｕｓｉｃｅｄｕｃａｔｉｏｎａｐｐｌｉｃａｔｉｏｎｓ”，ＥＵＲＡＳＩＰＪｏｕｒｎａｌｏｎＡｄｖａｎｃｅｓｉｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，２０１４：２３，ｐｐ．１－１９ [81] E. Cano, G. Schuller, C. Dittmar, “Pitch-informed solo and accommodation separation towers its use in music education applications”, EURASIP Journal on Advances in Signal Processing, 2014:23, pp. 1-19

［８２］Ｓ．Ｉ．Ｍｉｍｉｌａｋｉｓ，Ｋ．Ｄｒｏｓｓｏｓ，Ｊ．Ｆ．Ｓａｎｔｏｓ，Ｇ．Ｓｃｈｕｌｌｅｒ，Ｔ．Ｖｉｒｔａｎｅｎ，Ｙ．Ｂｅｎｇｉｏ，”ＭｏｎａｕｒａｌＳｉｎｇｉｎｇＶｏｉｃｅＳｅｐａｒａｔｉｏｎｗｉｔｈＳｋｉｐ－ＦｉｌｔｅｒｉｎｇＣｏｎｎｅｃｔｉｏｎｓａｎｄＲｅｃｕｒｒｅｎｔＩｎｆｅｒｅｎｃｅｏｆＴｉｍｅ－ＦｒｅｑｕｅｎｃｙＭａｓｋ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），Ｃａｌｇａｒｙ，Ｃａｎａｄａ，Ｓ．７２１－７２５，２０１８ [82] S. I. Mimilakis, K. Drossos, J. F. Santos, G. Schuller, T. Virtanen, Y. Bengio, “Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, S. 721-725, 2018

［８３］Ｊ．Ｆ．Ｇｅｍｍｅｋｅ，Ｄ．Ｐ．Ｗ．Ｅｌｌｉｓ，Ｄ．Ｆｒｅｅｄｍａｎ，Ａ．Ｊａｎｓｅｎ，Ｗ．Ｌａｗｒｅｎｃｅ，Ｒ．Ｃ．Ｍｏｏｒｅ，Ｍ．Ｐｌａｋａｌ，Ｍ．Ｒｉｔｔｅｒ，”ＡｕｄｉｏＳｅｔ：Ａｎｏｎｔｏｌｏｇｙａｎｄｈｕｍａｎ－ｌａｂｅｌｅｄｄａｔａｓｅｔｆｏｒａｕｄｉｏｅｖｅｎｔｓ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），ＮｅｗＯｒｌｅａｎｓ，ＵＳＡ，２０１７ [83] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017

［８４］Ｋｌｅｉｎｅｒ，Ｍ．”ＡｃｏｕｓｔｉｃｓａｎｄＡｕｄｉｏＴｅｃｈｎｏｌｏｇｙ，”．３ｒｄｅｄ．ＵＳＡ：Ｊ．ＲｏｓｓＰｕｂｌｉｓｈｉｎｇ，２０１２ [84] Kleiner, M. "Acoustics and Audio Technology,". 3rd ed. USA: J. Ross Publishing, 2012

［８５］Ｍ．Ｄｉｃｋｒｅｉｔｅｒ，Ｖ．Ｄｉｔｔｅｌ，Ｗ．Ｈｏｅｇ，Ｍ．Ｗｏｈｒ，Ｍ．，，ＨａｎｄｂｕｃｈｄｅｒＴｏｎｓｔｕｄｉｏｔｅｃｈｎｉｋ，” Ａ．ｍｅｄｉｅｎａｋａｄｅｍｉｅ（Ｅｄｓ）．７ｔｈｅｄｉｔｉｏｎ，Ｖｏｌ．１．，Ｍｕｎｉｃｈ：Ｋ．Ｇ．ＳａｕｒＶｅｒｌａｇ，２００８ [85] M. Dickreiter, V. Dittel, W. Hoeg, M. Wohr, M. ,,Handbuch der Tonstudiotechnik,” A.medienakademie (Eds). 7th edition, Vol.1., Munich: K.G. Saur Verlag, 2008

［８６］Ｆ．Ｍｕｌｌｅｒ，Ｍ．Ｋａｒａｕ．，，Ｔｒａｎｓｐａｒａｎｔｈｅａｒｉｎｇ，” ｉｎ：ＣＨＩ，０２ＥｘｔｅｎｄｅｄＡｂｓｔｒａｃｔｓｏｎＨｕｍａｎＦａｃｔｏｒｓｉｎＣｏｍｐｕｔｉｎｇＳｙｓｔｅｍｓ（ＣＨＩＥＡ ’０２），Ｍｉｎｎｅａｐｏｌｉｓ，ＵＳＡ，ｐｐ．７３０－７３１，Ａｐｒｉｌ２００２ [86] F. Muller, M. Karau. ,,Transparent hearing,” in: CHI ,02 Extended Abstracts on Human Factors in Computing Systems (CHI EA '02), Minneapolis, USA, pp. 730-731, April 2002

［８７］Ｌ．Ｖｉｅｉｒａ．”Ｓｕｐｅｒｈｅａｒｉｎｇ：ａｓｔｕｄｙｏｎｖｉｒｔｕａｌｐｒｏｔｏｔｙｐｉｎｇｆｏｒｈｅａｒａｂｌｅｓａｎｄｈｅａｒｉｎｇａｉｄｓ，” ＭａｓｔｅｒＴｈｅｓｉｓ，ＡａｌｂｏｒｇＵｎｉｖｅｒｓｉｔｙ，２０１８．Ａｖａｉｌａｂｌｅ：ｈｔｔｐｓ：／／ｐｒｏｊｅｋｔｅｒ．ａａｕ．ｄｋ／ｐｒｏｊｅｋｔｅｒ／ｆｉｌｅｓ／２８７５１５９４３／ＭａｓｔｅｒＴｈｅｓｉｓ＿Ｌｕｉｓ．ｐｄｆ [87] L. Vieira. “Super hearing: a study on virtual prototyping for healables and hearing aids,” Master Thesis, Aalborg University, 2018. Available: https://projecter. aau. dk/projecter/files/287515943/MasterThesis_Luis. pdf

［８８］Ｓｅｎｎｈｅｉｓｅｒ，”ＡＭＢＥＯＳｍａｒｔＨｅａｄｓｅｔ，” ［Ｏｎｌｉｎｅ］．Ａｖａｉｌａｂｌｅ：
ｈｔｔｐｓ：／／ｄｅ－ｄｅ．ｓｅｎｎｈｅｉｓｅｒ．ｃｏｍ／ｆｉｎａｌｓｔｏｐ［Ａｃｃｅｓｓｅｄ：Ｍａｒｃｈ１，２０１９］ [88] Sennheiser, “AMBEO Smart Headset,” [Online]. Available:
https://de-de. sennheiser. com/finalstop [Accessed: March 1, 2019]

［８９］Ｏｒｏｓｏｕｎｄ ”ＴｉｌｄｅＥａｒｐｈｏｎｅｓ” ［Ｏｎｌｉｎｅ］．Ａｖａｉｌａｂｌｅ：
ｈｔｔｐｓ：／／ｗｗｗ．ｏｒｏｓｏｕｎｄ．ｃｏｍ／ｔｉｌｄｅ－ｅａｒｐｈｏｎｅｓ／［Ａｃｃｅｓｓｅｄ；Ｍａｒｃｈ１，２０１９］ [89] Orosound “Tilde Earphones” [Online]. Available:
https://www. orosound. com/tilde-earphones/ [Accessed; March 1, 2019]

［９０］Ｂｒａｎｄｅｎｂｕｒｇ，Ｋ．，ＣａｎｏＣｅｒｏｎ，Ｅ．，Ｋｌｅｉｎ，Ｆ．，Ｋｏｌｌｍｅｒ，Ｔ．，Ｌｕｋａｓｈｅｖｉｃｈ，Ｈ．，Ｎｅｉｄｈａｒｄｔ，Ａ．，Ｎｏｗａｋ，Ｊ．，Ｓｌｏｍａ，Ｕ．，ｕｎｄＷｅｒｎｅｒ，Ｓ．，，，Ｐｅｒｓｏｎａｌｉｚｅｄａｕｄｉｔｏｒｙｒｅａｌｉｔｙ，” ｉｎ４４．ＪａｈｒｅｓｔａｇｕｎｇｆｕｒＡｋｕｓｔｉｋ（ＤＡＧＡ），ＧａｒｃｈｉｎｇｂｅｉＭｕｎｃｈｅｎ，ＤｅｕｔｓｃｈｅＧｅｓｅｌｌｓｃｈａｆｔｆｕｒＡｋｕｓｔｉｋ（ＤＥＧＡ），２０１８ [90] Brandenburg, K. , Cano Ceron, E. , Klein, F. , Kollmer, T. , Lukashevich, H. , Neidhardt, A. , Nowak, J. , Sloma, U. , and Werner, S. ,,,Personalized auditorium reality,” in 44. Jahrestagung fur Akustik (DAGA), Garching bei Munchen, Deutsche Gesellschaft fur Akustik (DEGA), 2018

［９１］ＵＳ２０１５１９５６４１Ａ１，Ａｐｐｌｉｃａｔｉｏｎｄａｔｅ：Ｊａｎｕａｒｙ６，２０１４；ｐｕｂｌｉｓｈｅｄｏｎＪｕｌｙ９，２０１５．
[91] US 2015 195641 A1, Application date: January 6, 2014; published on July 9, 2015.

Claims

an analyzer (152) for determining a plurality of binaural room impulse responses;
a loudspeaker signal generator (154) for generating at least two loudspeaker signals in response to the plurality of binaural room impulse responses and in response to a source signal of at least one sound source;
Equipped with
the analyzer (152) is configured to determine the plurality of binaural room impulse responses such that each of the plurality of binaural room impulse responses takes into account an effect resulting from a user wearing headphones.
system.

The system comprises the headphones,
the headphones are configured to output the at least two loudspeaker signals;
The system of claim 1 .

The headphones include two headphone capsules and at least one microphone for measuring sound in each of the two headphone capsules;
the at least one microphone is configured to measure the sound and is disposed in each of the two headphone capsules;
the analyzer (152) is configured to perform the determination of the plurality of binaural room impulse responses using the measurements of the at least one microphone in each of the two headphone capsules.
3. A system according to claim 1 or 2.

the at least one microphone in each of the two headphone capsules is configured to generate one or more recordings of a sound situation in a reproduction room prior to reproduction of the at least two loudspeaker signals by the headphones, determine an estimate of a first sound signal of at least one sound source from the one or more recordings, and determine a binaural room impulse response of the plurality of binaural room impulse responses for the sound source in the reproduction room.
The system of claim 3.

the at least one microphone in each of the two headphone capsules is configured to generate, during playback of the at least two loudspeaker signals by the headphones, one or more further recordings of the sound situation in the reproduction room, subtract an augmented signal from these one or more further recordings, determine the estimate of the first sound signal from one or more sound sources, and determine the binaural room impulse response of the plurality of binaural room impulse responses for the sound sources in the reproduction room.
The system of claim 4.

the analyzer (152) is configured to determine an acoustic room characteristic of the reproduction room and to adapt the plurality of binaural room impulse responses in response to the acoustic room characteristic.
6. A system according to claim 4 or 5.

the at least one microphone is disposed in each of the two headphone capsules to measure sounds near the entrance of the ear canal;
A system according to any one of claims 4 to 6.

the system includes one or more further microphones outside the two headphone capsules for measuring the sound situation in the reproduction room;
A system according to any one of claims 4 to 7.

the headphones include a headphone band , and at least one of the one or more further microphones is disposed on the headphone band .
The system of claim 8.

the loudspeaker signal generator (154) is configured to generate the at least two loudspeaker signals by convolving each of the plurality of binaural room impulse responses with a source signal of a plurality of one or more source signals.
A system according to any one of claims 1 to 9.

the analyzer (152) is configured to determine at least one of the plurality of binaural room impulse responses in response to a movement of the headphones.
A system according to any one of claims 1 to 10.

The system includes a sensor for determining movement of the headphones.
The system of claim 11.

1. A system for assisting selective hearing, comprising:
a detector (110) for detecting source signal portions of one or more sound sources using at least two received microphone signals of the auditory environment;
a position determiner (120) for assigning position information to each of said one or more sound sources;
a sound type classifier (130) for assigning a sound signal type to the sound source signal portions of each of the one or more sound sources;
a signal portion modifier (140) for modifying the source signal portion of the at least one sound source of the one or more sound sources depending on the sound signal type of the source signal portion of the at least one sound source to obtain a modified sound signal portion of the at least one sound source;
Further equipped with
said analyzer (152) and said loudspeaker signal generator (154) together form a signal generator (150);
the analyzer (152) of the signal generator (150) is configured to generate the plurality of binaural room impulse responses, the plurality of binaural room impulse responses being a plurality of binaural room impulse responses for each of the one or more sound sources, dependent on the position information of the sound sources and on a user's head orientation;
the loudspeaker signal generator (154) of the signal generator (150) is configured to generate the at least two loudspeaker signals in response to the plurality of binaural room impulse responses and in response to the modified audio signal portion of the at least one sound source.
The system of claim 5 .

the detector (110) is configured to detect the source signal portions of the one or more sound sources by using a deep learning model.
The system of claim 13.

the location determiner (120) is configured to determine the location information for each of the one or more sound sources in response to captured images or recorded video.
15. A system according to claim 13 or 14.

the signal portion modifier (140) is configured to select the at least one sound source whose sound source signal portion is to be modified in response to a previously learned user scenario, and to modify the sound source in response to the previously learned user scenario.
A system according to any one of claims 13 to 15.

The system includes a remote device (190) including the detector (110), the location determiner (120), the sound type classifier (130), the signal portion modifier (140), and the signal generator (150);
the remote device is spatially separated from the headphones;
17. A system according to any one of claims 13 to 16.

The system of claim 17, wherein the remote device (190) is a smartphone.

1. A method comprising:
determining, by an analyzer (152) of the system , a plurality of binaural room impulse responses;
generating at least two loudspeaker signals by a loudspeaker signal generator (154) of the system in response to the plurality of binaural room impulse responses and in response to a source signal of at least one sound source;
Including,
the plurality of binaural room impulse responses are determined by the analyzer (152) such that each of the plurality of binaural room impulse responses takes into account an effect due to a user wearing headphones.
method.

20. A computer program having a program code for causing the execution of the method according to claim 19 , when the computer program is executed by a computer or signal processor .