Acoustic Echo Cancellation Thesis
Acoustic Echo Cancellation Thesis
A Thesis
Presented to
The Academic Faculty
by
Ted S. Wada
In Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy in the
School of Electrical and Computer Engineering
Approved by:
iii
ACKNOWLEDGEMENTS
Many thanks go out to countless people that I met throughout the academic journey at
To start with, I want to recognize the fellow research group members and graduate
students from the Center for Signal and Image Processing (CSIP) and the school of Electrical
Jeon, Gaofeng Yue, Qiang Fu, Enrique Robledo-Arnuncio, Dwi Sianto Mansjur, Soohyun
Bae, Yong Zhao, Sunghwan Shin, Umair Altaf, Jason Wung, Mingyu Chen, Chao Weng, Jin
Wang, Yu Tsao, Yeongseon Lee, Byungki Byun, Jinwoo Kang, Gregory Krudysz, Kaustubh
Kalgaonkar, Toygar Akgun, Kun Shi, Thao Tran, Ilseo Kim, Jonathan Kim, Xusheng Sun,
Deryck Yeung, Byungchil Kim, and Calvin Xu. I apologize to those I may have left out.
I would like to acknowledge the visiting scholars and professors I have come across during
my long-term stay at Georgia Tech: Drs. Marco Siniscalchi, Shigeki Miyabe, Francesco
Nesta, Taras Butko, Taeyoon Kim, Shinji Watanabe, Tomoko Matusi, Tomohiro Nakatani,
Hiroshi Saruwatari, and Thrasyvoulos Pappas. I especially thank Dr. Miyabe and Dr.
Nesta, without whom I would not have had a direct exposure to the topic of blind source
I would like to direct my appreciation to the professors who agreed to share their pre-
cious time to be on my dissertation committee: Profs. David Anderson, Xiaoli Ma, Jeff
Shamma, and Brani Vidakovic. I also would like to thank those behind the scene who
kept or still keep ECE and CSIP running: Prof. David Hertling, Prof. Bonnie Ferri,
Prof. Douglas Williams, Marilou Mycko, Daniela Staiculescu, Suzzette Willingham, Tasha
Diana Fouts, Pat Dixon, Christy Ellis, Tammy Scott, Jennifer Lunsford, Lisa Gardner, and
Stacie Speights.
I am forever grateful to the bosses, mentors, and co-workers at the companies I interned
iv
with for giving me the opportunities to gain valuable work experiences: Drs. Eric Diethorn
and Gary Elko at Avaya, Inc., Drs. Rick Younce and Rafid Sukkar at Tellabs, Inc., Drs.
Qi Li and Manli Zhu at Li Creative Technologies, Inc., and Drs. Juin-Hwey Chen and Jes
Most importantly, I would like to reflect the sincerest respect and gratitude towards my
advisor, Prof. Biing-Hwang Juang. I could not have stuck around for so long at Georgia
Tech without his patience and generosity. Dr. Juang’s critical, insightful, and practical
view on scientific research and everyday life should continue to guide my future endeavors.
Finally but not least, I would like to thank my mother Junjitsu and father Motonaka
for giving me an identity that can never be taken away, and my brothers David and Roy
v
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II PROBLEM BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Acoustic Echo Cancellation (AEC) . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Mean Square Error (MSE) Minimization . . . . . . . . . . . . . . 9
2.1.2 Least Mean Square (LMS) Algorithm . . . . . . . . . . . . . . . . 11
2.1.3 Orthogonality Principle . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Multi-channel AEC (MCAEC) . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Single-channel Solution to MCAEC . . . . . . . . . . . . . . . . . 15
2.2.2 Non-uniqueness Problem . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Coherence Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Blind Source Separation (BSS) . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Independence Maximization via Independent Component Analysis
(ICA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Semi-Blind Source Separation (SBSS) . . . . . . . . . . . . . . . . 22
2.4 Effect of Distortions on AEC and BSS . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Linear Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Nonlinear Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Sampling Rate Mismatch . . . . . . . . . . . . . . . . . . . . . . . 28
vi
III ROBUST AEC VIA RESIDUAL ECHO ENHANCEMENT (REE) . . . . . . 30
3.1 Conventional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Robustness against Non-stationarity . . . . . . . . . . . . . . . . . 33
3.1.2 Robustness against Ambient Noise . . . . . . . . . . . . . . . . . . 33
3.1.3 Robustness against Double-Talk . . . . . . . . . . . . . . . . . . . 34
3.2 Residual Echo Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Error Recovery Nonlinearity (ERN) . . . . . . . . . . . . . . . . . 35
3.2.2 Connection to Conventional Approaches . . . . . . . . . . . . . . . 42
3.2.3 Analysis of ERN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Connection between LMS, ICA, and SBSS . . . . . . . . . . . . . 47
3.2.5 Block-Iterative Adaptation (BIA) . . . . . . . . . . . . . . . . . . 50
3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Time-Domain AEC with Ambient Noise . . . . . . . . . . . . . . . 53
3.3.2 Time-Domain AEC with Speech Coding Distortion . . . . . . . . 55
3.3.3 Frequency-Domain AEC with Ambient Noise and Double-Talk . . 59
vii
V ROBUST MCAEC VIA SBSS . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1 Generalization of MCAEC by SBSS . . . . . . . . . . . . . . . . . . . . . 100
5.1.1 SBSS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.2 Non-uniqueness Problem in SBSS . . . . . . . . . . . . . . . . . . 103
5.1.3 Derivation of Steady-State Solution . . . . . . . . . . . . . . . . . 104
5.1.4 Connection between MSE and ICA . . . . . . . . . . . . . . . . . 108
5.1.5 Constraint on Separation Matrix . . . . . . . . . . . . . . . . . . . 110
5.2 Algorithm Design and Related Issues . . . . . . . . . . . . . . . . . . . . 115
5.2.1 Online Implementation of SBSS . . . . . . . . . . . . . . . . . . . 115
5.2.2 Proposed SBSS Algorithm . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 122
VI CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Future Research Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . 144
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
viii
LIST OF TABLES
ix
LIST OF FIGURES
x
10 GSM AMR speech coding distortion, measured in terms of the signal-to-
distortion ratio (SDR), versus input speech magnitude, measured in terms of
the signal loss, for various bit-rate (kbps). . . . . . . . . . . . . . . . . . . . 27
11 Echo return loss enhancement (ERLE) from the network-based AEC (using
the FBLMS algorithm) as a function of the loudspeaker saturation parameter
ρ (smaller ρ indicates greater saturation) and the GSM AMR bit-rate. . . . 27
12 ERLE when there is a sampling rate mismatch between the reference signal
and the acoustic echo. The correction was made through re-sampling by
interpolation in the time domain by using the coefficients re-use block size of
B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
13 Adaptive filtering with linear or nonlinear distortion on the true, noise-free
¯
response d(n). A noise-suppressing memoryless nonlinearity f (·) is applied
to the observed filter estimation error e(n) in the feedback-loop in order to
suppress the effects of the distortion during the adaptation of a linear filter. 31
14 PDFs (scaled by log(·)) and score functions of s, t, and u when s = t + u,
t ∼ Gaussian(0, 1), and u ∼ Laplacian(0, 1). For |x| < 2, φs (x) exhibits
approximately linear scaling by a factor of 0.5 that accounts for relatively
equal contributions from t and u for small observed (noisy) magnitude |s|.
For |x| > 2, φs (x) → φu (x) very rapidly as |x| → ∞ since the kurtosis is
higher for u than t, i.e., the probability of u is greater than that of t for large
observed magnitude |s|. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
15 Realization of noise-suppressing nonlinearities (61), (62), (65), (66), and (67)
obtained through the MMSE and the MAP estimation procedures when the
observed adaptive filter estimation error is modeled additively as e = ē + v. 40
16 Realization of ad-hoc noise-suppressing nonlinearities (68) and (69). . . . . 41
17 Comparison between AEC and adaptive noise cancellation (ANC). In both
cases, the application of ERN to the filer error e(n) allows an LMS-based
adaptive filter to produce the estimate of the target source s(n) that is inde-
pendent on average from the signal x(n) to be canceled. . . . . . . . . . . . 48
18 From the regularized NLMS algorithms (rNLMS1, rNLMS2) with the ERN
GL
(fMMSE ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
19 From the regularized NLMS algorithm (rNLMS2) with the ERN (fMMSE GL ).
Corresponding average tERLE is provided in the legend. . . . . . . . . . . . 56
20 Network-based AEC with speech coding distortion for various bit-rates (kbps).
Quantization noise starts to take over beyond the signal loss of 25 dB. . . . 57
21 From the NLMS algorithms (NLMS, rNLMS2) with the ERN (fMMSE GG ). GSM
AMR speech encoding and decoding at 12.2 kbps bit-rate were applied to the
acoustic echo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
22 From the regularized FBLMS algorithm (RFBLMS) with the ERN (fMMSE GG ,
LG GL
fMMSE , fMMSE ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xi
23 From RFBLMS with fMMSEGL for various block iterations iter and step-size
α. The RIR changes suddenly at 10 seconds due to the displacement of the
loudspeaker-microphone enclosure. . . . . . . . . . . . . . . . . . . . . . . . 62
24 GL
From RFBLMS with fMMSE GL . . . . . . . . . . . . . . . . . . . . . .
or fMAP 63
25 System integration of adaptive filter and REE. . . . . . . . . . . . . . . . . 68
26 Signal delay is linearly varied after resampling frame-wise (a) alternatingly
across channels and (b,c,d) simultaneously across channels where every other
block is resampled after time reversal and reversed back afterward (N = 2048,
R = 1.0004, fs = 16 kHz). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
27 Inter-channel coherence (averaged over first 5 seconds of speech). From top
to bottom: N = ∞, 4096, 2048, 1024, 512, 256. . . . . . . . . . . . . . . . . . 72
28 Inter-channel coherence comparison after decorrelation. . . . . . . . . . . . 72
29 Distortion on a speech signal by FDR (normalized with respect to TDR) for
R = 1.0004, N = 2048, and fs = 16 kHz. . . . . . . . . . . . . . . . . . . . . 75
30 Block-iterative adaptation (BIA) of the filter coefficient vector wi,j , where i
is the iteration index and j is the block index. . . . . . . . . . . . . . . . . . 78
31 Near-end loudspeaker, local, and microphone signals for SAEC. Double-
arrows indicate individual speech activity. . . . . . . . . . . . . . . . . . . . 80
32 True ERLE (averaged over left and right channels) for combinations of iter,
B, DTD, and EW without decorrelation. . . . . . . . . . . . . . . . . . . . . 82
33 Improvement in misalignment without decorrelation. . . . . . . . . . . . . . 83
34 True residual echo and tERLE without decorrelation. . . . . . . . . . . . . . 83
35 True residual echo and tERLE with NLP. . . . . . . . . . . . . . . . . . . . 84
36 True residual echo and tERLE with AWGN. . . . . . . . . . . . . . . . . . . 84
37 True residual echo and tERLE with OSD. . . . . . . . . . . . . . . . . . . . 84
38 True residual echo and tERLE with RUD or RBI. . . . . . . . . . . . . . . 85
39 Improvement in misalignment after decorrelation. . . . . . . . . . . . . . . . 85
40 Misalignment averaged over 20 runs and all echo paths. . . . . . . . . . . . 88
41 Sub-band tERLE decomposition (stereophonic, averaged over all channels). 90
42 Sub-band misalignment decomposition (stereophonic, averaged over all echo
paths). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
43 Variable resampling ratios R1 , R2 , and R3 and their corresponding coherence
plots, which match the coherence from SBR to that of other decorrelation
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
44 FDR with fixed ∆R = 0.0028 and the corresponding variable resampling
ratios R4 and R5 that match the coherence of FDR in the high frequency band. 93
xii
45 Sub-band tERLE decomposition (single channel). . . . . . . . . . . . . . . . 97
46 Sub-band misalignment decomposition (single channel). . . . . . . . . . . . 97
47 Model of the near-end and the far-end mixing systems and the semi-blind
source separation (SBSS) system. . . . . . . . . . . . . . . . . . . . . . . . . 101
48 Source activities in the worst-case scenario. . . . . . . . . . . . . . . . . . . 123
49 Performance of SBSS with AWGN decorrelation procedure (with de-mixing
matrix constraints and 5 iterations). . . . . . . . . . . . . . . . . . . . . . . 125
50 Performance of constrained and unconstrained SBSS (with 5 iterations). . . 126
51 Performance of SBSS with different number of iterations (with de-mixing
matrix constraint). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
52 Performance of SBSS with different EBR (with de-mixing matrix constraint
and 5 iterations). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
53 Performance of SBSS with AWGN noise at near-end microphones (with de-
mixing matrix constraint and 5 iterations). . . . . . . . . . . . . . . . . . . 130
54 Performance of SBSS with different FFT frame size (with de-mixing matrix
constraint and 5 iterations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
55 Performance of SBSS with different ICA step-size η (with de-mixing matrix
constraint and 5 iterations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
56 Performance of SBSS with online and batch-online implementations (with
de-mixing matrix constraint and 5 iterations). . . . . . . . . . . . . . . . . . 133
57 Performance of SBSS with variation in the active source number (with de-
mixing matrix constraint and 5 iterations). . . . . . . . . . . . . . . . . . . 134
58 Performance of SBSS for different microphone configurations (with de-mixing
matrix constraint and 5 iterations). . . . . . . . . . . . . . . . . . . . . . . . 136
59 Performance of SBSS with different microphone configurations and without
applying the near-end source separation (with de-mixing matrix constraint
and 5 iterations). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
60 Comparison between SBSS and FBLMS for a real-world scenario. . . . . . . 139
xiii
SUMMARY
In the real world, an acoustic signal is almost never present in isolation. The ubiqui-
tous “acoustic ambience” is always full of noise of various kinds. These interferences are
detrimental in problems that involve linear system identification since the input-output
relationship is much more than what is defined by the linear system. In particular, the
conventional acoustic echo cancellation (AEC) framework based on the least mean square
(LMS) algorithm by itself lacks the ability to handle many secondary signals, e.g., local
speech and background noise, that interfere with the adaptive identification and filtering
process. There are also nonlinear interferences, such as those introduced by speech codecs
used in telecommunication networks, that can pose more difficulties to AEC than linearly
additive noises. We need to address the deficiencies of traditional AEC techniques and
develop a set of novel techniques to provide the AEC performance for immersive telecon-
In this dissertation, we build a foundation for what we refer to as the system approach
aims for the integration of individual components, or algorithms, into a cohesive unit for
the benefit of the system as a whole to cope with real-world enhancement problems. The
standard system identification approach by minimizing the mean square error (MSE) of a
linear system is sensitive to distortions that greatly affect the quality of the identification
nonlinearity in the adaptive filter error feedback-loop of the LMS algorithm when there is
an interference at the near end, where the source of distortion may be linear or nonlinear.
We provide a thorough derivation and analysis of the error recovery nonlinearity (ERN)
that “enhances” the filter estimation error prior to the adaptation to transform the cor-
rupted error’s distribution into a desired one, or very close to it, in order to assist the linear
xiv
adaptation process. We reveal important connections of the residual echo enhancement
(REE) technique to other existing AEC and signal enhancement procedures, where the
technique is well-founded in the information-theoretic sense and has strong ties to indepen-
dent component analysis (ICA), which is the basis for blind source separation (BSS) that
the single-channel AEC problem can be viewed as a special case of semi-blind source separa-
tion (SBSS) where one of the source signals is partially known, i.e., the far-end microphone
signal that generates the near-end acoustic echo. Indeed, SBSS optimized via ICA leads to
the system combination of the LMS algorithm with the ERN that allows continuous and
Next, we extend the system perspective to the decorrelation problem for AEC that,
apart from the system identification issues, can retard the adaptive algorithm’s conver-
gence speed and degrade the AEC performance. We show that the REE procedure can
while introducing minimal distortion to signal quality and statistics. We then illustrate
the systematic relationship between DBR and block-iterative adaptation (BIA), or batch
more design flexibility for controlling the trade-off between signal distortion and decorrela-
tion amount. As an advanced extension of FDR, we come up with the sub-band resampling
(SBR) technique, which, given the same degree of decorrelation measured in terms of the
coherence, has the potential to not only achieve superior audio quality but also provide
better overall AEC performance than other decorrelation procedures. We show that the
system approach can also be applied to the multi-delay filter (MDF), which suffers from
the inter-block correlation problem. Sub-band analysis of the echo return loss enhancement
xv
and the misalignment illustrates that DBR and BIA work together mutually to recover the
cancellation performance lost from inter-channel correlation during MCAEC and that BIA
enables the MDF to naturally regain the performance lost from inter-block correlation.
Finally, we generalize the MCAEC problem in the SBSS framework and discuss many
issues related to the implementation of an SBSS system. After a deep analysis of the struc-
ture of the SBSS adaptation, which is a truly multi-channel approach to AEC rather than
implementation that stabilizes the convergence behavior even in the worst case scenario
of a single far-end talker along with the non-uniqueness condition on the far-end mixing
system. Specifically, we use a matrix constraint to reduce the effect of the non-uniqueness
problem. Experimental results show that high echo cancellation can be achieved just as the
the far-end signals even for the single far-end talker case.
The proposed techniques are developed from a pragmatic standpoint, motivated by real-
world problems in acoustic and audio signal processing. From a theoretical point of view, an
of the mean square error optimization criteria that gives rise to the LMS algorithm. Short-
comings of these methods and criteria, although conceptually simple and practical in terms
of computational saving, have been much discussed. On the other hand, the results from
processing, have been in many cases surprisingly good and robust, as reported in the recent
principle to the system level of an AEC problem allows us to relate AEC to source sep-
aration that seeks to maximize the independence, hence implicitly the orthogonality, not
only between the error signal and the far-end signal, but rather, among all signals involved.
The system approach, for which the REE paradigm is just one realization, enables the en-
yet practically effective manner for solving the enhancement problem in a very noisy and
xvi
CHAPTER I
INTRODUCTION
1.1 Motivations
Acoustic echo arises during a telecommunication session since there is usually some degree
of coupling between a loudspeaker and a microphone at the near end (or the local end),
which results in a signal from the far end (or the remote end) being played out and captured
locally and sent back to its originating location (see Figure 1). When the round trip delay is
substantial (∼ 200 ms or more), the resulting “echo” becomes highly objectionable and may
render the teleconferencing ineffective [40, Chapter 1]. The situation is obviously avoided by
using a half-duplex transmission mode to eliminate the echo’s return path completely (i.e.,
hard clipping) or a headset to isolate the earpiece from the microphone. But such solutions
are not always desirable or even possible. Many everyday situations clearly demand a full-
duplex, hands-free communication, in which case the acoustic echo cancellation (AEC) is
addition, the usage of multiple loudspeakers and microphones to provide the spatial audio
awareness requires effective multi-channel acoustic echo cancellation (MCAEC) among other
lem in which the room impulse response (RIR) (i.e., the echo path) between a loudspeaker
and a microphone is estimated by a linear adaptive filter. The least mean square (LMS)
algorithm is commonly used for such a task due to its computational simplicity and adapt-
ability. To cancel the echo, the estimated RIR in the form of a finite impulse finite impulse
response (FIR) filter is applied to the far-end signal (i.e., the reference signal), the result
of which is a close replica of the far-end signal captured at the near-end microphone that
can be subtracted from the near-end microphone signal before being transmitted back to
1
x(n) x(n)
from far end Power Amplifier
Loudspeaker
Microphone
d(n)
to far end Pre−Amp
Figure 1: Model for acoustic echo inside the loudspeaker-enclosure-microphone system (LEMS). A
loudspeaker power amplifier (PA) is often modeled by a memory-less saturation nonlinearity that
distorts the far-end (reference) signal x(n), whereas the effect of a microphone pre-amp is usually
ignored. Impulse responses hd (n), due to a direct echo path, and hr (n), due to reverberation (i.e.,
reflections), are combined together as a single room impulse response (RIR) h(n). A local noise v(n)
is added directly to the acoustic echo d(n), and the resulting near-end signal y(n) is sent back to
the far end.
the far end. However, there may be many types of distortion that can prevent the LMS-
based adaptive filter from properly identifying the echo path in the first place for sufficient
AEC performance. Any remaining echo then has to be reduced further by residual echo
Adaptive
Filter h(n)
ˆ
d(n)
_
Residual Echo + Linear/Nonlinear
r(n) Suppressor e(n) d(n) Distortion ¯
d(n)
Figure 2: Model for acoustic echo cancellation (AEC). There may be linear or nonlinear distor-
¯
tion applied to the reference signal x̄(n) or the acoustic echo d(n) that can prevent a linear finite
impulse response (FIR) adaptive filter from properly identifying the RIR h(n). Any remaining echo
component in the estimation error e(n) is reduced further through residual echo suppression (RES).
For example, a local noise added to the acoustic echo can complicate the task of sys-
tem identification since it is unrelated to the reference signal that is actually driving the
2
loudspeaker-enclosure-microphone system (LEMS) (see Figure 1). Such a type of distortion
can be categorized into two groups. One is the ambient background noise, e.g., from an
air conditioner or a display projector in a conference room, that is ubiquitous and continu-
ously present. The other type of noise occurs when the near-end talker speaks concurrently
with the far-end talker, i.e., the double-talk situation. Although usually short in duration,
double talk can severely disrupt the filter adaptation since the near-end speech is colored,
non-stationary, and most likely larger in volume than the acoustic echo. A practical yet
rather ad-hoc solution, much like the hard-clipping RES approach, is to completely freeze
the filter adaptation during double talk, which obviously limits the effectiveness of an adap-
Whereas the effect of a local noise is linear in the LEMS, nonlinear characteristics
(see Figure 1), can also degrade the AEC performance excessively [16, 134]. A linear
approximation of the LEMS is no longer valid in such a case, and an FIR filter determined
by the LMS algorithm is able to cancel only the linear portion of the acoustic echo. In such
decouple the echo path into a cascade of nonlinear response, e.g., the loudspeaker saturation,
and linear response, i.e., the acoustic impulse response itself (see Figure 1). More detailed
Another source of nonlinear and pervasive distortion that greatly degrades the AEC
performance consists of the speech codecs used in wireless and voice-over-IP (VoIP) com-
munications. Many signal enhancement tasks such as the automatic gain control (AGC)
and the noise reduction are often relegated to the telecommunications network providers
due to the lack of proper computing resources at the near end, e.g., a cellular handset.
Hence the AEC must then be implemented at a central location somewhere in the net-
work as illustrated in Figure 3. When a speech coding distortion is applied to the acoustic
echo, even fast-converging alternatives to the LMS algorithm such as the frequency-block
LMS (FBLMS) and the recursive least squares (RLS) algorithms are unable to provide an
3
Base Station / Central Office Cellular Handset
From Remote
Caller Encoder Decoder
Acoustic
Echo
Canceller
Decoder Encoder
To Remote
Caller
and the developers of signal enhancement algorithms is the mismatch in the sampling rate
rate mismatch of several hundred parts per million (ppm) is normally expected due to the
randomness in manufacturing processes and the lack of a common clock source by DAC
and ADC, where such a small amount of mismatch can create enough nonlinear distortion
Finally, a distortion to the acoustic echo may occur simply through a change in the RIR
itself, after which an adaptive filter must be able to converge quickly to a new solution.
Particularly for the case of MCAEC, a filter needs to be re-adapted also when the far-end
echo paths change [118], e.g., when the far-end talkers change in location or when an active
talker switches from one to another. The situation is further complicated by the so-called
“non-uniqueness” problem caused by the correlated multiple reference signals, i.e., far-end
microphone signals, that dramatically slows down the convergence rate of a multi-channel
1.2 Objectives
There are many distortions that can affect the AEC performance as discussed above, in-
cluding:
• Local noise in the form of near-end speech, i.e., the double-talk situation.
4
• Speech coding distortion.
• Ill-conditioning of the far-end mixing system that causes the non-uniqueness problem
during MCAEC.
In order to better attack these problems, we may generalize all signal enhancement pro-
cedures (e.g., AEC, noise reduction, dereverberation, beamforming, etc.) as “source sep-
aration,” i.e., estimation of individual signals from a mixture to obtain the signal(s) of
interest. AEC and other signal enhancement algorithms are traditionally derived through
the standard mean square error (MSE) optimization. This leads to the well-known orthog-
onality principle, i.e., decorrelation of algorithm output signals. On the other hand, recent
works on source separation clearly suggest that by elevating the signal enhancement crite-
rion to achieve maximum statistical independence of the processed signals, an improved and
even more robust performance can be obtained than with standard MSE-based methods.
By “robustness,” we mean the ability to provide sufficient performance even if the assumed
conditions deviate from the ideal ones. The conventional MCAEC framework happens to be
inherently limited by the single-channel LMS optimization approach that does not properly
The main challenge is in deriving a new AEC paradigm in the framework of blind source
separation (BSS) based on independent component analysis (ICA) that provides a solid
analytical foundation to solving the robustness issue. Many classical AEC algorithms must
be adjusted often in an ad-hoc fashion to fit the real-world situations and, by lacking the
robustness, may consequently fail to produce the desired results. The ICA-based BSS is a
very powerful and effective signal enhancement method that allows the recovery of a target
signal from a mixture of multiple signals even when the original signals are unavailable
(hence “blind”). By relating the LMS-based AEC to the ICA-based semi-blind source
separation (SBSS), in which case some of the source signals are known in advance (e.g.,
the far-end reference signal), the development of many new AEC techniques that are much
5
Our overall objective is to establish what we call the system approach for the develop-
ment of algorithms, or components, for robust AEC in order to overcome the real-world
designed properly to interact with each other mutually for the benefit of the system as a
The REE technique is just one realization of the system approach to the AEC problem that,
in fact, falls out naturally from the SBSS framework. The goal is to show that by raising
the orthogonality principle to the system level that seeks to maximize the independence
between outputs representing all the involved signals, robust signal enhancement is possible
1.3 Outline
In Chapter 2, we briefly go over adaptive filtering algorithms based on the MSE op-
timization for single-channel AEC. We review the LMS algorithm and the corresponding
which occurs when the single-channel LMS algorithm is applied to the MCAEC problem.
We then provide an overview of BSS, the statistical independence maximization via ICA,
and SBSS for a generalized framework for MCAEC. We also define the performance mea-
sures for AEC and the coherence, or correlation, measure for MCAEC, and examine the
In Chapter 3, we establish the foundation for the system approach to robust AEC by
introducing the REE technique. We derive what we refer to as the error recovery nonlinear-
ity (ERN) through the Bayesian estimation procedure for the suppression of additive noise
remaining in an adaptive filter’s cancellation error that allows the adaptive filter to cope
6
with distortions to the acoustic echo. We show that the technique is fundamentally related
to many other conventional procedures for the robustification of AEC, and that it is also
Through the system approach, we motivate what we call the block-iterative adaptation
(BIA) as an essential part of the robust AEC system that not only permits the recovery of
the convergence speed lost in noisy situations but also the reduction in sensitivity to the mis-
simulation with linear and nonlinear distortions to illustrate the technique’s effectiveness.
In Chapter 4, we expand the system perspective to cover the decorrelation aspect for
assisting the AEC process. We first apply the REE technique to MCAEC to show that
BIA permits a natural recovery of the AEC performance lost due to inter-channel correla-
that is compatible with the REE-based MCAEC system for alleviating the non-uniqueness
problem. We then extend the DBR technique to frequency-domain resampling (FDR) for
computational saving, where we obtain the sub-band resampling (SBR) technique as a vari-
ation of FDR that not only achieves superior audio quality compared to other decorrelation
procedures but also provides the most improvement in the overall AEC performance. Fi-
nally, we apply the decorrelation via system approach to the multi-delay filter (MDF) [120],
which suffers from inter-block correlation [19]. We demonstrate the advantage of the sys-
tem approach through the sub-band decomposition of the true echo return loss enhancement
(tERLE) and the misalignment obtained from MCAEC and MDF simulations.
MCAEC. We first define the SBSS model and cast the non-uniqueness problem in the
SBSS framework. We then derive the steady-state state solution for SBSS and make a clear
connection between the MSE solution to MCAEC and the ICA solution to SBSS, where we
show that the ICA solution is contingent on certain constraints on the separation matrix.
As the convergence rate of ICA-based adaptive algorithms is generally much slower com-
pared to that of the LMS algorithm and is strongly dependent on the amount of data, we
introduce the batch-online adaptation procedure for practical SBSS and provide an outline
7
of the proposed SBSS algorithm along with simulation results.
8
CHAPTER II
PROBLEM BACKGROUND
This chapter is organized as follows. First in Section 2.1, we go over the acoustic echo
cancellation (AEC) problem and its workhorse, the least mean square (LMS) algorithm
based on the mean square error (MSE) optimization. Second in Section 2.2, we present
the multi-channel AEC (MCAEC) scenario and the corresponding non-uniqueness prob-
lem that occurs when the LMS algorithm is applied to MCAEC. Third in Section 2.3, we
motivate blind source separation (BSS), or namely semi-blind BSS (SBSS), for the gen-
eralization of MCAEC that maximizes the statistical independence of output signals via
independent component analysis (ICA). Finally in Section 2.4, we examine the effect of
A linear adaptive filter can be obtained through an optimization procedure that minimizes
some cost function (or criterion) in terms of the estimation error and the filter coefficients.
There are basically three types of sample-based time-domain adaptive filtering algorithms
commonly used for AEC that are derived from the traditional least-square optimization ap-
proach: LMS [138, 139], affine projection (AP) [94, 40], and recursive least squares (RLS)
[49, 46]. The corresponding cost function and algorithmic complexity for the three algo-
The convergence rate of the AP and the RLS algorithms is generally much higher than
that of the LMS algorithm, but so is their computational complexity. The LMS algorithm is
a special case of the AP algorithm when only a single tap rather than multiple taps of past
estimation error is used to update the filter coefficients. The LMS algorithm is also a robust
way to stochastically approximate the least-squares solution [47] and is derived in the same
fashion as with the Wiener filter through the minimization of the MSE E{|e(n)|2 }, where
9
Table 1: Three sample-based adaptive algorithms obtained through least-square optimization and
commonly used for AEC: least mean square (LMS), affine projection (AP), and recursive least
squares (RLS). E{·} is the expectation operator, e(n) is the estimation error, L is the filter order,
P is the AP order, and 0 < λ ≤ 1 is the forgetting factor.
E{·} is the expectation operator and e(n) is the filter output error. The RLS algorithm,
which may be considered as a stochastic version of the Kalman filter and exhibits much
faster convergence speed than the Wiener filter, is computationally far more expensive
than the LMS algorithm and is susceptible to numerical instability since it recursively
estimates the inverse of the input auto-correlation matrix. The LMS algorithm suffers the
most from the decreased convergence rate when the input signal is non-white, e.g., the
speech signal. Still, the LMS algorithm is widely adopted for the AEC purpose due to
its computational efficiency and adaptability, and it is a standard against which all other
On the other hand, block-based frequency-domain versions of the LMS algorithm in-
clude frequency-block LMS (FBLMS) [23, 115], multi-delay filter [120] (or equivalently,
partitioned block frequency-domain adaptive filter (PBFDAF) [17]), and generalized MDF
(GMDF) [82]. These adaptive algorithms achieve improved computational efficiency and
faster convergence rate than the time-domain sample-based counterpart through the imple-
mentation of filtering (i.e., convolution) and adaptation in the frequency domain. Although
the overlap-add (OLA) or overlap-save (OLS) method [93] can be used to reduce a buffer-
ing delay introduced by the block processing, the delay may still be significant when a long
room impulse response (RIR) corresponding to the reverberant acoustic condition must be
estimated. A trade-off between the computational complexity, convergence rate, and pro-
cessing delay must be considered while implementing the frequency-domain AEC in real
time.
10
2.1.2 Least Mean Square (LMS) Algorithm
where x(n) = [x(n), x(n − 1), · · · , x(n − L + 1)]T is the reference signal vector of length L
at time n, w(n) = [w0 (n), w1 (n), · · · , wL−1 (n)]T is the filter coefficient vector of the same
ˆ
e(n) = d(n) − d(n) (2)
is the error from approximating the observed acoustic echo d(n) by the replica
ˆ
d(n) = wT (n)x(n) (3)
based on the estimated echo return path response reflected in w(n). It iteratively and
where Rxx = E{xxT } is the auto-correlation matrix and rxd = E{xd} is the cross-
correlation vector.
There are many real-world factors that affect the performance of the LMS algorithm.
First and the foremost is that a finite number of coefficients are used to describe a system
response h(n) that is usually infinite in length, hence e(n) 6= 0 no matter how perfect an
adaptive algorithm is. Another factor is that the approximation of the gradient of the MSE
(i.e., the gradient noise [49]) of the filter coefficients, which exacerbates the sensitivity of
the adaptive algorithm to non-whiteness and non-stationarity of signals. Finally, there are
¯ + v(n),
d(n) = d(n) (5)
¯
d(n) = h(n) ∗ x(n) ≈ hT (n)x(n), (6)
11
¯
where d(n) is the true acoustic echo, v(n) is the noise, assumed here to be additive at
the time-sample level, “∗” is the convolution operator, and h(n) is a truncated vector
representation of the RIR h(n). By “true,” we mean a distortion-free situation, i.e., v(n) =
¯ − d(n)
ē(n) = d(n) ˆ ≈ hT∆ (n)x(n), (8)
After its introduction in [138], the LMS algorithm was initially applied to the network
echo (or line echo) cancellation (NEC) in the 1960s [119, 116]. A network echo occurs due to
the reflection of electrical signals caused by the impedance mismatch between the two-wire
and the four-wire circuits in the telecommunications network. AEC is much more difficult
to implement than NEC since the acoustic echo is usually longer in length (on the order of
64 to 128 ms or more with the reverberation time of T60 > 200 ms) and highly time-varying
when compared to the network echo (typically less than 64 ms) [40, 9]. Many modifications
have been proposed to make the LMS algorithm robust to deviations from ideal conditions
A major deficiency of the LMS algorithm is that the short-term effect of additive noise on
To begin with, we observe that the LMS algorithm of (1) is actually a stochastic version
1
For simplification purpose, the RIR h(n) is assumed to be finite in length, represented by a vector h(n)
¯
of length N such that d(n) = hT (n)x(n), and also N = L is assumed if not otherwise specified. To simplify
the statistical analysis, all signals are assumed to be generated by zero-mean random processes.
12
where the direction and the amount of adaptive correction to w(n) depend on the sample
(10) converges to the optimal filter coefficients wopt (n) when E{e(n)x(n)} = 0 that min-
imizes the MSE and consequently satisfies the orthogonality principle E{e(n)x(n)} = 0
[49]. The orthogonality principle is illustrated in Figure 4, where a projection of the desired
¯
signal d(n), which resides in the L-dimensional reference signal space, onto the reference
signal vector x(n) gives the optimal estimate dˆopt (n) such that the true estimation error
ē(n) is orthogonal, i.e., uncorrelated, to x(n). The error is modeled to be quadratic in the
space of the parameter w(n), and the adaptation rule is based on the second-order statistics
(SOS).
v e
d
d¯ x
ē
dˆopt = wTopt x
Figure 4: Effect of additive noise v on the least mean square (LMS) estimation of the optimal
filter coefficient vector wopt . The optimal solution is obtained as long as E{vx} = 0, i.e., E{e} =
E{(ē + v)x} = E{ēx}. However, the LMS algorithm may never converge to wopt due to the sample-
wise estimation of E{ex} by ex = (ē + v)x, where vx = 0 does not necessarily hold.
As Figure 4 also shows, (11) implies that the optimal solution is attainable in principle
even when v(n) 6= 0 as long as v(n) is uncorrelated with x(n) such that E{v(n)x(n)} = 0
and
= −2E{ē(n)x(n)}. (12)
13
However, the orthogonality condition is undermined when the expectation in (11) is esti-
∇w E{|e(n)|2 } ≈ ∇w |e(n)|2
= −2e(n)x(n)
That is, even though the noise is most likely uncorrelated with the reference signal, in which
case the optimal solution should be still attainable since E{e(n)x(n)} = E{ē(n)x(n)} +
E{v(n)x(n)} = E{ē(n)x(n)} (i.e., v(n) and x(n) are uncorrelated on average), such an
ideal condition does not always hold for individual samples that may actually be correlated
in a short period of time such that v(n)x(n) 6= 0, which translates to noisy updating of
w(n) at each iteration. This causes the LMS algorithm, which has an inherent ability to
eventually converge, with an arbitrary degree of precision depending on the filter length,
to a unique and optimal solution wopt (n), to have difficulty in converging to the optimal
solution in a noisy situation. Therefore, coupled with the gradient noise, the local noise can
greatly disrupt the LMS adaptation process and contribute to even larger MSE.
The two conventional performance measures for AEC are the echo return loss enhancement
(ERLE),
P
E{|d(n)|2 } n |d(n)|
2
ERLE (dB) ≡ 10 log10 ≈ 10 log 10 P , (14)
E{|e(n)|2 } n |e(n)|
2
i.e., the ratio of energies between the acoustic echo d(n) and the residual echo e(n), and
i.e., the ratio of energies between the misadjustment h∆ (n) and the RIR h(n). The ERLE,
which should be as high as possible, measures the MSE performance E{|e(n)|2 }, while the
misalignment, which should be as low as possible, measures the mean square deviation
14
When v(n) 6= 0, e(n) → v(n) as n → ∞ by (7), and the ERLE defined in (14) converges
¯ 2} P ¯ 2
E{|d(n)| |d(n)|
a priori SNR( dB) ≡ 10 log 10 ≈ 10 log 10 Pn . (17)
E{|v(n)|2 } n |v(n)|
2
Hence the standard definition of the ERLE tends to hide the true echo cancellation per-
formance when there is an additive noise. A more objective MSE performance measure is
i.e., the usual ERLE measured after subtracting out the contributions from the near-end
noise.
Figure 5 illustrates the case of a simplest stereophonic AEC (SAEC) (i.e., two loudspeak-
ers and two microphones) with one talker located at the far end and another located at
the near end. Any adaptive algorithm may be used to minimize the MSE individually
ˆ
for each near-end microphone channel’s error e(n) = d(n) − d(n) = (h(n) − w(n))T x(n),
where h(n) = [hT1 (n), hT2 (n)]T , w(n) = [w1T (n), w2T (n)]T , and x(n) = [xT1 (n), xT2 (n)]T are
formed by concatenating the respective vectors from the two channels. Thus MCAEC is
Furthermore, the reference signals x1 (n) and x2 (n) are highly correlated when they are
produced by a single far-end source. In such a case, the auto-correlation matrix Rxx (n) =
E{x(n)xT (n)} formed by x(n) is very poorly conditioned, which in effect slows down the
convergence rate of an LMS-based adaptive filter. The conditioning of Rxx (n) is improved
naturally when there are two or more active far-end sources, but there is normally at most
15
one active talker on one end during a conversation. The worst-case scenario of ranked-
deficient Rxx (n) occurs when the adaptive filter length L is longer than or equal to the
far-end RIR length M that leads to the non-uniqueness problem, in which case no unique
solution exists for the echo paths h1 (n) and h2 (n) [118].
Decorrelation
Procedure
g1 (n) sf (n)
h1 (n)
x2 (n)
g2 (n)
Network Channels
h2 (n)
ĥ1 (n)
ĥ2 (n)
ŝn (n)
d2 (n) d1 (n) − −
+
sn (n) dˆ1 (n) dˆ2 (n)
− −
y(n) e(n)
+
Figure 5: Model for stereophonic AEC (SAEC). Only two of the four possible echo-paths are shown
in the figure not only for simplification purpose but also since the conventional SAEC approach
attempts to minimize the mean square error (MSE) for one near-end microphone at a time. A
decorrelation procedure is applied to the reference signals before playback and adaptation at the
near end to alleviate the non-uniqueness problem.
Since the acoustic impulse response in reality is infinite in length, the uniqueness con-
dition of L < M should automatically be satisfied. Still, severe ill-conditioning of Rxx (n)
occurs when there is only one colored source at the far end. A high level of inter-channel
correlation and the effect of local distortions would then cause the convergence rate of an
LMS-based adaptive algorithm to decrease so much that the solution would appear as if it
is non-unique. Also, the tail effect occurs when the RIR is under-modeled in length (i.e.,
L < N ) such that it leads to the biased filter coefficients when Rxx (n) is ill-conditioned
16
As shown in Figure 5 for the case of SAEC, let sf (n) be the far-end source signal. For
i = 1, 2, . . . , P , where P is the number of channels, let gi (n) be the ith far-end RIR vector
of length M , sf (n) be the far-end source signal vector of same length, xi (n) = giT (n)sf (n)
be the ith reference signal, and hi (n) be the ith near-end RIR vector of length N = L (i.e.,
same length as wi (n)). Also, let sn (n) = 0 for the near-end source signal (i.e., no double
talk). Then we can show (under regular assumptions) that the MSE is given by
where h(n) = [hT1 (n), . . . , hTP (n)]T , w(n) = [w1T (n), . . . , wPT (n)]T , h∆ (n) = h(n) − w(n),
x(n) = [xT1 (n), . . . , xTP (n)]T , Rxx = E{x(n)xT (n)}, Rsf sf = E{sf (n)sTf (n)}, and G(n) is a
to [57] for a more detailed presentation). Assuming that Rsf sf is fully ranked, the optimal
unique solution if and only if L < M . Therefore, the non-uniqueness problem arises when
the matrix G(n) is either ill-conditioned or rank deficient. (20) also shows clearly that a
change in the far-end mixing matrix G(n) (e.g., when the far-end talker activity switches)
Figure 6 is a simplified illustration of the effect of the far-end mixing system on the
MSE for SAEC when L = N = 1, M = 2, and h(n) = [2, 1]T . The MSE surface becomes
deformed by the ill-conditioning of the far-end mixing matrix G(n), which in effect decreases
the convergence rate of the LMS-based adaptive algorithm (see Figure 6(b)). When G(n)
is rank deficient, the dimension of the solution space of h(n) (or alternatively the null space
of Rxx (n)) is greater than zero such that there is no unique solution (see Figure 6(c)).
For the case of L = M = N ≥ 2, the MSE surface forms a “manifold” and cannot be
17
visualized easily in a three-dimensional space [32]. Figure 6 can be used to visualize the
20 1 1.5
15
E{|e(n)| }
E{|e(n)| }
E{|e(n)| }
2
2
1
10 0.5
0.5
5
0 0 0
4 4 4
4 4 4
2 2 2
2 2 2
h (n) 0 0 h (n) 0 0 h (n) 0 0
1 h2(n) 1 h (n) 1 h (n)
2 2
(a) G(n) = I. (b) G(n) 6= I but fully ranked. (c) G(n) is rank deficient.
Figure 6: Effect of ill-conditioning of the far-end mixing matrix G(n) on the MSE E{|e(n)|2 } for
SAEC. The optimal solution is at h = [2, 1]T for both (a) and (b). However, even when G(n) is fully
ranked such that there is a unique solution, the deformation of the MSE surface in (b) causes the
convergence rate of the LMS-based adaptive algorithm to go down. The worst-case scenario occurs
when G(n) is rank deficient as in (c), for which the solution is a line that goes through h = [2, 1]T .
Many decorrelation procedures have been proposed to reduce the inter-channel correla-
tion during SAEC (see [9, 40] for references), one of which applies memory-less nonlinearities
to the reference signals before playback and adaptation at the near end [12]. Addition of
random noises to the reference signals has also proven to be effective. However, these tech-
niques come with the cost of degraded signal quality and must be controlled carefully to not
destroy the spacial audio information perceived by the near-end listeners. Other indirect
ways for reducing the inter-channel correlation include the input-sliding technique [125], the
selective-tap adaptive algorithm [64], and the two-stage estimation of the echo path [31],
where all these methods introduce some form of distortion to the estimated echo paths to
A standard measure for quantifying the amount of correlation between two signals x1 (n)
and x2 (n) is the magnitude-squared coherence (MSC) [20] (or “coherence” in short),
18
where X1 (k) and X2 (k) are the discrete Fourier transform (DFT) of x1 (n) and x2 (n),
respectively, and “∗ ” is the complex conjugate operator. It can be shown for SAEC that
the misalignment at each frequency bin k is inversely proportional to 1−Cx1 x2 (k) [36, 63, 62].
BSS refers to a signal enhancement procedure that attempts to recover the original source
signals when only the observations of some mixture of the signals are available. The simplest
scenario is when the mixing system is linear and instantaneous as indicated in Figure 7.
That is, given the mixing matrix A(n) and the source signals vector s(n), the observed
The matrix W(n) can be found up to arbitrary permutation and scaling of rows as long as
the original sources are statistically independent from one another and A(n) is an invertible
square matrix [56]. In general, a unique solution still exists if the number of sources Q is
less than the number of sensors P , i.e., the over-determined case. The under-determined
case of Q > P is a difficult problem and is out of the scope of this research. Without loss
However, the acoustic mixing systems in real life are convolutive by nature and not
instantaneous. To solve such an obstacle, the separation is performed in the DFT domain,
frequency domain:
where k is the frequency index and l is the block index in time. A major drawback is
that the permutation ambiguity exists at each frequency, and the separated components
must somehow be re-aligned and associated to correct sources for all frequencies. There are
19
Mixing System De−mixing System
s1 (n) x1 (n) y1 (n)
..
h11 (n)
Sensor 1 ..
w11 (n)
.. ..
h1Q (n)
.. ..
w1P (n)
..
..
hP 1 (n)
..
wQ1(n)
Figure 7: Model for blind source separation (BSS) consisting of the mixing system with Q source
signals followed by the de-mixing system with P sensors.
many on-going researches on the permutation alignment problem in the frequency domain.
One set of approaches consists of methods based on the time difference of arrival (TDOA)
or direction of arrivals (DOA) [102, 88]. Another set consists of those based on the inter-
frequency magnitude correlation and the spectral continuity across frequency, both of which
applies directly to the speech signal [98, 87]. The scaling ambiguity is much simpler to solve
than the permutation ambiguity and can be corrected through either the projection-back
processing [86] or the minimum distortion principle (MDP) [78]. Good coverage of the
ambiguity problems and other techniques associated with BSS are provided in [73].
A very powerful and effective approach for estimating the de-mixing matrix W(n) when
sources are non-Gaussian is ICA that utilizes the higher-order statistics (HOS) [24]. The
object is to maximize the statistical independence between the separated output signals
A standard adaptive algorithm used for ICA optimization is the natural gradient (NG)
Wm+1 (k) = Wm (k) + µ I − E{φ(y(k, l))yT (k, l)} Wm (k), (25)
where m is the iteration index, µ is the step-size, I is the identity matrix, and φ(·) is a
memory-less nonlinearity commonly referred to as the score function that depends on the
20
source’s probability density function (PDF). (25) converges when E{φ(y(k, l))yT (k, l)} = I,
i.e., the off-diagonal terms of the nonlinear cross-correlation matrix E{φ(y(k, l))yT (k, l)}
become zero. The process is similar to principal component analysis (PCA) that uses the
SOS to obtain decorrelated components, and (25) can be interpreted as performing the
The advantage of ICA over PCA is illustrated in Figure 8 [68]. By maximizing the
variance in the direction of the principal components, thereby minimizing the correlation
between the unmixed signals, PCA is able to recover the best estimate of the original sig-
nals. However, the scatter plots show that while PCA is able to “sphere” to decorrelate, it
is unable to “rotate” to match the directions of the latent components. This is since the
uniform distribution requires not just the SOS but also the HOS for its complete character-
ization. On the other hand, ICA is capable of achieving much better separation than PCA
for non-Gaussian signals (e.g., speech) by taking into account all the HOS. The disadvan-
tage is that then at most one of the source signals can be Gaussian distributed for ICA to
be effective [56].
Figure 8: PCA versus ICA. The original source signal s1 and s2 were generated by two independent
uniform distributions. [68]
21
There are several ways to derive the NG algorithm (see [68, 56] for references). One
approach is to maximize the transfer of statistical information during the separation pro-
cess, i.e., the infomax or equivalently the maximum-likelihood (ML) approach [6]. Another
way is to minimize the mutual information (MI) or equivalently to maximize the statisti-
cal independence between the separated components [68]. Differences in the geometrical
interpretation between the ML and the MI approaches is presented in [73, Chapter 6].
by microphones as some of the source signals to be separated as long as there are enough
microphones [103]. However, the loudspeaker (i.e., the far-end) signals are already known
A better approach is to cast the MCAEC problem into one system framework consisting
of BSS and AEC in terms of SBSS. Figure 9 illustrates a model for SBSS-based SAEC. The
subscripts “n” and “f ” denote “near end” and “far end,” respectively. The far-end and the
near-end linear mixing systems can be combined together systematically in the frequency
domain as
H11 (k, l) H12 (k, l)G(k, l)
H(k, l) = (26)
0 G(k, l)
for the kth frequency bin and the lth time block such that
x
n (k, l) H
11 (k, l) H 12 (k, l)G(k, l) s
n (k, l)
= , (27)
xf (k, l) 0 G(k, l) sf (k, l)
where
xn (k, l) xn (n)
= DFT (28)
xf (n)
xf (k, l)
is a 4-by-1 vector formed by the DFT of the near and far-end microphone signal vectors
22
is a 4-by-1 vector formed by the DFT of the near and far-end source signal vectors sn (n) =
is the 2-by-2 near-end response matrix between the near-end sources and microphones,
h13 (n) h14 (n)
H12 (k, l) = DFT{H12 (n)} = DFT (31)
h23 (n) h24 (n)
is the 2-by-2 near-end response matrix between the near-end loudspeaker and microphones
s3 (n) Processing
x̃3 (n)
s2 (n)
x2 (n) y2 (n)
H22 (n) = I
Procedure
W22 (n)
x4 (n) y4 (n)
23
Then the objective is to find the 4-by-4 de-mixing matrix
W11 (k, l) W12 (k, l)
W(k, l) = (33)
0 W22 (k, l)
such that
yn (k, l) W11 (k, l) W12 (k, l) xn (k, l) sn (k, l)
= ≃ , (34)
yf (k, l) 0 W22 (k, l) xf (k, l) sf (k, l)
where W(k, l) can be estimated by using the NG algorithm of (25). The optimal solution
opt opt
(W11 (k, l)H12 (k, l) + W12 (k, l))G(k, l) = 0 (36)
and
opt opt
H12 (k, l) = −W11 (k, l)−1 W12 (k, l) (37)
We can show through matrix manipulations that the sub-matrix W11 (k, l) is responsible
for source separation while W12 (k, l) performs echo cancellation, where the two matrices
are linked together with the near-end response matrix H12 (k, l), which represents the echo
paths, through (37). More specifically, the BSS is performed such that W11 (k, l)H11 (k, l)
≃ I, and the estimation error from the AEC is obtained by E(k, l) = xn (k, l) − W11 (k, l)−1
W12 (k, l)xf (k, l). (36) also indicates that just as in the traditional MCAEC framework, a
change in the far-end room response G(k, l) disrupts the adaptation of the echo cancellation
The SBSS framework was first proposed in [60] and applied successfully as a combination
of BSS and single-channel AEC in [80] without double-talk detection, where such noise-
robustness is naturally derived from the ICA optimization. We observed in [135] that
24
a decorrelation procedure on the far-end reference signals can indeed improve the SBSS
performance for SAEC. On the other hand, we have also observed that a relatively high
SAEC performance is still achievable without the decorrelation procedure if W22 (k, l) is
There are many open issues yet to be solved for applying the ICA-based SBSS algorithm
to real-time situations due to its computational complexity and slow convergence character-
istics. Still, the approach is naturally suited for general signal enhancement purpose with
multiple signals. The main difference of such an approach from the traditional MCAEC
framework besides the use of the HOS over the SOS is that it is a truly multi-channel ap-
proach [89]. Although multiple loudspeakers and microphones are used, the conventional
MCAEC is inherently a single-channel approach and thus is ill-suited when there are many
interfering signals, some of which are often not directly observable, e.g., local speech and
background noise.
An important question is whether or not tERLE > a priori SNR is possible after the con-
vergence of the LMS algorithm, i.e., E{|ē(n)|2 } < E{|v(n)|2 } as n → ∞. The steady-state
convergence analysis in terms of the system distance for the regularized NLMS algorithm
x(n)
w(n + 1) = w(n) + µe(n) kx(n)k2 +δ , where δ > 0 is the regularization parameter, is summa-
25
Thus since E{|ē(n)|2 } ≈ σx2 E{kh∆ (n)k2 } according to (8) for white x(n) and uncorrelated
x(n) and h(n), achieving the echo cancellation below the average noise level is generally
We note that the above analysis holds when assuming white signals. In reality for a
highly correlated signal such as speech in a complex acoustic environment, low misalignment
is only a sufficient and not a necessary condition for high ERLE or tERLE. Misalignment
determined in the time domain as defined by (15) is purely a measure of the system identifi-
cation performance. A meaningful measure for the overall cancellation performance should
practically be the tERLE for the linear distortion case and the ERLE for the nonlinear
distortion case, and the two measures do not have to directly correlate with the MSD
performance.
Figure 10 illustrates the amount of distortion created by the GSM AMR speech codec [58]
on a female speech. The figure shows that a speech coding distortion is directly proportional
to the signal level and the bit-rate and that it starts to increase exponentially beyond 20 dB
signal attenuation. The figure also indicates that the distortion is quite significant even if
it is perceptually unnoticeable.
Figure 11 was obtained from simulating the loudspeaker saturation by using a saturation-
1 − exp(−x/ρ)
ϕ(x) = , −1≤x≤1 (41)
1 + exp(−x/ρ)
for input signal x and some saturation parameter ρ > 0, i.e., smaller ρ gives more compres-
sion. The figure shows that while a mild loudspeaker nonlinearity would cause the ERLE
to go down significantly, it is a speech coding distortion on the acoustic echo signal that
Results from other simulations we performed in [134] showed that while the NLMS
volume level, block-based algorithms such as FBLMS and RLS are unable by themselves
26
16.5
16
15.5
15 12.2
10.2
14.5 7.95
SDR (dB)
7.4
14
6.7
13.5 5.9
5.15
13 4.75
12.5
12
11.5
0 5 10 15 20 25 30 35 40
Signal Loss (dB)
Figure 10: GSM AMR speech coding distortion, measured in terms of the signal-to-distortion ratio
(SDR), versus input speech magnitude, measured in terms of the signal loss, for various bit-rate
(kbps).
30
No Codec
12.2 kbps
25 7.4 kbps
4.75 kbps
20
ERLE (dB)
15
10
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
ρ
Figure 11: Echo return loss enhancement (ERLE) from the network-based AEC (using the FBLMS
algorithm) as a function of the loudspeaker saturation parameter ρ (smaller ρ indicates greater
saturation) and the GSM AMR bit-rate.
27
to adequately handle the effect of either the loudspeaker saturation or the speech coding
distortion on the acoustic echo at any volume level. Figure 11, which was obtained from the
FBLMS algorithm, not only indicates the severity of speech coding distortion but also the
issues on how the nonlinearities should be treated with respect to an adaptive algorithm,
Nonlinear AEC (NAEC) is used when the LEMS can be modeled by a Hammerstein
system to decouple the echo path into a cascade of nonlinear response, e.g., the loudspeaker
saturation, and linear response, i.e., the RIR itself. Main approaches to NAEC include using
a nonlinearity with memory, represented by the polynomial Volterra filter [123, 43] or by
neural network [15], in a Hammerstein system to compensate for the nonlinear response.
Other recent techniques consist of orthogonalized power filter with parallelized nonlinear
modeling of the echo path [66], pre-distortion of the reference signal to linearize the echo
path prior to the excitation of the LEMS [109], and optimal control of RES to remove
the remaining nonlinear echo component after AEC [67, 108]. More detailed treatment of
We showed in [100] that a mismatch between two signals x1 (n1 ) and x2 (n2 ) with different
procedure in the time domain instead of the conventional approach of upsampling followed
by downsampling. That is, if the signals are band-limited, T1 ≈ T2 , and the amount
function can be applied to x1 (n1 ) to match its sampling rate to that of x2 (n2 ), i.e.,
P
X
x̂1 (n2 ) ≈ x1 (n1 ) sinc(t1 − n1 ), (42)
n1 =−P
where 2P + 1 is the filter order and t1 = n2 T2 /T1 is the interpolated sample position in
continuous time. The computational cost is further reduced by re-using the interpolation
filter coefficients for a fixed ť1 = t1 − [t1 ], where [·] is the greatest integer function (i.e.,
28
flooring function), to determine a block of interpolated samples for ň2 ≤ n2 ≤ ň2 + B − 1,
Table 2 illustrates the variations in the actual sampling rate of portable audio capturing
devices (cell phones, PDAs, laptops). The table shows that a mismatch can be over 0.5% of
the nominal sampling rate. Figure 12 illustrates the effect of the sampling rate mismatch on
AEC before and after the mismatch correction of (42) using P = 64 and various re-use block
length B. White Gaussian noise (WGN) was added at 40 dB signal-to-noise ratio (SNR)
to a simulated acoustic echo of 200 ms in length to create a more realistic acoustic mixing
condition. The figure shows that a mere 100 ppm mismatch (0.01%) in the sampling rate
can decrease the ERLE by 10 dB. Only about 5 dB ERLE would then be possible without
the sampling rate correction for Device 5 and 6 in Table 2, for which the re-use block size
Table 2: Sampling rate mismatch, measured in terms of parts per million (ppm), for six portable
audio capturing devices with the nominal sampling rate of 16 kHz.
25
20
B = 16
B = 64
ERLE (dB)
B = 128
15
B = 256
B = 512
No Correction
10
5
0 1000 2000 3000 4000 5000 6000
Rate Mismatch (ppm)
Figure 12: ERLE when there is a sampling rate mismatch between the reference signal and the
acoustic echo. The correction was made through re-sampling by interpolation in the time domain
by using the coefficients re-use block size of B.
29
CHAPTER III
The least mean square (LMS) algorithm is a popular choice for acoustic echo cancellation
(AEC) due to its computational simplicity and adaptability. The filter error e(n) is modeled
to be quadratic in the space of the filter coefficients vector w(n), and the adaptation rule
is based on the second-order statistics (SOS). The LMS algorithm has a natural ability to
eventually converge to a unique and optimal solution that minimizes the mean square error
where x(n) is the reference signal. However, along with the gradient noise, local distor-
tions (e.g., double talk, speech codec) can greatly disrupt the LMS adaptation process and
There is ultimately a need for a residual echo suppressor (RES) based on classical signal
enhancement, namely the Wiener filtering, to attenuate the remaining cancellation error
to an acceptable level [46], i.e., the reduction of ē within e. Similar in spirit to the joint
statistical adaptation of AEC and RES studied in [45], we will motivate in this chapter
the idea of system approach to achieving the robust signal enhancement performance. The
to mutually support one another and achieve the robustness of the entire AEC system
rather than focusing simply on the AEC adaptive filtering component itself. Specifically,
we postulate that if an adaptive filter is capable of converging to the optimal solution, then
the adaptation process can be assisted if the effects of disruption are removed before the filter
coefficients are updated, i.e., the reduction of v within e prior to the filter adaptation. In
other words, by including a signal enhancement procedure not just for the post-enhancement
purpose to further reduce the residual echo but within the feedback-loop of an adaptive filter
to enhance the error, the linear portion of the impulse response should be better estimated
30
We investigated in [133] the new “system” perspective to applying an instantaneous
memoryless nonlinearity that performs the signal enhancement of the residual echo in order
to increase the robustness of the AEC in a continuously noisy environment [128, 129, 130].
We will hereafter refer to such a nonlinearity as the error recovery nonlinearity (ERN).
The procedure, represented schematically in Figure 13, is summarized by the LMS update
equation
where the ERN f (·) is applied to the filter estimation error e(n). We refer to such a
in general. The use of an error nonlinearity to modify the convergence behavior of the
LMS algorithm has been addressed on a number of occasions in the past: [117, 30, 76,
14, 104, 28, 2], just to name a few. Although the same signal-enhancement approach to
“enhance” the residual echo was recently studied also by others in [114], we are to the best
of our knowledge the first to propose and demonstrate the effectiveness of the technique for
AEC in [128, 129, 130]. The idea is akin to the Bussang technique for blind equalization
[48], where the system inversion without the reference signal is made possible through the
iterative application of a linear adaptive filter, which deconvolutes the observed signal,
followed by a memoryless nonlinearity, which reduces the modeling error (referred to as the
x(n)
Adaptive
Filter h(n)
Nonlinearity f (e(n)) ˆ
_ d(n)
+ Linear/Nonlinear
e(n) d(n) Distortion ¯
d(n)
Figure 13: Adaptive filtering with linear or nonlinear distortion on the true, noise-free response
¯
d(n). A noise-suppressing memoryless nonlinearity f (·) is applied to the observed filter estimation
error e(n) in the feedback-loop in order to suppress the effects of the distortion during the adaptation
of a linear filter.
31
error in such a manner is similar to minimizing the MSE or the mean square deviation
(MSD) E{kh∆ (n)k2 }, where h∆ (n) is the misalignment vector, of LMS-type algorithms.
Furthermore, the ERN can also be viewed as a function that controls the step-size µ
for sub-optimal conditions reflected in the error statistics when the signals are no longer
Gaussian distributed, as most often is the case in reality, and it may be combined with
other existing noise-robust schemes to improve the overall performance of the LMS algo-
rithm. More importantly, we have shown in [131] that the technique is well-founded in an
information-theoretic sense and has a deep connection to the algorithms based on indepen-
dent component analysis (ICA) [56], e.g., blind source separation (BSS) that maximizes the
independence of output signals and allows the recovery of a target signal among interfering
signals even when the source signals are unknown. In fact, the single-channel AEC problem
is a special case of semi-blind source separation (SBSS) for which one of the source signals
is partially known, i.e., the far-end reference signal that becomes the acoustic echo. With
system view of the signal enhancement problem in general. Often times the constrained
physical world. By placing all possible signals from both the near and the far end, their prior
source information, and the system components together into a proper perspective, robust
performance is achieved in a very complex, real-world setting with non-ideal acoustic mix-
ing conditions. Moreover, if adapted appropriately, the system approach alleviates the need
for precise tracking of the room impulse response (RIR) or the signal statistics to obtain
adaptation (BIA), which is related to the traditional data reuse and batch adaptation pro-
BSS systems developed previously by many others and will be illustrated here in the scope
the independence between the involved signals leads naturally to the ERN and offers a
32
refreshingly new view of the conventional AEC problem. Our proposed approach in this
chapter brings seemingly separate signal enhancement techniques under one roof and ties
The rest of this chapter is organized as follows. First in Section 3.1, we review the
derive and analyze the ERN for REE in detail; show the REE technique’s connections to
the traditional AEC procedures, ICA, and SBSS; and discuss the system approach to the
AEC problem and the BIA procedure. Finally in Section 3.3, we provide the experimental
results from REE when dealing with linear and nonlinear distortions.
The normalized least mean square (NLMS) algorithm [139] is preferred over the LMS algo-
rithm since the real-world acoustic signals involved in communications are rarely stationary,
which leads to the gradient noise during the LMS adaptation. The algorithm can be derived
through the principle of minimal disturbance, which states that “from one iteration to the
next, the weight vector of an adaptive filter should be changed in a minimal manner, subject
In the NLMS algorithm, the LMS filter coefficients adaptation rule is “normalized” by
x(n)
w(n + 1) = w(n) + µe(n) , (44)
kx(n)k2
where k · k is the Euclidian norm. By such a normalization procedure, the NLMS algorithm
becomes invariant to the scaling of the reference signal, and its convergence kw(n + 1) −
Just as with the LMS algorithm, the NLMS algorithm is still susceptible to the effect of
distortions on the acoustic echo. A simple procedure for improving the convergence behavior
in the presence of a local additive noise v is to normalize the reference signal vector x with
33
kxk2 + δ for δ > 0 [46] to regularize, or stabilize, the adaptation when kxk ≈ 0:
x(n)
w(n + 1) = w(n) + µe(n) . (45)
kx(n)k2 + δ
While the regularization parameter δ can be maintained large enough to keep the NLMS
algorithm from diverging when the signal-to-noise ratio (SNR) between x and v is very
small, there is no explicit guidance on how to control the parameter precisely in practical
Many adaptive, or variable, step-size algorithms were developed not only to find the
optimal step-size µ to improve the stability of the LMS and the NLMS algorithms in the
presence of v, e.g., [13, 77, 26, 1, 71, 5, 65, 74], but also to specifically reduce the effect of
v on the adaptive algorithms, e.g., [52, 97, 91, 50, 124]. The first set of approaches keeps
µ as large as possible during early stages of adaptation for fast convergence and reduces it
when the true error ē or the misalignment becomes small as the algorithm converges and
the effect of v increases. The second set of approaches generally controls µ as a function of
the signal statistics, where µ is adjusted to be large at high SNR and small at low SNR.
Most of the adaptive step-size techniques listed above except for [124] are designed for
ambient and continuous background noise. They do not work as well during the double
talk situation when the local noise is dominated by a near-end speech signal, which can
be highly non-stationary and “bursty” (i.e., occurs suddenly with large energy). There
are frequency-domain adaptive step-size algorithms designed for dealing with double talk
[112, 126, 127, 114]. Still, distinguishing the near-end speech from the far-end speech in
the observed, noisy acoustic echo in order to estimate the SNR is a very complicated task,
A traditional and practical solution is to stop the adaptation completely during double
comparison between the available signals, e.g., the reference signal (the far-end microphone
signal), the near-end microphone signal, the estimated acoustic echo, and the residual echo
34
in terms of the magnitude ratio [29] (the Geigel algorithm), the cross correlation (or co-
herence) [145, 39, 10, 35, 61, 37, 59], or the mutual information [107, 106]. The two-path
models approach, which uses a “background” filter only for adaptation and a “foreground”
filter for actual cancellation, can also be utilized to overcome the sporadic disturbances from
a local speech without entirely freezing the adaptation process [92, 111, 25]..
nonlinearity to the estimation error to limit the sudden jump in the error that may occur
when a DTD fails to detect the event initially [33, 38, 7, 18, 113]. The technique is based
on the robust statistics theory [54], which provides the means to minimize the influence of
the outliers on a statistical estimation procedure. There is a trade-off between the false
alarm, which unduly stops the adaptation and retards the convergence rate, and the missed
detection, which may lead to the adaptation instability. A logical solution for handling
the two-sided problem is to adjust the DTD threshold level accordingly to decrease the
false alarm rate at the cost of increased missed detections, in which case the effect of the
Residual echo enhancement closely follows the Bussang technique for blind equalization or
deconvolution:
where y(n) is a vector of the observed (convoluted) signal y(n) = h(n) ∗ x(n), z(n) =
w(n) ∗ y(n) ≈ wT (n)y(n) is the filtered (deconvoluted) output, and φ(·) is a memoryless
nonlinearity such that φ(z(n)) = x̂(n) is an estimator of the original, unobservable signal
x(n). The system equalization without any training signal is made possible through the iter-
ative application of a linear adaptive filter, which estimates the inverse filter w(n), followed
by the nonlinearity, which enhances the filtered signal z(n). Specifically, a decomposition
35
of z(n) reveals that
where h−1 (n) is the inverse impulse response, i.e., h−1 (n) ∗ h(n) = 1, and v(n) = (w(n) −
h−1 (n)) ∗ h(n) ∗ x(n) is the convolutional noise. Hence by reducing the effect of v(n) in z(n)
through the application of the nonlinearity φ(·), the estimation error e(n) = z(n)−φ(z(n)) =
z(n) − x̂(n) necessary for the filter adaptation can be obtained blindly.
Assuming the noise v(n) is statistically independent from the true estimation error ē(n), the
as possible from the observed noisy error e(n) = ē(n) + v(n). Hereafter, the sample-wise
signal enhancement is assumed, and the time index n may be dropped for simplification in
notation.
As done commonly in the signal enhancement community and also in [48] for the Bussang
technique, the Bayesian parameter estimation procedure [99] is utilized to perform the error
enhancement. Namely, the best estimate of ē can be found through either the minimum
procedure 1 (the MAP estimate may also be referred to as the maximum likelihood (ML)
estimate [42, 55] 2 ). At the heart of the Bayesian estimation is the conditional probability
1
For the random variables that consist of the unknown parameter θ, the observation y, and the estimate
θ̂(y), the MMSE procedure minimizes the risk r(θ̂) = Eθ {C[θ̂(y), θ]} for the quadratic cost C[θ̂, θ] = (θ̂ − θ)2 ,
whereas the MAP procedure minimizes the risk for the uniform cost C[θ̂, θ] = 1, if |θ̂ − θ| > ∆ for ∆ > 0,
and C[θ̂, θ] = 0, otherwise. There is another approach not covered here called the minimum mean-absolute
error (MMAE) procedure that minimizes the risk for the absolute cost C[θ̂, θ] = |θ̂ − θ|.
2
Generally in statistics, the ML estimate is defined to be a special case of the MAP estimate with the
prior probability p(θ) = 1 such that it maximizes the likelihood p(y|θ). The ML procedure defined as such
is not covered here since it gives the trivial solution θ̂ = y when used as a sample-wise estimator. Some
procedures may be referred to as being ML instead of MAP even though the prior probability is not explicitly
set equal to 1 since they still maximize the joint density (or “likelihood”) p(y, θ) = p(y|θ)p(θ).
36
given by the Bayes formula
pe|ē (e|ē)pē (ē)
pē|e (ē|e) = R ∞ . (48)
−∞ pe|ē (e|ē)pē (ē)dē
The MMSE estimate is obtained by minimizing the expectation of the residual E{(ē− ēˆ)2 |e}
with respect to the estimate ēˆ conditioned on the observation e, resulting in ēˆ = E{ē|e}:
Z ∞
fMMSE (e) = ē pē|e (ē|e)dē. (49)
−∞
The MAP estimate is obtained by maximizing (48) directly, which is equivalent to maxi-
∂ p ′ (s)
φs (s) = − log ps (s) = − s (51)
∂s ps (s)
for some probability density function (PDF) ps , where “ ′ ” is the derivative operator. (51)
measures the relative rate at which ps changes at a value s. Let us consider three random
variables s, t, and u, where s = t + u to reflect the additive distortion model. If the PDF
of t is zero-mean Gaussian
1 −t2
pt (t; σt ) = p exp , (52)
2πσt2 2σt2
If u is from any probability distribution, then it can be shown by the convolution theorem
for the PDF of the sum of independent random variables that (see Appendix A)
R∞ 2
−t
1 −∞ t p u (s − t) exp 2σt2
dt
φs(u) (s; σt ) = 2 R , (54)
σt ∞ pu (s − t) exp −t22 dt
−∞ 2σt
where “s; σt ” and “s(u)” indicate explicit dependence of s on t (which is integrated out)
Laplacian
1 −|u|
pu (u; αu ) = exp , (55)
2αu αu
37
denoted hereafter by u ∼ Laplacian(0, αu ), then
where χ = s/σt and ψ = s/αu are the scaled signals, ξ = σt2 /α2u is the SNR (in a general
sense), erfcx(r) = exp (r 2 )erfc(r) is the scaled complementary error function, and erfc(r) =
R∞
√2 exp (−r̄ 2 )dr̄ is the complementary error function. (57) then leads to
π r
ξ−ψ ξ+ψ
1 erfcx √
2ξ
− erfcx √
2ξ
φs (s; σt , αu ) = . (58)
αu erfcx ξ−ψ
√ + erfcx √ξ+ψ
2ξ 2ξ
The PDFs (52), (55), (57), and the score functions (53), (56), (58), are plotted in Figure 14
for σt = αu = 1.
0 4
−2
2
log(p(x))
−4
φ(x)
0
−6 p φs
s
pt −2 φt
−8
pu φu
−10 −4
−4 −2 0 2 4 −4 −2 0 2 4
x x
Figure 14: PDFs (scaled by log(·)) and score functions of s, t, and u when s = t + u, t ∼
Gaussian(0, 1), and u ∼ Laplacian(0, 1). For |x| < 2, φs (x) exhibits approximately linear scaling
by a factor of 0.5 that accounts for relatively equal contributions from t and u for small observed
(noisy) magnitude |s|. For |x| > 2, φs (x) → φu (x) very rapidly as |x| → ∞ since the kurtosis is
higher for u than t, i.e., the probability of u is greater than that of t for large observed magnitude
|s|.
Specific forms of the ERN obtained from the MMSE and the MAP estimations are as
follows (see Appendix A for the derivation). First, if ē ∼ Gaussian(0, σē ) and v is any
38
GA
fMAP (e) = {ēˆ : ēˆ = σē2 φv (e − ēˆ)}, (60)
where the superscripts “G” and “A” indicate “Gaussian” error ē and “any” type of distortion
GL
fMMSE (e) = σē2 φe (e; σē , αv ). (61)
sign(e)σ 2 /αv if |e| ≥ σ 2 /αv ,
GL ē ē
fMAP (e) = (62)
e if |e| < σē2 /αv .
AG
fMMSE (e) = e − σv2 φe(ē) (e; σv ), (63)
AG
fMAP (e) = {ēˆ : ēˆ = e − σv2 φē (ēˆ)}. (64)
LG
fMMSE (e) = e − σv2 φe (e; σv , αē ), (65)
e − sign(e)σ 2 /αē if |e| ≥ σ 2 /αē ,
LG v v
fMAP (e) = (66)
0 if |e| < σv2 /αē .
GG GG σē2
fMMSE (e) = fMAP (e) = e, (67)
σē2 + σv2
which is the well-known Wiener scaling rule used often by the traditional frequency-domain
signal enhancement techniques. The nonlinearities (61), (62), (65), (66), and (67) are
The center-clipping (CC) and coring (COR) nonlinearities, used widely for image and audio
39
4 4
2 2
f (e)
f (e)
0 0
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
e e
(a) (61) for σē = αv = 1. (b) (62) for σē = αv = 1.
4 4 4
2 2 2
f (e)
f (e)
f (e)
0 0 0
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
e e e
(c) (65) for αē = σv = 1. (d) (66) for αē = σv = 1. (e) (67) for σē = σv = 1.
Figure 15: Realization of noise-suppressing nonlinearities (61), (62), (65), (66), and (67) obtained
through the MMSE and the MAP estimation procedures when the observed adaptive filter estimation
error is modeled additively as e = ē + v.
40
e if |e| ≥ kcor ,
fCOR (e) = (69)
sign(e) |e/kcor |η kcor if |e| < kcor ,
for kcc > 0, kcor > 0, and η ≥ 1. (68) and (69) are plotted in Figure 16(a) and 16(b),
4 4
2 2
f (e)
f (e)
0 0
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
e e
(a) (68) for kcc = 1. (b) (69) for kcor = 3, η = 2.
(67) can be interpreted as the “mid-point” of (59) and (63), where (59) → (61) and (63) →
(65) when pv for (59) and pē for (63) become Laplacian-like, i.e., has higher kurtosis than
the Gaussian PDF. For example, Figure 14 illustrates that when the kurtosis is higher for
v than ē, φe tends to represent φv for large |e| at which the probability of v is higher than
that of ē, and re-scaling the output of φe by σē2 results in the best estimate of ē. On the
other hand, the opposite effect takes place for (65) when the kurtosis is higher for ē than
v, in which case φe is more likely to represent φē instead of φv , and re-scaling the output
of φe by σv2 and then subtracting the result from e consequently gives the best estimate of
ē. Using the Laplacian PDF for both ē and v gives the nonlinearity similar in form to (61)
if αē < αv and to (65) if αē > αv , whereas the nonlinearity becomes identical to the linear
Wiener scaling of (67) if αē = αv [75]. Whether or not the overall form is “compressive”
the score function φe that depends on the degree of non-Gaussianity of pē or pv and on the
SNR.
41
Similar analysis applies to the MAP nonlinearities (60), (62), (64), and (66), which are
the limits of the MMSE nonlinearities, i.e., (61) → (62) and (65) → (66) asymptotically
as |e| → 0 or ∞. Thus (61) and (65) may be approximated by (62) and (66), respectively,
for large |e| at which the MMSE nonlinearities are numerically unstable due to the division
of pe′ by pe . Moreover, (59) → (60) and (63) → (64) as n → ∞ since φe (e) → φv (e) as
the LMS algorithm dynamically reduces ē and since the estimates ēˆ and v̂ are related by
e = ēˆ+ v̂. The limiting behavior is consistent with the previous observation that the general
form of the nonlinearity is governed by the Gaussianity of pē and pv . Also, it implies that
the overall amount of scaling performed by the nonlinearity depends ultimately on the SNR.
For example, the effect of (62) can be interpreted as multiplying σē by σē /αv to get the best
estimate of ē above the threshold σē2 /αv , whereas the same interpretation holds for (66), in
which case the ratio of and the threshold of interest are σv /αē and σv2 /αē , respectively.
Since (67) is a linear estimator of ē(n) by definition is not a nonlinearity. However, for a
constant σv2 , the effect of (67) is to scale down e(n) when ē(n) is small (i.e., σē2 /(σē2 +σv2 ) → 0
as σē2 → 0) and leave e(n) as it is when ē(n) is large (i.e., σē2 /(σē2 + σv2 ) → 1 as σē2 → ∞).
Thus (67) dynamically emulates a coring nonlinearity for the case of ambient background
noise [128].
When the observed adaptive filter error is modeled as e = ē + v, where ē is the true, zero-
mean Gaussian-distributed error, the optimal error nonlinearity for the LMS algorithm that
which is simply the score function (51) in terms of e. Since (70) is equal to (59) for zero-
mean standardized Gaussian-distributed ē (i.e., σē = 1) or σē2 in (59) may be absorbed into
the step-size µ of the LMS algorithm, such an ERN is optimal in the LMS-sense for any
42
3.2.2.2 Adaptive Step-Size Procedure
The optimal adaptive step-size for the NLMS algorithm that achieves the largest decrease
E{|ē(n)|2 } E{|ē(n)|2 }
µopt (n) = = (71)
E{|e(n)|2 } E{|ē(n)|2 } + E{|v(n)|2 }
when ē and v are uncorrelated. (71) is precisely the Wiener scaling coefficient in (67), i.e.,
the MMSE and the MAP estimation procedures provide the MSD-optimal step-size for the
application of (67) to the NLMS algorithm means that while the reference signal vector
x(n) is normalized to give a proper direction of adaptation in the solution space of h(n),
the observed estimation error e(n) is re-scaled to be nearly equal to E{|ē|} to provide as
A very effective adaptive regularization procedure for the NLMS algorithm proposed in [52]
kx(n)k2
w(n + 1) = w(n) + µe(n) x(n) (72)
kx(n)k4 + γσv4
for γ > 0. The regularized normalization factor in (72) can be expanded into the product
||x(n)||2 1 σē2
≈ . (73)
||x(n)||4 + γσv4 ||x(n)||2 σ 2 + γ E{||h∆ (n)||2 } σv2 σ 2
ē L2 σx2 v
That is, (73) is also capable of performing the error enhancement by itself. This ensures
that the Wiener step-size control is carried out properly when the adaptive filter has not
reached sufficient convergence (i.e., for large E{||h∆ ||2 }) or when the mixing system is
weakly excited (i.e., for small σx2 /σv2 ). The error enhancement procedure by itself would
have difficulty adjusting quickly since it lacks any direct information about the reference
signal. In this sense, (73) also performs voice activity detection (VAD) on the reference
speech signal to ensure stable adaptation during such an ill-conditioned situation. The
43
effectiveness of a coring-type nonlinearity that exhibits VAD-like behavior and stabilizes
the adaptation of the NLMS algorithm at low signal level was demonstrated in [128], which
is consistent with the behavior of (72). Similar behavior should be expected from the oft-
used regularized normalization factor (||x||2 +δ)−1 in (45), which can be used also to perform
a function of the noise statistics (e.g., a procedure for adapting δ is provided in [74]).
The use of a compressive nonlinearity to limit the onset of double talk essentially suppresses
the outliers, i.e., signals corrupted by the impulsive noise, to get back the signal of interest.
The strategy is very similar in its functionality to the error enhancement approach. It
was shown in [38] that a compressive form of the error nonlinearity can be derived for a
bursty noise through the robust statistics theory [54] to improve the performance of the
for 0 < λs < 1 and β > 0. (74) appears exactly the same in form as (62) in Figure 15(b),
which is also able to limit the effect of v generated by a heavy-tailed PDF. The main
difference is that (61) and (62) hold the output to within the ±σē2 /αv range, which allows for
an adaptive adjustment of the threshold according to the SNR, while (74) limits the output
range to roughly ±k0 s, where (76) indicates that the adaptive scaling term s represents a
low-passed, noise-robust estimate of E{|ē|}. Hence (61) and (62) should be able to track
the gradual changes in the noise condition better than (74), whereas (74) is more effective
than (61) or (62) in limiting the large fluctuations but may be too restrictive such that it
leads to slower convergence by the LMS algorithm than necessary during online adaptation.
44
A compressive nonlinearity that is more general than (74) and goes to zero beyond the
threshold can also be derived from the robust statistics theory [146], which is equivalent in
effect to freezing the filter adaptation when the residual echo is greater in magnitude than
some threshold as a function of the reference signal, e.g., the Geigel DTD algorithm [29].
By following the same reasoning for the origin of the “coring” nonlinearity as discussed in
[128], such a “super-compressive” nonlinearity may be obtained when pē is Gaussian and
The convergence behavior in terms of the system distance for the LMS algorithm with the
where I is the L×L identity matrix and Rx = E{xxT } is the L×L auto-correlation matrix
2
0<µ< , (78)
λmax E{φe′ (e(n))}
where λmax is the maximum eigenvalue of Rx . Refer to [28] for detailed convergence analysis
and more precise form of the optimal error nonlinearity, which becomes the same as (70)
A recursive and alternating application of the LMS algorithm and the ERN may also be
algorithm, the maximization step (M-step) is performed by maximizing the joint likelihood
¯
of the known signal (i.e., d(n)) and the “hidden” signal (i.e., d(n)), given the old estimate
of the “latent” system parameter (i.e., h(n)) to update the parameter estimate (i.e., w(n)),
whereas the expectation step (E-step) is performed by determining the conditional expec-
tation of the hidden signal given the observed signal and the new parameter estimate. The
45
M-step is roughly implemented through the LMS algorithm, where it is well known that the
maximum-likelihood (ML) solution is exactly equal to the least-squares (LS) solution when
the estimation error is zero-mean Gaussian distributed [56] and that the LMS algorithm
stochastically approximates the LS solution [47]. The E-step is precisely what is performed
through the Bayesian estimation procedure from which the error enhancement nonlinear-
ities are derived. For example, performing the BSS strictly through the EM algorithm is
possible [84], as ICA-based algorithms can also be developed in the ML framework [70].
Many alternative approaches, statistical or heuristic, may be taken to arrive at the ERN with
similar form and functionality. For example, ad-hoc nonlinearities such as the compressed
center-clipper and the coring nonlinearity, useful in other signal enhancement scenarios,
can be related to the MMSE and the MAP nonlinearities [128]. While the general form
of the ERN is dictated by the source statistics, the amount of noise suppression depends
directly on the SNR. The major obstacle, however, is that even a typical real-life acoustic
mixing environment usually precludes the precise measurement of individual signal statis-
tics. Nonetheless, we show in the remainder of this chapter that rigorous specification of
the ERN and estimation of the SNR are not necessary for a sufficient online echo cancel-
lation performance as long as the overall form of the ERN is correctly chosen, given the
approximate a priori knowledge of the signal distribution, and the entire AEC system is
updated via the BIA procedure, which takes advantage of EM-like “dual” combination of
adaptation and error enhancement, to take full advantage of the inherent adaptability of
The Gaussianity of ē is usually assumed for the LMS algorithm when the filter length is long
enough by the argument of the central limit theorem [2]. The error enhancement technique
may be interpreted as a generalization of the adaptive step-size method for any probability
the signal level for non-Gaussian signals even when their statistics remain stationary. The
46
ERN technique enables the incorporation of the statistical source information for linear
MSE-based adaptive filtering. In fact, the ERN obtained from (60) or (64) suppresses the
noise signal better than the Wiener enhancement rule of (67) does when either ē or v is non-
Gaussian distributed (see Appendix B), the implication of which is significant since ē may
not be Gaussian-like during early stages of adaptation if the filter length is relatively short.
In any case, most of the signals encountered in real life are not distributed as Gaussian,
e.g., the speech signal distribution is widely regarded to be super-Gaussian in either the
time or the frequency domain [41]. This leads naturally to the role of ICA as discussed in
A single-channel AEC setup shown in Figure 17(a) can be viewed as a special case of the
source separation problem for the recovery of the near-end signals when some of the source
signals are partially known, i.e., the far-end (reference) signal. By following the source
separation convention, the mixing system in Figure 17(a) can be modeled linearly as
T
d(n) 1 h(n) v(n)
= (79)
x(n) 0 I x(n)
where 0 is the zero vector of length L and w̃ is the scaled impulse response estimate vector
(the scaling process is discussed in more detail below). Then the natural gradient (NG)
for some adaptation step-sizes µ1 and µ2 . The usual MSE-based system identification
47
algorithm simplifies to
(a) Single-channel AEC with the signal v(n) from a local source s(n) added to the
¯
desired response d(n) (acoustic echo) from a loudspeaker signal x(n) to be canceled.
(b) Single-channel ANC with the signal v(n) form a local source s(n) added to the
¯
desired response d(n) (noise) from a secondary source x(n) to be canceled.
Figure 17: Comparison between AEC and adaptive noise cancellation (ANC). In both cases, the
application of ERN to the filer error e(n) allows an LMS-based adaptive filter to produce the estimate
of the target source s(n) that is independent on average from the signal x(n) to be canceled.
By casting the single-channel AEC problem into the two-channel source separation
framework with the loudspeaker and the microphone channels as two separate inputs, the
ERN falls out automatically from the NG algorithm by applying ICA to the optimization
procedure. For example, the mixing system depicted in Figure 17(a) can be generalized
for the adaptive noise cancellation (ANC) problem examined in [96] as illustrated in Fig-
ure 17(b), which is simply a two-channel BSS problem for recovering one of the source
signals. More importantly, the perspective is shifted beyond the identification of the RIR h
and focused instead on the estimation of the near-end signal v represented by the output e,
i.e., recovery of the original signals by maximizing the independence of the system output
48
due to the presence of a near-end source, where historically the adaptive step-size proce-
dure and DTD were developed in retrospect to remedy the stability issue. Then the same
assumptions used to guarantee the idealized algorithmic solution in the MSE sense may
instead restrict the search for other valid solutions that are, though not necessarily the
global optimal, also just as effective. Other significant observations with respect to ICA are
as follows.
average in SOS, i.e., E{ex} = 0. On the other hand, the application of φe to e during
the MSE optimization procedure attempts to make e and x independent, which means
decorrelation through the second and all high-order statistics (HOS), i.e., E{ek x} = 0 for
but not vice versa, the ICA-based LMS algorithm is a generalization of the LMS algorithm
for non-Gaussian signals. Gaussian signals are characterized by only up to the SOS, i.e.,
φe (e) = e/σe2 when e ∼ Gaussian(0, σe ) such that (83) reduces to the LMS algorithm.
The adaptive scaling factor a of (82) is used to obtain the scaled estimate ẽ = ad + w̃T x ≈ ṽ
of the local signal v such that ṽ is “standardized” during the convergence process. Its
functionality is similar to that of the scaling factor s of (76) used by the double-talk-
robust nonlinearity of (74). The adaptive scaling process makes the NG algorithm robust
to large fluctuations in d due to a noise source [144]. Such a mechanism is enforced by the
natural gradient learning rule that is scale-invariant, or “covariant” [70], which governs the
convergence process of (81) and (82), i.e., E{φẽ (ẽ)ẽ} → 1 as n → ∞. It has been exploited
to perform the single-channel AEC during BSS without a DTD through semi-blind sources
separation in the frequency domain in [79]. SBSS is a direct extension of BSS when some
partial knowledge of the source signals is already available and is consequently suited for
multi-channel AEC in the presence of interferences [89]. Even though the adaptive scaling
is fixed for the error enhancement technique, i.e., a(n) = 1, the scaling process is an integral
49
part of the MMSE and the MAP nonlinearities by the virtue of the PDF that standardizes
the input signal as a function of the signal statistics. Hence the ICA-based LMS algorithm
As motivated above, the ERN derived from the score function is inherently capable of acting
= ṽ + ē˜. (84)
The score function φẽ suppresses the scaled local noise ṽ = ad + h̃T x, where h̃ is the scaled
negative of the RIR vector h, such that a better estimate of the scaled true filter estimation
error ē˜ = (w̃ − h̃)T x can be obtained and then be used to update the de-mixing filter
coefficients w̃. Thus the use of a nonlinearity in the NG algorithm may be viewed not only
as an essential part of the nonlinear decorrelation process [56, 89] but also as performing
the error enhancement to facilitate the filter adaptation. After convergence, (84) must be
re-scaled to the original scaling to obtain the local signal estimate [144]. In other words,
the error enhancement and the adaptation are both carried out on the scaled signals by the
NG algorithm via optimization through ICA, whereas the adaptation is performed entirely
in the original scale after the error enhancement by the ICA-based LMS algorithm.
The HOS-based adaptive algorithms are normally suited for batch-wise, offline adaptation
such that a mis-specification in the signal statistics, or PDF in general, does not diminish
the effectiveness of ICA [56]. There is a trade-off between the offline learning and the
online learning, where the former may use as many adaptive iterations as allowed in a given
time to further improve the solution, whereas the latter offers the tracking capability in a
50
depends on how well an adaptation procedure is modified to retain the advantage of batch
learning, e.g., the use of so-called “batch-online” adaptation for SBSS in [89] that shortens
the batch size considerably yet accomplishes adequate real-time separation performance.
For the LMS algorithm, a small adaptation step-size corresponds to low MSE at the cost
of slow convergence speed [46, 49]. The aim is to exploit such an inherent ability of the LMS
algorithm to converge to the optimal solution by instilling just enough noise-robustness con-
domain AEC in [131] that a loss in the convergence rate incurred for a gain in the stability
due to the error enhancement can then be compensated via BIA without requiring precise
estimation of the signal statistics. The EM-like dual re-estimation procedure involving the
filter adaptation and the error enhancement permits the refinement of the roughly estimated
signal statistics at each iteration until sufficient convergence is obtained. This is different
from the conventional data-reuse or batch-wise adaptation philosophy that simply applies
the adaptive algorithm repeatedly on a same set of data. Furthermore, it was shown clearly
in [8] that the data-reuse NLMS (DR-NLMS) algorithm is similar in form and convergence
behavior to the affine projection algorithm (APA). The FBLMS algorithm with BIA re-
quires far less computation than the DR-NLMS algorithm or the APA due to the block
processing and the overlap-and-save (or -add) filtering procedure while providing noise ro-
bustness during the APA-like adaptation through the error enhancement procedure. We
observed experimentally in [131] that four iterations are sufficient for the FBLMS algorithm
to maximize the BIA performance, which is analogous to the generally accepted practice of
The error enhancement technique has also been applied to the traditional multi-channel
AEC [132] and combined with a RES [142, 143] with excellent results. The successes are a
testament to the system approach to signal enhancement that culminates in the robustness
against mis-estimation of statistics. Most of the existing ICA-based BSS procedures already
utilize a priori knowledge of the source statistics and batch-wise adaptation along with the
natural gradient algorithm to provide sufficient source separation. The same framework
51
underlies the error enhancement paradigm for AEC, realized through the system combina-
tion of the LMS algorithm, the ERN, the adaptive regularization procedure (which can be
consider an integral part of the error enhancement component), and the BIA procedure.
In such a framework, where the system components are designed to interact mutually with
each other, prior source information, which dictates the overall characteristics of the ERN,
and not necessarily the precisely estimated signal statics becomes essential for robust AEC
performance.
The standard usage of VAD to estimate the noise variance during silence for the error en-
hancement has already been tested in [128, 129, 130]. Other advanced techniques, e.g.,
[124], may be employed for improved tracking performance. However, the conventional
single-channel techniques tend to be heuristic in nature and depend strongly on the ar-
bitrarily chosen thresholds and smoothing constants. In order to clearly demonstrate the
behaviors of different combinations of the ERN, the given SNR (averaged over time for en-
tire data) is used throughout the time-domain simulation. The main purpose is to show that
the sensitivity to the choice of fixed parameters is reduced by the system approach and that
the precisely estimated signal statistics are not necessary through the BIA procedure, which
can be carried out efficiently in the frequency domain. Thus the full practical potential of
A simulated acoustic echo, generated from a RIR with the reverberation time of about
250 ms measured at 8 kHz sampling rate, was used. The RIR was truncated prior to mixing
to match its length to that of the adaptive filter, chosen here to be 64 ms, and force zero
filter-length mismatch error (again for the purpose of illustrating the behavior of the ERNs
more clearly), and it was scaled to produce the echo return loss (ERL, i.e., attenuation) of
10 dB. 16 kHz, 16-bit PCM speech from the TIMIT database downsampled to 8 kHz were
used.
Three main scenarios were examined. First, an ambient noise comprised of white Gaus-
sian noise (WGN) was added to the acoustic echo during the time-domain AEC to simulate
52
a linear distortion. Second, encoding and decoding using the GSM AMR speech codec [58]
were applied to the acoustic echo during the time-domain AEC to simulate a nonlinear
distortion. Finally, a continuous near-end speech was added along with the WGN during
the frequency-domain AEC to simulate a very noisy AEC condition that would normally
Figure 18 shows the results from a simulated acoustic echo with additive WGN at 10 dB
GL
SNR and using a compressive ERN (fMMSE ) with known noise energy. The regularization
parameters, δ for the first regularized NLMS algorithm in (45) (rNLMS1) and γ for the
second regularized NLMS algorithm in (72) (rNLMS2), were adjusted to give the highest
from the figure. First, rNLMS2 performs better than rNLMS1 when the SNR is low,
i.e., at low signal level, which is consistent with the intended design of rNLMS2. Second,
GL
combining fMMSE with rNLMS1 or rNLMS2 allows as much as 5 dB improvement in the
tERLE. Third, the tERLE plot clearly indicates that echo cancellation beyond the noise level
(10 dB a priori SNR) is indeed possible through a combination of the error enhancement
obtained with the inclusion of the ERN technique even though the misalignment is higher
when compared to the performances of rNLMS1 or rNLMS2 without the ERN. Thus lower
misalignment, or better system identification, does not always translate to higher echo
Figure 19 illustrates the stabilizing effect of the ERN on rNLMS2 for the mis-specification
of γ, which also implies to a large extent the robustness against the mis-estimation of the
53
5
0
Misalignment (dB)
rNLMS1
GL
rNLMS1+f
−5
rNLMS2
rNLMS2+f GL
−10
−15
0 5 10 15 20 25
time (s)
(a) Misalignment.
30
25
20
tERLE (dB)
15
10
−5
0 5 10 15 20 25
time (s)
GL
Figure 18: From the regularized NLMS algorithms (rNLMS1, rNLMS2) with the ERN (fMMSE ).
54
emulates a coring nonlinearity [128] and enhances the filter estimation error at low signal
GL
level, whereas a combination of rNLMS2 and fMMSE ensures the proper enhancement for a
wide range of time-varying SNR resulting from the non-stationary speech signal. The same
By modeling the nonlinear speech coding distortion as an additive noise, i.e., v(n) = ȳ(n) −
ŷ(n) for the original speech ȳ and a distorted speech ŷ, the signal-to-distortion ratio (SDR)
can be defined in the same fashion as the SNR. Then the error enhancement procedure may
and processing delays between the encoder and the decoder are ignored). The amount of
where the GSM AMR coding distortion (measured in terms of the SDR and averaged over
20 female speech utterances) is plotted against the input speech magnitude (measured in
Figure 21 is obtained from a simulated acoustic echo encoded and decoded by GSM
AMR at 12.2 kbps bit-rate. The regularization parameter for rNLMS2 was adjusted to give
model the amount of distortion in terms of the input signal magnitude as in [128], a fixed
distortion estimate was used this time throughout the simulation just as a fixed noise power
estimate was used for the previous ambient noise case. Compared to the increase in the
the step-size is adjusted when dealing with nonlinear distortions, in which case it is better
to let the LMS algorithm converge naturally without exerting more secondary control than
GG ,
necessary. rNLMS2 by itself provides the same overall performance as NLMS+fMMSE
which is as expected since rNLMS2 already possesses the Wiener-like error enhancement
GG
capability. As such, the application of fMMSE to rNLMS2 does not provide any substantial
benefit.
55
5
0
Misalignment (dB)
−5
−10
6
−15 γ =10 , tERLE=8.5 dB
γ =107, tERLE=10.9 dB
γ =108, tERLE=12.9 dB
−20
γ =109, tERLE=15.6 dB
γ =1010, tERLE=18.7 dB
−25
0 5 10 15 20 25
time (s)
GL
(a) Without fMMSE .
0
Misalignment (dB)
−5
−10
6
−15 γ =10 , tERLE=14.3 dB
γ =107, tERLE=16.9 dB
γ =108, tERLE=19.0 dB
−20
γ =109, tERLE=21.4 dB
γ =1010, tERLE=22.8 dB
−25
0 5 10 15 20 25
time (s)
GL
(b) With fMMSE .
GL
Figure 19: From the regularized NLMS algorithm (rNLMS2) with the ERN (fMMSE ). Correspond-
ing average tERLE is provided in the legend.
56
in the network near end
Encoder
Decoder
Decoder
x̄(n) x(n) x(n)
Adaptive
Filter h(n)
ERN
_ ŷ(n)
Encoder
Decoder
+
e(n) y(n) ȳ(n)
16.5
16
15.5
15 12.2
10.2
14.5 7.95
SDR (dB)
7.4
14
6.7
13.5 5.9
5.15
13 4.75
12.5
12
11.5
0 5 10 15 20 25 30 35 40
Signal Loss (dB)
Figure 20: Network-based AEC with speech coding distortion for various bit-rates (kbps). Quan-
tization noise starts to take over beyond the signal loss of 25 dB.
57
4
2
Misalignment (dB)
−2
−4
−6
−8
0 5 10 15 20 25
time (s)
(a) Misalignment.
25
20
15
ERLE (dB)
10
NLMS
5 NLMS+f GG
rNLMS2
0 rNLMS2+f GG
−5
0 5 10 15 20 25
time (s)
(b) ERLE.
GG
Figure 21: From the NLMS algorithms (NLMS, rNLMS2) with the ERN (fMMSE ). GSM AMR
speech encoding and decoding at 12.2 kbps bit-rate were applied to the acoustic echo.
58
We have observed that using other types of ERNs with NLMS does not improve the
GG
ERLE much more than fMMSE for the speech coding distortion. This is also expected since
GSM AMR by design imparts perceptually weighted distortion in the frequency domain,
which results in the fixed SDR on average except at low signal level when the distortion is
GG
mainly due to the quantization noise as indicated by Figure 20(b). In such a case, fMMSE
that dynamically mimics a coring nonlinearity well and does not overly alter the input signal
should serve better than either a subtractive or a compressive ERN. This is a good example
of how the a priori knowledge of source characteristics and the understanding of the actual
physics involved influence the proper integration of system components for delivering the
GG
Figure 22 shows the results from using all the three types of ERNs (fMMSE LG
, fMMSE GL
, fMMSE )
with the regularized FBLMS (RFBLMS) [129] on a simulated acoustic echo, where silence
(WGN at 90 dB SNR) of 1 second was inserted between the reference speech utterances
before the near-end mixing. Both the WGN at 10 dB SNR and a continuous local speech
at 0 dB SNR were added to the acoustic echo. The same simplified statistics estimation
GL
strategy in [129] was utilized, i.e., v ≈ e and ξ = 1 (in such a case, fMMSE GL
and fMAP
become much like the compressive nonlinearity of (74) by limiting e to be roughly within
GL the threshold becomes t = ξα ≈ σ for the SNR ξ = σ 2 /α2 ), as well
±σē , e.g., for fMAP v e ē v
as γ = 1 for the regularization procedure. The ERN was applied only to the input mag-
nitude while the phase was left unmodified. Relatively large adaptation step-size α = 0.02
and smoothing constant β = 0.998 were used for RFBLMS such that only four BIAs (i.e.,
GL
filter→ERN→adaptation) per batch of data were necessary for RFBLMS+fMMSE to reach
GL
the maximum steady-state tERLE. The tERLE plot shows that RFBLMS+fMMSE pro-
vides the best result during the entire simulation. The misalignment plot also exhibits the
GL
stability of RFBLMS+fMMSE , where the large α causes instability at the start of adap-
tation from which the other combinations are unable to recover. The noise robustness of
GL
RFBLMS+fMMSE is also apparent as high tERLE is maintained even during the silences in
59
the acoustic echo signal.
GL
Figure 23 illustrates the APA-like behavior of RFBLMS with fMMSE for various block
iterations iter and step-size α. A significant recovery of the convergence speed, especially
at the initial adaptation stage, is possible through BIA. The AEC system is able to recover
after a sudden change in the RIR at 10 seconds. We note that although the acoustic echo
is still audible at around 20 dB tERLE, the use of a RES and the masking effect during
double talk must be taken into account in practice [142]. Also, this simulation example does
not imply that a DTD should not be used at all; rather, it should be incorporated in such
a manner to help the overall system performance, e.g., the step-size can be decreased by a
factor of half during double talk for extra stability [132]. The current practice of freezing the
adaptation entirely, depending on a particular choice of the DTD threshold, may be more
disruptive than helpful due to, for example, the false detection or when the RIR changes
The double-talk situation can create large variations in the filter estimation error, espe-
cially during the frequency-domain AEC when a speech signal is very sparsely represented
GL
across frequency. fMMSE performs the best out of the three types of ERNs since it is able
to suppress the large outliers in the observed estimation error just as with the traditional
the BIA procedure enables the recovery of the convergence rate lost due to the scaling down
of the step-size, such a mechanism becomes crucial during double talk that requires a more
stringent step-size control than for the ambient noise. None of the previously mentioned
frequency-domain AEC algorithms designed specifically to work with the double-talk situa-
tion [126, 114] were able to take advantage of batch-wise adaptability afforded by the error
GL
enhancement procedure. Replacing fMMSE GL , which is much simpler to implement
with fMAP
numerically, results in nearly identical behavior as shown in Figure 24 for the same step-
size (slight differences in the performance may be adjusted by varying the step-size). Thus
the proper specification of the overall form of the ERN, dictated by the a priori source
information, is practically more relevant than that of the exact mathematical form.
60
15
RFBLMS
RFBLMS+f GG
10
RFBLMS+f LG
RFBLMS+f GL
Misalignment (dB)
−5
−10
−15
0 5 10 15 20 25
Time (sec)
(a) Misalignment.
40
35
30
local speech silence
echo + WGN
25
tERLE (dB)
20
15
10
0
0 5 10 15 20 25
Time (sec)
GG LG
Figure 22: From the regularized FBLMS algorithm (RFBLMS) with the ERN (fMMSE , fMMSE ,
GL
fMMSE ).
61
0
iter = 1, α = 0.01
enclosure displacement iter = 4, α = 0.01
iter = 1, α = 0.02
iter = 4, α = 0.02
Misalignment (dB)
−5
−10
−15
0 5 10 15 20 25
Time (sec)
(a) Misalignment.
40
35
30
local speech silence
echo + WGN
25
tERLE (dB)
20
15
10
0
0 5 10 15 20 25
Time (sec)
GL
Figure 23: From RFBLMS with fMMSE for various block iterations iter and step-size α. The RIR
changes suddenly at 10 seconds due to the displacement of the loudspeaker-microphone enclosure.
62
0
MMSE
enclosure displacement MAP
Misalignment (dB)
−5
−10
−15
0 5 10 15 20 25
Time (sec)
(a) Misalignment.
40
35
30
local speech silence
echo + WGN
25
tERLE (dB)
20
15
10
0
0 5 10 15 20 25
Time (sec)
GL GL
Figure 24: From RFBLMS with fMMSE or fMAP .
63
CHAPTER IV
As with any other least-square-based adaptive filtering, the least mean square (LMS) al-
gorithm suffers from two main problems. One is that the convergence speed is decreased
greatly for highly correlated reference signal x, e.g., speech, thereby making the normal
equation
ill-conditioned for solving numerically, where Rxx = E{xxT } is the auto-correlation matrix,
x and w are the reference and the filter coefficients vectors, respectively, rxy = E{xy} is
the cross-correlation vector, y is the microphone signal, and {·}T is the transpose operator.
d + v for linear distortion where v can be the near-end speech or the background noise,
that consequently corrupts the error e needed for ideal filter coefficients updating, i.e.,
LMS algorithm by itself has difficulty converging to the optimal solution in the presence of
local noise (e.g., double talk) since the noise directly perturbs the single-sample, “noisy”
estimate of the MSE gradient (i.e., the gradient, or estimation, noise [49]). Therefore, the
ill-conditioned Rxx , the near-end noise v, and the gradient noise all together conspire to
degrade the AEC performance in a disruptive and complex acoustic mixing system.
There are other issues associated with the LMS algorithm in the real-world AEC scenar-
ios. Namely, the “non-uniqueness” problem arises due to inter-channel correlation during
multi-channel AEC (MCAEC) that further retards the convergence speed [34]. To improve
the echo-path tracking, a decorrelation procedure must be applied to the reference signal
before playback and adaptation at the cost of decreased audio quality. A similar prob-
lem occurs due to inter-block correlation with the multi-delay filter (MDF) [120], which is a
partitioned-block approach to the frequency-block LMS (FBLMS) algorithm [49] for reduced
64
computation and blocking delay. The two related issues can be alleviated by accounting for
the inter-correlation structure [19]. This, however, ultimately involves the inversion of the
auto-correlation matrix, where a fast, but potentially unstable, algorithm may be utilized
In the previous chapter, we showed the systematic relationship in detail between the
LMS algorithm and the error recovery nonlinearity (ERN) during residual echo enhancement
(REE) for noise-robust AEC performance [131, 133]. The system approach, which places
analytically consistent yet global perspective on the problem at hand rather than focusing
noise-robustness issue can be effectively solved by REE that applies the ERN to “enhance”
the filter estimation error before updating the filter coefficients. Both the steady-state
and the convergence behaviors of the LMS algorithm are improved significantly through
REE and multiple recursive filtering and adaptation on a batch of noisy data via block-
iterative adaptation (BIA). However, while the REE technique may be readily extended
to the traditional single-channel solution to MCAEC, it does not directly address the non-
uniqueness issue to reduce the dependency of the Wiener solution on the far-end room
response. Some form of decorrelation must still be used with aforementioned trade-offs.
Many conventional techniques, e.g., a nonlinear “half-wave rectifying” processor [34] and
comb filtering [11], do not entirely satisfy the first two requirements. They may not likely
meet the third condition necessary for optimal steady-state performance by an MSE-based
65
adaptive filter, and they also tend to be incompatible for the case of more than two audio
channels.
We will demonstrate in this chapter that the same system perspective can be extended to
the decorrelation procedure for directly assisting the AEC adaptation process. The overall
aim here is to achieve decorrelation by integrating the decorrelation procedure not simply
as a separate pre-processor applied prior to the LMS algorithm but as a part of the AEC
system, capable of controlling both the echo cancellation and tracking performances while
The rest of this chapter is organized as follows. First in Section 4.1, we present the
decorrelation by resampling (DBR) technique that exploits the systematic link between BIA,
decorrelation, and resampling. The new technique leads to the development of frequency-
domain resampling (FDR), which takes advantage of the computational efficiency of the Fast
Fourier Transform (FFT), and sub-band resampling (SBR), which is an extension of FDR
that allows selective decorrelation per frequency sub-band as measured by the coherence [20]
for controlling the trade-off between signal distortion and decorrelation amount. Second in
Section 4.2, we provide extension of the techniques already developed and used for a single-
channel robust AEC (R-AEC) system to MCAEC and MDF, where we also discuss the
variations in BIA for further improvement in the AEC performance. Finally in Section 4.3,
all the proposed DBR techniques are tested for audio quality and AEC. In particular, we
evaluate the true echo return loss enhancement (tERLE) and the misalignment through the
sub-band decomposition to demonstrate the superiority of our system approach over other
Our stance is markedly different from the traditional aspects in several ways. First,
the DBR and the R-AEC components complement one another such that low coherence,
or equivalently low misalignment, over the entire frequency range is not necessary for high
tERLE during MCAEC. Conventional wisdom instead favors largest decrease in the coher-
ence as possible over all frequencies, which most likely causes the degradation of the actual
cancellation performance itself [132, 137]. Second, the R-AEC component takes advantage
of intrinsic adaptability of the LMS algorithm by applying BIA to the FBLMS algorithm
66
for increased convergence speed rather than simply relying on fast-converging alternatives
such as the affine projection algorithm or the recursive least square algorithm. Finally, the
R-AEC component, which includes the error enhancement and the adaptive regularization
procedures [131, 133], is adapted continuously in noise-robust fashion via REE even during
double talk. Such a robustness, in turn, is essentially what makes BIA possible. We illus-
trate in this chapter the first two points above through the sub-band analysis of the tERLE
microphone, where “T ” denotes vector transposition, xj (n) is the reference vector from
= N , a set of filter coefficients corresponding to the echo paths between all Q loudspeakers
Rwi = ri , (88)
where R = {E xj (n − k)xj ′ (n − k′ ) } is an LQ×LQ matrix, wi = {wij (k)}i and ri =
{E yi (n)xj ′ (n − k′ ) }i are LQ×1 vectors, and E{·} is the expectation operator.
The normal equation indicates that even if the uniqueness condition of L < M is met
(i.e., xj is linearly independent of xj ′ for j 6= j ′ [57]), which is most likely the case in
reality, the problem remains ill-conditioned if E xj xj ′ 6= 0. The convergence behavior of
67
a stochastic gradient descent algorithm is then assisted by a decorrelation procedure φ(·)
such that E φ(xj )φ(xj ′ ) ≈ 0 for j 6= j ′ . Still, any extra processing, linear or nonlinear, will
inevitably change the statistics of non-stationary random processes xj (n) and xj ′ (n) and
modify the steady-state (or near steady-state) solution, where the effect may be significant
for the LMS algorithm that uses a very rough estimate of the gradient. A decorrelation
The REE technique is represented systematically in Figure 25 that recursively refines the
estimations b̂ and v̂ of the corrupted residual echo e = b+v. This built-in dual re-estimation
process, carried out via BIA, makes the R-AEC component based on the LMS algorithm
less sensitive to mis-specification of the system parameters and mis-estimation of the signal
statistics [133]. BIA also permits a natural recovery of the convergence speed lost to the
coloration of the reference signals, e.g., due to the non-uniqueness problem [132, 137].
Adaptive Filter
b̂ e=b+v v̂
Residual Echo
Enhancement
The LMS algorithm iteratively and stochastically solves the normal equation (85). For
such a dynamic solution, a mismatch in the sampling rate between the loudspeaker and the
microphone channels on the order of few hundred parts per million (∼ 0.01%) is enough
to break down the correlation structure of rxy in (85) for significant decrease in the LMS
same effect for the decorrelation purpose, i.e., to improve the conditioning of Rxx , by
resampling x frame-wise. Such a close and dynamic relationship between the sampling rate
and the LMS algorithm is one major reason for DBR’s effectiveness as revealed in [132, 137].
One other crucial system aspect is that BIA enables a natural recovery of the convergence
68
speed and hence reduces the need for aggressive decorrelation applied directly to x, which
subsequently minimizes the audio distortion and also the potential interference with the
adaptive cancellation process. We have observed that the DBR and the R-AEC components
complement one another such that low coherence, or equivalently low misalignment, over the
entire frequency range is not necessary for high cancellation performance during MCAEC.
Conventional wisdom instead favors as large decrease in the coherence as possible over all
frequencies, which most likely causes the degradation of the actual cancellation performance
[132, 137].
Let fs and fs′ be the original and the new sampling rates, respectively. The resampling
ratio is defined as
fs′
R= = 1 + r, (89)
fs
where the mismatch ratio is defined as r = f∆ /fs , f∆ = fs′ − fs . Assuming without loss
of generality a real-valued R > 1 (or r > 0), sampling rate expansion gives the identity
relationship
where x(n) and X(Z) are the discrete time sequence of a continuous time signal x(t) and
the corresponding Z-transform, respectively, after which the upsampled signal is obtained
which is the fractional delay of expanded samples with respect to the original samples. Thus
applied to the original signal, and (91) means the delay grows progressively in time (i.e.,
MCAEC in [51] with larger modulation at higher bands to perceptually hide the signal dis-
tortion after synthesis, whereas one-sample delay was inserted periodically across channels
69
into frames with half delay period per frame and quarter-period shifting during stereophonic
AEC (SAEC) in [125]. We propose combining the resampling approach with the alternat-
ing projection technique of [110]. Such a combination takes on key features from [51] and
[125] as it periodically imparts smoothly increasing modulation (in frequency) and delay (in
time) across channels. The main drawback is the computational cost of resampling at the
rate R ≃ 1 (r ≃ 0), which requires very large integer-valued resampling ratios for the ideal
upsampling and downsampling scheme. Such a problem can be solved by the resampling-
by-interpolation strategy proposed in [100], which drastically reduces the computation time
by omitting the downsampling process and reusing a short interpolation filter (sinc func-
tion) per block. Therefore, the decorrelation is achieved simply by lowpass filtering in an
appropriate manner.
Figure 26 demonstrates several ways to apply DBR for inter-channel decorrelation. The
resampling rate R is adjusted arbitrarily here to produce the nearest whole extra sample
for the given frame size N . For DBR1, resampling is applied to every other channel and
time frame, which then requires a smoothing procedure to ensure continuity across the
resampled frames [132]. On the other hand for DBR2, DRB3, and DBR4, resampling is
applied simultaneously across all channels by time-reversing a frame in every other channel
before resampling at the opposite rate than the previous frame, i.e., 1/R versus R, and
reversing back afterward. This results in continuously varying delay, thus continuity in
the resampled signal. Negative delay in Channel 2 for DBR3 and DBR4 is produced by
resampling every other frame at the rate 1/R, R > 1, and by time-reversing and resampling
the rest at R and reversing them back afterward. Many other framing procedures are of
Figures 27 and 28 show the corresponding inter-channel coherence plots. The coherence
and DBR3 provide results very similar to each other, where DBR1 gives slightly lower
coherence at higher frequencies than others due to the undesirable aliasing distortion caused
by decimation, or discontinuity in delay, that occurs across frames [125]. DBR4 is able to
lower the coherence more than others since it applies as much continuous delay variation
70
−4 Channel 1 −4 Channel 1
x 10 x 10
delay (s)
delay (s)
1 1
0 0
0 0.5 1 0 0.5 1
−4 Channel 2 −4 Channel 2
x 10 x 10
delay (s)
delay (s)
1 1
0 0
0 0.5 1 0 0.5 1
time (s) time (s)
(a) DBR1 [132]. (b) DBR2 [137].
−4 Channel 1 −4 Channel 1
x 10 x 10
1 1
delay (s)
delay (s)
0 0
−1 −1
0 0.5 1 0 0.5 1
−4 Channel 2 −4 Channel 2
x 10 x 10
1 1
delay (s)
delay (s)
0 0
−1 −1
0 0.5 1 0 0.5 1
time (s) time (s)
(c) DBR3. (d) DBR4.
Figure 26: Signal delay is linearly varied after resampling frame-wise (a) alternatingly across
channels and (b,c,d) simultaneously across channels where every other block is resampled after time
reversal and reversed back afterward (N = 2048, R = 1.0004, fs = 16 kHz).
as possible across both time and channel. Also, increasing R for fixed N leads to a large
reduction of the coherence from mid to high frequencies much like in Figure 27(d).
Figure 28 includes the result from nonlinear processing (NLP) with the nonlinearity
parameter set at α = 0.5 [34] for comparison with DBR. It show that NLP tends to reduce
the coherence more than DBR at low frequencies where most of the speech energy resides.
Depending on the frame size, DBR is able to reduce the coherence more than NLP at mid
to high frequencies while altering the original signal less than NLP at low frequencies.
Informal listening tests have revealed that no audible distortion is noticed during loud-
speaker playback after DBR as long as the loudspeakers are spaced sufficiently apart to
eliminate the minor perception of “flutter,” or audio image fluctuation, at high frequencies,
most likely due to the coupling across frequency of time-varying phase difference between
channels. The resampling rate R must be adjusted to be closer to unity in order to reduce
71
magnitude−squared coherence
magnitude−squared coherence
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 2 4 6 8 0 2 4 6 8
frequency (kHz) frequency (kHz)
(a) DBR1 [132]. (b) DBR2 [137].
magnitude−squared coherence
magnitude−squared coherence
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 2 4 6 8 0 2 4 6 8
frequency (kHz) frequency (kHz)
(c) DBR3. (d) DBR4.
Figure 27: Inter-channel coherence (averaged over first 5 seconds of speech). From top to bottom:
N = ∞, 4096, 2048, 1024, 512, 256.
magnitude−squared coherence
0.8
no decor. proc.
0.6 NLP (0.5)
DBR4 (N=2048)
0.4
DBR4 (N=256)
0.2
0
0 2 4 6 8
frequency (kHz)
72
4.1.4 Decorrelation via Frequency-Domain Resampling (FDR)
As described above, DBR introduces time-varying signal delay frame-wise across channels
with negligible audible distortion [132]. Resampling by interpolation (RBI) was utilized in
[100, 132] since the conventional upsampling followed by downsampling may be computa-
tionally infeasible for real-time applications when the desired sampling rate change is very
small (on the order of 0.01%). We may still do better computationally if we can take advan-
tage of the efficiency of the FFT and resample, or interpolate, across frequency rather than
across time with appropriate expansion or contraction dictated by the resampling ratio R.
holds for continuous time signal x(t) and the corresponding Fourier transform X(ej2πf ),
i.e., the time-frequency scaling is inversely related. Let X(ejω ) be the discrete-time Fourier
transform (DTFT) and XN (k) be the N -point discrete Fourier transform (DFT) of x(n),
respectively, with proper bandlimiting during the sampling of x(t) to avoid the frequency
aliasing to obtain x(n). Extending an N -point sequence from x(n) by inserting zeros at the
end of the sequence to turn it into an L-point sequence gives the relationship
x(n), n = 0, 1, . . . , N − 1,
y(n) =
0, n = N, N + 1, . . . , L,
←→ YL (k) = U (ejω ) ∗ X(ejω )ω=2πk/L , (93)
(93) specifies how often YL (k), which results from the convolution (or smearing) of X(ejω )
with the sinc function over frequency due to the windowing of x(n) by u(n) over time, is
sampled in the frequency domain. The phase shift e−jπk(N −1)/L in (94) is due to the DFT
73
of u(n) not centered at the origin. On the other hand, expanding the N -point sequence of
x(n) by padding M − 1 zeros between the samples gives an L = M N -point sequence with
the relationships
x(n/M ), n = 0, M, 2M, . . . , L − 1,
y(n) =
0, otherwise,
where “∗ ” denotes the complex conjugation. (96) describes the M -times upsampling proce-
dure with zero-padding followed by convolution with the sinc function for interpolation over
time. The term ejπn(N −1)/L in UL∗ (n) of (96) arises from the shifting of centered windowing,
The above analysis implies that the DFT or the inverse-DFT of sampled signals can
increasing L by varying M with fixed N in (93) and (96) leads to resampled values over
frequency and time, respectively. One must note, however, that the zero-extension technique
does not increase the time or the frequency resolution, which is proportional to the number
Therefore, we propose the following procedure for resampling an N -point sequence from
74
with the constraints k ≤ Rk′ ≤ k + 1 and α′ = Rk′ − k for each (k′ )th new sample to
5. Discard the samples at the end of the new sequence x′ (n) to retain the first RN
Using the zero-extension factor M ≥ 2 and taking the 2N -point inverse-DFT avoids the
general for efficient implementation of FFT. The computation load can be reduced further
by taking advantage of conjugate-symmetric XL (k) for real-valued x(n) and by storing the
interpolated values over frequency in memory provided that they can be used later by the
frequency-domain AEC.
Figure 29 illustrates the amount of distortion from resampling by the proposed FDR
that 4 ≤ P ≤ 6 should be sufficient for most applications. FDR should offer computational
saving over TDR when R is near unity and N is sufficiently small. For example, 2(128) + 1
interpolation filter coefficients for relatively high resampling accuracy were taken during
TDR to obtain Figure 29, which translates to the computational overhead by roughly a
factor of 257N/(M N log2 (M N )) ≈ 1.07 when compared to FDR with P = 4 and N = 2048.
normalized distortion (dB)
−20
−30
−40
1 2 3 4 5 6 7 8
zero−extention order P
Figure 29: Distortion on a speech signal by FDR (normalized with respect to TDR) for R = 1.0004,
N = 2048, and fs = 16 kHz.
We note that one can choose to interpolate over the magnitude and the phase separately
instead of over the real and the imaginary parts as implied by (97). It then necessitates the
unwrapping of the phase, which may not be an accurate process for some signals. One can
75
also zero-extend at the beginning or evenly at both ends of a sequence rather than at the
end. We have observed that applying these alternate approaches to a speech signal does
not produce any perceptible difference. In any case, the overall distortion per resampled
frame should be minimized as long as large enough P is taken to ensure accurate short-
time interpolation. We also have observed that compared to FDR, TDR gives virtually
the same coherence reduction as FDR except near 8 kHz where a notable roll-off in the
coherence occurs due to the low-pass, anti-alias filtering. Thus FDR should provide even
For perceptual quality and the actual cancellation performance reasons [132, 137], we may
want to modify the signal only in certain sub-bands. For example, interaural time differences
plays an important role for sound localization at low frequencies [140]. A modification of
the low-band content disturbs the phase information of the signal and ultimately alters the
To that end, we point out that for achieving the same overall reduction in the coherence,
or equivalently the cross-correlation, the resampling ratio R may be adjusted separately over
each sub-band in the frequency domain as if resampling the entire signal frame with a fixed
R. This can be done to make sure that the spatial image distortion will be minimized by
in the low band and R = R0 > 1 in the high band, may introduce the unwanted frequency-
domain distortion. We have experimentally verified that the distortion created by such a
the resampling ratio per frequency bin as smoothly across the bins as possible, which simply
involves making R a continuous function of frequency, i.e., R(k), and applying the desired
Unlike the traditional decorrelation procedures [34, 125, 51] where the coherence is usu-
ally fixed throughout the frequencies for a given parameter, SBR is highly flexible and can
be fine-tuned for better perceptual quality, e.g., less “resampling” at lower frequencies and
76
vice versa at higher frequencies. We have shown that given the same degree of decorrelation,
SBR outperforms the conventional methods in terms of the audio quality [141].
Just as in [131], FBLMS can be combined with the REE procedure for MCAEC by using
a “compressive” nonlinearity for the ERN, which provides robustness against an impul-
sive noise and permits continuous adaptation during double talk, and with the regularized
which ensures a wide range of noise robustness (e.g., when the echo path is weakly excited)
and is an integral part of the REE procedure, where the reference and the noise power
spectrums Sxj and Svi for kth frequency bin at lth block index are determined per j th and
ith channels, respectively. The same simplified statistics estimation strategy in [131] can
be utilized, where Svi in (98) is estimated directly by the residual echo power spectrum
Sei and the over-suppression factor η ≥ 1 is employed this time with REE such that the
The following modifications may be included for extra improvement in the overall
MCAEC performance. First, double-talk detection (DTD) [39] is used to decrease the
step-size by half during double talk to maintain as much stability as possible. Second, ex-
ponential weighting (EW) [72] is applied to the time-domain filter coefficients during the
FBLMS’s gradient constraint procedure for increased convergence rate and also adaptation
stability. Finally, a truly batch-wise adaptation can be carried out for a batch of B samples,
where the filtering and the adaptation steps are performed per block size L (same as the
filter size) and repeated across blocks of L < B, with or without overlap, in the same batch
There are mainly two options for BIA implementation. BIA1 in Figure 30(a) adapts over
each filter block, assumed here to be of size L, for several iterations before moving on to
77
the next block in time. BIA2 in Figure 30(b) adapts across J blocks partitioned over a
batch of data of size B, and the process is repeated for several iterations. The blocks may
be overlapped over time with the integer overlap factor of > 1 (of = 2 in Figure 30) for
further increase in the convergence speed, which then requires proper output smoothing [83]
1. The two cases are each suited for real-time and off-line implementations, respectively.
If the adaptation is permitted for B > L, BIA2 should provide higher tERLE than BIA1
for the same amount of calculation since it allows for wider input signal variation in time,
w1,1
1,2
w2,1
w1,J
w2,2
···
··· w2,J
···
···
···
(a) BIA1: Adaptive iteration per block.
w1,1 w2,1
w1,2 w2,2 ···
···
···
w1,J w2,J
batch size B
(b) BIA2: Adaptive iteration across blocks.
Figure 30: Block-iterative adaptation (BIA) of the filter coefficient vector wi,j , where i is the
iteration index and j is the block index.
The MDF partitions x and w into K sub-blocks of size D each for reduced computation in
1
Smoothing of the filter output dˆ for of > 1 as suggested in [83] actually interferes with BIA. It is
sufficient to simply smooth the residual echo e to eliminate the audible discontinuity due to the block-
adaptive processing.
78
where w = [w1T , . . . , wK
T ]T , x = [xT , . . . , xT ]T , and L = KD is the filter length. On
1 K
the other hand for MCAEC with P channels, the acoustic echo estimate at one of the
microphones is given by
P
X −1
ˆ
d(n) = wT (n)x(n) = wiT (n)xi (n), (100)
i=0
were w = [w1T , . . . , wPT ]T and x = [xT1 , . . . , xTP ]T are the concatenated filter coefficient and
reference signal vectors, respectively. We can observe from (99) and (100) that the inter-
block and the inter-channel correlation problems are due to high correlation between the
sub-blocks of the reference vector x. Thus we should expect the same systematic benefit
from applying REE and BIA to MDF as when they are applied to MCAEC.
Compared to the full-block version, FBLMS, the advantages and the disadvantages of
Advantage (computation-wise):
• Less BIA iterations necessary for blocks towards the RIR tail with less energy.
Disadvantage (performance-wise):
• Shorter xk means increased ill-conditioning of Rxk xk for a speech signal, thus better
regularization is required.
• Shorter ek is used to update all filter coefficients, thus better error enhancement is
required.
The generalized MDF (GMDF), i.e., MDF with of > 1, allows for increase in the conver-
gence speed [83]. In such a case, BIA2 in Figure 30(b) may be executed for each sub-block
of +1
(i.e., J = of and B = of D) for a gain in the tERLE.
79
4.3 Experimental Evaluation
TIMIT speech corpus recorded at 16 kHz sampling rate was used to simulate the near
and far-end talkers with at most two simultaneous talkers at both ends with an overlap
of at most 2 seconds (see Figure 31). Same number of microphones, loudspeakers, and
talkers was used at each end. Talkers were randomly selected such that the talkers from
both ends took turns to speak exactly one utterance per sequence, and such a sequence
was repeated for at least 20 seconds for each simulation. The signal energy during voice
activity after decorrelation was normalized to match that of the original signal to maintain
the same energy after all decorrelation procedures and ensure consistent echo return loss
(ERL) control.
Left Right
Loudspeaker
0.5 0.5
0 0
−0.5 −0.5
0 5 10 15 20 0 5 10 15 20
0.2 0.2
Local
0 0
−0.2 −0.2
0 5 10 15 20 0 5 10 15 20
Microphone
0.2 0.2
0 0
−0.2 −0.2
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
Figure 31: Near-end loudspeaker, local, and microphone signals for SAEC. Double-arrows indicate
individual speech activity.
The following decorrelation procedures were tested with simulated SAEC to initially eval-
80
• Resampling by upsampling and downsampling (RUD) via DBR1.
In order to eliminate the audible “pops” generated between the processed frames due to
discontinuity for the last three methods, the following simplified smoothing schemes were
used, where xj (m, n) and x̂j (m, n) are the original and the resampled values, respectively,
• OSD: The average of xj (m − 1, Nf ) and xj (m, 1) is inserted between the two samples
• RUD: R > 1 is chosen such that one extra sample is produced by resampling. After-
ward, x̂j (m, 1) is averaged with xj (m − 1, Nf ), and the extra sample x̂j (m, Nf + 1) is
as before, x̂j (m, 1) is averaged with xj (m − 1, Nf ), and x̂j (m, Nf ) is averaged with
xj (m + 1, 1).
Only one extra sample delay (look-ahead) is incurred by the above strategies for real-time
playback. More advanced framing and smoothing are possible, e.g., [125], albeit with longer
delay.
to record three sets of RIRs, two for the near and far-end talkers and one for the near-end
loudspeakers, with average reverberation time of T60 = 250 ms. The RIRs were truncated
to 128 ms (L = 2048 at fs = 16 kHz) before convolution, and the near-end RIRs were scaled
to produce the ERL of 10 dB. 40 dB SNR AWGN was applied to the far-end microphone
signals, and an air-conditioner noise and local speech signals (double talk) with echo-to-
noise ratios (ENRs) of 20 dB and 0 dB, respectively, were mixed with the acoustic echo to
81
comprise the near-end microphone signals. µ = 0.15, β = 0.99, γ = 1, and η = 5 were used
Figure 32 indicates that BIA accelerates the convergence rate significantly especially
at the beginning of adaptation (in contrast to the common approach of simply using a
Using DTD to decrease the step-size during double talk assists in increasing the tERLE
calculated without the local noise. The effectiveness of EW during both single and double
talk is also observed. Continuous, noise-robust adaptation afforded by REE is crucial for
MCAEC since the far-end room response change may occur during double talk, e.g., far-end
40
iter = 1, B = L
iter = 4, B = L
30 iter = 4, B = 4L
tERLE (dB)
20
10
0
0 2 4 6 8 10
time (sec)
40
30
tERLE (dB)
20
10 iter = 4, B = 4L
iter = 4, B = 4L, DTD
iter = 4, B = 4L, DTD, EW
0
0 2 4 6 8 10
time (sec)
Figure 32: True ERLE (averaged over left and right channels) for combinations of iter, B, DTD,
and EW without decorrelation.
However, Figure 33 reveals that only a minor improvement in the overall misalignment
is possible without a decorrelation procedure. The effect of the far-end speech activity
transition is clearly visible at t ≈ 2.5 seconds in Figure 34 without decorrelation even when
all other techniques are employed. Some improvement is displayed after decorrelation in
Figures 35, 36, and 37 for NLP (the nonlinearity parameter was set at α = 0.5 [34]), AWGN
82
(30 dB SNR), and OSD (Nf = L), respectively, but with limitations, e.g., degradation of
Misalignment (dB)
0 0
−10 −10
−20 −20
−30 −30
0 5 10 15 20 0 5 10 15 20
R Microphone, L Loudspeaker R Microphone, R Loudspeaker
Misalignment (dB)
0 0
−10 −10
iter = 1, B = L
−20 −20 iter = 4, B = 4L,
DTD, EW
−30 −30
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
Left Right
true residual echo
0.05 0.05
0 0
−0.05 −0.05
0 5 10 15 20 0 5 10 15 20
40 40
tERLE (dB)
20 20
0 0
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
On the other hand, Figures 38 and 39 confirm the effectiveness of the resampling tech-
nique. RBI provides virtually the same results as RUD (MATLAB’s resample function was
used for RUD, the reuse-block size and sinc function length of 64 was used for RBI [100],
and Nf = L and R = 1.0004 were used for both). Informal listening tests indicated no loss
83
Left Right
0 0
−0.05 −0.05
0 5 10 15 20 0 5 10 15 20
40 40
tERLE (dB)
20 20
0 0
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
Left Right
true residual echo
0.05 0.05
0 0
−0.05 −0.05
0 5 10 15 20 0 5 10 15 20
40 40
tERLE (dB)
20 20
0 0
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
Left Right
true residual echo
0.05 0.05
0 0
−0.05 −0.05
0 5 10 15 20 0 5 10 15 20
40 40
tERLE (dB)
20 20
0 0
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
84
Left Right
0 0
−0.05 −0.05
0 5 10 15 20 0 5 10 15 20
40 40
tERLE (dB)
20 20
0 0
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
Figure 38: True residual echo and tERLE with RUD or RBI.
no decorr no decorr
AWGN AWGN
−10 NLP −10 NLP
OSD OSD
−20 −20
no decorr no decorr
NLP
NLP OSD
−10 OSD −10
AWGN
−20 AWGN −20
RUD / RBI
−30 RUD / RBI −30
0 5 10 15 20 0 5 10 15 20
time (sec) time (sec)
85
4.3.2 Robust MCAEC with FDR
The following cases were tested for MCAEC to evaluate the FDR technique.
α mod (i−1,2)
x̃i (n) = xi (n) + xi (n) + (−1) |xi (n)| , (101)
2
here xi (n) is the reference signal from ith channel, i ≥ 1, and mod (·, ·) is the modulus
function.
• AWGN.
For the rest of the simulations in this chapter, a set of RIRs with the reverberation time
of T60 ≈ 200 ms was measured using an 8-microphone (omni-directional) circular array with
0.02 m spacing placed in the center of an 8-loudspeaker circular array with 1 m spacing.
The arrays were ordered counter-clockwise, and the RIRs between four microphones with
indicies i ∈ {1, 2, 3, 4} and four loudspeakers with indicies j ∈ {1, 2, 3, 4} were used for
the near-end echo paths, those between i ∈ {1, 2, 3, 4} and j ∈ {5, 6, 7, 8} for the near-end
speech mixing, and those between i ∈ {5, 6, 7, 8} and j ∈ {5, 6, 7, 8} for the far-end speech
mixing. Different sets of loudspeakers were used for the sets {1, 2, 3, 4} and {5, 6, 7, 8} to
vary the near and the far-end echo paths as much as possible. The RIRs were truncated to
128 ms before convolution to set it equal to the filter length (L = 2048 at fs = 16 kHz),
and the near-end RIRs were scaled to produce the ERL of around 10 dB.
40 dB and 100 dB SNR AWGNs were applied to the near and far-end microphones, respec-
tively, and a background noise from an air-conditioner and local speech signals with the
ENRs of 20 dB and 0 dB, respectively, were mixed with the acoustic echo to comprise the
along with DTD (to decrease the step-size by half during double talk) and EW were used
for the REE-based FBLMS algorithm [137]. Smaller step-size µ was used this time than in
86
[132] to achieve better steady-state performance at the cost of slower convergence speed.
For decorrelation, NLP was used with α = 0.5, AWGN was generated at 30 dB SNR, and
Tables 3, 4, 5 provide the tERLE, segmental signal-to-residual echo ratio (SSRR, cal-
culated in the same fashion as the segmental SNR between v and b), and LSD (measured
between v and e), respectively. The measurements were averaged over all channels for 20
independent simulations, where the last 19 out of 20 seconds of data were used to ensure
sufficient initial filter convergence. SSRR and LSD quantify the amount of time-domain
and frequency-domain distortions, respectively, remaining after the echo cancellation pro-
cess. tERLE was measured during voice activity only. The tables show that FDR is able to
maintain consistently better performance than NLP or AWGN as the number of channels is
increased, although the apparent differences are relatively small due to the averaging pro-
cess. In addition, as already referred to in [132], all decorrelation procedures tend to result
in improved tracking of the echo paths but at the cost of decreased initial and steady-state
tERLE. The trade-off is more pronounced when a larger step-size is used. Therefore, re-
taining the original reference signal’s characteristics during decorrelation can be beneficial
Figure 40 provides the misalignment plot, averaged over all echo paths, that indicates
the advantage of FDR over others for quicker tracking performance. Block-iterative adap-
frequency-domain MCAEC since it naturally allows the recovery of tERLE lost due to
the non-uniqueness problem [132]. Data reuse by the normalized LMS (NLMS) algorithm
is shown to be another form of the affine projection algorithm (APA)[8], and BIA of the
FBLMS algorithm evidently exhibits an APA-like convergence behavior [137]. For MCAEC,
a decorrelation procedure should not only reduce the coherence across channels but also
emphasize the natural variation of the correlation across time such that a projection-type
adaptive algorithm may “zig-zag” its way faster towards the optimal solution than otherwise
[110]. Stabilized batch adaptation of the FBLMS algorithm through the REE procedure
should be compatible with the time-varying decorrelation procedure via resampling since
87
Table 3: tERLE comparison (dB, higher is better).
less decorrelation is then required for the low-frequency signal components that carry most
of the speech energy and information. This in turn minimizes the interference of the actual
linear cancellation process and ultimately aids in the maximization of the cancellation per-
and not a necessary condition for high tERLE [137] especially for MCAEC with multiple
0 0 0
none
Misalignment (dB)
Misalignment (dB)
Misalignment (dB)
NLP
−5 AWGN −5 −5
FDR
Figure 40: Misalignment averaged over 20 runs and all echo paths.
88
4.3.2.2 Sub-band tERLE and Misalignment Decomposition
The tERLE and the misalignment were decomposed into three sub-band components (low,
mid, and high) through the Fourier series expansion, which gives far better reconstruction
accuracy than using the DFT filter banks formed from a prototype filter. For tERLE,
the microphone and the residual echo signals were decomposed with 50% overlap of the
analysis frames. For misalignment, the RIR and the filter coefficients were mirrored in time
This time the near-end RIRs were switched at 15 sec to enact a sudden disruption to
the RIR. 100 dB and 40 dB SNR AWGNs were added to the near and far-end microphones,
respectively, and air-conditioner noise and speech with the ENRs of 20 dB and 0 dB,
respectively, were added to the acoustic echo. The the REE-based FBLMS algorithm used
Figure 30(a)), and of = 4 (of = 1 was used in [132, 137]) along with EW and scaling of the
step-size µ by half during double talk [132]. For decorrelation, NLP was used with α = 0.5
and FDR was used with P = 6 and with DBR4 of given frame size.
Figures 41, 42, and Table 6 show the SAEC results. The averaged tERLE was obtained
over the echo duration only while the misalignment was averaged over the entire time. The
far-end talker activity change takes place four times, occurring initially at around 2.5 sec.
The main observations are as follows. First, both NLP and DBR tend to hurt the initial
and the steady-state tERLE for improved echo-path tracking. Second, NLP does not fare
as well as DBR in the mid and high bands after the far and the near-end RIR change.
For the low band, NLP leads to lower steady-state misalignment but not higher tERLE
than DBR1. Over all bands, DBR is able to give higher tERLE and lower misalignment on
average than NLP. Third, a substantial gain in the tracking capability appears in the mid
to high bands for DBR. Such an improvement is attributed largely to the DBR’s ability to
continuously instill both short and long-time decorrelation for the direct benefit of the LMS
algorithm and not simply due to the coherence reduction. Finally, smaller N leads to much
better tracking for DBR, where the two features are both crucial for real-time AEC.
89
no decorrelation procedure
40
(dB)
20
NLP (α = 0.5)
40
(dB)
20
0
low
mid
DBR4 (N = 2048)
high
40
(dB)
20
DBR4 (N = 256)
40
(dB)
20
0
0 5 10 15 20 25 30
time (sec)
Figure 41: Sub-band tERLE decomposition (stereophonic, averaged over all channels).
Table 6: Average tERLE (dB, left column) and misalignment(dB, right column).
90
no decorrelation procedure
0
−10
(dB)
−20
−30
NLP (α = 0.5)
0
−10
(dB)
−20
−30
low
DBR4 (N = 2048) mid
0 high
−10
(dB)
−20
−30
DBR4 (N = 256)
0
−10
(dB)
−20
−30
0 5 10 15 20 25 30
time (sec)
Figure 42: Sub-band misalignment decomposition (stereophonic, averaged over all echo paths).
91
4.3.3 Robust MCAEC with SBR
The following methods were tested to compare the R-AEC performance with the proposed
• SBR via DBR3 with N = 512 and variable resampling ratios R1 through R5 as shown
For the evaluation of speech quality after decorrelation, a stereo reference signal of 30
seconds was used. Silences were removed prior to calculating the coherence. As SBR
allows us to fine-tune the coherence at each frequency bin, R1 is used to achieve the same
coherence given by AWGN, R2 to achieve that by NLP, and R3 to achieve that by PMod to
form the same basis for measuring the processed speech quality and comparing against other
decorrelation procedures. Figure 43 also shows how well the coherence can be controlled by
SBR. Thus by properly choosing ∆R = R−1, the average degree of decorrelation, measured
in terms of the coherence, by SBR can be matched to that of AWGN, NLP, and PMod.
Also to demonstrate the ability of SBR to control the AEC performance per sub-band, the
coherence is matched with regular FDR only in the mid to high bands while leaving the
low band unmodified. The two other SBR coherence-matching schemes with the variable
resampling ratios R4 and R5 used for this purpose are shown in Figure 44.
For objective quality evaluation, SSNR, LSD, and perceptual evaluation of speech qual-
ity (PESQ) score were used. The SSNR measures the deviation of the processed signal
from the original signal in the time domain while the LSD measures the distortion in the
frequency domain. Both narrowband and wideband modes were used for the PESQ score
92
−3
x 10
1
6 Var. ∆ R1
5 Var. ∆ R2
0.8
Var. ∆ R3
coherence
4
∆R
3 0.6
2 ∆R=0
0.4
AWGN
1 Var. ∆ R1
0 0.2
0 2000 4000 6000 8000 0 2000 4000 6000 8000
frequency (Hz) frequency (Hz)
1 1
0.8 0.8
coherence
coherence
0.6 0.6
∆R=0 ∆R=0
0.4 0.4
NLP PMod
Var. ∆ R2 Var. ∆ R3
0.2 0.2
0 2000 4000 6000 8000 0 2000 4000 6000 8000
frequency (Hz) frequency (Hz)
Figure 43: Variable resampling ratios R1 , R2 , and R3 and their corresponding coherence plots,
which match the coherence from SBR to that of other decorrelation methods.
−3
x 10
3 1
2.5
0.8
2
coherence
0.6
∆R
1.5
0.4 ∆R=0
1 ∆ R = 0.0028
∆ R = 0.0028
Var. ∆ R4
Var. ∆ R4
0.5 0.2
Var. ∆ R5 Var. ∆ R5
0 0
0 2000 4000 6000 8000 0 2000 4000 6000 8000
frequency (Hz) frequency (Hz)
Figure 44: FDR with fixed ∆R = 0.0028 and the corresponding variable resampling ratios R4 and
R5 that match the coherence of FDR in the high frequency band.
93
(PESQNB and PESQWB ), which is an objective measurement that predicts the results of
mean opinion score (MOS) in subjective listening tests. PESQNB-LR and PESQWB-LR cor-
respond to the evaluations obtained after averaging the measures taken individually from
Table 7 summarizes the quality measures. SBR generally outperforms the other con-
ventional methods in terms of the sound quality as reflected by the PESQ score [141]. We
note that even though the SSNR and the LSD of AWGN are better than R1 , the distortion
introduced by AWGN is quite audible as indicated by the PESQ score. The distortion
introduced by SBR, on the other hand, is almost negligible when ∆R is very small. We
also note that the resampling ratio for FDR and R4 and R5 for SBR in the igh band are
set to large values to demonstrate the effect of highly decorrelated signal after DBR on the
tERLE and the misalignment performances. As a result, the PESQWB score suffers due to
large distortion in the high frequency region. Still, even though the PESQ scores are quite
similar in those cases, better SSNR and LSD can be achieved by avoiding the resampling
Just as in the case of FDR, the REE-based FBLMS algorithm was used with µ = 0.12,
β = 0.998, γ = 10, η = 5, iter = 4, B = L, and the overlap factor of = 4 (of = 1 was used
in [132, 132]) along with EW and scaling of the step-size µ by half during double talk [132].
Tables 8 and 9 show the SAEC results. The main observations are as follows. First,
although AWGN is able to provide the tERLE closer to when no decorrelation is used than
SBR, it leads to much worse misalignment. Second, NLP tends to hurt the low-band tERLE
94
more than SBR when compared to no decorrelation. The performance gain by SBR against
NLP is even larger in the mid and the high bands for the tERLE and especially for the
misalignment. Finally, PMod is capable of providing lower misalignment over all bands than
SBR when its coherence is matched by SBR, but it does not necessarily translate to higher
tERLE, which is less in the low to mid bands for PMod than SBR. Poor misalignment by
SBR in this case is expected since the coherence is not reduced much after the matching.
The results in Tables 10 and 11 (averaged only over echo duration for tERLE and over
entire time for misalignment) indicate that a substantial gain in the tracking capability
appears in the mid to high bands for FDR and SBR when compared to no decorrelation
and other decorrelation procedures. The tERLE is also increased especially in the high band.
Such an improvement is again attributed to the DBR’s ability to continuously instill both
short and long-time decorrelation for the benefit of the LMS algorithm. Furthermore, SBR
is able to provide higher tERLE than FDR in the low band by selectively not modifying the
signal components in that region, in which case the tERLE is recovered naturally through
BIA of the REE technique. This results in higher overall tERLE for SBR than without
decorrelation.
95
Table 10: Average tERLE (dB, higher is better).
The system parameters used for FBLMS were µ = 0.12, β = 0.998, γ = 10, η = 5, iter = 4,
B = L (i.e., BIA1 in Figure 30(a)), and of = 4 (of = 1 was used in [132, 137]) along with
EW and scaling of the step-size µ by half during double talk [132]. For MDF with K = 4,
the parameters adjusted from FBLMS were µ = 0.35, γ = 10K 2 , η = 10, and of = 1. For
GMDF, BIA2 in Figure 30(b) was applied with µ = 0.2, γ = 20K 2 , η = 10, and J = of = 4.
Figures 45, 46, and Table 12 provide the results from single-channel AEC. The figures
show that, as BIA with iter = 4 is used instead of the usual iter = 1, MDF is capable of
closely matching the performance of FBLMS even in a noisy condition. Due to the increased
overlap, GMDF is able to provide better result than MDF. Still, the MDF is limited by
shorter DFT blocks than with FBLMS, which affects the high frequency components more
than the lower ones. Further improvement is expected through refined regularization and
error enhancement procedures, e.g., [143]. Although not plotted here, the reduction of
adaptive iterations over the RIR tail portions), gives practically the same results for MDF
and GMDF. Moreover, it turns out that the EW technique is essential for MDF with BIA,
without which the overall performance decreases substantially, as it in effect applies the
96
FBLMS
40
(dB)
20
0
MDF
40
(dB) low
20 mid
high
0
GMDF
40
(dB)
20
0
0 5 10 15 20 25 30
time (sec)
FBLMS
0
−10
(dB)
−20
−30
MDF
0
−10 low
(dB)
−20 mid
high
−30
GMDF
0
−10
(dB)
−20
−30
0 5 10 15 20 25 30
time (sec)
Table 12: Average tERLE (dB, left column) and misalignment(dB, right column).
97
CHAPTER V
The least mean square (LMS) algorithm [138] is summarized by the filter coefficient update
equation
where r(n) is the reference signal vector, w(n) is the filter coefficient vector, µ is the
adaptation step-size, and e(n) is the estimation error. It relies on information from the
second-order statistics (SOS) for adaptation as the mean-square error (MSE) E{e2 (n)} is
minimized to obtain the optimal filter coefficients that best represent the true echo path.
There are two main issues that prevent the LMS algorithm from achieving a desired echo
cancellation performance: the presence of local acoustic noise (or near-end speech) and the
non-uniqueness problem, the latter of which arises during multi-channel AEC (MCAEC). To
better handle the two problems, we propose a shift in the conventional MCAEC paradigm
to a new framework of semi-blind source separation (SBSS). Blind source separation (BSS)
is a powerful signal enhancement method for recovering a target signal from a mixture of
signals when no prior information on the original source signals are available. SBSS is
a direct extension of BSS when some partial knowledge of the source signals are already
available (e.g., components of the reference signals), and is naturally suited for the MCAEC
BSS can be implemented in the frequency domain through a batch, i.e., offline, adapta-
tion based on independent component analysis (ICA), which aims to maximize the statistical
independence of separated signal components given that the original sources themselves are
independent. In particular, the natural gradient algorithm [3] has become a standard in
realizing the ICA optimization and can also be interpreted as performing a nonlinear decor-
relation by using the higher-order statistics (HOS) [56]. Besides computational efficiency
and batch-wise adaptability, BSS is commonly carried out in the frequency domain since
98
a convolutive mixture becomes a linear and instantaneous mixture after the short-time
Fourier transform (STFT), thus making the separation problem much more tractable than
in the time domain. Although there are many problems associated with a practical imple-
mentation of the ICA-based BSS in the frequency domain, e.g., the permutation and the
scaling ambiguities, such a truly multi-channel approach to signal enhancement has been
proven to be very effective in handling a convolutive mixture of speech signals. The SBSS
framework was first proposed in [60] and was successfully implemented in [79] as a combi-
nation of multi-channel BSS and a single-channel AEC in the frequency domain. We have
shown subsequently in [135] that BSS and stereophonic AEC (SAEC) can be effectively
implemented together in such a framework and that a decorrelation procedure also helps
SBSS achieve better echo cancellation performance just as in the MCAEC case.
In this chapter, as we have done in [89], we analyze the structure of the SBSS de-mixing
matrix to see how the multi-channel echo cancellation performance can be improved. After
a deep theoretical analysis, algorithmic issues are discussed to provide suggestions for the
design of robust and practical SBSS systems. In particular, we examine the behavior of a
to possibly take advantage of both types of learning. We ultimately show through different
far-end mixing conditions that with a proper constraint on the de-mixing matrix and a
regularization procedure, both high echo cancellation and relatively low misalignment can
be achieved without any pre-decorrelation procedure even for the worst-case scenario of a
single far-end talker along with the non-uniqueness condition on the far-end mixing system.
The rest of this chapter is organized as follows. First in Section 5.1, we present a deep
SBSS system, discussion of the origin of the non-uniqueness problem in SBSS, steady-state
analysis of the SBSS system, illustration of a connection between the MSE-based and the
ICA-based approaches, and exploration of constraints on the SBSS de-mixing filter and
their effect on the ICA optimization. Next in Section 5.2, we discuss the main issues of
the algorithmic SBSS design, including: detailed outline of an online implementation of the
SBSS system and description of the implemented SBSS algorithm. Finally in Section 5.3,
99
we describe methods for evaluating the SBSS performance and provide simulated and real-
world results.
possible, and we assume zero-mean random processes that generate the involved signals.
We will make some more simplifications later to make the analysis more tractable.
To begin with, we define the notations that are used in the following sections. A model
for the near-end and the far-end mixing systems and the SBSS system is illustrated in
Figure 47. At the far end, Q sources are recorded by an array of R microphones. At the
near end, an array of S microphones records P near-end sources and R loudspeaker signals.
matrix G, which represents the far-end mixing system, to give the reference signal vector
r:
where r(ω, t) = [r1 (ω, t), · · · , rR (ω, t)]T and q(ω, t) = [q1 (ω, t), · · · , qQ (ω, t)]T . A near-
response matrix H12 , i.e., the echo paths. The two matrices can be combined into a single
where s(ω, t) = [s1 (ω, t), · · · , sP (ω, t)]T , xs (ω, t) = [x1 (ω, t), · · · , xS (ω, t)]T , xr (ω, t) =
where OR×P is an R × P matrix with all elements equal to zero and H22 is automatically
100
the far-end and the near-end mixing systems can be combined into a unified mixing system
Hereafter, we assume S = P and R = Q, which implies that the matrix H̃ may be invertible.
We will show later that with proper constraints on the adaptation, different conditions for
the source numbers Q and P may occur for SBSS to be still effective for the MCAEC
purpose.
Figure 47: Model of the near-end and the far-end mixing systems and the semi-blind source
separation (SBSS) system.
The goal of an SBSS system is to perform the estimation of the near-end source signals
where ys (ω, t) = [y1 (ω, t), · · · , yS (ω, t)]T and yr (ω, t) = [yS+1 (ω, t), · · · , yS+R (ω, t)]T . We
101
where W11 = [wij ]1≤i,j≤S , W12 = [wij ]1≤i≤S,S+1≤j≤R , and W22 = [wij ]S+1≤i,j≤S+R . We
can see from (106) and (108) that the optimal solution for W is obtained when WH̃ = I
such that
That is, the SBSS system is able to jointly perform the separation of near-end source signals
and the cancellation of acoustic echoes such that the diagonal terms become an identity
while the off-diagonal terms become null. More specifically, W11 and W22 provide the
separation of the near-end and the far-end sources, respectively, whereas W11 and W12 are
together responsible for the acoustic echo cancellation. There are two other key observations
First, we must point out that we are not interested in recovering the signals played
through the loudspeakers since we already have them as the reference signals. Moreover,
we assume that the responsibility of separating the far-end microphone signals, i.e., the
reference signals, is with the far-end SBSS system, thus we do not have to adapt W22
for the purpose of obtaining the original source vector q. Therefore, yr can be any linear
combination of r, and the exact form of W22 may be controlled to optimize the near-end
SBSS performance appropriately. The constraining of W22 and its effect on the SBSS
The other observation is that W is a block upper-triangular matrix, i.e., W21 = OR×S ,
since the reference signal channels are completely blind to the near-end source signals, i.e.,
H21 = OR×S . However, the near-end signals themselves may take a round-trip in a full-
duplex teleconferencing system, and the resulting echo may not be canceled entirely by the
far-end SBSS system such that the reference signals would contain some remnants of the
near-end signals. We neglect such a circumstance here by assuming that the far-end system
has performed a reasonable job in suppressing the echo and that the latest instance of the
near-end signals is uncorrelated with its own echo due to a sufficiently large round-trip delay.
102
Hence by requiring W21 = OR×S , we avoid the possibility that yr contains the near-end
source signals. In other words, the echo-canceled output components of the SBSS system
in ys are subject to the permutation ambiguity introduced only through the separation
of the near-end sources, and the ambiguity problem then needs to be solved only for the
sub-matrix W11 .
The non-uniqueness problem also occurs in SBSS and can be briefly analyzed as follows.
If W11 and G are not singular, H12 that corresponds to the echo paths can be uniquely
When there is an equal number of sources and microphones at the near end (i.e., P = S), the
physical interpretation of the frequency-domain BSS [102, 88] ensures that W11 is almost
always non-singular by assuming spatial diversity and mutual independence between the
sources. In the case of more sources than microphones (i.e., the under-determined case
P > S), H11 is not invertible, and there is no unique solution for W11 . However, the
estimate of W11 is not necessarily singular in such a case, and its inversion is still attainable
for the estimation of H12 . On the other hand, the severe ill-conditioning, or near-singularity,
have already mentioned that a unique identification of the echo paths by the MSE-based
MCAEC is possible only if the far-end mixing matrix is not singular. Also, (112) indicates
the dependence of the solution for H12 on G just as in the MCAEC case [118]. Therefore, by
neglecting the rare occurrence of singularity of W11 , the non-uniqueness problem in SBSS
103
5.1.3 Derivation of Steady-State Solution
The global de-mixing matrix W can be estimated through the gradient-descent estimation
procedure:
and n is the iteration index. Γ takes different forms according to the cost function that is
non-stationary source signals can be achieved by minimizing either the second-order or the
higher-order correlation among the output signals in yn . In a more general case of non-
Gaussian source signals, the separation is possible through ICA by maximizing the mutual
Any gradient-descent algorithm may be used for the estimation of W. For the analysis of
the steady-state solution, we consider an ICA optimization procedure based on the natural
gradient algorithm and the cost function determined by the Kullback-Leibler divergence [6]:
Z
p(y)
DKL [p(y)kp(y1 ), . . . , p(yS+R )] = p(y) log QS+R dy (116)
i=1 p(yi )
where p(y) is the joint probability density function (PDF) of the de-mixed output signals
and p(yi ) is the marginal PDF of each output signal. We also perform batch adaptation
Γ[Wn (ω), yn (ω, t)] = IS+R − E{Φ(yn (ω, t))yn (ω, t)H } Wn (ω), (117)
where Φ(·) is a nonlinear function, {·}H is the Hermitian (conjugate) transpose operator,
and E{·} is the expectation operator that, assuming stationarity, can be approximated by
averaging over time. Then assuming that all of the near-end sources and the echo paths are
104
The convergence analysis of the natural gradient algorithm is a very difficult task. Still,
we can approximate the analysis through a direct inspection of what we refer to as the
generalized covariance matrix E{Φ(y)yH } by assuming that the components in the output
vector y are zero-mean random variables such that they can be considered statistically
minimizing each cross-moment of order u after a Taylor expansion of the nonlinear function
Φ(·):
where ∗ denotes the complex conjugation. That is, the statistical independence for two out-
put components ya and yb is achieved when the generalized covariance E{Φ(ya )yb∗ } becomes
zero, as two zero-mean random variables can be considered statistically independent if all
We can analyze the structure of the steady-state solution for H12 as follows. First, the
Next, the output signals used for updating W in (115) are obtained from (108) and (109):
ys (ω, t) W11 (ω)H11 (ω)s(ω, t) + [W11 (ω)H12 (ω) + W12 (ω)] r(ω, t)
= . (120)
yr (ω, t) W22 (ω)r(ω, t)
Now, let’s for the moment consider statistical independence between the separated sources
vector ys associated with the near-end system and the separated sources vector yr associated
with the reference signals. The optimal solution for H12 is obtained by setting E{ysu yrH } =
OS×R ∀u ∈ N. Then, after omitting the frequency and time dependencies for notation
where ysu indicates the raising of each component of ys to the integer power u (i.e., the
scalar sources ya and yb from (118) are simply substituted by the vectors ys and yr ). By
105
applying the binomial expansion, we can rewrite (121) as:
expansion to further expand the additive terms with powers u, u − 1− k, and k, it is possible
to demonstrate that if r and s are statistically independent from each other, the first and
the third terms in (122) are zero. In fact, all the matrix elements would be factorized as a
sum of moments E{sui rj } that are zero for each u if si and rj are zero-mean and mutually
independent. It means the solution for H12 that satisfies (121) does not depend on the
near-end sources, and the optimization is possible even though both the near-end and the
far-end sources are active at the same time (i.e., the double-talk situation). It follows then
Since the far-end sources are assumed to be statistically independent, we can rewrite (123)
as (see Appendix C)
where E{qu qH } is the full-ranked generalized covariance matrix of the far-end source. If
Finally, assuming that W11 is known and invertible, H12 can be estimated as
b 12 = −W−1 W12 ,
H (126)
11
Several key observations can be made from the above derivation. First of all, going from
(122) to (123) was made possible by assuming independence between the reference signals
in r and the near-end source signals in s. That is, the optimal solution for H12 that satisfies
106
(126) is ideally possible through the ICA optimization even though both the near-end and
the far-ends sources are active at the same time. This is analogous to the fact that the
MSE optimization during AEC is still possible even if there is a local noise as long as the
reference signal and the noise are uncorrelated with each other [53, Chapter 6]. However,
due to the noisy sample-wise estimate of the gradient of the MSE, the LMS algorithm may
fail in identifying the true echo paths since its asymptotic performance depends on the
adaptation step-size and echo-to-background power ratio (EBR) [53, Chapter 6]. Similarly,
we must be careful how the gradient term in (117) is estimated in order to maintain the
inherent ability of the natural gradient algorithm to converge to the optimal solution when
Second, we assume by (123) that there is always a solution W12 = −W11 H12 that
maximizes the statistical independence of the output signals in ys , but the exact echo path
identification is possible only if W11 , W22 , and G are fully ranked. During the derivation,
we only considered the optimization by maximizing the mutual independence between the
vectors ys and yr . In a full optimization procedure, W11 and W22 should also be adapted
to make the output components in ys and yr mutually independent from one another.
According to (110) and (111), W11 and W22 are expected to be the inverses of H11 and
pointed out earlier, a physical configuration of the frequency-domain BSS system makes
the singularity of both W11 and W22 a rare occurrence. Therefore, as we are not interested
in separating the reference signals at the near end, we can focus on the structure of W22
such that the entire SBSS system can be made robust to ill-conditioned situations.
Third, due to the possibility of ill-conditioning of the far-end response matrix G, the
b 12 may
derivations from (123) to (126) may not be exact, and thus the optimal solution H
not be unique. Although the natural gradient algorithm through a batch adaptation would
still converge to a solution for W12 that maximizes the output independence within the
current observed data even if the non-uniqueness problem exists, the solution would also be
affected by the changes in the far-end mixing system, and a continuous and stable sample-
wise or batch-wise adaptation would not be possible. To avoid such a problem, traditional
107
MCAEC methods employ a decorrelation procedure to not only improve the convergence
rate but also to keep the current estimate of the echo paths as close to the optimal solution
such that the SBSS system can continue converging towards a unique solution. With re-
gard to this observation, we should remember that the ICA optimization jointly uses the
HOS from the observed signals, thus it should be less sensitive to the effect of the non-
uniqueness problem so that a decorrelation procedure with less degree of distortion can be
implemented when compared to the MSE-based MCAEC that uses only the SOS. Never-
theless, we show in Section 5.1.5 that a proper matrix constraint, which exploits the low
spatial correlation between the far-end room impulse responses (RIRs) (observable under
certain conditions [44]), is sufficient for reducing the fluctuations in the estimate of H12
during the ICA optimization so that the adaptation process becomes stable even without
Finally, the joint adaptation of W11 and W12 requires that the near-end source separa-
tion must also be applied. However, if simultaneous source separation and echo cancellation
−1
are not necessary, W12 can be re-scaled by multiplying by the inverse W11 after adapting
W11 and W12 separately. The re-scaling, or normalization, process and its effect on the
Assuming that there are no active near-end sources, i.e., ys = 0S , where 0S is a zero vector
of length S, the conventional MCAEC minimizes the MSE in each near-end microphone
where ei (ω, t) is the estimation error for the ith microphone channel, 1 ≤ i ≤ S, r =
108
hiR ]T is a vector of R frequency responses corresponding to the echo paths from the j th
e j and
loudspeaker to the ith microphone. Taking the gradient of (127) with respect to h
E{e∗i (ω, t)r(ω, t)} = E{ei (ω, t)r∗ (ω, t)} = 0R , (128)
which is the well-known orthogonality principle, i.e., the estimation error is decorrelated
On the other hand, the echo paths determined by the ICA-based SBSS satisfies (123)
for all integer powers u. Since the near-end sources are assumed to be inactive, (123) can be
simplified by setting W11 = IS . Moreover, we can impose the constraint W22 = IR since
we do not need to separate the reference signals at the near end. Substituting r = Gq into
−1
Once an ICA optimization procedure converges to the steady-state solution H12 = −W11 W12
= −W12 , (129) can be rewritten for u = 1 (i.e., only the SOS are considered) as
signals. That is, the ICA-based SBSS minimizes the MSEs for all microphone channels
such that the orthogonality principle is satisfied just as in the MSE-based MCAEC case.
The above result illustrates that the ICA-based SBSS is capable of jointly minimizing
not only the MSE for every near-end microphone channel but also all the higher-order cross-
correlations between the reference and the microphone channels through the diagonalization
of the generalized covariance matrix E{Φ(y)yH } such that (129) is satisfied for all u. The
traditional MCAEC approach is inherently a single-channel AEC approach that simply min-
imizes the MSE at the output of each microphone independently from other microphones.
109
Thus the ICA-based SBSS should theoretically be able to handle multiple acoustic echoes
better than the traditional MSE-based MCAEC since it has more information about the
involved signals. Such a multi-channel adaptability based on the HOS is also the key to the
ICA-based approach for being much more robust to the effects of the noise signals than the
MSE-based approach that uses only the SOS, and it allows the ICA-based SBSS to perform
We assume here that the uniqueness condition on the far-end mixing system is met such
that the reference signals in r are linearly independent and that the generalized autocovari-
ance matrix E{Φ(r)rH } is fully ranked. However, the actual conditioning of E{Φ(r)rH }
varies largely according to the number of far-end sources and microphones due to sparse
representations of the far-end signals and mixing system in the frequency domain.
To discuss the effects of constraining the global de-mixing matrix W on the SBSS
adaptation process, we first assume that the number of the sources is equal to the number
of microphones at the near end and consider three different mixing conditions at the far
end:
2.(Case B: Q > R) The number of active sources is greater than the number of micro-
phones.
3.(Case C: Q < R) The number of active sources is less than the number of microphones.
Afterward, we analyze the effect on the global adaptation for two different mixing conditions
5.(Case E: P 6= S) The number of active sources is different from the number of micro-
phones.
110
5.1.5.1 Case A: Q = R
at the far end, the reference signals in r are guaranteed to be linearly independent and
Although the near-end SBSS system is not responsible for the separation of the far-end
source signals such that the estimation of W22 is not necessary, a progressive decorrelation
of the reference signals during the adaptation would stabilize its convergence behavior since
a stable minimum is approached when the entire generalized covariance matrix E{Φ(y)yH }
is diagonalized.
Therefore, the first case to be examined involves no constraint on W22 to get a full
benefit of the natural gradient algorithm based on ICA. However, the estimate of the matrix
consider a normalization procedure for reducing the intrinsic scaling ambiguity of the ICA
optimization. In particular, the scaling ambiguity for W can be reduced by applying the
c
W(ω) = diag[W−1 (ω)]W(ω). (131)
The matrix W is not invertible if W22 is singular. However, when W has a block upper-
Moreover, since we are not interested in the final output components corresponding to the
decorrelated reference signals, we can avoid the inversion of the entire de-mixing matrix W
In other words, the scaling ambiguity only depends on the separation of the near-end sources
and not on the echo paths estimation. The idea is similar to limiting the permutation
111
5.1.5.2 Case B: Q > R
If the number of active sources is greater than the number of microphones at the far end, the
reference signals in r(ω) are linearly independent and are decorrelated naturally by the pres-
ence of extra source signals beyond the number of microphones. Thus E{Φ(yr )yrH } will be
conditioned well, and the optimal solution for H12 can be found. However, E{Φ(yr )yrH } can
never be diagonalized since the number of observations is less than the number of sources.
Consequently, an ICA optimization procedure may become unstable, and the iterative so-
lution for W22 may never converge to a stable minimum. Hence, as the decorrelation of
the reference signals through the adaptation of W22 is ill-advised in such a case, the overall
stability of the system can be improved by constraining W22 to be a fixed matrix, e.g.,
W22 = IR .
b 12 may not be unique when there are fewer active sources than the
The optimal solution H
microphones at the far end. Such a case corresponds to the near-singularity of the far-end
matrix E{Φ(r)rH }. In a real-life scenario, E{Φ(r)rH } should always be fully ranked since
the modeling filters are generally shorter in length than the far-end RIRs. Also, nonlinear
loudspeaker distortion and additive background noise naturally decorrelate the reference
in the frequency domain ultimately hampers an ICA optimization procedure from converg-
ing to the true echo paths, thus the overall echo cancellation performance becomes very
sensitive to the variations in both the far-end and the near-end mixing systems. Neverthe-
less, although exploiting the HOS cannot solve the non-uniqueness problem, the likelihood
that the gradient of the ICA optimization cost would point towards a specific region in the
the de-mixing matrix and to the characteristics of the far-end RIRs. For example, if the
112
cm), the far-end RIRs are already sparse in the time domain. The sparseness is not neces-
sarily inherited from the time domain at each frequency in the frequency domain, but the
correlation between the frequency responses decreases with the microphones spacing if we
assume the reverberation to be a diffuse noise field [44]. Then we can approximate the first
which can be derived as in Appendix C by considering W11 , H12 , and W12 to be constant
matrices and G a matrix of zero-mean independent random variables, taking the expectation
of [(W11 H12 + W12 )G]u , and estimating E{Gu } by Gu . Also, since the far-end sources are
Assuming for simplicity W11 = IS (i.e., no near-end source separation is performed) and
It becomes clear from (137) that with the de-mixing matrix constraint of (136) and
an “ideal” assumption of zero correlation between the far-end frequency responses (i.e.,
infinite distance between the far-end microphones), the elements of the matrix W12 are
independently optimized. In other words, the update direction during the gradient-descent
optimization procedure for the (i, j)th element of W12 does not depend on the other elements
that are related to different echo paths, and hence the effect of the non-uniqueness problem
is alleviated through the reduction in the ambiguity of physically allowed solution of the
echo paths.
113
We should point out that the diagonal constraint of W22 cannot completely solve the
that the iterative solution for W12 is attracted more likely to a region very close to the point
that corresponds to the true echo paths since the contribution of the off-diagonal terms in
(135) can be neglected for many frequencies. Thus, the constraint tends to globally bind
the solution space of the time-domain filters related to W12 , which consequently reduces
5.1.5.4 Case D: P = S
If the number of active sources is equal to the number of microphones at the near end, there
exists an invertible de-mixing matrix W11 that maximize the output independence of the
estimated near-end sources ys . Therefore, the estimation of the echo paths is possible by
applying (126).
5.1.5.5 Case E: P 6= S
If the number of active sources is different from the number of microphones at the near end,
we can examine two sub-cases. For the case of P < S, the mixing system is not fully ranked,
and it may not be possible to estimate an invertible de-mixing matrix that corresponds to
the inverse of H11 . However, since the natural gradient algorithm does not apply any matrix
inversion, the adaptation may still converge to a singular matrix that maximizes the output
independence of the estimated near-end sources ys (in such a case, some rows of W11 would
be zero). For the case of P > S, since the matrix H11 is not square, the natural gradient
source mixture and may approach singularity during the adaptation of W11 .
From the MCAEC perspective, the singularity of W11 should be avoided for both of the
sub-cases P < S and P > S since it would hamper the estimation of the true echo paths by
(126) and of the output sources by the MDP of (133). Therefore, constraints for preventing
the singularity of W11 are required. For example, if the near-end source separation is not the
main goal of the SBSS system, the constraint W11 = IS may be adopted. Alternatively, the
114
such as the Flexible ICA proposed in [22]. However, it is worth underlining that in real-
world scenarios, the singularity is a very rare occurrence since the number of independent
acoustic sources is in general greater than the number of the microphones. Furthermore,
we note that the loudspeakers themselves represent independent sources at the near end.
Hence the singularity of W may theoretically occur only when both the near-end sources
and the acoustic echoes are absent, which is a degenerate case controlled easily by inhibiting
Thus similarly to the permutation and scaling ambiguities, it may be concluded that the
ambiguity of the source number needs to be solved only if the separation of the near-end
sources is desired. On the other hand, the MCAEC is not affected by such an ambiguity
since the number of independent sources that generate the echoes (i.e., the loudspeakers)
is constrained most often to be equal to the number of reference signals, which is fixed in
A batch implementation of the SBSS system was considered in the previous discussions.
However, online structures need to be analyzed for a practical scenario. We note that the
adaptation is still performed in the frequency domain and that the term “online” does not
AEC procedure.
1. Online adaptation.
2. Batch-online adaptation.
of the frequency-domain values of y. By substituting the iteration index n with the current
time index t, and the expectation in (117) with the instantaneous generalized covariance
115
matrix Φ(y)yH , the ICA solution is updated with the incoming data by iterating over the
following formulas:
Wt+1 (ω) = Wt (ω) + η{IS+R − Φ[y(ω, t)]y(ω, t)H }Wt (ω), (139)
where η is the adaptation step-size of the online adaptation. The choice of η plays an
important role in the overall stability of the adaptation process. For example, a large value
is preferred for quick adaptation to the changes in the mixing conditions, but alternatively
a small value increases the accuracy of the steady-state solution. By using a fixed step-
size, the stability of an online adaptation can be compromised by abrupt variations in the
Among several solutions, a promising method for stabilizing the convergence behavior
of the natural gradient algorithm consists of using the a posteriori unit-norm constraint on
Φ(y)yH . Such a normalized version of the natural gradient is referred to as the scaled nat-
ural gradient [27]. The need for scaling through normalization becomes even more relevant
when the constraint W22 = IR is enforced. In fact, W is re-scaled by the intrinsic normal-
ization effect of the natural gradient algorithm that regularizes the convergence behavior
which is tightly linked to the estimation of the HOS, the accuracy of which is often limited
by the lack in the amount of available data. However, if the W22 = IR constraint is enforced,
the normalization effect does not apply to W22 , and hence the norm of the de-mixing matrix
may increase and possibly lead to divergence. Furthermore, even without any constraint on
W22 , the norm may increase also when the power of one of the reference signals becomes
very small.
A common approach for stabilizing the adaptation process is to remove the normalization
effect by modifying the structure of the natural gradient with a non-holonomic constraint:
Wt+1 (ω) = Wt (ω) + η{Λt (ω) − Φ[y(ω, t)]y(ω, t)H }Wt (ω), (141)
116
where Λt is a diagonal matrix. The adaptation becomes stable by using Λt = diag[Φ(y)yH ]
even when the power of the signals changes rapidly over time [4]. However, the non-
holonomic constraint does not guarantee the diagonality of W22 , which in effect would
increase the misalignment with a subsequent degradation in the overall echo cancellation
performance according to the analysis in Section 5.1.5. Therefore, the use of the scaled
natural gradient is preferable since the stabilization can be obtained while still preserving
We note that other diagonal constraints on W22 are possible, but we focus in this work
on the applicability of the efficient scaling normalization. Specifically, we impose the scaled
where ∆W21 and ∆W22 are the sub-matrices relative to the gradient
1 H
∆Wt (ω) = I − Φ[y(ω, t)]y(ω, t) Wt (ω)
d(ω, t)
∆W11 (ω) ∆W12 (ω)
= , (143)
∆W21 (ω) ∆W22 (ω)
where d is an inverse scaling factor. After the calculation of (143) with the constraints in
where c is the scaling normalization. The scaling factors c and d are computed as in [27].
An online optimization is preferred for its ability in tracking the changes in the mixing
conditions and for small input-output algorithm processing delay. However, the statistical
bias in the instantaneous generalized covariance matrix cannot be neglected, especially when
the adaptation is controlled through the HOS. Furthermore, the scaling and the permutation
117
5.2.1.2 Batch-Online Adaptation
The statistical bias in the instantaneous covariance matrix can be reduced by adopting a
batch-online adaptation, where the higher-order correlations between the de-mixed output
Wn+1 (ω) = Wn (ω) + η{IS+R − E{Φ(yn(b) (ω, t))yn(b) (ω, t)H }}Wn (ω) (146)
(b)
where x(b) is a vector of input signals in the bth batch and yn is a vector of output signals
obtained at the nth iteration in the bth batch. Then in a batch-online implementation, the
solution for W is recursively refined for a certain number of iterations by using the expected
(b) (b)
generalized covariance matrix E{Φ(yn )(yn )H }.
The expectation operator E{·} may be approximated by averaging over time in a batch
adaptation procedure. On the other hand, it is estimated through a moving average proce-
µE{Φ[y(b−1) (ω, t)]y(b−1) (ω, t)H } + (1 − µ)A{Φ[yn(b) (ω, t)]yn(b) (ω, t)H }, (147)
where µ is a smoothing parameter that controls the averaging across batches, E{Φ[y(b−1) ]
(y(b−1) )H } is the generalized covariance matrix estimated from the previous batch, and A{·}
where Tb and Tb+1 defines the time interval of the data used in the bth batch. We note
that the approximation of the expectation by a moving average model is valid only if we
assume ergodicity and stationarity of the random processes. However, the full ergodicity and
stationarity cannot always be guaranteed, especially when the ICA cost function involving
the HOS is used since the source statistics is almost always expected to evolve over time. In
other words, the higher-order correlations cannot be estimated accurately by averaging over
118
(b) (b)
time, which would result in an instability of the recursively estimated E{Φ[yn ](yn )H },
E{Φ[yn(b) (ω, t)]yn(b) (ω, t)H } ≃ A{Φ[yn(b) (ω, t)]yn(b) (ω, t)H }, (149)
where the output signal components in yn (ω, t) are obtained by initializing the iteration in
(145) and (146) with the matrix obtained at the (b − 1)th batch. It means that we avoid the
estimation of the generalized covariance matrix that depends on the HOS, and we enforce
Section 5.2.1 to verify the effectiveness of the constraints discussed in Section 5.1.5.
The time-domain microphone and reference signals in x(t) = [xTs (t), xTr (t)]T are trans-
where k is the frequency bin index and l is the block index in time. Signals are divided into
non-overlapping batches of STFT blocks. The estimate of the de-mixing matrix W(k) for
each frequency bin is adapted in each batch by recursing over the algorithm summarized
in Table 13 for nmax iterations. The adaptation of the de-mixing matrix across batches is
We note that the permutation ambiguity problem for W11 still exists and needs to be
solved also by the SBSS algorithm. Permutation ambiguity is currently an open problem
for ICA-based frequency-domain BSS, and many methods have been proposed. For exam-
ple, TDOA-based approaches are reliable and robust in a batch adaptation even when the
observed signals are very short in time [88, 101], where they are expected to be just as
119
Table 13: Outline of the proposed SBSS algorithm.
y(k, l) = W(k)x(k, l)
1 H
∆W(k) = {IS+R − d(k) E{Φ(y(k, l))y(k, l) }}W(k)
c
W(k) = IS+R ;
while b
for k=1 to Nk
c
W(k) = W(k)
for n=1 to nmax
refine W(k) as in Table 13
end for
c
W(k) = W(k)
−1
W11 (k) = diag(W11 (k))W11 (k)
Solve permutation for W11 (k)
end for
end while
120
5.3 Experimental Evaluation
5.3.1 Evaluation Methods
We discuss here some important issues that need to be taken into account for a fair perfor-
For the traditional MSE-based AEC, the identification of the echo path is performed by
at time t is obtained by using the previous L taps from the reference signal r during the
adaptation process, b
h is constrained to be causal. On the other hand, in the batch-online
SBSS, the de-mixing matrix W is estimated by evaluating the generalized covariance of the
de-mixed signals over time observations within a batch of STFT blocks. Consequently, the
estimated de-mixing matrix does not generally correspond to causal filters. Therefore, for
a correct measurement of the misalignment, we need to transform the obtained filters from
b
H(t) −1
= IF F T [−W11 (ω)W12 (ω)], (152)
The causality of the estimated filters depend strictly on the rank of the far-end response
matrix G. If the rank of G is full, the optimal solution for the echo paths can be obtained
b are
regardless of the adaptation technique. In such a case, the time-domain filters in H
already causal, and a circular shifting would introduce only a delay without any effect on the
echo cancellation. However, when the rank of G is not full, the ICA optimization procedure
can converge to a solution for W12 that is not unique and whose physical interpretation does
not match with the true echo paths represented by H12 , in which case the corresponding
121
To measure the echo cancellation performance, the following two metrics are used. First,
where hij is a RIR vector corresponding to the echo path from the j th loudspeaker to the
b ij is the estimated RIR vector. Second, the true echo return loss
ith microphone and h
where h̃ij is a vector corresponding to the RIR from the j th near-end source to the ith
microphone. (155) is simply the traditional ERLE calculated after removing the near-end
source signal so that the true amount of echo cancellation can be calculated during noisy
time.
It is important to stress that for a fair evaluation of the SBSS performance, the above
two metrics must be used with extra caution. The misalignment is an objective measure
of the system identification performance and is in general indicative of the actual echo
poor echo cancellation performance since a high degree of echo cancellation can still be
achieved even with non-causal filters whose physical interpretation is not related to the
true system identification, in which case the tERLE is the only meaningful metric for the
performance evaluation.
The SBSS algorithm proposed in Section 5.2.2 was evaluated on both simulated and real-
world data. First, to better control the environmental conditions and to evaluate the effect
of the parameters, we simulated the case of two pairs of near-end sources and microphones
(i.e., P = 2 and S = 2) and two pairs of far-end sources and microphones (i.e., Q = 2
and R = 2). From the misalignment perspective, the theoretically worst-case scenario of
single far-end talker and rapid changes in the far-end RIRs were simulated by alternating
the activity of two far-end sources every 25 seconds as illustrated in Figure 48.
122
Figure 48: Source activities in the worst-case scenario.
A Hanning window of 4096 taps with 75% overlap was applied to speech signals sampled
at fs = 16 kHz before taking the STFT. The RIRs were simulated with by the Lehmann &
Johansson’s image source method [69]. The far-end and the near-end RIRs were truncated
to 4096 and 3200 taps, respectively. The non-uniqueness problem should then be expected
batches, where each batch lasted for one second to avoid more than one far-end source
being active within a same batch and maintain the worst-case scenario. The algorithm
Parameters
η = 0.1
Φ(x) = tanh(10 · |x|) exp(jφ(x))
1 ≤ nmax ≤ 20 (depending on the test situation)
adaptation without any input-output delay. That is, the data in the bth batch are processed
with a filter estimated from the previous (b − 2)th batch. Such a method takes into account
both the algorithmic and the computational delays assuming an unitary real-time factor.
123
Thus the tERLEs for first two batches at the start of the adaptation are always equal to
0 dB. The tERLE and the misalignment are averaged over all possible combinations of
signals and echo paths, respectively, to obtain a single performance measure for the entire
system. Furthermore, the tERLE measurements have been smoothed over time with a first-
order autoregressive moving average, with smoothing parameter equal to 0.8, for a more
Figure 49 shows the SBSS performance when additive white Gaussian noise (AWGN)
is used to decorrelate the reference signals, where “AWGN=inf” corresponds to the case of
noise ratio (SNR). We note that AWGN is used here as a practical choice to compare the
SBSS performance with and without a decorrelation procedure and is not representative
of the best procedure. As expected, the misalignment is reduced as the SNR is decreased
to better decorrelate the reference signals. The misalignment converges to a relatively low
value regardless of the presence of the near-end source signal, which evidently indicates the
double-talk robustness of the SBSS algorithm. However, we note that the tERLE is almost
equivalent for any value of SNR. As we have already stated previously, the optimal solution
obtained through SBSS is not always equivalent to the actual echo paths, and an effective
echo cancellation is possible even if the reference signals are linearly dependent.
Figure 50 shows the performance of SBSS with and without the W22 = IR constraint.
We observe that for the constrained SBSS, the tERLE rapidly converges to a value of
about 23 dB while the misalignment slowly decreases. As the two far-end sources alternate
in activity every 25 seconds, the unconstrained SBSS displays a large degradation in the
tERLE since the estimation of W12 depends on the far-end mixing condition. A better
understanding can be obtained from the behavior of the unconstrained SBSS during the first
25 seconds, where the misalignment is considerably high even though the echo cancellation
performance is acceptable. Again, it means that the SBSS algorithm converges to a solution
that is strongly dependent on the far-end mixing system, and since a non-causal filtering
was used, the solution does not have to make a physical sense. Such a solution is then valid
for only one of the far-end source signals with which the adaptation was performed and
124
(a) True ERLE.
(b) Misalignment.
Figure 49: Performance of SBSS with AWGN decorrelation procedure (with de-mixing matrix
constraints and 5 iterations).
125
not for the other, which explains a large degradation in tERLE for the unconstrained SBSS
(b) Misalignment.
Figure 51 shows the performance of SBSS with various number of iterations per batch.
As the number of iterations is decreased, the adaptation process becomes more stable across
time due to the reduction of the effect of the statistical bias, and the tERLE converge
asymptotically to a higher value. However, the convergence rate is also decreased, which is
expected since there is usually a trade-off between the convergence rate and the steady-state
performance.
Figure 52 shows the effect of a near-end source signal on the SBSS performance. The
signals were scaled appropriately to control the SNR between the loudspeakers and the
126
(a) True ERLE.
(b) Misalignment.
Figure 51: Performance of SBSS with different number of iterations (with de-mixing matrix con-
straint).
127
near-end signals. The EBR is defined as
hri2 (t)it
EBR ≡ 10log , ∀i, j (156)
hs2j (t)it
where ri and sj are the ith and j th reference and near-end signals, respectively, and h · it
indicates the averaging operator over time t. We can observe that even when the EBR is
at −20 dB, which means that the acoustic echoes are almost inaudible with respect to the
near-end signal, the SBSS algorithm is still able to achieve a high tERLE of about 20 dB.
This experiment clearly demonstrates the robustness of the proposed SBSS algorithm to
very noisy near-end mixing conditions and shows that the self-normalization introduced by
the scaled natural gradient algorithm allows a stable adaptation that is mostly insensitive
to differences in the input signal magnitude. This attractive characteristic was already
observed in [27] for BSS and seems to be particularly effective also in the SBSS context.
Figure 53 shows the SBSS performance when the AWGN is added to the near-end
microphone signals. We note that the performance does not vary considerably even with
a very low SNR. It is worth noting that SBSS is capable of intrinsically adjusting to the
noise. In particular, all of the acoustic sources besides the acoustic echo at the near end can
be considered as a single interfering source with given probability distribution. Thus the
effectiveness of SBSS should depend also on the correct choice of the ICA contrast function
Φ(x).
Figure 54 shows the SBSS performance with different FFT frame sizes. As expected,
the performance decreases with the size of the FFT frame. In fact, the mixing system can
Figure 55 shows the SBSS performance for different values of the ICA step-size η. Sim-
ilarly to the influence of the number of iterations, the converge rate increases with the
Figure 56 shows the SBSS performance from using different number of iterations and
STFT frame sizes for each batch. We gradually move from the full online implementation
128
(a) True ERLE.
(b) Misalignment.
Figure 52: Performance of SBSS with different EBR (with de-mixing matrix constraint and 5 it-
erations).
129
(a) True ERLE.
(b) Misalignment.
Figure 53: Performance of SBSS with AWGN noise at near-end microphones (with de-mixing
matrix constraint and 5 iterations).
130
(a) True ERLE.
(b) Misalignment.
Figure 54: Performance of SBSS with different FFT frame size (with de-mixing matrix constraint
and 5 iterations).
131
(a) True ERLE.
(b) Misalignment.
Figure 55: Performance of SBSS with different ICA step-size η (with de-mixing matrix constraint
and 5 iterations).
132
(i.e., covariance matrix Φ is computed per frame and adapted by using (144) per iteration)
(b) Misalignment.
Figure 56: Performance of SBSS with online and batch-online implementations (with de-mixing
matrix constraint and 5 iterations).
In both of the cases, the SBSS algorithm converges very quickly to a high tERLE. Simi-
larly to the batch-online frequency-domain BSS in [85], it can be noted that the performance
improves as we move from an online to a batch-online strategy. However, using batches with
overly large number of frames may reduce the adaptability of the system to a variation in
the mixing system, which again implies a trade-off between the convergence rate and the
steady-state performance.
As we have previously discussed in Section 5.1.5 about the effect of the separation
133
matrix constraint on stability during changes in the source number or the mixing condition,
we simulated another set of experiments with more realistic scenario where the number
of both the near-end and far-end sources changes over time. Figure 57 shows the SBSS
performance with and without the W22 = IR constraint when maximum of three sources
were simultaneously active at the near end and the far end (i.e., 0 ≤ P ≤ 2 and 0 ≤ Q ≤ 3),
the number of the near-end and the far-end microphones were set to two, and AWGN
with 80 dB SNR was added to the near-end microphone signals. The figure indicates that,
similarly to the case discussed in Figure 50, the adaptation is stabilized with the matrix
constraint regardless of the variation in the near-end and the far-end source numbers.
(b) Misalignment.
Figure 57: Performance of SBSS with variation in the active source number (with de-mixing matrix
constraint and 5 iterations).
The effectiveness of the proposed framework can be better shown by evaluating the
134
SBSS performance for several configurations of near-end and far-end microphones: (1) one
near-end microphone and two far-end microphones; (2) two near-end microphones and two
far-end microphones; and (3) three near-end microphones and three far-end microphones.
Figure 58 shows that the tERLE is higher for a larger number of near-end microphones
while the misalignment is lower for a reduced number of the microphones. Such a result
confirms that for SBSS, the echo cancellation performance does not correspond to the
correct system identification. In fact, when many near-end microphones are used, the truly
multi-channel nature of the SBSS adaptation allows the exploitation of spatial information
through proper adaptation of the de-mixing sub-matrices W11 and W12 . This aspect
becomes much more clear by observing Figure 59 that shows the SBSS performance when
−1
the near-end source separation is not applied, where W11 and W12 are multiplied by W11
and the AEC is performed by using causal mixing filters estimated by (152) and (153). As
expected, the performance for the configurations with two and three near-end microphones
become similar to the single microphone case. In fact, removing the effect of W11 is almost
algorithm based on FBLMS [23], where we used a regularization procedure proposed in [52]
to make FBLMS robust to the ambient background noise. The combined algorithm was
implemented along with an ideal double-talk detector that froze adaptation by using the
prior knowledge of near-end source activity. Real-world data were recorded in a room with
a reverberation time of approximatively T60 = 200 ms, and the algorithms were tested with
two near-end sources and two near-end loudspeakers (i.e., two far-end microphones). The
far-end and the near-end source numbers were varied over time, 0 ≤ P, Q ≤ 2, where the
near-end sources became active after 25 seconds. Changes in the near-end mixing condition
were simulated by moving the microphones after 50 seconds and 80 seconds. Two adaptation
strategies were used: (1) a batch-online adaptation with overlapped blocks of 1.024 second
shifted by 0.128 second, and (2) an online STFT frame-by-frame adaptation. In both of
the adaptation strategies, the signals were transformed through STFT and non-overlapping
Hanning windows of 4096 samples, the covariance matrix was averaged over four frames
135
(a) True ERLE.
(b) Misalignment.
Figure 58: Performance of SBSS for different microphone configurations (with de-mixing matrix
constraint and 5 iterations).
136
(a) True ERLE.
(b) Misalignment.
Figure 59: Performance of SBSS with different microphone configurations and without applying
the near-end source separation (with de-mixing matrix constraint and 5 iterations).
137
during the batch-online adaptation, and the solution for W was updated per iteration per
block/frame.
Figure 60(a) shows the tERLE performance for the batch-online strategy where no
smoothing was applied to the tERLE measurements. In the first 25 seconds, the FBLMS
converges quickly to a high tERLE, but the echo cancellation performance is degraded
when the near-end sources become active. On the other hand, while the SBSS exhibits
slow convergence during the first 25 seconds, it clearly outperforms the FBLMS algorithm
during double talk, most likely due to the scaled natural gradient algorithm. In particular,
when the near-end source is active and when the mixing condition changes over time, the
FBLMS algorithm is unable to track the variation in the room response while the SBSS
Figure 60(b) shows the results from the online adaptation strategy. In this case, the
FBLMS algorithm displays consistently low convergence rate due to the noisy block-wise
adaptation even when the near-end talkers are silent. The SBSS algorithm also suffers
from a slower convergence rate than the corresponding batch-online implementation, but
the adaptation is stable enough due to the scaled normalization of the natural gradient
algorithm to ultimately provide acceptable echo cancellation performance. We note that the
generalized covariance matrix used in the SBSS adaptation exploits the higher-order source
statistics. Similarly to the case of BSS, any measures of higher-order source dependencies
would be highly biased if the amount of data is limited. Therefore, in order to maximize the
benefit of the SBSS structure and achieve the best overall echo cancellation performance,
138
(a) True ERLE from batch-online implementation.
Figure 60: Comparison between SBSS and FBLMS for a real-world scenario.
139
CHAPTER VI
CONCLUSIONS
First, we showed that the use of an error recovery nonlinearity (ERN) to “enhance” the
error signal proves simple and effective in dealing with the robustness issue of acoustic echo
cancellation (AEC) in the real world. The residual echo enhancement (REE), or error en-
hancement, paradigm arises from a very simple notion that reducing the effect of distortion
remaining in the residual echo after the AEC and prior to the filter coefficients adaptation
should provide for improved linear adaptive filtering performance in a noisy condition. The
idea stems from the “system” approach to signal enhancement, where the system compo-
nents should interact with one another mutually for the entire system’s benefit. The ERN
can be derived from well-established signal enhancement techniques based on the Bayesian
statistical analysis. The combined technique evidently has close ties to the traditional noise-
robust AEC schemes, namely the adaptive step-size and regularization procedures, and it
can be readily utilized not only in the presence of an additive local noise but also when
there is a nonlinear distortion on the acoustic echo due to, for example, a speech codec.
Moreover, the ERN technique can be viewed as a generalization of the adaptive step-size
the conventional practice of interrupting the filter adaptation in the presence of significant
near-end interferences, e.g., double talk. There is no need to freeze the filter adapta-
tion entirely during the double-talk situation when the error enhancement procedure using
Such a systematic paring allows the filter adaptation to be carried out continuously and
recursively on a batch of very noisy data during the frequency-domain AEC. The “block-
iterative” adaptation (BIA) in turn mitigates for the retarded convergence speed due to
140
an aggressive noise-robustness control enforced by the ERN. Recursive, batch-wise adapta-
tion is nothing new for the blind source separation (BSS) algorithms based on independent
component analysis (ICA) that are capable of unsupervised adaptation in the presence of
multiple interfering signals. Indeed, the natural gradient algorithm optimized by ICA leads
directly to the least mean square (LMS) algorithm and the ERN, where the combination is
equivalent to semi-blind source separation (SBSS), i.e., BSS when some source signals are
very noisy acoustic conditions. Other traditional techniques, such as double-talk detection
and exponential weighting, were integrated into the AEC system to further improve the
noise robustness and the overall cancellation performance. We proposed the decorrelation-
by-resampling (DBR) technique with minimal signal distortion that effectively alleviates
the non-uniqueness problem, which occurs due to inter-channel correlation when the LMS
domain resampling (FDR) technique that is computationally efficient, as it relies on the Fast
Fourier Transform for most of the interpolation work, and can be extended readily to more
than two channels and higher sampling rates. The coherence measurement supports the
intended design of DBR to smoothly decrease the coherence from low to high frequencies
in order to minimize the signal distortion of the low frequency signal components, the
process of which may degrade not only the audio quality but also the AEC performance
itself. Simulation results indicate that the time-varying decorrelation via resampling is well
suited for a frequency-domain MCAEC system with the REE procedure, as BIA used in
conjunction with REE enables the recovery of lost cancellation performance due to the
non-uniqueness problem and consequently reduces the necessity for aggressive decorrelation
at low frequencies where most of the speech energy and information resides. In addition,
we observed that BIA permits a natural recovery of the cancellation performance lost due
141
As an extended realization of the system approach to decorrelation for AEC, the in-
tegration of the sub-band resampling (SBR) technique, obtained directly from FDR, with
and the AEC system as we showed through the sub-band analysis of the tERLE and mis-
alignment. SBR allows selective decorrelation, measured in terms of the coherence, per
frequency sub-band in order to, for example, leave the low frequency band unmodified to
maintain high cancellation performance in that region naturally through BIA without the
need for aggressive decorrelation. Such a selective decorrelation results in higher tERLE
and lower misalignment overall than without decorrelation or with any other decorrelation
procedures tested while achieving superior audio quality. SBR thus provides the flexibility
to leave certain bands untouched and only resample other bands according to the desired
that generalizes the MCAEC and is a truly multi-channel approach to AEC. We analyzed
in detail the structure of the SBSS optimization and algorithmic issues in order to define
constrained batch-online algorithm that stabilizes the adaptation process and improves the
overall echo cancellation performance. Promising results show that in the simulated worst
scenario case of a single far-end talker along with the non-uniqueness condition on the far-
end mixing system, a stable adaptation is possible without the need of any decorrelation
procedure on the reference signals. Furthermore, the adaptation is insensitive to the double-
talk situation even when there are multiple near-end sources and acoustic echoes.
The problem of real-world AEC is best dealt with as a system issue, unlike the conven-
tional single algorithm approaches in which the focus used to be unduly steered away from
the robustness challenge. While we were first motivated by the intuitive notion of REE in
the presence of noise and distortion, the technique is nonetheless a culmination of many
diverse signal enhancement techniques. All of our findings thus far indicate strongly that
the principle of orthogonality, can be elevated to the system level of an AEC problem. In a
142
rather interesting manner, such a connection allows us to relate AEC to source separation
that seeks to maximize the independence, and thus implicitly the orthogonality, not only
between the error signal and the far-end signal, but rather, among all signals involved. In
with limited practical applicability to the global, system-level foundation where multiple in-
formation are available for the refinement of involved signals. This gives rise to an entirely
new approach to the conventional AEC problem in a noisy and disruptive environment,
where a single, idealized algorithm solution alone will not always be able to properly carry
out its intended function as the acoustic mixing system becomes even more complex and
6.1 Contributions
• Proposed the system approach to signal enhancement for tackling the real-world en-
hancement problems.
• Proposed the REE technique based on ICA for robust AEC performance in the pres-
BIA in the frequency domain for noise-robust AEC without double-talk detection.
• Proposed applying BIA along with REE for a natural recovery of lost AEC perfor-
during MDF.
• Proposed the DBR technique to directly alleviate the non-uniqueness problem with
• Proposed the FDR technique for expanded applicability of DBR, e.g., reduction in
• Proposed the SBSS as generalized, robust, and truly multi-channel solution to MCAEC.
143
Refer to the references chapter under the authors F. Nesta, E. Robledo-Arununcio, T.
S. Wada, and J. Wung for two journal and fourteen conference publications that have come
• Improvement in the regularization and the REE procedures, e.g., use of information
from residual echo suppression (RES) to assist the error enhancement [143].
• Improvement in the DBR procedure, e.g., perceptually hide the audio image fluctua-
• Application of the proposed techniques to other adaptive filters, e.g., APA and RLS.
• Application of the proposed techniques to the sampling rate mismatch problem, e.g.,
• Application of the proposed techniques to BSS and SBSS, e.g., use of DBR for further
• Application of the system approach to other AEC structures, e.g., sub-band AEC,
nonlinear AEC.
• Application of the system approach to other aspects of AEC, e.g., double-talk detec-
• Application of the system approach to other signal enhancement problems, e.g., feed-
144
APPENDIX A
BAYESIAN ESTIMATION
Then applying (52) and (54) to (159) gives (59) for ē ∼ Gaussian(0, σē ) and any v, (63) for
v ∼ Gaussian(0, σv ) and any ē, and (67) for ē ∼ Gaussian(0, σē ) and v ∼ Gaussian(0, σv ).
Furthermore, applying (58) to (59) gives (61) for v ∼ Laplacian(0, αv ), and applying (58)
to (63) gives (65) for ē ∼ Laplacian(0, αē ). Similar derivations are provided in [42, 75].
The MAP nonlinearity is obtained by solving φē (ēˆ) = φv (e − ēˆ) for ēˆ as a function of e.
First, if ē ∼ Gaussian(0, σē ), then (60) is obtained for any v by applying (53) to (160) to
145
get
Applying (56) to (162) gives (66) for ē ∼ Laplacian(0, αē ). Finally, if ē ∼ Gaussian(0, σē )
and v ∼ Gaussian(0, σv ), then applying (53) to (160) gives (67). Similar derivations are
146
APPENDIX B
FISHER INFORMATION
Non-Gaussianity for some PDF ps can be measured through the Fisher information [55]
[55] that the relative improvement in noise reduction by using (60) over the linear Wiener
filtering rule of (67) when ē ∼ Gaussian(0, σē ) but v is zero-mean non-Gaussian distributed
GA 2 2 2
Ref f ective = [I(v/σē ) + I(v/γv ) − 1]σē /γv + o(σē ) (165)
for σē2 < γv2 , where “G” stands for Gaussian and “A” for any distribution. Similarly, the
relative improvement by using (64) over (71) when v ∼ Gaussian(0, σv ) but ē is zero-mean
AG 2 2 2
Ref f ective = [I(ē/γē ) + I(ē/σv ) − 1]σv /γē + o(σv ) (166)
for σv2 < γē2 . In a set of PDF with unit variance, (163) is minimized and is equal to 1 for
the Gaussian PDF. Therefore, (165) and (166) indicate improved noise suppression when
either one of ē and v is Gaussian distributed while the other is non-Gaussian distributed.
147
APPENDIX C
GENERALIZED DECORRELATION
Assuming that x = {xj } is a vector of zero-mean random variables of length N and that
where xu and Au indicate the raising of each element of the vector x and of the matrix A
know that if the random variables in x are independent, then given the N × N matrices A
and B, we have
By using (168) and following the derivation of (170), it is possible to generalize (171) for
higher-order moments as
148
REFERENCES
[1] Aboulnasr, T. and Mayyas, K., “A robust variable step-size LMS-type algorithm:
Analysis and simulations,” IEEE Trans. Signal Process., vol. 45, pp. 631–639, Mar.
1997.
[2] Al-Naffouri, T. Y. and Sayed, A. H., “Adaptive filters with error nonlineari-
ties: Mean-square analysis and optimum design,” EURASIP Applied Signal Process.,
vol. 2001, pp. 192–205, Oct. 2001.
[3] Amari, S., “Natural gradient works efficiently in learning,” Neural Computat., vol. 10,
no. 2, pp. 251–276, 1998.
[4] Amari, S.-I., Chen, T.-P., and Cichocki, A., “Nonholonomic orthogonal learning
algorithms for blind source separation,” Neural Computat., vol. 12, no. 6, pp. 1463–
1484, 2000.
[5] Ang, W.-P. and Farhang-Boroujeny, B., “A new class of gradient adaptive step-
size LMS algorithms,” IEEE Trans. Signal Process., vol. 49, pp. 805–810, Apr. 2001.
[7] Benesty, J. and Gänsler, T., “A robust fast recursive least squares adaptive al-
gorithm,” in Proc. IEEE ICASSP, vol. 6, pp. 3785–3788, May 2001.
[8] Benesty, J. and Gänsler, T., “On data-reuse adaptive algorithms,” in Proc.
IWAENC, Sep. 2003.
[9] Benesty, J., Gänsler, T., Morgan, D. R., Sondhi, M. M., and Gay, S. L.,
Advances in Network and Acoustic Echo Cancellation. Springer, 2001.
[10] Benesty, J., Morgan, D. R., and Cho, J. H., “A new class of doubletalk detectors
based on cross-correlation,” IEEE Trans. Speech Audio Process., vol. 8, pp. 168–172,
Mar. 2000.
[11] Benesty, J., Morgan, D. R., Hall, J. L., and Sondhi, M. M., “Stereophonic
acoustic echo cancellation using nonlinear transformations and comb filtering,” in
Proc. IEEE ICASSP, vol. 6, pp. 3673–3676, May 1998.
[12] Benesty, J., Morgan, D. R., and Sondhi, M. M., “A better understanding and an
improved solution to the specific problems of stereophonic acoustic echo cancellation,”
IEEE Trans. Speech Audio Process., vol. 6, pp. 156–165, Mar. 1998.
[13] Benveniste, A., Metivier, M., and Priouret, P., Adaptive Algorithms and
Stochastic Approximation. Springer, 1990.
149
[14] Bershad, N. J., “On error-saturation nonlinearities in LMS adaptation,” IEEE
Trans. Acoust. Speech Signal Process., vol. 36, pp. 440–452, Apr. 1988.
[15] Birkett, A. N. and Goubran, R. A., “Acoustic echo cancellation using NLMS-
neural network structures,” in Proc. IEEE ICASSP, vol. 5, pp. 3035–3038, May 1995.
[16] Birkett, A. N. and Goubran, R. A., “Limitations of handsfree acoustic echo
cancellers due to nonlinear loudspeaker distortion and enclosure vibration effects,” in
Proc. IEEE WASPAA, pp. 103–106, Oct. 1995.
[17] Borrallo, J. M. P. and Otero, M. G., “On the implementation of a partitioned
block frequency domain adaptive filter (pbfdaf) for long acoustic echo cancellation,”
Signal Process., vol. 27, pp. 301–315, Jun. 1992.
[18] Buchner, H., Benesty, J., Gänsler, T., and Kellermann, W., “An outlier-
robust extended multidelay filter with application to acoustic echo cancellation,” in
Proc. IWAENC, pp. 19–22, Sep. 2003.
[19] Buchner, H., Benesty, J., and Kellermann, W., “Generalized multichannel
frequency-domain adaptive filtering: Efficient realization and application to hands-
free speech communication,” Signal Process., vol. 85, pp. 549–570, Mar. 2005.
[20] Carter, G., “Coherence and time delay estimation,” Proc. IEEE, vol. 75, pp. 236–
255, Feb. 1987.
[21] Cichocki, A. and Amari, S., Adaptive Blind Signal and Image Processing: Learning
Algorithms and Applications. John Wiley & Sons, 2002.
[22] Cichocki, A., Sabala, I., and Amari, S., “Intelligent neural networks for blind
signal separation with unknown number of sources,” in Proc. CEIS, pp. 148–154,
1998.
[23] Clark, G. A., Parker, S. R., and Mitra, S. K., “A unified approach to time-
and frequency-domain realization of FIR adaptive digital filters,” IEEE Trans. Acoust.
Speech Signal Process., vol. ASSP-31, pp. 1073–1083, Oct. 1983.
[24] Comon, P., “Independent component analysis, a new concept?,” Signal Process.,
vol. 36, pp. 287–314, Apr. 1994.
[25] Diethorn, E. J., “Improved decision logic for two-path echo cancelers,” in Proc.
IWAENC, Sep. 2001.
[26] Douglas, S. C., “Generalized gradient adaptive step sizes for stochastic gradient
adaptive filters,” in Proc. IEEE ICASSP, vol. 2(9), pp. 1396–1399, May 1995.
[27] Douglas, S. C. and Gupta, M., “Scaled natural gradient algorithms for instan-
taneous and convolutive blind source separation,” in Proc. IEEE ICASSP, vol. 2(5),
pp. 637–640, Apr. 2007.
[28] Douglas, S. C. and Meng, T. H.-Y., “Stochastic gradient adaptation under general
error criteria,” IEEE Trans. Signal Process., vol. 42, pp. 1335–1351, June 1994.
[29] Duttweiler, D. L., “A twelve-channel digital echo canceller,” IEEE Trans. Com-
mun., vol. COM-26, pp. 647–653, May 1978.
150
[30] Duttweiler, D. L., “Adaptive filter performance with nonlinearities in the correla-
tion multiplier,” IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-30, pp. 578–
586, Aug. 1982.
[31] Faller, C., Gänsler, T., and Vetterli, M., “Two stage estimation for the echo
paths in stereophonic acoustic echo cancellation,” in Proc. IWAENC, pp. 157–160,
Sep. 2006, paper no. 40.
[32] Fozunbal, M., Kalker, T., and Schafer, R. W., “Multi-channel echo control by
model learning,” in Proc. IWAENC, Sep. 2008, paper no. 9014.
[33] Gänsler, T., “A double-talk resistant subband echo canceller,” Signal Process.,
vol. 65, pp. 89–101, Feb. 1998.
[34] Gänsler, T. and Benesty, J., “Stereophonic acoustic echo cancellation and two-
channel adaptive filtering: an overview,” in J. Adapt. Control Signal Process., vol. 14,
pp. 565–586, 2000.
[36] Gänsler, T. and Benesty, J., “New insights into the stereophonic acoustic echo
cancellation problem and an adaptive nonlinearity solution,” IEEE Trans. Speech
Audio Process., vol. 10, pp. 257–267, July 2002.
[37] Gänsler, T. and Benesty, J., “The fast normalized cross-correlation double-talk
detector,” Signal Process., vol. 86, pp. 1124–1139, June 2006.
[38] Gänsler, T., Gay, S. L., Sondhi, M. M., and Benesty, J., “Double-talk robust
fast converging algorithms for network echo cancellation,” IEEE Trans. Speech Audio
Process., vol. 8, pp. 656–663, Nov. 2000.
[39] Gänsler, T., Hansson, M., Ivarson, C. J., and Salomonsson, G., “A double-
talk detector based on coherence,” IEEE Trans. Commun., vol. 44, pp. 1421–1427,
Nov. 1996.
[40] Gay, S. L. and Benesty, J., Acoustic Signal Processing for Telecommunication.
Kluwer Academic, 2000.
[41] Gazor, S. and Zhang, W., “Speech probability distribution,” IEEE Signal Process.
Letters, vol. 10, pp. 204–207, Jul. 2003.
[43] Guérin, A., Faucon, G., and Bouquin-Jeannes, R. L., “Nonlinear acoustic echo
cancellation based on Volterra filters,” IEEE Trans. Speech Audio Process., vol. 11,
pp. 672–683, Nov. 2003.
[44] Gustaffson, T., Rao, B. D., and Trivedi, M., “Source localization in reverberant
environments: Modeling and statistical analysis,” IEEE Trans. Speech Audio Process.,
vol. 11, pp. 791–803, Nov. 2003.
151
[45] Hänsler, E. and Schmidt, G. U., “Hands-free telephonesjoint control of echo can-
cellation and postfiltering,” Signal Process., vol. 80, pp. 2295–2305, Oct. 2000.
[46] Hänsler, E. and Schmidt, G. U., Acoustic Echo and Noise Control: A Practical
Approach. John Wiley & Sons, 2004.
[47] Hassibi, B., Sayed, A. H., and Kailath, T., “H ∞ optimality of the LMS algo-
rithm,” IEEE Trans. Signal Process., vol. 44, pp. 267–280, Feb. 1996.
[49] Haykin, S., Adaptive Filter Theory. Prentice Hall, 4th ed., 2002.
[50] Heckmann, M., Vogel, J., and Kroschel, K., “Frequency selective step-size
control for acoustic echo cancellation,” in Proc. EURASIP EUSIPCO, pp. 1855–1858,
Sep. 2000.
[51] Herre, J., Buchner, H., and Kellermann, W., “Acoustic echo cancellation for
surround sound using perceptually motivated convergence enhancement,” in Proc.
IEEE ICASSP, vol. I, pp. 17–20, Apr. 2007.
[52] Hirano, A. and Sugiyama, A., “A noise-robust stochastic gradient algorithm with
an adaptive step-size for mobile hands-free telephones,” in Proc. IEEE ICASSP, vol. 2,
pp. 1392–1395, May 1995.
[53] Huang, Y. and Benesty, J., eds., Audio Signal Processing for Next-Generation
Multimedia Communication Systems. Kluwer Academic, 2004.
[54] Huber, P. J., Robust Statistics. John Wiley & Sons, 1981.
[55] Hyvärinen, A., “Sparse code shrinkage: Denoising of nongaussian data by maximum
likelihood estimation,” Neural Computat., vol. 11, pp. 1739–1768, Oct. 1999.
[56] Hyvärinen, A., Karhunen, J., and Oja, E., Independent Component Analysis.
John Wiley & Sons, 2001.
[57] Ikeda, K. and Sakamoto, R., “Convergence analyses of stereo acoustic echo can-
celers with preprocessing,” IEEE Trans. Signal Process., vol. 51, pp. 1324–1334, May
2003.
[58] Inst., E. T. S., “ETSI TS 126 104 V6.1.0: ANSI-C code for the floating-point
Adaptive Multi-Rate (AMR) speech codec,” 2004.
[59] Iqbal, M. A., Stokes, J., and Grant, S. L., “Normalized double-talk detection
based on microphone and AEC error cross-correlation,” in Proc. IEEE ICME, vol. 2,
pp. 360–363, July 2007.
[60] Joho, M., Mathis, H., and Moschytz, G. S., “Combined blind/nonblind source
separation based on the natural gradient,” IEEE Signal Process. Letters, vol. 8,
pp. 236–238, Aug. 2001.
[61] Kallinger, M., Mertins, A., and Kammeyer, K. D., “Enhanced double-talk
detection based on pseudo-coherence in stereo,” in Proc. IWAENC, pp. 177–180, Sep.
2005.
152
[62] Khong, A. W. H., Benesty, J., and Naylor, P. A., “Effect of interchannel
coherence on conditioning and misalignment performance for stereo acoustic echo
cancellation,” in Proc. IEEE ICASSP, vol. 5, pp. 265–268, May 2006.
[63] Khong, A. W. H., Benesty, J., and Naylor, P. A., “Stereophonic acoustic echo
cancellation: Analysis of the misalignment in the frequency domain,” IEEE Signal
Process. Letters, vol. 13, pp. 33–36, Jan. 2006.
[65] Koike, S., “A class of adaptive step-size control algorithms for adaptive filters,”
IEEE Trans. Signal Process., vol. 50, pp. 1315–1326, June 2002.
[66] Kuech, F. and Kellermann, W., “Orthogonalized power filters for nonlinear acous-
tic echo cancellation,” Signal Process., vol. 86, pp. 1168–1181, June 2006.
[67] Kuech, F. and Kellermann, W., “Nonlinear residual echo suppression using a
power filter model of the acoustic echo path,” in Proc. IEEE ICASSP, vol. 1, pp. 73–
76, Apr. 2007.
[68] Lee, T.-W., Independent Component Analysis: Theory and Applications. Kluwer
Academic, 1998.
[69] Lehmann, E. and Johansson, A., “Prediction of energy decay in room impulse re-
sponses simulated with an image-source model,” J. Acoust. Soc. America, vol. 124(1),
pp. 269–277, July 2008.
[70] MacKay, D. J. C., Information Theory, Inference, and Learning Algorithms. Cam-
bridge University Press, 2003.
[71] Mader, A., Puder, H., and Schmidt, G. U., “Step-size control for acoustic echo
cancellation filters - an overview,” Signal Process., vol. 80, pp. 1697–1719, Sep. 2000.
[72] Makino, S., Kaneda, Y., and Koizumi, N., “Exponentially weighted step-size
NLMS adaptive filter based on the statistics of a room impulse response,” IEEE
Trans. Speech Audio Process., vol. 1, pp. 101–108, Jan. 1993.
[73] Makino, S., Lee, T.-W., and Sawada, H., eds., Blind Speech Separation. Springer,
2007.
[74] Mandic, D. P., “A generalized normalized gradient descent algorithm,” IEEE Signal
Process. Letters, vol. 11, pp. 115–118, Feb. 2004.
[75] Martin, R., “Speech enhancement based on minimum mean-square error estimation
and supergaussian priors,” IEEE Trans. Speech Audio Process., vol. 13, pp. 845–856,
Sep. 2005.
153
[77] Mathews, V. J. and Xie, Z., “A stochastic gradient adaptive filter with gradient
adaptive stepsize,” IEEE Trans. Signal Process., vol. 41, pp. 2075–2087, June 1993.
[78] Matsuoka, K. and Nakashima, S., “Minimal distortion principle for blind source
separation,” in Proc. ICA, vol. 3, pp. 722–727, Dec. 2001.
[79] Miyabe, S., Takatani, T., Saruwatari, H., Shikano, K., and Tatekura, Y.,
“Barge-in and noise-free spoken dialogue interface based on sound field control and
semi-blind source separation,” in Proc. EURASIP EUSIPCO, pp. 232–236, Sep. 2007.
[80] Miyabe, S., Barge-in Robust Spoken Dialogue Interface Using Multichannel Sound
Field Control and Array Signal Processing. PhD thesis, Nara Institute of Science and
Technology, Nara, Japan, Sep. 2007.
[82] Moulines, E., Ait-Amrane, O., and Grenier, Y., “The generalized multidelay
adaptive filter: Structure and convergence analysis,” IEEE Trans. Signal Process.,
vol. 43, pp. 14–28, Jan. 1995.
[83] Moulines, E., Ait-Amrane, O., and Grenier, Y., “The generalized multidelay
adaptive filter: Structure and convergence analysis,” IEEE Trans. Signal Process.,
vol. 43, pp. 14–28, Jan. 1995.
[84] Moulines, E., Cardoso, J.-F., and Gassiat, E., “Maximum likelihood for blind
separation and deconvolution of noisy signals using mixture models,” in Proc. IEEE
ICASSP, pp. 3617–3620, Apr. 1997.
[85] Mukai, R., Sawada, H., Araki, S., and Makino, S., “Real-time blind source
separation an DOA estimation using small 3-D microphone array,” in Proc. IWAENC,
pp. 45–48, Sep. 2005.
[86] Murata, N. and Ikeda, S., “An on-line algorithm for blind source separation on
speech signals,” in Proc. NOLTA, vol. 3, pp. 923–926, Sep. 1998.
[87] Murata, N., Ikeda, S., and Ziehe, A., “An approach to blind source separation
based on temporal structure of speech signals,” Neural Computat., vol. 41, pp. 1–24,
Aug 2001.
[88] Nesta, F., Omologo, M., and Svaizer, P., “A novel robust solution to the per-
mutation problem based on a joint multiple TDOA estimation,” in Proc. IWAENC,
Sep. 2008, paper no. 9023.
[89] Nesta, F., Wada, T. S., and Juang, B.-H., “Batch-online semi-blind source sepa-
ration applied to multi-channel acoustic echo cancellation,” IEEE Trans. Audio Speech
Language Process., vol. 19, pp. 583–599, Mar. 2011.
[90] Nesta, F., Wada, T. S., Miyabe, S., and Juang, B.-H., “On the non-uniqueness
problem and the semi-blind source separation,” in Proc. IEEE WASPAA, pp. 101–104,
Oct. 2009.
154
[91] Nitsch, B. H., “A frequency-selective stepfactor control for an adaptive filter algo-
rithm working in the frequency domain,” Signal Process., vol. 80, pp. 1733–1745, Sep.
2000.
[92] Ochiai, K., Araseki, T., and Ogihara, T., “Echo canceler with two echo path
models,” IEEE Trans. Commun., vol. COM-25, pp. 589–595, June 1977.
[93] Oppenheim, A. V., Schafer, R. W., and Buck, J. R., Discrete-Time Signal
Processing. Prentice Hall, 2nd ed., 1999.
[94] Ozeki, K. and Umeda, T., “An adaptive filtering algorithm using an orthogonal
projection to an affine subspace and its properties,” Trans. Inst. Elect. Info. Commun.
Engineers, vol. J67-A, pp. 126–132, Feb. 1984.
[95] Papoulis, A. and Pillai, S. U., Probability, Random Variables and Stochastic Pro-
cesses. McGraw Hill, 4th ed., 2002.
[96] Park, H.-M., Oh, S.-H., and Lee, S.-Y., “On adaptive noise cancelling based on
independent component analysis,” Electronics Letters, vol. 38, pp. 832–833, Jul. 2002.
[98] Pham, D.-T., Servière, C., and Boumaraf, H., “Blind separation of speech mix-
tures based on nonstationarity,” in Proc. IEEE ISSPA, vol. 2, pp. 73–76, July 2003.
[99] Poor, H. V., An Introduction to Signal Detection and Estimation. Springer, 2nd ed.,
1994.
[100] Robledo-Arununcio, E., Wada, T. S., and Juang, B.-H., “On dealing with
sampling rate mismatches in blind source separation and acoustic echo cancellation,”
in Proc. IEEE WASPAA, pp. 34–37, Oct. 2007.
[101] Saruwatari, H., Kurita, S., Takeda, K., Itakura, F., Nishikawa, T., and
Shikano, K., “Blind source separation combining independent component analysis
and beamforming,” EURASIP Applied Signal Process., no. 1, pp. 1135–1146, 2003.
[102] Sawada, H., Araki, S., Mukai, R., and Makino, S., “Grouping separated fre-
quency components by estimating propagation model parameters in frequency-domain
blind source separation,” IEEE Trans. Audio Speech Language Process., vol. 15,
pp. 1592–1604, July 2007.
[104] Sethares, W. A., “Adaptive algorithms with nonlinear data and error functions,”
IEEE Trans. Signal Process., vol. 40, pp. 2199–2206, Sep. 1992.
[105] Shi, K., Nonlinear Acoustic Echo Cancellation. PhD thesis, Georgia Institute of
Technology, Atlanta, GA, USA, Dec. 2008.
155
[106] Shi, K., Ma, X., and Zhou, G. T., “A double-talk detector based on generalized
mutual information for stereophonic acoustic echo cancellation in the presence of non-
linearity,” in Proc. IEEE ACSSC, Oct. 2008.
[107] Shi, K., Ma, X., and Zhou, G. T., “A mutual information based double-talk de-
tector for nonlinear systems,” in Proc. IEEE CISS, pp. 356–360, Mar. 2008.
[108] Shi, K., Ma, X., and Zhou, G. T., “A residual echo suppression technique for
systems with nonlinear acoustic echo paths,” in Proc. IEEE ICASSP, pp. 257–260,
Apr. 2008.
[109] Shi, K., Zou, G. T., and Viberg, M., “Compensation for nonlinearity in a Ham-
merstein system using the coherence function with application to nonlinear acoustic
echo cancellation,” IEEE Trans. Signal Process., vol. 55, pp. 5853–5858, Dec. 2007.
[110] Shimauchi, S. and Makino, S., “Stereo projection echo canceller with true echo
path estimation,” in Proc. IEEE ICASSP, pp. 3059–3062, May 1995.
[111] Shimauchi, S., Makino, S., Haneda, Y., Nakagawa, A., and Sakauchi, S.,
“A stereo echo canceller implemented using a stereo shaker and a duo-filter control
system,” in Proc. IEEE ICASSP, pp. 857–860, Mar. 1999.
[112] Shimauchi, S., Haneda, Y., and Kataoka, A., “Double-talk robust frequency
domain echo cancellation algorithm with scalable nonlinear reference and error func-
tions,” in Proc. IWAENC, pp. 23–26, Sep. 2003.
[113] Shimauchi, S., Haneda, Y., and Kataoka, A., “A robust NLMS algorithm for
acoustic echo cancellation,” Elect. Commun. Japan III: Fund. Elect. Sci., vol. 89,
pp. 1–9, Aug. 2006.
[114] Shimauchi, S., Haneda, Y., and Kataoka, A., “Robust frequency domain acous-
tic echo cancellation filter employing normalized residual echo enhancement,” IEICE
Trans. Fundamentals, vol. E91-A, pp. 1347–1356, June 2008.
[115] Shynk, J. J., “Frequency-domain and multirate adaptive filtering,” IEEE Signal
Process. Magazine, vol. 9, pp. 14–37, Jan. 1992.
[116] Sondhi, M. M., “Adaptive echo cancellers,” Bell Syst. Tech. Journal, vol. 46,
pp. 497–511, Mar. 1967.
[117] Sondhi, M. M. and Mitra, D., “New results on the performance of a well-known
class of adaptive filters,” Proc. IEEE, vol. 64, pp. 1583–1597, Nov. 1976.
[118] Sondhi, M. M., Morgan, D. R., and Hall, J. L., “Stereophonic acoustic echo
cancellation - an overview of the fundamental problem,” IEEE Signal Process. Letters,
vol. 2, pp. 148–151, Aug. 1995.
[119] Sondhi, M. M. and Presti, A. J., “A self-adaptive echo canceler,” Bell Syst. Tech.
Journal, vol. 45, pp. 1851–1854, Dec. 1966.
[120] Soo, J.-S. and Pang, K., “Multidelay block frequency domain adaptive filter,” IEEE
Trans. Acoust. Speech Signal Process., vol. 38, pp. 373–376, Feb. 1990.
156
[121] Stenger, A. and Kellermann, W., “Adaptation of a memoryless preprocessor for
nonlinear acoustic echo cancelling,” Signal Process., vol. 80, pp. 1747–1760, Sep. 2000.
[122] Stenger, A. and Rabenstein, R., “An acoustic echo canceller with compensation
of nonlinearities,” in Proc. EURASIP EUSIPCO, pp. 969–972, Sep. 1998.
[123] Stenger, A., Trautmann, L., and Rabenstein, R., “Nonlinear acoustic echo
cancellation with 2nd order adaptive Volterra filter,” in Proc. IEEE ICASSP, vol. 2,
pp. 877–880, Mar. 1999.
[124] Sugiyama, A., “A robust NLMS algorithm with a novel noise modeling based on
stationary/nonstationary noise decomposition,” in Proc. IEEE ICASSP, pp. 201–204,
Apr. 2009.
[125] Sugiyama, A., Joncour, Y., and Hirano, A., “A stereo echo canceler with correct
echo-path identification based on an input-sliding technique,” IEEE Trans. Signal
Process., vol. 49, pp. 2577–2587, Nov. 2001.
[126] Valin, J.-M., “On adjusting the learning rate in frequency domain echo cancellation
with double-talk,” IEEE Trans. Audio Speech Language Process., vol. 15, pp. 1030–
1034, Mar. 2007.
[127] Valin, J.-M. and Collings, I. B., “A new robust frequency domain echo canceller
with closed-loop learning rate adaptation,” in Proc. IEEE ICASSP, vol. 1, pp. 93–96,
Apr. 2007.
[128] Wada, T. S. and Juang, B.-H., “Enhancement of residual echo for improved acous-
tic echo cancellation,” in Proc. EURASIP EUSIPCO, pp. 1620–1624, Sep. 2007.
[129] Wada, T. S. and Juang, B.-H., “Enhancement of residual echo for improved
frequency-domain acoustic echo cancellation,” in Proc. IEEE WASPAA, pp. 175–178,
Oct. 2007.
[130] Wada, T. S. and Juang, B.-H., “Towards robust acoustic echo cancellation during
double-talk and near-end background noise via enhancement of residual echo,” in
Proc. IEEE ICASSP, pp. 253–256, Apr. 2008.
[131] Wada, T. S. and Juang, B.-H., “Acoustic echo cancellation based on indepen-
dent component analysis and integrated residual echo enhancement,” in Proc. IEEE
WASPAA, pp. 205–208, Oct. 2009.
[132] Wada, T. S. and Juang, B.-H., “Multi-channel acoustic echo cancellation based
on residual echo enhancement with effective channel decorrelation via resampling,” in
Proc. IWAENC, Sep. 2010.
[133] Wada, T. S. and Juang, B.-H., “Enhancement of residual echo for robust acoustic
echo cancellation,” IEEE Trans. Audio Speech Language Process., vol. 20, pp. 175–
189, Jan. 2012.
[134] Wada, T. S., Juang, B.-H., and Sukkar, R. A., “Measurement of the effects of
nonlinearities on the network-based acoustic echo cancellation,” in Proc. EURASIP
EUSIPCO, Sep. 2006.
157
[135] Wada, T. S., Miyabe, S., and Juang, B.-H., “Use of decorrelation procedure for
source and echo suppression,” in Proc. IWAENC, Sep. 2008, paper no. 9086.
[136] Wada, T. S., Robledo-Arununcio, E., Yue, G., and Juang, B.-H., “Immersive
acoustic signal processing for intelligent collaboration,,” in Proc. WESPAC, June
2006, paper no. 653.
[137] Wada, T. S., Wung, J., and Juang, B.-H., “Decorrelation by resampling in fre-
quency domain for multi-channel acoustic echo cancellation based on residual echo
enhancement,” in Proc. IEEE WASPAA, pp. 289–292, Oct. 2011.
[138] Widrow, B. and Hoff, Jr., M. E., “Adaptive switching circuits,” IRE Wescon
Conf. Rec., pp. 96–104, 1960, Part 4.
[139] Widrow, B. and Stearns, S. D., Adaptive Signal Processing. Prentice Hall, 1985.
[140] Wightman, F. and Kistler, D., “The dominant role of low-frequency interaural
time differences in sound localization,” J. Acoust. Soc. America, vol. 91, pp. 1648–
1661, Mar. 1992.
[141] Wung, J., Wada, T. S., and Juang, B.-H., “Inter-channel decorrelation by sub-
band resampling in frequency domain,” in Proc. IEEE ICASSP, pp. 29–32, Mar.
2012.
[142] Wung, J., Wada, T. S., Juang, B.-H., Lee, B., Kalker, T., and Schafer, R.,
“System approach to residual echo suppression in robust hands-free teleconferencing,”
in Proc. IEEE ICASSP, pp. 445–448, May 2011.
[143] Wung, J., Wada, T. S., Juang, B.-H., Lee, B., Trott, M., and Schafer,
R. W., “A system approach to acoustic echo cancellation in robust hands-free tele-
conferencing,” in Proc. IEEE WASPAA, pp. 101–104, Oct. 2011.
[144] Yang, J.-M. and Sakai, H., “A robust ICA-based adaptive filter algorithm for sys-
tem identification,” IEEE Trans. Circuits Syst. II: Express Briefs, vol. 55, pp. 1259–
1263, Dec. 2008.
[145] Ye, H. and Wu, B.-X., “A new double-talk detection algorithm based on the or-
thogonality theorem,” IEEE Trans. Commun., vol. 39, pp. 1542–1545, Nov. 1991.
[146] Zou, Y., Chan, S.-C., and Ng, T.-S., “Least mean M-estimate algorithms for
robust adaptive filtering in impulse noise,” IEEE Trans. Circuits Syst. II: Analog and
Digital Signal Proc., vol. 47, pp. 1564–1569, Dec. 2000.
158
VITA
Ted S. Wada was born in Tokyo, Japan, and moved to the United States during the summer
of 1982. After earning B.S. and M.S. degrees in applied physics, statistics, and electrical and
computer engineering from Columbia University in New York, NY, Georgia State University
in Atlanta, GA, and Georgia Institute of Technology in Atlanta, GA, he is about to finalize
the academic pursuit with Ph.D. degree in electrical and computer engineering from Georgia
Tech. He spent the summer of 2003 at Avaya, Inc., in Basking Ridge, NJ, the summer of
2005 at Tellabs, Inc., in Naperville, IL, and the winter/spring of 2009/2010 at Li Creative
Technologies, Inc., in Florham Park, NJ, while working on acoustic echo cancellation and
speech enhancement problems. He has been with Broadcom, Inc., in Irvine, CA, since
the summer of 2012. His research interests include speech and audio signal processing,
159