Distributionally Robust Instrumental Variables Estimation
Distributionally Robust Instrumental Variables Estimation
Variables Estimation
∗
Columbia University, Data Science Institute
†
Columbia University, Department of Statistics
Abstract
Instrumental variables (IV) estimation is a fundamental method in econometrics
and statistics for estimating causal effects in the presence of unobserved confounding.
However, challenges such as untestable model assumptions and poor finite sample
properties have undermined its reliability in practice. Viewing common issues in
IV estimation as distributional uncertainties, we propose DRIVE, a distributionally
robust IV estimation method. We show that DRIVE minimizes a square root variant
of ridge regularized two stage least squares (TSLS) objective when the ambiguity
set is based on a Wasserstein distance. In addition, we develop a novel asymptotic
theory for this estimator, showing that it achieves consistency without requiring the
regularization parameter to vanish. This novel property ensures that the estimator is
robust to distributional uncertainties that persist in large samples. We further derive
the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures
to select the regularization parameter based on theoretical results. Simulation studies
demonstrate the superior finite sample performance of Wasserstein DRIVE in terms of
estimation error and out-of-sample prediction. Due to its regularization and robustness
properties, Wasserstein DRIVE presents an appealing option when the practitioner is
uncertain about model assumptions or distributional shifts in data.
1
1 Introduction
affect the outcome exogenously and exclusively through the endogenous regressor to yield
consistent causal estimates, even when the standard ordinary least squares (OLS) estimator
is biased by unobserved confounding (Imbens and Angrist, 1994; Angrist et al., 1996; Imbens
and Rubin, 2015). Over the years, IV estimation has become an indispensable tool for
causal inference in empirical works in economics (Card and Krueger, 1994), as well as in
the study of genetic and epidemiological data (Davey Smith and Ebrahim, 2003).
Despite the widespread use of IV in empirical and applied works, it has important
limitations and challenges, such as invalid instruments (Sargan, 1958; Murray, 2006), weak
instruments (Staiger and Stock, 1997), non-compliance (Imbens and Angrist, 1994), and
(Andrews et al., 2019; Young, 2022). These issues could significantly impact the validity
and quality of estimation and inference using instrumental variables (Jiang, 2017). Many
works have since been devoted to assessing and addressing these issues, such as statistical
tests (Hansen, 1982; Stock and Yogo, 2002), sensitivity analysis (Rosenbaum and Rubin,
1983; Bonhomme and Weidner, 2022), and additional assumptions or structures on the data
generating process (Kolesár et al., 2015; Kang et al., 2016; Guo et al., 2018b).
causality and the concepts of invariance and robustness (Peters et al., 2016; Meinshausen,
2018; Rothenhäusler et al., 2021; Bühlmann, 2020; Jakobsen and Peters, 2022; Fan et al.,
2024). Their guiding philosophy is that causal properties can be viewed as robustness against
2
The robustness of an estimator against P is often represented in a distributionally robust
In many estimation and regression settings, one assumes that the true data distribution
conditions guarantee that standard statistical procedures based on the empirical distribution
that the distribution P0 of the observed data might deviate from that generated by the
ideal model that satisfies such conditions, e.g., due to measurement errors or model mis-
possible deviations into an ambiguity set P = P(P0 , ρ) of distributions that are “close” to
P0 . The parameter ρ quantifies the degree of uncertainty, e.g., as the radius of a ball centered
minimizing the worst-case loss over P(P0 , ρ) in the min-max optimization problem (1), the
statistical methods. For example, it is well-known that members of the family of k-class
estimators (Anderson and Rubin, 1949; Nagar, 1959; Theil, 1961) are more robust than
the standard IV estimator against weak instruments (Andrews, 2007). Recent works by
Rothenhäusler et al. (2021) and Jakobsen and Peters (2022) show that k-class estimators in
fact have a DRO representation of the form (1), where ℓ is the square loss, W = (X, Y ),
and X, Y are endogenous and outcome variables generated from structural equation models
parameterized by the natural parameter of k-class estimators. See Appendix A.2 for details.
3
The general robust optimization problem (1) can trace its roots in the classical robust
statistics literature (Huber, 1964; Huber and Ronchetti, 2011) as well as classic works on
robustness in economics (Hansen and Sargent, 2008). Drawing inspirations from them,
recent works in econometrics have also explored the use of robust optimization to account for
(local) deviations from model assumptions (Kitamura et al., 2013; Armstrong and Kolesár,
2021; Chen et al., 2021; Bonhomme and Weidner, 2022; Adjaho and Christensen, 2022; Fan
et al., 2023). These works, together with works on invariance and robustness, highlight the
Despite new developments connecting causality and robustness, many questions and
opportunities remain. An important challenge in DRO is the choice of the ambiguity set
on the structure of the particular problem of interest. While some existing DRO approaches
use ambiguity sets P(P0 , ρ) based on marginal or joint distributions of data, such P(P0 , ρ)
may not effectively capture the structure of IV estimation models. In addition, as the
min-max problem (1) minimizes the loss function under the worst-case distribution in
P(P0 , ρ), a common concern is that the resulting estimator is too conservative when P(P0 , ρ)
is too large. In particular, although DRO estimators enjoy better empirical performance
in finite samples, their asymptotic validity typically requires the ambiguity set to vanish
large samples, necessitating the need for an ambiguity set that does not vanish to a singleton.
It is therefore important to ask whether and how one can construct an estimator in the
IV estimation setting that can sufficiently capture the distributional uncertainties about
model assumptions, and at the same time remains asymptotically valid with a non-vanishing
robustness parameter.
In this paper, we propose to view common challenges to IV estimation through the lens
4
of DRO, whereby uncertainties about model assumptions, such as the exclusion restriction
and homoskedasticity, are captured by a suitably chosen ambiguity set in (1). Based on
constructing the ambiguity set based on marginal or joint distributions as in existing works,
Y, X projected onto the space spanned by instrumental variables. When the ambiguity set of
DRIVE is based on the 2-Wasserstein metric, we show that the resulting estimator minimizes
a square root version of ridge regularized two stage least squares (TSLS) objective, where
the radius ρ of the ambiguity set becomes the regularization parameter. This regularized
regression formulation relies on the general duality of Wasserstein DRO problems (Gao and
We next next reveal a surprising statistical property of the square root ridge by showing
above by an estimable constant, which depends on the first stage coefficient of the IV
model and can be interpreted as a measure of instrument quality. To our knowledge, this is
the first consistency result for regularized regression estimators where the regularization
parameter does not vanish as the sample size n → ∞. One implication of our results is that
Wasserstein DRIVE, being a regularized regression estimator, enjoys better finite sample
properties, but does not introduce bias asymptotically even for non-vanishing ρ, unlike
numerical experiments that Wasserstein DRIVE improves over the finite sample performance
of IV and k-class estimators, thanks to its ridge type regularization, while at the same time
5
DRIVE achieves significant improvements in mean squared errors (MSEs) over IV and OLS
when instruments are moderately invalid. These findings suggest that Wasserstein DRIVE
can be an attractive option in practice when we are concerned about model assumptions.
The rest of the paper is organized as follows. In Section 2, we discuss the standard IV esti-
mation framework and common challenges. In Section 3, we propose the Wasserstein DRIVE
framework and provide the duality theory. In Section 4, we develop asymptotic results for the
parameter. Section 5 conducts numerical studies that compare Wasserstein DRIVE with
other estimators including IV, OLS, and k-class estimators. Background materials, proofs,
and additional results are included in the appendices in the supplementary material.
Notation. Throughout the paper, ∥v∥p denotes the p-norm of a vector v, while
∥v∥ := ∥v∥2 denotes the Euclidean norm. Tr(M ) denotes the trace of a matrix M . λk (M )
represents the k-th largest eigenvalue of a symmetric matrix M . Boldfaced variables, such
In this section, we first provide a brief review of the standard IV estimation framework. We
then motivate the DRO approach to IV estimation by viewing common challenges from
Consider the following standard linear instrumental variables regression model with X ∈
Y = β0T X + ϵ,
(2)
T
X = γ Z + ξ.
6
In (2), X are the endogenous variables, Z are the instrumental variables, and Y is the
outcome variable. The error terms ϵ and ξ capture the unobserved (or residual) components
distributed (i.i.d.) samples {Xi , Yi , Zi }ni=1 . However, X and Y are confounded through some
unobserved confounders U that are correlated with both Y and X, represented graphically
γ β0
Z X Y
E [Xϵ] ̸= 0.
As a result of the unobserved confounding, the standard ordinary least squares (OLS)
IV estimation approach leverages access to the instrumental variables Z, also often called
h i
rank(E ZX T ) = p, (3)
h i
E [Zϵ] = 0, E Zξ T = 0. (4)
Under these conditions, a popular IV estimator is the two stage least squares (TSLS,
sometimes also stylized as 2SLS) estimator (Theil, 1953). With ΠZ := Z(ZT Z)−1 ZT and
1
min ∥Y − ΠZ Xβ∥2 . (5)
β n
7
In contrast, the standard OLS estimator β̂ OLS solves the problem
1
min ∥Y − Xβ∥2 . (6)
β n
When the moment conditions (3) and (4) hold, the TSLS estimator is a consistent estimator
of the causal effect β0 under standard assumptions (Wooldridge, 2020), and valid inference
of β̂ IV (Imbens and Rubin, 2015). Although not the most common presentation of TSLS,
the optimization formulation in (5) provides intuition on how IV estimation works: when
the instruments Z are uncorrelated with the unobserved confounders U affecting X and
consistent estimator of β0 .
The validity of estimation and inference based on β̂ IV relies critically on the moment
conditions (3) and (4). Condition (3) is often called the relevance condition or rank
condition, and requires E ZX T to have full rank (recall p ≤ d). In the special case of
to E [ZX] ̸= 0. Intuitively, the relevant condition ensures that the instruments Z can
explain sufficient variations in the endogenous variables X. In this case, the instruments
are said to be relevant and strong. When E ZX T is close to being rank deficient, i.e.,
the smallest eigenvalue λp (E ZX T ) ≈ 0, IV estimation suffers from the so-called weak
instrument problem, which results in many issues in estimation and inference, such as small
sample bias and non-normal statistics (Stock et al., 2002). Some k-class estimators, such as
limited information maximum likelihood (LIML) (Anderson and Rubin, 1949), are partially
motivated to address these problems. Condition (4) is often referred to as the exclusion
restriction or instrument exogeneity (Imbens and Rubin, 2015), and instruments that
satisfy this condition are called valid instruments. When an instrument Z is correlated
8
with the unobserved confounder that confounds X, Y , or when Z affects the outcome Y
through an unobserved variable other than the endogenous variable X, the instrument
becomes invalid, resulting in biased estimation and invalid inference of β0 (Murray, 2006).
These issues can often be exacerbated when the instruments are weak, when there is
heteroskedasticity (Andrews et al., 2019), or the data is highly leveraged (Young, 2022).
Although many works have been devoted to addressing the problems of weak and invalid
instruments, there are fundamental limits on the extent to which one can test for these
estimation and inference procedures that are robust to the presence of such issues. Our
can be viewed as uncertainties about the data distribution, i.e., deviations from the ideal
model that satisfies IV assumptions, which can be explicitly taken into account by choosing
demonstrate this perspective more concretely, we now examine some common problems in
IV estimation and show that they can be viewed as distributional shifts under a suitable
Y = Xβ0 + ϵ
X = Zγ + ξ,
ϵ = Zη + U, ξ = U.
9
Note that in addition to U , there is also potentially a direct effect η from the instrument Z
to the outcome variable Y . We focus on the resulting model for our subsequent discussions:
Y = Xβ0 + Zη + U
(7)
X = Zγ + U.
The standard IV assumptions can be succinctly summarized for (7). The relevance condition
(3) requires that γ ̸= 0, while the exclusion restriction (4) requires that Z is uncorrelated
with U and that in addition η = 0. Assume that U, Z are i.i.d. standard normal. X, Y are
then determined by (7). We are interested in the shifts in data distribution, appropriately
defined and measured, when the exogeneity and relevance conditions are violated.
if η ̸= 0, and |η| quantifies the degree of instrument invalidity. Let Pη denote the joint
distribution on (X, Y, Z) in the model (7) indexed by η ∈ R. Let P̃η,Z be the resulting
Z. We are interested in the (expected) distributional shift between P̃η,Z and P̃0,Z . We
choose the 2-Wasserstein distance W2 (·, ·) (Kantorovich, 1942, 1960), also known as the
Kantorovich metric, to measure this shift. Conveniently, the 2-Wasserstein distance between
two normal distributions Q1 = N (µ1 , Σ1 ) and Q2 = N (µ2 , Σ2 ) has an explicit formula due
1/2 1/2
W2 (Q1 , Q2 )2 = ∥µ1 − µ2 ∥2 + Tr(Σ1 + Σ2 − 2(Σ2 Σ1 Σ2 )1/2 ). (8)
Applying (8) to the conditional distributions P̃η,Z , P̃0,Z , and taking the expectation with
This calculation shows that the degree of instrument invalidity, as measured by the strength
10
shift of the distribution on (X̃, Ỹ ) from that under the valid IV assumption. Moreover,
the simple form of the expected distributional shift relies on our choice of the Wasserstein
distance to measure the distributional shift of the conditional random variables (X̃, Ỹ ). If we
instead measure shifts in the joint distribution Pη on (X, Y, Z), the resulting distributional
shift will depend on other model parameters in addition to η. This example therefore
suggests that the Wasserstein metric applied to the conditional distributional shift of (X̃, Ỹ )
Example 2 (Weak Instruments). Now consider another common problem with IV esti-
mation, which happens when the first stage coefficient γ is close to 0. Let Q̃γ,Z be the
distribution on (X̃, Ỹ ) indexed by γ ∈ R and η = 0 in (7). In this case, we can verify that
r q
2
EW2 (Q̃γ1 ,Z , Q̃γ2 ,Z ) = 1 + β02 |γ1 − γ2 |.
π
The expected distributional shift between the setting with a “strong” instrument with
Similar to the previous example, the degree of violation of the strong instrument assumption,
distributional shift on (X̃, Ỹ ). Note, however, that the distance is also proportional to the
relative, and should be measured relative to the scale of the true causal parameter.
Next, we consider the distributional shift resulting from heteroskedastic errors, which are
known to yield the TSLS estimator inefficient and the standard variance estimator invalid
(Baum et al., 2003). Some k-class estimators, such as the LIML and the Fuller estimators,
11
Example 3 (Heteroskedasticity). In this example, we assume η = 0 in (7) and that
α · |Z| + 1 where α ≥ 0. We are interested in the average distributional shift between the
heteroskedastic setting (α > 0) from the homoskedastic setting (α = 0). We can verify that
propose to construct an ambiguity set in (1) using a Wasserstein ball around the empirical
3.1 DRIVE
Robust IV Estimation (DRIVE) framework, which solves the following DRO problem given
12
a dataset {(Xi , Yi , Zi )}ni=1 and robustness parameter ρ:
h i
(DRIVE Objective) min sup EQ (Y − X T β)2 , (12)
β
{Q:D(Q,P̃n )≤ρ}
where P̃n (X × Y) is the empirical distribution on (X, Y ) induced by the projected samples
ΠZ = Z(ZT Z)−1 ZT is the projection matrix onto the column space of Z. D(·, ·) is a metric
DRIVE framework, we first regress both the outcome Y and covariate X on the instrument
Z to form the n predicted samples (ΠZ X, ΠZ Y). Then an ambiguity set is constructed
using D around the empirical distribution P̃n . This choice of the reference distribution
P0 is a key distinction of our work from previous works that leverage DRO in statistical
chosen as the empirical distribution P̂n on {Xi , Yi }ni=1 (Blanchet et al., 2019). In the IV
estimation setting where we have additional access to instruments Z, we have the choice of
constructing ambiguity sets around the empirical distribution on the marginal quantities
{(Xi , Yi , Zi )}ni=1 , which is the approach taken in Bertsimas et al. (2022). In contrast, we
choose to use the empirical distribution on the conditional quantities {(X̃i , Ỹi )}ni=1 . This
The choice of the divergence measure D(·, ·) is also important, as it characterizes the
potential distributional uncertainties that DRIVE is robust to. In this paper, we propose
(Mohajerin Esfahani and Kuhn, 2018; Gao and Kleywegt, 2023). One advantage of the
Wasserstein distance is the tractability of its associated DRO problems (Blanchet et al., 2019),
which can often be formulated as regularized regression problems with unique solutions.
13
See also Appendix A.1. In Section 2.2, we provided several examples that demonstrate the
class of ϕ-divergences (Ben-Tal et al., 2013), can also be used instead of the Wasserstein
distance. For example, Kitamura et al. (2013) use the Hellinger distance to model local
estimation setting. In this paper, we focus on the Wasserstein DRIVE framework based
We next begin our formal study of Wasserstein DRIVE. In Section 3.2, we will show
that the Wasserstein DRIVE objective is dual to a convex regularized regression problem.
As a result, the solution to the optimization problem (12) is well-defined, and we denote
non-vanishing choices of the robustness parameter and derive its asymptotic distribution.
as (12) often have equivalent formulations as regularized regression problems. This cor-
respondence between regularization and robustness already manifests itself in the ridge
2018). Importantly, the regularized regression formulations are often more tractable in
terms of solving the resulting optimization problem, and also facilitate the study of the
statistical properties of the estimators. We first show that the Wasserstein DRIVE objective
can also be written as a regularized regression problem similar to, but distinct from, the
standard TSLS objective with ridge regularization. Proofs can be found in Appendix F.
Theorem 3.1. The optimization problem in (12) is equivalent to the following convex
14
regularized regression problem:
r
1 p
min ∥ΠZ Y − ΠZ Xβ∥2 + ρ(∥β∥2 + 1), (13)
β n
where ΠZ = Z(ZT Z)−1 ZT is the finite sample projection operator, and ΠZ Y and ΠZ X are
Note that the robustness parameter ρ of the DRO formulation (12) now has the dual
formulation implies that the min-max problem (12) associated with Wassertein DRIVE has
p
a unique solution, thanks to the strict convexity of the regularization term ρ(∥β∥2 + 1),
and is easy to compute despite not having a closed form solution. In particular, (13) can be
reformulated as a standard second order conic program (SOCP) (El Ghaoui and Lebret,
1997), which can be solved efficiently with off-the-shelf convex optimization routines, such as
The equivalence between Wasserstein DRO problems and regularized regression problem
is a familiar general result in recent works. For example, Blanchet et al. (2019) and Gao
and Kleywegt (2023) derive similar duality results for distributionally robust regression
with q-Wasserstein distances for q > 1. Compared to previous works, our work is distinct
in the following aspects. First, we apply Wasserstein DRO to the IV estimation setting
instead of standard regression settings, such as OLS or logistic regression. Although from
a new asymptotic regime that uncovers interesting statistical properties of the resulting
estimators. Second, the regularization term in (13) is distinct from those in previous works,
which often use ∥β∥p with p ≥ 1. This seemingly innocuous difference turns out to be
crucial for our novel results on the Wasserstein DRIVE. Lastly, compared to the proof in
Blanchet et al. (2019), our proof of Theorem 3.1 is based on a different argument using the
15
Sherman-Morrison formula instead of Hölder’s inequality, which provides an independent
proof of the important duality result for Wasserstein distributionally robust optimization.
The regularized regression formulation of the Wasserstein DRIVE problem in (13) resembles
the standard ridge regularized (Hoerl and Kennard, 1970) TSLS regression:
1X 1
min (Yi − X̃iT β)2 + ρ∥β∥2 ⇐⇒ min ∥Y − ΠZ Xβ∥2 + ρ∥β∥2 . (14)
β n i β n
We therefore refer to (13) as the square root ridge regularized TSLS. However, there
are three major distinctions between (13) and (14) that are essential in guaranteeing the
statistical properties of Wasserstein DRIVE not enjoyed by the standard ridge regularized
TSLS. First, the presence of square root operations on both the risk term and the penalty
term; second, the presence of a constant in the regularization term; third, an additional
In the standard regression setting without instrumental variables, the square root ridge
r
1 p
min ∥Y − Xβ∥2 + ρ(1 + ∥β∥2 ) (15)
β n
In particular, both can be written as dual problems of Wasserstein DRO problems (Blanchet
et al., 2019). However, the square root LASSO is motivated by high-dimensional regression
settings where the dimension of X is potentially larger than the sample size n, but β is
very sparse. In contrast, our study of the square root ridge is motivated by its robustness
properties in the IV estimation setting, where the dimension of the endogenous variable is
small (often one-dimensional). In other words, variable selection is not the main focus of
this paper. A variant of the square root ridge estimator in (15) was also considered in the
standard regression setting by Owen (2007), who instead uses the penalty term ∥β∥2 .
16
As is well-known in the regularized regression literature (Fu and Knight, 2000), when the
√
regularization parameter decays to 0 at a rate Op (1/ n), the ridge estimator is consistent.
A similar result also holds for the square root ridge (15) in the standard regression setting as
assumptions, such as the validity of instruments, could persist even in large samples. Recall
that ρ is also the robustness parameter in the DRO formulation (12). Therefore, the usual
in the IV estimation setting. In the next section, we study the asymptotic properties of
Wasserstein DRIVE when ρ does not necessarily vanish. In particular, we establish the
consistency of Wasserstein DRIVE leveraging the three distinct features of (13) that are
absent in the standard ridge regularized TSLS regression (14). This asymptotic result is
In this section, we leverage distinct geometric features of the square root ridge regression to
study the asymptotic properties of the Wasserstein DRIVE. In Section 4.1, we show that
the Wasserstein DRIVE estimator is consistent for any ρ ∈ [0, ρ], where ρ depends on the
first stage coefficient γ. This property is a consequence of the consistency of the square root
ridge estimator in settings where the objective value at the true parameter vanishes, such
as the GMM estimation setting. It ensures that Wasserstein DRIVE can achieve better
finite sample performance thanks to its ridge type regularization, while at the same time
retaining asymptotic validity when instruments are valid. In Section 4.2, we characterize
the asymptotic distribution of Wasserstein DRIVE, and discuss several special settings
17
4.1 Consistency of Wasserstein DRIVE
Y = β0T X + ϵ,
X = γ T Z + ξ,
section, we make the standard assumptions that the instruments satisfy the relevance and
exogeneity conditions in (3) and (4), ϵ, ξ are homoskedastic, the instruments Z are not
perfectly collinear, and that E∥Z∥2k < ∞, E∥ξ∥2k < ∞, E|ϵ|2k < ∞ for some k > 2. The
results can be extended in a straightforward manner when we relax these assumptions, e.g.,
only requiring that exogeneity holds asymptotically. Given i.i.d. samples from the linear IV
model, recall the regularized regression formulation of the Wasserstein DRIVE objective
s
1X p
min (ΠZ Y − ΠZ Xβ)2i + ρn (∥β∥2 + 1), (17)
β n i
Theorem 4.1 (Consistency of Wasserstein DRIVE). Let β̂nDRIVE be the unique minimizer of
and exogeneity conditions (3) and (4), the Wasserstein DRIVE estimator β̂nDRIVE converges
p p
min (β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1). (18)
β
Rp×p , the unique minimizer of the objective (18) is the true causal effect, i.e., β DRIVE ≡ β0 .
parameter is bounded above by ρ = λp (γ T ΣZ γ). In the case when ΣZ = σZ2 Id for σZ2 > 0,
18
the upper bound ρ is proportional to the square of the smallest singular value of the first
stage coefficient γ, which is positive under the relevance condition (3). Recall that ρn is the
radius of the Wasserstein ball in the min-max formulation of DRIVE in (12). Theorem 4.1
therefore guarantees that even when the robustness parameter ρn ≡ ρ ̸= 0, which implies
the solution to the min-max problem is different from the TSLS estimator (ρ = 0), the
more precisely the variance covariance matrix ΣZ of Z and the first stage coefficient γ in
the IV regression model. The maximum amount of robustness that can be afforded by
the strength and variance of the instrument. This relation can be described more precisely
when ΣZ = σZ2 Id , in which case the bound is proportional to σZ2 and λp (γ T γ). Both
quantities improve the quality of the instruments: σZ2 improves the proportion of variance
of X and Y explained by the instrument vs. noise, while a γ far from rank deficiency avoids
the weak instrument problem. Therefore, the robustness of Wasserstein DRIVE depends
standard TSLS estimator when errors are homoskedastic. This observation suggests an
intrinsic connection between robustness and efficiency in the IV setting. See the discussions
More importantly, Theorem 4.1 is the first consistency result for regularized regression
estimators where the regularization parameter does not vanish with sample size. Although
regularized regression such as ridge and LASSO is often associated with better finite sample
performance at the cost of introducing some bias, our work demonstrates that, in the
IV estimation setting, we can get the best of both worlds. On one hand, the ridge type
19
regularization in Wasserstein DRIVE improves upon the finite sample properties of the
On the other hand, with a bounded level of the regularization parameter ρ, Wasserstein
DRIVE can still achieve consistency. This is in stark contrast to existing asymptotic results
on regularized regression (Fu and Knight, 2000). Therefore, in the context of IV estimation,
with Wasserstein DRIVE we can achieve consistency and a certain amount of robustness at
the same time, by leveraging additional information in the form of valid instruments. The
maximum degree of robustness that can be achieved also has a natural interpretation in
Theorem 4.1 also suggests the following procedure to construct a feasible and valid
the OLS regression estimator of the first stage coefficient γ and Σ̂Z an estimator of ΣZ , such
as the heteroskedasticity-robust estimator (White, 1980). We can use ρ̂ ≤ λp (γ̂ T Σ̂Z γ̂) to
construct the Wasserstein DRIVE objective, i.e., any value bounded above by the smallest
eigenvalue of γ̂ T Σ̂Z γ̂. Under the assumptions in Theorem 4.1, λp (γ̂ T Σ̂Z γ̂) → λp (γ T ΣZ γ),
which guarantees that the Wasserstein DRIVE estimator with parameter ρ̂ is consistent. In
We demonstrate the validity and superior finite sample performance of DRIVE based on
One may wonder why Wasserstein DRIVE can achieve consistency with a non-zero
regularization ρ. Here we briefly discuss the phenomenon that the limiting objective (18)
p p
min (β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1)
β
p
has a unique minimizer at β0 for bounded ρ > 0. The first term (β − β0 )T γ T ΣZ γ(β − β0 )
achieves its minimum value of 0 at β = β0 . When ρ is small, the effect of adding the
p
regularization term ρ(∥β∥2 + 1) does not overwhelm the first term, especially when its
20
plot of | 1| + 2+1
= 0.5
5 =1
standard ridge, = 1
=2
4 =5
√ p 2
Figure 1: Plot of |β − 1| + ρ (β + 1), which is the dual limit objective function in the
one-dimensional case with σZ2 = γ = β0 = 1. We also plot limit of standard ridge loss
(β − 1)2 + β 2 . For ρ ≤ 2, the minimum is achieved at β = 1, while for ρ = 5 and for the
curvature at β0 is large. As a result, we may expect the minimizer to not deviate much from
β0 . While this intuition is reasonable qualitatively, it does not fully explain the fact that
the minimizer does not change for small ρ. In the standard regression setting, the same
intuition can be applied to the standard ridge regularization, but we know shrinkage occurs
as soon as ρ > 0. The key distinction of (17) turns out to be the square root operations we
apply to the loss and regularization terms, which endows the objective with a geometric
interpretation, and ensures that the minimizer does not deviate from β0 unless ρ is above
some positive threshold. We call this phenomenon the “delayed shrinkage” of the square root
ridge, as shrinkage does not happen until the regularization is large enough. We illustrate it
with a simple example in Fig. 1, where the minimizer of the limiting square root objective
A crucial feature of the Wasserstein DRIVE objective is that both the outcome and the
covariates are regressed on the instrument to compute their predicted values. In other
21
estimation (ρ = 0), there is no substantial difference between the two objectives, since
their minimizers are exactly the same, due to the idempotent property Π2Z = ΠZ . In fact,
in applications of TSLS, the outcome variable is often not regressed on the instrument.
However, Wasserstein DRIVE is consistent for positive ρ only if the outcome Y is also
projected onto the instrument space. In other words, the following problem does not yield
The reason behind this phenomenon is that n1 ∥ΠZ Y − ΠZ Xβ∥2 is a GMM objective
1X 1 1X
( Zi (Yi − β T Xi ))T ( ZT Z)−1 ( Zi (Yi − β T Xi )),
n i n n i
does not vanish even at β0 . In the former case, the geometric properties of the square root
Having established the consistency of Wasserstein DRIVE with bounded ρ, we now turn to
will also examine several special cases relevant in practice where they coincide.
stein DRIVE estimator β̂nDRIVE has asymptotic distribution characterized by the following
optimization problem:
√ T
√ q
ρβ0
n(β̂ DRIVE
− β0 ) →d arg min (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ) + p · δ, (19)
δ (1 + ∥β0 ∥2 )
22
In particular, when ρn → 0 at any rate, we have
√
n(β̂ DRIVE − β0 ) →d N (0, σ 2 (γ T ΣZ γ)−1 ),
which is the asymptotic distribution for TSLS estimators with homoskedastic errors ϵ.
Recall that the maximal robustness parameter ρ of Wasserstein DRIVE while still being
of the asymptotic variance of the TSLS estimator. Therefore, as the efficiency of TSLS
increases, so does the robustness of the associated Wasserstein DRIVE estimator. The
“price” to pay for robustness when ρ > 0 is an interesting question. It is clear from Fig. 1
that the curvature of the population objective decreases as ρ increases. Since the objective
characterize this behavior. Note that the asymptotic distribution of the TSLS estimator
(Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ),
Theorem 4.2 implies that in general the asymptotic distributions of Wasserstein DRIVE and
TSLS are different when ρ > 0. However, there are still several cases relevant in practice
Corollary 4.3. In the following cases, the asymptotic distribution of Wasserstein DRIVE
1. When ρ = 0;
of IV estimation, since in practice we are often interested in the causal effect of a single
23
endogenous variable, for which we have a single instrument. The case when β0 ≡ 0 is
also very relevant, since an important question in practice is whether the causal effect of a
variable is zero. Our theory suggests that the asymptotic distribution of Wasserstein DRIVE
should be the same as that of the TSLS when the causal effect is zero and d > 1, even for
ρ > 0. Based on this observation, intuitively, we should expect that Wasserstein DRIVE and
TSLS estimators to be “close” to each other. If the estimators or their asymptotic variance
estimators differ significantly, then β0 may not be identically 0. We can design statistical
tests by leveraging this intuition. For example, we can construct test statistics using the
TSLS estimator and the DRIVE estimator with ρ > 0, such as the difference β̂ DRIVE − β̂ TSLS .
Then we can use bootstrap-based tests, such as a bootstrapped permutation test, to assess
the null hypothesis that β̂ DRIVE − β̂ TSLS = 0. If we fail to reject the null hypothesis, then
Corollary 4.3 can be seemingly pessimistic because it demonstrates that the asymptotic
distribution of Wasserstein DRIVE could be the same as that of the TSLS in special cases.
However, recall that Wasserstein DRIVE is formulated to minimize the worst-case risk
over a set of distributions that are designed to capture deviations from model assumptions.
Therefore, there is not actually any a priori reason that it should coincide with the TSLS
when ρ > 0. In this sense, the fact that the Wasserstein DRIVE is consistent with ρ > 0
and may even coincide with TSLS is rather surprising. In the latter case, the worst-case
distribution for Wasserstein DRIVE in the large sample limit must coincide with that of
The asymptotic results we develop in this section provide the basis on which one can
perform estimation and inference with the Wasserstein DRIVE estimator. In the next section,
we study the finite sample properties of DRIVE in simulation studies and demonstrate that
it is superior in terms of estimation error and out of sample prediction compared to other
popular estimators.
24
5 Numerical Studies
In this section, we study the empirical performance of Wasserstein DRIVE. Our results
deliver three main messages. First, we demonstrate with simulations that Wasserstein
DRIVE, with non-zero robustness parameter ρ based on Theorem 4.1, has comparable
performance as the standard IV estimator whenever instruments are valid. Second, when
RMSE. Third, on the education dataset of Card (1993), Wasserstein DRIVE also has
Y = Xβ0 + Zη + U
X = γZ + U
Z = U βU Z + ϵZ ,
where U, ϵZ ∼ N (0, σ 2 ) and we allow a direct effect η from the instruments Z to the outcome
Y . Moreover, the instruments Z can also be correlated with the unobserved confounder U
(βU Z ̸= 0). We fix the true parameters and generate independent datasets from the model,
varying the degree of instrument invalidity. In Table 1, we report the MSE of estimators
averaged over 500 repeated experiments. We control the degree of instrument invalidity
by varying η, the direct effect of instruments on the outcome, and βU Z , the correlation
between unobserved confounder and instruments. Results in Table 1 are based on data
where ∥γ∥ ≫ 0 is large. We see that when instruments are strong, Wasserstein DRIVE
performs as well as TSLS when instruments are valid, but performs significantly better than
OLS, TSLS, anchor, and TSLS ridge when instruments become invalid. This suggests that
DRIVE could be preferable in practice when we are concerned about instrument validity.
25
η βU Z OLS TSLS anchor TSLS ridge DRIVE
σ = 0.5. For TSLS ridge the regularization parameter is selected using cross validation based
selected based on the proposal in Rothenhäusler et al. (2021). For DRIVE the regularization
parameter is selected using nonparametric bootstrap of the score quantile (Appendix E).
ments are potentially invalid or weak. We present box plots (omitting outliers) of MSEs
bootstrapped quantiles of the score function consistently outperforms OLS, TSLS, anchor
(k-class), and TSLS with ridge regularization. Moreover, the selected penalties increase as
the direct effect of Z on Y or the correlation between the unobserved confounder U and
the instrument Z increases, i.e., as the model assumption of valid instruments becomes
increasingly invalid. See Fig. 5 in Appendix Section E for more details. This property
is highly desirable, because based on the DRO formulation of DRIVE, ρ represents the
amount of robustness against distributional shifts associated with the estimator, which
should increase as the instruments become more invalid (larger distributional shift). Box
plots of estimation errors in Fig. 2 also verify that even when instruments are valid, the
finite sample performance of Wasserstein DRIVE is still better compared to the standard
IV estimator, suggesting that there is no additional cost in terms of MSE when applying
26
MSE when Corr(U,Z) = 0.0 MSE when Corr(U,Z) = 0.15 MSE when Corr(U,Z) = 0.3
OLS 12 OLS OLS
TSLS TSLS TSLS
8 k-class (anchor) k-class (anchor) 25 k-class (anchor)
TSLS+ridge 10 TSLS+ridge TSLS+ridge
DRIVE DRIVE 20 DRIVE
6 8
15
MSE
MSE
MSE
6
4
4 10
2 5
2
0 0 0
0.0 0.15 0.3 0.45 0.6 0.0 0.15 0.3 0.45 0.6 0.0 0.15 0.3 0.45 0.6
(direct effect of Z on Y) (direct effect of Z on Y) (direct effect of Z on Y)
Figure 2: MSEs of estimators when instruments are potentially invalid. Instrument Z can
have direct effects η on the outcome Y , or be correlated with the unobserved confounder U .
We now turn our attention to a different task that has received more attention in recent years,
especially in the context of policy learning and estimating causal effects across heterogeneous
populations (Dehejia et al., 2021; Adjaho and Christensen, 2022; Menzel, 2023). We study
the prediction performance of estimators when they are estimated on a dataset (training
set) that has a potentially different distribution from the dataset for which they are used
to make predictions (test set). We demonstrate that whenever the distributions between
training and test datasets have significant differences, the prediction error of Wasserstein
DRIVE is significantly smaller than that of OLS, IV, and anchor (k-class) estimators.
We conduct our numerical study using the classic dataset on the return of education
to wage compiled by David Card (Card, 1993). Here, the causal inference problem is
estimating the effect of additional school years on the increase in wage later in life. The
sample comes from one of nine regions in the United States, which differ in the average
number of years of schooling and other characteristics, i.e., there are covariate shifts in
27
data collected from different regions. Our strategy is to divide the dataset into a training
set and a test set based on the relative ranks of their average years of schools, which is
the endogenous variable. We expect that if there are distributional shifts between different
regions, then predicting wages using education and other information using conventional
models trained on the training data may not yield a good performance on the test data.
Since each sample is labeled as working in 1976 in one of nine regions in the U.S., we
split the samples based on these labels, using number of years of education as the splitting
variable. For example, we can construct the training set by including samples from the top
6 regions with the highest average years of schooling, and the test set to consist of samples
coming from the bottom 3 regions with the lowest average years of schooling. In this case,
we would expect the training and test sets to have come from different distributions. Indeed,
the average years of schooling differs by more than 1 year, and is statistically significant.
In splitting the samples based on the distribution of the endogenous variable, we are also
motivated by the long-standing debates revolving around the use of instrumental variables
in classic economic studies (Card and Krueger, 1994). A leading concern is the validity of
instruments. In the case of the study on educational returns, the validity of estimation and
inference require that the instruments (proximity to college and quarter of birth) are not
correlated with unobserved characteristics that may also affect their earnings. The following
“In the case of quasi or natural experiments, however, inferences are based on
of birth. The use of these differences to draw causal inferences about the effect
28
training set (size) test set (size) OLS TSLS DRIVE anchor regression ridge TSLS ridge
29
top 3 educated regions (841)
(0.001) (0.025) (0.001) (0.001) (0.001) (0.001)
Table 2: Comparison of estimation methods in terms of MSE on test data. Here the training and test datasets are split according to
the 9 regions in the Card college proximity dataset based on their average education levels. In this specification, we did not include
unreliable for the wider population, and we evaluate the performance based on how well
they generalize to other groups of the population with potential distributional or covariate
shifts. In Table 2, we compare the test set MSE of OLS, IV, Wasserstein DRIVE, anchor
regression, ridge, and ridge regularized IV estimators. We see that Wasserstein DRIVE
6 Concluding Remarks
framework. Our approach is motivated by two main considerations in practice. The first is
the concern about model mis-specification in IV estimation, most notably the validity of
instruments. Second, going beyond estimating the causal effect for the endogenous variable,
practitioners may also be interested in making good predictions with the help of instruments
when there is heterogeneity between training and test datasets, e.g., generalizing from
We argue that both challenges can be naturally unified as problems of distributional shifts,
TSLS problem, and reveal a distinct property of the resulting estimator: it is consistent
of the Wasserstein DRIVE, and establish a few special cases when it coincides with that
of the standard TSLS estimator. Numerical studies suggest that Wasserstein DRIVE has
superior finite sample performance in two regards. First, it has lower estimation error when
instruments are potentially invalid, but performs as well as the TSLS when instruments are
valid. Second, it outperforms existing methods at the task of predicting outcomes under
30
distributional shifts between training and test data. These findings provide support for the
appeal of our DRO approach to IV estimation, and suggest that Wasserstein DRIVE could
be preferable in practice to standard IV methods. Finally, there are many future research
directions of interest, such as further results on inference and testing, as well as connections
Acknowledgements
We are indebted to Han Hong, Guido Imbens, and Yinyu Ye for invaluable advice and
guidance throughout this project, and to Agostino Capponi, Timothy Cogley, Rajeev
Dehejia, Yanqin Fan, Alfred Galichon, Rui Gao, Wenzhi Gao, Vishal Kamat, Samir Khan,
Frederic Koehler, Michal Kolesár, Simon Sokbae Lee, Greg Lewis, Elena Manresa, Konrad
Menzel, Axel Peytavin, Debraj Ray, Martin Rotemberg, Vasilis Syrgkanis, Johan Ugander,
and Ruoxuan Xiong for helpful discussions and suggestions. This work was supported in
References
Adjaho, C. and Christensen, T. (2022). Externally valid treatment choice. arXiv preprint
arXiv:2205.05561, 1.
20(1):46–63.
metrics, 4:2247–2294.
31
Andrews, D. W. (1999). Consistent moment selection procedures for generalized method of
Econometrics, 3:122–173.
Andrews, I., Stock, J. H., and Sun, L. (2019). Weak instruments in instrumental variables
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using
659.
Baum, C. F., Schaffer, M. E., and Stillman, S. (2003). Instrumental variables and gmm:
32
Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and
80(6):2369–2429.
Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C., and Kato, K. (2018). High-
Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of
Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013).
Science, 59(2):341–357.
Bennett, A. and Kallus, N. (2023). The variational method of moments. Journal of the
Berkowitz, D., Caner, M., and Fang, Y. (2008). Are “nearly exogenous instruments” reliable?
Bertsimas, D., Imai, K., and Li, M. L. (2022). Distributionally robust causal inference with
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and
33
Blanchet, J., Kang, Y., and Murthy, K. (2019). Robust wasserstein profile inference and
Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal
Blanchet, J., Murthy, K., and Si, N. (2022). Confidence regions in wasserstein distributionally
Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumental variables
estimation when the correlation between the instruments and the endogenous explanatory
Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalid
instruments: effect estimation and bias detection through egger regression. International
Bowden, J., Davey Smith, G., Haycock, P. C., and Burgess, S. (2016). Consistent estimation
426.
Burgess, S., Bowden, J., Dudbridge, F., and Thompson, S. G. (2016). Robust instrumental
34
Burgess, S., Foley, C. N., Allara, E., Staley, J. R., and Howson, J. M. (2020). A robust and
efficient method for mendelian randomization with hundreds of genetic variants. Nature
communications, 11(1):376.
Burgess, S., Thompson, S. G., and Collaboration, C. C. G. (2011). Avoiding bias from weak
40(3):755–764.
Caner, M. and Kock, A. B. (2018). High dimensional linear gmm. arXiv preprint
arXiv:1811.08779.
Card, D. (1993). Using geographic variation in college proximity to estimate the return to
Card, D. (1999). The causal effect of education on earnings. Handbook of labor economics,
3:1801–1863.
Card, D. and Krueger, A. B. (1994). Minimum wages and employment: A case study of the
fast-food industry in new jersey and pennsylvania. American Economic Review, 84(4).
Chamberlain, G. and Imbens, G. (2004). Random effects estimators with many instrumental
Chao, J. C. and Swanson, N. R. (2005). Consistent estimation with a large number of weak
Chen, X., Hansen, L. P., and Hansen, P. G. (2021). Robust inference for moment condition
35
Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Post-selection and post-
regularization inference in linear models with many controls and instruments. American
Conley, T. G., Hansen, C. B., and Rossi, P. E. (2012). Plausibly exogenous. Review of
Davey Smith, G. and Ebrahim, S. (2003). ‘mendelian randomization’: can genetic epidemi-
Dehejia, R., Pop-Eleches, C., and Samii, C. (2021). From local to global: External validity in
Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty
Duchi, J. C., Glynn, P. W., and Namkoong, H. (2021). Statistics of robust optimization: A
969.
20(1):73–88.
36
Emdin, C. A., Khera, A. V., and Kathiresan, S. (2017). Mendelian randomization. Jama,
318(19):1925–1926.
Fan, J., Fang, C., Gu, Y., and Zhang, T. (2024). Environment invariant linear least squares.
Fan, Y., Park, H., and Xu, G. (2023). Quantifying distributional model risk in marginal
Fu, W. and Knight, K. (2000). Asymptotics for lasso-type estimators. The Annals of
statistics, 28(5):1356–1378.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information
Guo, Z., Kang, H., Cai, T. T., and Small, D. S. (2018a). Testing endogeneity with high
37
Guo, Z., Kang, H., Tony Cai, T., and Small, D. S. (2018b). Confidence intervals for causal
effects with invalid instruments by using two-stage hard thresholding with voting. Journal
Hahn, J. and Hausman, J. (2005). Estimation with valid and invalid instruments. Annales
Hahn, J., Hausman, J., and Kuersteiner, G. (2004). Estimation with weak instruments:
7(1):272–306.
pages 230–255.
Hausman, J. A., Newey, W. K., Woutersen, T., Chao, J. C., and Swanson, N. R. (2012).
38
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical
Statistics, 35(1):73–101.
Huber, P. J. and Ronchetti, E. M. (2011). Robust statistics. John Wiley & Sons.
Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical
30(2):257–280.
Jiang, W. (2017). Have instrumental variables brought us closer to the truth. Review of
Kaji, T., Manresa, E., and Pouliot, G. (2020). An adversarial approach to structural
Kang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016). Instrumental variables estimation
with some invalid instruments and its application to mendelian randomization. Journal
39
Kantorovich, L. V. (1942). On the translocation of masses. In Dokl. Akad. Nauk. USSR
Kitamura, Y., Otsu, T., and Evdokimov, K. (2013). Robustness, infinitesimal neighborhoods,
Kolesár, M., Chetty, R., Friedman, J., Glaeser, E., and Imbens, G. W. (2015). Identification
and inference with many invalid instruments. Journal of Business & Economic Statistics,
33(4):474–484.
Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein
Operations research & management science in the age of analytics, pages 130–166. Informs.
Association, 75(371):693–700.
Lei, L., Sahoo, R., and Wager, S. (2023). Policy learning under biased sample selection.
preprint arXiv:1803.07164.
40
McDonald, J. B. (1977). The k-class estimators as least variance difference estimators.
Menzel, K. (2023). Transfer estimates for causal effects across heterogeneous sites. arXiv
preprint arXiv:2305.01435.
tion using the wasserstein metric: performance guarantees and tractable reformulations.
Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments.
Nagar, A. L. (1959). The bias and moment matrix of the general k-class estimators of the
pages 575–595.
estimator and its t-ratio when the instrument is a poor one. Journal of Business, pages
S125–S140.
Nelson, C. R. and Startz, R. (1990b). Some further results on the exact small sample prop-
Olkin, I. and Pukelsheim, F. (1982). The distance between two random vectors with given
41
Owen, A. B. (2007). A robust hybrid of lasso and ridge regression. Contemporary Mathe-
matics, 443(7):59–72.
Peters, J., Bühlmann, P., and Meinshausen, N. (2016). Causal Inference by using Invariant
Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econo-
Prékopa, A. (2013). Stochastic programming, volume 324. Springer Science & Business
Media.
covariate in an observational study with binary outcome. Journal of the Royal Statistical
Rothenhäusler, D., Meinshausen, N., Bühlmann, P., Peters, J., et al. (2021). Anchor
regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society
Series B, 83(2):215–246.
Sahoo, R., Lei, L., and Wager, S. (2022). Learning from a biased sample. arXiv preprint
arXiv:2209.01754.
Sanderson, E. and Windmeijer, F. (2016). A weak instrument f-test in linear iv models with
42
Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables.
21(1):124–127.
Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2017). Certifying some distributional
Small, D. S. (2007). Sensitivity analysis for instrumental variables regression with overiden-
Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instruments.
68(5):1055–1096.
Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instruments and
Statistics, 20(4):518–529.
Stock, J. H. and Yogo, M. (2002). Testing for weak instruments in linear iv regression.
Theil, H. (1953). Repeated least squares applied to complete equation systems. The Hague:
43
Theil, H. (1961). Economic forecasts and policy.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes:
VanderWeele, T. J., Tchetgen, E. J. T., Cornelis, M., and Kraft, P. (2014). Methodological
Villani, C. (2009). Optimal transport: old and new, volume 338. Springer.
Von Neumann, J. and Morgenstern, O. (1947). Theory of games and economic behavior,
2nd rev.
Wang, Z., Glynn, P. W., and Ye, Y. (2016). Likelihood robust optimization for data-driven
Windmeijer, F., Farbmacher, H., Davies, N., and Smith, G. D. (2018). On the use of the
lasso for instrumental variables estimation with some invalid instruments. Journal of the
44
Windmeijer, F., Liang, X., Hartwig, F. P., and Bowden, J. (2021). The confidence interval
method for selecting valid instrumental variables. Journal of the Royal Statistical Society
45
Appendix A Background and Preliminaries
We first formally define the Wasserstein distance and discuss relevant results useful in this
paper. The Wasserstein distance is a metric on the space of probability distributions defined
based on the optimal transport problem. More specifically, given any Polish space X with
metric d, let P(X ) be the set of Borel probability measures on X and P, Q ∈ P(X ). For
exposition, we assume they have densities f1 and f2 , respectively, although the Wasserstein
distance is well-defined for more general probability measures using the concept of push-
forwards (Villani, 2009). The optimal transport problem, whose studied was pioneered by
Kantorovich (1942, 1960), aims to find the joint probability distribution between P and Q
where p ≥ 1. The p-Wasserstein distance Wp (P, Q) is defined to be the p-th root of the
optimal value of the optimal transport problem above. The Wasserstein distance is a metric
on the space P(X ) of probability measures, and the dual problem of (20) is derived the
It should be noted that the term “Wasserstein metric” for the optimal transport distance
pioneering the theory of optimal transport and proposing the metrics. However, due to
a work of Wasserstein (Vaserstein, 1969), which briefly discussed the optimal transport
distance, being more well-known in the West initially, the terminology of Wasserstein metric
persisted until today (Vershik, 2013). The optimal transport problem has also been studied
46
One of the appeals of the Wasserstein distance when formulating distributionally robust
optimization problems lies in the tractability of the dual DRO problem. Specifically, in
problem for every β, and is in generally not tractable. However, if D is the Wasserstein
distance, the inner problem has a tractable dual minimization problem, which when combined
with the outer minimization problem over β, yields a simple and tractable minimization
problem. This will allow us to efficiently compute the WDRO estimator. Moreover, it
Let c ∈ L1 (X ) be a general loss function and P ∈ P(X ) with density f1 . The following
general duality result (Gao and Kleywegt, 2023; Blanchet and Murthy, 2019) provides a
In the anchor regression framework of Rothenhäusler et al. (2021), the baseline distribution
regressors, and A are “anchors” that can be understood as potentially invalid instrumental
variables that may violate the exclusion restriction. Under this SEM, Rothenhäusler et al.
(2021) posit that the potential deviations from the reference distribution P0 are driven by
47
bounded uncertainties in the anchors A. Their main result provides a DRO interpretation of
a modified population version of the IV regression that interpolates between the IV and
min E[(Y − X T β)2 ] + (γ − 1)E[(PA (Y − X T β))2 ] = min sup Ev [(Y − X T β)2 ]. (22)
β β v∈C γ
The set of distributions Ev induced by v ∈ C γ are defined via the following SEM with a
bounded set C γ :
X
Y = (I − B)−1 v, C γ := {v : vv T ⪯ γME(AAT )MT }. (23)
U
In the interpolated objective in Eq. (22), PA (·) = E(· | A) and E[(PA (Y − X T β))2 ] is
the population version of the IV (TSLS) regression objective with A as the instrument.
Letting κ = 1 − 1/γ, we can rewrite the interpolated objective on the left hand side in (22)
equivalently as
1−κ
min E[(PA (Y − X T β))2 ] + E[(Y − X T β)2 ], (24)
β κ
which can be interpreted as “regularizing” the IV objective with the OLS objective, with
penalty parameter (1 − κ)/κ. Jakobsen and Peters (2022) observe that the finite sample
version of the objective in (24) is precisely that of a k-class estimator with parameter
κ (Theil, 1961; Nagar, 1959). This observation together with (22) therefore provides a
DRO interpretation of k-class estimators, which is also extended by Jakobsen and Peters
estimator has a distributionally robust interpretation via (22) when distributional shifts v
are unbounded.
The DRO interpretation (22) of k-class estimators sheds new light on some old wisdom on
IV estimation. As has already been observed and studied by a large literature (Richardson,
48
1968; Nelson and Startz, 1990a,b; Bound et al., 1995; Staiger and Stock, 1997; Hahn et al.,
2004; Burgess et al., 2011; Andrews et al., 2019; Young, 2022), when instruments are weak,
the usual normal approximation to the distribution of the IV estimator may be very poor,
and the IV estimator is biased in small samples and in the weak instruments asymptotics.
Moreover, a small violation of the exclusion restriction, i.e., direct effect of instruments
on the outcome, can result in large bias when instruments are weak (Angrist and Pischke,
2009). Consequently, IV may not perform as well as the OLS estimator in such a setting.
Regularizing the IV objective by the OLS objective in (24) can therefore alleviate the
weak instrument problem. This improvement has also been observed for k-class estimators
with particular choices of κ (Fuller, 1977; Hahn et al., 2004). The DRO interpretation
complements the intuition above based on regularizing the IV objective with the OLS
from standard modeling assumptions (strong first stage effects), a distributionally robust
the case of the anchor regression, the distribution uncertainty set indexed by v ∈ C γ always
selecting appropriate ∥v∥ ≈ 0. Therefore, the DRO formulation (22) of k-class estimators
demonstrates that they are robust against the weak instrument problem by design. An
additional insight of the DRO formulation is that k-class estimators and anchor regression
are also optimal in terms of predicting Y with X when the distribution of (X, Y ) could
change between the training and test datasets in a bounded manner induced by the anchors
A.
On the other hand, the DRO interpretation of k-class estimators also exposes its potential
limitations. First of all, the ambiguity set in (23) does not in fact contain the reference
distribution P0 itself for any finite robustness parameter, which is unsatisfactory. Moreover,
the SEM in (23) also implies that the instrument (anchors) A cannot be influenced by the
49
unobserved confounder U , which is a major source of uncertainty regarding the validity
estimators as being robust against weak instruments (Young, 2022), since they minimize an
objective that interpolates between OLS and IV. On the other hand, the DRIVE approach
we propose in this paper is by design robust against invalid instruments, as the ambiguity set
captures distributional shifts arising from conditional correlations between the instrument
cally they developed largely independent of each other, recent works have started to explore
their interesting connections, and our work can be viewed as an effort in this direction.
mization
DRO has an important research area in operations research, and traces its origin to game
theory (Von Neumann and Morgenstern, 1947). Scarf (1958) first studied DRO in the
context of inventory control under uncertainties about future demand distributions. This
work was followed by a line of research in min-max stochastic optimization models, notably
the works of Shapiro and Kleywegt (2002), Calafiore and Ghaoui (2006), and Ruszczyński
Dupačová (1987); Prékopa (2013); Bertsimas and Popescu (2005); Delage and Ye (2010).
Iyengar (2005); Wang et al. (2016). In recent years, distributional uncertainty sets based on
50
the Wasserstein metric have gained traction, appearing in Mohajerin Esfahani and Kuhn
(2018); Blanchet et al. (2019); Blanchet and Murthy (2019); Duchi et al. (2021), partly due
to their close connections to regularized regression, such as the LASSO (Tibshirani, 1996;
Belloni et al., 2011) and regularized logistic regression. Other works employ alternative
divergence measures, such as the KL divergence (Hu and Hong, 2013) and more generally
ϕ-divergence (Ben-Tal et al., 2013). In this work, we focus on DRO based on the Wasserstein
metric, originally proposed by Kantorovich (1942) in the context of optimal transport, which
has also become a popular tool in economics in recent years (Galichon, 2018, 2021).
DRO has gained traction in causal inference problems in econometrics and statistics
very recently. For example, Kallus and Zhou (2021); Adjaho and Christensen (2022); Lei
et al. (2023) apply DRO in policy learning to handle distributional uncertainties. Chen et al.
in estimation of structural models. Sahoo et al. (2022) use distributional shifts to model
sampling bias. Bertsimas et al. (2022) study DRO versions of classic causal inference
frameworks. Fan et al. (2023) studies distributional model risk when data comes from
multiple sources and only marginal reference measures are identified. DRO is also connected
to the literature in macroeconomics on robust control (Hansen and Sargent, 2010). A related
recent line of works in econometrics also employs a min-max approach to estimation (Lewis
and Syrgkanis, 2018; Kaji et al., 2020; Metzger, 2022; Cigliutti and Manresa, 2022; Bennett
and Kallus, 2023), inspired by adversarial networks from machine learning (Goodfellow
et al., 2014). These works leverage adversarial learning to enforce a large, possibly infinite,
the emphasize of the min-max formulation in our paper is to capture the potential violations
The DRO approach that we propose in this paper is motivated by a recent line of
works that reveal interesting connections between causality and notions of invariance and
51
distributional robustness (Peters et al., 2016; Meinshausen, 2018; Rothenhäusler et al., 2021;
Another important literature has studied the connections between causal inference and
concepts of invariance and robustness (Peters et al., 2016; Meinshausen, 2018; Rothenhäusler
et al., 2021; Bühlmann, 2020; Jakobsen and Peters, 2022). Our work is closely related to this
distributional shifts. In particular, Rothenhäusler et al. (2021); Jakobsen and Peters (2022)
work, instead of constructing the distribution set based on marginal or joint distributions
then reformulated as a ridge type regularized IV estimation problem. In this regard, our
Our work is also closely related to the classic literatures in econometrics and statistics on
Theil (1953) and Nagar (1959), and became widely used in applied fields in economics. Since
then, many works have investigated potential challenges to instrumental variables estimation
and their solutions, including invalid instruments (Fisher, 1961; Hahn and Hausman, 2005;
Berkowitz et al., 2008; Kolesár et al., 2015) and weak instruments (Nelson and Startz,
1990a,b; Staiger and Stock, 1997; Murray, 2006; Andrews et al., 2019). Tests of weak
instrument have been proposed by Stock and Yogo (2002) and Sanderson and Windmeijer
(2016). Notably, the test of Stock and Yogo (2002) for multi-dimensional instruments is
based on the minimum eigenvalue rank test statistic of Cragg and Donald (1993). In our
52
Wasserstein DRIVE framework, the penalty/robustness parameter can also be selected
using the minimum eigenvalue of the first stage coefficient. It remains to further study
the connections between our work and the weak instrument literature in this regard. The
related econometric literature on many (weak) instruments studies the regime where the
number of instruments is allowed to diverge proportionally with the sample size (Kunitomo,
1980; Bekker, 1994; Chamberlain and Imbens, 2004; Chao and Swanson, 2005; Kolesár,
2018). In this work, we will assume a fixed number of instruments to best illustrate the
Testing for invalid instruments is possible in the over-identified regime, where there
are more instruments than endogenous variables (Sargan, 1958; Kadane and Anderson,
1977; Hansen, 1982; Andrews, 1999). These tests have been used in combination with
variable selection methods, such as LASSO and thresholding, to select valid instruments
under certain assumptions (Kang et al., 2016; Windmeijer et al., 2018; Guo et al., 2018a;
Windmeijer et al., 2021). In our paper, we propose a regularization selection procedure for
the selected ρ increases with the degree of instrument invalidity. It remains to further
study the relation of this score quantile and test statistics for instrument invalidity in
the over-identified setting. Lastly, our framework can be viewed as complementary to the
post-hoc sensitivity analysis of invalid instruments (Angrist et al., 1996; Small, 2007; Conley
et al., 2012), where instead of bounding the potential bias of IV estimators arising from
Instrumental variables estimation has also gained wide adoption in epidemiology and
Bowden et al., 2015; Sanderson and Windmeijer, 2016; Emdin et al., 2017). An important
53
consideration in MR is invalid instruments, because many genetic variants, which are
variable through unknown mechanisms that are either direct effects (horizontal pleitropy)
or correlations with unobserved confounders. Methods have been proposed to address these
challenges, based on robust regression and regularization ideas (Bowden et al., 2015, 2016;
Burgess et al., 2016, 2020). Our proposed DRIVE framework contributes to this area by
IV method. In this regard, it complements the classic k-class estimators, which regularize
the IV objective with OLS (Rothenhäusler et al., 2021). Data-driven k-class estimators have
been shown to enjoy better finite sample properties. These include the LIML (Anderson and
Rubin, 1949) and the Fuller estimator (Fuller, 1977), which is a modification of LIML that
works well when instrument are weak (Stock et al., 2002). More recently, Jakobsen and Peters
(2022) proposed another data-driven k-class estimator called the PULSE, which minimizes
the OLS objective but with a constraint set defined by statistical tests of independence
between instrument and residuals. Kolesár et al. (2015) propose a modification of the k-class
estimator that is consistent with invalid instruments whose direct effects on the outcome
There is a rich literature that explores the interactions and connections between regular-
ized regression and instrumental variables methods. One line of works seeks to improve the
from regularized regression. For example, Windmeijer et al. (2018) applies LASSO regression
to the first stage, motivated by applications in genetics where one may have access to many
weak or invalid instruments. Belloni et al. (2012); Chernozhukov et al. (2015) also apply
54
LASSO, but the task is to select optimal instruments in the many instruments setting or
when covariates are high-dimensional. (Caner, 2009; Caner and Kock, 2018; Belloni et al.,
2018) apply LASSO to GMM estimators, generalizing regularized regression results from the
M-estimation setting to the moment estimation setting, which also includes IV estimation.
Another line of works on regularized regressions, which is more closely related to our work,
have investigated the connections and equivalences between regularized regression and causal
McDonald (1977) are the first to connect k-class estimators to regularized regressions.
Rothenhäusler et al. (2021) and Jakobsen and Peters (2022) further study the distributional
OLS objective. The Wasserstein DRIVE estimator that we propose in this work applies a
different type of regularization, namely a square root ridge regularization, to the second
stage coefficient in the TSLS objective. As such it has different behaviors compared to the
anchor and k-class estimators, which regularize using the OLS objective. It is also different
In this section, we turn our attention to the square root ridge estimator in the standard
√
regression setting. We first establish the n-consistency of the square root ridge when the
regularization parameter vanishes at the appropriate rate. We then consider a novel regime
with non-vanishing regularization parameter and vanishing noise, revealing properties that
are strikingly different from the standard ridge regression. As we will see, these observations
in the standard setting help motivate and provide the essential intuitions for our results
in the IV estimation setting. In short, the interesting behaviors of the square root ridge
arise from its unique geometry in the regime of vanishing noise, where ni=1 ϵ2i = op (n)
P
55
as n → ∞. This regime is rarely studied in conventional regression settings in statistics,
but it precisely captures features of the instrumental variables estimation setting, where
projected residuals (Y − Xβ0 )T ΠZ (Y − Xβ0 ) = op (n) when instruments are valid and β0 is
the true effect coefficient. In addition to providing intuitions for the IV estimation setting,
the regularization parameter selection procedure proposed for the square root LASSO in
the standard regression setting by Belloni et al. (2011) also inspires us to propose a novel
procedure for the IV setting in Section E, which is shown to perform well in simulations.
√
C.1 n-Consistency of the Square Root Ridge
We now consider the square root ridge estimator in the standard regression setting, and
√
prove its n-consistency. We will build on the results of Belloni et al. (2011) on the
fixed design Xi ∈ Rp , and with Φ the CDF of ϵi , we consider the data generating process,
Yi = XiT β0 + σϵi .
In this section, we rewrite the objective of the square root ridge estimation (15) as
λp
q
min Q̂(β) + ∥β∥2 + 1 (25)
β n
n
1X
Q̂(β) = (Yi − XiT β)2 , (26)
n i=1
and denote β̂ as the minimizer of the objective. Without loss of generality, we assume for
all j,
n
1X 2
X =1
n i=1 ij
In other words, each covariate (feature) is normalized to have unit norm. Similar to the
√
square root LASSO case, we will show that by selecting λ = O( n) properly, or equivalently
√
ρ = O(n−1 ) in (15), we can achieve, with probability 1 − α, a n-consistency result:
p
∥β̂ − β∥2 ≲ σ p log(2p/α)/n.
56
Compare this with the bound of the square root LASSO, which is
p
∥β̂ − β∥2 ≲ σ s log(2p/α)/n,
we do not impose assumptions on the size of the support of the p-dimensional vector β0 , if
s = p is finite in the square root LASSO framework, we achieve the same bound on the
estimation error. Our bound for the square root ridge is therefore sharp in this sense.
q
An important quantity in the analysis is the score S̃, which is the gradient of Q̂(β)
where En denotes the empirical average of the quantities. Similar to the lower bound on
the regularization parameter in terms of the score function λ ≥ cn∥S̃∥∞ in Belloni et al.
(2011), we will aim to impose the condition that λ ≥ cn∥S̃∥2 for some c > 1. Conveniently,
√ ∗
this condition is already implied by λ = pλ , where λ∗ follows the selection procedures
√
proposed in that paper. To see this point, note that ∥S̃∥2 ≤ p∥S̃∥∞ , so that with high
√ √
probability, pλ∗ ≥ pcn∥S̃∥∞ ≥ cn∥S̃∥2 . Thus we may use the exact same selection
procedure to achieve the desired bound, although there are other selection procedures
for λ that would guarantee λ ≥ cn∥S̃∥2 with high probability. For example, choose the
(1 − α)-quantile of n∥S̃∥2 given Xi ’s. We will for now adopt the selection procedure and
Under this assumption, and assuming that ϵ is normal, the selected regularization
√
λ= pλ∗ satisfies
p
λ≲ pn log(2p/α)
57
with probability 1 − α for all large n, using the same argument as Lemma 1 of Belloni et al.
(2011).
An important quantity in deriving the bound on the estimation error is the “prediction”
norm
1X T
∥β̂ − β0 ∥22,n := (Xi (β̂ − β0 ))2
n i
1X
= (β̂ − β0 )T Xi XiT (β̂ − β0 ),
n i
1
Xi XiT .
P
which is related to the Euclidean norm ∥β̂ − β0 ∥2 through the Gram matrix n i
Assumption C.2. There exists a constant κ and n0 such that for all n ≥ n0 , κ∥δ∥2 ≤ ∥δ∥2,n
for all δ ∈ Rp .
1
Xi XiT will be full rank (with high probability with
P
When p ≤ n, the Gram matrix n i
random design), and concentrate around the population covariance matrix. This setting of
p ≤ n is different from the high-dimensional setting in the square root LASSO paper, as
LASSO-type penalties are able to achieve selection consistency when p > n under sparsity,
whereas ridge-type penalties generally cannot. Note also that when p > n, the restricted
eigenvalues are necessary when defining κ, and it is necessary to prove that β̂ − β0 belongs to
a restricted subset of Rp on which the bound with κ holds. When p ≤ n, the restricted subset
and eigenvalues are not necessary, and κ can be understood as the minimum eigenvalue of
the Gram matrix, which would be bonded away from 0 (with high probability). The exact
value of κ is a function of the data generating process. For example, if we assume covariates
Theorem C.3. Assume that p ≤ n but p is allowed to grow with n. Let the regularization
√ √
λ = pλ∗ where λ∗ = c nΦ−1 (1 − α/2p), and under Assumption C.1 and Assumption C.2,
58
the solution β̂ to the square root ridge problem
s
1X λp
min (Yi − XiT β)2 + ∥β∥2 + 1
β n i n
satisfies
2( 1c + 1) λ p 2) ≲ σ
p
∥β̂ − β0 ∥2 ≤ · σ En (ϵ p log(2p/α)/n
1 − ( nλ )2 κ2 n
En (Xϵ)
We remark that the quantile of the score function √ is not only critical for
En (ϵ2 )
√
establishing the n-consistency of the square root ridge. It is also important in practice as
regularization selection procedure that uses nonparametric bootstrap to estimate the quantile
of the score, and demonstrate in Section 5 that it has very good empirical performance.
the square root ridge under the novel vanishing noise regime.
Conventional wisdom on regressions with ridge type penalties is that they induce shrink-
age on parameter estimates, and this shrinkage happens for any non-zero regularization.
Asymptotically, if the regularization parameter does not vanish as the sample size increases,
the limit of the estimator, when it exists, is not equal to the true parameter. The same
behavior may be expected of the square root ridge regression. Indeed, this is the case in the
standard linear regression setting with constant variance, i.e., Var(ϵi ) = σ 2 > 0 and
Yi = XiT β0 + ϵi .
However, as we will see, when Var(ϵi ) depends on the sample size, and vanishes as n → ∞,
the square root ridge estimator can be consistent for non-vanishing penalties.
59
To best illustrate the intuition behind this property of the square root ridge, we start
with the following simple example. Consider the data generating process written in matrix
vector form:
Y = Xβ0 + ϵ, (27)
where the rows of X ∈ Rn×p are i.i.d. N (0, Ip ) and independent of ϵ ∼ N (0, σn2 Ip ). Suppose
that the variance of the noises vanishes: σn2 → 0 as n → ∞. This is not a standard regression
setup, but captures the essence of the IV estimation setting, as we show in Section 4.
Recall the square root ridge regression problem in (15), which is strictly convex:
r
1 p
min ∥Y − Xβ∥2 + ρ(1 + ∥β∥2 ).
β n
(n)
Let β̂sqrt be its unique minimizer. As the sample size n → ∞, we will fix the regularization
where we have used the crucial property σn2 → 0. Therefore, under standard conditions, we
have
(n)
p
β̂sqrt →p βsqrt := arg min ∥β0 − β∥ + (1 + ∥β∥2 ). (28)
β
Note that the limiting objective above is strictly convex and hence has a unique minimizer.
Moreover,
p
∥β0 − β∥ + (1 + ∥β∥2 ) = ∥(β0 , −1) − (β, −1)∥ + ∥(β, −1)∥ ≥ ∥(β0 , −1)∥,
using the triangle inequality. On the other hand, setting β = β0 in (28) achieves the lower
bound ∥(β0 , −1)∥. Therefore, βsqrt = β0 is the unique minimizer of the limiting objective,
60
and so
(n)
β̂sqrt →p β0
regularization parameter, the square root ridge regression can still produce a consistent
estimator. This phenomenon holds more generally: the square root ridge estimator is
consistent for any (limiting) regularization parameter ρ ∈ [0, 1 + ∥β10 ∥2 ], as long as the noise
vanishes, in the sense that ni=1 ϵ2i = op (n). This condition is achieved for a wide variety of
P
Theorem C.4. In the linear model (27) where the rows of X are distributed i.i.d. N (0, Ip ),
if ni=1 ϵ2i = op (n) as sample size n → ∞, then for any ρ ∈ [0, 1 + ∥β10 ∥2 ], the unique solution
P
(n)
β̂sqrt →p β0 .
In Fig. 3, we plot the solution of the limiting square root ridge objective in a one-
ρ exceeds the limit 1 + ∥β10 ∥2 in the vanishing noise regime. This behavior is in stark contrast
with the regular ridge regression estimator, for which shrinkage starts from the origin, even
shrinkage property of the square root ridge is essentially a simple consequence of the triangle
inequality, it relies crucially on three features of the square root ridge estimation procedure.
First, even though the square root ridge shares similarities with the standard ridge
regression
1
min ∥Y − Xβ∥2 + ρ∥β∥2 ,
β n
61
Figure 3: Limit of the square root ridge estimator in a one-dimensional example with
1
vanishing noise, as a function of the regularization parameter ρ. Optimal ρ = 1 + ∥β0 ∥2
is
the largest regularization level that guarantees consistency of square root ridge.
only the former has delayed shrinkage: the square root operations applied to the mean
squared loss and the squared norm of the parameter above are essential. To see this, note
that with vanishing noise, the limit of the standard ridge estimator for the model in (27) is
which results in the optimal solution β = β0 /(1 + ρ). Therefore, with any non-zero ρ, the
crucial in guaranteeing that βsqrt = β0 is the unique limit of the square root ridge estimator.
To see this, consider instead the following modified square root ridge problem, which appears
where the regularization term does not include an additive constant in the square root, so
simplifies to ∥β∥. Under model (27) with vanishing noise and ρ = 1, this objective has limit
∥β0 − βsqrt ∥ + ∥βsqrt ∥. Without the “curvature” guaranteed by the additional constant in the
62
regularization term, the limiting objective is no longer strictly convex, and there is actually
an infinite number of solutions that achieves the lower bound in the triangle inequality
including βsqrt = 0. This implies that the solution to the modified objective is no longer
version of the square root ridge estimator has also been corroborated by simulations.
Third, given that small penalties in the square root ridge objective could achieve
regularization in finite samples without sacrificing consistency, one may wonder why it
is not widely used. This is partly due to the standard ridge being easier to implement
computationally, but the main reason is that the delayed shrinkage of the square root
ridge estimator is only present in the vanishing noise regime. To see this, assume now
(n)
and as before β̂sqrt →p βsqrt , the unique minimizer of the limiting objective above. The
(β − β0 ) √ β
p + ρp = 0,
∥β − β0 ∥2 + σ 2 (1 + ∥β∥2 )
(n)
and now only when ρ → 0 is β̂sqrt a consistent estimator of β0 , unless β0 ≡ 0. For this reason,
the fact that square root ridge can be consistent with non-vanishing regularization may
noise, shrinkage happens for any non-zero regularization, which has also been confirmed in
simulations.
Although the consistency of the square root ridge estimator with non-vanishing regular-
ization does not have immediate practical implications for conventional regression problems,
63
it is actually very well suited for the instrumental variables estimation setting. The rea-
son is that IV and TSLS regressions involve projecting the endogenous (and outcome)
variables onto the space spanned by the instrumental variables in the first stage. When
instruments are valid, this projection cancels out the noise terms asymptotically, resulting
projected variables therefore precisely corresponds to the vanishing noise regime, and we
may expect a similar delayed shrinkage effect. This is indeed the case, and with non-zero
regularization and (asymptotically) valid instruments, we show in Section 4 that the Wasser-
stein DRIVE estimator is consistent. This result suggests that we can introduce robustness
and regularization to the standard IV estimation through the square root ridge objective
C.3 Square Root Ridge vs. Ridge for GMM and M-Estimators
We also remark on the distinction between the square root ridge and the standard ridge
in the case when ρn → 0. From Fu and Knight (2000), we know that if ρ approaches 0 at
√
a rate of or slower than O(1/ n), then the ridge estimator has asymptotic bias, i.e., it is
not centered at β0 . However, for square root ridge (and DRIVE), as long as ρn → 0 at any
rate, the estimator will not have any bias. This feature is a result of the self-normalization
property of the square root ridge. In (19), the second term results from
q √ p nρβ0T √ √
nρ(1 + ∥β0 + δ/ n∥2 ) − nρ(1 + ∥β0 ∥2 ) = p · δ/ n + o(δ/ n)
nρ(1 + ∥β0 ∥ ) 2
√ T
ρβ0 δ
→p ,
(1 + ∥β0 ∥2 )
which does not depend on n. In this sense, the parameter ρ in square root ridge is scale-free,
unlike the regularization parameter in the standard ridge case, whose natural scale is
√
O(1/ n). In the same spirit, when ρ does not vanish, the resulting square root ridge
estimator will have similar behaviors as that of a standard ridge estimator with a vanishing
64
√
regularization parameter with rate Θ(1/ n). Moreover, the amount of shrinkage essentially
p
does not depend on the magnitude of β0 due to the normalization of β0 by (1 + ∥β0 ∥2 ),
Lastly, we discuss the distinction between our work and that of Blanchet et al. (2022),
which analyzes the asymptotic properties of a general class of DRO estimators. In that
work, the original estimators are based on minimizing a sample loss of the form
n
1X
ℓ(Xi , Yi , β),
n i=1
which encompasses most M-estimators, including the maximum likelihood estimator, and
they focus on the case when ρn → 0. However, the IV (TSLS) estimator is different in that
key distinction between these estimators is that the objective function of GMM estimators
function that evaluates to 0 at the true parameter β0 , whereas the objectives of M-estimators
tend to converge to a limit that does not vanish even at the true parameter. To see this
distinction more precisely, consider the limit of the OLS objective under the linear model
1 1
(Y − Xβ)T (Y − Xβ) = (Xβ0 + ϵ − Xβ)T (Xβ0 + ϵ − Xβ)
n n
→ (β0 − β)T (β0 − β) + σ 2 (ϵ),
the following GMM version of the OLS estimator, based on the moment condition that
E(Xi ϵi ) = 0:
1n o n o
min (Y − Xβ)T X W XT (Y − Xβ) ,
β n
where W is a weighting matrix, with the optimal choice being (XT X)−1 in this setting. We
65
have, assuming again n1 XT X →p Ip ,
1n T
o
T −1
n
T
o 1 T 1 T −1 1 T
(Y − Xβ) X (X X) X (Y − Xβ) = (Y − Xβ) X ( X X) X (Y − Xβ)
n n n n
which is also minimized at β0 but achieves a minimum value of 0. This distinction between
M-estimators and Z-estimators (and GMM estimators) is negligible in the standard setting
without the distributionally robust optimization component, and in fact the standard OLS
estimator is preferable to the GMM version for being more stable (Hall, 2003). However,
when we apply square root ridge regularization to these estimators, they start behaving
differently. Only regularized regression based on GMM and Z-estimators enjoys consistency
with a non-zero ρ > 0. In Appendix D.1, we exploit this property to generalize our results
Wasserstein Distances
In this section, we consider generalizations of the framework and results in the main paper.
and generalize the asymptotic results on Wasserstein DRIVE in this setting. We then
that the resulting estimator enjoys a similar consistency property with non-vanishing
robustness/regularization parameter.
In this section, we consider general GMM estimation and propose a distributionally robust
GMM estimation framework. Let θ0 ∈ Rp be the true parameter vector in the interior of
66
some compact space Θ ⊆ Rp . Let ψ(W, θ) be a vector of moments that satisfy
E[ψ(Wi , θ0 )] = 0,
for all i, where {W1 , . . . , Wn } are independent but not necessarily identically distributed
variables. Let ψi (θ0 ) = ψ(Wi , θ). We consider the GMM estimators that minimize the
objective
T
1 X 1 X
min ψi (θ) Wn (θ) ψi (θ)
θ n i n i
where Wn is a positive definite weight matrix, e.g., the weight matrix corresponding to
the two-step or continuous updating estimator, and n1 i ψi (θ) are the sample moments
P
under the empirical distribution Pn on ψ(θ). Both the IV estimation and GMM formulation
of OLS regression fall under this framework. When we are uncertain about the validity
We will study the asymptotic properties of this regularized GMM objective. We make use of
the following sufficient technical conditions in Caner (2009) on GMM estimation to simplify
the proof.
1
Pn
1. For all i and θ1 , θ2 ∈ Θ, we have |ψi (θ1 )−ψi (θ2 )| ≤ Bt |θ1 , θ2 |, with limn→∞ n i=1 EBtd <
∞ for some d > 2; supθ∈Θ E|ψi (θ)|d < ∞ for some d > 2.
2. Let mn (θ) := n1 E
P
i ψ(θ) and assume that mn (θ) → m(θ) uniformly over Θ, mn (θ)
67
3. Wn (θ) is positive definite and continuous on Θ, and Wn (θ) →p W (θ) uniformly in θ.
4. The population objective m(θ)T W (θ)m(θ) is lower bounded by the squared distance
See also Andrews (1994); Stock and Wright (2000) which assume similar conditions
as 1-3 on the GMM estimation setup. Condition 4 requires that the weighted moment
is bounded below by a quadratic function near θ0 . Under these conditions, we have the
following result.
Theorem D.2. Under the assumptions in, the unique solution θ̂GM M to
v T
u
u
u 1 X 1 X p
min t ψi (θ) Wn (θ) ψi (θ) + ρn (1 + ∥θ∥2 )
θ n i n i
Therefore, the square root ridge regularized GMM estimator also satisfies the consistency
distance with q ̸= 2.
The duality result in Theorem 3.1 can be generalized to q-Wasserstein ambiguity sets. The
resulting estimator can enjoy a similar consistency result as the square root Wasserstein
DRIVE (q = 2), but only when q ∈ (1, 2]. This is because the limiting objective can be
p
p p
(β − β0 )T γ T γ(β − β0 ) + (∥β∥p + 1),
68
where 1/p + 1/q = 1. When q ∈ (1, 2], p ∈ [2, ∞), and so ∥x∥2 ≥ ∥x∥p . As a result, the
p p
p p
∥β − β0 ∥2 + (∥β∥p + 1) ≥ ∥(β, −1) − (β0 , −1)∥p + (∥β∥p + 1)
≥ ∥(β0 , −1)∥p ,
with equality holding in both inequalities if and only if β = β0 , i.e., β0 is again the unique
Corollary D.3. Under the same assumptions as Theorem 4.1, the following regularized
regression problem
s
1X p
p
min (ΠZ Y − ΠZ Xβ)2i + ρn (∥β∥p + 1) (30)
β n i
has a unique solution that converges in probability to β0 whenever q ∈ (1, 2] and limn→∞ ρn ≤
λp (γ T ΣZ γ).
Wasserstein DRIVE
regularized regression problems. The most common approach is cross validation based on
loss function minimization. However, for Wasserstein DRIVE, this standard cross validation
procedure may not adequately address the challenges and goals of DRIVE. For example,
from Theorem 4.1 we know that the Wasserstein DRIVE is only consistent when the penalty
parameter is bounded above. We therefore need to take this result into account when
selecting the penalty parameter. In this section, we discuss two selection procedures, one
based on the first stage regression coefficient, and the other based on quantiles of the score
69
estimated using a nonparametric bootstrap procedure, which is also of independent interest.
We connect our procedures to existing works in the literature on weak and invalid IVs and
Theorem 4.1 guarantees that as long as the regularization parameter converges to a value in
the interval [0, σmin (γ)], Wasserstein DRIVE is consistent. A natural procedure to select ρ
is thus to compute the minimum singular value ρmax := σmin (γ̂) of the first stage regression
coefficient γ̂ and then select a regularization parameter ρ = c · ρmax for c ∈ [0, 1]. In
Section 5, we verify that this procedure produces consistent DRIVE estimators whenever
instruments are valid. Moreover, when the instrument is invalid or weak, Wasserstein
DRIVE enjoys superior finite sample properties, outperforming the standard IV, OLS, and
related estimators at estimation accuracy and prediction accuracy under distributional shift.
This approach is also related to the test of Cragg and Donald (1993), which is originally
used to test for under-identification, and later used by Stock and Yogo (2002) to test for
weak instruments. In the Cragg-Donald test, the minimum eigenvalue of the first stage rank
Although selecting ρ based on the first stage coefficient gives rise to Wasserstein DRIVE
estimators that perform well in practice, there is one important challenge that remains to
be addressed. Recall that violations of the exclusion restriction can be viewed as a form
of distributional shift. We therefore expect that as the degree of invalidity increases, the
distributional shift becomes stronger. From the DRO formulation of DRIVE in Eq. (12), we
know that the regularization parameter ρ is also the radius of the Wasserstein distribution
set. Therefore, ρ should adaptively increase with increasingly invalid instruments. However,
as the selection procedure proposed here only depends on the first stage estimate, it does
not take this consideration into account. More importantly, when the instruments are weak,
70
the smallest singular of the first stage coefficient is likely to be very close to zero, which
results in a DRIVE estimate with a very small penalty parameter and may thus have similar
problems as the standard IV. We next introduce another parameter selection procedure for
ρ based on Theorem C.3 that is able to better handle invalid and weak instruments.
of Score
Recall that the square root LASSO uses the following valid penalty:
λ∗ = cn∥S̃∥∞ ,
1X
Q̂(β) = (Yi − XiT β)2 ,
n i
and c = 1.1 is a constant of Bickel et al. (2009). The intuition for this penalty level comes
from the simplest case β0 ≡ 0, when the optimality condition requires λ ≥ n∥S̃∥∞ . To
estimate ∥S̃∥∞ , Belloni et al. (2011) propose to estimate the empirical (1 − α)-quantile
(conditional on Xi ) of ∥E
√n (xϵ)∥2∞ by sampling i.i.d. errors ϵ from the known error distribution
En (ϵ )
√
λ∗ = c nΦ−1 (1 − α/2p), (31)
The consistency result in Theorem C.3 then suggests a natural choice of penalty parameter
√ √
p ∗
ρ for the square root ridge, given by ρ = n
λ, where λ∗ is constructed from (31).
However, there are two main challenges when applying this regularization parameter
First, it requires prior knowledge of the type of distribution Φ, e.g., Gaussian, of the errors
71
√ √
p ∗
ϵ, even if we do not need its variance. Second, ρ= n
λ is only valid for the square root
ridge problem in the standard regression setting without instruments. When applied to the
This means that “observations” (Ỹi , X̃i ) are no longer independent. Therefore, the i.i.d.
simultaneous addresses the two challenges above. Given a starting estimate β (0) of β0 (say
(0)
the IV estimator), we compute the residuals ri = Ỹi − X̃iT β (0) . Then we bootstrap these
bootstrap then replaces Φ−1 (1 − α/2p) in (31) to give a penalty level ρ, which we can use
to solve the square root ridge problem to obtain a new estimate β (1) . Then we use β (1)
(1)
to compute new residuals ri = Ỹi − X̃iT β (1) , and repeat the process. In practice, we can
use the OLS or TSLS estimate as the starting point β (0) . Fig. 4 shows that this procedure
converges very quickly and does not depend on the initial β (0) . Moreover, in Section 5
we demonstrate that the resulting Wasserstein DRIVE estimator has superior estimation
Instruments
When instruments are valid, one should expect the boostrapped quantiles will converge
to 0. We next formalize this intuition in Proposition E.1 and also confirm it in numerical
72
Figure 4: Left: Penalty parameter convergence as a function of iteration number, with β (0)
starting from OLS, TSLS, and TSLS ridge estimates. Right: Converged penalty for the
Figure 5: Penalty strength selected based on nonparametric bootstrap of score quantile vs.
experiments.
Proposition E.1. The bootstrapped quantiles converge to 0 when instruments are valid.
More importantly, in practice we observe that the bootstrapped quantile increases with
the degree of instrument invalidity. Fig. 5 illustrates this phenomenon with increasing
correlation between the instruments and the unobserved confounder. The intuition behind
this observation is that the quantile is essentially describing the orthogonality (moment)
condition for valid IVs, and so should be close to zero with valid IV. A large value therefore
indicates possible violation. Thus, the bootstrapped quantile could potentially be used as a
test statistic for invalid instruments, using for example permutation tests. Equivalently, in
73
a sensitivity analysis it could be used as a sensitivity parameter, based on which we can
is a test statistic for the orthogonality condition E[X(Y − Xβ)] = 0 which holds asymp-
P
∥ i (xϵ)∥∞
totically for β equal to the OLS estimator. When √P 2
is not zero, it indicates a
i (ϵ )
violation of the orthogonality condition, which means a non-zero penalty could be beneficial.
is a test statistic for the orthogonality condition E[X̃(Ỹ − X̃β)] = 0 which is asymptotically
P
∥ i (x̃ϵ)∥∞
correct when β is the TSLS estimator and Z is valid instrument. A large √P 2
therefore
i (ϵ )
indicates potential violations of the IV assumptions. We may also compare this quantity
with the Sargan test statistic (Sargan, 1958) for instrument invalidity in the over-identified
The penalty selection proposed in Belloni et al. (2011) can therefore be seen as a test
statistic for the moment condition E(X(Y − Xβ)) = 0 which should hold asymptotically for
β equal to the OLS estimator if the model assumption that X is independent of the error
Similarly, in a TSLS model, the moment condition is E(X̃(Ỹ − X̃β)) = 0 for beta equal
to the TSLS estimator, so the penalty can be seen as assessing potential violation of IV
assumptions.
Remark E.2. Besides the data-driven procedures discussed above, we can also consider
74
in over-identified settings, the Sargan-Hasen test (Sargan, 1958; Hansen, 1982) can be
used to test for the exclusion restriction. We can use this test to provide evidence on the
validity of the instrument. For testing weak instruments, the popular test of Stock and
Yogo (2002) can be used. This proposal is also related to our observation that ρ based
on bootstrapped quantiles increase with the degree of invalidity, i.e., direct effect on the
outcome or correlation with confounders, and can therefore potentially be used as a test
statistic for the reliability of the IV estimator. We leave a detailed investigation of this
Appendix F Proofs
Proof. The proof of Theorem 3.1 relies on a general duality result on Wasserstein DRO,
with different variations derived in (Gao and Kleywegt, 2023; Blanchet and Murthy, 2019;
Sinha et al., 2017). We start with the inner problem in the objective in (12):
h i
T 2
sup EQ (Ỹ − X̃ β) ,
{Q:D(Q,P̃n )≤ρ}
where D is the 2-Wasserstein distance and P̃n is the empirical distribution on the pro-
jected data {Ỹi , X̃i }ni=1 ≡ {(ΠZ Y)i , (ΠZ X)i }ni=1 . Proposition 1 of Sinha et al. (2017) and
75
with W = (X, Y ), W̃ = (X̃, Ỹ ) and α = (−β, 1). Note that γ is always chosen large
enough, i.e., γI − ααT ⪰ 0, so that the objective W T ααT W − γ∥W − W̃ ∥22 is concave in
W . Otherwise, the supremum over W in the inner problem is unbounded. Therefore, the
ααT W − γ(W − W̃ ) = 0,
so that
(ααT − γI)W = −γ W̃ ,
and
= (I − ααT /γ)−1 W̃ ,
W̃ T (I − ααT /γ)−1 ααT (I − ααT /γ)−1 W̃ − γ(W̃ T ((I − ααT /γ)−1 − I)2 W̃ )
= W̃ T ((I − ααT /γ)−1 ααT (I − ααT /γ)−1 − γ((I − ααT /γ)−1 − I)2 )W̃ ≡ ∥W̃ ∥2A ,
where
A = ((I − ααT /γ)−1 ααT (I − ααT /γ)−1 − γ((I − ααT /γ)−1 − I)2 ).
Using the Sherman-Morrison Lemma (Sherman and Morrison, 1950), whose condition is
1
(I − ααT /γ)−1 = I + ααT ,
γ − αT α
γ
A= T
ααT .
γ−α α
76
In summary, for each projected observation (for the IV estimate) W̃i = (X̃i , Ỹi ), we can
obtain a new “robustified” sample using the above operation, then minimize the following
where for fixed β, γ ≥ 0 is always chosen large enough so that ϕγ (β; X, Y ) is finite.
Now, the inner minimization problem can be further solved explicitly. Recall that it is
equal to
1X T γ
inf γρ + W̃i ( T
ααT )W̃i ,
γ≥0 n i γ−α α
77
β T X̃i )2 ,
s
1X 1X T T
γρ + ∥(X̃i , Ỹi )∥2A = ραT α · W̃i αα W̃i + ραT α
n i n i
1X T 1
+ W̃i ( αT α
ααT )W̃i
n i 1− r
T
P T T
1 i W̃i αα W̃i α α
n ρ
+αT α
p ℓIV
= ραT α · ℓIV + ραT α + r P T T
1 i W̃i αα W̃i
n ρ
r
1
P T T
i W̃i αα W̃i
√
n ρ
+ αT α
√
p αT α
= ραT α · ℓIV + ραT α + ℓIV + q ℓIV
W̃iT ααT W̃i
P
1 i
n ρ
p
= 2 ραT α · ℓIV + ραT α + ℓIV
p p
= ( ℓIV + ραT α)2 .
h i
min sup EQ (Ỹ − X̃ T β)2
β
{Q:D(Q,P̃n )≤ρ}
Y = β0T X + ϵ,
X = γ T Z + ξ.
78
with instrument relevance and exogeneity conditions
h i
rank(E ZX T ) = p,
h i
T
E [Zϵ] = 0, E Zξ = 0.
First, we compute the limit of the objective function (17), reproduced below
r
1 p
∥ΠZ Y − ΠZ Xβ∥2 + ρn (∥β∥2 + 1). (32)
n
Note first that n1 ϵT ΠZ X(β − β0 ) = op (1) whenever the instruments are valid, since
1 T 1 X X
ϵ ΠZ X(β − β0 ) = ( ϵi Zi )T (ZT Z)−1 ( Zi XiT (β − β0 ))
n n i i
1 X 1 1X
=( ϵi Zi )T ( ZT Z)−1 ( Zi XiT (β − β0 ))
n i n n i
→p E[Zϵ] · Σ−1 T
Z · E[ZX ] · (β − β0 ) = 0,
1 1 X X
(β − β0 )T XT ΠZ X(β − β0 ) = ( Zi XiT (β − β0 ))T (ZT Z)−1 ( (Zi XiT (β − β0 ))
n n i i
1X 1 1X
=( Zi XiT (β − β0 ))T ( ZT Z)−1 ( Zi XiT (β − β0 ))
n i n n i
= (β − β0 )T γ T ΣZ Σ−1
Z ΣZ γ(β − β0 )
= (β − β0 )T γ T ΣZ γ(β − β0 ).
79
The most important part is the “vanishing noise” behavior, i.e.,
1 T 1 X X
ϵ ΠZ ϵ = ( ϵi Zi )T (ZT Z)−1 ( ϵi Zi )
n n i i
1 X 1 1X
= ( ϵi Zi )T ( ZT Z)−1 ( ϵi Zi )
n i n n i
It then follows that the regularized regression objective (17) of the Wasserstein DRIVE
p p
(β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1). (33)
For ρ > 0, the population objective (33) is continuous and strictly convex in β, and so has
a unique minimizer β DRIVE . Applying the convexity lemma of Pollard (1991), since (32)
that contain β DRIVE . Applying Corollary 3.2.3 of van der Vaart and Wellner (1996), we can
therefore conclude that the minimizers of the empirical objectives converge in probability
β̂ DRIVE →p β DRIVE .
ρ ≤ λp (γ T ΣZ γ T ),
p p √ √ p
(β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1) ≥ ρ∥β − β0 ∥2 + ρ ∥β∥2 + 1
√ √
= ρ∥(β, 1) − (β0 , 1)∥2 + ρ∥(β, 1)∥2
√
≥ ρ∥(β0 , 1)∥2 ,
80
where in the second line we augment the vectors β, β0 with an extra coordinate equal to
1. The last line follows from the triangle inequality, with equality if and only if β ≡ β0 .
√
We can verify that the lower bound ρ∥(β0 , 1)∥2 of the population objective is therefore
achieved uniquely at β ≡ β0 due to strict convexity. We have thus proved that when
p
0 < ρ ≤ λp (γ T ΣZ γ), the population objective has a unique minimizer at β0 . When
we have β̂ DRIVE →p β0 .
√
Note that Hn (δ) is minimized at δ = n(β̂nDRIVE − β0 ). The key components of the proof
are to compute the uniform limit H(δ) of Hn (δ) on compact sets in the weak topology, and
√
to verify that their minimizers are uniformly tight, i.e., n(β̂nDRIVE − β0 ) = Op (1). We
can then apply Theorem 3.2.2 of van der Vaart and Wellner (1996) to conclude that the
√
sequence of minimizers n(β̂n − β0 ) of Hn (δ) converges in distribution to the minimizer of
√ √ q √ p
Hn (δ) = n · (ϕn (β0 + δ/ n) − ϕn (β0 )) = ∥ΠZ Y − ΠZ X(β0 + δ/ n)∥2 − ∥ΠZ Y − ΠZ Xβ0 ∥2
| {z }
I
q √ p
+ nρn (1 + ∥β0 + δ/ n∥2 ) − nρn (1 + ∥β0 ∥2 ).
| {z }
II
81
We first focus on I:
q √ p
I= ∥ΠZ Y − ΠZ X(β0 + δ/ n)∥2 − ∥ΠZ Y − ΠZ Xβ0 ∥2
q √ p
= Fn (β0 + δ/ n) − Fn (β0 ),
where
√ √
Fn (β0 + δ/ n) = ∥ΠZ Y − ΠZ X(β0 + δ/ n)∥2
1 X √ 1 1 X √
= (√ ψi (β0 + δ/ n))T ( ZT Z)−1 ( √ ψi (β0 + δ/ n)),
n i n n i
= ∥Z(Z T γ + ξ T )(β1 − β2 )∥
where ∥ · ∥ denotes operator norm for matrices and Euclidean norm for vectors. We have,
<∞
82
using the assumptions that E∥Z∥2k < ∞ and E∥ξ∥2k < ∞. Moreover, we have
which is uniformly bounded on compact subsets. The consistency result in Theorem 4.1
combined with the above bounds guarantee stochastic equicontinuity (Andrews, 1994), so
√
that as n → ∞, uniformly in δ on compact sets that contain δ = n(β̂nDRIVE − β0 ),
1 X √ √
√ ψi (β0 + δ/ n) − Eψi (β0 + δ/ n) →d N (0, Ω(β0 )) ≡ Z,
n i
√1 E T
P
where Ω(β) = n i (ψi (β)ψi (β)), so that
1 X 1 X
Ω(β0 ) = √ E (ψi (β0 )ψiT (β0 )) = √ E (Yi − XiT β)2 Zi ZiT
n i
n i
1 X
=√ E ϵ2i Zi ZiT = σ 2 ΣZ ,
n i
1 X √ √ h √ i
√ Eψi (β0 + δ/ n) = nE X T (β0 + δ/ n) − Y Z
n i
√ h √ i
= nE X T (β0 + δ/ n) − (X T β0 + ϵ) Z
= ΣZ γδ.
1 X √
√ ψi (β0 + δ/ n) →d Z + ΣZ γδ,
n i
83
uniformly in δ on compact sets, so that
√
Fn (β0 + δ/ n) →d (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ)
Fn (β0 ) →d Z T Σ−1
Z Z,
and applying the continuous mapping theorem to the square root function,
q √ p q q
I= Fn (β0 + δ/ n) − Fn (β0 ) →d (Z + ΣZ γδ) ΣZ (Z + ΣZ γδ) − Z T Σ−1
T −1
Z Z.
Next we have
q √ p
II = nρn (1 + ∥β0 + δ/ n∥2 ) − nρn (1 + ∥β0 ∥2 )
nρn β0T √ √
=p · δ/ n + o(δ/ n)
nρn (1 + ∥β0 ∥2 )
√
ρn β0T
→p · δ.
(1 + ∥β0 ∥2 )
uniformly. Because Hn (δ) is convex and H(δ) has a unique minimum, arg minδ Hn (δ) =
√
n(β̂nDRIVE − β0 ) = Op (1). Applying Theorem 3.2.2 of van der Vaart and Wellner (1996)
√ T
√ q
ρβ0
n(β̂nDRIVE − β0 ) = arg min Hn (δ) →d arg min (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ) + p · δ.
δ δ (1 + ∥β0 ∥2 )
84
which recovers the same minimizer as the TSLS objective
(Z + ΣZ γδ)T Σ−1 T −1 T T T T
Z (Z + ΣZ γδ) − Z ΣZ Z = 2δ γ Z + δ γ ΣZ γδ,
γ T Z + γ T ΣZ γδ
p = 0,
(Z + ΣZ γδ)T (Z + ΣZ γδ)
γ T Z + γ T ΣZ γδ = 0.
We can therefore conclude that with vanishing ρn → ρ = 0, regardless of the rate, the
asymptotic distribution of Wasserstein DRIVE coincides with that of the standard TSLS
estimator.
reduces to
γ T Z + δ T γ T γ = 0,
85
√ √
ρβ0 ρβ0
The objective is Z + γδ + √ · δ when γδ + Z ≥ 0 and −Z − γδ + √ · δ when
(1+∥β0 ∥2 ) (1+∥β0 ∥2 )
√
γδ + Z ≤ 0. Recall that by assumption ρ ≤ |γ|.
√
ρβ0
If β0 > 0 and γ > 0, then γδ + √ ·δ when γδ +Z ≥ 0 is minimized at δ = −γ −1 Z,
(1+∥β0 ∥2 )
√
ρβ0
and −γδ + √ · δ when γδ + Z ≤ 0 is minimized at −γ −1 Z.
(1+∥β0 ∥2 )
√
ρβ0
If β0 > 0 and γ < 0, then γδ + √ · δ when γδ + Z ≥ 0 is again minimized
(1+∥β0 ∥2 )
√
ρβ0
at δ = −γ −1 Z (since δ ≤ −γ −1 Z), and −γδ + √ · δ when γδ + Z ≤ 0 is again
(1+∥β0 ∥2 )
minimized at −γ −1 Z.
√
ρβ0
If β0 < 0 and γ > 0, then γδ + √ ·δ when γδ +Z ≥ 0 is minimized at δ = −γ −1 Z,
(1+∥β0 ∥2 )
√
ρβ0
and −γδ + √ · δ when γδ + Z ≤ 0 is minimized at −γ −1 Z.
(1+∥β0 ∥2 )
√
ρβ0
If β0 < 0 and γ > 0, then γδ + √ ·δ when γδ +Z ≥ 0 is minimized at δ = −γ −1 Z,
(1+∥β0 ∥2 )
√
ρβ0
and −γδ + √ · δ when γδ + Z ≤ 0 is minimized at −γ −1 Z.
(1+∥β0 ∥2 )
1X 1X 1X
ψi (θ) = [ψi (θ) − Eψi (θ)] + Eψi (θ).
n i n i n i
1X
[ψi (θ) − Eψi (θ)] = op (1),
n i
86
Assumption D.1.3 further guarantees that
v T
u
u
u 1X 1X p
t ψi (θ) Wn (θ) ψi (θ) + ρn (1 + ∥θ∥2 ) →p
n i n i
p p
m(θ)T W (θ)m(θ) + ρ(1 + ∥θ∥2 )
uniformly in θ. Applying Corollary 3.2.3 of van der Vaart and Wellner (1996), we can
Next, we consider the minimizer of the population objective. Applying Assumption D.1.4,
p p p p
m(θ)T W (θ)m(θ) + ρ(1 + ∥θ∥2 ) ≥ ρ∥θ − θ0 ∥2 + ρ(1 + ∥θ∥2 )
p p p
≥ ρ · ( ∥θ − θ0 ∥2 + (1 + ∥θ∥2 ))
p √
= ρ · (∥(θ, 1) − (θ0 , 1)∥2 + ρ∥(θ, 1)∥2 )
√
≥ ρ∥(θ0 , 1)∥2 ,
where again the last inequality follows from the triangle inequality. We can verify that
equalities are achieved if and only if θ = θ0 , which guarantees that θ̂GM M →p θ0 . The
λp λ
q q q
Q̂(β̂) − Q̂(β0 ) ≤ ∥β0 ∥2 + 1 − ∥β̂∥2 + 1
n n
q
On the other hand, by convexity of Q̂(β),
λ
q q
Q̂(β̂) − Q̂(β0 ) ≥ S̃ T (β̂ − β0 ) ≥ −∥S̃∥2 ∥β̂ − β0 ∥2 ≥ − ∥β̂ − β0 ∥2
cn
87
Now the estimation error in terms of the “prediction norm” (which is just the norm
1X T
∥β̂ − β0 ∥22,n := (Xi (β̂ − β0 ))2
n i
1X
= (β̂ − β0 )T Xi XiT (β̂ − β0 )
n i
1X 1X
Q̂(β̂) − Q̂(β0 ) = (Yi − XiT β̂)2 − (Yi − XiT β0 )2
n i n i
1X 1X
= (Yi − XiT β0 + XiT β0 − XiT β̂)2 − (Yi − XiT β0 )2
n i n i
1X
= ∥β̂ − β0 ∥22,n + 2 (Yi − XiT β0 )(XiT β0 − XiT β̂)
n i
1X
= ∥β̂ − β0 ∥22,n + 2 (σϵi )XiT (β0 − β̂)
n i
1X
2En (σϵX T (β̂ − β0 )) = 2 (σϵi )XiT (β̂ − β0 )
n i
s 1
P T
1X n i (σϵXi )
=2 (σϵi )2 q P (β̂ − β0 )
n i 1 2 2
(σ ϵ )
n i i
q
= 2 Q̂(β0 ) · S̃ T (β̂ − β0 )
q
≤ 2 Q̂(β0 )∥S̃∥2 ∥β̂ − β0 ∥2
88
Combining these, we can bound the estimation error ∥β̂ − β0 ∥22,n as
∥β̂ − β0 ∥22,n
Now the norms ∥β̂ − β0 ∥22,n and ∥β̂ − β0 ∥2 differ by the Gram matrix n1 i Xi XiT , which
P
by the assumption n1 i Xij2 = 1 has diagonal entries equal to 1. Recall that κ is the tight
P
1 1 λ 1 1 λ
q
∥β̂ − β0 ∥22 ≤ 2 ∥β̂ − β0 ∥22,n ≤ 2 2 Q̂(β0 ) ( + 1)∥β̂ − β0 ∥2 + 2 ( )2 ∥β̂ − β0 ∥22
κ κ n c κ n
which yields
1 1 λ 1
q
∥β̂ − β0 ∥2 ≤ 1 λ 2 κ2
2 Q̂(β0 ) ( + 1)
1 − κ2 ( n ) n c
q
2 Q̂(β0 ) nλ ( 1c + 1)
=
κ2 − ( nλ )2
provided that
λ
( )2 ≤ κ2 .
n
As λ/n → 0 and κ is a universal constant linking the two norms, this condition will be
satisfied for all n large enough if Assumption 2 holds, so that the rate of convergence of
89
λ
∥β̂ − β0 ∥2,n → 0 is governed by that of n
→ 0:
2 nλ ( 1c + 1)
q p
∥β̂ − β0 ∥2 ≤ · Q̂(β0 ) ≲ σ p log(2p/α)/n.
κ2 − ( nλ )2
90