[go: up one dir, main page]

0% found this document useful (0 votes)
9 views90 pages

Distributionally Robust Instrumental Variables Estimation

Uploaded by

lcmn7102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views90 pages

Distributionally Robust Instrumental Variables Estimation

Uploaded by

lcmn7102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Distributionally Robust Instrumental

Variables Estimation

Zhaonan Qu∗ Yongchan Kwon†


arXiv:2410.15634v2 [econ.EM] 22 Dec 2024


Columbia University, Data Science Institute

Columbia University, Department of Statistics

Abstract
Instrumental variables (IV) estimation is a fundamental method in econometrics
and statistics for estimating causal effects in the presence of unobserved confounding.
However, challenges such as untestable model assumptions and poor finite sample
properties have undermined its reliability in practice. Viewing common issues in
IV estimation as distributional uncertainties, we propose DRIVE, a distributionally
robust IV estimation method. We show that DRIVE minimizes a square root variant
of ridge regularized two stage least squares (TSLS) objective when the ambiguity
set is based on a Wasserstein distance. In addition, we develop a novel asymptotic
theory for this estimator, showing that it achieves consistency without requiring the
regularization parameter to vanish. This novel property ensures that the estimator is
robust to distributional uncertainties that persist in large samples. We further derive
the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures
to select the regularization parameter based on theoretical results. Simulation studies
demonstrate the superior finite sample performance of Wasserstein DRIVE in terms of
estimation error and out-of-sample prediction. Due to its regularization and robustness
properties, Wasserstein DRIVE presents an appealing option when the practitioner is
uncertain about model assumptions or distributional shifts in data.

Keywords: Causal Inference; Distributionally Robust Optimization; Square Root Ridge;


Invalid Instruments; Distribution Shift

Corresponding author. zq2236@columbia.edu

yk3012@columbia.edu

1
1 Introduction

Instrumental variables (IV) estimation, also known as IV regression, is a fundamental

method in econometrics and statistics to infer causal relationships in observational data

with unobserved confounding. It leverages access to additional variables (instruments) that

affect the outcome exogenously and exclusively through the endogenous regressor to yield

consistent causal estimates, even when the standard ordinary least squares (OLS) estimator

is biased by unobserved confounding (Imbens and Angrist, 1994; Angrist et al., 1996; Imbens

and Rubin, 2015). Over the years, IV estimation has become an indispensable tool for

causal inference in empirical works in economics (Card and Krueger, 1994), as well as in

the study of genetic and epidemiological data (Davey Smith and Ebrahim, 2003).

Despite the widespread use of IV in empirical and applied works, it has important

limitations and challenges, such as invalid instruments (Sargan, 1958; Murray, 2006), weak

instruments (Staiger and Stock, 1997), non-compliance (Imbens and Angrist, 1994), and

heteroskedasticity, especially in settings with weak instruments or highly leveraged datasets

(Andrews et al., 2019; Young, 2022). These issues could significantly impact the validity

and quality of estimation and inference using instrumental variables (Jiang, 2017). Many

works have since been devoted to assessing and addressing these issues, such as statistical

tests (Hansen, 1982; Stock and Yogo, 2002), sensitivity analysis (Rosenbaum and Rubin,

1983; Bonhomme and Weidner, 2022), and additional assumptions or structures on the data

generating process (Kolesár et al., 2015; Kang et al., 2016; Guo et al., 2018b).

Recently, an emerging line of works have highlighted interesting connections between

causality and the concepts of invariance and robustness (Peters et al., 2016; Meinshausen,

2018; Rothenhäusler et al., 2021; Bühlmann, 2020; Jakobsen and Peters, 2022; Fan et al.,

2024). Their guiding philosophy is that causal properties can be viewed as robustness against

changes across heterogeneous environments, represented by a set P of data distributions.

2
The robustness of an estimator against P is often represented in a distributionally robust

optimization (DRO) framework via the min-max problem

min sup E[ℓ(W ; β)], (1)


β P∈P

where ℓ(W ; β) is a loss function of data W and parameter β of interest.

In many estimation and regression settings, one assumes that the true data distribution

satisfies certain conditions, e.g., conditional independence or moment equations. Such

conditions guarantee that standard statistical procedures based on the empirical distribution

P0 of data, such as M-estimation and generalized method of moments (GMM), enable

valid estimation and inference. In practice, however, it is often reasonable to expect

that the distribution P0 of the observed data might deviate from that generated by the

ideal model that satisfies such conditions, e.g., due to measurement errors or model mis-

specifications. DRO addresses such distributional uncertainties by explicitly incorporating

possible deviations into an ambiguity set P = P(P0 , ρ) of distributions that are “close” to

P0 . The parameter ρ quantifies the degree of uncertainty, e.g., as the radius of a ball centered

around P0 and defined by some divergence measure between probability distributions. By

minimizing the worst-case loss over P(P0 , ρ) in the min-max optimization problem (1), the

DRO approach achieves robustness against deviations captured by P(P0 , ρ).

DRO provides a useful perspective for understanding the robustness properties of

statistical methods. For example, it is well-known that members of the family of k-class

estimators (Anderson and Rubin, 1949; Nagar, 1959; Theil, 1961) are more robust than

the standard IV estimator against weak instruments (Andrews, 2007). Recent works by

Rothenhäusler et al. (2021) and Jakobsen and Peters (2022) show that k-class estimators in

fact have a DRO representation of the form (1), where ℓ is the square loss, W = (X, Y ),

and X, Y are endogenous and outcome variables generated from structural equation models

parameterized by the natural parameter of k-class estimators. See Appendix A.2 for details.

3
The general robust optimization problem (1) can trace its roots in the classical robust

statistics literature (Huber, 1964; Huber and Ronchetti, 2011) as well as classic works on

robustness in economics (Hansen and Sargent, 2008). Drawing inspirations from them,

recent works in econometrics have also explored the use of robust optimization to account for

(local) deviations from model assumptions (Kitamura et al., 2013; Armstrong and Kolesár,

2021; Chen et al., 2021; Bonhomme and Weidner, 2022; Adjaho and Christensen, 2022; Fan

et al., 2023). These works, together with works on invariance and robustness, highlight the

emerging interactions between econometrics, statistics, and robust optimization.

Despite new developments connecting causality and robustness, many questions and

opportunities remain. An important challenge in DRO is the choice of the ambiguity set

P(P0 , ρ) to adequately capture distributional uncertainties. This choice is highly dependent

on the structure of the particular problem of interest. While some existing DRO approaches

use ambiguity sets P(P0 , ρ) based on marginal or joint distributions of data, such P(P0 , ρ)

may not effectively capture the structure of IV estimation models. In addition, as the

min-max problem (1) minimizes the loss function under the worst-case distribution in

P(P0 , ρ), a common concern is that the resulting estimator is too conservative when P(P0 , ρ)

is too large. In particular, although DRO estimators enjoy better empirical performance

in finite samples, their asymptotic validity typically requires the ambiguity set to vanish

to a singleton, i.e., ρ → 0 (Blanchet et al., 2019, 2022). However, in the context of IV

estimation, distributional uncertainties about untestable model assumptions could persist in

large samples, necessitating the need for an ambiguity set that does not vanish to a singleton.

It is therefore important to ask whether and how one can construct an estimator in the

IV estimation setting that can sufficiently capture the distributional uncertainties about

model assumptions, and at the same time remains asymptotically valid with a non-vanishing

robustness parameter.

In this paper, we propose to view common challenges to IV estimation through the lens

4
of DRO, whereby uncertainties about model assumptions, such as the exclusion restriction

and homoskedasticity, are captured by a suitably chosen ambiguity set in (1). Based on

this perspective, we propose DRIVE, a general DRO approach to IV estimation. Instead of

constructing the ambiguity set based on marginal or joint distributions as in existing works,

we construct P(P0 , ρ) from distributions conditional on the instrumental variables. More

precisely, we construct P0 as the empirical distribution of outcome and endogenous variables

Y, X projected onto the space spanned by instrumental variables. When the ambiguity set of

DRIVE is based on the 2-Wasserstein metric, we show that the resulting estimator minimizes

a square root version of ridge regularized two stage least squares (TSLS) objective, where

the radius ρ of the ambiguity set becomes the regularization parameter. This regularized

regression formulation relies on the general duality of Wasserstein DRO problems (Gao and

Kleywegt, 2023; Blanchet et al., 2019; Kuhn et al., 2019).

We next next reveal a surprising statistical property of the square root ridge by showing

that Wasserstein DRIVE is consistent as long as the regularization parameter ρ is bounded

above by an estimable constant, which depends on the first stage coefficient of the IV

model and can be interpreted as a measure of instrument quality. To our knowledge, this is

the first consistency result for regularized regression estimators where the regularization

parameter does not vanish as the sample size n → ∞. One implication of our results is that

Wasserstein DRIVE, being a regularized regression estimator, enjoys better finite sample

properties, but does not introduce bias asymptotically even for non-vanishing ρ, unlike

standard regularized regression estimators such as the ridge and LASSO.

We further characterize the asymptotic distribution of Wasserstein DRIVE and propose

data-driven procedures to select the regularization parameter. We demonstrate with

numerical experiments that Wasserstein DRIVE improves over the finite sample performance

of IV and k-class estimators, thanks to its ridge type regularization, while at the same time

retaining asymptotic validity whenever instruments are valid. In particular, Wasserstein

5
DRIVE achieves significant improvements in mean squared errors (MSEs) over IV and OLS

when instruments are moderately invalid. These findings suggest that Wasserstein DRIVE

can be an attractive option in practice when we are concerned about model assumptions.

The rest of the paper is organized as follows. In Section 2, we discuss the standard IV esti-

mation framework and common challenges. In Section 3, we propose the Wasserstein DRIVE

framework and provide the duality theory. In Section 4, we develop asymptotic results for the

Wasserstein DRIVE, including consistency under a non-vanishing robustness/regularization

parameter. Section 5 conducts numerical studies that compare Wasserstein DRIVE with

other estimators including IV, OLS, and k-class estimators. Background materials, proofs,

and additional results are included in the appendices in the supplementary material.

Notation. Throughout the paper, ∥v∥p denotes the p-norm of a vector v, while

∥v∥ := ∥v∥2 denotes the Euclidean norm. Tr(M ) denotes the trace of a matrix M . λk (M )

represents the k-th largest eigenvalue of a symmetric matrix M . Boldfaced variables, such

as X, represents a matrix whose i-th row is the variable Xi .

2 Background and Challenges in IV Estimation

In this section, we first provide a brief review of the standard IV estimation framework. We

then motivate the DRO approach to IV estimation by viewing common challenges from

the perspective of distributional uncertainties. In Section 3, we propose the Wasserstein

distributionally robust instrumental variables estimation (DRIVE) framework.

2.1 Instrumental Variables Estimation

Consider the following standard linear instrumental variables regression model with X ∈

Rp , Z ∈ Rd where d ≥ p, and β0 ∈ Rp , γ ∈ Rd×p :

Y = β0T X + ϵ,
(2)
T
X = γ Z + ξ.

6
In (2), X are the endogenous variables, Z are the instrumental variables, and Y is the

outcome variable. The error terms ϵ and ξ capture the unobserved (or residual) components

of Y and X, respectively. We are interested in estimating the causal effects β0 of the

endogenous variables X on the outcome variable Y given independent and identically

distributed (i.i.d.) samples {Xi , Yi , Zi }ni=1 . However, X and Y are confounded through some

unobserved confounders U that are correlated with both Y and X, represented graphically

in the directed acyclic graph (DAG) below:

γ β0
Z X Y

Mathematically, the unobserved confounding can be described by the moment condition

E [Xϵ] ̸= 0.

As a result of the unobserved confounding, the standard ordinary least squares (OLS)

regression estimator of β0 that regresses Y on X is biased. To address this problem, the

IV estimation approach leverages access to the instrumental variables Z, also often called

instruments, which satisfy the moment conditions

h i
rank(E ZX T ) = p, (3)
h i
E [Zϵ] = 0, E Zξ T = 0. (4)

Under these conditions, a popular IV estimator is the two stage least squares (TSLS,

sometimes also stylized as 2SLS) estimator (Theil, 1953). With ΠZ := Z(ZT Z)−1 ZT and

X, Y, Z matrix representations of {Xi , Yi , Zi }ni=1 whose i-th rows correspond to Xi , Yi , Zi ,

respectively, the TSLS estimator β̂ IV := (XT ΠZ X)−1 XT ΠZ Y minimizes the objective

1
min ∥Y − ΠZ Xβ∥2 . (5)
β n

7
In contrast, the standard OLS estimator β̂ OLS solves the problem

1
min ∥Y − Xβ∥2 . (6)
β n

When the moment conditions (3) and (4) hold, the TSLS estimator is a consistent estimator

of the causal effect β0 under standard assumptions (Wooldridge, 2020), and valid inference

can be performed by constructing variance estimators based on the asymptotic distribution

of β̂ IV (Imbens and Rubin, 2015). Although not the most common presentation of TSLS,

the optimization formulation in (5) provides intuition on how IV estimation works: when

the instruments Z are uncorrelated with the unobserved confounders U affecting X and

Y , the projection operator ΠZ applied to X “removes” the confounding from X, so that

ΠZ X becomes (asymptotically) uncorrelated with ϵ. Regressing Y on ΠZ X then yields a

consistent estimator of β0 .

The validity of estimation and inference based on β̂ IV relies critically on the moment

conditions (3) and (4). Condition (3) is often called the relevance condition or rank
 
condition, and requires E ZX T to have full rank (recall p ≤ d). In the special case of

one-dimensional instrumental and endogenous variables, i.e., d = p = 1, it simply reduces

to E [ZX] ̸= 0. Intuitively, the relevant condition ensures that the instruments Z can

explain sufficient variations in the endogenous variables X. In this case, the instruments
 
are said to be relevant and strong. When E ZX T is close to being rank deficient, i.e.,
 
the smallest eigenvalue λp (E ZX T ) ≈ 0, IV estimation suffers from the so-called weak

instrument problem, which results in many issues in estimation and inference, such as small

sample bias and non-normal statistics (Stock et al., 2002). Some k-class estimators, such as

limited information maximum likelihood (LIML) (Anderson and Rubin, 1949), are partially

motivated to address these problems. Condition (4) is often referred to as the exclusion

restriction or instrument exogeneity (Imbens and Rubin, 2015), and instruments that

satisfy this condition are called valid instruments. When an instrument Z is correlated

8
with the unobserved confounder that confounds X, Y , or when Z affects the outcome Y

through an unobserved variable other than the endogenous variable X, the instrument

becomes invalid, resulting in biased estimation and invalid inference of β0 (Murray, 2006).

These issues can often be exacerbated when the instruments are weak, when there is

heteroskedasticity (Andrews et al., 2019), or the data is highly leveraged (Young, 2022).

Although many works have been devoted to addressing the problems of weak and invalid

instruments, there are fundamental limits on the extent to which one can test for these

issues. Given the popularity of IV estimation in practice, it is therefore desirable to have

estimation and inference procedures that are robust to the presence of such issues. Our

work is precisely motivated by these considerations. Compared to many existing robust

approaches to IV estimation, we take a more agnostic approach via distributionally robust

optimization. More precisely, we argue that many common challenges in IV estimation

can be viewed as uncertainties about the data distribution, i.e., deviations from the ideal

model that satisfies IV assumptions, which can be explicitly taken into account by choosing

an appropriate ambiguity set in a DRO formulation of the standard IV estimation. To

demonstrate this perspective more concretely, we now examine some common problems in

IV estimation and show that they can be viewed as distributional shifts under a suitable

metric, and therefore amenable to a DRO approach.

2.2 Challenges in IV Estimation as Distributional Uncertainties

Consider now the following one-dimensional version of the IV model in (2)

Y = Xβ0 + ϵ

X = Zγ + ξ,

where we assume that X, Y are confounded by an unobserved confounder U through

ϵ = Zη + U, ξ = U.

9
Note that in addition to U , there is also potentially a direct effect η from the instrument Z

to the outcome variable Y . We focus on the resulting model for our subsequent discussions:

Y = Xβ0 + Zη + U
(7)
X = Zγ + U.

The standard IV assumptions can be succinctly summarized for (7). The relevance condition

(3) requires that γ ̸= 0, while the exclusion restriction (4) requires that Z is uncorrelated

with U and that in addition η = 0. Assume that U, Z are i.i.d. standard normal. X, Y are

then determined by (7). We are interested in the shifts in data distribution, appropriately

defined and measured, when the exogeneity and relevance conditions are violated.

Example 1 (Invalid Instruments). As U, Z are independent, Z becomes invalid if and only

if η ̸= 0, and |η| quantifies the degree of instrument invalidity. Let Pη denote the joint

distribution on (X, Y, Z) in the model (7) indexed by η ∈ R. Let P̃η,Z be the resulting

normal distribution on the conditional random variables (X̃, Ỹ ) = (X | Z, Y | Z), given

Z. We are interested in the (expected) distributional shift between P̃η,Z and P̃0,Z . We

choose the 2-Wasserstein distance W2 (·, ·) (Kantorovich, 1942, 1960), also known as the

Kantorovich metric, to measure this shift. Conveniently, the 2-Wasserstein distance between

two normal distributions Q1 = N (µ1 , Σ1 ) and Q2 = N (µ2 , Σ2 ) has an explicit formula due

to Olkin and Pukelsheim (1982):

1/2 1/2
W2 (Q1 , Q2 )2 = ∥µ1 − µ2 ∥2 + Tr(Σ1 + Σ2 − 2(Σ2 Σ1 Σ2 )1/2 ). (8)

Applying (8) to the conditional distributions P̃η,Z , P̃0,Z , and taking the expectation with

respect to Z, we obtain the simple formula


r
2
EW2 (P̃η,Z , P̃0,Z ) = · |η|. (9)
π

This calculation shows that the degree of instrument invalidity, as measured by the strength

of direct effect of instrument on the outcome, is proportional to the expected distributional

10
shift of the distribution on (X̃, Ỹ ) from that under the valid IV assumption. Moreover,

the simple form of the expected distributional shift relies on our choice of the Wasserstein

distance to measure the distributional shift of the conditional random variables (X̃, Ỹ ). If we

instead measure shifts in the joint distribution Pη on (X, Y, Z), the resulting distributional

shift will depend on other model parameters in addition to η. This example therefore

suggests that the Wasserstein metric applied to the conditional distributional shift of (X̃, Ỹ )

could be an appropriate measure of distributional uncertainty in IV regression models.

Example 2 (Weak Instruments). Now consider another common problem with IV esti-

mation, which happens when the first stage coefficient γ is close to 0. Let Q̃γ,Z be the

distribution on (X̃, Ỹ ) indexed by γ ∈ R and η = 0 in (7). In this case, we can verify that
r q
2
EW2 (Q̃γ1 ,Z , Q̃γ2 ,Z ) = 1 + β02 |γ1 − γ2 |.
π

The expected distributional shift between the setting with a “strong” instrument with

γ = γ0 and a “weak” instrument with γ = δ · γ0 where δ → 0 in the limit, measured by the

2-Wasserstein metric, is equal to


r q
2
1 + β02 · |γ0 |. (10)
π

Similar to the previous example, the degree of violation of the strong instrument assumption,

as measured by the presumed instrument strength |γ0 |, is proportional to the expected

distributional shift on (X̃, Ỹ ). Note, however, that the distance is also proportional to the

magnitude of the causal parameter β0 . This is reasonable because instrument strength is

relative, and should be measured relative to the scale of the true causal parameter.

Next, we consider the distributional shift resulting from heteroskedastic errors, which are

known to yield the TSLS estimator inefficient and the standard variance estimator invalid

(Baum et al., 2003). Some k-class estimators, such as the LIML and the Fuller estimators,

also become inconsistent under heteroskedasticity (Hausman et al., 2012).

11
Example 3 (Heteroskedasticity). In this example, we assume η = 0 in (7) and that

the conditional distribution of U given Z is centered normal with standard deviation

α · |Z| + 1 where α ≥ 0. We are interested in the average distributional shift between the

heteroskedastic setting (α > 0) from the homoskedastic setting (α = 0). We can verify that

the expected 2-Wasserstein distance between the conditional distributions on (X̃, Ỹ ) is


r q
2
1 + (β02 + 1)2 · α, (11)
π

which is proportional to the degree of heteroskedasticity α.

The preceding discussions demonstrate that distributional uncertainties resulting from

violations of common model assumptions in IV estimation are well captured by the 2-

Wasserstein distance on the distributions of the conditional variables (X̃, Ỹ ). We therefore

propose to construct an ambiguity set in (1) using a Wasserstein ball around the empirical

distribution on (X̃, Ỹ ). We provide details of this framework in the next section.

3 Wasserstein Distributionally Robust IV Estimation

In this section, we propose a distributionally robust IV estimation framework. We propose to

use Wasserstein ambiguity sets to account for distributional uncertainties in IV estimation.

We develop the dual formulation of Wasserstein DRIVE as regularized regression, and

discuss its connections and distinctions to other regularized regression estimators.

3.1 DRIVE

Motivated by the intuition that common challenges to IV estimation in practice, such

as violations of model assumptions, can be viewed as distributional uncertainties on the

conditional distributions of (X̃, Ỹ ) = (X | Z, Y | Z), we propose the Distributionally

Robust IV Estimation (DRIVE) framework, which solves the following DRO problem given

12
a dataset {(Xi , Yi , Zi )}ni=1 and robustness parameter ρ:
h i
(DRIVE Objective) min sup EQ (Y − X T β)2 , (12)
β
{Q:D(Q,P̃n )≤ρ}

where P̃n (X × Y) is the empirical distribution on (X, Y ) induced by the projected samples

{X̃i , Ỹi }ni=1 ≡ {(ΠZ X)i , (ΠZ Y)i }ni=1 .

Here X ∈ Rn×p , Y ∈ Rn , Z ∈ Rn×d are the matrix representations of observations, and

ΠZ = Z(ZT Z)−1 ZT is the projection matrix onto the column space of Z. D(·, ·) is a metric

or divergence measure on the space of probability distributions on X × Y. Therefore, in our

DRIVE framework, we first regress both the outcome Y and covariate X on the instrument

Z to form the n predicted samples (ΠZ X, ΠZ Y). Then an ambiguity set is constructed

using D around the empirical distribution P̃n . This choice of the reference distribution

P0 is a key distinction of our work from previous works that leverage DRO in statistical

models. In the standard regression/classification setting, the reference distribution is often

chosen as the empirical distribution P̂n on {Xi , Yi }ni=1 (Blanchet et al., 2019). In the IV

estimation setting where we have additional access to instruments Z, we have the choice of

constructing ambiguity sets around the empirical distribution on the marginal quantities

{(Xi , Yi , Zi )}ni=1 , which is the approach taken in Bertsimas et al. (2022). In contrast, we

choose to use the empirical distribution on the conditional quantities {(X̃i , Ỹi )}ni=1 . This

choice is motivated by the intuition that violations of IV assumptions can be captured by

conditional distributional shifts, as illustrated by examples in the previous section.

The choice of the divergence measure D(·, ·) is also important, as it characterizes the

potential distributional uncertainties that DRIVE is robust to. In this paper, we propose

to use the 2-Wasserstein distance W2 (µ, ν) between two probability distributions µ, ν

(Mohajerin Esfahani and Kuhn, 2018; Gao and Kleywegt, 2023). One advantage of the

Wasserstein distance is the tractability of its associated DRO problems (Blanchet et al., 2019),

which can often be formulated as regularized regression problems with unique solutions.

13
See also Appendix A.1. In Section 2.2, we provided several examples that demonstrate the

2-Wasserstein distance is able to capture common distributional uncertainties in the IV

estimation setting. Alternative distance measures of probability distributions, such as the

class of ϕ-divergences (Ben-Tal et al., 2013), can also be used instead of the Wasserstein

distance. For example, Kitamura et al. (2013) use the Hellinger distance to model local

perturbations in robust estimation under moment restrictions, although not in the IV

estimation setting. In this paper, we focus on the Wasserstein DRIVE framework based

on D = W2 , and leave studies of DRIVE with other choices of D to future works.

We next begin our formal study of Wasserstein DRIVE. In Section 3.2, we will show

that the Wasserstein DRIVE objective is dual to a convex regularized regression problem.

As a result, the solution to the optimization problem (12) is well-defined, and we denote

this estimator by β̂DRIVE . In Section 4, we show β̂DRIVE is consistent with potentially

non-vanishing choices of the robustness parameter and derive its asymptotic distribution.

3.2 Dual Representation of Wasserstein DRIVE

It is well-known in the optimization literature that min-max optimization problems such

as (12) often have equivalent formulations as regularized regression problems. This cor-

respondence between regularization and robustness already manifests itself in the ridge

regression, which is equivalent to an ℓ2 -robust OLS regression (Bertsimas and Copenhaver,

2018). Importantly, the regularized regression formulations are often more tractable in

terms of solving the resulting optimization problem, and also facilitate the study of the

statistical properties of the estimators. We first show that the Wasserstein DRIVE objective

can also be written as a regularized regression problem similar to, but distinct from, the

standard TSLS objective with ridge regularization. Proofs can be found in Appendix F.

Theorem 3.1. The optimization problem in (12) is equivalent to the following convex

14
regularized regression problem:
r
1 p
min ∥ΠZ Y − ΠZ Xβ∥2 + ρ(∥β∥2 + 1), (13)
β n

where ΠZ = Z(ZT Z)−1 ZT is the finite sample projection operator, and ΠZ Y and ΠZ X are

the OLS predictions of X, Y using instruments Z.

Note that the robustness parameter ρ of the DRO formulation (12) now has the dual

interpretation as the regularization parameter in (13). This convex regularized regression

formulation implies that the min-max problem (12) associated with Wassertein DRIVE has
p
a unique solution, thanks to the strict convexity of the regularization term ρ(∥β∥2 + 1),

and is easy to compute despite not having a closed form solution. In particular, (13) can be

reformulated as a standard second order conic program (SOCP) (El Ghaoui and Lebret,

1997), which can be solved efficiently with off-the-shelf convex optimization routines, such as

CVX. More importantly, we leverage this formulation of Wasserstein DRIVE as a regularized

regression problem to study its novel statistical properties in Section 4.

The equivalence between Wasserstein DRO problems and regularized regression problem

is a familiar general result in recent works. For example, Blanchet et al. (2019) and Gao

and Kleywegt (2023) derive similar duality results for distributionally robust regression

with q-Wasserstein distances for q > 1. Compared to previous works, our work is distinct

in the following aspects. First, we apply Wasserstein DRO to the IV estimation setting

instead of standard regression settings, such as OLS or logistic regression. Although from

an optimization point of view there is no substantial difference, the IV setting motivates

a new asymptotic regime that uncovers interesting statistical properties of the resulting

estimators. Second, the regularization term in (13) is distinct from those in previous works,

which often use ∥β∥p with p ≥ 1. This seemingly innocuous difference turns out to be

crucial for our novel results on the Wasserstein DRIVE. Lastly, compared to the proof in

Blanchet et al. (2019), our proof of Theorem 3.1 is based on a different argument using the

15
Sherman-Morrison formula instead of Hölder’s inequality, which provides an independent

proof of the important duality result for Wasserstein distributionally robust optimization.

3.3 Wasserstein DRIVE and Regularized Regression

The regularized regression formulation of the Wasserstein DRIVE problem in (13) resembles

the standard ridge regularized (Hoerl and Kennard, 1970) TSLS regression:

1X 1
min (Yi − X̃iT β)2 + ρ∥β∥2 ⇐⇒ min ∥Y − ΠZ Xβ∥2 + ρ∥β∥2 . (14)
β n i β n

We therefore refer to (13) as the square root ridge regularized TSLS. However, there

are three major distinctions between (13) and (14) that are essential in guaranteeing the

statistical properties of Wasserstein DRIVE not enjoyed by the standard ridge regularized

TSLS. First, the presence of square root operations on both the risk term and the penalty

term; second, the presence of a constant in the regularization term; third, an additional

projection on the outcomes in ΠZ Y. We further elaborate on these features in Section 4.

In the standard regression setting without instrumental variables, the square root ridge
r
1 p
min ∥Y − Xβ∥2 + ρ(1 + ∥β∥2 ) (15)
β n

also resembles the “square root LASSO” of Belloni et al. (2011):


r
1
min ∥Y − Xβ∥2 + λ∥β∥1 . (16)
β n

In particular, both can be written as dual problems of Wasserstein DRO problems (Blanchet

et al., 2019). However, the square root LASSO is motivated by high-dimensional regression

settings where the dimension of X is potentially larger than the sample size n, but β is

very sparse. In contrast, our study of the square root ridge is motivated by its robustness

properties in the IV estimation setting, where the dimension of the endogenous variable is

small (often one-dimensional). In other words, variable selection is not the main focus of

this paper. A variant of the square root ridge estimator in (15) was also considered in the

standard regression setting by Owen (2007), who instead uses the penalty term ∥β∥2 .

16
As is well-known in the regularized regression literature (Fu and Knight, 2000), when the

regularization parameter decays to 0 at a rate Op (1/ n), the ridge estimator is consistent.

A similar result also holds for the square root ridge (15) in the standard regression setting as

ρ → 0. However, in the IV estimation setting, our distributional uncertainties about model

assumptions, such as the validity of instruments, could persist even in large samples. Recall

that ρ is also the robustness parameter in the DRO formulation (12). Therefore, the usual

requirement that ρ → 0 as n → ∞ cannot adequately capture distributional uncertainties

in the IV estimation setting. In the next section, we study the asymptotic properties of

Wasserstein DRIVE when ρ does not necessarily vanish. In particular, we establish the

consistency of Wasserstein DRIVE leveraging the three distinct features of (13) that are

absent in the standard ridge regularized TSLS regression (14). This asymptotic result is

in stark contrast to the conventional wisdom on regularized regression that regularized

regression achieves lower variance at the cost of non-zero bias.

4 Asymptotic Theory of Wasserstein DRIVE

In this section, we leverage distinct geometric features of the square root ridge regression to

study the asymptotic properties of the Wasserstein DRIVE. In Section 4.1, we show that

the Wasserstein DRIVE estimator is consistent for any ρ ∈ [0, ρ], where ρ depends on the

first stage coefficient γ. This property is a consequence of the consistency of the square root

ridge estimator in settings where the objective value at the true parameter vanishes, such

as the GMM estimation setting. It ensures that Wasserstein DRIVE can achieve better

finite sample performance thanks to its ridge type regularization, while at the same time

retaining asymptotic validity when instruments are valid. In Section 4.2, we characterize

the asymptotic distribution of Wasserstein DRIVE, and discuss several special settings

particularly relevant in practice, such as the just-identified setting with one-dimensional

instrumental and endogenous variables.

17
4.1 Consistency of Wasserstein DRIVE

Recall the linear IV regression model in (2)

Y = β0T X + ϵ,

X = γ T Z + ξ,

where X ∈ Rp , Z ∈ Rd , and β0 ∈ Rp , γ ∈ Rd×p with d ≥ p to ensure identification. In this

section, we make the standard assumptions that the instruments satisfy the relevance and

exogeneity conditions in (3) and (4), ϵ, ξ are homoskedastic, the instruments Z are not

perfectly collinear, and that E∥Z∥2k < ∞, E∥ξ∥2k < ∞, E|ϵ|2k < ∞ for some k > 2. The

results can be extended in a straightforward manner when we relax these assumptions, e.g.,

only requiring that exogeneity holds asymptotically. Given i.i.d. samples from the linear IV

model, recall the regularized regression formulation of the Wasserstein DRIVE objective
s
1X p
min (ΠZ Y − ΠZ Xβ)2i + ρn (∥β∥2 + 1), (17)
β n i

where ΠZ = Z(ZT Z)−1 ZT ∈ Rn×n , and ΠZ Y ∈ Rn and ΠZ X ∈ Rn×p are Y ∈ Rn , X ∈ Rn×p

projected onto the instrument space spanned by Z ∈ Rn×d .

Theorem 4.1 (Consistency of Wasserstein DRIVE). Let β̂nDRIVE be the unique minimizer of

the objective in (17). Let ρn → ρ ≥ 0 and n1 ZT Z →p = E[ZZ T ] = ΣZ . Under the relevance

and exogeneity conditions (3) and (4), the Wasserstein DRIVE estimator β̂nDRIVE converges

to β DRIVE in probability as n → ∞, where β DRIVE is the unique minimizer of

p p
min (β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1). (18)
β

Moreover, whenever ρ ∈ [0, ρ] where ρ = λp (γ T ΣZ γ) is the smallest eigenvalue of γ T ΣZ γ ∈

Rp×p , the unique minimizer of the objective (18) is the true causal effect, i.e., β DRIVE ≡ β0 .

Therefore, Wasserstein DRIVE is consistent as long as the limit of the regularization

parameter is bounded above by ρ = λp (γ T ΣZ γ). In the case when ΣZ = σZ2 Id for σZ2 > 0,

18
the upper bound ρ is proportional to the square of the smallest singular value of the first

stage coefficient γ, which is positive under the relevance condition (3). Recall that ρn is the

radius of the Wasserstein ball in the min-max formulation of DRIVE in (12). Theorem 4.1

therefore guarantees that even when the robustness parameter ρn ≡ ρ ̸= 0, which implies

the solution to the min-max problem is different from the TSLS estimator (ρ = 0), the

resulting estimator always has the same limit as long as ρ ≤ λp (γ T ΣZ γ).

The significance of Theorem 4.1 is twofold. First, it provides a meaningful interpretation

of the robustness parameter ρ in Wasserstein DRIVE in terms of problem parameters,

more precisely the variance covariance matrix ΣZ of Z and the first stage coefficient γ in

the IV regression model. The maximum amount of robustness that can be afforded by

Wasserstein DRIVE without sacrificing consistency is λp (γ T ΣZ γ), which directly depends on

the strength and variance of the instrument. This relation can be described more precisely

when ΣZ = σZ2 Id , in which case the bound is proportional to σZ2 and λp (γ T γ). Both

quantities improve the quality of the instruments: σZ2 improves the proportion of variance

of X and Y explained by the instrument vs. noise, while a γ far from rank deficiency avoids

the weak instrument problem. Therefore, the robustness of Wasserstein DRIVE depends

intrinsically on the strength of the instruments. The quantity γ T ΣZ γ is not unfamiliar

in the IV setting, as it is proportional to the inverse of the asymptotic variance of the

standard TSLS estimator when errors are homoskedastic. This observation suggests an

intrinsic connection between robustness and efficiency in the IV setting. See the discussions

after Theorem 4.2 for more on this point.

More importantly, Theorem 4.1 is the first consistency result for regularized regression

estimators where the regularization parameter does not vanish with sample size. Although

regularized regression such as ridge and LASSO is often associated with better finite sample

performance at the cost of introducing some bias, our work demonstrates that, in the

IV estimation setting, we can get the best of both worlds. On one hand, the ridge type

19
regularization in Wasserstein DRIVE improves upon the finite sample properties of the

standard IV estimators, which aligns with conventional wisdom on regularized regression.

On the other hand, with a bounded level of the regularization parameter ρ, Wasserstein

DRIVE can still achieve consistency. This is in stark contrast to existing asymptotic results

on regularized regression (Fu and Knight, 2000). Therefore, in the context of IV estimation,

with Wasserstein DRIVE we can achieve consistency and a certain amount of robustness at

the same time, by leveraging additional information in the form of valid instruments. The

maximum degree of robustness that can be achieved also has a natural interpretation in

terms of the strength and variance of the instruments.

Theorem 4.1 also suggests the following procedure to construct a feasible and valid

robustness/regularization parameter ρ̂ given data {Xi , Yi , Zi }ni=1 . Let γ̂ = (ZT Z)−1 ZT X be

the OLS regression estimator of the first stage coefficient γ and Σ̂Z an estimator of ΣZ , such

as the heteroskedasticity-robust estimator (White, 1980). We can use ρ̂ ≤ λp (γ̂ T Σ̂Z γ̂) to

construct the Wasserstein DRIVE objective, i.e., any value bounded above by the smallest

eigenvalue of γ̂ T Σ̂Z γ̂. Under the assumptions in Theorem 4.1, λp (γ̂ T Σ̂Z γ̂) → λp (γ T ΣZ γ),

which guarantees that the Wasserstein DRIVE estimator with parameter ρ̂ is consistent. In

Appendix E, we discuss the construction of feasible regularization parameters in more detail.

We demonstrate the validity and superior finite sample performance of DRIVE based on

these proposals in simulation studies in Section 5.

One may wonder why Wasserstein DRIVE can achieve consistency with a non-zero

regularization ρ. Here we briefly discuss the phenomenon that the limiting objective (18)

p p
min (β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1)
β

p
has a unique minimizer at β0 for bounded ρ > 0. The first term (β − β0 )T γ T ΣZ γ(β − β0 )

achieves its minimum value of 0 at β = β0 . When ρ is small, the effect of adding the
p
regularization term ρ(∥β∥2 + 1) does not overwhelm the first term, especially when its

20
plot of | 1| + 2+1
= 0.5
5 =1
standard ridge, = 1
=2
4 =5

1.0 0.5 0.0 0.5 1.0 1.5 2.0

√ p 2
Figure 1: Plot of |β − 1| + ρ (β + 1), which is the dual limit objective function in the

one-dimensional case with σZ2 = γ = β0 = 1. We also plot limit of standard ridge loss

(β − 1)2 + β 2 . For ρ ≤ 2, the minimum is achieved at β = 1, while for ρ = 5 and for the

standard ridge, the minimum is achieved at β = 0.5.

curvature at β0 is large. As a result, we may expect the minimizer to not deviate much from

β0 . While this intuition is reasonable qualitatively, it does not fully explain the fact that

the minimizer does not change for small ρ. In the standard regression setting, the same

intuition can be applied to the standard ridge regularization, but we know shrinkage occurs

as soon as ρ > 0. The key distinction of (17) turns out to be the square root operations we

apply to the loss and regularization terms, which endows the objective with a geometric

interpretation, and ensures that the minimizer does not deviate from β0 unless ρ is above

some positive threshold. We call this phenomenon the “delayed shrinkage” of the square root

ridge, as shrinkage does not happen until the regularization is large enough. We illustrate it

with a simple example in Fig. 1, where the minimizer of the limiting square root objective

does not change for a bounded range of ρ.

Lastly, we comment on the importance of projection operations in Wasserstein DRIVE.

A crucial feature of the Wasserstein DRIVE objective is that both the outcome and the

covariates are regressed on the instrument to compute their predicted values. In other

words, the objective (12) uses ΠZ Y − ΠZ Xβ instead of Y − ΠZ Xβ. For standard IV

21
estimation (ρ = 0), there is no substantial difference between the two objectives, since

their minimizers are exactly the same, due to the idempotent property Π2Z = ΠZ . In fact,

in applications of TSLS, the outcome variable is often not regressed on the instrument.

However, Wasserstein DRIVE is consistent for positive ρ only if the outcome Y is also

projected onto the instrument space. In other words, the following problem does not yield

a consistent estimator when ρ > 0:


r
1 p
min ∥Y − ΠZ Xβ∥2 + ρ(∥β∥2 + 1).
β n

The reason behind this phenomenon is that n1 ∥ΠZ Y − ΠZ Xβ∥2 is a GMM objective

1X 1 1X
( Zi (Yi − β T Xi ))T ( ZT Z)−1 ( Zi (Yi − β T Xi )),
n i n n i

which when n → ∞ achieves a minimal value of 0 at β0 , while the limit of n1 ∥Y − ΠZ Xβ∥2

does not vanish even at β0 . In the former case, the geometric properties of the square root

ridge ensure that the minimizer of the regularized objective is β0 .

4.2 Asymptotic Distribution of Wasserstein DRIVE

Having established the consistency of Wasserstein DRIVE with bounded ρ, we now turn to

the characterization of its asymptotic distribution. In general, the asymptotic distribution

of Wasserstein DRIVE is different from that of the standard IV estimator. However, we

will also examine several special cases relevant in practice where they coincide.

Theorem 4.2 (Asymptotic Distribution). When limn→∞ ρn = ρ ≤ λp (γ T ΣZ γ), the Wasser-

stein DRIVE estimator β̂nDRIVE has asymptotic distribution characterized by the following

optimization problem:
√ T
√ q
ρβ0
n(β̂ DRIVE
− β0 ) →d arg min (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ) + p · δ, (19)
δ (1 + ∥β0 ∥2 )

where Z = N (0, σ 2 ΣZ ) and σ 2 = Eϵ2 .

22
In particular, when ρn → 0 at any rate, we have


n(β̂ DRIVE − β0 ) →d N (0, σ 2 (γ T ΣZ γ)−1 ),

which is the asymptotic distribution for TSLS estimators with homoskedastic errors ϵ.

Recall that the maximal robustness parameter ρ of Wasserstein DRIVE while still being

consistent is equal to the smallest eigenvalue of γ T ΣZ γ, which is proportional to the inverse

of the asymptotic variance of the TSLS estimator. Therefore, as the efficiency of TSLS

increases, so does the robustness of the associated Wasserstein DRIVE estimator. The

“price” to pay for robustness when ρ > 0 is an interesting question. It is clear from Fig. 1

that the curvature of the population objective decreases as ρ increases. Since the objective

is not continuous at β0 , however, a generalized notion of curvature is needed to precisely

characterize this behavior. Note that the asymptotic distribution of the TSLS estimator

minimizes the objective

(Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ),

Theorem 4.2 implies that in general the asymptotic distributions of Wasserstein DRIVE and

TSLS are different when ρ > 0. However, there are still several cases relevant in practice

where their asymptotic distributions do coincide, which we discuss next.

Corollary 4.3. In the following cases, the asymptotic distribution of Wasserstein DRIVE

is the same as that of the standard TSLS estimator:

1. When ρ = 0;

2. When ρ ≤ λp (γ T ΣZ γ) and β0 is identically 0;

3. When ρ ≤ λp (γ T ΣZ γ) and both β0 and γ are one-dimensional, i.e., d = p = 1.

In particular, the just-identified case with d = p = 1 covers many empirical applications

of IV estimation, since in practice we are often interested in the causal effect of a single

23
endogenous variable, for which we have a single instrument. The case when β0 ≡ 0 is

also very relevant, since an important question in practice is whether the causal effect of a

variable is zero. Our theory suggests that the asymptotic distribution of Wasserstein DRIVE

should be the same as that of the TSLS when the causal effect is zero and d > 1, even for

ρ > 0. Based on this observation, intuitively, we should expect that Wasserstein DRIVE and

TSLS estimators to be “close” to each other. If the estimators or their asymptotic variance

estimators differ significantly, then β0 may not be identically 0. We can design statistical

tests by leveraging this intuition. For example, we can construct test statistics using the

TSLS estimator and the DRIVE estimator with ρ > 0, such as the difference β̂ DRIVE − β̂ TSLS .

Then we can use bootstrap-based tests, such as a bootstrapped permutation test, to assess

the null hypothesis that β̂ DRIVE − β̂ TSLS = 0. If we fail to reject the null hypothesis, then

there is evidence that the true causal effect β0 = 0.

Corollary 4.3 can be seemingly pessimistic because it demonstrates that the asymptotic

distribution of Wasserstein DRIVE could be the same as that of the TSLS in special cases.

However, recall that Wasserstein DRIVE is formulated to minimize the worst-case risk

over a set of distributions that are designed to capture deviations from model assumptions.

Therefore, there is not actually any a priori reason that it should coincide with the TSLS

when ρ > 0. In this sense, the fact that the Wasserstein DRIVE is consistent with ρ > 0

and may even coincide with TSLS is rather surprising. In the latter case, the worst-case

distribution for Wasserstein DRIVE in the large sample limit must coincide with that of

the standard population distribution, which may be worth further investigation.

The asymptotic results we develop in this section provide the basis on which one can

perform estimation and inference with the Wasserstein DRIVE estimator. In the next section,

we study the finite sample properties of DRIVE in simulation studies and demonstrate that

it is superior in terms of estimation error and out of sample prediction compared to other

popular estimators.

24
5 Numerical Studies

In this section, we study the empirical performance of Wasserstein DRIVE. Our results

deliver three main messages. First, we demonstrate with simulations that Wasserstein

DRIVE, with non-zero robustness parameter ρ based on Theorem 4.1, has comparable

performance as the standard IV estimator whenever instruments are valid. Second, when

instruments become invalid, Wasserstein DRIVE outperforms other methods in terms of

RMSE. Third, on the education dataset of Card (1993), Wasserstein DRIVE also has

superior performance at prediction for a heterogeneous target population.

5.1 MSE of Wasserstein DRIVE

We use the data generating process

Y = Xβ0 + Zη + U

X = γZ + U

Z = U βU Z + ϵZ ,

where U, ϵZ ∼ N (0, σ 2 ) and we allow a direct effect η from the instruments Z to the outcome

Y . Moreover, the instruments Z can also be correlated with the unobserved confounder U

(βU Z ̸= 0). We fix the true parameters and generate independent datasets from the model,

varying the degree of instrument invalidity. In Table 1, we report the MSE of estimators

averaged over 500 repeated experiments. We control the degree of instrument invalidity

by varying η, the direct effect of instruments on the outcome, and βU Z , the correlation

between unobserved confounder and instruments. Results in Table 1 are based on data

where ∥γ∥ ≫ 0 is large. We see that when instruments are strong, Wasserstein DRIVE

performs as well as TSLS when instruments are valid, but performs significantly better than

OLS, TSLS, anchor, and TSLS ridge when instruments become invalid. This suggests that

DRIVE could be preferable in practice when we are concerned about instrument validity.

25
η βU Z OLS TSLS anchor TSLS ridge DRIVE

0 0 0.21 0.03 0.19 0.03 0.03

0.4 0 0.20 0.07 0.16 0.06 0.03

0.4 0.4 0.26 0.25 0.24 0.21 0.07

0.4 0.8 0.29 0.62 0.29 0.56 0.09

0.8 0 0.26 0.23 0.23 0.22 0.06

0.8 0.4 0.32 0.51 0.31 0.46 0.10

0.8 0.8 0.37 0.82 0.38 0.81 0.14

Table 1: MSE of estimators when instruments are potentially invalid. β0 = 1, n = 2000,

σ = 0.5. For TSLS ridge the regularization parameter is selected using cross validation based

on out-of-sample prediction errors. For anchor regression the regularization parameter is

selected based on the proposal in Rothenhäusler et al. (2021). For DRIVE the regularization

parameter is selected using nonparametric bootstrap of the score quantile (Appendix E).

We further investigate the empirical performance of Wasserstein DRIVE when instru-

ments are potentially invalid or weak. We present box plots (omitting outliers) of MSEs

in Fig. 2. The Wasserstein DRIVE estimator with regularization parameter ρ based on

bootstrapped quantiles of the score function consistently outperforms OLS, TSLS, anchor

(k-class), and TSLS with ridge regularization. Moreover, the selected penalties increase as

the direct effect of Z on Y or the correlation between the unobserved confounder U and

the instrument Z increases, i.e., as the model assumption of valid instruments becomes

increasingly invalid. See Fig. 5 in Appendix Section E for more details. This property

is highly desirable, because based on the DRO formulation of DRIVE, ρ represents the

amount of robustness against distributional shifts associated with the estimator, which

should increase as the instruments become more invalid (larger distributional shift). Box

plots of estimation errors in Fig. 2 also verify that even when instruments are valid, the

finite sample performance of Wasserstein DRIVE is still better compared to the standard

IV estimator, suggesting that there is no additional cost in terms of MSE when applying

26
MSE when Corr(U,Z) = 0.0 MSE when Corr(U,Z) = 0.15 MSE when Corr(U,Z) = 0.3
OLS 12 OLS OLS
TSLS TSLS TSLS
8 k-class (anchor) k-class (anchor) 25 k-class (anchor)
TSLS+ridge 10 TSLS+ridge TSLS+ridge
DRIVE DRIVE 20 DRIVE
6 8
15
MSE

MSE

MSE
6
4
4 10
2 5
2
0 0 0
0.0 0.15 0.3 0.45 0.6 0.0 0.15 0.3 0.45 0.6 0.0 0.15 0.3 0.45 0.6
(direct effect of Z on Y) (direct effect of Z on Y) (direct effect of Z on Y)

Figure 2: MSEs of estimators when instruments are potentially invalid. Instrument Z can

have direct effects η on the outcome Y , or be correlated with the unobserved confounder U .

Wasserstein DRIVE consistently outperforms the other estimators.

Wasserstein DRIVE, even when instruments are valid.

5.2 Prediction under Distributional Shifts on Education Data

We now turn our attention to a different task that has received more attention in recent years,

especially in the context of policy learning and estimating causal effects across heterogeneous

populations (Dehejia et al., 2021; Adjaho and Christensen, 2022; Menzel, 2023). We study

the prediction performance of estimators when they are estimated on a dataset (training

set) that has a potentially different distribution from the dataset for which they are used

to make predictions (test set). We demonstrate that whenever the distributions between

training and test datasets have significant differences, the prediction error of Wasserstein

DRIVE is significantly smaller than that of OLS, IV, and anchor (k-class) estimators.

We conduct our numerical study using the classic dataset on the return of education

to wage compiled by David Card (Card, 1993). Here, the causal inference problem is

estimating the effect of additional school years on the increase in wage later in life. The

dataset contains demographic information about interviewed subjects. Importantly, each

sample comes from one of nine regions in the United States, which differ in the average

number of years of schooling and other characteristics, i.e., there are covariate shifts in

27
data collected from different regions. Our strategy is to divide the dataset into a training

set and a test set based on the relative ranks of their average years of schools, which is

the endogenous variable. We expect that if there are distributional shifts between different

regions, then predicting wages using education and other information using conventional

models trained on the training data may not yield a good performance on the test data.

Since each sample is labeled as working in 1976 in one of nine regions in the U.S., we

split the samples based on these labels, using number of years of education as the splitting

variable. For example, we can construct the training set by including samples from the top

6 regions with the highest average years of schooling, and the test set to consist of samples

coming from the bottom 3 regions with the lowest average years of schooling. In this case,

we would expect the training and test sets to have come from different distributions. Indeed,

the average years of schooling differs by more than 1 year, and is statistically significant.

In splitting the samples based on the distribution of the endogenous variable, we are also

motivated by the long-standing debates revolving around the use of instrumental variables

in classic economic studies (Card and Krueger, 1994). A leading concern is the validity of

instruments. In the case of the study on educational returns, the validity of estimation and

inference require that the instruments (proximity to college and quarter of birth) are not

correlated with unobserved characteristics that may also affect their earnings. The following

quote from Card (1999) illustrates this concern:

“In the case of quasi or natural experiments, however, inferences are based on

difference between groups of individuals who attended schools at different times,

or in different locations, or had differences in other characteristics such as month

of birth. The use of these differences to draw causal inferences about the effect

of schooling requires careful consideration of the maintained assumption that

the groups are otherwise identical.”

28
training set (size) test set (size) OLS TSLS DRIVE anchor regression ridge TSLS ridge

0.444 0.537 0.364 0.444 0.444 0.421


top 3 educated regions (841)
(0.009) (0.031) (0.002) (0.009) (0.009) (0.019)
bottom 3 educated states (1247)
0.451 1.064 0.371 0.451 0.451 0.430
top 6 educated regions (1763)
(0.011) (0.274) (0.003) (0.011) (0.011) (0.027)

0.390 0.584 0.356 0.390 0.390 0.377


top 6 educated regions (1763) bottom 3 educated regions (1247)
(0.007) (0.120) (0.002) (0.007) (0.007) (0.015)

0.389 1.99 0.355 0.388 0.359 0.344


bottom 3 educated regions (1247)
(0.013) (0.775) (0.005) (0.013) (0.014) (0.004)
top 3 educated regions (841)
0.328 3.18 0.364 0.328 0.326 0.361
middle 3 educated regions (922)
(0.001) (1.213) (0.005) (0.001) (0.001) (0.005)

0.332 0.410 0.332 0.332 0.333 0.332

29
top 3 educated regions (841)
(0.001) (0.025) (0.001) (0.001) (0.001) (0.001)

0.416 0.538 0.363 0.416 0.409 0.386


bottom 3 educated regions (1247)
(0.014) (0.063) (0.005) (0.014) (0.016) (0.019)
middle 3 educated regions (922)
0.395 2.47 0.362 0.395 0.392 0.355
most+least educated regions (374)
(0.001) (1.81) (0.004) (0.011) (0.012) (0.004)

0.396 0.451 0.358 0.396 0.382 0.366


top 3+bottom 3 educated regions (2088)
(0.005) (0.032) (0.003) (0.009) (0.006) (0.009)

Table 2: Comparison of estimation methods in terms of MSE on test data. Here the training and test datasets are split according to

the 9 regions in the Card college proximity dataset based on their average education levels. In this specification, we did not include

experience squared. Standard errors are obtained using 10 bootstrapped datasets.


When this assumption is violated, the estimates based on a particular subpopulation becomes

unreliable for the wider population, and we evaluate the performance based on how well

they generalize to other groups of the population with potential distributional or covariate

shifts. In Table 2, we compare the test set MSE of OLS, IV, Wasserstein DRIVE, anchor

regression, ridge, and ridge regularized IV estimators. We see that Wasserstein DRIVE

consistently outperforms other estimators commonly used in practice.

6 Concluding Remarks

In this paper, we propose a distributionally robust instrumental variables estimation

framework. Our approach is motivated by two main considerations in practice. The first is

the concern about model mis-specification in IV estimation, most notably the validity of

instruments. Second, going beyond estimating the causal effect for the endogenous variable,

practitioners may also be interested in making good predictions with the help of instruments

when there is heterogeneity between training and test datasets, e.g., generalizing from

findings using samples from a particular population/geographical group to other groups.

We argue that both challenges can be naturally unified as problems of distributional shifts,

and then addressed using frameworks from distributionally robust optimization.

We provide a dual representation of our Wasserstein DRIVE framework as a regularized

TSLS problem, and reveal a distinct property of the resulting estimator: it is consistent

with non-vanishing penalty parameter. We further characterize the asymptotic distribution

of the Wasserstein DRIVE, and establish a few special cases when it coincides with that

of the standard TSLS estimator. Numerical studies suggest that Wasserstein DRIVE has

superior finite sample performance in two regards. First, it has lower estimation error when

instruments are potentially invalid, but performs as well as the TSLS when instruments are

valid. Second, it outperforms existing methods at the task of predicting outcomes under

30
distributional shifts between training and test data. These findings provide support for the

appeal of our DRO approach to IV estimation, and suggest that Wasserstein DRIVE could

be preferable in practice to standard IV methods. Finally, there are many future research

directions of interest, such as further results on inference and testing, as well as connections

to sensitivity analysis. Extensions to nonlinear models would also be useful in practice.

Acknowledgements

We are indebted to Han Hong, Guido Imbens, and Yinyu Ye for invaluable advice and

guidance throughout this project, and to Agostino Capponi, Timothy Cogley, Rajeev

Dehejia, Yanqin Fan, Alfred Galichon, Rui Gao, Wenzhi Gao, Vishal Kamat, Samir Khan,

Frederic Koehler, Michal Kolesár, Simon Sokbae Lee, Greg Lewis, Elena Manresa, Konrad

Menzel, Axel Peytavin, Debraj Ray, Martin Rotemberg, Vasilis Syrgkanis, Johan Ugander,

and Ruoxuan Xiong for helpful discussions and suggestions. This work was supported in

part by a Stanford Interdisciplinary Graduate Fellowship (SIGF).

References

Adjaho, C. and Christensen, T. (2022). Externally valid treatment choice. arXiv preprint

arXiv:2205.05561, 1.

Anderson, T. W. and Rubin, H. (1949). Estimation of the parameters of a single equation

in a complete system of stochastic equations. The Annals of mathematical statistics,

20(1):46–63.

Andrews, D. W. (1994). Empirical process methods in econometrics. Handbook of econo-

metrics, 4:2247–2294.

31
Andrews, D. W. (1999). Consistent moment selection procedures for generalized method of

moments estimation. Econometrica, 67(3):543–563.

Andrews, D. W. (2007). Inference with weak instruments. Advances in Economics and

Econometrics, 3:122–173.

Andrews, I., Stock, J. H., and Sun, L. (2019). Weak instruments in instrumental variables

regression: Theory and practice. Annual Review of Economics, 11(1).

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using

instrumental variables. Journal of the American statistical Association, 91(434):444–455.

Angrist, J. D. and Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s

companion. Princeton university press.

Armstrong, T. B. and Kolesár, M. (2021). Sensitivity analysis using approximate moment

condition models. Quantitative Economics, 12(1):77–108.

Basmann, R. L. (1960a). On finite sample distributions of generalized classical linear

identifiability test statistics. Journal of the American Statistical Association, 55(292):650–

659.

Basmann, R. L. (1960b). On the asymptotic distribution of generalized linear estimators.

Econometrica, Journal of the Econometric Society, pages 97–107.

Baum, C. F., Schaffer, M. E., and Stillman, S. (2003). Instrumental variables and gmm:

Estimation and testing. The Stata Journal, 3(1):1–31.

Bekker, P. A. (1994). Alternative approximations to the distributions of instrumental

variable estimators. Econometrica: Journal of the Econometric Society, pages 657–681.

32
Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and

methods for optimal instruments with an application to eminent domain. Econometrica,

80(6):2369–2429.

Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C., and Kato, K. (2018). High-

dimensional econometrics and regularized gmm. arXiv preprint arXiv:1806.01888.

Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of

sparse signals via conic programming. Biometrika, 98(4):791–806.

Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013).

Robust solutions of optimization problems affected by uncertain probabilities. Management

Science, 59(2):341–357.

Bennett, A. and Kallus, N. (2023). The variational method of moments. Journal of the

Royal Statistical Society Series B: Statistical Methodology, 85(3):810–841.

Berkowitz, D., Caner, M., and Fang, Y. (2008). Are “nearly exogenous instruments” reliable?

Economics Letters, 101(1):20–23.

Bertsimas, D. and Copenhaver, M. S. (2018). Characterization of the equivalence of

robustification and regularization in linear and matrix regression. European Journal of

Operational Research, 270(3):931–942.

Bertsimas, D., Imai, K., and Li, M. L. (2022). Distributionally robust causal inference with

observational data. arXiv preprint arXiv:2210.08326.

Bertsimas, D. and Popescu, I. (2005). Optimal inequalities in probability theory: A convex

optimization approach. SIAM Journal on Optimization, 15(3):780–804.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and

Dantzig selector. The Annals of Statistics, 37(4):1705 – 1732.

33
Blanchet, J., Kang, Y., and Murthy, K. (2019). Robust wasserstein profile inference and

applications to machine learning. Journal of Applied Probability, 56(3):830–857.

Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal

transport. Mathematics of Operations Research, 44(2):565–600.

Blanchet, J., Murthy, K., and Si, N. (2022). Confidence regions in wasserstein distributionally

robust estimation. Biometrika, 109(2):295–315.

Bonhomme, S. and Weidner, M. (2022). Minimizing sensitivity to model misspecification.

Quantitative Economics, 13(3):907–954.

Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumental variables

estimation when the correlation between the instruments and the endogenous explanatory

variable is weak. Journal of the American statistical association, 90(430):443–450.

Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalid

instruments: effect estimation and bias detection through egger regression. International

journal of epidemiology, 44(2):512–525.

Bowden, J., Davey Smith, G., Haycock, P. C., and Burgess, S. (2016). Consistent estimation

in mendelian randomization with some invalid instruments using a weighted median

estimator. Genetic epidemiology, 40(4):304–314.

Bühlmann, P. (2020). Invariance, causality and robustness. Statistical Science, 35(3):404–

426.

Burgess, S., Bowden, J., Dudbridge, F., and Thompson, S. G. (2016). Robust instrumental

variable methods using multiple candidate instruments with application to mendelian

randomization. arXiv preprint arXiv:1606.03729.

34
Burgess, S., Foley, C. N., Allara, E., Staley, J. R., and Howson, J. M. (2020). A robust and

efficient method for mendelian randomization with hundreds of genetic variants. Nature

communications, 11(1):376.

Burgess, S., Thompson, S. G., and Collaboration, C. C. G. (2011). Avoiding bias from weak

instruments in mendelian randomization studies. International journal of epidemiology,

40(3):755–764.

Calafiore, G. C. and Ghaoui, L. E. (2006). On distributionally robust chance-constrained

linear programs. Journal of Optimization Theory and Applications, 130:1–22.

Caner, M. (2009). Lasso-type gmm estimator. Econometric Theory, 25(1):270–290.

Caner, M. and Kock, A. B. (2018). High dimensional linear gmm. arXiv preprint

arXiv:1811.08779.

Card, D. (1993). Using geographic variation in college proximity to estimate the return to

schooling. NBER Working Paper, (w4483).

Card, D. (1999). The causal effect of education on earnings. Handbook of labor economics,

3:1801–1863.

Card, D. and Krueger, A. B. (1994). Minimum wages and employment: A case study of the

fast-food industry in new jersey and pennsylvania. American Economic Review, 84(4).

Chamberlain, G. and Imbens, G. (2004). Random effects estimators with many instrumental

variables. Econometrica, 72(1):295–306.

Chao, J. C. and Swanson, N. R. (2005). Consistent estimation with a large number of weak

instruments. Econometrica, 73(5):1673–1692.

Chen, X., Hansen, L. P., and Hansen, P. G. (2021). Robust inference for moment condition

models without rational expectations. Journal of Econometrics, forthcoming.

35
Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Post-selection and post-

regularization inference in linear models with many controls and instruments. American

Economic Review, 105(5):486–490.

Cigliutti, I. and Manresa, E. (2022). Adversarial method of moments.

Conley, T. G., Hansen, C. B., and Rossi, P. E. (2012). Plausibly exogenous. Review of

Economics and Statistics, 94(1):260–272.

Cragg, J. G. and Donald, S. G. (1993). Testing identifiability and specification in instrumental

variable models. Econometric Theory, 9(2):222–240.

Davey Smith, G. and Ebrahim, S. (2003). ‘mendelian randomization’: can genetic epidemi-

ology contribute to understanding environmental determinants of disease? International

journal of epidemiology, 32(1):1–22.

Dehejia, R., Pop-Eleches, C., and Samii, C. (2021). From local to global: External validity in

a fertility natural experiment. Journal of Business & Economic Statistics, 39(1):217–243.

Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty

with application to data-driven problems. Operations research, 58(3):595–612.

Duchi, J. C., Glynn, P. W., and Namkoong, H. (2021). Statistics of robust optimization: A

generalized empirical likelihood approach. Mathematics of Operations Research, 46(3):946–

969.

Dupačová, J. (1987). The minimax approach to stochastic programming and an illustrative

application. Stochastics: An International Journal of Probability and Stochastic Processes,

20(1):73–88.

El Ghaoui, L. and Lebret, H. (1997). Robust solutions to least-squares problems with

uncertain data. SIAM Journal on matrix analysis and applications, 18(4):1035–1064.

36
Emdin, C. A., Khera, A. V., and Kathiresan, S. (2017). Mendelian randomization. Jama,

318(19):1925–1926.

Fan, J., Fang, C., Gu, Y., and Zhang, T. (2024). Environment invariant linear least squares.

The Annals of Statistics, 52(5):2268–2292.

Fan, Y., Park, H., and Xu, G. (2023). Quantifying distributional model risk in marginal

problems via optimal transport. arXiv preprint arXiv:2307.00779.

Fisher, F. M. (1961). On the cost of approximate specification in simultaneous equation

estimation. Econometrica: journal of the Econometric Society, pages 139–170.

Fu, W. and Knight, K. (2000). Asymptotics for lasso-type estimators. The Annals of

statistics, 28(5):1356–1378.

Fuller, W. A. (1977). Some properties of a modification of the limited information estimator.

Econometrica: Journal of the Econometric Society, pages 939–953.

Galichon, A. (2018). Optimal transport methods in economics. Princeton University Press.

Galichon, A. (2021). The unreasonable effectiveness of optimal transport in economics.

arXiv preprint arXiv:2107.04700.

Gao, R. and Kleywegt, A. (2023). Distributionally robust stochastic optimization with

wasserstein distance. Mathematics of Operations Research, 48(2):603–655.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,

A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information

processing systems, 27.

Guo, Z., Kang, H., Cai, T. T., and Small, D. S. (2018a). Testing endogeneity with high

dimensional covariates. Journal of Econometrics, 207(1):175–187.

37
Guo, Z., Kang, H., Tony Cai, T., and Small, D. S. (2018b). Confidence intervals for causal

effects with invalid instruments by using two-stage hard thresholding with voting. Journal

of the Royal Statistical Society Series B: Statistical Methodology, 80(4):793–815.

Hahn, J. and Hausman, J. (2005). Estimation with valid and invalid instruments. Annales

d’Economie et de Statistique, pages 25–57.

Hahn, J., Hausman, J., and Kuersteiner, G. (2004). Estimation with weak instruments:

Accuracy of higher-order bias and mse approximations. The Econometrics Journal,

7(1):272–306.

Hall, A. R. (2003). Generalized method of moments. A companion to theoretical econometrics,

pages 230–255.

Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.

Econometrica: Journal of the econometric society, pages 1029–1054.

Hansen, L. P. and Sargent, T. J. (2008). Robustness. Princeton university press.

Hansen, L. P. and Sargent, T. J. (2010). Wanting robustness in macroeconomics. In

Handbook of monetary economics, volume 3, pages 1097–1157. Elsevier.

Hausman, J. A., Newey, W. K., Woutersen, T., Chao, J. C., and Swanson, N. R. (2012).

Instrumental variable estimation with heteroskedasticity and many instruments. Quanti-

tative Economics, 3(2):211–255.

Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: applications to nonorthogonal

problems. Technometrics, 12(1):69–82.

Hu, Z. and Hong, L. J. (2013). Kullback-leibler divergence constrained distributionally

robust optimization. Available at Optimization Online, pages 1695–1724.

38
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical

Statistics, 35(1):73–101.

Huber, P. J. and Ronchetti, E. M. (2011). Robust statistics. John Wiley & Sons.

Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average

treatment effects. Econometrica, 62(2):467–475.

Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical

sciences. Cambridge university press.

Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research,

30(2):257–280.

Jakobsen, M. E. and Peters, J. (2022). Distributional robustness of k-class estimators and

the pulse. The Econometrics Journal, 25(2):404–432.

Jiang, W. (2017). Have instrumental variables brought us closer to the truth. Review of

Corporate Finance Studies, 6(2):127–140.

Kadane, J. B. and Anderson, T. (1977). A comment on the test of overidentifying restrictions.

Econometrica: Journal of the Econometric Society, pages 1027–1031.

Kaji, T., Manresa, E., and Pouliot, G. (2020). An adversarial approach to structural

estimation. arXiv preprint arXiv:2007.06169.

Kallus, N. and Zhou, A. (2021). Minimax-optimal policy learning under unobserved

confounding. Management Science, 67(5):2870–2890.

Kang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016). Instrumental variables estimation

with some invalid instruments and its application to mendelian randomization. Journal

of the American statistical Association, 111(513):132–144.

39
Kantorovich, L. V. (1942). On the translocation of masses. In Dokl. Akad. Nauk. USSR

(NS), volume 37, pages 199–201.

Kantorovich, L. V. (1960). Mathematical methods of organizing and planning production.

Management science, 6(4):366–422.

Kitamura, Y., Otsu, T., and Evdokimov, K. (2013). Robustness, infinitesimal neighborhoods,

and moment restrictions. Econometrica, 81(3):1185–1201.

Kolesár, M. (2018). Minimum distance approach to inference with many instruments.

Journal of Econometrics, 204(1):86–100.

Kolesár, M., Chetty, R., Friedman, J., Glaeser, E., and Imbens, G. W. (2015). Identification

and inference with many invalid instruments. Journal of Business & Economic Statistics,

33(4):474–484.

Koopmans, T. C. (1949). Optimum utilization of the transportation system. Econometrica:

Journal of the Econometric Society, pages 136–146.

Kuhn, D., Esfahani, P. M., Nguyen, V. A., and Shafieezadeh-Abadeh, S. (2019). Wasserstein

distributionally robust optimization: Theory and applications in machine learning. In

Operations research & management science in the age of analytics, pages 130–166. Informs.

Kunitomo, N. (1980). Asymptotic expansions of the distributions of estimators in a linear

functional relationship and simultaneous equations. Journal of the American Statistical

Association, 75(371):693–700.

Lei, L., Sahoo, R., and Wager, S. (2023). Policy learning under biased sample selection.

arXiv preprint arXiv:2304.11735.

Lewis, G. and Syrgkanis, V. (2018). Adversarial generalized method of moments. arXiv

preprint arXiv:1803.07164.

40
McDonald, J. B. (1977). The k-class estimators as least variance difference estimators.

Econometrica: Journal of the Econometric Society, pages 759–763.

Meinshausen, N. (2018). Causality from a distributional robustness point of view. In 2018

IEEE Data Science Workshop (DSW), pages 6–10. IEEE.

Menzel, K. (2023). Transfer estimates for causal effects across heterogeneous sites. arXiv

preprint arXiv:2305.01435.

Metzger, J. (2022). Adversarial estimators. arXiv preprint arXiv:2204.10495.

Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimiza-

tion using the wasserstein metric: performance guarantees and tractable reformulations.

Mathematical Programming, 171(1-2):115–166.

Murray, M. P. (2006). Avoiding invalid instruments and coping with weak instruments.

Journal of economic Perspectives, 20(4):111–132.

Nagar, A. L. (1959). The bias and moment matrix of the general k-class estimators of the

parameters in simultaneous equations. Econometrica: Journal of the Econometric Society,

pages 575–595.

Nelson, C. R. and Startz, R. (1990a). The distribution of the instrumental variables

estimator and its t-ratio when the instrument is a poor one. Journal of Business, pages

S125–S140.

Nelson, C. R. and Startz, R. (1990b). Some further results on the exact small sample prop-

erties of the instrumental variable estimator. Econometrica: Journal of the Econometric

Society, pages 967–976.

Olkin, I. and Pukelsheim, F. (1982). The distance between two random vectors with given

dispersion matrices. Linear Algebra and its Applications, 48:257–263.

41
Owen, A. B. (2007). A robust hybrid of lasso and ridge regression. Contemporary Mathe-

matics, 443(7):59–72.

Peters, J., Bühlmann, P., and Meinshausen, N. (2016). Causal Inference by using Invariant

Prediction: Identification and Confidence Intervals. Journal of the Royal Statistical

Society Series B: Statistical Methodology, 78(5):947–1012.

Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econo-

metric Theory, 7(2):186–199.

Prékopa, A. (2013). Stochastic programming, volume 324. Springer Science & Business

Media.

Richardson, D. H. (1968). The exact distribution of a structural coefficient estimator.

Journal of the American Statistical Association, 63(324):1214–1226.

Rosenbaum, P. R. and Rubin, D. B. (1983). Assessing sensitivity to an unobserved binary

covariate in an observational study with binary outcome. Journal of the Royal Statistical

Society: Series B (Methodological), 45(2):212–218.

Rothenhäusler, D., Meinshausen, N., Bühlmann, P., Peters, J., et al. (2021). Anchor

regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society

Series B, 83(2):215–246.

Ruszczyński, A. (2010). Risk-averse dynamic programming for markov decision processes.

Mathematical programming, 125:235–261.

Sahoo, R., Lei, L., and Wager, S. (2022). Learning from a biased sample. arXiv preprint

arXiv:2209.01754.

Sanderson, E. and Windmeijer, F. (2016). A weak instrument f-test in linear iv models with

multiple endogenous variables. Journal of econometrics, 190(2):212–221.

42
Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables.

Econometrica: Journal of the econometric society, pages 393–415.

Scarf, H. (1958). A min-max solution of an inventory problem. Studies in the mathematical

theory of inventory and production.

Shapiro, A. and Kleywegt, A. (2002). Minimax analysis of stochastic problems. Optimization

Methods and Software, 17(3):523–542.

Sherman, J. and Morrison, W. J. (1950). Adjustment of an inverse matrix corresponding

to a change in one element of a given matrix. The Annals of Mathematical Statistics,

21(1):124–127.

Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2017). Certifying some distributional

robustness with principled adversarial training. arXiv preprint arXiv:1710.10571.

Small, D. S. (2007). Sensitivity analysis for instrumental variables regression with overiden-

tifying restrictions. Journal of the American Statistical Association, 102(479):1049–1058.

Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instruments.

Econometrica: Journal of the Econometric Society, pages 557–586.

Stock, J. H. and Wright, J. H. (2000). Gmm with weak identification. Econometrica,

68(5):1055–1096.

Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instruments and

weak identification in generalized method of moments. Journal of Business & Economic

Statistics, 20(4):518–529.

Stock, J. H. and Yogo, M. (2002). Testing for weak instruments in linear iv regression.

Theil, H. (1953). Repeated least squares applied to complete equation systems. The Hague:

central planning bureau.

43
Theil, H. (1961). Economic forecasts and policy.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society Series B: Statistical Methodology, 58(1):267–288.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes:

With Applications to Statistics. Springer.

VanderWeele, T. J., Tchetgen, E. J. T., Cornelis, M., and Kraft, P. (2014). Methodological

challenges in mendelian randomization. Epidemiology (Cambridge, Mass.), 25(3):427.

Vaserstein, L. N. (1969). Markov processes over denumerable products of spaces, describing

large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72.

Vershik, A. M. (2013). Long history of the monge-kantorovich transportation problem:

(marking the centennial of lv kantorovich’s birth!). The Mathematical Intelligencer, 35:1–9.

Villani, C. (2009). Optimal transport: old and new, volume 338. Springer.

Von Neumann, J. and Morgenstern, O. (1947). Theory of games and economic behavior,

2nd rev.

Wang, Z., Glynn, P. W., and Ye, Y. (2016). Likelihood robust optimization for data-driven

problems. Computational Management Science, 13:241–261.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct

test for heteroskedasticity. Econometrica, pages 817–838.

Windmeijer, F., Farbmacher, H., Davies, N., and Smith, G. D. (2018). On the use of the

lasso for instrumental variables estimation with some invalid instruments. Journal of the

American Statistical Association.

44
Windmeijer, F., Liang, X., Hartwig, F. P., and Bowden, J. (2021). The confidence interval

method for selecting valid instrumental variables. Journal of the Royal Statistical Society

Series B: Statistical Methodology, 83(4):752–776.

Wooldridge, J. M. (2020). Introductory econometrics: a modern approach.

Young, A. (2022). Consistency without inference: Instrumental variables in practical

application. European Economic Review, page 104112.

45
Appendix A Background and Preliminaries

A.1 Wasserstein Distributionally Robust Optimization

We first formally define the Wasserstein distance and discuss relevant results useful in this

paper. The Wasserstein distance is a metric on the space of probability distributions defined

based on the optimal transport problem. More specifically, given any Polish space X with

metric d, let P(X ) be the set of Borel probability measures on X and P, Q ∈ P(X ). For

exposition, we assume they have densities f1 and f2 , respectively, although the Wasserstein

distance is well-defined for more general probability measures using the concept of push-

forwards (Villani, 2009). The optimal transport problem, whose studied was pioneered by

Kantorovich (1942, 1960), aims to find the joint probability distribution between P and Q

with the smallest cost, as specified by the metric d:


Z
R min R dp (x1 , x2 )π(x1 , x2 )dx1 dx2 , (20)
π∈P(X ×X ): x1 π(x1 ,x2 )dx1 =f2 (x2 ); x2 π(x1 ,x2 )dx2 =f1 (x1 ) X ×X

where p ≥ 1. The p-Wasserstein distance Wp (P, Q) is defined to be the p-th root of the

optimal value of the optimal transport problem above. The Wasserstein distance is a metric

on the space P(X ) of probability measures, and the dual problem of (20) is derived the

following important duality result due to Kantorovich (Villani, 2009):


Z Z
Wpp (P, Q) = sup { u(x1 )f1 (x1 )dx1 + v(x2 )f2 (x2 )dx2 }
u∈L1 (P),v∈L1 (Q):u(x1 )+v(x2 )≤dp (x1 ,x2 ) x1 x2

It should be noted that the term “Wasserstein metric” for the optimal transport distance

defined above is an unfortunate mistake, as Kantorovich (1942) should be credited with

pioneering the theory of optimal transport and proposing the metrics. However, due to

a work of Wasserstein (Vaserstein, 1969), which briefly discussed the optimal transport

distance, being more well-known in the West initially, the terminology of Wasserstein metric

persisted until today (Vershik, 2013). The optimal transport problem has also been studied

in the seminal work of Koopmans (1949).

46
One of the appeals of the Wasserstein distance when formulating distributionally robust

optimization problems lies in the tractability of the dual DRO problem. Specifically, in

(12), the inner maximization problem requires solving an infinite-dimensional optimization

problem for every β, and is in generally not tractable. However, if D is the Wasserstein

distance, the inner problem has a tractable dual minimization problem, which when combined

with the outer minimization problem over β, yields a simple and tractable minimization

problem. This will allow us to efficiently compute the WDRO estimator. Moreover, it

establishes connections with the popular statistical approach of ridge regression.

Let c ∈ L1 (X ) be a general loss function and P ∈ P(X ) with density f1 . The following

general duality result (Gao and Kleywegt, 2023; Blanchet and Murthy, 2019) provides a

tractable reformulation of the Wasserstein DRIVE objective introduced in Section 2:


Z Z
p
sup f2 (x)c(x)dx = inf {λθ − inf [λdp (x1 , x2 ) − c(x2 )]f1 (x1 )dx1 }, (21)
Q∈P(X ):Wp (Q,P)≤ρ λ≥0 x2 ∈X

where f2 is the density of Q.

A.2 Anchor Regression of Rothenhäusler et al. (2021)

In the anchor regression framework of Rothenhäusler et al. (2021), the baseline distribution

P0 on (X, Y, U, A) is prescribed by the following linear structural equation model (SEM),

given well-defined B, M and distributions of A, ϵ:


     
X  X  X 
     
 Y  = B  Y  + MA + ϵ ⇐⇒  Y  = (I − B)−1 (MA + ϵ).
     
     
     
U U U

Here U represents unobserved confounders, Y is the outcome variable, X are observed

regressors, and A are “anchors” that can be understood as potentially invalid instrumental

variables that may violate the exclusion restriction. Under this SEM, Rothenhäusler et al.

(2021) posit that the potential deviations from the reference distribution P0 are driven by

47
bounded uncertainties in the anchors A. Their main result provides a DRO interpretation of

a modified population version of the IV regression that interpolates between the IV and

OLS objectives for γ > 1:

min E[(Y − X T β)2 ] + (γ − 1)E[(PA (Y − X T β))2 ] = min sup Ev [(Y − X T β)2 ]. (22)
β β v∈C γ

The set of distributions Ev induced by v ∈ C γ are defined via the following SEM with a

bounded set C γ :
 
X 
 
 Y  = (I − B)−1 v, C γ := {v : vv T ⪯ γME(AAT )MT }. (23)
 
 
 
U

In the interpolated objective in Eq. (22), PA (·) = E(· | A) and E[(PA (Y − X T β))2 ] is

the population version of the IV (TSLS) regression objective with A as the instrument.

Letting κ = 1 − 1/γ, we can rewrite the interpolated objective on the left hand side in (22)

equivalently as

1−κ
min E[(PA (Y − X T β))2 ] + E[(Y − X T β)2 ], (24)
β κ

which can be interpreted as “regularizing” the IV objective with the OLS objective, with

penalty parameter (1 − κ)/κ. Jakobsen and Peters (2022) observe that the finite sample

version of the objective in (24) is precisely that of a k-class estimator with parameter

κ (Theil, 1961; Nagar, 1959). This observation together with (22) therefore provides a

DRO interpretation of k-class estimators, which is also extended by Jakobsen and Peters

(2022) to more general instrumental variables estimation settings. Moreover, when κ = 1,

or equivalently γ → ∞, we recover the standard IV objective in (24). Therefore, the IV

estimator has a distributionally robust interpretation via (22) when distributional shifts v

are unbounded.

The DRO interpretation (22) of k-class estimators sheds new light on some old wisdom on

IV estimation. As has already been observed and studied by a large literature (Richardson,

48
1968; Nelson and Startz, 1990a,b; Bound et al., 1995; Staiger and Stock, 1997; Hahn et al.,

2004; Burgess et al., 2011; Andrews et al., 2019; Young, 2022), when instruments are weak,

the usual normal approximation to the distribution of the IV estimator may be very poor,

and the IV estimator is biased in small samples and in the weak instruments asymptotics.

Moreover, a small violation of the exclusion restriction, i.e., direct effect of instruments

on the outcome, can result in large bias when instruments are weak (Angrist and Pischke,

2009). Consequently, IV may not perform as well as the OLS estimator in such a setting.

Regularizing the IV objective by the OLS objective in (24) can therefore alleviate the

weak instrument problem. This improvement has also been observed for k-class estimators

with particular choices of κ (Fuller, 1977; Hahn et al., 2004). The DRO interpretation

complements the intuition above based on regularizing the IV objective with the OLS

objective. In so far as weak instruments can be understood as a form of distributional shift

from standard modeling assumptions (strong first stage effects), a distributionally robust

regression approach is a natural solution to address the challenge of weak instruments. In

the case of the anchor regression, the distribution uncertainty set indexed by v ∈ C γ always

contains distributions on (X, Y, U, A) where the association between A and X is weak, by

selecting appropriate ∥v∥ ≈ 0. Therefore, the DRO formulation (22) of k-class estimators

demonstrates that they are robust against the weak instrument problem by design. An

additional insight of the DRO formulation is that k-class estimators and anchor regression

are also optimal in terms of predicting Y with X when the distribution of (X, Y ) could

change between the training and test datasets in a bounded manner induced by the anchors

A.

On the other hand, the DRO interpretation of k-class estimators also exposes its potential

limitations. First of all, the ambiguity set in (23) does not in fact contain the reference

distribution P0 itself for any finite robustness parameter, which is unsatisfactory. Moreover,

the SEM in (23) also implies that the instrument (anchors) A cannot be influenced by the

49
unobserved confounder U , which is a major source of uncertainty regarding the validity

of instruments in applications of IV estimation. In this sense, we may understand k-class

estimators as being robust against weak instruments (Young, 2022), since they minimize an

objective that interpolates between OLS and IV. On the other hand, the DRIVE approach

we propose in this paper is by design robust against invalid instruments, as the ambiguity set

captures distributional shifts arising from conditional correlations between the instrument

and the outcome variable, conditional on the endogenous variable.

Appendix B Related Works

Our work is related to several literatures, including distributionally robust optimization,

instrumental variables estimation, and regularized (penalized) regression. Although histori-

cally they developed largely independent of each other, recent works have started to explore

their interesting connections, and our work can be viewed as an effort in this direction.

B.1 Distributionally Robust Optimization and Min-max Opti-

mization

DRO has an important research area in operations research, and traces its origin to game

theory (Von Neumann and Morgenstern, 1947). Scarf (1958) first studied DRO in the

context of inventory control under uncertainties about future demand distributions. This

work was followed by a line of research in min-max stochastic optimization models, notably

the works of Shapiro and Kleywegt (2002), Calafiore and Ghaoui (2006), and Ruszczyński

(2010). Distributional uncertainty sets based on moment conditions are considered by

Dupačová (1987); Prékopa (2013); Bertsimas and Popescu (2005); Delage and Ye (2010).

Distributional uncertainty sets based on distance or divergence measures are considered by

Iyengar (2005); Wang et al. (2016). In recent years, distributional uncertainty sets based on

50
the Wasserstein metric have gained traction, appearing in Mohajerin Esfahani and Kuhn

(2018); Blanchet et al. (2019); Blanchet and Murthy (2019); Duchi et al. (2021), partly due

to their close connections to regularized regression, such as the LASSO (Tibshirani, 1996;

Belloni et al., 2011) and regularized logistic regression. Other works employ alternative

divergence measures, such as the KL divergence (Hu and Hong, 2013) and more generally

ϕ-divergence (Ben-Tal et al., 2013). In this work, we focus on DRO based on the Wasserstein

metric, originally proposed by Kantorovich (1942) in the context of optimal transport, which

has also become a popular tool in economics in recent years (Galichon, 2018, 2021).

DRO has gained traction in causal inference problems in econometrics and statistics

very recently. For example, Kallus and Zhou (2021); Adjaho and Christensen (2022); Lei

et al. (2023) apply DRO in policy learning to handle distributional uncertainties. Chen et al.

(2021) apply DRO to address the possibility of mis-specification of rational expectation

in estimation of structural models. Sahoo et al. (2022) use distributional shifts to model

sampling bias. Bertsimas et al. (2022) study DRO versions of classic causal inference

frameworks. Fan et al. (2023) studies distributional model risk when data comes from

multiple sources and only marginal reference measures are identified. DRO is also connected

to the literature in macroeconomics on robust control (Hansen and Sargent, 2010). A related

recent line of works in econometrics also employs a min-max approach to estimation (Lewis

and Syrgkanis, 2018; Kaji et al., 2020; Metzger, 2022; Cigliutti and Manresa, 2022; Bennett

and Kallus, 2023), inspired by adversarial networks from machine learning (Goodfellow

et al., 2014). These works leverage adversarial learning to enforce a large, possibly infinite,

number of (conditional) moment constraints, in order to achieve efficiency gains. In contrast,

the emphasize of the min-max formulation in our paper is to capture the potential violations

of model assumptions using a distributional uncertainty set.

The DRO approach that we propose in this paper is motivated by a recent line of

works that reveal interesting connections between causality and notions of invariance and

51
distributional robustness (Peters et al., 2016; Meinshausen, 2018; Rothenhäusler et al., 2021;

Bühlmann, 2020; Jakobsen and Peters, 2022).

Another important literature has studied the connections between causal inference and

concepts of invariance and robustness (Peters et al., 2016; Meinshausen, 2018; Rothenhäusler

et al., 2021; Bühlmann, 2020; Jakobsen and Peters, 2022). Our work is closely related to this

line of works, whereby causality is interpreted as an invariance or robustness property under

distributional shifts. In particular, Rothenhäusler et al. (2021); Jakobsen and Peters (2022)

provide a distributionally robust interpretation of the classic k-class estimators. In our

work, instead of constructing the distribution set based on marginal or joint distributions

as is commonly done in previous works, we propose a Wasserstein DRO version of the

IV estimation problem based on distributional shifts in conditional quantities, which is

then reformulated as a ridge type regularized IV estimation problem. In this regard, our

estimator is fundamentally different from the k-class estimators, which minimize an IV

regression objective regularized by an OLS objective.

B.2 Instrumental Variables Estimation

Our work is also closely related to the classic literatures in econometrics and statistics on

instrumental variables estimation (regression), which is originally proposed and developed by

Theil (1953) and Nagar (1959), and became widely used in applied fields in economics. Since

then, many works have investigated potential challenges to instrumental variables estimation

and their solutions, including invalid instruments (Fisher, 1961; Hahn and Hausman, 2005;

Berkowitz et al., 2008; Kolesár et al., 2015) and weak instruments (Nelson and Startz,

1990a,b; Staiger and Stock, 1997; Murray, 2006; Andrews et al., 2019). Tests of weak

instrument have been proposed by Stock and Yogo (2002) and Sanderson and Windmeijer

(2016). Notably, the test of Stock and Yogo (2002) for multi-dimensional instruments is

based on the minimum eigenvalue rank test statistic of Cragg and Donald (1993). In our

52
Wasserstein DRIVE framework, the penalty/robustness parameter can also be selected

using the minimum eigenvalue of the first stage coefficient. It remains to further study

the connections between our work and the weak instrument literature in this regard. The

related econometric literature on many (weak) instruments studies the regime where the

number of instruments is allowed to diverge proportionally with the sample size (Kunitomo,

1980; Bekker, 1994; Chamberlain and Imbens, 2004; Chao and Swanson, 2005; Kolesár,

2018). In this work, we will assume a fixed number of instruments to best illustrate the

Wasserstein DRIVE approach. However, it would be interesting to extend the framework

and analysis in the current work to the many instruments setting.

Testing for invalid instruments is possible in the over-identified regime, where there

are more instruments than endogenous variables (Sargan, 1958; Kadane and Anderson,

1977; Hansen, 1982; Andrews, 1999). These tests have been used in combination with

variable selection methods, such as LASSO and thresholding, to select valid instruments

under certain assumptions (Kang et al., 2016; Windmeijer et al., 2018; Guo et al., 2018a;

Windmeijer et al., 2021). In our paper, we propose a regularization selection procedure for

Wasserstein DRIVE based on bootstrapped score quantile. In simulations, we find that

the selected ρ increases with the degree of instrument invalidity. It remains to further

study the relation of this score quantile and test statistics for instrument invalidity in

the over-identified setting. Lastly, our framework can be viewed as complementary to the

post-hoc sensitivity analysis of invalid instruments (Angrist et al., 1996; Small, 2007; Conley

et al., 2012), where instead of bounding the potential bias of IV estimators arising from

violations of model assumptions after estimation, we incorporate such potential deviations

directly into the estimation procedure.

Instrumental variables estimation has also gained wide adoption in epidemiology and

genetics, where it is known as Mendelian randomization (MR) (VanderWeele et al., 2014;

Bowden et al., 2015; Sanderson and Windmeijer, 2016; Emdin et al., 2017). An important

53
consideration in MR is invalid instruments, because many genetic variants, which are

candidate instruments in Mendelian randomization, could be correlated with the outcome

variable through unknown mechanisms that are either direct effects (horizontal pleitropy)

or correlations with unobserved confounders. Methods have been proposed to address these

challenges, based on robust regression and regularization ideas (Bowden et al., 2015, 2016;

Burgess et al., 2016, 2020). Our proposed DRIVE framework contributes to this area by

providing a novel regularization method robust against potentially invalid instruments.

B.3 Regularized Regression

Our Wasserstein DRIVE framework can be viewed as an instance of data-driven regularized

IV method. In this regard, it complements the classic k-class estimators, which regularize

the IV objective with OLS (Rothenhäusler et al., 2021). Data-driven k-class estimators have

been shown to enjoy better finite sample properties. These include the LIML (Anderson and

Rubin, 1949) and the Fuller estimator (Fuller, 1977), which is a modification of LIML that

works well when instrument are weak (Stock et al., 2002). More recently, Jakobsen and Peters

(2022) proposed another data-driven k-class estimator called the PULSE, which minimizes

the OLS objective but with a constraint set defined by statistical tests of independence

between instrument and residuals. Kolesár et al. (2015) propose a modification of the k-class

estimator that is consistent with invalid instruments whose direct effects on the outcome

are independent of the first stage effect on the endogenous regressor.

There is a rich literature that explores the interactions and connections between regular-

ized regression and instrumental variables methods. One line of works seeks to improve the

finite-sample performance and asymptotic properties of IV type estimators using methods

from regularized regression. For example, Windmeijer et al. (2018) applies LASSO regression

to the first stage, motivated by applications in genetics where one may have access to many

weak or invalid instruments. Belloni et al. (2012); Chernozhukov et al. (2015) also apply

54
LASSO, but the task is to select optimal instruments in the many instruments setting or

when covariates are high-dimensional. (Caner, 2009; Caner and Kock, 2018; Belloni et al.,

2018) apply LASSO to GMM estimators, generalizing regularized regression results from the

M-estimation setting to the moment estimation setting, which also includes IV estimation.

Another line of works on regularized regressions, which is more closely related to our work,

have investigated the connections and equivalences between regularized regression and causal

effect estimators in econometrics based on instrumental variables. Basmann (1960a,b);

McDonald (1977) are the first to connect k-class estimators to regularized regressions.

Rothenhäusler et al. (2021) and Jakobsen and Peters (2022) further study the distributional

robustness of k-class estimators as minimizers of the TSLS objective regularized by the

OLS objective. The Wasserstein DRIVE estimator that we propose in this work applies a

different type of regularization, namely a square root ridge regularization, to the second

stage coefficient in the TSLS objective. As such it has different behaviors compared to the

anchor and k-class estimators, which regularize using the OLS objective. It is also different

from works that apply regularization to the first stage.

Appendix C Square Root Ridge Regression

In this section, we turn our attention to the square root ridge estimator in the standard

regression setting. We first establish the n-consistency of the square root ridge when the

regularization parameter vanishes at the appropriate rate. We then consider a novel regime

with non-vanishing regularization parameter and vanishing noise, revealing properties that

are strikingly different from the standard ridge regression. As we will see, these observations

in the standard setting help motivate and provide the essential intuitions for our results

in the IV estimation setting. In short, the interesting behaviors of the square root ridge

arise from its unique geometry in the regime of vanishing noise, where ni=1 ϵ2i = op (n)
P

55
as n → ∞. This regime is rarely studied in conventional regression settings in statistics,

but it precisely captures features of the instrumental variables estimation setting, where

projected residuals (Y − Xβ0 )T ΠZ (Y − Xβ0 ) = op (n) when instruments are valid and β0 is

the true effect coefficient. In addition to providing intuitions for the IV estimation setting,

the regularization parameter selection procedure proposed for the square root LASSO in

the standard regression setting by Belloni et al. (2011) also inspires us to propose a novel

procedure for the IV setting in Section E, which is shown to perform well in simulations.


C.1 n-Consistency of the Square Root Ridge

We now consider the square root ridge estimator in the standard regression setting, and

prove its n-consistency. We will build on the results of Belloni et al. (2011) on the

non-asymptotic estimation error of the square root LASSO estimator. Conditional on a

fixed design Xi ∈ Rp , and with Φ the CDF of ϵi , we consider the data generating process,

Yi = XiT β0 + σϵi .

In this section, we rewrite the objective of the square root ridge estimation (15) as
λp
q
min Q̂(β) + ∥β∥2 + 1 (25)
β n
n
1X
Q̂(β) = (Yi − XiT β)2 , (26)
n i=1

and denote β̂ as the minimizer of the objective. Without loss of generality, we assume for

all j,
n
1X 2
X =1
n i=1 ij
In other words, each covariate (feature) is normalized to have unit norm. Similar to the

square root LASSO case, we will show that by selecting λ = O( n) properly, or equivalently

ρ = O(n−1 ) in (15), we can achieve, with probability 1 − α, a n-consistency result:
p
∥β̂ − β∥2 ≲ σ p log(2p/α)/n.

56
Compare this with the bound of the square root LASSO, which is

p
∥β̂ − β∥2 ≲ σ s log(2p/α)/n,

where s is the number of non-zero entries of β0 , and is allowed to diverge as n → ∞. Since

we do not impose assumptions on the size of the support of the p-dimensional vector β0 , if

s = p is finite in the square root LASSO framework, we achieve the same bound on the

estimation error. Our bound for the square root ridge is therefore sharp in this sense.
q
An important quantity in the analysis is the score S̃, which is the gradient of Q̂(β)

evaluated at the true parameter value β = β0 :

∇Q̂(β0 ) En (Xσϵ) En (Xϵ)


q
S̃ = ∇ Q̂(β)(β0 ) = q =p =p ,
E (σ 2 ϵ2 ) E (ϵ2)
2 Q̂(β0 ) n n

where En denotes the empirical average of the quantities. Similar to the lower bound on

the regularization parameter in terms of the score function λ ≥ cn∥S̃∥∞ in Belloni et al.

(2011), we will aim to impose the condition that λ ≥ cn∥S̃∥2 for some c > 1. Conveniently,
√ ∗
this condition is already implied by λ = pλ , where λ∗ follows the selection procedures

proposed in that paper. To see this point, note that ∥S̃∥2 ≤ p∥S̃∥∞ , so that with high
√ √
probability, pλ∗ ≥ pcn∥S̃∥∞ ≥ cn∥S̃∥2 . Thus we may use the exact same selection

procedure to achieve the desired bound, although there are other selection procedures

for λ that would guarantee λ ≥ cn∥S̃∥2 with high probability. For example, choose the

(1 − α)-quantile of n∥S̃∥2 given Xi ’s. We will for now adopt the selection procedure and

the model assumptions in Belloni et al. (2011).

Assumption C.1. We have log2 (p/α) log(1/α) = o(n) and p/α → ∞ as n → ∞.

Under this assumption, and assuming that ϵ is normal, the selected regularization

λ= pλ∗ satisfies

p
λ≲ pn log(2p/α)

57
with probability 1 − α for all large n, using the same argument as Lemma 1 of Belloni et al.

(2011).

An important quantity in deriving the bound on the estimation error is the “prediction”

norm

1X T
∥β̂ − β0 ∥22,n := (Xi (β̂ − β0 ))2
n i
1X
= (β̂ − β0 )T Xi XiT (β̂ − β0 ),
n i

1
Xi XiT .
P
which is related to the Euclidean norm ∥β̂ − β0 ∥2 through the Gram matrix n i

We need to make an assumption on the modulus of continuity.

Assumption C.2. There exists a constant κ and n0 such that for all n ≥ n0 , κ∥δ∥2 ≤ ∥δ∥2,n

for all δ ∈ Rp .

1
Xi XiT will be full rank (with high probability with
P
When p ≤ n, the Gram matrix n i

random design), and concentrate around the population covariance matrix. This setting of

p ≤ n is different from the high-dimensional setting in the square root LASSO paper, as

LASSO-type penalties are able to achieve selection consistency when p > n under sparsity,

whereas ridge-type penalties generally cannot. Note also that when p > n, the restricted

eigenvalues are necessary when defining κ, and it is necessary to prove that β̂ − β0 belongs to

a restricted subset of Rp on which the bound with κ holds. When p ≤ n, the restricted subset

and eigenvalues are not necessary, and κ can be understood as the minimum eigenvalue of

the Gram matrix, which would be bonded away from 0 (with high probability). The exact

value of κ is a function of the data generating process. For example, if we assume covariates

are generated independent of each other, then κ ≈ 1.

Theorem C.3. Assume that p ≤ n but p is allowed to grow with n. Let the regularization
√ √
λ = pλ∗ where λ∗ = c nΦ−1 (1 − α/2p), and under Assumption C.1 and Assumption C.2,

58
the solution β̂ to the square root ridge problem
s
1X λp
min (Yi − XiT β)2 + ∥β∥2 + 1
β n i n

satisfies

2( 1c + 1) λ p 2) ≲ σ
p
∥β̂ − β0 ∥2 ≤ · σ En (ϵ p log(2p/α)/n
1 − ( nλ )2 κ2 n

with probability at least 1 − α for all n large enough.

En (Xϵ)
We remark that the quantile of the score function √ is not only critical for
En (ϵ2 )

establishing the n-consistency of the square root ridge. It is also important in practice as

the basis for regularization parameter selection. In Section E, we propose a data-driven

regularization selection procedure that uses nonparametric bootstrap to estimate the quantile

of the score, and demonstrate in Section 5 that it has very good empirical performance.

The nonparametric bootstrap procedure may be of independent as well. Before we discuss

regularization parameter selection in detail, we first focus on the statistical properties of

the square root ridge under the novel vanishing noise regime.

C.2 Delayed Shrinkage of Square Root Ridge

Conventional wisdom on regressions with ridge type penalties is that they induce shrink-

age on parameter estimates, and this shrinkage happens for any non-zero regularization.

Asymptotically, if the regularization parameter does not vanish as the sample size increases,

the limit of the estimator, when it exists, is not equal to the true parameter. The same

behavior may be expected of the square root ridge regression. Indeed, this is the case in the

standard linear regression setting with constant variance, i.e., Var(ϵi ) = σ 2 > 0 and

Yi = XiT β0 + ϵi .

However, as we will see, when Var(ϵi ) depends on the sample size, and vanishes as n → ∞,

the square root ridge estimator can be consistent for non-vanishing penalties.

59
To best illustrate the intuition behind this property of the square root ridge, we start

with the following simple example. Consider the data generating process written in matrix

vector form:

Y = Xβ0 + ϵ, (27)

where the rows of X ∈ Rn×p are i.i.d. N (0, Ip ) and independent of ϵ ∼ N (0, σn2 Ip ). Suppose

that the variance of the noises vanishes: σn2 → 0 as n → ∞. This is not a standard regression

setup, but captures the essence of the IV estimation setting, as we show in Section 4.

Recall the square root ridge regression problem in (15), which is strictly convex:
r
1 p
min ∥Y − Xβ∥2 + ρ(1 + ∥β∥2 ).
β n
(n)
Let β̂sqrt be its unique minimizer. As the sample size n → ∞, we will fix the regularization

parameter ρ ≡ 1, instead of letting ρ → 0. Standard asymptotic theory implies that


(n)
β̂sqrt →p βsqrt , where βsqrt is the minimizer of the limit of the square root ridge objective.

For the simple model (27), we can verify that


r
1 p p
∥Y − Xβ∥2 + (1 + ∥β∥2 ) →p ∥β0 − β∥ + (1 + ∥β∥2 ),
n

where we have used the crucial property σn2 → 0. Therefore, under standard conditions, we

have

(n)
p
β̂sqrt →p βsqrt := arg min ∥β0 − β∥ + (1 + ∥β∥2 ). (28)
β

Note that the limiting objective above is strictly convex and hence has a unique minimizer.

Moreover,

p
∥β0 − β∥ + (1 + ∥β∥2 ) = ∥(β0 , −1) − (β, −1)∥ + ∥(β, −1)∥ ≥ ∥(β0 , −1)∥,

using the triangle inequality. On the other hand, setting β = β0 in (28) achieves the lower

bound ∥(β0 , −1)∥. Therefore, βsqrt = β0 is the unique minimizer of the limiting objective,

60
and so
(n)
β̂sqrt →p β0

with ρ = 1 non-vanishing. We have therefore demonstrated that with a non-vanishing

regularization parameter, the square root ridge regression can still produce a consistent

estimator. This phenomenon holds more generally: the square root ridge estimator is

consistent for any (limiting) regularization parameter ρ ∈ [0, 1 + ∥β10 ∥2 ], as long as the noise

vanishes, in the sense that ni=1 ϵ2i = op (n). This condition is achieved for a wide variety of
P

empirical risk minimization objectives, including the IV estimation objective.

Theorem C.4. In the linear model (27) where the rows of X are distributed i.i.d. N (0, Ip ),

if ni=1 ϵ2i = op (n) as sample size n → ∞, then for any ρ ∈ [0, 1 + ∥β10 ∥2 ], the unique solution
P

β̂sqrt to (15) is consistent:

(n)
β̂sqrt →p β0 .

In Fig. 3, we plot the solution of the limiting square root ridge objective in a one-

dimensional example. As we can see, (asymptotic) shrinkage is delayed until regularization

ρ exceeds the limit 1 + ∥β10 ∥2 in the vanishing noise regime. This behavior is in stark contrast

with the regular ridge regression estimator, for which shrinkage starts from the origin, even

in the vanishing noise setting.

Remark C.5 (Necessary Requirements of Delayed Shrinkage). Although the delayed

shrinkage property of the square root ridge is essentially a simple consequence of the triangle

inequality, it relies crucially on three features of the square root ridge estimation procedure.

First, even though the square root ridge shares similarities with the standard ridge

regression

1
min ∥Y − Xβ∥2 + ρ∥β∥2 ,
β n

61
Figure 3: Limit of the square root ridge estimator in a one-dimensional example with
1
vanishing noise, as a function of the regularization parameter ρ. Optimal ρ = 1 + ∥β0 ∥2
is

the largest regularization level that guarantees consistency of square root ridge.

only the former has delayed shrinkage: the square root operations applied to the mean

squared loss and the squared norm of the parameter above are essential. To see this, note

that with vanishing noise, the limit of the standard ridge estimator for the model in (27) is

the solution to the following problem:

min ∥β0 − β∥2 + ρ∥β∥2 ,


β

which results in the optimal solution β = β0 /(1 + ρ). Therefore, with any non-zero ρ, the

ridge estimator exhibits shrinkage, even with vanishing noise.


p
Second, the inclusion of the extra constant 1 in the regularization term (1 + ∥β∥2 ) is

crucial in guaranteeing that βsqrt = β0 is the unique limit of the square root ridge estimator.

To see this, consider instead the following modified square root ridge problem, which appears

in Owen (2007); Blanchet et al. (2019):


r
1 √
min ∥Y − Xβ∥2 + ρ∥β∥2 ,
β n

where the regularization term does not include an additive constant in the square root, so

simplifies to ∥β∥. Under model (27) with vanishing noise and ρ = 1, this objective has limit

∥β0 − βsqrt ∥ + ∥βsqrt ∥. Without the “curvature” guaranteed by the additional constant in the

62
regularization term, the limiting objective is no longer strictly convex, and there is actually

an infinite number of solutions that achieves the lower bound in the triangle inequality

∥β0 − βsqrt ∥ + ∥βsqrt ∥ ≥ ∥β0 ∥,

including βsqrt = 0. This implies that the solution to the modified objective is no longer

guaranteed to be a consistent estimator of β0 . Indeed, the inconsistency of this curvature-less

version of the square root ridge estimator has also been corroborated by simulations.

Third, given that small penalties in the square root ridge objective could achieve

regularization in finite samples without sacrificing consistency, one may wonder why it

is not widely used. This is partly due to the standard ridge being easier to implement

computationally, but the main reason is that the delayed shrinkage of the square root

ridge estimator is only present in the vanishing noise regime. To see this, assume now

ϵ ∼ N (0, σ 2 I) with non-vanishing σ 2 > 0 in (27). As sample size n → ∞,


r r
1 p 1 p
∥Y − Xβ∥2 + ρ(1 + ∥β∥2 ) = (Xβ0 + ϵ − Xβ)T (Xβ0 + ϵ − Xβ) + ρ(1 + ∥β∥2 )
n n
p p
→p ∥β0 − β∥2 + σ 2 + ρ(1 + ∥β∥2 ),

(n)
and as before β̂sqrt →p βsqrt , the unique minimizer of the limiting objective above. The

optimal condition is given by

(β − β0 ) √ β
p + ρp = 0,
∥β − β0 ∥2 + σ 2 (1 + ∥β∥2 )
(n)
and now only when ρ → 0 is β̂sqrt a consistent estimator of β0 , unless β0 ≡ 0. For this reason,

the fact that square root ridge can be consistent with non-vanishing regularization may

not be particularly useful in standard regression settings. In the presence of non-vanishing

noise, shrinkage happens for any non-zero regularization, which has also been confirmed in

simulations.

Although the consistency of the square root ridge estimator with non-vanishing regular-

ization does not have immediate practical implications for conventional regression problems,

63
it is actually very well suited for the instrumental variables estimation setting. The rea-

son is that IV and TSLS regressions involve projecting the endogenous (and outcome)

variables onto the space spanned by the instrumental variables in the first stage. When

instruments are valid, this projection cancels out the noise terms asymptotically, resulting

in n1 (Y − Xβ0 )T ΠZ (Y − Xβ0 ) →p 0. The subsequent second-stage regression involving the

projected variables therefore precisely corresponds to the vanishing noise regime, and we

may expect a similar delayed shrinkage effect. This is indeed the case, and with non-zero

regularization and (asymptotically) valid instruments, we show in Section 4 that the Wasser-

stein DRIVE estimator is consistent. This result suggests that we can introduce robustness

and regularization to the standard IV estimation through the square root ridge objective

without sacrificing asymptotic validity, and has important implications in practice.

C.3 Square Root Ridge vs. Ridge for GMM and M-Estimators

We also remark on the distinction between the square root ridge and the standard ridge

in the case when ρn → 0. From Fu and Knight (2000), we know that if ρ approaches 0 at

a rate of or slower than O(1/ n), then the ridge estimator has asymptotic bias, i.e., it is

not centered at β0 . However, for square root ridge (and DRIVE), as long as ρn → 0 at any

rate, the estimator will not have any bias. This feature is a result of the self-normalization

property of the square root ridge. In (19), the second term results from
q √ p nρβ0T √ √
nρ(1 + ∥β0 + δ/ n∥2 ) − nρ(1 + ∥β0 ∥2 ) = p · δ/ n + o(δ/ n)
nρ(1 + ∥β0 ∥ ) 2
√ T
ρβ0 δ
→p ,
(1 + ∥β0 ∥2 )

which does not depend on n. In this sense, the parameter ρ in square root ridge is scale-free,

unlike the regularization parameter in the standard ridge case, whose natural scale is

O(1/ n). In the same spirit, when ρ does not vanish, the resulting square root ridge

estimator will have similar behaviors as that of a standard ridge estimator with a vanishing

64

regularization parameter with rate Θ(1/ n). Moreover, the amount of shrinkage essentially
p
does not depend on the magnitude of β0 due to the normalization of β0 by (1 + ∥β0 ∥2 ),

which is also different from the standard ridge setting.

Lastly, we discuss the distinction between our work and that of Blanchet et al. (2022),

which analyzes the asymptotic properties of a general class of DRO estimators. In that

work, the original estimators are based on minimizing a sample loss of the form
n
1X
ℓ(Xi , Yi , β),
n i=1

which encompasses most M-estimators, including the maximum likelihood estimator, and

they focus on the case when ρn → 0. However, the IV (TSLS) estimator is different in that

it is a moment-based estimator, more precisely a GMM estimator (Hansen, 1982). The

key distinction between these estimators is that the objective function of GMM estimators

(and Z-estimators based on estimating equations) usually converges to a weighted distance

function that evaluates to 0 at the true parameter β0 , whereas the objectives of M-estimators

tend to converge to a limit that does not vanish even at the true parameter. To see this

distinction more precisely, consider the limit of the OLS objective under the linear model

Yi = XiT β + ϵi with E(Xi ϵi ) = 0 and n1 XT X →p Ip :

1 1
(Y − Xβ)T (Y − Xβ) = (Xβ0 + ϵ − Xβ)T (Xβ0 + ϵ − Xβ)
n n
→ (β0 − β)T (β0 − β) + σ 2 (ϵ),

which is minimized at β0 , achieving a minimum of σ 2 (ϵ). On the other hand, consider

the following GMM version of the OLS estimator, based on the moment condition that

E(Xi ϵi ) = 0:

1n o n o
min (Y − Xβ)T X W XT (Y − Xβ) ,
β n

where W is a weighting matrix, with the optimal choice being (XT X)−1 in this setting. We

65
have, assuming again n1 XT X →p Ip ,
   
1n T
o
T −1
n
T
o 1 T 1 T −1 1 T
(Y − Xβ) X (X X) X (Y − Xβ) = (Y − Xβ) X ( X X) X (Y − Xβ)
n n n n

→ (β0 − β)T I(β0 − β) = ∥β0 − β∥2 ,

which is also minimized at β0 but achieves a minimum value of 0. This distinction between

M-estimators and Z-estimators (and GMM estimators) is negligible in the standard setting

without the distributionally robust optimization component, and in fact the standard OLS

estimator is preferable to the GMM version for being more stable (Hall, 2003). However,

when we apply square root ridge regularization to these estimators, they start behaving

differently. Only regularized regression based on GMM and Z-estimators enjoys consistency

with a non-zero ρ > 0. In Appendix D.1, we exploit this property to generalize our results

and develop asymptotic results for a general class of GMM estimators.

Appendix D Extensions to GMM Estimation and q-

Wasserstein Distances

In this section, we consider generalizations of the framework and results in the main paper.

We first formulate a Wasserstein Distributionally Robust GMM Estimation Framework,

and generalize the asymptotic results on Wasserstein DRIVE in this setting. We then

consider Wasserstein DRIVE with q-Wasserstein distance where q ̸= 2, and demonstrate

that the resulting estimator enjoys a similar consistency property with non-vanishing

robustness/regularization parameter.

D.1 Wasserstein Distributionally Robust GMM

In this section, we consider general GMM estimation and propose a distributionally robust

GMM estimation framework. Let θ0 ∈ Rp be the true parameter vector in the interior of

66
some compact space Θ ⊆ Rp . Let ψ(W, θ) be a vector of moments that satisfy

E[ψ(Wi , θ0 )] = 0,

for all i, where {W1 , . . . , Wn } are independent but not necessarily identically distributed

variables. Let ψi (θ0 ) = ψ(Wi , θ). We consider the GMM estimators that minimize the

objective
 T  
1 X 1 X
min  ψi (θ) Wn (θ)  ψi (θ)
θ n i n i

where Wn is a positive definite weight matrix, e.g., the weight matrix corresponding to

the two-step or continuous updating estimator, and n1 i ψi (θ) are the sample moments
P

under the empirical distribution Pn on ψ(θ). Both the IV estimation and GMM formulation

of OLS regression fall under this framework. When we are uncertain about the validity

of the moment conditions, similarly to the Wasserstein DRIVE, we consider a regularized

regression objective given by


v T  
u
u
u 1 X 1 X p
min t ψi (θ) Wn (θ)  ψi (θ) + ρ(1 + ∥θ∥2 ). (29)
θ n i n i

We will study the asymptotic properties of this regularized GMM objective. We make use of

the following sufficient technical conditions in Caner (2009) on GMM estimation to simplify

the proof.

Assumption D.1. The following conditions are satisfied:

1
Pn
1. For all i and θ1 , θ2 ∈ Θ, we have |ψi (θ1 )−ψi (θ2 )| ≤ Bt |θ1 , θ2 |, with limn→∞ n i=1 EBtd <

∞ for some d > 2; supθ∈Θ E|ψi (θ)|d < ∞ for some d > 2.

2. Let mn (θ) := n1 E
P
i ψ(θ) and assume that mn (θ) → m(θ) uniformly over Θ, mn (θ)

is continuously differentiable in θ, m1 (θ0 ) = 0 if and only if θ = θ0 , and m(θ) is

continuous in θ; The Jacobian matrix ∂mn (θ)/∂θ →p J(θ) in a neighborhood of θ,

and J(θ0 ) has full rank.

67
3. Wn (θ) is positive definite and continuous on Θ, and Wn (θ) →p W (θ) uniformly in θ.

W (θ) is continuous in θ and positive definite for all θ ∈ Θ.

4. The population objective m(θ)T W (θ)m(θ) is lower bounded by the squared distance

∥θ − θ0 ∥2 , i.e., m(θ)T W (θ)m(θ) ≥ ρ∥θ − θ0 ∥2 for all θ ∈ Θ and some ρ > 0.

See also Andrews (1994); Stock and Wright (2000) which assume similar conditions

as 1-3 on the GMM estimation setup. Condition 4 requires that the weighted moment

is bounded below by a quadratic function near θ0 . Under these conditions, we have the

following result.

Theorem D.2. Under the assumptions in, the unique solution θ̂GM M to
v T  
u
u
u 1 X 1 X p
min t ψi (θ) Wn (θ)  ψi (θ) + ρn (1 + ∥θ∥2 )
θ n i n i

converges to the solution θGM M of the population objective


p p
min m(θ)T W (θ)m(θ) + ρ(1 + ∥θ∥2 ).
θ

Moreover, whenever ρ ≤ ρ, θGM M = θ0 , so that θ̂GM M →p θ0 .

Therefore, the square root ridge regularized GMM estimator also satisfies the consistency

property with a non-zero regularization parameter ρ. Next, we consider general q-Wasserstein

distance with q ̸= 2.

D.2 Generalization to q-Wasserstein DRIVE

The duality result in Theorem 3.1 can be generalized to q-Wasserstein ambiguity sets. The

resulting estimator can enjoy a similar consistency result as the square root Wasserstein

DRIVE (q = 2), but only when q ∈ (1, 2]. This is because the limiting objective can be

written as (assuming ρ = 1 and λp (γ T γ) = 1)

p
p p
(β − β0 )T γ T γ(β − β0 ) + (∥β∥p + 1),

68
where 1/p + 1/q = 1. When q ∈ (1, 2], p ∈ [2, ∞), and so ∥x∥2 ≥ ∥x∥p . As a result, the

limiting objective is bounded below by

p p
p p
∥β − β0 ∥2 + (∥β∥p + 1) ≥ ∥(β, −1) − (β0 , −1)∥p + (∥β∥p + 1)

= ∥(β, −1) − (β0 , −1)∥p + ∥(β, −1)∥p

≥ ∥(β0 , −1)∥p ,

with equality holding in both inequalities if and only if β = β0 , i.e., β0 is again the unique

minimizer of the limiting objective. We therefore have the following result.

Corollary D.3. Under the same assumptions as Theorem 4.1, the following regularized

regression problem
s
1X p
p
min (ΠZ Y − ΠZ Xβ)2i + ρn (∥β∥p + 1) (30)
β n i

has a unique solution that converges in probability to β0 whenever q ∈ (1, 2] and limn→∞ ρn ≤

λp (γ T ΣZ γ).

Appendix E Regularization Parameter Selection for

Wasserstein DRIVE

The selection of penalty/regularization parameters is an important consideration for all

regularized regression problems. The most common approach is cross validation based on

loss function minimization. However, for Wasserstein DRIVE, this standard cross validation

procedure may not adequately address the challenges and goals of DRIVE. For example,

from Theorem 4.1 we know that the Wasserstein DRIVE is only consistent when the penalty

parameter is bounded above. We therefore need to take this result into account when

selecting the penalty parameter. In this section, we discuss two selection procedures, one

based on the first stage regression coefficient, and the other based on quantiles of the score

69
estimated using a nonparametric bootstrap procedure, which is also of independent interest.

We connect our procedures to existing works in the literature on weak and invalid IVs and

investigate their empirical performance in Section 5.

E.1 Selecting ρ Based on Estimate of First Stage Coefficient

Theorem 4.1 guarantees that as long as the regularization parameter converges to a value in

the interval [0, σmin (γ)], Wasserstein DRIVE is consistent. A natural procedure to select ρ

is thus to compute the minimum singular value ρmax := σmin (γ̂) of the first stage regression

coefficient γ̂ and then select a regularization parameter ρ = c · ρmax for c ∈ [0, 1]. In

Section 5, we verify that this procedure produces consistent DRIVE estimators whenever

instruments are valid. Moreover, when the instrument is invalid or weak, Wasserstein

DRIVE enjoys superior finite sample properties, outperforming the standard IV, OLS, and

related estimators at estimation accuracy and prediction accuracy under distributional shift.

This approach is also related to the test of Cragg and Donald (1993), which is originally

used to test for under-identification, and later used by Stock and Yogo (2002) to test for

weak instruments. In the Cragg-Donald test, the minimum eigenvalue of the first stage rank

matrix is used to construct the F -statistic.

Although selecting ρ based on the first stage coefficient gives rise to Wasserstein DRIVE

estimators that perform well in practice, there is one important challenge that remains to

be addressed. Recall that violations of the exclusion restriction can be viewed as a form

of distributional shift. We therefore expect that as the degree of invalidity increases, the

distributional shift becomes stronger. From the DRO formulation of DRIVE in Eq. (12), we

know that the regularization parameter ρ is also the radius of the Wasserstein distribution

set. Therefore, ρ should adaptively increase with increasingly invalid instruments. However,

as the selection procedure proposed here only depends on the first stage estimate, it does

not take this consideration into account. More importantly, when the instruments are weak,

70
the smallest singular of the first stage coefficient is likely to be very close to zero, which

results in a DRIVE estimate with a very small penalty parameter and may thus have similar

problems as the standard IV. We next introduce another parameter selection procedure for

ρ based on Theorem C.3 that is able to better handle invalid and weak instruments.

E.2 Selecting ρ Based on Nonparametric Bootstrap of Quantile

of Score

Recall that the square root LASSO uses the following valid penalty:

λ∗ = cn∥S̃∥∞ ,

where the score function S̃ = ∇Q̂1/2 (β0 ) = √En (xϵ)2 with


En (ϵ )

1X
Q̂(β) = (Yi − XiT β)2 ,
n i

and c = 1.1 is a constant of Bickel et al. (2009). The intuition for this penalty level comes

from the simplest case β0 ≡ 0, when the optimality condition requires λ ≥ n∥S̃∥∞ . To

estimate ∥S̃∥∞ , Belloni et al. (2011) propose to estimate the empirical (1 − α)-quantile

(conditional on Xi ) of ∥E
√n (xϵ)∥2∞ by sampling i.i.d. errors ϵ from the known error distribution
En (ϵ )

Φ with zero mean and variance 1, resulting in


λ∗ = c nΦ−1 (1 − α/2p), (31)

where the confidence level 1 − α is usually set to 0.95.

The consistency result in Theorem C.3 then suggests a natural choice of penalty parameter
√ √
p ∗
ρ for the square root ridge, given by ρ = n
λ, where λ∗ is constructed from (31).

However, there are two main challenges when applying this regularization parameter

selection procedure to Wasserstein DRIVE in the instrumental variables estimation setting.

First, it requires prior knowledge of the type of distribution Φ, e.g., Gaussian, of the errors

71
√ √
p ∗
ϵ, even if we do not need its variance. Second, ρ= n
λ is only valid for the square root

ridge problem in the standard regression setting without instruments. When applied to the

IV setting, the empirical risk is now


1X
Q̂(β) = (Ỹi − X̃iT β)2 ,
n i
where Ỹi = (ΠZ Y)i and X̃i = (ΠZ X)i are variables projected to the instrument space.

This means that “observations” (Ỹi , X̃i ) are no longer independent. Therefore, the i.i.d.

assumption on the errors in the standard regression setting no longer holds.

We propose the following iterative procedure based on nonparametric bootstrap that

simultaneous addresses the two challenges above. Given a starting estimate β (0) of β0 (say
(0)
the IV estimator), we compute the residuals ri = Ỹi − X̃iT β (0) . Then we bootstrap these

residuals to compute the empirical quantile of


∥En (x̃ϵ)∥∞
p ,
En (ϵ2 )
where ϵ is drawn uniformly with replacement from the residuals ri . The quantile based on

bootstrap then replaces Φ−1 (1 − α/2p) in (31) to give a penalty level ρ, which we can use

to solve the square root ridge problem to obtain a new estimate β (1) . Then we use β (1)
(1)
to compute new residuals ri = Ỹi − X̃iT β (1) , and repeat the process. In practice, we can

use the OLS or TSLS estimate as the starting point β (0) . Fig. 4 shows that this procedure

converges very quickly and does not depend on the initial β (0) . Moreover, in Section 5

we demonstrate that the resulting Wasserstein DRIVE estimator has superior estimation

performance in terms of ℓ2 error, as well as prediction under distributional shift.

E.3 Bootstrapped Score Quantile As Test Statistic for Invalid

Instruments

When instruments are valid, one should expect the boostrapped quantiles will converge

to 0. We next formalize this intuition in Proposition E.1 and also confirm it in numerical

72
Figure 4: Left: Penalty parameter convergence as a function of iteration number, with β (0)

starting from OLS, TSLS, and TSLS ridge estimates. Right: Converged penalty for the

standard linear regression model as a function of sample size.

Figure 5: Penalty strength selected based on nonparametric bootstrap of score quantile vs.

correlation strength between invalid instrument and unobserved confounders.

experiments.

Proposition E.1. The bootstrapped quantiles converge to 0 when instruments are valid.

More importantly, in practice we observe that the bootstrapped quantile increases with

the degree of instrument invalidity. Fig. 5 illustrates this phenomenon with increasing

correlation between the instruments and the unobserved confounder. The intuition behind

this observation is that the quantile is essentially describing the orthogonality (moment)

condition for valid IVs, and so should be close to zero with valid IV. A large value therefore

indicates possible violation. Thus, the bootstrapped quantile could potentially be used as a

test statistic for invalid instruments, using for example permutation tests. Equivalently, in

73
a sensitivity analysis it could be used as a sensitivity parameter, based on which we can

bound the worst bias of IV/OLS models.

We provide a more detailed discussion to further justify our proposal. In a linear

regression setting, the quantity


P
∥ i (xϵ)∥∞
pP
2
i (ϵ )

is a test statistic for the orthogonality condition E[X(Y − Xβ)] = 0 which holds asymp-
P
∥ i (xϵ)∥∞
totically for β equal to the OLS estimator. When √P 2
is not zero, it indicates a
i (ϵ )

violation of the orthogonality condition, which means a non-zero penalty could be beneficial.

Similarly, in a TSLS model


P
∥ i (x̃ϵ)∥∞
pP
2
i (ϵ )

is a test statistic for the orthogonality condition E[X̃(Ỹ − X̃β)] = 0 which is asymptotically
P
∥ i (x̃ϵ)∥∞
correct when β is the TSLS estimator and Z is valid instrument. A large √P 2
therefore
i (ϵ )

indicates potential violations of the IV assumptions. We may also compare this quantity

with the Sargan test statistic (Sargan, 1958) for instrument invalidity in the over-identified

setting and note similarities.

The penalty selection proposed in Belloni et al. (2011) can therefore be seen as a test

statistic for the moment condition E(X(Y − Xβ)) = 0 which should hold asymptotically for

β equal to the OLS estimator if the model assumption that X is independent of the error

term is correct. So if the penalty is large, it is evidence for potential violation of X ⊥ ϵ.

Similarly, in a TSLS model, the moment condition is E(X̃(Ỹ − X̃β)) = 0 for beta equal

to the TSLS estimator, so the penalty can be seen as assessing potential violation of IV

assumptions.

Remark E.2. Besides the data-driven procedures discussed above, we can also consider

incorporating information provided by statistical tests for IV estimation. For example,

74
in over-identified settings, the Sargan-Hasen test (Sargan, 1958; Hansen, 1982) can be

used to test for the exclusion restriction. We can use this test to provide evidence on the

validity of the instrument. For testing weak instruments, the popular test of Stock and

Yogo (2002) can be used. This proposal is also related to our observation that ρ based

on bootstrapped quantiles increase with the degree of invalidity, i.e., direct effect on the

outcome or correlation with confounders, and can therefore potentially be used as a test

statistic for the reliability of the IV estimator. We leave a detailed investigation of this

proposal to future work.

Appendix F Proofs

F.1 Proof of Theorem 3.1

Proof. The proof of Theorem 3.1 relies on a general duality result on Wasserstein DRO,

with different variations derived in (Gao and Kleywegt, 2023; Blanchet and Murthy, 2019;

Sinha et al., 2017). We start with the inner problem in the objective in (12):

h i
T 2
sup EQ (Ỹ − X̃ β) ,
{Q:D(Q,P̃n )≤ρ}

where D is the 2-Wasserstein distance and P̃n is the empirical distribution on the pro-

jected data {Ỹi , X̃i }ni=1 ≡ {(ΠZ Y)i , (ΠZ X)i }ni=1 . Proposition 1 of Sinha et al. (2017) and

Proposition 1 of Blanchet et al. (2019) both imply that


n
h
T 2
i 1X
sup EQ (Ỹ − X̃ β) = inf γρ + ϕγ (β; (X̃i , Ỹi )),
{Q:D(Q,P)≤ρ} γ≥0 n i=1

where the “robust” loss function is

ϕγ (β; (X̃, Ỹ )) = sup (Y − X T β)2 − γ∥X − X̃∥22 − γ(Y − Ỹ )2


(X,Y )

= sup W T ααT W − γ∥W − W̃ ∥22 ,


W

75
with W = (X, Y ), W̃ = (X̃, Ỹ ) and α = (−β, 1). Note that γ is always chosen large

enough, i.e., γI − ααT ⪰ 0, so that the objective W T ααT W − γ∥W − W̃ ∥22 is concave in

W . Otherwise, the supremum over W in the inner problem is unbounded. Therefore, the

first order condition is sufficient:

ααT W − γ(W − W̃ ) = 0,

so that

(ααT − γI)W = −γ W̃ ,

and

W = γ(γI − ααT )−1 W̃

= (I − ααT /γ)−1 W̃ ,

where I − ααT /γ is invertible if γI − ααT is positive definite, which is required to make

sure that the quadratic is concave in W . The supremum is then given by

W̃ T (I − ααT /γ)−1 ααT (I − ααT /γ)−1 W̃ − γ(W̃ T ((I − ααT /γ)−1 − I)2 W̃ )

= W̃ T ((I − ααT /γ)−1 ααT (I − ααT /γ)−1 − γ((I − ααT /γ)−1 − I)2 )W̃ ≡ ∥W̃ ∥2A ,

where

A = ((I − ααT /γ)−1 ααT (I − ααT /γ)−1 − γ((I − ααT /γ)−1 − I)2 ).

Using the Sherman-Morrison Lemma (Sherman and Morrison, 1950), whose condition is

satisfied if γI − ααT is positive definite,

1
(I − ααT /γ)−1 = I + ααT ,
γ − αT α

and A can be simplified as

γ
A= T
ααT .
γ−α α

76
In summary, for each projected observation (for the IV estimate) W̃i = (X̃i , Ỹi ), we can

obtain a new “robustified” sample using the above operation, then minimize the following

modified empirical risk constructed from the robustified samples:


n
h i 1X
min sup EQ (Ỹi − X̃iT β)2 ⇔ min inf γρ + (ϕγ (β; (X̃i , Ỹi ))
β {Q:D(Q,P)≤ρ} β γ≥0 n i=1
1X
⇔ min inf γρ + ∥(X̃i , Ỹi )∥2A ,
β γ≥0 n i

where for fixed β, γ ≥ 0 is always chosen large enough so that ϕγ (β; X, Y ) is finite.

Now, the inner minimization problem can be further solved explicitly. Recall that it is

equal to

1X T γ
inf γρ + W̃i ( T
ααT )W̃i ,
γ≥0 n i γ−α α

which is convex in γ hence minimized at the first order condition:

W̃iT ααT W̃i αT α


P
1 i
ρ= ,
n (γ − αT α)2
q
W̃iT ααT W̃i αT α
P
1
or γ = n
i
ρ
+ αT α, where we have chosen the larger root since only it is

guaranteed to satisfy γI − ααT ⪰ 0 for any α = (−β, 1).


1
P
Plugging this expression of γ into the objective, and using the notation ℓIV := n i (Ỹi −

77
β T X̃i )2 ,
s
1X 1X T T
γρ + ∥(X̃i , Ỹi )∥2A = ραT α · W̃i αα W̃i + ραT α
n i n i
1X T 1
+ W̃i ( αT α
ααT )W̃i
n i 1− r
T
P T T
1 i W̃i αα W̃i α α
n ρ
+αT α

p ℓIV
= ραT α · ℓIV + ραT α + r P T T
1 i W̃i αα W̃i
n ρ
r
1
P T T
i W̃i αα W̃i

n ρ
+ αT α

p αT α
= ραT α · ℓIV + ραT α + ℓIV + q ℓIV
W̃iT ααT W̃i
P
1 i
n ρ
p
= 2 ραT α · ℓIV + ραT α + ℓIV
p p
= ( ℓIV + ραT α)2 .

Therefore, we have proved that the Wasserstein DRIVE objective

h i
min sup EQ (Ỹ − X̃ T β)2
β
{Q:D(Q,P̃n )≤ρ}

is equivalent to the following square root ridge regularized IV objective:


s
1X p
min (Ỹi − β T X̃i )2 + ρ(∥β∥2 + 1).
β n i

F.2 Proof of Theorem 4.1


p
Proof. We will show that as n → ∞, β̂ DRIVE → β0 as long as ρn → ρ ≤ λp (γΣZ γ T ).

Recall the linear IV model (2)

Y = β0T X + ϵ,

X = γ T Z + ξ.

78
with instrument relevance and exogeneity conditions
h i
rank(E ZX T ) = p,
h i
T
E [Zϵ] = 0, E Zξ = 0.

First, we compute the limit of the objective function (17), reproduced below
r
1 p
∥ΠZ Y − ΠZ Xβ∥2 + ρn (∥β∥2 + 1). (32)
n

For the loss term, we have


s r
1X 1
(ΠZ Y − ΠZ Xβ)2i = (ΠZ Y − ΠZ Xβ)T (ΠZ Y − ΠZ Xβ)
n i n
r
1
= (ΠZ (Xβ0 + ϵ) − ΠZ Xβ)T (ΠZ (Xβ0 + ϵ) − ΠZ Xβ)
n
r
1
= (ΠZ X(β0 − β) + ϵ)T (ΠZ X(β0 − β) + ϵ)
n
r
1 T
= (ϵ ΠZ ϵ − 2ϵT ΠZ X(β − β0 ) + (β − β0 )T XT ΠZ X(β − β0 )).
n

Note first that n1 ϵT ΠZ X(β − β0 ) = op (1) whenever the instruments are valid, since

1 T 1 X X
ϵ ΠZ X(β − β0 ) = ( ϵi Zi )T (ZT Z)−1 ( Zi XiT (β − β0 ))
n n i i
1 X 1 1X
=( ϵi Zi )T ( ZT Z)−1 ( Zi XiT (β − β0 ))
n i n n i

→p E[Zϵ] · Σ−1 T
Z · E[ZX ] · (β − β0 ) = 0,

by the continuous mapping theorem. Similarly,

1 1 X X
(β − β0 )T XT ΠZ X(β − β0 ) = ( Zi XiT (β − β0 ))T (ZT Z)−1 ( (Zi XiT (β − β0 ))
n n i i
1X 1 1X
=( Zi XiT (β − β0 ))T ( ZT Z)−1 ( Zi XiT (β − β0 ))
n i n n i

→p (β − β0 )T E(Xi ZiT )Σ−1 T


Z E(Zi Xi )(β − β0 )

= (β − β0 )T γ T ΣZ Σ−1
Z ΣZ γ(β − β0 )

= (β − β0 )T γ T ΣZ γ(β − β0 ).

79
The most important part is the “vanishing noise” behavior, i.e.,

1 T 1 X X
ϵ ΠZ ϵ = ( ϵi Zi )T (ZT Z)−1 ( ϵi Zi )
n n i i
1 X 1 1X
= ( ϵi Zi )T ( ZT Z)−1 ( ϵi Zi )
n i n n i

→p (E(ϵi Zi ))T Σ−1


Z (E(ϵi Zi )) = 0.

It then follows that the regularized regression objective (17) of the Wasserstein DRIVE

estimator converges in probability to (18), reproduced below

p p
(β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1). (33)

For ρ > 0, the population objective (33) is continuous and strictly convex in β, and so has

a unique minimizer β DRIVE . Applying the convexity lemma of Pollard (1991), since (32)

is also strictly convex in β, the convergence to (33) is uniform on compact sets B ⊆ Rp

that contain β DRIVE . Applying Corollary 3.2.3 of van der Vaart and Wellner (1996), we can

therefore conclude that the minimizers of the empirical objectives converge in probability

to the minimizer of the population objective, i.e.,

β̂ DRIVE →p β DRIVE .

Next, we consider minimizing the population objective (33). If ρ is bounded above by

the smallest singular value of γ T ΣZ γ, i.e.,

ρ ≤ λp (γ T ΣZ γ T ),

the population objective is lower bounded by

p p √ √ p
(β − β0 )T γ T ΣZ γ(β − β0 ) + ρ(∥β∥2 + 1) ≥ ρ∥β − β0 ∥2 + ρ ∥β∥2 + 1
√ √
= ρ∥(β, 1) − (β0 , 1)∥2 + ρ∥(β, 1)∥2

≥ ρ∥(β0 , 1)∥2 ,

80
where in the second line we augment the vectors β, β0 with an extra coordinate equal to

1. The last line follows from the triangle inequality, with equality if and only if β ≡ β0 .

We can verify that the lower bound ρ∥(β0 , 1)∥2 of the population objective is therefore

achieved uniquely at β ≡ β0 due to strict convexity. We have thus proved that when
p
0 < ρ ≤ λp (γ T ΣZ γ), the population objective has a unique minimizer at β0 . When

ρ = 0, the consistency of β̂ DRIVE can be similarly proved as long as λp (γ T ΣZ γ) > 0, which

guarantees that β0 is the unique minimizer of (33). Therefore, whenever ρ ≤ λp (γ T ΣZ γ T ),

we have β̂ DRIVE →p β0 .

F.3 Proof of Theorem 4.2

Proof. Define the objective function Hn (δ) of a local parameter δ ∈ Rp as follows:


r
1 p
ϕn (β) := ∥ΠZ Y − ΠZ Xβ∥2 + ρn (∥β∥2 + 1)
n
√  √ 
Hn (δ) := n ϕn (β0 + δ/ n) − ϕn (β0 ) .


Note that Hn (δ) is minimized at δ = n(β̂nDRIVE − β0 ). The key components of the proof

are to compute the uniform limit H(δ) of Hn (δ) on compact sets in the weak topology, and

to verify that their minimizers are uniformly tight, i.e., n(β̂nDRIVE − β0 ) = Op (1). We

can then apply Theorem 3.2.2 of van der Vaart and Wellner (1996) to conclude that the

sequence of minimizers n(β̂n − β0 ) of Hn (δ) converges in distribution to the minimizer of

the limit H(δ). We have

√ √ q √ p
Hn (δ) = n · (ϕn (β0 + δ/ n) − ϕn (β0 )) = ∥ΠZ Y − ΠZ X(β0 + δ/ n)∥2 − ∥ΠZ Y − ΠZ Xβ0 ∥2
| {z }
I
q √ p
+ nρn (1 + ∥β0 + δ/ n∥2 ) − nρn (1 + ∥β0 ∥2 ).
| {z }
II

81
We first focus on I:
q √ p
I= ∥ΠZ Y − ΠZ X(β0 + δ/ n)∥2 − ∥ΠZ Y − ΠZ Xβ0 ∥2
q √ p
= Fn (β0 + δ/ n) − Fn (β0 ),

where

Fn (β) = ∥ΠZ Y − ΠZ Xβ∥2 .

We have, with ψi (β) ≡ Zi (Yi − β T Xi ),

√ √
Fn (β0 + δ/ n) = ∥ΠZ Y − ΠZ X(β0 + δ/ n)∥2
1 X √ 1 1 X √
= (√ ψi (β0 + δ/ n))T ( ZT Z)−1 ( √ ψi (β0 + δ/ n)),
n i n n i

Fn (β0 ) = ∥ΠZ Y − ΠZ Xβ0 ∥2


1 X 1 1 X
= (√ ψi (β0 ))T ( ZT Z)−1 ( √ ψi (β0 )).
n i n n i

We compute the limits of Fn (β0 + δ/ n) and Fn (β0 ). We have

∥ψi (β1 ) − ψi (β2 )∥ ≤ ∥Zi XiT (β1 − β2 )∥

= ∥Z(Z T γ + ξ T )(β1 − β2 )∥

≤ (∥ZZ T ∥∥γ∥ + ∥Zξ T ∥) · ∥(β1 − β2 )∥

≤ (∥Z∥2 ∥γ∥ + ∥Z∥∥ξ∥) · ∥(β1 − β2 )∥

where ∥ · ∥ denotes operator norm for matrices and Euclidean norm for vectors. We have,

for some constant c that depends on k,

E(∥Z∥2 ∥γ∥ + ∥Z∥∥ξ∥)k ≤ c(E∥Z∥2k ∥γ∥k + E∥Z∥k ∥ξ∥k )


p p
≤ c(E∥Z∥2k ∥γ∥k + E∥Z∥2k · E∥ξ∥2k )

<∞

82
using the assumptions that E∥Z∥2k < ∞ and E∥ξ∥2k < ∞. Moreover, we have

E∥ψ(β)∥k = E∥Z(Y − β T X)∥k

= E∥Z(X T (β0 − β) + ϵ)∥k

= E∥Z((Z T γ + ξ T )(β0 − β) + ϵ)∥k


 
≤ c E∥ZZ T γ(β0 − β)∥k + E∥Zξ T (β0 − β)∥k + E∥Zϵ∥k
 p p p p 
≤ c E∥Z∥2k ∥γ(β0 − β)∥k + E∥Z∥2k E∥ξ∥2k ∥(β0 − β)∥k + E∥Z∥2k E∥ϵ∥2k

which is uniformly bounded on compact subsets. The consistency result in Theorem 4.1

combined with the above bounds guarantee stochastic equicontinuity (Andrews, 1994), so

that as n → ∞, uniformly in δ on compact sets that contain δ = n(β̂nDRIVE − β0 ),
 
1 X √ √
√ ψi (β0 + δ/ n) − Eψi (β0 + δ/ n) →d N (0, Ω(β0 )) ≡ Z,
n i

√1 E T
P
where Ω(β) = n i (ψi (β)ψi (β)), so that

1 X 1 X
Ω(β0 ) = √ E (ψi (β0 )ψiT (β0 )) = √ E (Yi − XiT β)2 Zi ZiT
n i
n i
1 X
=√ E ϵ2i Zi ZiT = σ 2 ΣZ ,
n i

using independence and homoskedasticity. Moreover,

1 X √ √ h √ i
√ Eψi (β0 + δ/ n) = nE X T (β0 + δ/ n) − Y Z
n i
√ h √ i
= nE X T (β0 + δ/ n) − (X T β0 + ϵ) Z

= EZX T δ = EZ(Z T γ + ξ)δ

= ΣZ γδ.

Combining these, we have

1 X √
√ ψi (β0 + δ/ n) →d Z + ΣZ γδ,
n i

83
uniformly in δ on compact sets, so that


Fn (β0 + δ/ n) →d (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ)

Fn (β0 ) →d Z T Σ−1
Z Z,

and applying the continuous mapping theorem to the square root function,
q √ p q q
I= Fn (β0 + δ/ n) − Fn (β0 ) →d (Z + ΣZ γδ) ΣZ (Z + ΣZ γδ) − Z T Σ−1
T −1
Z Z.

Next we have
q √ p
II = nρn (1 + ∥β0 + δ/ n∥2 ) − nρn (1 + ∥β0 ∥2 )
nρn β0T √ √
=p · δ/ n + o(δ/ n)
nρn (1 + ∥β0 ∥2 )

ρn β0T
→p · δ.
(1 + ∥β0 ∥2 )

Combining the analyses of I and II, we have



ρβ0T
q q
Hn (δ) →d (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ) − Z T Σ−1
Z Z +p ·δ
(1 + ∥β0 ∥2 )

uniformly. Because Hn (δ) is convex and H(δ) has a unique minimum, arg minδ Hn (δ) =

n(β̂nDRIVE − β0 ) = Op (1). Applying Theorem 3.2.2 of van der Vaart and Wellner (1996)

allows us to conclude that



√ ρβ0T
q q
n(β̂nDRIVE − β0 ) →d arg min (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ) − Z T Σ−1
Z Z +p · δ.
δ (1 + ∥β0 ∥2 )
q
In fact, we may drop the term Z T Σ−1
Z Z since it does not depend on δ. Therefore,

√ T
√ q
ρβ0
n(β̂nDRIVE − β0 ) = arg min Hn (δ) →d arg min (Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ) + p · δ.
δ δ (1 + ∥β0 ∥2 )

Now when ρ = 0, the objective above reduces to


q
(Z + ΣZ γδ)T Σ−1
Z (Z + ΣZ γδ),

84
which recovers the same minimizer as the TSLS objective

(Z + ΣZ γδ)T Σ−1 T −1 T T T T
Z (Z + ΣZ γδ) − Z ΣZ Z = 2δ γ Z + δ γ ΣZ γδ,

since the first order condition of the former is

γ T Z + γ T ΣZ γδ
p = 0,
(Z + ΣZ γδ)T (Z + ΣZ γδ)

and of the latter is

γ T Z + γ T ΣZ γδ = 0.

We can therefore conclude that with vanishing ρn → ρ = 0, regardless of the rate, the

asymptotic distribution of Wasserstein DRIVE coincides with that of the standard TSLS

estimator.

F.4 Proof of Corollary 4.3


p
Proof. When ρn → 0, the limiting objective is (Z + γδ)T (Z + γδ) which is minimized at

the same δ that minimizes the standard limit (Z + γδ)T (Z + γδ).

If 0 < ρ ≤ |γ|, then FOC gives



γ T Z + δT γ T γ ρβ0
p +p = 0.
(Z + γδ)T (Z + γδ) (1 + ∥β0 ∥2 )

If β0 is one-dimensional (but γ can be a vector, i.e., multiple instruments), then FOC

reduces to

γ T Z + δ T γ T γ = 0,

which is the same FOC as the standard IV limiting objective.

If both γ and β0 are one-dimensional, but β0 is not necessarily 0, we have that


√ T √
p
T
ρβ0 ρβ0
(Z + γδ) (Z + γδ) + p · δ = |Z + γδ| + p ·δ
2
(1 + ∥β0 ∥ ) (1 + ∥β0 ∥2 )

85
√ √
ρβ0 ρβ0
The objective is Z + γδ + √ · δ when γδ + Z ≥ 0 and −Z − γδ + √ · δ when
(1+∥β0 ∥2 ) (1+∥β0 ∥2 )

γδ + Z ≤ 0. Recall that by assumption ρ ≤ |γ|.

ρβ0
If β0 > 0 and γ > 0, then γδ + √ ·δ when γδ +Z ≥ 0 is minimized at δ = −γ −1 Z,
(1+∥β0 ∥2 )

ρβ0
and −γδ + √ · δ when γδ + Z ≤ 0 is minimized at −γ −1 Z.
(1+∥β0 ∥2 )

ρβ0
If β0 > 0 and γ < 0, then γδ + √ · δ when γδ + Z ≥ 0 is again minimized
(1+∥β0 ∥2 )

ρβ0
at δ = −γ −1 Z (since δ ≤ −γ −1 Z), and −γδ + √ · δ when γδ + Z ≤ 0 is again
(1+∥β0 ∥2 )

minimized at −γ −1 Z.

ρβ0
If β0 < 0 and γ > 0, then γδ + √ ·δ when γδ +Z ≥ 0 is minimized at δ = −γ −1 Z,
(1+∥β0 ∥2 )

ρβ0
and −γδ + √ · δ when γδ + Z ≤ 0 is minimized at −γ −1 Z.
(1+∥β0 ∥2 )

ρβ0
If β0 < 0 and γ > 0, then γδ + √ ·δ when γδ +Z ≥ 0 is minimized at δ = −γ −1 Z,
(1+∥β0 ∥2 )

ρβ0
and −γδ + √ · δ when γδ + Z ≤ 0 is minimized at −γ −1 Z.
(1+∥β0 ∥2 )

We can therefore conclude that the objective is always minimized at δ = −γ −1 Z, which

is the limiting distribution of TSLS.

F.5 Proof of Theorem D.2

Proof. We can write

1X 1X 1X
ψi (θ) = [ψi (θ) − Eψi (θ)] + Eψi (θ).
n i n i n i

Assumption D.1.1 guarantees that

1X
[ψi (θ) − Eψi (θ)] = op (1),
n i

using for example Andrews.


1
P
Next, Assumption D.1.2 guarantees that n i Eψi (θ) → m(θ) uniformly in θ, and

86
Assumption D.1.3 further guarantees that
v T  
u
u
u 1X 1X p
t ψi (θ) Wn (θ)  ψi (θ) + ρn (1 + ∥θ∥2 ) →p
n i n i
p p
m(θ)T W (θ)m(θ) + ρ(1 + ∥θ∥2 )

uniformly in θ. Applying Corollary 3.2.3 of van der Vaart and Wellner (1996), we can

conclude θ̂GM M →p θGM M .

Next, we consider the minimizer of the population objective. Applying Assumption D.1.4,

when ρ ≤ ρ, it is lower bounded by

p p p p
m(θ)T W (θ)m(θ) + ρ(1 + ∥θ∥2 ) ≥ ρ∥θ − θ0 ∥2 + ρ(1 + ∥θ∥2 )
p p p
≥ ρ · ( ∥θ − θ0 ∥2 + (1 + ∥θ∥2 ))
p √
= ρ · (∥(θ, 1) − (θ0 , 1)∥2 + ρ∥(θ, 1)∥2 )

≥ ρ∥(θ0 , 1)∥2 ,

where again the last inequality follows from the triangle inequality. We can verify that

equalities are achieved if and only if θ = θ0 , which guarantees that θ̂GM M →p θ0 . The

condition m(θ)T W (θ)m(θ) ≥ ρ∥θ − θ0 ∥2 is satisfies by many GMM estimators, including

the TSLS, so this proof applies to Theorem 4.1 as well.

F.6 Proof of Theorem C.3

Proof. First, we use optimality condition of β̂ to bound

λp λ
q q q
Q̂(β̂) − Q̂(β0 ) ≤ ∥β0 ∥2 + 1 − ∥β̂∥2 + 1
n n
q
On the other hand, by convexity of Q̂(β),

λ
q q
Q̂(β̂) − Q̂(β0 ) ≥ S̃ T (β̂ − β0 ) ≥ −∥S̃∥2 ∥β̂ − β0 ∥2 ≥ − ∥β̂ − β0 ∥2
cn

87
Now the estimation error in terms of the “prediction norm” (which is just the norm

defined using the Gram matrix)

1X T
∥β̂ − β0 ∥22,n := (Xi (β̂ − β0 ))2
n i
1X
= (β̂ − β0 )T Xi XiT (β̂ − β0 )
n i

is related to the difference Q̂(β̂) − Q̂(β0 ) as follows:

1X 1X
Q̂(β̂) − Q̂(β0 ) = (Yi − XiT β̂)2 − (Yi − XiT β0 )2
n i n i
1X 1X
= (Yi − XiT β0 + XiT β0 − XiT β̂)2 − (Yi − XiT β0 )2
n i n i
1X
= ∥β̂ − β0 ∥22,n + 2 (Yi − XiT β0 )(XiT β0 − XiT β̂)
n i
1X
= ∥β̂ − β0 ∥22,n + 2 (σϵi )XiT (β0 − β̂)
n i

= ∥β̂ − β0 ∥22,n − 2En (σϵX T (β̂ − β0 ))

On the other hand,


q q  q q 
Q̂(β̂) − Q̂(β0 ) = Q̂(β̂) + Q̂(β0 ) · Q̂(β̂) − Q̂(β0 )

and using Holder’s inequality,

1X
2En (σϵX T (β̂ − β0 )) = 2 (σϵi )XiT (β̂ − β0 )
n i
s 1
P T
1X n i (σϵXi )
=2 (σϵi )2 q P (β̂ − β0 )
n i 1 2 2
(σ ϵ )
n i i
q
= 2 Q̂(β0 ) · S̃ T (β̂ − β0 )
q
≤ 2 Q̂(β0 )∥S̃∥2 ∥β̂ − β0 ∥2

88
Combining these, we can bound the estimation error ∥β̂ − β0 ∥22,n as

∥β̂ − β0 ∥22,n

= 2En (σϵX T (β̂ − β0 )) + Q̂(β̂) − Q̂(β0 )


λp λ
q q q q
≤ 2 Q̂(β0 )∥S̃∥2 ∥β̂ − β0 ∥2 + ( ∥β0 ∥2 + 1 − ∥β̂∥2 + 1) · ( Q̂(β̂) + Q̂(β0 ))
n n
λ λ λp λ
q p q q q
≤ 2 Q̂(β0 )∥S̃∥2 ∥β̂ − β0 ∥2 + ( ∥β0 ∥2 + 1 − ∥β̂∥2 + 1) · (2 Q̂(β0 ) + ∥β0 ∥2 + 1 − ∥β̂∥2 + 1)
n n n n
λp λ λp λ
q q q q
2 2 2 2
= 2 Q̂(β0 )∥S̃∥2 ∥β̂ − β0 ∥2 + ( ∥β0 ∥ + 1 − ∥β̂∥ + 1) + 2 Q̂(β0 )( ∥β0 ∥ + 1 − ∥β̂∥2 + 1)
n n n n
λ 2 λ
q q
2
≤ 2 Q̂(β0 )∥S̃∥2 ∥β̂ − β0 ∥2 + ( ) ∥β̂ − β0 ∥2 + 2 Q̂(β0 ) ∥β̂ − β0 ∥2
n n
λ 1 λ 2
q
≤ 2 Q̂(β0 ) ( + 1)∥β̂ − β0 ∥2 + ( ) ∥β̂ − β0 ∥22
n c n

Now the norms ∥β̂ − β0 ∥22,n and ∥β̂ − β0 ∥2 differ by the Gram matrix n1 i Xi XiT , which
P

by the assumption n1 i Xij2 = 1 has diagonal entries equal to 1. Recall that κ is the tight
P

constant such that

κ∥β̂ − β0 ∥2 ≤ ∥β̂ − β0 ∥2,n

for any β̂ − β0 , so we get

1 1 λ 1 1 λ
q
∥β̂ − β0 ∥22 ≤ 2 ∥β̂ − β0 ∥22,n ≤ 2 2 Q̂(β0 ) ( + 1)∥β̂ − β0 ∥2 + 2 ( )2 ∥β̂ − β0 ∥22
κ κ n c κ n

which yields

1 1 λ 1
q
∥β̂ − β0 ∥2 ≤ 1 λ 2 κ2
2 Q̂(β0 ) ( + 1)
1 − κ2 ( n ) n c
q
2 Q̂(β0 ) nλ ( 1c + 1)
=
κ2 − ( nλ )2

provided that

λ
( )2 ≤ κ2 .
n

As λ/n → 0 and κ is a universal constant linking the two norms, this condition will be

satisfied for all n large enough if Assumption 2 holds, so that the rate of convergence of

89
λ
∥β̂ − β0 ∥2,n → 0 is governed by that of n
→ 0:

2 nλ ( 1c + 1)
q p
∥β̂ − β0 ∥2 ≤ · Q̂(β0 ) ≲ σ p log(2p/α)/n.
κ2 − ( nλ )2

90

You might also like