[go: up one dir, main page]

0% found this document useful (0 votes)
15 views10 pages

Maximum_Likelihood_Estimation_of_Optimal_Receiver_

This paper presents a method for estimating the optimal receiver operating characteristic (ROC) curve for binary hypothesis testing (BHT) using maximum likelihood estimation from likelihood ratio observations. The authors derive the maximum likelihood estimator of the optimal ROC curve and demonstrate its convergence to the true ROC curve as the number of observations increases. Simulation results indicate that the maximum likelihood estimator outperforms classical empirical estimators, particularly when sample sizes are small.

Uploaded by

ahsanbser67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Maximum_Likelihood_Estimation_of_Optimal_Receiver_

This paper presents a method for estimating the optimal receiver operating characteristic (ROC) curve for binary hypothesis testing (BHT) using maximum likelihood estimation from likelihood ratio observations. The authors derive the maximum likelihood estimator of the optimal ROC curve and demonstrate its convergence to the true ROC curve as the number of observations increases. Simulation results indicate that the maximum likelihood estimator outperforms classical empirical estimators, particularly when sample sizes are small.

Uploaded by

ahsanbser67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Maximum Likelihood Estimation of Optimal

Receiver Operating Characteristic Curves from


Likelihood Ratio Observations
Bruce Hajek and Xiaohan Kang
University of Illinois at Urbana–Champaign
Electrical and Computer Engineering and Coordinated Science Laboratory
Urbana, Illinois
Email: b-hajek@illinois.edu, xiaohan.kang1@gmail.com
arXiv:2202.01956v1 [cs.IT] 4 Feb 2022

Abstract—The optimal receiver operating characteristic (ROC) specific decision rule, we use the terms “optimal ROC” and
curve, giving the maximum probability of detection as a function “ROC” interchangeably.
of the probability of false alarm, is a key information-theoretic
This paper addresses the problem of estimating the ROC
indicator of the difficulty of a binary hypothesis testing problem
(BHT). It is well known that the optimal ROC curve for a given curve for a BHT from independent samples R1 , . . . , Rn of
BHT, corresponding to the likelihood ratio test, is theoretically the likelihood ratio. Specifically, we assume for some deter-
determined by the probability distribution of the observed data ministic sequence, (Ii : i ∈ [n]), that Ri is generated from an
under each of the two hypotheses. In some cases, these two instance of the BHT such that hypothesis HIi is true. This
distributions may be unknown or computationally intractable,
problem can arise if the densities g0 and g1 are unknown,
but independent samples of the likelihood ratio can be observed.
This raises the problem of estimating the optimal ROC for a but can be factored as gk (x) = u(x)hk (x) for k ∈ {0, 1},
BHT from such samples. The maximum likelihood estimator of for some unknown (or very difficult-to-compute) function u
the optimal ROC curve is derived, and it is shown to converge to and known functions h0 and h1 . Then the likelihood ratio
the true optimal ROC curve in the Lévy metric, as the number can be computed for an observation X using R = hh10 (X) (X) ,
of observations tends to infinity. A classical empirical estimator,
but the distribution of the likelihood ratio depends on the
based on estimating the two types of error probabilities from
two separate sets of samples, is also considered. The maximum unknown function u. So if it is possible, through simulation
likelihood estimator is observed in simulation experiments to or repeated physical trials, to generate independent instances
be considerably more accurate than the empirical estimator, of the BHT, it may be possible to generate the independent
especially when the number of samples obtained under one of the samples R1 , . . . , Rn as described.
two hypotheses is small. The area under the maximum likelihood
estimator is derived; it is a consistent estimator of the true area
To elaborate a bit more, we discuss a possible specific
under the optimal ROC curve. scenario related to Cox’s notion of partial likelihood [1].
Suppose X = (Y1 , S1 , Y2 , S2 , . . . , YT , ST ), where the com-
ponents themselves may be vectors. The full likelihood under
I. I NTRODUCTION
hypothesis Hk for k = 0, 1 is the product of two factors given
Consider a binary hypothesis testing problem (BHT) with below, each of which is a product of T factors:
observation X. The observation X could be high dimen- T
!
sional with continuous and/or discrete components. Suppose
Y
t−1 t−1
fYt |Y t−1 ,S t−1 (yt |y , s ; k)
g0 and g1 are the probability densities of X with respect t=1
to some reference measure, under hypothesis H0 or H1 , T
!
respectively. Then the likelihood ratio is R = gg10 (X)
Y
t t−1
(X) . By · fSt |Y t ,S t−1 (st |y , s ; k) ,
the Neyman–Pearson lemma, the optimal decision rule for a t=1
specified probability of false alarm, is to declare H1 to be
true if either R ≥ τ , or if a biased coin comes up heads where y t , (yt0 : t0 ∈ [t]). Cox defined the first factor to be
and R = τ , for a suitable threshold τ and bias of the coin. the partial likelihood based on Y and the second factor to
The optimal receiver operating characteristic (ROC) curve, be the partial likelihood based on S. If the first factor is
giving the maximum probability of detection as a function of very complicated but does not depend on k, and the second
the probability of false alarm, is a key information-theoretic factor is known and tractable, we arrive at the form of the
indicator of the difficulty of the BHT. Because we focus on the total likelihood described above: gk (x) = u(x)hk (x) for
optimal ROC, which is determined by the BHT rather than the k ∈ {0, 1}.
To avoid possible confusion, we emphasize that the problem
This material is based upon work supported by the National Science considered is an inference problem with independent observa-
Foundation under Grant No. CCF 19-00636. tions, where the ROC is to be estimated. The space of ROCs
is infinite-dimensional. The observations R1 , . . . , Rn are not be true if R > τ , and declares H1 to be true with probability η
used for a binary hypothesis testing problem. if R = τ . The optimal ROC curve is the graph of the function
There is a large classical literature on ROC curves dating to ROC(p) : 0 ≤ p ≤ 1 defined by ROC(p) = F1c (τ, η) where
the early 1940s. Much of the emphasis relating to estimating τ and η are selected such that F0c (τ, η) = p. Note this is
ROC curves is focused on estimating the area under the ROC well-defined because F0 is surjective and for any τ , τ 0 , η, and
curve (AUC). A popular approach is the binormal model such η 0 we have F0c (τ, η) = F0c (τ 0 , η 0 ) if and only if F1c (τ, η) =
that the distribution of an observed score is assumed to be a F1c (τ 0 , η 0 ). Equivalently, the optimal ROC curve is the set of
monotonic transformation of Gaussian under either hypothesis, points traced out by P = (F0c (τ, η), F1c (τ, η)) as τ and η vary.
and maximum likelihood estimates of the parameters of the Proposition 1: Any one of the functions F0 , F1 , or ROC
Gaussians are found. See [2], [3] and references therein. The determines the other two. ROC is a continuous, concave
first estimator we consider for the ROC curve, which we call function over [0, 1].
the “empirical ROC curve,” is described by that name in [4].
The empirical ROC curve is the same up to a rotation as the C. The Lévy metric on the space of ROC curves
“sample ordinal dominance graph” defined in [5], p. 400. Given nondecreasing functions A, B mapping the interval
The paper is organized as follows. Some preliminaries about [0, 1] into itself, the Lévy distance between them, L(A, B), is
ROC curves are given in Section II. The empirical estimator of the infimum of  > 0 such that
the optimal ROC curve based on using the empirical estimators
for the two types of error probabilities is considered in Section A(p − ) −  ≤ B(p) ≤ A(p + ) +  for all p ∈ R,
III. A performance guarantee is derived based on a well-known
with the convention that A(p) = B(p) = 0 for p < 0 and
bound for empirical estimators of CDFs. The ML estimator of
A(p) = B(p) = 1 for p > 1. A geometric interpretation of
the ROC curve is derived in Section IV. Consistency of the
L(A, B) is as follows. It is the smallest value of  such that
ML estimator with respect to the Lévy metric is demonstrated
the graph of B is contained in the region bounded by the
in Section V. The area under the ML estimator of the ROC
following two curves: An upper curve obtained by shifting
curve is derived in Section (VI) and is shown to be a consistent
the graph of A to the left by  and up by , and a lower curve
estimator of AUC. Simulations comparing the accuracy of the
obtained by shifting the graph of A to the right by  and down
empirical and ML estimators are given in Section VII, and
by .
discussion is in Section VIII. Proofs are found in the appendix
Remark 1: It is easy to see the Lévy metric is dominated by
of the full version of this paper posted to arXiv.
the uniform norm kA − Bk∞ , supp∈[0,1] |A(p) − B(p)|. The
II. P RELIMINARIES A BOUT O PTIMAL ROC C URVES Lévy metric is typically more suitable than the uniform norm
A. An extension of a cumulative distribution function (CDF) for functions with jumps. To see this, consider a perfect ROC
curve ROC ≡ 1 and an estimate ROC(p)[ = min{cp, 1}. Then
The CDF F for a random variable R is defined by F (τ ) =
for large c the uniform norm of the difference is 1, while the
P{R ≤ τ } for τ ∈ R. In this paper ∞ always means +∞.
Lévy distance is small.
Given a CDF F with F (0−) = 0 and possibly a point mass
Lemma 1: Let Fa,0 , Fa,1 , Fb,0 , Fb,1 denote CDFs for proba-
at ∞, we define an extended version of F , and abuse notation
bility distributions on [0, ∞]. Let A be the function defined on
by using F to denote both F and its extension. The extension
[0, 1] determined by Fa,0 , Fa,1 as follows. For any p ∈ [0, 1],
is defined for τ ∈ R ∪ {∞} and η ∈ [0, 1], by F (τ, η) = c
A(p) = Fa,1 (τ, η), where (τ, η) is the lexicographically
(1 − η)F (τ −) + ηF (τ ), where F (∞−) = limτ →∞ F (τ ) and c
smallest point in [0, ∞] × [0, 1] such that Fa,0 (τ, η) = p. (If
F (∞) = 1. Let F ({τ }) = F (τ ) − F (τ −) denote the mass at
Fa,0 and Fa,1 are the CDFs of the likelihood ratio of a BHT,
τ . Thus, if R is an extended random variable with CDF F ,
then A is the corresponding optimal ROC.) Let B be defined
then F (τ, η) = P{R < τ } + η P{R = τ }. Note the extended
similarly in terms of Fb,0 and Fb,1 . Then
version of F is continuous and nondecreasing in (τ, η) in the
lexicographically order with F (0, 0) = 0 and F (∞, 1) = 1, L(A, B)
and hence surjective onto [0, 1]. Also, let the extended com-
≤ sup max{|Fa,0 (τ ) − Fb,0 (τ )|, |Fa,1 (τ ) − Fb,1 (τ )|}.
plementary CDF for F be defined by F c (τ, η) = 1 − F (τ, η), τ ∈[0,∞)
so that F c (τ, η) = P{R > τ } + η P{R = τ }. (1)
B. The optimal ROC curve for a BHT We remark that [6] introduces a topology on binary input
Consider a BHT and let F0 denote the CDF of the likelihood channels that is related to the Lévy metric used in this paper.
ratio R under hypothesis H0 and let F1 denote the CDF of the
observation R under hypothesis H1 . Then dF1 (r) = r dF0 (r) III. T HE E MPIRICAL E STIMATOR OF THE ROC
for r ∈ (0, ∞) (see Appendix A for details), and F1 (0) = Consider a BHT and let Fk denote the CDF of the likelihood
F0 ({∞}) = 0, while it is possible that F0 (0) > 0 and/or ratio R under hypothesis Hk for k = 0, 1. Suppose for some
F1 ({∞}) > 0. positive integers n0 and n1 , independent random variables
The likelihood ratio test with threshold τ and randomization R0,1 , . . . , R0,n0 , R1,1 , . . . , R1,n1 are observed such that Rk,j
parameter η declares H0 to be true if R < τ , declares H1 to has CDF Fk for k = 0, 1 and 1 ≤ i ≤ nk . A straight forward
approach to estimate ROC is to estimate Fk using only the nk IV. T HE ML E STIMATOR OF THE ROC
observations having CDF Fk for k = 0, 1. In other words, let Consider a BHT and let Fk denote the CDF of the likelihood
nk ratio R under hypothesis Hk for k = 0, 1, and suppose for
ck (τ ) = 1
X
F I{Rk,i ≤τ } some n ≥ 1 and deterministic binary sequence Ii : i ∈ [n],
nk i=1 independent random variables R1 , . . . , Rn are observed such
that for each i ∈ [n], the distribution of Ri is FIi . The
[ E , the empirical estimator of ROC,
for k = 0, 1 and let ROC likelihood of the set of observations is the probability the
c c
have the graph swept out by the point (F c0 (τ, η), Fc1 (τ, η)) observations take their particular values, and that is determined
as τ varies over [0, ∞] and η varies over [0, 1]. In general, by F0 and F1 , and hence, by Proposition 1, also by ROC
[ E is a step function with all jump locations at multiples
ROC or by F0 alone or by F1 alone. Hence, it makes sense
of n10 and the jump sizes being multiples of n11 . Moreover, to ask what is the maximum likelihood (ML) estimator of
[ E depends on the numerical values of the observations
ROC ROC, or equivalently, what is the ML estimator of the triplet
only through the ranks (i.e., the order, with ties accounted (F0 , F1 , ROC), given Ii : i ∈ [n] and Ri , i ∈ [n]. The answer
for) of the observations. is given by the proposition in this section.
[ E as we have defined it is typically not Let ϕn be defined by
The estimator ROC ( Pn
concave, and is hence typically not the optimal ROC curve 1 1
if 0 ≤ λ < 1,
for a BHT. This suggests the concavified empirical estimator ϕn (λ) , n i=1 λ+(1−λ)Ri (4)
1 if λ = 1.
[ CE defined to be the least concave majorant of ROC
ROC [ E.
[ CE is the
Equivalently, the region under the graph of ROC Note that ϕn is finite over (0, 1], continuous over [0, 1), and
[ E.
convex hull of the region under ROC convex over [0, 1]. Moreover, ϕn (0) = ∞ if and only if Ri = 0
The following provides some performance guarantees for for some i amd ϕn has a jump discontinuity at 1 if and only
the empirical and concavified empirical estimators. if Ri = ∞ for some i.
Proposition 3: The ML estimator (F c0 , F [ ML ) is
c1 , ROC
Proposition 2: Let n = n0 + n1 and α = n1n+n 0
0
. Then the
empirical estimator satisfies unique and is determined as follows. ROC[ ML is the optimal
ROC curve corresponding to F c0 and/or F c1 , where:
2 2
[ E ) ≥ δ} ≤ 2e−2nαδ + 2e−2n(1−α)δ . (2)
P{L(ROC, ROC 1
Pn
1) If n i=1 Ri ≤ 1, then for τ ∈ [0, ∞)
n n
Moreover, if α ∈ (0, 1) is fixed and nk → ∞ for k = 0, 1 c0 (τ ) = 1 c1 (τ ) = 1
X X
F I{Ri ≤τ } ; F I{Ri ≤τ } Ri .
with nn01 = 1−α
α , then L(ROC, ROCE ) → 0 a.s. as n → ∞.
[ n i=1 n i=1
[ E is consistent in the Lévy metric. In
In other words, ROC Pn
[ CE ) ≤ L(ROC, ROC [ E ), so the above 2) If n1 i=1 R1i ≤ 1, then for τ ∈ [0, ∞)
general, L(ROC, ROC
[ [ CE . n n
statements are also true with ROCE replaced by ROC c0 (τ ) = 1
c X 1 c1 (τ ) = 1
X
F I{Ri >τ } ; F I{Ri ≤τ } .
Proof: The Dvoretzky–Kiefer–Wolfowitz (DKW) in- n i=1 Ri n i=1
equality with the optimal constant proved by Massart implies:
3) If neither of the previous two cases holds, then for τ ∈
2
ck (τ )| ≥ δ} ≤ 2e−2nk δ .
P{ sup |Fk (τ ) − F (3) [0, ∞)
τ ∈[0,∞) n
c0 (τ ) = 1 1
X
Combining (3) with Lemma 1 implies (2). The consistency F I{Ri ≤τ }
n i=1 λn + (1 − λn )Ri
[ E follows from the Borel–Cantelli lemma and the fact
of ROC
the sum of the right-hand side of (2) over n is finite for any and
n
δ > 0.
c1 (τ ) = 1 Ri
X
The final inequality follows from the following observa- F I{Ri ≤τ } ,
n i=1 λn + (1 − λn )Ri
[ CE (p) ≥ ROC
tions: ROC [ E (p) for p ∈ [0, 1], and if ROC
[ E is
less than or equal to the concave function p 7→ ROC(p+)+, where λn is the unique value in (0, 1) so that ϕn (λn ) = 1.
[ CE , by the definition of least concave majorant.
then so is ROC Remark 3:
1) The estimator does not depend on the indicator variables
Remark 2: A strong consistency result for the empirical Ii : i ∈ [n]. That is, the estimator does not take into
estimator in terms of the uniform norm with some restrictions account which observations are generated using which
on the distributions F0 and F1 has been developed in [3]. hypothesis.
While the bound (2) seems reasonably tight for α near 1/2, 2) Cases 1) and 2) can both hold only if Ri = 1 for all i,
the bound is degenerate if α is very close to zero or one. The because r + 1r ≥ 2 for r ∈ [0, ∞] with equality if and
maximum likelihood estimator derived in the next section is only if r = 1.
consistent even if all the observations are generated under a 3) If case 1) holds with strict inequality, then F
c1 ({∞}) > 0,
single hypothesis. even though Ri < ∞ for all i.
4) Similarly, if case 2) holds with strict inequality, then The proof of Proposition 4 is given in Appendix E. The first
F
c0 (0) > 0 even though Ri > 0 for all i. part of the proof is to establish that if F0 is not identical to
5) Suppose case 3) holds. The existence and uniqueness of F1 , then λn → α a.s. as n → ∞. Thus, the estimators F c0 and
λn can be seen as follows. Since case 2) does not hold, F
c1 are close to functions obtained by replacing λn by α, and
ϕn (0) > 1. If Ri = ∞ for some i then Pϕnn (1−) < 1; and those resulting functions converge to F0 and F1 , respectively,
if Ri < ∞ for all i, then ϕ0n (1) = n1 i=1 (Ri − 1) > 0, by the law of large numbers.
where we have used the fact case 1) does not hold. Thus,
ϕn (λ) < 1 if λ < 1 and λ is sufficiently close to 1. VI. A REA U NDER THE ML ROC C URVE
Therefore the existence and uniqueness of λn in case 3)
[ ML , which we denote by AUC
The area under ROC d ML , is a
follow from the properties of ϕn .
6) The proof of Proposition 3 is in Appendix C. Maximiz- natural candidate for an estimator of AUC, the area under ROC
ing the likelihood is reduced to a convex optimization for the BHT. An expression for it is given in the following
problem and the KKT optimality conditions are used. proposition. Let λn be defined as in Corollary 1 and for i, i0 ∈
[n], let
The following corollary presents an alternative version of
Proposition 3 that consolidates the three cases of Proposition 3. max{Ri , Ri0 }
It is used in the proof of consistency of the ML estimator. Ti,i0 = ,
2(λn + (1 − λn )Ri )(λn + (1 − λn )Ri0 )
Corollary 1: The ML estimator is unique and is determined
as follows. For τ ∈ [0, ∞), with the following understanding. Recall that if Ri = 0 for
some i ∈ [n] then λn > 0, so the denominator in Ti,i0 is
n
c0 (τ ) = 1 1 always strictly positive. Also recall that if Ri = ∞ for some
c X
F I{Ri >τ }
n i=1 λn + (1 − λn )Ri i ∈ [n] then λn < 1, and the following is based on continuity:
If Ri = Ri0 = ∞ set Ti,i0 = 0. If Ri < Ri0 = ∞, set
and Ti,i0 = 2(λn +(1−λn1)Ri )(1−λn ) .
n
Proposition 5:
c1 (τ ) = 1 Ri
X
F I{Ri ≤τ } ,
n i=1 λn + (1 − λn )Ri [ ML is given by
1) The area under ROC
n X n
where λn = min{λ ∈ [0, 1] : ϕn (λ) ≤ 1}.
d ML = 1
X
Remark 4: It is shown in the proof of Proposition 3 that AUC Ti,i0 . (5)
n2 i=1 0
i =1
λn → α a.s. as n → ∞ if F0 is not identical to F1 . Thus,
for n large, λn is approximately the prior probability that a d ML → AUC a.s.
2) The estimator AUC
d ML is consistent: AUC
given observation is generated under hypothesis H0 and nλn
as n → ∞.
is approximately the number of observations generated under
3) Let R, R0 be independent random variables and use E0 to
H0 . The ML estimator Fb0 can be written as
denote expectation when they both have CDF F0 . Then
n
c 1 X λn
F
c0 (τ ) = I{Ri >τ } , 1
nλn i=1 λn + (1 − λn )Ri AUC = E0 [max{R, R0 }] + F1 ({∞}) (6)
2
1
λn
where λn +(1−λ can be interpreted as an estimate of the = 1 − E0 [min{R, R0 }]. (7)
n )Ri 2
posterior probability that observation Ri was generated under
H0 . (α) (α)
4) For i 6= i0 , E[Ti,i0 ] = AUC, where Ti,i0 is the same as
Ti,i0 with λn replaced by α.
V. C ONSISTENCY OF THE ML E STIMATE OF THE ROC
Remark 5:
Suppose R has CDF F0 under H0 and CDF F1 under
H1 , such that R is also the likelihood ratio. Let α be fixed 1) The expression (5) can be verified by checking that it
with α ∈ [0, 1] and suppose the observations R1 , R2 , . . . are reduces to (6) in case E0 is replaced by expectation using
independent, identically distributed random variables with the F
c0 and F1 is replaced by F c1 . A more direct proof of (5)
mixture distribution αF0 + (1 − α)F1 . We are considering the is given.
problem of estimating the ROC curve for the BHT for distri- 2) The true AUC for the BHT is invariant under swapping
butions F0 and F1 using the ML estimator (ROC [ ML , F
c0 , F
c1 ) the two hypotheses. Similarly, AUCd ML is invariant under
based on the observations R1 , . . . , Rn as n → ∞. For brevity replacing λn by 1 − λn and Ri by R1i for all i. If Ri = 1
we suppress n in the notation for F c0 , F [ ML and we
c1 , and ROC for all i, AUC
d ML = 1/2.
write “X → c a.s. as n → ∞” where a.s. is the abbreviation 3) Part 4) of the proposition is to be expected due to the
for "almost surely," to mean P{limn→∞ X = c} = 1. consistency of AUCd ML and the law of large numbers,
Proposition 4 (Consistency of the ML estimator of the ROC because if n is large, most of the n2 terms in (5) are
curve): The ML estimator of the ROC curve for H0 vs. H1 is indexed by i, i0 with i 6= i0 , and we know, if F0 is not
[ ML , ROC) → 0 a.s. as n → ∞.
consistent. That is, L(ROC identical to F1 , that λn → α a.s. as n → ∞.
VII. S IMULATIONS

average Lévy distance


1.0 0.15

prob. of detection
In this section we test the estimators in a simple binormal 0.8
setting. Let X be distributed by N (0, 1) under H0 and 0.6 0.10
true
by N (1, 1) under H1 . Then the likelihood ratio is R = 0.4 MLE 0.05
exp(X − 21 ) and the ROC curve is given by ROC(p) = 0.2 E
CE
1 − Φ(Φ−1 (1 − p) − 1), where Φ is the CDF of the standard 0.0 0.00
0.00 0.25 0.50 0.75 1.00 MLE E CE
Gaussian distribution. Simulation results for the three ROC prob. of false alarm estimator
estimators are shown in Fig. 1 with various numbers of (a) For n0 = 10, n1 = 10.

average Lévy distance


observations under the two hypotheses (n0 , n1 ). For each pair 1.0 0.15

prob. of detection
of (n0 , n1 ) two figures are shown. The left figure shows the 0.8
estimated ROC curves together with the true ROC curve for a 0.6 0.10
true
single sample instance of n0 +n1 likelihood observations. The 0.4 MLE 0.05
right figure shows the average Lévy distances of the estimators 0.2 E
CE
over N = 500 such sample instances together with error 0.0 0.00
0.00 0.25 0.50 0.75 1.00 MLE E CE
bars√(i.e., plus or minus sample standard deviations divided prob. of false alarm estimator
by N ). The simulation code can be found at [7]. (b) For n0 = 100, n1 = 100.

average Lévy distance


The two empirical estimators have similar performance, 1.0 0.15

prob. of detection
while CE outperforms E slightly in terms of the Lévy distance. 0.8
[ CE , as the least concave majorant of ROC[ E , could 0.6 0.10
Note ROC true
be biased toward higher probability of detection as evidenced 0.4 MLE 0.05
by the sample instances. 0.2 E
CE
It can be seen that the ML estimator (MLE) achieves 0.0 0.00
0.00 0.25 0.50 0.75 1.00 MLE E CE
much smaller Lévy distance than E or CE. The difference prob. of false alarm estimator
(c) For n0 = 1000, n1 = 1000.
is more pronounced when the number of observations under

average Lévy distance


one hypothesis is significantly smaller than that under the 1.0 0.15
prob. of detection

other, as seen in Figs. 1d–1f. This is because E and CE 0.8


0.6 0.10
calculate the empirical distributions based on the likelihood true
ratio observations under the two hypotheses separately before 0.4 MLE 0.05
0.2 E
combining the empirical distributions into an estimated ROC CE
curve. As a result, having very few samples under either 0.0 0.00
0.00 0.25 0.50 0.75 1.00 MLE E CE
hypothesis results in errors in estimating the ROC curve prob. of false alarm estimator
(d) For n0 = 10, n1 = 100.
regardless of how accurate the estimated distribution under the
average Lévy distance
other hypothesis is. In contrast, every observation contributes 1.0 0.15
prob. of detection

to the joint estimation of the pair of distributions in MLE, 0.8


0.6 0.10
so the ROC curve can be accurately estimated even when true
0.4 MLE
there are very few samples from one hypothesis. In fact, as 0.05
0.2 E
Section V suggested, MLE works even if all samples are CE
0.0 0.00
generated from the same hypothesis (see Fig. 1g), while E 0.00 0.25 0.50 0.75 1.00 MLE E CE
and CE do not work because one of the distribution cannot be prob. of false alarm estimator
(e) For n0 = 10, n1 = 1000.
estimated at all.
average Lévy distance

1.0 0.15
prob. of detection

VIII. D ISCUSSION 0.8


The qualitative differences between the empirical estimator 0.6 0.10
true
[ E and the ML estimator ROC
ROC [ ML are striking. Only 0.4 MLE 0.05
the rank ordering of the samples is used by the empirical 0.2 E
CE
estimator–not the numerical values. So it is important to track 0.0 0.00
0.00 0.25 0.50 0.75 1.00 MLE E CE
which samples are generated with which distribution. The ML prob. of false alarm estimator
estimator does not depend on which samples were generated (f) For n0 = 100, n1 = 1000.
average Lévy distance

with which distribution and exact numerical values are used. 1.0 0.15
prob. of detection

We proved a consistency result for ROC[ ML but perhaps it 0.8


0.6 0.10
also satisfies a bound similar to (2). It may be interesting to
explore the accuracy of the ML estimator for large, fixed n 0.4
0.05
as a function of the fraction, α, of observations that are taken 0.2 true
MLE
under hypothesis H0 . 0.0 0.00
0.00 0.25 0.50 0.75 1.00 MLE
A BHT is the same as a binary input channel (BIC). prob. of false alarm estimator
Work of Blackwell and others working on the comparison (g) For n0 = 0, n1 = 100.
Fig. 1: Sample instances and average Lévy distances.
of experiments has led to canonical channel descriptions that for any Borel measurable function h.
are equivalent to the ROC curve, such as the Blackwell Proposition 6: For any Borel set A in R,
measure. The Blackwell measure is the distribution of the Z
posterior probability that hypothesis H0 is true for equal prior ν1 (A) = rν0 (dr).
A
probabilities 1/2 for the hypotheses. See [8] and references
therein. It may be of interest to explore estimation of various In other words, when restricted to the Borel sets in R,
canonical channel descriptions besides the ROC under various ν1 is absolutely continuous with respect to ν0 , and the
metrics. Radon–Nikodym derivative is the identity function almost
everywhere with respect to ν0 .
R EFERENCES
Proof: By the change-of-variables formula for push-
[1] D. R. Cox, “Partial likelihood,” Biometrika, vol. 62, no. 2, pp. 269–276, forward measures, for any Borel set A in R,
1975. [Online]. Available: http://www.jstor.org/stable/2335362
[2] C. Metz and X. Pan, “Proper binormal ROC curves: Theory and
Z
maximum-likelihood estimation,” Biometrics, vol. 43, pp. 1–33, 1999. ν1 (A) = IA (r)ν1 (dr)
[3] F. Hsieh and B. W. Turnbull, “Nonparametric and semiparametric ZR̄
estimation of the receiver operating characteristic curve,” The Annals
of Statistics, vol. 24, no. 1, Feb. 1996. = IA (ρ(x))P1 (dx)
[4] E. DeLong, D. DeLong, and D. Clark-Pearson, “Comparing the areas ZX
under two or more correlated receiver operating characteristic curves: a
nonparametric approach,” Biometrics, vol. 44, pp. 837–845, 1988. = IA (ρ(x))g1 (x)µ(dx)
[5] D. Barber, “The area above the ordinal dominance graph and the area ZX
below the receiver operating characteristic graph,” J. Math. Psychology,
vol. 12, pp. 387–415, 1975.
= IA (ρ(x))ρ(x)g0 (x)µ(dx)
[6] R. Nasser, “Topological structures on DMC spaces,” Entropy, vol. 20, ZX
no. 5, 2018. [Online]. Available: https://www.mdpi.com/1099-4300/20/ = IA (ρ(x))ρ(x)P0 (dx)
5/343
[7] X. Kang, “ML estimator of optimal roc curve simulations,” Feb. 2022. ZX
[Online]. Available: https://github.com/Veggente/mleroc = IA (r)rν0 (dr)
[8] N. Goela and M. Raginsky, “Channel polarization through the lens of
Blackwell measures,” IEEE Transactions on Information Theory, vol. 66, ZR̄
no. 10, pp. 6222–6241, 2020. = rν0 (dr),
[9] A. Ben-Tal and A. Nemirovski, “Optimization III: Convex A
analysis, nonlinear programming theory, nonlinear programming
algorithms,” 2013, https://www2.isye.gatech.edu/~nemirovs/OPTIII_ implying the proposition.
LectureNotes2018.pdf.
[10] K. B. Athreya and S. N. Lahiri, Measure Theory and Probability Theory. A PPENDIX B
New York, NY: Springer, 2006. P ROOFS FOR S ECTION II
A PPENDIX A Proof of RProposition 1: The function F0 determines F1
R ELATION OF F0 AND F1 by F1 (τ ) = [0,τ ] r dF0 (r) for τ ∈ [0, ∞). Conversely, F1
determines F0 by F0c (τ ) = (τ,∞) 1r dF1 (r) for τ ∈ [0, ∞).
R
Let Pk and gk denote the probability distribution and the
probability density function with respect to some reference So either one of F0 or F1 determines the other, and hence also
measure µ of the observation X in a measurable space (X , Σ) determines ROC as described above. To complete the proof it
under
R hypothesis Hk for k = 0, 1. In other words, Pk (A) = suffices to show that ROC determines F0 . The function ROC is
g
A k
(x)µ(dx) for any A ∈ Σ. Let ρ : X → R̄ , R ∪ {∞} be concave so it has a right-hand derivative which we denote by
defined by ( ROC0 . Then F0c (τ ) = sup{p : ROC0 (p) > τ } for τ ∈ [0, ∞).
g1 (x)
if g0 (x) > 0,
ρ(x) = g0 (x) Proof of Lemma 1 : Let the right-hand side of (1) be
∞ if g0 (x) = 0.
denoted by . Note that
Then ρ is a Borel measurable function denoting the likelihood
c c
ratio given an observation. The probability distribution of the = sup max{|Fa,0 (τ, η) − Fb,0 (τ, η)|,
τ ∈(0,∞),η∈[0,1]
extended random variable R = ρ(X) under Hk is the push-
c c
forward of the measure Pk induced by the function ρ for |Fa,1 (τ, η) − Fb,1 (τ, η)|}, (8)
k = 0, 1, denoted by νk . The probability distribution νk
because for τ fixed, the right-hand side of (8) is the maximum
restricted to R is also the unique Borel measure (known as
of a convex function of η and the value at η = 0 and
the Lebesgue–Stieltjes (L–S) measure corresponding to Fk ,
η = 1 is obtained by the right-hand side of (1) at τ −
the CDF of R) on [0, ∞) such that νk ([0, τ ]) = Fk (τ ) for all
and τ, respectively. We appeal to the geometric interpretation
τ ∈ [0, ∞). R of L(A, B). Consider any point (p, B(p)) on the graph of
Throughout this paper, integrals of the form h(r) dF (r) c c
B. It is equal to (Fb,0 (τ, η), Fb,1 (τ, η)) for some choice of
are understood to be Lebesgue–Stieltjes integrals (for the 0 0
(τ, η). Let (p , A(p )) denote the point on the graph of A
extended real numbers). That is,
Z Z for the same choice of (τ, η). In other words, it is the point
c c
h(r) dF (r) , h(r)νF (dr), (Fa,0 (τ, η), Fa,1 (τ, η)). Then (p, B(p)) can be reached from
0 0
R̄ R̄ (p , A(p )) by moving horizontally at most  and moving
vertically at most . So (p, B(p)) is contained in the region both constraints in (9) hold with equality and the likelihood
bounded between the upper and lower shifts of the graph of is strictly increased. This completes the proof of the claim.
A as claimed. Therefore, any ML estimator is such that the distributions
are supported on the set {v0 , . . . , vm+1 } and the probabilities
A PPENDIX C assigned to the points give an ML estimator if and only if they
[ ML
D ERIVATION OF ROC are solutions to the following convex optimization problem:
Proposition 3 and its corollary are proved in this section. m
X
Proof of Proposition 3: Given the binary sequence max cj log aj + cm+1 log b (10)
a≥0,b≥0
(Ii : i ∈ [n]) and the likelihood ratio samples R1 , . . . , Rn , j=0
m m
let 0 = v0 < v1 < v2 < · · · < vm < vm+1 = ∞ X X
be the set of unique values of the samples, augmented by s.t. aj = 1 and aj vj + b = 1.
v0 = 0 and vm+1 = ∞ even if 0 and/or ∞ is not among j=0 j=1

the observed samples. Let (c00 , c01 , c02 , . . . , c0m ) denote the mul- The relaxed Slater constraint qualification condition is sat-
tiplicities of the values from among (Ri : Ii = 0) and let isfied for (10), so there exists a solution and dual variables
(c11 , c12 , . . . , c1m , c1m+1 ) denote the multiplicities of the values satisfying the KKT conditions (see Theorem 3.2.4 in [9]). The
from among (Ri : Ii = 1). Lagrangian is
Let aj = F0 ({vj }) for 0 ≤ j ≤ m and let b = F1 ({∞}). m
X
Thus aj is the probability mass at vj under hypothesis H0 for L(a, b, λ, µ) = cj log aj + cm+1 log b
0 ≤ j ≤ m. The corresponding probability mass at vj under j=0
hypothesis H1 is aj vj for 0 ≤ j ≤ m and the probability
   
Xm m
X
mass at vm+1 under hypothesis H1 is b. − aj − 1 λ −  aj vj + b − 1 µ.
The log-likelihood to be maximized is given by j=0 j=1
m m
X X The KKT conditions on (a, b, λ, µ) are
c0j log aj + c1j log(aj vj ) + c1m+1 log b,
m m
j=0 j=1
X X
a ≥ 0, b ≥ 0; aj = 1; aj vj + b = 1;
where 0 log 0 is understood as 0 and log 0 is understood
Pm as neg- j=0 j=1
ative infinity. Equivalently, dropping the term j=1 c1j log(vj ) ∂L ∂L ∂L
which does not depend on F0 (or F1 or ROC), the ML ≤ 0; a0 · = 0; = 0 for j ∈ [m];
∂a0 ∂a0 ∂aj
estimator is to maximize ∂L ∂L
m ≤ 0; b · = 0,
X ∂b ∂b
cj log aj + cm+1 log b,
where (
j=0 c0
∂L − λ if c0 > 0,
(a, b, λ, µ) = a0
where c0 , c00 , cm+1 , c1m+1 and cj , c0j +c1j for 1 ≤ j ≤ m. ∂a0 −λ if c0 = 0;
In other words, cj is the total multiplicity of vj in all samples
∂L cj
regardless of the hypothesis. (a, b, λ, µ) = − λ − vj µ for j ∈ [m];
∂aj aj
For any choice of F0 (or F1 or ROC), the probabilities (
satisfy the constraint: cm+1
∂L b − µ if cm+1 > 0,
m m
(a, b, λ, µ) =
X X ∂b −µ if cm+1 = 0.
aj ≤ 1 and aj vj + b ≤ 1. (9)
j=0 j=1 Solving the KKT conditions yields:
Pm Pm
1) If cm+1 = 0 and j=1 vj cj ≤ j=0 cj , then
The inequalities in (9) both hold with equality if the dis-
tribution F0 (or equivalently F1 ) assigns probability one to cj
aj = Pm
b for 0 ≤ j ≤ m;
the set {v0 , . . . , vm+1 }. Otherwise, both inequalities are strict. k=0 ck
We claim and now prove that any ML estimator is such that Pm m
both inequalities in (9) hold with equality. It is true in the vj cj
bb = 1 − Pj=1
X
m ; λ
b= cj ; µ
b = 0.
degenerate special case that Ri ∈ {0, ∞} for all i, in which j=0 cj j=0
case an ML estimator is given by ROC(p) ≡ 1, F0 (0) = 1 and Pm Pm+1
F1 ({∞}) = 1. So we can assume m ≥ 1 and there is a value 2) Otherwise, if c0 = 0 and j=1 cj /vj ≤ j=1 cj , then
j0 (for example, j0 = 1) such that 1 ≤ j0 ≤ m. If F0 does not cj
assign probability one to {v0 , . . . , vm+1 } then the same is true aj =
b for 1 ≤ j ≤ m;
vj µ
for F1 , so that strict inequality must hold in both constraints
in (9). Then the probability mass from F0 (and F1 ) that is not Pm
j=1 cj /vj
on the set {v0 , . . . , vm+1 } can be removed and mass can be a0 = 1 − Pm+1
b ;
added to F0 at 0 and vj0 and to F1 at vj0 and ∞ such that j=1 cj
m+1
bb = 0;
X and the monotonicity of F and G implies the following. For
λ
b = 0; µ
b= ck .
0 ≤ ` ≤ L − 1 and c ∈ (c` , c`+1 ),
k=1
1
3) Otherwise, λ
b > 0, µ
b > 0 are determined by solving G(c) ≥ G(c` ) ≥ F (c` ) − δ ≥ F (c) − δ −
L
m
X cj and similarly
= 1, (11)
λ + vj µ 1
j=0 G(c) ≤ G(c`+1 −) ≤ F (c`+1 −) + δ ≤ F (c) + δ + .
L
m
cj cm+1 Since R ⊂ {c1 , . . . , cL−1 } ∪ ∪L−1

`=1 (c` , c`+1 ) , it follows that
X
+ = 1, (12)
j=1
λ/vj + µ µ |F (c) − G(c)| ≤ δ + L1 for all c ∈ R, as was to be proved.
Corollary 2: If F is a CDF, there is a countable sequence
and for 0 ≤ j ≤ m, (c` : ` ≥ 1) such that, for any sequence of CDFs (Fn : n ≥ 1),
cj bb = cm+1 . supc∈R |F (c) − Fn (c)| → 0 if and only if Fn (c` ) → F (c` )
aj = ,
and Fn (c` −) → F (c` −) as n → ∞ for all ` ≥ 1.
b
λ
b + vj µ
b µ
b
Proof: Given F , let (Lj : j ≥ 1) be a sequence of integers
Multiplying both sides of (11) by λ and both sides of (12) by µ converging to ∞. For each j, Lemma 2 implies the existence
and adding the respective
Pm+1 sides of the two equations obtained, of Lj − 1 values c` with a specified property. Let the infinite
yields λ + µ = j=0 cj = n. The above conditions can be sequence (c` : ` ≥ 1) be obtained by concatenating those finite
expressed in terms of the variables Ri , and then replacing µ sequences.
by n − nλn and λ by nλn yields the proposition. Alternatively, Corollary 2 is a consequence of Polyā’s theorem,
Proof of Corollary 1: Corollary 1 is deduced from which states uniform convergence of CDFs is equivalent to
Proposition 3 as follows. If Ri = 1 for 1 ≤ i ≤ n then pointwise convergence on the union of a dense subset of R
the corollary gives that both Fc0 and Fb1 have all their mass at
and {−∞, ∞} (see, e.g., [10]).
r = 1, in agreement with Proposition 3. So for the remainder
of the proof suppose Ri 6= 1 for some i. A PPENDIX E
Consider the three cases of Proposition 3. If case 1) holds P ROOF OF C ONSISTENCY OF ML E STIMATOR
Pn
then ϕn (1−) = 1 and ϕ0n (1) = n1 i=1 (Ri − 1) ≤ 0. Also, The proof of the Proposition 4 is given in this section after
Ri < ∞ for 1 ≤ i ≤ n. Since Ri 6∈ {1, ∞} for at least some preliminary results. Define ϕ(λ) for 0 ≤ λ ≤ 1 by
one value of i, ϕn is strictly convex over [0, 1]. Therefore, ( h i
1
ϕn (λ) > 1 for λ ∈ [0, 1). Thus, λn defined in the corollary is E λ+(1−λ)R if 0 ≤ λ < 1,
ϕ(λ) , (13)
given by λn = 1, and the corollary agrees with Proposition 3. 1 if λ = 1.
If case 2) holds then ϕn (0) ≤ 1. Thus, λn defined in the
corollary is given by λn = 0, and the corollary agrees with For any fixed λ ∈ [0, 1], ϕn (λ) is the average of n independent
Proposition 3. random variables with mean ϕ(λ), so by the law of large
If neither case 1) nor case 2) holds, then λn in the corollary numbers, ϕn (λ) → ϕ(λ) a.s. as n → ∞. Note that ϕ is finite
is the same as λn in Proposition 3, and the corollary again over (0, 1], continuous over [0, 1), and convex over [0, 1].
agrees with Proposition 3. Lemma 3: If F0 is not identical to F1 , exactly one of the
following happens:
A PPENDIX D 1) ϕ(1−) < 1 and ϕ is convex;
F ROM P OINTWISE TO U NIFORM C ONVERGENCE OF CDF S 2) ϕ(1−) = 1 and ϕ is strictly convex.
1
The following basic lemma shows that uniform convergence Proof: Note that ϕ(λ) ≤ supr∈[0,∞] λ+(1−λ)r = λ1 for
of a sequence (Fn : n ≥ 1) of CDFs to a fixed limit is λ ∈ (0, 1], so ϕ(1−) ≤ 1. The function ϕ is convex because
equivalent to pointwise convergence of both the sequence and it is the expectation of a convex function. If ϕ(1−) = 1, then
the corresponding sequence of left limit functions, at each P{Ri = ∞} = 0 and since it is also assumed that F0 is not
of a suitable countably infinite set of points. The CDFs in identical to F1 , P{Ri 6∈ {1, ∞}} > 0. Hence, the function
this section may correspond to probability distributions with in the expectation defining ϕ is strictly convex with positive
positive mass at −∞ and/or ∞. probability, so ϕ is strictly convex.
Lemma 2 (Finite net lemma for CDFs): Given a CDF F and Lemma 4: Suppose F0 is not identical to F1 and let λn be
any integer L ≥ 1, there exist c1 , . . . , cL−1 ∈ R ∪ {−∞, ∞} defined as in Corollary 1. Then λn → α a.s. as n → ∞.
such that for any CDF G, supc∈R |F (c)−G(c)| ≤ δ+ L1 where Proof: Suppose α = 0. Then
  Z ∞
1 1
δ= max max{|F (c` ) − G(c` )|, |F (c` −) − G(c` −)|}. ϕ(0) = E1 = r dF0 (r) = 1 − F0 (0) ≤ 1.
1≤`≤L−1 R 0+ r

Proof: Let c` = min c ∈ R ∪ {−∞, ∞} : F (c) ≥ L`



If ϕ(0) < 1, then, since ϕn (0) → ϕ(0) a.s. as n → ∞,
for 1 ≤ ` ≤ L − 1. Also, let c0 = −∞ and cL = ∞. ϕn (0) < 1 for all sufficiently large n. So λn = 0 = α for all
The fact F (c`+1 −) − F (c` ) ≤ L1 for 0 ≤ ` ≤ L − 1 sufficiently large n, with probability one.
If ϕ(0) = 1, then by Lemma 3 it follows that ϕ(λ) < 1 Fix an arbitrary δ > 0 and let  > 0 be so small that  < 1
for λ ∈ (0, 1). So for any such λ fixed, ϕn (λ) < 1 for all and 2(F0 () − F0 (0)) < δ. For any CDF F and τ ∈ [0, ],
sufficiently large n with probability one. Thus, for any fixed F c (τ ) = (F c (τ ) − F c ()) + F c () and |F c (τ ) − F c ()| ≤
c c
λ ∈ (0, 1), λn ≤ λ for all large n with probability one, so F c (0)−F c (). Also note that (since  < 1) Fc0 (0)− F c0 () ≤
c c
λn → 0 a.s. as n → ∞. This implies the lemma for α = 0. c0 (0) − G
G c0 (). Therefore, for τ ∈ [0, ],
Suppose α ∈ (0, 1). Note that c c c c c c
Z ∞ |F
c0 (τ ) − G
c0 (τ )| ≤ |F
c0 () − G c0 (0) − G
c0 ()| + 2|G c0 ()|.
1
ϕ(α) = (α + (1 − α)r) dF0 (r) = 1. Thus,
0 α + (1 − α)r
c c
Therefore, Lemma 3 implies that, for any  > 0 such that sup |F
c0 (τ ) − G
c0 (τ )|
τ ∈[0,∞)
α +  < 1 and α −  ≥ 0, it holds that ϕ(α + ) < 0 and c c c c
ϕ(α − ) > 0. Therefore, with probability one, ϕn (α + ) < 0 ≤ sup |F
c0 (τ ) − G c0 (0) − G
c0 (τ )| + 2|G c0 ()|. (15)
and ϕn (α − ) > 0 for all sufficiently large n, and therefore τ ∈[,∞)

|λn − α| <  for all sufficiently large n with probability one. Since λn → 0 with probability one, the function r 7→
This implies the lemma for α ∈ (0, 1). 1 1
λn +(1−λn )r converges uniformly over all r ∈ [, ∞] to r .
Suppose α = 1. Since P{Ri < ∞} = P0 {Ri < ∞} = 1 it It follows that the supremum term on the right side of (15)
holds that ϕ(1−) = 1, so by Lemma 3, ϕ is strictly convex. converges to zero a.s. as n → ∞. Since
Furthermore, ϕ0 (1) = E0 [R] − 1 ≤ 0. Therefore, ϕ(λ) > 1 n
c0 () ≤ 1 1
c c
for λ ∈ [0, 1). Thus, for any fixed λ ∈ [0, 1), ϕn (λ) > 1
X
c0 (0) − G
G I{0<Ri ≤}
for all sufficiently large n, with probability one. This implies n i=1 Ri
the lemma for α = 1, as needed. The proof of the lemma is and
complete.   Z 
1 1
Define cumulative distribution functions G c0 and Gc1 by E I{0<Ri ≤} = dF1 (r)
Ri r
( n ) Z0+

c 1X 1
G0 (τ ) = min
c I{Ri >τ } ,1 = dF0 (r) = F0 () − F0 (0),
n i=1 α + (1 − α)Ri 0+
( n )
1 X R i the law of large numbers implies
G
c1 (τ ) = min I{Ri ≤τ } ,1 c c
n i=1 α + (1 − α)Ri c0 (0) − G
lim sup 2|G c0 ()| ≤ 2(F0 () − F0 (0)) < δ
n→∞
for τ ∈ [0, ∞). c c
Lemma 5: As n → ∞, with probability one. So supτ ∈[0,∞) |Fc0 (τ )− Gc0 (τ )| ≤ δ for
all sufficiently large n, with probability one. Since δ > 0 was
sup |F
ck (τ ) − G
ck (τ )| → 0 a.s. for k ∈ {0, 1}. (14) selected arbitrarily, this completes the proof of (14) for k = 0
τ ∈[0,∞)
in case α = 0, and hence the proof of Lemma 5 overall.
Proof: The following conditions are equivalent: F0 is Lemma 6: As n → ∞,
identical to F1 ; F0 ({1}) = 1; F1 ({1}) = 1; P{R = 1} = 1.
sup |G
ck (τ ) − Fk (τ )| → 0 a.s. for k ∈ {0, 1}. (16)
If any of these conditions hold then Ri = 1 for all i with τ ∈[0,∞)
probability one, so by Corollary 1, F c0 ({1}) = F c1 ({1}) = 1.
Proof: Note that
Also, G0 ({1}) = G1 ({1}) = 1. So the lemma is true if F0 is
c c
 
identical to F1 . For the remainder of the proof suppose F0 is 1
E I{Ri >τ }
not identical to F1 , which by Lemma 4 implies that λn → α α + (1 − α)Ri
Z ∞
a.s. as n → ∞. 1
If 0 < α < 1, the convergence (14) follows immediately = (α + (1 − α)r) dF0 (r)
τ + α + (1 − α)r
from the fact (based on λn → α) that, as n → ∞, the function = F0c (τ ),
1
r 7→ λn +(1−λ n )r
converges uniformly over all r ∈ [0, ∞]
1 r
to α+(1−α)r , and the function r 7→ λn +(1−λ n )r
converges 
Ri

r
uniformly over all r ∈ [0, ∞] to α+(1−α)r . E I{Ri ≤τ }
α + (1 − α)Ri
The proof of (14) in case α = 0 or α = 1 is more subtle. Z τ
r
Here we give the proof for α = 0 and k = 0. The other = (α/r + (1 − α)) dF1 (r)
three possibilities for α and k follow in the same way. So 0 α + (1 − α)r

consider the case α = 0. The random variables R1 , R2 , . . . = F1 (τ ).


are independent and all have CDF F1 , and Hence, by the law of large numbers, for any fixed τ ∈ [0, ∞),
( n ) b k (τ ) → Fk (τ ) with probability one as n → ∞, for
G
c 1 X 1
G
c0 (τ ) = min I{Ri >τ } , 1 (for α = 0). k ∈ {0, 1}. It can similarly be shown that G
b k (τ −) → Fk (τ −)
n i=1 Ri
with probability one as n → ∞, for k ∈ {0, 1} for each τ
  
fixed. Pointwise convergence of CDFs and their corresponding 0 1
= E0 R I{R0 >R} + I{R0 =R} + F1 ({∞})
left limits implies uniform convergence (see Appendix D), 2
implying (16). 1
= E0 [max{R, R0 }] + F1 ({∞})
Proof of Proposition 4: Lemmas 5 and 6 and the triangle 2
inequality: 1
= E0 [max{R, R0 }] + 1 − E0 [R]
2
|F
ck (τ ) − Fk (τ )| ≤ |F
ck (τ ) − G
ck (τ )| + |G
ck (τ ) − Fk (τ )|, 1
= 1 − E0 [R + R0 − max{R, R0 }]
2
imply that as n → ∞, 1
= 1 − E0 [min{R, R0 }],
2
sup |F
ck (τ ) − Fk (τ )| → 0 a.s. for k ∈ {0, 1}.
τ ∈[0,∞) where (a) follows from the fact that ROC(p) is affine over
the maximal intervals of p such that τ (p) is constant, so the
Application of Lemma 1 completes the proof.
integral is the same if ROC(p) is replaced over each such
A PPENDIX F interval by its average over the interval, and (b) follows from
D ERIVATION OF E XPRESSIONS FOR AUC AND AUC
d ML the fact that if U is a random variable uniformly distributed
on the interval (0, 1), then the CDF of τ (U ) is F0 because
Proof of Proposition 5: (Proof of 1) Let R[1] ≤ R[2] ≤ for any c ≥ 0, P{τ (U ) > c} = P{U ≤ F0c (c)} = F0c (c). This
. . . ≤ R[n] denote a reordering of the samples R1 , . . . , Rn . establishes (6) and (7).
Then the region under ROC [ ML can be partitioned into a union (Proof of 4) This follows from (6) and the fact the CDF of
of trapezoidal regions, such that there is one trapezoid for R and R0 satisfies dF (r) = (α + (1 − α)r) dF0 (r) over [0, ∞)
each R[i] such that R[i] < ∞. The trapezoids are numbered and F ({∞}) = (1 − α)F1 ({∞}).
from right to left. If a value vj ∈ (0, ∞) is taken on by
cj of the samples, then the union of the trapezoidal regions
corresponding to those samples is also a trapezoidal region.
The area of the ith trapezoidal region is the width of the
base times the average of the lengths of the two sides. The
width of the base is n1 · λn +(1−λ 1
n )R[i]
, corresponding to a term
1
c0 . The length of the left side is · P 0 0 1
in F n ,
i :i >i λn +(1−λn )R[i0 ]
and the length of the right side is greater than the length of
the left side by n1 · λn +(1−λ
1
n )R[i]
. Summing the areas of the
trapezoids yields:
n 
1 X 1
AUCML = 2
d
n i=1 (λn + (1 − λn )R[i] )
n
! !
X R[i0 ] 1 R[i]
· + ,
0
(λn + (1 − λn )R[i0 ] ) 2 (λn + (1 − λn )R[i] )
i =i+1

which is equivalent to the expression given 1) of the proposi-


tion.
(Proof of 2) The consistency of AUC d ML follows from
Proposition 4, the consistency of ROC [ ML .
(Proof of 3) Let τ (p) and η(p) denote values τ (p) ∈ [0, ∞)
and η(p) ∈ [0, 1] such that F0c (τ (p), η(p)) = p. Then
Z 1 Z 1
AUC = ROC(p) dp = F1c (τ (p), η(p)) dp
0 0
Z 1
= (η(p)F1c (τ (p)) + (1 − η(p))F1c (τ (p)−)) dp
0
Z 1 c
(a) F1 (τ (p)) + F1c (τ (p)−)
= dp
0 2
F1c (R) + F1c (R−)

(b)
= E0
2
(R ∞ R∞ )
R+
r dF0 (r0 ) + R r0 dF0 (r0 )
0
= E0 + F1 ({∞})
2

You might also like