Abstract
We consider a variant of the pure exploration problem in Multi-Armed Bandits, where the goal is to find the arm for which the \(\lambda \)-quantile is maximal. Within the PAC framework, we provide a lower bound on the sample complexity of any \((\epsilon ,\delta )\)-correct algorithm, and propose algorithms with matching upper bounds. Our bounds sharpen existing ones by explicitly incorporating the quantile factor \(\lambda \). We further provide experiments that compare the sample complexity of our algorithms with that of previous works.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
In the classical multi-armed bandit (MAB) problem, the learning agent faces a set \(K\) of stochastic arms, from which it chooses arms sequentially. In each round, the agent observes a random reward that depends on the selected arm. The goal of the agent is to maximize the cumulative reward (in the regret formulation), or to identify the arm with the highest expected reward (in the pure exploration problem). The MAB model has been studies extensively in the statistical and learning literature, see [2] for a comprehensive survey.
In this paper, we consider a quantile-based variant of the pure exploration MAB problem (quantile-MAB). In this variant, for a given \(0<\lambda <1\), the goal is to identify the arm for which the \(\lambda \)-quantile is the largest among all arms (here, as usual the \(\lambda \)-quantile is such that the probability of observing a larger reward is at least \(\lambda \)). More precisely, considering the PAC framework, the goal is to identify an \((\epsilon ,\delta )\) -correct arm, namely an arm for which the \((\lambda -\epsilon )\)-quantile is not smaller than the largest \(\lambda \)-quantile among all arms, with a probability larger than \(1-\delta \). In addition, we wish to minimize the sample complexity, i.e., the expected number of samples observed until the learning algorithm terminates.
For the standard MAB problem, algorithms that find the best arm (in terms of its expected reward) in the PAC sense were presented in [1, 5–8, 10], and lower bounds on the sample complexity were presented in [1, 9, 11].
Similar to the present quantile-MAB problem is the variant of the MAB problem in which the goal is to find the arm from which the largest possible sample can be obtained. This is known as the max k-armed bandit problem, and was first introduced in [3]. For this variant, algorithms that find the best arm in the PAC sense were provided in [4, 13], and a lower bound was presented in [4]. In contrast to the current quantile-MAB problem, in the max k-armed setting, it is necessary to assume a lower bound on the tail probabilities of the arms. When the tail functions of the arms are known, and \(\epsilon =\lambda \), the algorithms for the max k-armed bandit setting can be applied in the present quantile-MAB problem. However, their sample complexity upper bounds are larger than those of the algorithms presented in this paper.
More related to the present quantile-MAB problem is the work [15] which consider a measure of risk called value-at-risk (see [12]). The value-at-risk of a given random variable (R.V.) \(\varvec{X}\) is actually the same as the quantile of the R.V. \(\varvec{-X}\). An algorithm with an upper bound on the sample complexity that increases as \(\frac{\lambda |K|}{\epsilon ^2\delta D}\), (where D is the upper bound on the density functions) was provided in [15], that algorithm is computationally demanding since at each iteration it solves a non-linear constrained and integer-valued optimization problem. Recently, the quantile-MAB problem was studied in [14]. They provided a lower bound for the case in which \(\lambda =3/4\) and an algorithm with an upper bound on the sample complexity of the order of \(\sum _{k\in K}\frac{1}{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}\ln (\frac{|K|}{\delta \max \left( \epsilon ,\varDelta _{k,\lambda }\right) })\), where \(\varDelta _{k,\lambda }\) is the difference between the \(\lambda \)-quantile of arm k and that of the best arm.
In this paper, for certain arm distributions, we provide a lower bound of the order of \(\sum _{k\in K}\frac{\lambda (1-\lambda )\ln (\frac{1}{\delta })}{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}\) on the sample complexity of every \((\epsilon ,\delta )\) -correct algorithm. That lower bound improves the bound in [14] in the sense of considering the quantile factor \(\lambda (1-\lambda )\). This is significant when \(\lambda \) is close to 1 or to 0. Furthermore, for general distribution functions, we provide two algorithms that attain the lower bound up to the logarithmic terms \(\ln \left( |K|\epsilon \right) \) and \(\ln \left( |K|\log _2(\epsilon )\right) \) respectively. The upper bounds of these algorithms are smaller than that in [14] by a factor of \(\lambda \) and a logarithmic factor in \(\epsilon \) for the second algorithm.
The paper proceeds as follows. In the next section we present our model. In Sect. 3, a lower bound on the sample complexity of every \((\epsilon ,\delta )\) -correct algorithm is presented. Then in Sect. 4 we present our \((\epsilon ,\delta )\)-correct algorithms, and provide upper bounds on their sample complexity. The second algorithm is bases on applying the doubling trick on the first one. Then, in Sect. 5 we provide experiments that illustrate the improved sample complexity of our algorithms compared with the results presented in [14]. In Sect. 6 we close the paper with some concluding remarks.
2 Model Definition
We consider a finite set of arms, denoted by \(K\). At each stage \(t=1,2,\dots \) the learning agent chooses an arm \(k\in K\), and a real valued reward is obtained from that arm. The rewards obtained from each arm k are independent and identically distributed, with a distribution function (CDF) \(F_{k}(x)\), \(x\in \mathbb {R}\). We denote the quantile function of arm \(k\in K\) by \(Q_k:[0,1]\rightarrow \mathbb {R}\), and define it as follows.
Definition 1
For every arm \(k\in K\), the quantile function \(Q_k(\lambda )\) is defined by
Note that \(P\left( \varvec{x}_k\ge Q_k(\lambda ) \right) \ge 1-\lambda \) where \(\varvec{x}_k\) stands for a random variable with distribution \(F_k\). Clearly, if \(F_k\) is continuous at the point \(Q_k(\lambda )\), we have equality, namely, \(P\left( \varvec{x}_k\ge Q_k(\lambda ) \right) = 1-\lambda \).
An algorithm for the quantile-MAB problem samples an arm at each time step, based on the observed history so far (i.e., the previously selected arms and observed rewards). We require the algorithm to terminate after a random number T of samples, which is finite with probability 1, and return an arm \(k'\). An algorithm is said to be \((\epsilon ,\delta )\) -correct if the returned arm is \(\epsilon \)-optimal with a probability larger than \(1-\delta \), (see a precise definition later in this section). The expected number of samples E[T] taken by the algorithm is the sample complexity, which we wish to minimize.
We next provide some definitions and notations which we use later in this paper. A \(\lambda \)-quantile optimal arm is defined as follows.
Definition 2
Arm \(k\in K\) is \(\lambda \)-quantile optimal if
We use the following quantity which represents the distance of an arm from being optimal,
If \(F_k\) is continuous, then \(\varDelta _{k,\lambda }=F_k(x^*_{\lambda })-(1-\lambda )\). Furthermore, note that for every suboptimal arm k, namely, an arm for which \(Q_k(\lambda )<x^*_{\lambda }\), it follows by the monotonicity of CDF functions that \(\varDelta _{k,\lambda }>0\).
Now we are ready to precisely define an \((\epsilon ,\delta )\) -correct algorithm.
Definition 3
For \(\lambda \) and \(\epsilon \) such that \(0<\epsilon<\lambda <1\) and \(\delta >0\), an algorithm is \((\epsilon ,\delta )\) -correct if
where \(k'\) stands for the arm returned by the algorithm.
3 Lower Bound
Before presenting our algorithms, we provide a lower bound on the sample complexity of any \((\epsilon ,\delta )\) -correct algorithm for certain arm distributions. The lower bound is provided in the following Theorem.
Theorem 1
Assume that \(F_k\) is continuous for every \(k\in K\). Fix some \(\epsilon _0\) such that \(0<\epsilon _0<\frac{1}{4}\). For every \(\lambda \in [2\epsilon _0,1-2\epsilon _0]\), \(\epsilon \in (0,\epsilon _0]\) and \(\delta \le 0.15\), there exist some set of arm distributions \(\{F_k\}_{k\in K}\), such that for every \((\epsilon ,\delta )\) -correct algorithm,
where \(k^*\) denote some optimal arm, with \(Q_{k^*}(\lambda )=x^*_{\lambda }\).
The above lower bound refines the one presented in [14] in the sense that here the size of the quantile \(\lambda \) is considered in the bound. To illustrate the lower bound, we provide an example.
Example 1
Let \(\{\mu _k\}_{k\in K}\) be a set of constants, and let \(\mu ^*=\max _{k\in K}\mu _{k}\). Suppose that the rewards of each arm \(k\in K\) are uniformly distributed on the interval \((\mu _k-1,\mu _k)\). Since, \(\mu _{k}-x^*_{\lambda }\le 1\), for every arm \(k\in K\) it follows that
As \(x^*_{\lambda }=\mu ^*-\lambda \), Eq. (1), implies that
Since \(\epsilon <\lambda \), the denominator term in Eq. (2) can be seen to be
Proof
(Theorem 1 ). First we assume that the quantile value of the optimal arm, namely, \(x^*_{\lambda }\) is known. Moreover, we assume that for every arm \(k\in K\), the conditional probabilities \(P\left( \varvec{x}_k| \varvec{x}_k\ge x^*_{\lambda } \right) \) and \(P\left( \varvec{x}_k| \varvec{x}_k< x^*_{\lambda } \right) \) are also known. Therefore, the learning algorithm needs only to estimate the parameters
Now, by the continuity of the distribution functions it follows that \(\max _{k\in K}p_k=\lambda \). Also, by Eq. (1) it follows that
Therefore, finding an arm \(k'\) such that \(\varDelta _{k',\lambda }\le \epsilon \) is the same as finding a Bernoulli arm \(k'\), such that its expected value is \(\epsilon \)-optimal, namely, \(\max _{k\in K}p_k-p_{k'}\le \epsilon \). So, our problem is the same as the standard Bernoulli bandit problem with \(\{p_k\}_{k\in K}\) as the Bernoulli parameters.
Then, by Remark 5 in [9], in which a lower bound for the standard MAB problem with Bernoulli arms is provided for \(\delta \le 0.15\), we have
where \(S_{\epsilon }\triangleq \{k|k\in K,\,p_k\ge \lambda -\epsilon \}\) and \(KL\left( p,q\right) \) stands for the Kullback-Leibler divergence between two Bernoulli distributions with parameters p and q respectively.
We note that \(\ln (1+x)\le x\). Hence,
Therefore, by Eqs. (3) and (4) it follows that
Hence, by the facts that \(2\epsilon \le \lambda \) and \(2\epsilon \le (1-\lambda )\) and since \(\varDelta _{k,\lambda }=\lambda -p_k\), Eq. (2) is obtained. \(\quad \square \)
4 Algorithms
In this section we provide two related algorithms. The first one is simpler and attains the lower bound in Theorem 1 up to a logarithmic term. The second algorithm is based on applying the doubling trick on the first one and hence its upper bound attains Theorem 1 up to a double logarithmic term.
4.1 The Max-Q Algorithm
Here we present our Max-Q algorithm. The algorithm is \((\epsilon ,\delta )\)-correct and based on sampling the arm which has the highest potential \(\lambda \)-quantile value.
The Max-Q algorithm starts by sampling a fixed number of times from each arm. Then, for each arm, the algorithm associates a value that has been sampled from its quantile in a large probability and choses the arm for which the value is maximal. If the number of times that arm has been sampled is larger than a certain threshold, the algorithm stops returns that arm, else it samples one more time from the chosen arm.
The fundamental difference between the Max-Q algorithm and the algorithm presented in [14] is the fact that in the latter the entire CDF is estimated, while in this paper, just the value of the quantile is estimated. That difference leads to a bound on the sample complexity of the Max-Q algorithm which is smaller by a factor of \(\lambda \), compared to that in [14].
Theorem 2
For every \(\lambda \in (0,1)\), \(\epsilon \in (0,\lambda )\) and \(\delta \in (0,1)\), Algorithm 1 is \((\epsilon ,\delta )\)-correct with a sample complexity bound of
where \(L=6\ln \left( |K|\left( 1+\frac{-10\lambda \ln (\delta )}{\epsilon ^2}\right) \right) -\ln \left( \delta \right) \) as defined in the algorithm.
It may be observed that for \(\lambda \le \frac{1}{2}\), the upper bound provided in Theorem 2 is of the same order as the lower bound in Theorem 1, up to a logarithmic factor.
To establish Theorem 2, we first bound the probability of the event under which the m-th largest sample of one of the optimal arm is below the \(\lambda \)-quantile. Then, we bound the number of samples needed to be observed from each suboptimal arm such that the m-th largest value (obtained from that arm) is below the \((\lambda -\epsilon )\)-quantile. For establishing these bounds in a way that the multiplicative factor of \(\lambda \) remains in the bounds, we use Bernstein’s inequality for bounding the difference between the empirical mean and the mean value of a Bernoulli R.V. which is one if the sampled value is above the quantile and zero otherwise.
Proof
(Theorem 2 ). We denote the time step of the algorithm by t, the value of the counter C(k) at time step t by \(C^{t}(k)\) and we use the notations \(L'=L+\ln \left( \delta \right) \) and \(x^*\) as a short for \(x^*_{\lambda }\). Recall that T stands for the random final time step. By the condition in step 5 of the algorithm, for every arm \(k\in K\), it follows that,
Note that by the facts that for \(x\ge 6\) it follows that \(\frac{d6\ln (x)}{dx}\le 1\), and that for \(x_{0}=20\) it follows that \(x_{0}>6\ln (x_{0})+1\), it is obtained that
for \(L''\ge 20\). So, by the fact that \(T=\sum _{k\in K}C^{T-1}(k)+1\), for \(L''\ge 20\) it follows that
We proceed to establish the \((\epsilon ,\delta )\)-correctness of the algorithm. Let \(V^k_N(m)\) stand for the m-th largest value obtained from arm k after sampling it for N times and assume w.l.o.g. that \(Q_{1}(\lambda )=x^*_{\lambda }\). Then, for \(N\ge N_0\) and \(m=\lfloor \lambda N-\sqrt{3\lambda NL}\rfloor +1\), as stated in the algorithm, by Lemma 1 below it follows that
Hence, at every time step t, by Eqs. (7) and (8), applying the union bound obtains
where \(V^{t,k}\) stands for the value of \(V^{k}\) at time step t.
Let \(k^*_T\) stand for the arm returned by the algorithm. Also, by Lemma 1, for \(N>\frac{10\lambda \left( L'-\ln (\delta )\right) }{\epsilon ^2}\), it follows that
So, since by the condition in step 5, it is obtained that \(C(k^*_T)>\frac{10\lambda \left( L'-\ln (\delta )\right) }{\epsilon ^2}\), it follows by Eq. (10) and the union bound that
Also, by Eq. (9) and the union bound it follows that
So, since by step 4 of the algorithm, \(V^{T,k^*_T}\ge V^{T,1}\), it follows by Eqs. (11) and (12) that
It follows that the algorithm returns an \(\epsilon \)-optimal arm with a probability larger than \(1-\delta \). Hence, it is \((\epsilon ,\delta )\)-correct.
To prove the bound on the expected sample complexity of the algorithm, we define the following sets:
As before, we assume w.l.o.g. that \(Q_1(\lambda )=x^*\). Then, for the case in which
occurs, for every arm \(k\in K\), a necessary condition for \(C^T(k)>N_k'\), where \(N_k'=\lfloor \frac{10\lambda \left( L'-\ln (\delta )\right) }{\varDelta _{k,\lambda }^2}\rfloor +1\) is
where \(m_k'=\lfloor \lambda N_k'-\sqrt{3\lambda N_k'\left( L'-\ln (\delta )\right) }\rfloor +1\).
Now, by using the bound in Eq. (6) and the fact that \(\sum _{k\in K}C^{T}(k)=\sum _{k\in K}C^{T-1}(k)+1\) for the arms in the set \(M(\epsilon )\), \(N_k'\) as a bound for the arms in the set \(N(\epsilon )\), and the bound in Eq. (7), it is obtained that
where \(\varPhi _k(\epsilon )=\lfloor \frac{10\lambda \left( L'-\ln (\delta )\right) }{\left( \max \left( \epsilon ,\varDelta _{k,\lambda }\right) \right) ^2}\rfloor +1\). But, by Eq. (9) it follows that
Also, since \(Q_k' \triangleq Q_k\left( \lambda -\sqrt{\frac{10\lambda \left( L'-\ln (\delta )\right) }{N_k'}}\right) <x^*\) for \(k\in N(\epsilon )\), it follows by Lemma 1 that
Therefore, by Eqs. (13), (14) and (15) and the definition of \(\varPhi _k(\epsilon )\), the bound on the sample complexity is obtained. \(\quad \square \)
Lemma 1
For every arm \(k\in K\), let \(V^k_N(m)\) stand for the m-th largest value obtained from arm k after sampling it for N times. Then, for any positive integers m and N such that \(m<N\), and every \(\lambda \in [0, 1]\), it follows that,
-
1.
If \(\frac{m}{N}>\lambda \), then
$$\begin{aligned} P\left( V^k_{N}(m)> Q_k(\lambda )\right) \le f_{0}(m,N,\lambda )\, . \end{aligned}$$ -
2.
If \(\frac{m}{N}<\lambda \), then
$$\begin{aligned} P\left( V^k_{N}(m)<Q_k(\lambda )\right) \le f_{0}(m,N,\lambda )\, , \end{aligned}$$
where \(f_{0}(m,N,\lambda )=\exp \left( -\frac{|m-N\lambda |^{2}}{2\left( N\lambda +|m-N\lambda |/3\right) }\right) \).
The proof is based on Bernstein’s inequality.
Proof
In this proof, we omit the arm index k for short. We start with claim (1). Let \(x_i\) stand for the i-th sampled value from the arm, and let \(\left\{ X_{i}(\lambda )\right\} \) and \(\left\{ Y_{i}(\lambda )\right\} \) be random variables for which
Note that the variables \(\left\{ Y_{i}(\lambda )\right\} \) are i.i.d. The variables \(\left\{ X_{i}(\lambda )\right\} \) are i.i.d as well. Then, since \(P\left( Y_{i}(\lambda )=1\right) \le P\left( X_{i}(\lambda )=1\right) \), after sampling N times,
where \(\widetilde{X}_{i}(\lambda )=X_{i}(\lambda )-E[X_{1}(\lambda )]\). So, \(\left\{ \widetilde{X}_{i}(\lambda )\right\} \) satisfies the conditions of Bernstein’s inequality with \(\sigma ^{2}=\lambda \left( 1-\lambda \right) \), and \(E[X_{1}(\lambda )]=\lambda \). Therefore
Proceeding to claim (2), let \(\left\{ Z_{i}(\lambda )\right\} \) be random variables for which
Note that \(\left\{ Z_{i}\right\} \) are i.i.d. Then, since \(P\left( Z_{i}(\lambda )=1\right) \ge P\left( X_{i}(\lambda )=1\right) \),
and by symmetry
\(\quad \square \)
4.2 The Doubled Max-Q Algorithm
Here we improve on the previous algorithm by resorting to the doubling trick. The Doubled Max-Q Algorithm is based on the same principle as the Max-Q Algorithm. However, instead of observing one sample at each time step, here the algorithm doubles the number of samples of the chosen arm. Consequently, the number of times at which the algorithm needs to choose an arm is roughly logarithmic compared to that under the previous algorithm, leading to a tighter bound. Algorithm 2 presents the proposed Doubled Max-Q algorithm.
Theorem 3
For every \(\lambda \in (0,1)\), \(\epsilon \in (0,\lambda )\) and \(\delta \in (0,1)\), Algorithm 2 is \((\epsilon ,\delta )\)-correct with a sample complexity bound of
where \(L_D=6\ln \left( |K|\log _2\left( \frac{-20\lambda \ln (\delta )}{\epsilon ^2}\right) \right) -\ln \left( \delta \right) \) as defined in the algorithm.
Here, the upper bound is of the same order as the lower bound in Theorem 1, up to a double-logarithmic order.
The proof of Theorem 3 is established by some adjustments of the proof of Theorem 2.
Proof
As before, we denote the time step of the algorithm by t, the value of the counter C(k) at time step t by \(C^{t}(k)\) and we use the notations \(L_D'=L_D+\ln \left( \delta \right) \) and \(x^*\) as a short for \(x^*_{\lambda }\). We note that here, at each time step, there may be more than a single sample, so T, the sample complexity, may be different than the final time step. Hence, here we denote the (random) final time step by \(T_D\). By the condition in step 5 of the algorithm, for every arm \(k\in K\), it follows that,
Note that by the facts that for \(x\ge 6\) it follows that \(\frac{d6\ln (x)}{dx}\le 1\), and that for \(x_{0}=20\) it follows that \(x_{0}>6\ln (x_{0})+1\) it is obtained that
for \(L_D''\ge 20\). So, by the fact that \(T=\sum _{k\in K}\log _2\left( 2C^{T_D-1}(k)\right) \), for \(L_D''\ge 20\) it follows that
Recall that \(x^*\) is used as a short for \(x^*_{\lambda }\). Now, we begin with proving the \((\epsilon ,\delta )\)-correctness property of the algorithm. We let \(V^k_N(m)\) stands for the m-th largest value obtained from arm k after sampling it for N times and we assume w.l.o.g. that \(Q_{1}(\lambda )=x^*\). Then, for \(N\ge N_0\) and \(m=\lfloor \lambda N-\sqrt{3\lambda NL_D}\rfloor +1\), as stated in the algorithm, by Lemma 1 it follows that
Hence, at every time step t, by Eqs. (24) and (25), by applying the union bound, for \(N_i=2^iN_0\) it follows that
where \(V^{t,k}\) stands for the value of \(V^{k}\) at time step t.
Now, we let \(k^*_{T_D}\) stands for the arm returned by the algorithm. Also, by Lemma 1, for \(N>\frac{10\lambda \left( L_D'-\ln (\delta )\right) }{\epsilon ^2}\), it follows that
So, since by the condition in step 5, it is obtained that \(C(k^*_{T_D})>\frac{10\lambda \left( L_D'-\ln (\delta )\right) }{\epsilon ^2}\), it follows by Eq. (27) and the union bound that
Also, by Eq. (26) and applying the union bound it follows that
So, since by step 4 of the algorithm, \(V^{{T_D},k^*_{T_D}}\ge V^{{T_D},1}\), it follows by Eqs. (28) and (29) that
Therefore, it follows that the algorithm returns an \(\epsilon \)-optimal arm with a probability larger than \(1-\delta \). So, it is \((\epsilon ,\delta )\)-correct.
For proving the bound on the expected sample complexity of the algorithm we define the following sets:
As before, we assume w.l.o.g. that \(Q_1(\lambda )=x^*\). For the case in which
occurs, for every arm \(k\in K\), a necessary condition for \(C^{T_D}(k)>N_{k,D}'\), where
is
where \(m_{k,D}'=\lfloor \lambda N_{k,D}'-\sqrt{3\lambda N_{k,D}'\left( L_D'-\ln (\delta )\right) }\rfloor +1\).
Then for
for \(k\in N(\epsilon )\) it follows that
So, by using the bound in Eq. (23) and the fact that \(\sum _{k\in K}C^{{T_D}}(k)=2\sum _{k\in K}C^{{T_D}-1}(k)\) for the arms in the set \(M(\epsilon )\), \(N_{k,D}'\) as a bound for the arms in the set \(N(\epsilon )\) and the bound in Eq. (24), it is obtained that
But, by Eq. (26) it follows that
Also, since \(Q_k\left( \lambda -\sqrt{\frac{10\lambda \left( L_D'-\ln (\delta )\right) }{N_{k,D}'}}\right) <x^*\) for \(k\in N(\epsilon )\), it follows by Lemma 1 that
Therefore, by Eqs. (30), (31) and (32) and the definition of \(\varPhi _{k,D}(\epsilon )\), the bound on the sample complexity is obtained. \(\quad \square \)
5 Experiments
In this section we investigate numerically the Max-Q and the Double-Max-Q algorithms presented in this paper and compare them with the QPAC algorithm presented in [14].
In Fig. 1, we present the average sample complexity of 10 runs vs. the quantile \(\lambda \) for \(\delta =0.01\) and various values of \(\epsilon \). As shown in Fig. 1, and detailed in Tables 1, 2, 3 and 4, the Max-Q and the Double-Max-Q algorithms significantly outperform the QPAC algorithm. The arms distribution functions used here were uniform with an interval of length 1.
6 Conclusion
In this paper we studied the pure exploration problem where the goal is to find the arm with the maximal \(\lambda \)-quantile. Under the PAC framework, we provided a lower bound and algorithms that attain it up to a logarithmic term (for the first algorithm) and a double-logarithmic term (for the second algorithm).
A challenge for future work is closing the logarithmic gap between the lower and upper bounds.
References
Audibert, J.Y., Bubeck, S.: Best arm identification in multi-armed bandits. In: Proceeding of the 23rd Conference on Learning Theory (COLT), pp. 41–53 (2010)
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)
Cicirello, V.A., Smith, S.F.: The max k-armed bandit: a new model of exploration applied to search heuristic selection. Proc. Ntl. Conf. Artif. Intell. 20, 1355–1361 (2005)
David, Y., Shimkin, N.: PAC lower bounds and efficient algorithms for the max k-armed bandit problem. In: Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 878–887 (2016)
Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and markov decision processes. In: Ben-David, S. (ed.) EuroCOLT 1997. LNCS, vol. 1208, pp. 255–270. Springer, Heidelberg (2002). doi:10.1007/3-540-45435-7_18
Gabillon, V., Ghavamzadeh, M., Lazaric, A.: Best arm identification: a unified approach to fixed budget and fixed confidence. Adv. Neural Inf. Process. Syst. 25, 3212–3220 (2012). Curran Associates, Inc
Kalyanakrishnan, S., Tewari, A., Auer, P., Stone, P.: PAC subset selection in stochastic multi-armed bandits. In: Proceedings of the 29th International Conference on Machine Learning (ICML), pp. 655–662 (2012)
Karnin, Z.S., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on Machine Learning (ICML), pp. 1238–1246 (2013)
Kaufmann, E., Cappé, O., Garivier, A.: On the complexity of best-arm identification in multi-armed bandit models. J. Mach. Learn. Res. 17(1), 1–42 (2016)
Kaufmann, E., Kalyanakrishnan, S.: Information complexity in bandit subset selection. In: Proceeding of the 26th Conference on Learning Theory (COLT), pp. 228–251 (2013)
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)
Schachter, B.: An irreverent guide to value at risk. Finan. Eng. News 1(1), 17–18 (1997)
Streeter, M.J., Smith, S.F.: An asymptotically optimal algorithm for the max k-armed bandit problem. Proc. Ntl. Conf. Artif. Intell. 21, 135–142 (2006)
Szörényi, B., Busa-Fekete, R., Weng, P., Hüllermeier, E.: Qualitative multi-armed bandits: A quantile-based approach. In: Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 1660–1668 (2015)
Yu, J.Y., Nikolova, E.: Sample complexity of risk-averse bandit-arm selection. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI) (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
David, Y., Shimkin, N. (2016). Pure Exploration for Max-Quantile Bandits. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9851. Springer, Cham. https://doi.org/10.1007/978-3-319-46128-1_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-46128-1_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46127-4
Online ISBN: 978-3-319-46128-1
eBook Packages: Computer ScienceComputer Science (R0)