Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Guanghui Lan¹

2346 Accesses
29 Citations
3 Altmetric
Explore all metrics

Abstract

We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an ${{\mathcal {O}}}(1/\epsilon )$ (resp., ${{\mathcal {O}}}(1/\epsilon ^2)$) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where $\epsilon $ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by ${{\mathcal {O}}}\{(\log _\gamma \epsilon ) [(1-\gamma )L/\mu ]^{1/2}\log (1/\epsilon )\}$ (resp., ${{\mathcal {O}}} \{(\log _\gamma \epsilon ) (L/\epsilon )^{1/2}\}$) for problems with strongly (resp., general) convex regularizers. Here $\gamma $ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly enhances the flexibility and thus expands the applicability of RL models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

Article 19 September 2023

Finite-time error bounds for Greedy-GQ

Article 30 April 2024

Softmax policy gradient methods can take exponential time to converge

Article Open access 23 January 2023

Notes

It is worth noting that we do not enforce $\pi (a|s) > 0$ when defining $\omega (\pi (\cdot |s))$ as all the search points generated by our algorithms will satisfy this assumption.

References

Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. arXiv:1908.00261 (2019)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. SIAM J. Optim. 27, 927–956 (2003)
MATH Google Scholar
Bellman, R., Dreyfus, S.: Functional approximations and dynamic programming. Math. Tables Other Aids Comput. 13(68), 247–251 (1959)
Article MathSciNet MATH Google Scholar
Bhandari, J., Russo, D.: A Note on the Linear Convergence of Policy Gradient Methods. arXiv e-prints arXiv:2007.11120 (2020)
Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization. arXiv e-prints arXiv:2007.06558 (2020)
Dang, C.D., Lan, G.: On the convergence properties of non-Euclidean extragradient methods for variational inequalities with generalized monotone operators. Comput. Optim. Appl. 60(2), 277–310 (2015)
Article MathSciNet MATH Google Scholar
Even-Dar, E., Kakade, S.M., Mansour, Y.: Online Markov decision processes. Math. Oper. Res. 34(3), 726–736 (2009)
Article MathSciNet MATH Google Scholar
Facchinei, F., Pang, J.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Volumes I and II. Comprehensive Study in Mathematics. Springer, New York (2003)
MATH Google Scholar
Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML) (2002)
Khodadadian, S., Chen, Z., Maguluri, S.T.: Finite-sample analysis of off-policy natural actor-critic algorithm. arXiv:2102.09318 (2021)
Kotsalis, G., Lan, G., Li, T.: Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation. arXiv:2011.02987 (2020)
Kotsalis, G., Lan, G., Li, T.: Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning. arXiv:2011.08434 (2020)
Lan, G.: First-Order and Stochastic Optimization Methods for Machine Learning. Springer, Switzerland (2020)
Book MATH Google Scholar
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy. arXiv:1906.10306 (2019)
Mei, J., Xiao, C., Szepesvari, C., Schuurmans, D.: On the Global Convergence Rates of Softmax Policy Gradient Methods. arXiv:2005.06392 (2020)
Nemirovski, A.S., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)
Article MathSciNet MATH Google Scholar
Nemirovski, A.S., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, New York (1983)
Google Scholar
Nesterov, Y.E.: A method for unconstrained convex minimization problem with the rate of convergence $O(1/k^2)$. Dokl. AN SSSR 269, 543–547 (1983)
Google Scholar
Puterman, Martin L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. Wiley, New York (1994)
Book MATH Google Scholar
Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPS. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pp. 5668–5675. AAAI Press (2020)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS’99: Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 1057–1063 (1999)
Tomar, M., Shani, L., Efroni, Y., Ghavamzadeh, M.: Mirror descent policy optimization. arXiv:2005.09814 (2020)
Vershynin, R.: High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47. Cambridge University Press, Cambridge (2018)
MATH Google Scholar
Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence. arXiv:abs/1909.01150 (2020)
Wolfer, G., Kontorovich, A.: Statistical estimation of ergodic Markov chain kernel over discrete state space. arXiv:1809.05014v6 (2020)
Xu, T., Wang, Z., Liang, Y.: Improving sample complexity bounds for actor-critic algorithms. arXiv:2004.12956 (2020)

Download references

Acknowledgements

The author appreciates very much Caleb Ju, Sajad Khodaddadian, Tianjiao Li, Yan Li and two anonymous reviewers for their careful reading and a few suggested corrections for earlier versions of this paper.

Author information

Authors and Affiliations

H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Guanghui Lan

Authors

Guanghui Lan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guanghui Lan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was partially supported by the NSF grants 1909298 and 1953199 and NIFA grant 2020-67021-31526. The paper was first released at arXiv:2102.00135 on 01/30/2021.

Appendices

Appendix A: Concentration bounds for $l_\infty $-bounded noise

We first show how to bound the expectation of the maximum for a finite number of sub-exponential variables.

Lemma 25

Let $\left\Vert X\right\Vert _{\psi _1}:= \inf \{t > 0: \exp (|X|/t) \le \exp (2) \}$ denote the sub-exponential norm of X. For a given sequence of sub-exponential variables $\{X_i\}_{i=1}^n$ with $\mathbb {E}[X_i] \le v$ and $\left\Vert X_i\right\Vert _{\psi _1} \le \sigma $, we have

$$\begin{aligned} \mathbb {E}[\max _i X_i] \le C \sigma (\log n + 1) + v, \end{aligned}$$

where C denotes an absolute constant.

Proof

By the property of sub-exponential random variables (Section 2.7 of [23]), we know that $Y_i = X_i - \mathbb {E}\left[ X_i\right] $ is also sub-exponential with $\left\Vert Y_i\right\Vert _{\psi _1} \le C_1 \left\Vert X_i\right\Vert _{\psi _1} \le C_1 \sigma $ for some absolute constant $C_1 > 0$. Hence by Proposition 2.7.1 of [23], there exists an absolute constant $C > 0$ such that $ \mathbb {E}[\exp (\lambda Y_i)] \le \exp (C^2 \sigma ^2 \lambda ^2) , ~ \forall |\lambda | \le 1/( C \sigma ). $ Using the previous observation, we have

$$\begin{aligned}&\exp ( \mathbb {E}[\lambda \max _{i} Y_i]) \le \mathbb {E}[ \exp (\lambda \max _i Y_i)] \le \mathbb {E}[\textstyle \sum _{i=1}^n \exp (\lambda Y_i) ] \\&\quad \le n \exp (C^2 \sigma ^2 \lambda ^2), ~ \forall |\lambda | \le \frac{1}{C \sigma }, \end{aligned}$$

which implies $ \mathbb {E}[\max _i Y_i] \le \log n / \lambda + C^2 \sigma ^2 \lambda , ~ \forall |\lambda | \le 1/(C \sigma ). $ Choosing $\lambda = 1/(C \sigma )$, we obtain $ \mathbb {E}\left[ \max _i Y_i\right] \le C \sigma (\log n + 1 ) . $ By combining this relation with the definition of $Y_i$, we conclude that $ \mathbb {E}[\max _i X_i] \le \mathbb {E}[\max _i Y_i ]+ v \le C \sigma (\log n + 1) + v. $

$\square $

Proposition 7

For $\delta ^{k} := Q^{\pi _k, \xi _k} - Q^{\pi _k} \in \mathbb {R}^{|{{\mathcal {S}}} | \times |{{\mathcal {A}}} |}$, we have

$$\begin{aligned} \mathbb {E}_{\xi _k}[ \Vert \delta ^k\Vert _{\infty }^2 ] \le \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \left[ \gamma ^{2T_k} + \tfrac{\kappa }{M_k} (\log (|{{\mathcal {S}}} | |{{\mathcal {A}}} |) + 1)\right] , \end{aligned}$$

where $\kappa >0$ denotes an absolute constant.

Proof

To proceed, we denote $\delta ^{k}_{s,a} := Q^{\pi _k, \xi _k}(s,a) - Q^{\pi _k}(s,a) $, and hence

$$\begin{aligned}\mathbb {E}_{\xi _k} \Vert Q^{\pi _k, \xi _k} - Q^{\pi _k}\Vert _{\infty }^2 = \mathbb {E}_{\xi _k} [\max _{s \in {{\mathcal {S}}}, a \in {{\mathcal {A}}}} (\delta ^k_{s,a})^2]. \end{aligned}$$

Note that by definition, for each (s, a) pair, we have $M_k$ independent trajectories of length $T_k$ starting from (s, a). Let us denote $Z_i := \sum _{t = 0}^{T_k - 1} \gamma ^t \left[ c(s_t^i, a_t^i) + h^{\pi _k}(s_t^i) \right] $, $i = 1, \ldots , M_k$. Hence,

$$\begin{aligned} Q^{\pi _k, \xi _k} (s,a)&= \frac{1}{M_k} \textstyle \sum _{i=1}^{M_k} \textstyle \sum _{t = 0}^{T_k - 1} \gamma ^t \left[ c(s_t^i, a_t^i) + h^{\pi _k}(s_t^i) \right] = \tfrac{1}{M_k} \sum _{i=1}^{M_k} Z_i, \\ \delta ^{k}_{s,a}&= \frac{1}{M_k} \textstyle \sum _{i=1}^{M_k} (Z_i - Q^{\pi _k}(s,a)), ~~ Z_i - Q^{\pi _k}(s,a) \in [-\tfrac{{\overline{c}} + {\overline{h}}}{1-\gamma }, \tfrac{{\overline{c}} + {\overline{h}}}{1-\gamma }]. \end{aligned}$$

Since each $Z_i - Q^{\pi _k}(s,a)$ is independent of each other, it is immediate to see that

$Y_{s,a} := (\delta ^k_{s,a})^2$ is a sub-exponential with $\left\Vert Y_{s,a}\right\Vert _{\psi _1} \le \frac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k}$. Also note that

$$\begin{aligned} \mathbb {E}_{\xi _k} [Y_{s,a}] = \mathbb {E}_{\xi _k} [(\delta ^k_{s,a})^2] = \mathrm {Var}(\delta ^k_{s,a}) + (\mathbb {E}\delta ^k_{s,a})^2 \le \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} + \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \gamma ^{2T_k}. \end{aligned}$$

Thus in view of Lemma 25 with $\sigma = \tfrac{ ({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k}$, and $v = \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} + \frac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \gamma ^{2T_k}$, we conclude that

$$\begin{aligned} \mathbb {E}[ \Vert \delta ^k\Vert _{\infty }^2]&= \mathbb {E}[ \max _{s\in {{\mathcal {S}}}, a\in {{\mathcal {A}}}} (\delta ^k_{s,a})^2] = \mathbb {E}[ \max _{s\in {{\mathcal {S}}}, a\in {{\mathcal {A}}}} Y_{s,a}] \\&\le \tfrac{C ({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} (\log (|{{\mathcal {S}}} | |{{\mathcal {A}}} |) + 1) + \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} + \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \gamma ^{2T_k}. \end{aligned}$$

$\square $

Appendix B: Bias for conditional temporal difference methods

Proof of Lemma 18

For simplicity, let us denote ${\bar{\theta }}_t \equiv \mathbb {E}[\theta _t]$, $\zeta _t \equiv (\zeta _t^1, \ldots , \zeta _t^\alpha )$ and $\zeta _{\lceil t\rceil } = (\zeta _1, \ldots , \zeta _t)$. Also let us denote $\delta ^F_t := F^\pi (\theta _t) - \mathbb {E}[{\tilde{F}}^\pi (\theta _t,\zeta _t^\alpha )|\zeta _{\lceil t-1\rceil }]$ and ${\bar{\delta }}^F_t = \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\delta ^F_t]$. It follows from Jensen’s ienquality and Lemma 17 that

$$\begin{aligned} \Vert {\bar{\theta }}_t - \theta ^*\Vert _2 = \Vert \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[ \theta _t] - \theta ^*\Vert _2 \le \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\Vert \theta _t - \theta ^*\Vert _2] \le R. \end{aligned}$$

(7.1)

Also by Jensen’s inequality, Lemma 16 and Lemma 17, we have

$$\begin{aligned} \Vert {\bar{\delta }}^F_t\Vert _2&= \Vert \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[ \delta ^F_t]\Vert _2 \le \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\Vert \delta ^F_t\Vert _2]\nonumber \\&\le C \rho ^\alpha \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\Vert \theta _t -\theta ^*\Vert _2]\le C R\rho ^\alpha . \end{aligned}$$

(7.2)

Notice that

$$\begin{aligned} \theta _{t+1}&= \theta _t - \beta _t {\tilde{F}}^\pi (\theta _t,\zeta _t^\alpha )\\&= \theta _t - \beta _t F^\pi (\theta _t) + \beta _t [F^\pi (\theta _t) - {\tilde{F}}^\pi (\theta _t,\zeta _t^\alpha )]. \end{aligned}$$

Now conditional on $\zeta _{\lceil t-1\rceil }$, taking expectation w.r.t. $\zeta _t$ on (5.11), we have $ \mathbb {E}[\theta _{t+1}|\zeta _{\lceil t-1\rceil }] = \theta _t - \beta _t F^\pi (\theta _t) + \beta _t \delta ^F_t. $ Taking further expectation w.r.t. $\zeta _{\lceil t-1\rceil }$ and using the linearity of F, we have $ {\bar{\theta }}_{t+1} = {\bar{\theta }}_t - \beta _t F^\pi ({\bar{\theta }}_t) + \beta _t {\bar{\delta }}^F_t, $ which implies

$$\begin{aligned} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&= \Vert {\bar{\theta }}_t - \theta ^* - \beta _t F^\pi ({\bar{\theta }}_t) + \beta _t {\bar{\delta }}^F_t\Vert _2^2\\&= \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 - 2 \beta _t \langle F^\pi ({\bar{\theta }}_t) - {\bar{\delta }}^F_t, {\bar{\theta }}_t - \theta ^*\rangle + \beta _t^2 \Vert F^\pi ({\bar{\theta }}_t) - {\bar{\delta }}^F_t\Vert _2^2\\&\le \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 - 2 \beta _t \langle F^\pi ({\bar{\theta }}_t) - {\bar{\delta }}^F_t, {\bar{\theta }}_t - \theta ^*\rangle \\&\quad + 2 \beta _t^2 [ \Vert F^\pi ({\bar{\theta }}_t)\Vert _2^2 + \Vert {\bar{\delta }}^F_t\Vert _2^2]. \end{aligned}$$

The above inequality, together with (7.1), (7.2) and the facts that

$$\begin{aligned} \langle F^\pi ({\bar{\theta }}_t), {\bar{\theta }}_t - \theta ^*\rangle = \langle F^\pi ({\bar{\theta }}_t) - F^\pi (\theta ^*), {\bar{\theta }}_t - \theta ^*\rangle \ge \varLambda _{\min } \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2\\ \Vert F^\pi ({\bar{\theta }}_t)\Vert _2 = \Vert F^\pi ({\bar{\theta }}_t) - F^\pi (\theta ^*)\Vert _2 \le \varLambda _{\max }\Vert {\bar{\theta }}_t - \theta ^*\Vert _2, \end{aligned}$$

then imply that

$$\begin{aligned} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&\le (1 - 2 \beta _t \varLambda _{\min } + 2 \beta _t^2 \varLambda _{\max }^2) \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 + 2\beta _t C R^2 \rho ^\alpha \nonumber \\&\quad + 2 \beta _t^2 C^2 R^2\rho ^{2\alpha } \nonumber \\&\le (1 - \tfrac{3}{t+ t_0 -1}) \Vert {\bar{\theta }}_t - \theta ^*\Vert _2^2 + 2\beta _t C R^2 \rho ^\alpha + 2 \beta _t^2 C^2 R^2\rho ^{2\alpha }, \end{aligned}$$

(7.3)

where the last inequality follows from

$$\begin{aligned} 2(\beta _t \varLambda _{\min } - \beta _t^2 \varLambda _{\max }^2)&= 2 \beta _t ( \varLambda _{\min } - \beta _t\varLambda _{\max }^2 ) = 2 \beta _t (\varLambda _{\min } - \tfrac{2 \varLambda _{\max }^2}{\varLambda _{\min } (t+ t_0 -1)})\\&\ge 2 \beta _t (\varLambda _{\min } - \tfrac{2 \varLambda _{\max }^2}{\varLambda _{\min } t_0 }) \ge \tfrac{3}{2} \beta _t \varLambda _{\min } = \tfrac{3}{t+ t_0 -1} \end{aligned}$$

due to the selection of $\beta _t$ in (5.13). Now let us denote $ \varGamma _t := {\left\{ \begin{array}{ll} 1 &{} t =0,\\ (1 - \tfrac{3}{t+ t_0 -1})\varGamma _{t-1} &{} t \ge 1, \end{array}\right. } $ or equivalently, $\varGamma _t := \tfrac{(t_0 - 1) (t_0 -2) (t_0 -3)}{(t+t_0-1) (t + t_0 -2) (t+ t_0 -3))}$. Dividing both sides of (7.3) by $\varGamma _t$ and taking the telescopic sum, we have

$$\begin{aligned} \tfrac{1}{\varGamma _t} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&\le \Vert {\bar{\theta }}_1 - \theta ^*\Vert _2^2 + 2 C R^2 \rho ^\alpha \textstyle \sum _{i=1}^t \tfrac{\beta _i}{\varGamma _i} + 2 C^2 R^2\rho ^{2\alpha } \textstyle \sum _{i=1}^t \tfrac{\beta _i^2}{\varGamma _i}. \end{aligned}$$

Noting that

$$\begin{aligned} \textstyle \sum _{i=1}^t \tfrac{\beta _i}{\varGamma _i}&= \tfrac{2}{\varLambda _{\min }} \textstyle \sum _{i=1}^t \tfrac{(i+t_0 - 2)(i+t_0-3)}{(t_0-1)(t_0-2)(t_0-3)} \le \tfrac{2 \textstyle \sum _{i=1}^t (i+t_0 - 2)^2}{\varLambda _{\min }(t_0-1)(t_0-2)(t_0-3)}\\&\le \tfrac{2 (t+t_0-1)^3}{3\varLambda _{\min }(t_0-1)(t_0-2)(t_0-3)},\\ \textstyle \sum _{i=1}^t \tfrac{\beta _i^2}{\varGamma _i}&\le \tfrac{4 \textstyle \sum _{i=1}^t (i+t_0-3)}{\varLambda _{\min }^2(t_0-1)(t_0-2)(t_0-3)} \le \tfrac{2 (t+t_0-2)^2}{\varLambda _{\min }^2(t_0-1)(t_0-2)(t_0-3)}, \end{aligned}$$

we conclude

$$\begin{aligned} \Vert {\bar{\theta }}_{t+1} - \theta ^*\Vert _2^2&\le \tfrac{(t_0 - 1) (t_0 -2) (t_0 -3)}{(t+t_0-1) (t + t_0 -2) (t+ t_0 -3)} \Vert {\bar{\theta }}_1 - \theta ^*\Vert _2^2 + 2 C R^2 \rho ^\alpha \tfrac{2 (t+t_0-1)^2}{3\varLambda _{\min } (t + t_0 -2) (t+ t_0 -3)} \\&\quad + 2 C^2 R^2\rho ^{2\alpha } \tfrac{2 (t+t_0-2)}{\varLambda _{\min }^2(t+t_0-1) (t+ t_0 -3)}\\&\le \tfrac{(t_0 - 1) (t_0 -2) (t_0 -3)}{(t+t_0-1) (t + t_0 -2) (t+ t_0 -3)} \Vert {\bar{\theta }}_1 - \theta ^*\Vert _2^2 + \tfrac{8 C R^2 \rho ^\alpha }{3\varLambda _{\min }} + \tfrac{C^2 R^2\rho ^{2\alpha }}{\varLambda _{\min }^2}, \end{aligned}$$

from which the result holds since ${\bar{\theta }}_1 = \theta _1$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lan, G. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Math. Program. 198, 1059–1106 (2023). https://doi.org/10.1007/s10107-022-01816-5

Download citation

Received: 05 February 2021
Accepted: 05 April 2022
Published: 28 April 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10107-022-01816-5

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

Finite-time error bounds for Greedy-GQ

Softmax policy gradient methods can take exponential time to converge

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Concentration bounds for \(l_\infty \)-bounded noise

Lemma 25

Proof

Proposition 7

Proof

Appendix B: Bias for conditional temporal difference methods

Proof of Lemma 18

Rights and permissions

About this article

Cite this article

Mathematics Subject Classification

Subscribe and save

Buy Now

Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity

Finite-time error bounds for Greedy-GQ

Softmax policy gradient methods can take exponential time to converge

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Concentration bounds for \(l_\infty \)-bounded noise

Lemma 25

Proof

Proposition 7

Proof

Appendix B: Bias for conditional temporal difference methods

Proof of Lemma 18

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Subscribe and save

Buy Now