Abstract
We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an \({{\mathcal {O}}}(1/\epsilon )\) (resp., \({{\mathcal {O}}}(1/\epsilon ^2)\)) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where \(\epsilon \) denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by \({{\mathcal {O}}}\{(\log _\gamma \epsilon ) [(1-\gamma )L/\mu ]^{1/2}\log (1/\epsilon )\}\) (resp., \({{\mathcal {O}}} \{(\log _\gamma \epsilon ) (L/\epsilon )^{1/2}\}\)) for problems with strongly (resp., general) convex regularizers. Here \(\gamma \) denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly enhances the flexibility and thus expands the applicability of RL models.
Similar content being viewed by others
Notes
It is worth noting that we do not enforce \(\pi (a|s) > 0\) when defining \(\omega (\pi (\cdot |s))\) as all the search points generated by our algorithms will satisfy this assumption.
References
Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: On the theory of policy gradient methods: optimality, approximation, and distribution shift. arXiv:1908.00261 (2019)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. SIAM J. Optim. 27, 927–956 (2003)
Bellman, R., Dreyfus, S.: Functional approximations and dynamic programming. Math. Tables Other Aids Comput. 13(68), 247–251 (1959)
Bhandari, J., Russo, D.: A Note on the Linear Convergence of Policy Gradient Methods. arXiv e-prints arXiv:2007.11120 (2020)
Cen, S., Cheng, C., Chen, Y., Wei, Y., Chi, Y.: Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization. arXiv e-prints arXiv:2007.06558 (2020)
Dang, C.D., Lan, G.: On the convergence properties of non-Euclidean extragradient methods for variational inequalities with generalized monotone operators. Comput. Optim. Appl. 60(2), 277–310 (2015)
Even-Dar, E., Kakade, S.M., Mansour, Y.: Online Markov decision processes. Math. Oper. Res. 34(3), 726–736 (2009)
Facchinei, F., Pang, J.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Volumes I and II. Comprehensive Study in Mathematics. Springer, New York (2003)
Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML) (2002)
Khodadadian, S., Chen, Z., Maguluri, S.T.: Finite-sample analysis of off-policy natural actor-critic algorithm. arXiv:2102.09318 (2021)
Kotsalis, G., Lan, G., Li, T.: Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation. arXiv:2011.02987 (2020)
Kotsalis, G., Lan, G., Li, T.: Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning. arXiv:2011.08434 (2020)
Lan, G.: First-Order and Stochastic Optimization Methods for Machine Learning. Springer, Switzerland (2020)
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy. arXiv:1906.10306 (2019)
Mei, J., Xiao, C., Szepesvari, C., Schuurmans, D.: On the Global Convergence Rates of Softmax Policy Gradient Methods. arXiv:2005.06392 (2020)
Nemirovski, A.S., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)
Nemirovski, A.S., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, New York (1983)
Nesterov, Y.E.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Dokl. AN SSSR 269, 543–547 (1983)
Puterman, Martin L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. Wiley, New York (1994)
Shani, L., Efroni, Y., Mannor, S.: Adaptive trust region policy optimization: global convergence and faster rates for regularized MDPS. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, pp. 5668–5675. AAAI Press (2020)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS’99: Proceedings of the 12th International Conference on Neural Information Processing Systems, pp. 1057–1063 (1999)
Tomar, M., Shani, L., Efroni, Y., Ghavamzadeh, M.: Mirror descent policy optimization. arXiv:2005.09814 (2020)
Vershynin, R.: High-Dimensional Probability: An Introduction with Applications in Data Science, vol. 47. Cambridge University Press, Cambridge (2018)
Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence. arXiv:abs/1909.01150 (2020)
Wolfer, G., Kontorovich, A.: Statistical estimation of ergodic Markov chain kernel over discrete state space. arXiv:1809.05014v6 (2020)
Xu, T., Wang, Z., Liang, Y.: Improving sample complexity bounds for actor-critic algorithms. arXiv:2004.12956 (2020)
Acknowledgements
The author appreciates very much Caleb Ju, Sajad Khodaddadian, Tianjiao Li, Yan Li and two anonymous reviewers for their careful reading and a few suggested corrections for earlier versions of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was partially supported by the NSF grants 1909298 and 1953199 and NIFA grant 2020-67021-31526. The paper was first released at arXiv:2102.00135 on 01/30/2021.
Appendices
Appendix A: Concentration bounds for \(l_\infty \)-bounded noise
We first show how to bound the expectation of the maximum for a finite number of sub-exponential variables.
Lemma 25
Let \(\left\Vert X\right\Vert _{\psi _1}:= \inf \{t > 0: \exp (|X|/t) \le \exp (2) \}\) denote the sub-exponential norm of X. For a given sequence of sub-exponential variables \(\{X_i\}_{i=1}^n\) with \(\mathbb {E}[X_i] \le v\) and \(\left\Vert X_i\right\Vert _{\psi _1} \le \sigma \), we have
where C denotes an absolute constant.
Proof
By the property of sub-exponential random variables (Section 2.7 of [23]), we know that \(Y_i = X_i - \mathbb {E}\left[ X_i\right] \) is also sub-exponential with \(\left\Vert Y_i\right\Vert _{\psi _1} \le C_1 \left\Vert X_i\right\Vert _{\psi _1} \le C_1 \sigma \) for some absolute constant \(C_1 > 0\). Hence by Proposition 2.7.1 of [23], there exists an absolute constant \(C > 0\) such that \( \mathbb {E}[\exp (\lambda Y_i)] \le \exp (C^2 \sigma ^2 \lambda ^2) , ~ \forall |\lambda | \le 1/( C \sigma ). \) Using the previous observation, we have
which implies \( \mathbb {E}[\max _i Y_i] \le \log n / \lambda + C^2 \sigma ^2 \lambda , ~ \forall |\lambda | \le 1/(C \sigma ). \) Choosing \(\lambda = 1/(C \sigma )\), we obtain \( \mathbb {E}\left[ \max _i Y_i\right] \le C \sigma (\log n + 1 ) . \) By combining this relation with the definition of \(Y_i\), we conclude that \( \mathbb {E}[\max _i X_i] \le \mathbb {E}[\max _i Y_i ]+ v \le C \sigma (\log n + 1) + v. \)
\(\square \)
Proposition 7
For \(\delta ^{k} := Q^{\pi _k, \xi _k} - Q^{\pi _k} \in \mathbb {R}^{|{{\mathcal {S}}} | \times |{{\mathcal {A}}} |}\), we have
where \(\kappa >0\) denotes an absolute constant.
Proof
To proceed, we denote \(\delta ^{k}_{s,a} := Q^{\pi _k, \xi _k}(s,a) - Q^{\pi _k}(s,a) \), and hence
Note that by definition, for each (s, a) pair, we have \(M_k\) independent trajectories of length \(T_k\) starting from (s, a). Let us denote \(Z_i := \sum _{t = 0}^{T_k - 1} \gamma ^t \left[ c(s_t^i, a_t^i) + h^{\pi _k}(s_t^i) \right] \), \(i = 1, \ldots , M_k\). Hence,
Since each \(Z_i - Q^{\pi _k}(s,a)\) is independent of each other, it is immediate to see that
\(Y_{s,a} := (\delta ^k_{s,a})^2\) is a sub-exponential with \(\left\Vert Y_{s,a}\right\Vert _{\psi _1} \le \frac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k}\). Also note that
Thus in view of Lemma 25 with \(\sigma = \tfrac{ ({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k}\), and \(v = \tfrac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2 M_k} + \frac{({\overline{c}} + {\overline{h}})^2}{(1-\gamma )^2} \gamma ^{2T_k}\), we conclude that
\(\square \)
Appendix B: Bias for conditional temporal difference methods
Proof of Lemma 18
For simplicity, let us denote \({\bar{\theta }}_t \equiv \mathbb {E}[\theta _t]\), \(\zeta _t \equiv (\zeta _t^1, \ldots , \zeta _t^\alpha )\) and \(\zeta _{\lceil t\rceil } = (\zeta _1, \ldots , \zeta _t)\). Also let us denote \(\delta ^F_t := F^\pi (\theta _t) - \mathbb {E}[{\tilde{F}}^\pi (\theta _t,\zeta _t^\alpha )|\zeta _{\lceil t-1\rceil }]\) and \({\bar{\delta }}^F_t = \mathbb {E}_{\zeta _{\lceil t-1\rceil }}[\delta ^F_t]\). It follows from Jensen’s ienquality and Lemma 17 that
Also by Jensen’s inequality, Lemma 16 and Lemma 17, we have
Notice that
Now conditional on \(\zeta _{\lceil t-1\rceil }\), taking expectation w.r.t. \(\zeta _t\) on (5.11), we have \( \mathbb {E}[\theta _{t+1}|\zeta _{\lceil t-1\rceil }] = \theta _t - \beta _t F^\pi (\theta _t) + \beta _t \delta ^F_t. \) Taking further expectation w.r.t. \(\zeta _{\lceil t-1\rceil }\) and using the linearity of F, we have \( {\bar{\theta }}_{t+1} = {\bar{\theta }}_t - \beta _t F^\pi ({\bar{\theta }}_t) + \beta _t {\bar{\delta }}^F_t, \) which implies
The above inequality, together with (7.1), (7.2) and the facts that
then imply that
where the last inequality follows from
due to the selection of \(\beta _t\) in (5.13). Now let us denote \( \varGamma _t := {\left\{ \begin{array}{ll} 1 &{} t =0,\\ (1 - \tfrac{3}{t+ t_0 -1})\varGamma _{t-1} &{} t \ge 1, \end{array}\right. } \) or equivalently, \(\varGamma _t := \tfrac{(t_0 - 1) (t_0 -2) (t_0 -3)}{(t+t_0-1) (t + t_0 -2) (t+ t_0 -3))}\). Dividing both sides of (7.3) by \(\varGamma _t\) and taking the telescopic sum, we have
Noting that
we conclude
from which the result holds since \({\bar{\theta }}_1 = \theta _1\). \(\square \)
Rights and permissions
About this article
Cite this article
Lan, G. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Math. Program. 198, 1059–1106 (2023). https://doi.org/10.1007/s10107-022-01816-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01816-5