Abstract

The problem of reinforcement learning (RL) in an unknown nonlinear dynamical system is equivalent to the search for an optimal feedback law utilizing only data from the simulations/rollouts of the dynamical system. Most RL techniques search over a complex global nonlinear feedback parametrization making them suffer from high training times as well as variance. Instead, we advocate searching over a local feedback representation consisting of an open-loop sequence, and an associated optimal linear feedback law completely determined by the open-loop. We show that this alternate approach results in highly efficient training, the answers obtained are globally optimum, repeatable with negligible variance, and hence reliable, and the resulting closed performance is superior to global state-of-the-art RL techniques. Finally, if we replan, whenever required, which is feasible due to the fast and reliable local solution, it allows us to recover the optimal global feedback law.

References

1.
Kumar
,
P. R.
, and
Varaiya
,
P.
,
2015
,
Stochastic Systems: Estimation, Identification, and Adaptive Control
, Vol.
75
,
SIAM
, Philadelphia, PA.
2.
Ioannou
,
P. A.
, and
Sun
,
J.
,
2012
,
Robust Adaptive Control
,
Courier Corporation
, North Chelmsford, MA.
3.
Silver
,
D.
,
Huang
,
A.
,
Maddison
,
C. J.
,
Guez
,
A.
,
Sifre
,
L.
,
van den Driessche
,
G.
, and
Schrittwieser
,
J.
, et al.,
2016
, “
Mastering the Game of Go With Deep Neural Networks and Tree Search
,”
Nature
,
529
(
7587
), pp.
484
489
.10.1038/nature16961
4.
Lillicrap
,
T. P.
,
Hunt
,
J. J.
,
Pritzel
,
A.
,
Heess
,
N.
,
Erez
,
T.
,
Tassa
,
Y.
,
Silver
,
D.
, and
Wierstra
,
D.
,
2016
, “
Continuous Control With Deep Reinforcement Learning
,”
ICLR
, San Juan, Puerto Rico, May 2–4, pp.
1
14
.https://arxiv.org/pdf/1509.02971
5.
Levine
,
S.
,
Finn
,
C.
,
Darrell
,
T.
, and
Abbeel
,
P.
,
2016
, “
End-to-End Training of Deep Visuomotor Policies
,”
J. Mach. Learn. Res.
,
17
(
1
), pp.
1334
1373
.https://jmlr.org/papers/v17/15-522.html
6.
Yuhuai
,
W.
,
Elman
,
M.
,
Shun
,
L.
,
Roger
,
G.
, and
Jimmy
,
B.
,
2017
, “
Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation
,”
31st International Conference on Neural Information Processing Systems
, Long Beach, CA, Dec. 4–9, pp.
5285
5294
.https://dl.acm.org/doi/10.5555/3295222.3295280
7.
Schulman
,
J.
,
Levine
,
S.
,
Moritz
,
P.
,
Jordan
,
M.
, and
Abbeel
,
P.
,
2015
, “
Trust Region Policy Optimization
,” 32nd International Conference on Machine Learning (
ICML 2015
), Lille, France, July 6–11, pp.
1
9
.https://proceedings.mlr.press/v37/schulman15.pdf
8.
Schulman
,
J.
,
Wolski
,
F.
,
Dhariwal
,
P.
,
Radford
,
A.
, and
Klimov
,
O.
,
2017
, “
Proximal Policy Optimization Algorithms
,” preprint arXiv:1707.06347.
9.
Fujimoto
,
S.
,
van Hoof
,
H.
, and
Meger
,
D.
,
2018
, “
Addressing Function Approximation Error in Actor-Critic Methods
,”
Proceedings of the 35th International Conference on Machine Learning
, Stockholm, Sweden, July 10–15, pp.
1587
1596
.https://proceedings.mlr.press/v80/fujimoto18a/fujimoto18a.pdf
10.
Haarnoja
,
T.
,
Zhou
,
A.
,
Abbeel
,
P.
, and
Levine
,
S.
,
2018
, “
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With a Stochastic Actor
,”
Proceedings of the 35th International Conference on Machine Learning
, Stockholm, Sweden, July 10–15, pp.
1861
1870
.https://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf
11.
Henderson
,
P.
,
Islam
,
R.
,
Bachman
,
P.
,
Pineau
,
J.
,
Precup
,
D.
, and
Meger
,
D.
,
2018
, “
Deep Reinforcement Learning That Matters
,”
32nd AAAI Conference on Artificial Intelligence
, New Orleans, LA, Feb. 2–7, pp.
3207
3214
.https://dl.acm.org/doi/abs/10.5555/3504035.3504427
12.
Gu
,
S.
,
Lillicrap
,
T.
,
Ghahramani
,
Z.
,
Turner
,
R. E.
, and
Levine
,
S.
,
2016
, “
Q-Prop: Sample-Efficient Policy Gradient With an Off-Policy Critic
,” 5th International Conference on Learning Representations (
ICLR
), Toulon, France, Apr. 24–26, pp.
1
13
.https://openreview.net/pdf?id=SJ3rcZcxl
13.
Sutton
,
R. S.
, and
Barto
,
A. G.
,
2018
,
Reinforcement Learning: An Introduction
,
MIT Press
, Cambridge, MA.
14.
Jacobsen
,
D.
, and
Mayne
,
D.
,
1970
,
Differential Dynamic Programming
,
Elsevier
, Amsterdam, Netherlands.
15.
Theoddorou
,
E.
,
Tassa
,
Y.
, and
Todorov
,
E.
,
2010
, “
Stochastic Differential Dynamic Programming
,”
Proceedings of American Control Conference
, Baltimore, MD, June 30–July 2, pp.
1125
1132
.10.1109/ACC.2010.5530971
16.
Todorov
,
E.
, and
Li
,
W.
,
2005
, “
A Generalized Iterative LQG Method for Locally-Optimal Feedback Control of Constrained Nonlinear Stochastic Systems
,”
Proceedings of American Control Conference
, Portland, OR, June 8–10, pp.
300
306
.10.1109/ACC.2005.1469949
17.
Li
,
W.
, and
Todorov
,
E.
,
2007
, “
Iterative Linearization Methods for Approximately Optimal Control and Estimation of Non-Linear Stochastic System
,”
Int. J. Control
,
80
(
9
), pp.
1439
1453
.10.1080/00207170701364913
18.
Sutton
,
R.
,
Barto
,
A.
, and
Williams
,
R.
,
1992
, “
Reinforcement Learning Is Direct Adaptive Optimal Control
,”
IEEE Control Syst. Mag.
,
12
(
2
), pp.
19
22
.10.1109/37.126844
19.
Kumar
,
V.
,
Todorov
,
E.
, and
Levine
,
S.
,
2016
, “
Optimal Control With Learned Local Models: Application to Dexterous Manipulation
,” International Conference for Robotics and Automation (
ICRA
), Stockholm, Sweden, May 16–21, pp.
378
383
.10.1109/ICRA.2016.7487156
20.
Tassa
,
Y.
,
Erez
,
T.
, and
Todorov
,
E.
,
2012
, “
Synthesis and Stabilization of Complex Behaviors Through Online Trajectory Optimization
,”
2012 IEEE/RSJ International Conference on Intelligent Robots and Systems
, Vilamoura-Algarve, Portugal, Oct. 7–12, pp.
4906
4913
.10.1109/IROS.2012.6386025
21.
Levine
,
S.
, and
Vladlen
,
K.
,
2014
, “
Learning Complex Neural Network Policies With Trajectory Optimization
,”
Proceedings of the International Conference on Machine Learning
, Beijing, China, June 22–24, pp.
829
837
.https://proceedings.mlr.press/v32/levine14.pdf
22.
Boggs
,
P. T.
, and
Tolle
,
J. W.
,
1995
, “
Sequential Quadratic Programming
,”
Acta Numer.
,
4
, pp.
1
51
.10.1017/S0962492900002518
23.
Mayne
,
D. Q.
,
2014
, “
Model Predictive Control: Recent Developments and Future Promise
,”
Automatica
,
50
(
12
), pp.
2967
2986
.10.1016/j.automatica.2014.10.128
24.
Rawlings
,
J. B.
, and
Mayne
,
D. Q.
,
2015
,
Model Predictive Control: Theory and Design
,
Nob Hill
,
Madison, WI
.
25.
Powell
,
W. B.
,
2007
,
Approximate Dynamic Programming: Solving the Curses of Dimensionality
,
Wiley
, Hoboken, NJ.
26.
Bertsekas
,
D. P.
,
1995
,
Dynamic Programming and Optimal Control, Two Volume Set
, 2nd ed.,
Athena Scientific
, Belmont, MA.
27.
Deisenroth
,
M.
, and
Rasmussen
,
C.
,
2011
, “
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
,” International Conference on Machine Learning (
ICML
), Bellevue, WA, June 28–July 2, pp.
465
472
.https://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf
28.
Mitrovic
,
D.
,
Klanke
,
S.
, and
Vijayakumar
,
S.
,
2010
, “
Adaptive Optimal Feedback Control With Learned Internal Dynamics Models
,”
From Motor Learning to Interaction Learning in Robots
(Studies in Computational Intelligence, Vol.
264
),
Springer
,
Berlin, Germany
.
29.
Wang
,
R.
,
Parunandi
,
K. S.
,
Sharma
,
A.
,
Goyal
,
R.
, and
Chakravorty
,
S.
,
2021
, “
On the Search for Feedback in Reinforcement Learning
,” IEEE Conference on Decision and Control (
CDC
), Austin, TX, Dec. 14–17, pp.
1
8
.10.1109/CDC45484.2021.9683350
30.
Mohamed
,
M. N. G.
,
Chakravorty
,
S.
,
Goyal
,
R.
, and
Wang
,
R.
,
2024
, “
On the Feedback Law in Stochastic Optimal Nonlinear Control
,” American Control Conference (
ACC
), Atlanta, GA, June 8--10, pp. 970--975.10.23919/ACC53348.2022.9867673
31.
Courant
,
R.
, and
Hilbert
,
D.
,
1953
,
Methods of Mathematical Physics, Vol. II
, Vol.
336
,
Interscience Publishers
,
New York
.
32.
Dontchev
,
A. L.
,
1996
, “
Discrete Approximations in Optimal Control
,”
Nonsmooth Analysis and Geometric Methods in Deterministic Optimal Control
,
B. S.
Mordukhovich
and
H. J.
Sussmann
, eds.,
Springer New York
,
New York
, pp.
59
80
.
33.
Versyhnin
,
R.
,
2018
,
High Dimensional Probability: An Introduction With Application to Data Science
,
Cambridge University Press
,
Cambridge, UK
.
34.
Sharma
,
A.
, and
Chakravorty
,
S.
,
2025
, “
A Reduced Order Iterative Linear Quadratic Regulator (ILQR) Technique for the Optimal Control of Nonlinear Partial Differential Equations
,” preprint arxiv:2501.06635.
35.
Emanuel
,
T.
,
Tom
,
E.
, and
Tassa
,
Y.
,
2012
, “
MuJoCo: A Physics Engine for Model-Based Control
,”
IEEE/RSJ International Conference on Intelligent Robots and Systems
, Vilamoura-Algarve, Portugal, Oct. 7–12, pp.
5026
5033
.10.1109/IROS.2012.6386109
36.
Brockman
,
G.
,
Cheung
,
V.
,
Pettersson
,
L.
,
Schneider
,
J.
,
Schulman
,
J.
,
Tang
,
J.
, and
Zaremba
,
W.
,
2016
, “
OpenAI Gym
,” preprint arXiv:1606.01540.
37.
Tassa
,
Y.
,
Doron
,
Y.
,
Muldal
,
A.
,
Erez
,
T.
,
Li
,
Y.
,
de Las Casas
,
D.
,
Budden
,
D.
,
2018
, “
DeepMind Control Suite
,” preprint arXiv:1801.00690v1.
38.
Wang
,
R.
,
Parunandi
,
K. S.
,
Sharma
,
A.
,
Goyal
,
R.
, and
Chakravorty
,
S.
,
2022
, “
On the Search for Feedback in Reinforcement Learning
,” preprint arxiv:2002.09478.
39.
Plappert
,
M.
,
2016
, “
keras-rl
,” GitHub, San Francisco, CA, accessed May 21, 2025, https://github.com/keras-rl/keras-rl
40.
Raffin
,
A.
,
Hill
,
A.
,
Gleave
,
A.
,
Kanervisto
,
A.
,
Ernestus
,
M.
, and
Dormann
,
N.
,
2021
, “
Stable-Baselines3: Reliable Reinforcement Learning Implementations
,”
J. Mach. Learn. Res.
,
22
(
268
), pp.
1
8
.https://jmlr.org/papers/v22/20-1364.html
41.
Allen
,
S. M.
, and
Cahn
,
J. W.
,
1979
, “
A Microscopic Theory for Antiphase Boundary Motion and Its Application to Antiphase Domain Coarsening
,”
Acta Metall.
,
27
(
6
), pp.
1085
1095
.10.1016/0001-6160(79)90196-2
42.
Islam
,
R.
,
Henderson
,
P.
,
Gomrokchi
,
M.
, and
Precup
,
D.
,
2017
, “
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
,” Reproducibility in Machine Learning Workshop (
ICML'17
), Sydney, Australia, Aug. 6, pp.
1
10
.https://openreview.net/pdf?id=BJNuErQX-
43.
Chakravorty
,
S.
,
Wang
,
R.
, and
Mohamed
,
M. N. G.
,
2020
, “
On the Convergence of Reinforcement Learning
,” 60th IEEE Conference on Decision and Control (
CDC
), Austin, TX, Dec. 13–15, pp.
2969
2975
.10.1109/CDC45484.2021.9682829
44.
Amadio
,
F.
,
Dalla Libera
,
A.
,
Antonello
,
R.
,
Nikovski
,
D. N.
,
Carli
,
R.
, and
Romeres
,
D.
,
2022
, “
Model-Based Policy Search Using Monte Carlo Gradient Estimation With Real Systems Application
,”
IEEE Trans. Rob.
,
38
(
6
), pp.
3879
3898
.10.1109/TRO.2022.3184837
You do not currently have access to this content.