Modeling dynamic systems incurring stochastic disturbances for deriving a control policy is a ubiquitous task in engineering. However, in some instances obtaining a model of a system may be impractical or impossible. Alternative approaches have been developed using a simulation-based stochastic framework, in which the system interacts with its environment in real time and obtains information that can be processed to produce an optimal control policy. In this context, the problem of developing a policy for controlling the system’s behavior is formulated as a sequential decision-making problem under uncertainty. This paper considers the problem of deriving a control policy for a dynamic system with unknown dynamics in real time, formulated as a sequential decision-making under uncertainty. The evolution of the system is modeled as a controlled Markov chain. A new state-space representation model and a learning mechanism are proposed that can be used to improve system performance over time. The major difference between the existing methods and the proposed learning model is that the latter utilizes an evaluation function, which considers the expected cost that can be achieved by state transitions forward in time. The model allows decision-making based on gradually enhanced knowledge of system response as it transitions from one state to another, in conjunction with actions taken at each state. The proposed model is demonstrated on the single cart-pole balancing problem and a vehicle cruise-control problem.

1.
Bertsekas
,
D. P.
and
Shreve
,
S. E.
, 2007,
Stochastic Optimal Control: The Discrete-Time Case
,
1st ed.
,
Athena Scientific
,
Nashua, NH
.
2.
Gosavi
,
A.
, 2004, “
Reinforcement Learning for Long-Run Average Cost
,”
Eur. J. Oper. Res.
0377-2217,
155
, pp.
654
74
.
3.
Bertsekas
,
D. P.
and
Tsitsiklis
,
J. N.
, 1996, “
Neuro-Dynamic Programming
” (
Optimization and Neural Computation Series, 3
),
1st ed.
,
Athena Scientific
,
Nashua, NH
.
4.
Sutton
,
R. S.
and
Barto
,
A. G.
, 1998, “
Reinforcement Learning: An Introduction
” (
Adaptive Computation and Machine Learning
),
MIT Press
,
Cambridge, MA
.
5.
Borkar
,
V. S.
, 2000, “
A Learning Algorithm for Discrete-Time Stochastic Control
,”
Probability in the Engineering and Informational Sciences
,
14
, pp.
243
258
. 0269-9648
6.
Samuel
,
A. L.
, 1959, “
Some Studies in Machine Learning Using the Game of Checkers
,”
IBM J. Res. Dev.
0018-8646,
3
, pp.
210
229
.
7.
Samuel
,
A. L.
, 1967, “
Some Studies in Machine Learning Using the Game of Checkers. II: Recent Progress
,”
IBM J. Res. Develop.
,
11
, pp.
601
617
. 0018-8646
8.
Sutton
,
R. S.
, 1984, “
Temporal Credit Assignment in Reinforcement Learning
,” Ph.D. thesis, University of Massachusetts, Amherst, MA.
9.
Sutton
,
R. S.
, 1988, “
Learning to Predict by the Methods of Temporal Difference
,”
Mach. Learn.
0885-6125,
3
, pp.
9
44
.
10.
Watkins
,
C. J.
, 1989, “
Learning From Delayed Rewards
,” Ph.D. thesis, Kings College, Cambridge, England.
11.
Watkins
,
C. J. C. H.
, and
Dayan
,
P.
, 1992, “
Q-Learning
,”
Mach. Learn.
0885-6125,
8
, pp.
279
92
.
12.
Kaelbling
,
L. P.
,
Littman
,
M. L.
, and
Moore
,
A. W.
, 1996, “
Reinforcement Learning: A Survey
,”
J. Artif. Intell. Res.
1076-9757,
4
, pp.
237
285
.
13.
Schwartz
,
A.
, 1993, “
A Reinforcement Learning Method for Maximizing Undiscounted Rewards
,”
Proceedings of the Tenth International Conference on Machine Learning
, Amherst, MA, pp.
298
305
.
14.
Mahadevan
,
S.
, 1996, “
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
,”
Mach. Learn.
0885-6125,
22
, pp.
159
195
.
15.
Sutton
,
R. S.
, 1990, “
Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming
,”
Proceedings of the Seventh International Conference on Machine Learning
, Austin, TX, pp.
216
224
.
16.
Sutton
,
R. S.
, 1991, “
Planning by Incremental Dynamic Programming
,”
Proceedings of the Eighth International Workshop on Machine Learning (ML91)
, Evanston, IL, pp.
353
357
.
17.
Moore
,
A. W.
, and
Atkinson
,
C. G.
, 1993, “
Prioritized Sweeping: Reinforcement Learning With Less Data and Less Time
,”
Mach. Learn.
0885-6125,
13
, pp.
103
30
.
18.
Peng
,
J.
, and
Williams
,
R. J.
, 1993, “
Efficient Learning and Planning Within the Dyna Framework
,”
Proceedings of the IEEE International Conference on Neural Networks
, San Francisco, CA, pp.
168
74
.
19.
Barto
,
A. G.
,
Bradtke
,
S. J.
, and
Singh
,
S. P.
, 1995, “
Learning to Act Using Real-Time Dynamic Programming
,”
Artif. Intell.
0004-3702,
72
, pp.
81
138
.
20.
Malikopoulos
,
A. A.
,
Papalambros
,
P. Y.
, and
Assanis
,
D. N.
, 2007, “
A State-Space Representation Model and Learning Algorithm for Real-Time Decision-Making Under Uncertainty
,”
Proceedings of the 2007 ASME International Mechanical Engineering Congress and Exposition
, Seattle, WA, Nov. 11–15.
21.
Puterman
,
M. L.
, 2005,
Markov Decision Processes: Discrete Stochastic Dynamic Programming
,
2nd rev. ed.
,
Wiley-Interscience
,
New York
.
22.
Malikopoulos
,
A. A.
,
Papalambros
,
P. Y.
, and
Assanis
,
D. N.
, 2007, “
A Learning Algorithm for Optimal Internal Combustion Engine Calibration in Real Time
,”
Proceedings of the ASME 2007 International Design Engineering Technical Conferences Computers and Information in Engineering Conference
, Las Vegas, NV, Sept. 4–7.
23.
Malikopoulos
,
A. A.
, 2008, “
Real-Time, Self-Learning Identification and Stochastic Optimal Control of Advanced Powertrain Systems
,” Ph.D. thesis, Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI.
24.
Malikopoulos
,
A. A.
,
Assanis
,
D. N.
, and
Papalambros
,
P. Y.
, 2007, “
Real-Time, Self-Learning Optimization of Diesel Engine Calibration
,”
Proceedings of the 2007 Fall Technical Conference of the ASME Internal Combustion Engine Division
, Charleston, SC, Oct. 14–17.
25.
Malikopoulos
,
A. A.
,
Assanis
,
D. N.
, and
Papalambros
,
P. Y.
, 2008, “
Optimal Engine Calibration for Individual Driving Styles
,”
Proceedings of the SAE 2008 World Congress and Exhibition
, Detroit, MI, Apr. 14–17, SAE Paper No. 2008-01-1367.
26.
Iwata
,
K.
,
Ito
,
N.
,
Yamauchi
,
K.
, and
Ishii
,
N.
, 2000, “
Combining Exploitation-Based and Exploration-Based Approach in Reinforcement Learning
,”
Proceedings of the Intelligent Data Engineering and Automated–IDEAL 2000
, Hong Kong, China, pp.
326
31
.
27.
Ishii
,
S.
,
Yoshida
,
W.
, and
Yoshimoto
,
J.
, 2002, “
Control of Exploitation–Exploration Meta-Parameter in Reinforcement Learning
,”
Neural Networks
0893-6080,
15
, pp.
665
87
.
28.
Chan-Geon
,
P.
, and
Sung-Bong
,
Y.
, 2003, “
Implementation of the Agent Using Universal On-Line Q-Learning by Balancing Exploration and Exploitation in Reinforcement Learning
,”
Journal of KISS: Software and Applications
,
30
, pp.
672
80
.
29.
Miyazaki
,
K.
, and
Yamamura
,
M.
, 1997, “
Marco Polo: A Reinforcement Learning System Considering Tradeoff Exploitation and Exploration Under Markovian Environments
,”
Journal of Japanese Society for Artificial Intelligence
,
12
, pp.
78
89
.
30.
Hernandez-Aguirre
,
A.
,
Buckles
,
B. P.
, and
Martinez-Alcantara
,
A.
, 2000, “
The Probably Approximately Correct (PAC) Population Size of a Genetic Algorithm
,”
12th IEEE Internationals Conference on Tools With Artificial Intelligence
, pp.
199
202
.
31.
Malikopoulos
,
A. A.
, 2008, “
Convergence Properties of a Computational Learning Model for Unknown Markov Chains
,”
Proceedings of the 2008 ASME Dynamic Systems and Control Conference
, Ann Arbor, MI, Oct. 20–22.
32.
Anderson
,
C. W.
, 1989, “
Learning to Control an Inverted Pendulum Using Neural Networks
,”
IEEE Control Syst. Mag.
0272-1708,
9
, pp.
31
7
.
33.
Williams
,
V.
, and
Matsuoka
,
K.
, 1991, “
Learning to Balance the Inverted Pendulum Using Neural Networks
,”
Proceedings of the 1991 IEEE International Joint Conference on Neural Networks
, Singapore, pp.
214
19
, Cat. No.91CH3065-0.
34.
Zhidong
,
D.
,
Zaixing
,
Z.
, and
Peifa
,
J.
, 1995, “
A Neural-Fuzzy BOXES Control System With Reinforcement Learning and its Applications to Inverted Pendulum
,”
Proceedings of the 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century
, Vancouver, BC, Canada, pp.
1250
4
, Cat. No.95CH3576-7.
35.
Jeen-Shing
,
W.
, and
McLaren
,
R.
, 1997, “
A Modified Defuzzifier for Control of the Inverted Pendulum Using Learning
,”
Proceedings of the 1997 Annual Meeting of the North American Fuzzy Information Processing Society–NAFIPS
, Syracuse, NY, pp.
118
23
, Cat. No. 97TH8297.
36.
Mustapha
,
S. M.
, and
Lachiver
,
G.
, 2000, “
A Modified Actor-Critic Reinforcement Learning Algorithm
,”
Proceedings of the 2000 Canadian Conference on Electrical and Computer Engineering
, Halifax, NS, Canada, pp.
605
9
.
37.
Si
,
J.
, and
Wang
,
Y. T.
, 2001, “
On-line Learning Control by Association and Reinforcement
,”
IEEE Trans. Neural Netw.
1045-9227,
12
, pp.
264
276
.
38.
Zhang
,
B. S.
,
Leigh
,
I.
, and
Leigh
,
J. R.
, 1995, “
Learning Control Based on Pattern Recognition Applied to Vehicle Cruise Control Systems
,”
Proceedings of the American Control Conference
, Seattle, WA, pp.
3101
3105
.
39.
Shahdi
,
S. A.
, and
Shouraki
,
S. B.
, 2003, “
Use of Active Learning Method to Develop an Intelligent Stop and Go Cruise Control
,”
Proceedings of the IASTED International Conference on Intelligent Systems and Control
, Salzburg, Austria, pp.
87
90
.
41.
Panait
,
L.
, and
Luke
,
S.
, 2005, “
Cooperative Multi-Agent Learning: The State of the Art
,”
Auton. Agents Multi-Agent Syst.
1387-2532,
11
, pp.
387
434
.
You do not currently have access to this content.