The simulation method is based on averaging the information gained
incrementally from samples. Suppose that you receive a sequence of
costs, ,
,
, and would like to incrementally compute
their average. You are not told the total number of samples in
advance, and at any point you are required to report the current
average. Let
denote the average of the first
samples,
![]() |
(10.80) |
Now consider the problem of estimating the expected cost-to-go,
, at every
for some fixed plan,
. If
and the costs
were known, then it could be
computed by solving
![]() |
(10.83) |
From each , suppose that
trials are conducted, and
the resulting costs to get to the goal are recorded and averaged.
Each trial is an iterative process in which
selects the
action, and the simulator indicates the next state and its incremental
cost. Once the goal state is reached, the costs are totaled to yield
the measured cost-to-go for that trial (this assumes that
for all
). If
denotes this total cost at
trial
, then the average,
, over
trials provides an
estimate of
. As
tends to infinity, we expect
to converge to
. The update formula (10.83)
can be conveniently used to maintain the improving sequence of
cost-to-go estimates. Let
denote the current estimate of
. The update formula based on (10.83) can be
expressed as
It turns out that a single trial can actually yield update values for
multiple states [930,96]. Suppose that a trial is
performed from that results in the sequence
,
,
,
,
,
,
of visited states. For every
state,
, in the sequence, a cost-to-go value can be measured by
recording the cost that was accumulated from
to
:
![]() |
(10.85) |