A backward value iteration solution will be presented that follows
naturally from the method given in Section 10.2.1. For
notational convenience, let the first stage be designated as
so that
may be replaced by
. In the
probabilistic case, the expected optimal cost-to-go is
Since the probabilistic case is more common, it will be covered here. The nondeterministic version is handled in a similar way (see Exercise 17). As before, backward value iterations will be performed because they are simpler to express. The discount factor causes a minor complication that must be fixed to make the dynamic programming recurrence work properly.
One difficulty is that the stage index now appears in the cost
function, in the form of . This means that the
shift-invariant property from Section 2.3.1 is no longer
preserved. We must therefore be careful about assigning stage
indices. This is a problem because for backward value iteration the
final stage index has been unknown and unimportant.
Consider a sequence of discounted decision-making problems, by
increasing the maximum stage index: ,
,
,
.
Look at the neighboring cost expressions in Figure 10.10.
What is the difference between finding the optimal cost-to-go for the
-stage problem and the
-stage problem? In Figure
10.10 the last four terms of the cost for
can be
obtained by multiplying all terms for
by
and adding a
new term,
. The only difference is that the stage indices need
to be shifted by one on each
that was borrowed from the
case. In general, the optimal costs of a
-stage optimization
problem can serve as the optimal costs of the
-stage problem if
they are first multiplied by
. The
-stage optimization
problem can be solved by optimizing over the sum of the first-stage
cost plus the optimal cost for the
-stage problem, discounted by
.
This can be derived using straightforward dynamic programming
arguments as follows. Suppose that is fixed. The cost-to-go can
be expressed recursively for
from 0 to
as
The idea of using neighboring cost values as shown in Figure
10.10 can be applied by making a notational change. Let
. This reverses the
direction of the stage indices to avoid specifying the final stage
and also scales by
to correctly compensate for the index
change. Substitution into (10.72) yields
Value iteration proceeds by first letting
for all
. Successive cost-to-go functions are computed by iterating
(10.74) over the state space. Under the
cycle-avoiding assumptions of Section 10.2.1, the
convergence is usually asymptotic due to the infinite horizon. The
discounting gradually causes the cost differences to diminish until
they are within the desired tolerance. The stationary form of the
dynamic programming recurrence, which is obtained in the limit as
tends to infinity, is
Steven M LaValle 2020-08-14