A backward value iteration solution will be presented that follows naturally from the method given in Section 10.2.1. For notational convenience, let the first stage be designated as so that may be replaced by . In the probabilistic case, the expected optimal cost-to-go is
Since the probabilistic case is more common, it will be covered here. The nondeterministic version is handled in a similar way (see Exercise 17). As before, backward value iterations will be performed because they are simpler to express. The discount factor causes a minor complication that must be fixed to make the dynamic programming recurrence work properly.
One difficulty is that the stage index now appears in the cost function, in the form of . This means that the shift-invariant property from Section 2.3.1 is no longer preserved. We must therefore be careful about assigning stage indices. This is a problem because for backward value iteration the final stage index has been unknown and unimportant.
Consider a sequence of discounted decision-making problems, by increasing the maximum stage index: , , , . Look at the neighboring cost expressions in Figure 10.10. What is the difference between finding the optimal cost-to-go for the -stage problem and the -stage problem? In Figure 10.10 the last four terms of the cost for can be obtained by multiplying all terms for by and adding a new term, . The only difference is that the stage indices need to be shifted by one on each that was borrowed from the case. In general, the optimal costs of a -stage optimization problem can serve as the optimal costs of the -stage problem if they are first multiplied by . The -stage optimization problem can be solved by optimizing over the sum of the first-stage cost plus the optimal cost for the -stage problem, discounted by .
This can be derived using straightforward dynamic programming arguments as follows. Suppose that is fixed. The cost-to-go can be expressed recursively for from 0 to as
The idea of using neighboring cost values as shown in Figure 10.10 can be applied by making a notational change. Let . This reverses the direction of the stage indices to avoid specifying the final stage and also scales by to correctly compensate for the index change. Substitution into (10.72) yields
Value iteration proceeds by first letting for all . Successive cost-to-go functions are computed by iterating (10.74) over the state space. Under the cycle-avoiding assumptions of Section 10.2.1, the convergence is usually asymptotic due to the infinite horizon. The discounting gradually causes the cost differences to diminish until they are within the desired tolerance. The stationary form of the dynamic programming recurrence, which is obtained in the limit as tends to infinity, is
Steven M LaValle 2020-08-14