So far two different kinds of dynamic programming have been covered. The value-iteration method of Section 2.3.2 involves repeated computations over the entire state space. Dijkstra's algorithm from Section 2.2.2 flows only once through the state space, but with the additional overhead of maintaining which states are alive.
Dijkstra's algorithm can be derived by focusing on the forward value iterations, as in Example 2.5, and identifying exactly where the ``interesting'' changes occur. Recall that for Dijkstra's algorithm, it was assumed that all costs are nonnegative. For any states that are not reachable, their values remain at infinity. They are precisely the unvisited states. States for which the optimal cost-to-come has already become stationary are dead. For the remaining states, an initial cost is obtained, but this cost may be lowered one or more times until the optimal cost is obtained. All states for which the cost is finite, but possibly not optimal, are in the queue, .
After understanding value iteration, it is easier to understand why Dijkstra's form of dynamic programming correctly computes optimal solutions. It is clear that the unvisited states will remain at infinity in both algorithms because no plan has reached them. It is helpful to consider the forward value iterations in Example 2.5 for comparison. In a sense, Dijkstra's algorithm is very much like the value iteration, except that it efficiently maintains the set of states within which cost-to-go values can change. It correctly inserts any states that are reached for the first time, changing their cost-to-come from infinity to a finite value. The values are changed in the same manner as in the value iterations. At the end of both algorithms, the resulting values correspond to the stationary, optimal cost-to-come, .
If Dijkstra's algorithm seems so clever, then why have we spent time covering the value-iteration method? For some problems it may become too expensive to maintain the sorted queue, and value iteration could provide a more efficient alternative. A more important reason is that value iteration extends easily to a much broader class of problems. Examples include optimal planning over continuous state spaces (Sections 8.5.2 and 14.5), stochastic optimal planning (Section 10.2), and computing dynamic game equilibria (Section 10.5). In some cases, it is still possible to obtain a Dijkstra-like algorithm by focusing the computation on the ``interesting'' region; however, as the model becomes more complicated, it may be inefficient or impossible in practice to maintain this region. Therefore, it is important to have a good understanding of both algorithms to determine which is most appropriate for a given problem.
Dijkstra's algorithm belongs to a broader family of label-correcting algorithms, which all produce optimal plans by making small modifications to the general forward-search algorithm in Figure 2.4. Figure 2.16 shows the resulting algorithm. The main difference is to allow states to become alive again if a better cost-to-come is found. This enables other cost-to-come values to be improved accordingly. This is not important for Dijkstra's algorithm and search because they only need to visit each state once. Thus, the algorithms in Figures 2.4 and 2.16 are essentially the same in this case. However, the label-correcting algorithm produces optimal solutions for any sorting of , including FIFO (breadth first) and LIFO (depth first), as long as is finite. If is not finite, then the issue of systematic search dominates because one must guarantee that states are revisited sufficiently many times to guarantee that optimal solutions will eventually be found.
Another important difference between label-correcting algorithms and the standard forward-search model is that the label-correcting approach uses the cost at the goal state to prune away many candidate paths; this is shown in line 7. Thus, it is only formulated to work for a single goal state; it can be adapted to work for multiple goal states, but performance degrades. The motivation for including in line 7 is that there is no need to worry about improving costs at some state, , if its new cost-to-come would be higher than ; there is no way it could be along a path that improves the cost to go to . Similarly, is not inserted in line 10 because there is no need to consider plans that have as an intermediate state. To recover the plan, either pointers can be stored from to each time an update is made in line 7, or the final, optimal cost-to-come, , can be used to recover the actions using (2.20).
Steven M LaValle 2020-08-14