The value iterations of Section 10.2.1 work by iteratively updating cost-to-go values on the state space. The optimal plan can alternatively be obtained by iteratively searching in the space of plans. This leads to a method called policy iteration [84]; the term policy is synonymous with plan. The method will be explained for the case of probabilistic uncertainty, and it is assumed that is finite. With minor adaptations, a version for nondeterministic uncertainty can also be developed.
Policy iteration repeatedly requires computing the cost-to-go for a given plan, . Recall the definition of from (10.32). First suppose that there are no uncertainties, and that the state transition equation is . The dynamic programming equation (2.18) from Section 2.3.2 can be used to derive the cost-to-go for each state under the application of . Make a copy of (2.18) for each , and instead of the , use the given action , to yield
If there are states, (10.50) yields equations, each of which gives an expression of for a different state. For the states in which , it is known that . Now that this is known, the cost-to-go for all states from which can be reached in one stage can be computed using (10.50) with . Once these cost-to-go values are computed, another wave of values can be computed from states that can reach these in one stage. This process continues until the cost-to-go values are computed for all states. This is similar to the behavior of Dijkstra's algorithm.
This process of determining the cost-to-go should not seem too mysterious. Equation (10.50) indicates how the costs differ between neighboring states in the state transition graph. Since all of the differences are specified and an initial condition is given for , all others can be derived by adding up the differences expressed in (10.50). Similar ideas appear in the Hamilton-Jacobi-Bellman equation and Pontryagin's minimum principle, which are covered in Section 15.2.
Now we turn to the case in which there are probabilistic uncertainties. The probabilistic analog of (2.18) is (10.49). For simplicity, consider the special case in which does not depend on , which results in
The cost-to-go for each under the application of can be determined by writing (10.53) for each state. Note that all quantities except are known. This means that if there are states, then there are linear equations and unknowns ( for each ). The same was true when (10.50) was used, except the equations were much simpler. In the probabilistic setting, a system of linear equations must be solved to determine . This may be performed using classical linear algebra techniques, such as singular value decomposition (SVD) [399,961].
Now that we have a method for evaluating the cost of a plan, the policy iteration method is straightforward, as specified in Figure 10.4. Note that in Step 3, the cost-to-go , which was developed for one plan, , is used to evaluate other plans. The result is the cost that will be obtained if a new action is tried in the first stage and then is used for all remaining stages. If a new action cannot reduce the cost, then must have already been optimal because it means that (10.54) has become equivalent to the stationary dynamic programming equation, (10.49). If it is possible to improve , then a new plan is obtained. The new plan must be strictly better than the previous plan, and there is only a finite number of possible plans in total. Therefore, the policy iteration method converges after a finite number of iterations.
Now use (10.53) to compute . This yields three equations:
(10.54) | ||
(10.55) | ||
(10.56) |
(10.57) |
(10.58) |
Now use (10.54) for each state with and to find a better plan, . At state , it is found by solving
Since an improved plan has been found, replace with and return to Step 2. The new plan yields the equations
(10.61) |
(10.62) |
Policy iteration may appear preferable to value iteration, especially because it usually converges in fewer iterations than value iteration. The equation solving that determines the cost of a plan effectively considers multiple stages at once. However, for most planning problems, is large and the large linear system of equations that must be solved at every iteration can become unwieldy. In some applications, either the state space may be small enough or sparse matrix techniques may allow efficient solutions over larger state spaces. In general, value-based methods seem preferable for most planning problems.
Steven M LaValle 2020-08-14