10.2.2 Policy Iteration

The value iterations of Section 10.2.1 work by iteratively updating cost-to-go values on the state space. The optimal plan can alternatively be obtained by iteratively searching in the space of plans. This leads to a method called policy iteration [84]; the term policy is synonymous with plan. The method will be explained for the case of probabilistic uncertainty, and it is assumed that $ X$ is finite. With minor adaptations, a version for nondeterministic uncertainty can also be developed.

Policy iteration repeatedly requires computing the cost-to-go for a given plan, $ \pi $. Recall the definition of $ G_\pi $ from (10.32). First suppose that there are no uncertainties, and that the state transition equation is $ x^\prime = f(x,u)$. The dynamic programming equation (2.18) from Section 2.3.2 can be used to derive the cost-to-go for each state $ x \in X$ under the application of $ \pi $. Make a copy of (2.18) for each $ x \in X$, and instead of the $ \min$, use the given action $ u = \pi (x)$, to yield

$\displaystyle {G_\pi }(x) = l(x,\pi (x)) + {G_\pi }(f(x,\pi (x))) .$ (10.50)

In (10.50), the $ G^*$ has been replaced by $ {G_\pi }$ because there are no variables to optimize (it is simply the cost of applying $ \pi $). Equation (10.50) can be thought of as a trivial form of dynamic programming in which the choice of possible plans has been restricted to a single plan, $ \pi $. If the dynamic programming recurrence (2.18) holds over the space of all plans, it must certainly hold over a space that consists of a single plan; this is reflected in (10.50).

If there are $ n$ states, (10.50) yields $ n$ equations, each of which gives an expression of $ {G_\pi }(x)$ for a different state. For the states in which $ x \in {X_{G}}$, it is known that $ {G_\pi }(x) = 0$. Now that this is known, the cost-to-go for all states from which $ {X_{G}}$ can be reached in one stage can be computed using (10.50) with $ {G_\pi }(f(x,\pi (x))) = 0$. Once these cost-to-go values are computed, another wave of values can be computed from states that can reach these in one stage. This process continues until the cost-to-go values are computed for all states. This is similar to the behavior of Dijkstra's algorithm.

This process of determining the cost-to-go should not seem too mysterious. Equation (10.50) indicates how the costs differ between neighboring states in the state transition graph. Since all of the differences are specified and an initial condition is given for $ {X_{G}}$, all others can be derived by adding up the differences expressed in (10.50). Similar ideas appear in the Hamilton-Jacobi-Bellman equation and Pontryagin's minimum principle, which are covered in Section 15.2.

Now we turn to the case in which there are probabilistic uncertainties. The probabilistic analog of (2.18) is (10.49). For simplicity, consider the special case in which $ l(x,u,\theta)$ does not depend on $ \theta $, which results in

$\displaystyle \pi ^*(x) = \operatornamewithlimits{argmin}_{u \in U(x)} \bigg\{ l(x,u) + \sum_{x^\prime \in X} G^*(x^\prime) P(x^\prime\vert x,u) \bigg\},$ (10.51)

in which $ x^\prime = f(x,u)$. The cost-to-go function, $ G^*$, satisfies the dynamic programming recurrence

$\displaystyle G^*(x) = \min_{u \in U(x)} \bigg\{ l(x,u) + \sum_{x^\prime \in X} G^*(x^\prime) P(x^\prime\vert x,u) \bigg\}.$ (10.52)

The probabilistic analog to (10.50) can be made from (10.52) by restricting the set of actions to a single plan, $ \pi $, to obtain

$\displaystyle G_\pi (x) = l(x,\pi (x)) + \sum_{x^\prime \in X} G_\pi (x^\prime) P(x^\prime\vert x,\pi (x)),$ (10.53)

in which $ x^\prime$ is the next state.

The cost-to-go for each $ x \in X$ under the application of $ \pi $ can be determined by writing (10.53) for each state. Note that all quantities except $ G_\pi $ are known. This means that if there are $ n$ states, then there are $ n$ linear equations and $ n$ unknowns ($ G_\pi (x)$ for each $ x \in X$). The same was true when (10.50) was used, except the equations were much simpler. In the probabilistic setting, a system of $ n$ linear equations must be solved to determine $ G_\pi $. This may be performed using classical linear algebra techniques, such as singular value decomposition (SVD) [399,961].

Figure 10.4: The policy iteration algorithm iteratively searches the space of plans by evaluating and improving plans.
\begin{figure}
% latex2html id marker 55378
POLICY ITERATION ALGORITHM
\begin{en...
..., declare $\pi $ to be the optimal plan,
$\pi ^*$.
\end{enumerate}
\end{figure}

Now that we have a method for evaluating the cost of a plan, the policy iteration method is straightforward, as specified in Figure 10.4. Note that in Step 3, the cost-to-go $ G_\pi $, which was developed for one plan, $ \pi $, is used to evaluate other plans. The result is the cost that will be obtained if a new action is tried in the first stage and then $ \pi $ is used for all remaining stages. If a new action cannot reduce the cost, then $ \pi $ must have already been optimal because it means that (10.54) has become equivalent to the stationary dynamic programming equation, (10.49). If it is possible to improve $ \pi $, then a new plan is obtained. The new plan must be strictly better than the previous plan, and there is only a finite number of possible plans in total. Therefore, the policy iteration method converges after a finite number of iterations.

Figure 10.5: The probabilistic state transition graphs for $ u=1$ and $ u=2$. Transitions out of $ c$ are not shown because it is assumed that a termination action is always applied from $ x_g$.
\begin{figure}\begin{center}
\begin{tabular}{cc}
\psfig{file=figs/pstgex.eps,wid...
...s/pstgex2.eps,width=2.7in} \\
u=1 & u=2
\end{tabular}
\end{center}\end{figure}

Example 10..7 (An Illustration of Policy Iteration)   A simple example will now be used to illustrate policy iteration. Let $ X = \{a,b,c\}$ and $ U = \{1,2,u_T\}$. Let $ {X_{G}}= \{c\}$. Let $ l(x,u) = 1$ for all $ x \in X$ and $ u \in U \setminus \{u_T\}$ (if $ u_T$ is applied, there is no cost). The probabilistic state transition graphs for each action are shown in Figure 10.5. The first step is to pick an initial plan. Let $ \pi (a) = 1$ and $ \pi (b) = 1$; let $ \pi (c) = u_T$ because $ c \in {X_{G}}$.

Now use (10.53) to compute $ G_\pi $. This yields three equations:

$\displaystyle G_\pi (a)$ $\displaystyle = 1 +G_\pi (a) P(a\;\vert\;a,1) + G_\pi (b) P(b\;\vert\;a,1) + G_\pi (c) P(c\;\vert\;a,1)$ (10.54)
$\displaystyle G_\pi (b)$ $\displaystyle = 1 + G_\pi (a) P(a\;\vert\;b,1) + G_\pi (b) P(b\;\vert\;b,1) + G_\pi (c) P(c\;\vert\;b,1)$ (10.55)
$\displaystyle G_\pi (c)$ $\displaystyle = 0 + G_\pi (a) P(a\;\vert\;c,u_T) + G_\pi (b) P(b\;\vert\;c,u_T) + G_\pi (c) P(c\;\vert\;c,u_T) .$ (10.56)

Each equation represents a different state and uses the appropriate action from $ \pi $. The final equation reduces to $ G_\pi (c) = 0$ because of the basic rules of applying a termination condition. After substituting values for $ P(x^\prime\vert x,u)$ and using $ G_\pi (c) = 0$, the other two equations become

$\displaystyle G_\pi (a) = 1 + \begin{matrix}\frac{1}{3} \end{matrix} G_\pi (a) + \begin{matrix}\frac{1}{3} \end{matrix} G_\pi (b)$ (10.57)

and

$\displaystyle G_\pi (b) = 1 + \begin{matrix}\frac{1}{3} \end{matrix} G_\pi (a) + \begin{matrix}\frac{1}{3} \end{matrix} G_\pi (b) .$ (10.58)

The solutions are $ G_\pi (a) = G_\pi (b) = 3$.

Now use (10.54) for each state with $ G_\pi (a) = G_\pi (b) = 3$ and $ G_\pi (c) = 0$ to find a better plan, $ \pi ^\prime$. At state $ a$, it is found by solving

$\displaystyle \pi ^\prime(a) = \operatornamewithlimits{argmin}_{u \in U} \Big\{ l(x,a) + \sum_{x^\prime \in X} G_\pi (x^\prime) P(x^\prime\vert x,a) \Big\}.$ (10.59)

The best action is $ u=2$, which produces cost $ 5/2$ and is computed as

$\displaystyle l(x,2) + \sum_{x^\prime \in X} G_\pi (x^\prime) P(x^\prime\vert x...
... \begin{matrix}\frac{1}{4}\end{matrix} = \begin{matrix}\frac{5}{2}\end{matrix}.$ (10.60)

Thus, $ \pi ^\prime(a) = 2$. Similarly, $ \pi ^\prime(b) = 2$ can be computed, which produces cost $ 7/4$. Once again, $ \pi ^\prime(c) =
u_T$, which completes the determination of an improved plan.

Since an improved plan has been found, replace $ \pi $ with $ \pi ^\prime$ and return to Step 2. The new plan yields the equations

$\displaystyle G_\pi (a) = 1 + \begin{matrix}\frac{1}{2} \end{matrix}G_\pi (b)$ (10.61)

and

$\displaystyle G_\pi (b) = 1 + \begin{matrix}\frac{1}{4} \end{matrix}G_\pi (a).$ (10.62)

Solving these yields $ G_\pi (a) = 12/7$ and $ G_\pi (b) = 10/7$. The next step attempts to find a better plan using (10.54), but it is determined that the current plan cannot be improved. The policy iteration method terminates by correctly reporting that $ \pi ^* = \pi $. $ \blacksquare$

Policy iteration may appear preferable to value iteration, especially because it usually converges in fewer iterations than value iteration. The equation solving that determines the cost of a plan effectively considers multiple stages at once. However, for most planning problems, $ X$ is large and the large linear system of equations that must be solved at every iteration can become unwieldy. In some applications, either the state space may be small enough or sparse matrix techniques may allow efficient solutions over larger state spaces. In general, value-based methods seem preferable for most planning problems.

Steven M LaValle 2020-08-14