2.3.2 Optimal Plans of Unspecified Lengths

The value-iteration method for fixed-length plans can be generalized nicely to the case in which plans of different lengths are allowed. There will be no bound on the maximal length of a plan; therefore, the current case is truly a generalization of Formulation 2.1 because arbitrarily long plans may be attempted in efforts to reach $ {X_{G}}$. The model for the general case does not require the specification of $ K$ but instead introduces a special action, $ u_T$.

Formulation 2..3 (Discrete Optimal Planning)  
  1. All of the components from Formulation 2.1 are inherited directly: $ X$, $ U(x)$, $ f$, $ {x_{I}}$, and $ {X_{G}}$. Also, the notion of stages from Formulation 2.2 is used.
  2. Let $ L$ denote a stage-additive cost functional, which may be applied to any $ K$-step plan, $ \pi _K$, to yield

    $\displaystyle L(\pi _K) = \sum_{k=1}^K l(x_k,u_k) + l_F(x_F) .$ (2.17)

    In comparison with $ L$ from Formulation 2.2, the present expression does not consider $ K$ as a predetermined constant. It will now vary, depending on the length of the plan. Thus, the domain of $ L$ is much larger.
  3. Each $ U(x)$ contains the special termination action, $ u_T$. If $ u_T$ is applied at $ x_k$, then the action is repeatedly applied forever, the state remains unchanged, and no more cost accumulates. Thus, for all $ i \geq k$, $ u_i = u_T$, $ x_i = x_k$, and $ l(x_i,u_T) =
0$.

The termination action is the key to allowing plans of different lengths. It will appear throughout this book. Suppose that value iterations are performed up to $ K=5$, and for the problem there exists a two-step solution plan, $ (u_1,u_2)$, that arrives in $ {X_{G}}$ from $ {x_{I}}$. This plan is equivalent to the five-step plan $ (u_1,u_2,u_T,u_T,u_T)$ because the termination action does not change the state, nor does it accumulate cost. The resulting five-step plan reaches $ {X_{G}}$ and costs the same as $ (u_1,u_2)$. With this simple extension, the forward and backward value iteration methods of Section 2.3.1 may be applied for any fixed $ K$ to optimize over all plans of length $ K$ or less (instead of fixing $ K$).

The next step is to remove the dependency on $ K$. Consider running backward value iterations indefinitely. At some point, $ G^*_1$ will be computed, but there is no reason why the process cannot be continued onward to $ G^*_0$, $ G^*_{-1}$, and so on. Recall that $ {x_{I}}$ is not utilized in the backward value-iteration method; therefore, there is no concern regarding the starting initial state of the plans. Suppose that backward value iteration was applied for $ K =
16$ and was executed down to $ G^*_{-8}$. This considers all plans of length $ 25$ or less. Note that it is harmless to add $ 9$ to all stage indices to shift all of the cost-to-go functions. Instead of running from $ G^*_{-8}$ to $ G^*_{16}$, they can run from $ G^*_1$ to $ G^*_{25}$ without affecting their values. The index shifting is allowed because none of the costs depend on the particular index that is given to the stage. The only important aspect of the value iterations is that they proceed backward and consecutively from stage to stage.

Eventually, enough iterations will have been executed so that an optimal plan is known from every state that can reach $ {X_{G}}$. From that stage, say $ k$, onward, the cost-to-go values from one value iteration to the next will be stationary, meaning that for all $ i \leq k$, $ G^*_{i-1}(x)
= G^*_i(x)$ for all $ x \in X$. Once the stationary condition is reached, the cost-to-go function no longer depends on a particular stage $ k$. In this case, the stage index may be dropped, and the recurrence becomes

$\displaystyle G^*(x) = \min_{u} \Big\{ l(x,u) + G^*(f(x,u)) \Big\} .$ (2.18)

Are there any conditions under which backward value iterations could be executed forever, with each iteration producing a cost-to-go function for which some values are different from the previous iteration? If $ l(x,u)$ is nonnegative for all $ x \in X$ and $ u \in U(x)$, then this could never happen. It could certainly be true that, for any fixed $ K$, longer plans will exist, but this cannot be said of optimal plans. From every $ x \in X$, there either exists a plan that reaches $ X_{G}$ with finite cost or there is no solution. For each state from which there exists a plan that reaches $ {X_{G}}$, consider the number of stages in the optimal plan. Consider the maximum number of stages taken from all states that can reach $ {X_{G}}$. This serves as an upper bound on the number of value iterations before the cost-to-go becomes stationary. Any further iterations will just consider solutions that are worse than the ones already considered (some may be equivalent due to the termination action and shifting of stages). Some trouble might occur if $ l(x,u)$ contains negative values. If the state transition graph contains a cycle for which total cost is negative, then it is preferable to execute a plan that travels around the cycle forever, thereby reducing the total cost to $ -\infty$. Therefore, we will assume that the cost functional is defined in a sensible way so that negative cycles do not exist. Otherwise, the optimization model itself appears flawed. Some negative values for $ l(x,u)$, however, are allowed as long as there are no negative cycles. (It is straightforward to detect and report negative cycles before running the value iterations.)

Since the particular stage index is unimportant, let $ k = 0$ be the index of the final stage, which is the stage at which the backward value iterations begin. Hence, $ G^*_0$ is the final stage cost, which is obtained directly from $ l_F$. Let $ -K$ denote the stage index at which the cost-to-go values all become stationary. At this stage, the optimal cost-to-go function, $ G^*: X \rightarrow {\mathbb{R}}
\cup \{\infty\}$, is expressed by assigning $ G^*= G^*_{-K}$. In other words, the particular stage index no longer matters. The value $ G^*(x)$ gives the optimal cost to go from state $ x \in X$ to the specific goal state $ {x_{G}}$.

If the optimal actions are not stored during the value iterations, the optimal cost-to-go, $ G^*$, can be used to efficiently recover them. Consider starting from some $ x \in X$. What is the optimal next action? This is given by

$\displaystyle u^* = \operatornamewithlimits{argmin}_{u \in U(x)} \Big\{ l(x,u) + G^*(f(x,u)) \Big\} ,$ (2.19)

in which $ \operatornamewithlimits{argmin}$ denotes the argument that achieves the minimum value of the expression. The action minimizes an expression that is very similar to (2.11). The only differences between (2.19) and (2.11) are that the stage indices are dropped in (2.19) because the cost-to-go values no longer depend on them, and $ \operatornamewithlimits{argmin}$ is used so that $ u^*$ is selected. After applying $ u^*$, the state transition equation is used to obtain $ x' = f(x,u^*)$, and (2.19) may be applied again on $ x'$. This process continues until a state in $ {X_{G}}$ is reached. This procedure is based directly on the dynamic programming recurrence; therefore, it recovers the optimal plan. The function $ G^*$ serves as a kind of guide that leads the system from any initial state into the goal set optimally. This can be considered as a special case of a navigation function, which will be covered in Section 8.2.2.

As in the case of fixed-length plans, the direction of the value iterations can be reversed to obtain a forward value-iteration method for the variable-length planning problem. In this case, the backward state transition equation, $ {f^{-1}}$, is used once again. Also, the initial cost term $ l_I$ is used instead of $ l_F$, as in (2.14). The forward value-iteration method starts at $ k=1$, and then iterates until the cost-to-come becomes stationary. Once again, the termination action, $ u_T$, preserves the cost of plans that arrived at a state in earlier iterations. Note that it is not required to specify $ {X_{G}}$. A counterpart to $ G^*$ may be obtained, from which optimal actions can be recovered. When the cost-to-come values become stationary, an optimal cost-to-come function, $ C^*: X \rightarrow {\mathbb{R}}\cup \{\infty\}$, may be expressed by assigning $ C^*= C^*_F$, in which $ F$ is the final stage reached when the algorithm terminates. The value $ C^*(x)$ gives the cost of an optimal plan that starts from $ {x_{I}}$ and reaches $ x$. The optimal action sequence for any specified goal $ {x_{G}}\in X$ can be obtained using

$\displaystyle \operatornamewithlimits{argmin}_{u^{-1}\in {U^{-1}}} \Big\{ C^*({f^{-1}}(x,u^{-1})) + l({f^{-1}}(x,u^{-1}),u') \Big\} ,$ (2.20)

which is the forward counterpart of (2.19). The $ u'$ is the action in $ U({f^{-1}}(x,u^{-1}))$ that yields $ x$ when the state transition function, $ f$, is applied. The iterations proceed backward from $ {x_{G}}$ and terminate when $ {x_{I}}$ is reached.

Example 2..5 (Value Iteration for Variable-Length Plans)   Once again, Example 2.3 is revisited; however, this time the plan length is not fixed due to the termination action. Its effect is depicted in Figure 2.13 by the superposition of new edges that have zero cost. It might appear at first that there is no incentive to choose nontermination actions, but remember that any plan that does not terminate in state $ x_{G} = d$ will receive infinite cost.

Figure 2.13: Compare this figure to Figure 2.11, for which $ K$ was fixed at $ 4$. The effect of the termination action is depicted as dashed-line edges that yield 0 cost when traversed. This enables plans of all finite lengths to be considered. Also, the stages extend indefinitely to the left (for the case of backward value iteration).
\begin{figure}\centerline{\psfig{figure=figs/fivestate4.eps,width=5.0in} }\end{figure}

Figure 2.14: The optimal cost-to-go functions computed by backward value iteration applied in the case of variable-length plans.
\begin{figure}\begin{center}
\begin{tabular}{\vert c\vert c\vert c\vert c\vert c...
...$ & 4 & 2 & 1 & 0 & $\infty$  \hline
\end{tabular}\end{center}
\end{figure}

See Figure 2.14. After a few backward value iterations, the cost-to-go values become stationary. After this point, the termination action is being applied from all reachable states and no further cost accumulates. The final cost-to-go function is defined to be $ G^*$. Since $ d$ is not reachable from $ e$, $ G^*(e) = \infty$.

As an example of using (2.19) to recover optimal actions, consider starting from state $ a$. The action that leads to $ b$ is chosen next because the total cost $ 2 + G^*(b) = 4$ is better than $ 2 + G^*(a) = 6$ (the $ 2$ comes from the action cost). From state $ b$, the optimal action leads to $ c$, which produces total cost $ 1 +
G^*(c) = 1$. Similarly, the next action leads to $ d \in {X_{G}}$, which terminates the plan.

Figure 2.15: The optimal cost-to-come functions computed by forward value iteration applied in the case of variable-length plans.
\begin{figure}\begin{center}
\begin{tabular}{\vert c\vert c\vert c\vert c\vert c...
...ine $C^*$ & 2 & 0 & 1 & 2 & 3 \\
\hline
\end{tabular}\end{center}
\end{figure}

Using forward value iteration, suppose that $ {x_{I}}= b$. The following cost-to-come functions shown in Figure 2.15 are obtained. For any finite value that remains constant from one iteration to the next, the termination action was applied. Note that the last value iteration is useless in this example. Once $ C^*_3$ is computed, the optimal cost-to-come to every possible state from $ {x_{I}}$ is determined, and future cost-to-come functions are identical. Therefore, the final cost-to-come is renamed $ C^*$. $ \blacksquare$

Steven M LaValle 2020-08-14