Assume that there is no limit on the number of stages. After the
value iterations terminate, cost-to-go functions are determined over
. This is not exactly a plan, because an action is required
for each
. The actions can be obtained by recording the
that produced the minimum cost value in
(10.45) or (10.39).
Assume that the value iterations have converged to a stationary cost-to-go function. Before uncertainty was introduced, the optimal actions were determined by (2.19). The nondeterministic and probabilistic versions of (2.19) are
Conveniently, the optimal action can be recovered directly during
execution of the plan, rather than storing actions. Each time a state
is obtained during execution, the appropriate action
is selected by evaluating (10.48) or
(10.49) at
. This means that the cost-to-go
function itself can be interpreted as a representation of the optimal
plan, once it is understood that a local operator is required to recover the action. It may seem strange that
such a local computation yields the global optimum; however, this
works because the cost-to-go function already encodes the global
costs. This behavior was also observed for continuous state spaces in
Section 8.4.1, in which a navigation function served to
define a feedback motion plan. In that context, a gradient operator
was needed to recover the direction of motion. In the current
setting, (10.48) and (10.49) serve the
same purpose.
Steven M LaValle 2020-08-14