A simulation-based version of value iteration can be constructed from
Q-factors. The reason for their use instead of is that a
minimization over
will be avoided in the dynamic programming.
Avoiding this minimization enables a sample-by-sample approach to
estimating the optimal values and ultimately obtaining the optimal
plan. The optimal cost-to-go can be obtained from the Q-factors as
If
and
were known, then (10.98)
would lead to an alternative, storage-intensive way to perform value
iteration. After convergence occurs, (10.97) can be used to
obtain the
values. The optimal plan is constructed as
Since the costs and transition probabilities are unknown, a
simulation-based approach is needed. The stochastic iterative
algorithm idea can be applied once again. Recall that
(10.96) estimated the cost of a plan by using individual
samples and required a convergence-rate parameter, . Using the
same idea here, a simulation-based version of value iteration can be
derived as
In most literature, Q-learning is applied to the discounted cost model. This yields a minor variant of (10.101):
Steven M LaValle 2020-08-14