Value iteration

A simulation-based version of value iteration can be constructed from Q-factors. The reason for their use instead of is that a minimization over will be avoided in the dynamic programming. Avoiding this minimization enables a sample-by-sample approach to estimating the optimal values and ultimately obtaining the optimal plan. The optimal cost-to-go can be obtained from the Q-factors as

$\displaystyle G^*(x) = \min_{u \in U(x)} \Big\{ Q^*(x,u) \Big\} .$

(10.96)

This enables the dynamic programming recurrence in (10.46) to be expressed as

$\displaystyle Q^*(x,u) = l(x,u) + \sum_{x^\prime \in X} P(x^\prime\vert x,u) \min_{u^\prime \in U(x^\prime)} \Big\{ Q^*(x^\prime,u^\prime) \Big\} .$

(10.97)

By applying (10.97) to the right side of (10.98), it can also be expressed using

$\displaystyle Q^*(x,u) = l(x,u) + \sum_{x^\prime \in X} P(x^\prime\vert x,u) G^*(x^\prime) .$

(10.98)

If $P(x^\prime\vert x,u)$ and were known, then (10.98) would lead to an alternative, storage-intensive way to perform value iteration. After convergence occurs, (10.97) can be used to obtain the values. The optimal plan is constructed as

$\displaystyle \pi ^*(x) = \operatornamewithlimits{argmin}_{u \in U(x)} \Big\{ Q^*(x,u) \Big\} .$

(10.99)

Since the costs and transition probabilities are unknown, a simulation-based approach is needed. The stochastic iterative algorithm idea can be applied once again. Recall that (10.96) estimated the cost of a plan by using individual samples and required a convergence-rate parameter, $\rho$ . Using the same idea here, a simulation-based version of value iteration can be derived as

$\displaystyle \hat{Q}^*(x,u) := (1-\rho) \hat{Q}^*(x,u) + \rho \left(l(x,u) + \min_{u' \in U(x')} \Big\{ \hat{Q}^*(x',u') \Big\}\right) ,$

(10.100)

in which

is the next state and

is the cost obtained from the simulator when

is applied at

. Initially, all Q-factors are set to zero. Sample trajectories that arrive at the goal can be generated using simulation, and (10.101) is applied to the resulting states and costs in each stage. Once again, the update equation may appear to be incorrect because the transition probabilities are not explicitly mentioned, but this is taken into account automatically through the simulation.

In most literature, Q-learning is applied to the discounted cost model. This yields a minor variant of (10.101):

$\displaystyle \hat{Q}^*(x,u) := (1-\rho) \hat{Q}^*(x,u) + \rho \left(l(x,u) + \alpha \min_{u' \in U(x')} \Big\{ \hat{Q}^*(x',u') \Big\} \right) ,$

(10.101)

in which the discount factor $\alpha$ appears because the update equation is derived from (10.76).

Steven M LaValle 2020-08-14