Solutions for the average cost-per-stage model

A value iteration algorithm for the average cost model can be obtained by simply neglecting to divide by $ K$. Selecting actions that optimize the total cost also optimizes the average cost as the number of stages approaches infinity. This may cause costs to increase toward $ \pm \infty$; however, only a finite number of iterations can be executed in practice.

The backward value iterations of Section 10.2.1 can be followed with very little modification. Initially, let $ G^*(x_F) =
0$ for all $ x_F \in X$. The value iterations are computed using the standard form

$\displaystyle G^*_k({x_k}) = \min_{{u_k}\in U({x_k})} \bigg\{ \sum_{\theta \in ...
...k}) + G^*_{k+1}(f(x_k,u_k,\theta_k)) \Big) P({\theta_k}\vert x_k,u_k) \bigg\} .$ (10.78)

The iterations continue until convergence occurs. To determine whether a solution of sufficient quality has been obtained, a reasonable criterion for is

$\displaystyle \max_{x \in X} \Big\{ \big\vert G^*_k(x)/N - G^*_{k+1}(x)/(N-1) \big\vert \Big\} < \epsilon ,$ (10.79)

in which $ \epsilon$ is the error tolerance and $ N$ is the number of value iterations that have been completed (it is required in (10.80) that $ N > 1$). Once (10.80) has been satisfied, the iterations can be terminated.

A numerical problem may exist with the growing values obtained for $ G^*(x)$. This can be alleviated by periodically reducing all values by some constant factor to ensure that the numbers fit within the allowable floating point range. In [96], a method called relative value iteration is presented, which selects one state, $ s \in X$, arbitrarily and expresses the cost-to-go values by subtracting off the cost at $ s$. This trims down all values simultaneously to keep them bounded while still maintaining the convergence properties of the algorithm.

Policy iteration can alternatively be performed by using the method given in Figure 10.4 with only minor modification.

Steven M LaValle 2020-08-14