Value iteration for discounted cost

A backward value iteration solution will be presented that follows naturally from the method given in Section 10.2.1. For notational convenience, let the first stage be designated as $ k = 0$ so that $ \alpha^{k-1}$ may be replaced by $ \alpha^k$. In the probabilistic case, the expected optimal cost-to-go is

$\displaystyle G^*(x) = \lim_{K \to \infty} \left( \min_{{\tilde{u}}} \left\{ E_...
...a}}} \left[ \sum_{k=1}^K \alpha^k l(x_k,u_k,\theta_k) \right] \right\} \right).$ (10.69)

The expectation is taken over all nature histories, each of which is an infinite sequence of nature actions. The corresponding expression for the nondeterministic case is

$\displaystyle G^*(x) = \lim_{K \to \infty} \left( \min_{{\tilde{u}}} \left\{ \m...
...}} \left\{ \sum_{k=1}^K \alpha^k l(x_k,u_k,\theta_k) \right\} \right\} \right).$ (10.70)

Since the probabilistic case is more common, it will be covered here. The nondeterministic version is handled in a similar way (see Exercise 17). As before, backward value iterations will be performed because they are simpler to express. The discount factor causes a minor complication that must be fixed to make the dynamic programming recurrence work properly.

One difficulty is that the stage index now appears in the cost function, in the form of $ \alpha^k$. This means that the shift-invariant property from Section 2.3.1 is no longer preserved. We must therefore be careful about assigning stage indices. This is a problem because for backward value iteration the final stage index has been unknown and unimportant.

Consider a sequence of discounted decision-making problems, by increasing the maximum stage index: $ K=0$, $ K=1$, $ K=2$, $ \ldots $. Look at the neighboring cost expressions in Figure 10.10. What is the difference between finding the optimal cost-to-go for the $ K+1$-stage problem and the $ K$-stage problem? In Figure 10.10 the last four terms of the cost for $ K=4$ can be obtained by multiplying all terms for $ K=3$ by $ \alpha$ and adding a new term, $ l_0$. The only difference is that the stage indices need to be shifted by one on each $ l_i$ that was borrowed from the $ K=3$ case. In general, the optimal costs of a $ K$-stage optimization problem can serve as the optimal costs of the $ K+1$-stage problem if they are first multiplied by $ \alpha$. The $ K+1$-stage optimization problem can be solved by optimizing over the sum of the first-stage cost plus the optimal cost for the $ K$-stage problem, discounted by $ \alpha$.

This can be derived using straightforward dynamic programming arguments as follows. Suppose that $ K$ is fixed. The cost-to-go can be expressed recursively for $ k$ from 0 to $ K$ as

$\displaystyle G^*_k({x_k}) = \min_{{u_k}\in U({x_k})} \Big\{ E_{{\theta_k}} \Big[ \alpha^k l({x_k},{u_k},{\theta_k}) + G^*_{k+1}(x_{k+1}) \Big] \Big\} ,$ (10.71)

in which $ x_{k+1} = f(x_k,u_k,\theta_k)$. The problem, however, is that the recursion depends on $ k$ through $ \alpha^k$, which makes it appear nonstationary.

The idea of using neighboring cost values as shown in Figure 10.10 can be applied by making a notational change. Let $ J^*_{K-k}(x_k) = \alpha^{-k} G^*_k(x_k)$. This reverses the direction of the stage indices to avoid specifying the final stage and also scales by $ \alpha^{-k}$ to correctly compensate for the index change. Substitution into (10.72) yields

$\displaystyle \alpha^k J^*_{K-k}(x_k) = \min_{{u_k}\in U({x_k})} \Big\{ E_{{\th...
...^k l({x_k},{u_k},{\theta_k}) + \alpha^{k+1} J^*_{K-k-1}(x_{k+1}) \Big] \Big\} .$ (10.72)

Dividing by $ \alpha^k$ and then letting $ i = K-k$ yields

$\displaystyle J^*_i(x_k) = \min_{{u_k}\in U({x_k})} \Big\{ E_{{\theta_k}} \Big[ l({x_k},{u_k},{\theta_k}) + \alpha J^*_{i-1}(x_{k+1}) \Big] \Big\} ,$ (10.73)

in which $ J^*_i$ represents the expected cost for a finite-horizon discounted problem in which $ K = i$. Note that (10.74) expresses $ J^*_i$ in terms of $ J^*_{i-1}$, but $ x_k$ is given, and the right-hand side uses $ x_{k+1}$. The indices appear to run in opposite directions because this is simply backward value iteration with a notational change that reverses some of the indices. The particular stage indices of $ x_k$ and $ x_{k+1}$ are not important in (10.74), as long as $ x_{k+1} = f(x_k,u_k,\theta_k)$ (for example, the substitutions $ x = x_k$, $ x^\prime = x_{k+1}$, $ u = u_k$, and $ \theta = \theta_k$ can be safely made).

Value iteration proceeds by first letting $ J^*_0(x_0) = 0$ for all $ x \in X$. Successive cost-to-go functions are computed by iterating (10.74) over the state space. Under the cycle-avoiding assumptions of Section 10.2.1, the convergence is usually asymptotic due to the infinite horizon. The discounting gradually causes the cost differences to diminish until they are within the desired tolerance. The stationary form of the dynamic programming recurrence, which is obtained in the limit as $ i$ tends to infinity, is

$\displaystyle J^*(x) = \min_{u \in U(x)} \Big\{ E_{{\theta_k}} \Big[ l(x,u,\theta) + \alpha J^*(f(x,u,\theta)) \Big] \Big\} .$ (10.74)

If the cost terms do not depend on nature, then the simplified form is

$\displaystyle J^*(x) = \min_{u \in U(x)} \Big\{ l(x,u) + \alpha \sum_{x^\prime \in X} J^*(x^\prime) P(x^\prime \vert x,u) \Big\} .$ (10.75)

As explained in Section 10.2.1, the optimal action, $ \pi ^*(x)$, is assigned as the $ u \in U(x)$ that satisfies (10.75) or (10.76) at $ x$.

Steven M LaValle 2020-08-14