Value iteration

A value-iteration method can be derived by adapting the derivation that was applied to (10.33) to instead apply to (10.108). This leads to the dynamic programming recurrence

$\displaystyle \overline{L}^*_k({x_k}) = \min_{{u_k}\in U({x_k})} \Big\{ \max_{v... ...V(x_k)} \Big\{ l({x_k},{u_k},v_k) + \overline{L}^*_{k+1}(\xkp1) \Big\} \Big\} ,$

(10.109)

which is analogous to (10.39). This can be used to iteratively compute a security plan for ${{\rm P}_1}$ . The security plan for ${{\rm P}_2}$ can be computed using

$\displaystyle \underline{L}^*_k({x_k}) = \max_{v_k \in V({x_k})} \Big\{ \min_{u... ...(x_k)} \Big\{ l({x_k},{u_k},v_k) + \underline{L}^*_{k+1}(\xkp1) \Big\} \Big\} ,$

(10.110)

which is the dynamic programming equation derived from (10.109).

Starting from the final stage, , the upper and lower values are determined directly from the cost function:

$\displaystyle \overline{L}^*_F(x_F) = \underline{L}^*_F(x_F) = l_F(x_F) .$

(10.111)

Now compute $\overline{L}^*_K$ and $\underline{L}^*_K$ . From every state,

, (10.110) and (10.111) are evaluated to determine whether $\overline{L}^*_K(x_K) = \underline{L}^*_K(x_K)$ . If this occurs, then $L^*_L(x_K) = \overline{L}^*_K(x_K) = \underline{L}^*_K(x_K)$ is the value of the game from

at stage

. If it is determined that from any particular state, $x_K \in X$ , the upper and lower values are not equal, then there is no deterministic saddle point from

. Furthermore, this will prevent the existence of deterministic saddle points from other states at earlier stages; these are encountered in later value iterations. Such problems are avoided by allowing randomized plans, but the optimization is more complicated because linear programming is repeatedly involved.

Suppose for now that $\overline{L}^*_K(x_K) = \underline{L}^*_K(x_K)$ for all $x_K \in X$ . The value iterations proceed in the usual way from down to . Again, suppose that at every stage, $\overline{L}^*_k(x_k) = \underline{L}^*_k(x_k)$ for all $x_k \in X$ . Note that $L^*_{k+1}$ can be written in the place of $\overline{L}^*_{k+1}$ and $\underline{L}^*_{k+1}$ in (10.110) and (10.111) because it is assumed that the upper and lower values coincide. If they do not, then the method fails because randomized plans are needed to obtain a randomized saddle point.

Once the resulting values are computed from each $x_1 \in X_1$ , a security plan $\pi _1^*$ for ${{\rm P}_1}$ is defined for each $k \in {\cal K}$ and $x_k \in X$ as any action that satisfies the $\min$ in (10.110). A security plan $\pi _2^*$ is similarly defined for ${{\rm P}_2}$ by applying any action that satisfies the $\max$ in (10.111).

Now suppose that there exists no deterministic saddle point from one or more initial states. To avoid regret, randomized security plans must be developed. These follow by direct extension of the randomized security strategies from Section 9.3.3. The vectors and will be used here to denote probability distributions over and , respectively. The probability vectors are selected from and , which correspond to the set of all probability distributions over and , respectively. For notational convenience, assume $U(x) = \{1,\ldots,m(x)\}$ and $V(x) = \{1,\ldots,n(x)\}$ , in which and are positive integers.

Recall (9.61) and (9.62), which defined the randomized upper and lower values of a single-stage game. This idea is generalized here to randomized upper and lower value of a sequential game. Their definitions are similar to (10.108) and (10.109), except that: 1) the alternating $\min$ 's and $\max$ 's are taken over probability distributions on the space of actions, and 2) the expected cost is used.

The dynamic programming principle can be applied to the randomized upper value to derive

$\displaystyle \overline{\cal L}^*_k({x_k}) = \min_{w \in W({x_k})} \Bigg\{ \max... ...(x_k,i,j) + \overline{\cal L}^*_{k+1}(x_{k+1}) \Bigr) w_i z_j \Bigg\} \Bigg\} ,$

(10.112)

in which $x_{k+1} = f(x_k,i,j)$ . The randomized lower value is similarly obtained as

$\displaystyle \underline{\cal L}^*_k({x_k}) = \max_{z \in Z({x_k})} \Bigg\{ \mi... ...x_k,i,j) + \underline{\cal L}^*_{k+1}(x_{k+1}) \Bigr) w_i z_j \Bigg\} \Bigg\} .$

(10.113)

In many games, the cost term may depend only on the state: for all $x \in X$ , $u \in U(x)$ and $v \in V(x)$ . In this case, (10.113) and (10.114) simplify to

$\displaystyle \overline{\cal L}^*_k({x_k}) = \min_{w \in W({x_k})} \Bigg\{ \max... ... \sum_{j=1}^{n(x_k)} \overline{\cal L}^*_{k+1}(x_{k+1}) w_i z_j \Bigg\} \Bigg\}$

(10.114)

and

$\displaystyle \underline{\cal L}^*_k({x_k}) = \max_{z \in Z({x_k})} \Bigg\{ \mi... ...um_{j=1}^{n(x_k)} \underline{\cal L}^*_{k+1}(x_{k+1}) w_i z_j \Bigg\} \Bigg\} ,$

(10.115)

which is similar to the simplification obtained in (10.46), in which $\theta_k$ was assumed not to appear in the cost term. The summations are essentially generalizations of (9.57) to the multiple-stage case. If desired, these could even be written as matrix multiplications, as was done in Section 9.3.3.

Value iteration can be performed over the equations above to obtain the randomized values of the sequential game. Since the upper and lower values are always the same, there is no need to check for discrepancies between the two. In practice, it is best in every evaluation of (10.113) and (10.114) (or their simpler forms) to first check whether a deterministic saddle exists from . Whenever one does not exist, the linear programming problem formulated in Section 9.3.3 must be solved to determine the value and the best randomized plan for each player. This can be avoided if a deterministic saddle exists from the current state and stage.

Steven M LaValle 2020-08-14