A value-iteration method can be derived by adapting the derivation that was applied to (10.33) to instead apply to (10.108). This leads to the dynamic programming recurrence
Starting from the final stage, , the upper and lower values are determined directly from the cost function:
Suppose for now that for all . The value iterations proceed in the usual way from down to . Again, suppose that at every stage, for all . Note that can be written in the place of and in (10.110) and (10.111) because it is assumed that the upper and lower values coincide. If they do not, then the method fails because randomized plans are needed to obtain a randomized saddle point.
Once the resulting values are computed from each , a security plan for is defined for each and as any action that satisfies the in (10.110). A security plan is similarly defined for by applying any action that satisfies the in (10.111).
Now suppose that there exists no deterministic saddle point from one or more initial states. To avoid regret, randomized security plans must be developed. These follow by direct extension of the randomized security strategies from Section 9.3.3. The vectors and will be used here to denote probability distributions over and , respectively. The probability vectors are selected from and , which correspond to the set of all probability distributions over and , respectively. For notational convenience, assume and , in which and are positive integers.
Recall (9.61) and (9.62), which defined the randomized upper and lower values of a single-stage game. This idea is generalized here to randomized upper and lower value of a sequential game. Their definitions are similar to (10.108) and (10.109), except that: 1) the alternating 's and 's are taken over probability distributions on the space of actions, and 2) the expected cost is used.
The dynamic programming principle can be applied to the randomized upper value to derive
In many games, the cost term may depend only on the state: for all , and . In this case, (10.113) and (10.114) simplify to
Value iteration can be performed over the equations above to obtain the randomized values of the sequential game. Since the upper and lower values are always the same, there is no need to check for discrepancies between the two. In practice, it is best in every evaluation of (10.113) and (10.114) (or their simpler forms) to first check whether a deterministic saddle exists from . Whenever one does not exist, the linear programming problem formulated in Section 9.3.3 must be solved to determine the value and the best randomized plan for each player. This can be avoided if a deterministic saddle exists from the current state and stage.
Steven M LaValle 2020-08-14