Policy iteration

A simulation-based policy iteration algorithm can be derived using Q-factors. Recall from Section 10.2.2 that methods are needed to: 1) evaluate a given plan, $\pi$ , and 2) improve the plan by selecting better actions. The plan evaluation previously involved linear equation solving. Now any plan, $\pi$ , can be evaluated without even knowing $P(x'\vert x,u)$ by using the methods of Section 10.4.2. Once $\hat{G}_\pi$ is computed reliably from every $x \in X$ , further simulation can be used to compute $Q_\pi (x,u)$ for each $x \in X$ and $u \in U$ . This can be achieved by defining a version of (10.99) that is constrained to $\pi$ :

$\displaystyle Q_\pi (x,u) = l(x,u) + \sum_{x^\prime \in X} P(x^\prime\vert x,u) G_\pi (x^\prime) .$

(10.102)

The transition probabilities do not need to be known. The Q-factors are computed by simulation and averaging. The plan can be improved by setting

$\displaystyle \pi '(x) = \operatornamewithlimits{argmin}_{u \in U(x)} \Big\{ Q^*(x,u) \Big\} ,$

(10.103)

which is based on (10.97).

Steven M LaValle 2020-08-14