Policy iteration

A simulation-based policy iteration algorithm can be derived using Q-factors. Recall from Section 10.2.2 that methods are needed to: 1) evaluate a given plan, $ \pi $, and 2) improve the plan by selecting better actions. The plan evaluation previously involved linear equation solving. Now any plan, $ \pi $, can be evaluated without even knowing $ P(x'\vert x,u)$ by using the methods of Section 10.4.2. Once $ \hat{G}_\pi $ is computed reliably from every $ x \in X$, further simulation can be used to compute $ Q_\pi (x,u)$ for each $ x \in X$ and $ u \in U$. This can be achieved by defining a version of (10.99) that is constrained to $ \pi $:

$\displaystyle Q_\pi (x,u) = l(x,u) + \sum_{x^\prime \in X} P(x^\prime\vert x,u) G_\pi (x^\prime) .$ (10.102)

The transition probabilities do not need to be known. The Q-factors are computed by simulation and averaging. The plan can be improved by setting

$\displaystyle \pi '(x) = \operatornamewithlimits{argmin}_{u \in U(x)} \Big\{ Q^*(x,u) \Big\} ,$ (10.103)

which is based on (10.97).

Steven M LaValle 2020-08-14