A simulation-based policy iteration algorithm can be derived using Q-factors. Recall from Section 10.2.2 that methods are needed to: 1) evaluate a given plan, , and 2) improve the plan by selecting better actions. The plan evaluation previously involved linear equation solving. Now any plan, , can be evaluated without even knowing by using the methods of Section 10.4.2. Once is computed reliably from every , further simulation can be used to compute for each and . This can be achieved by defining a version of (10.99) that is constrained to :

The transition probabilities do not need to be known. The Q-factors are computed by simulation and averaging. The plan can be improved by setting

which is based on (10.97).

Steven M LaValle 2020-08-14