10.4.3 Q-Learning: Computing an Optimal Plan

This section moves from evaluating a plan to computing an optimal plan in the simulation-based framework. The most important idea is the computation of Q-factors, . This is an extension of the optimal cost-to-go, , that records optimal costs for each possible combination of a state, $x \in X$ , and action $u \in U(x)$ . The interpretation of is the expected cost received by starting from state , applying , and then following the optimal plan from the resulting next state, $x' = f(x,u,\theta)$ . If happens to be the same action as would be selected by the optimal plan, $\pi ^*(x)$ , then . Thus, the Q-value can be thought of as the cost of making an arbitrary choice in the first stage and then exhibiting optimal decision making afterward.

Subsections

- Value iteration
- Policy iteration

Steven M LaValle 2020-08-14