10.4.2 Evaluating a Plan via Simulation

The simulation method is based on averaging the information gained incrementally from samples. Suppose that you receive a sequence of costs, , , $\ldots$ , and would like to incrementally compute their average. You are not told the total number of samples in advance, and at any point you are required to report the current average. Let denote the average of the first samples,

$\displaystyle m_i = \frac{1}{i} \sum_{j=1}^i c_j .$

(10.80)

To efficiently compute

from $m_{i-1}$ , multiply $m_{i-1}$ by

to recover the total, add

, and then divide by

$\displaystyle m_i = {(i-1) m_{i-1} + c_i \over i} .$

(10.81)

This can be manipulated into

$\displaystyle m_i = m_{i-1} + \frac{1}{i}(c_i - m_{i-1}) .$

(10.82)

Now consider the problem of estimating the expected cost-to-go, $G_\pi (x)$ , at every $x \in X$ for some fixed plan, $\pi$ . If $P(x'\vert x,u)$ and the costs were known, then it could be computed by solving

$\displaystyle G_\pi (x) = l(x,u) + \sum_{x'} P(x'\vert x, u) G_\pi (x') .$

(10.83)

However, without this information, we must rely on the simulator.

From each $x \in X$ , suppose that trials are conducted, and the resulting costs to get to the goal are recorded and averaged. Each trial is an iterative process in which $\pi$ selects the action, and the simulator indicates the next state and its incremental cost. Once the goal state is reached, the costs are totaled to yield the measured cost-to-go for that trial (this assumes that $\pi (x) = u_T$ for all $x \in {X_{G}}$ ). If denotes this total cost at trial , then the average, , over trials provides an estimate of $G_\pi (x)$ . As tends to infinity, we expect to converge to $G_\pi (x)$ . The update formula (10.83) can be conveniently used to maintain the improving sequence of cost-to-go estimates. Let $\hat{G}_\pi (x)$ denote the current estimate of $G_\pi (x)$ . The update formula based on (10.83) can be expressed as

$\displaystyle \hat{G}_\pi (x) := \hat{G}_\pi (x) + \frac{1}{i}(l(x_1,u_1) + l(x_2,u_2) + \cdots + l(x_K,u_K) - \hat{G}_\pi (x)),$

(10.84)

in which

means assignment, in the sense used in some programming languages.

It turns out that a single trial can actually yield update values for multiple states [930,96]. Suppose that a trial is performed from that results in the sequence , , $\ldots$ , , $\ldots$ , , of visited states. For every state, , in the sequence, a cost-to-go value can be measured by recording the cost that was accumulated from to :

$\displaystyle c_k(x_k) = \sum_{j = k}^K l(x_j,u_j) .$

(10.85)

It is much more efficient to make use of (10.85) on every state that is visited along the path.

Subsections

- Temporal differences

Steven M LaValle 2020-08-14