10.4.2 Evaluating a Plan via Simulation

The simulation method is based on averaging the information gained incrementally from samples. Suppose that you receive a sequence of costs, $ c_1$, $ c_2$, $ \ldots $, and would like to incrementally compute their average. You are not told the total number of samples in advance, and at any point you are required to report the current average. Let $ m_i$ denote the average of the first $ i$ samples,

$\displaystyle m_i = \frac{1}{i} \sum_{j=1}^i c_j .$ (10.80)

To efficiently compute $ m_i$ from $ m_{i-1}$, multiply $ m_{i-1}$ by $ i-1$ to recover the total, add $ c_i$, and then divide by $ i$:

$\displaystyle m_i = {(i-1) m_{i-1} + c_i \over i} .$ (10.81)

This can be manipulated into

$\displaystyle m_i = m_{i-1} + \frac{1}{i}(c_i - m_{i-1}) .$ (10.82)

Now consider the problem of estimating the expected cost-to-go, $ G_\pi (x)$, at every $ x \in X$ for some fixed plan, $ \pi $. If $ P(x'\vert x,u)$ and the costs $ l(x,u)$ were known, then it could be computed by solving

$\displaystyle G_\pi (x) = l(x,u) + \sum_{x'} P(x'\vert x, u) G_\pi (x') .$ (10.83)

However, without this information, we must rely on the simulator.

From each $ x \in X$, suppose that $ 1000$ trials are conducted, and the resulting costs to get to the goal are recorded and averaged. Each trial is an iterative process in which $ \pi $ selects the action, and the simulator indicates the next state and its incremental cost. Once the goal state is reached, the costs are totaled to yield the measured cost-to-go for that trial (this assumes that $ \pi (x)
= u_T$ for all $ x \in {X_{G}}$). If $ c_i$ denotes this total cost at trial $ i$, then the average, $ m_i$, over $ i$ trials provides an estimate of $ G_\pi (x)$. As $ i$ tends to infinity, we expect $ m_i$ to converge to $ G_\pi (x)$. The update formula (10.83) can be conveniently used to maintain the improving sequence of cost-to-go estimates. Let $ \hat{G}_\pi (x)$ denote the current estimate of $ G_\pi (x)$. The update formula based on (10.83) can be expressed as

$\displaystyle \hat{G}_\pi (x) := \hat{G}_\pi (x) + \frac{1}{i}(l(x_1,u_1) + l(x_2,u_2) + \cdots + l(x_K,u_K) - \hat{G}_\pi (x)),$ (10.84)

in which $ :=$ means assignment, in the sense used in some programming languages.

It turns out that a single trial can actually yield update values for multiple states [930,96]. Suppose that a trial is performed from $ x$ that results in the sequence $ x_1 = x$, $ x_2$, $ \ldots $, $ x_k$, $ \ldots $, $ x_K$, $ x_F$ of visited states. For every state, $ x_k$, in the sequence, a cost-to-go value can be measured by recording the cost that was accumulated from $ x_k$ to $ x_K$:

$\displaystyle c_k(x_k) = \sum_{j = k}^K l(x_j,u_j) .$ (10.85)

It is much more efficient to make use of (10.85) on every state that is visited along the path.



Subsections
Steven M LaValle 2020-08-14