The EM algorithm

The problem is often solved in practice by applying the expectation-maximization (EM) algorithm [106]. In the general framework, there are three different spaces:

  1. A set of parameters, which are to be determined through some measurement and estimation process. In our problem, this represents $ E$, because the main goal is to determine the environment.
  2. A set of data, which provide information that can be used to estimate the parameter. In the localization and mapping problem, this corresponds to the history I-space $ {\cal I}_K$. Each history I-state $ {\eta}_K \in {\cal I}_K$ is $ {\eta}_K = (p(x),{\tilde{u}}_{K-1},{\tilde{y}}_K)$, in which $ p(x)$ is a prior probability density over $ X$.
  3. A set of hidden variables, which are unknown but need to be estimated to complete the process of determining the parameters. In the localization and mapping problem, this is the configuration space $ {\cal C}$.
Since both the parameters and the hidden variables are unknown, the choice between the two may seem arbitrary. It will turn out that expressions can be derived to nicely express the probability density for the hidden variables, but the parameters are much more complicated.

The EM algorithm involves an expectation step followed by a maximization step. The two steps are repeated as necessary until a solution with the desired accuracy is obtained. The method is guaranteed to converge under general conditions [269,977,978]. In practice, it appears to work well even under cases that are not theoretically guaranteed to converge [940].

From this point onward, let $ E$, $ {\cal I}_K$, and $ {\cal C}$ denote the three spaces for the EM algorithm because they pertain directly to the problem. Suppose that a robot has moved in the environment for $ K-1$ stages, resulting in a final stage, $ K$. At each stage, $ k \in
\{1,\ldots,K\}$, an observation, $ y_k$, is made using its sensor. This could, for example, represent a set of distance measurements made by sonars or a range scanner. Furthermore, an action, $ u_k$, is applied for $ k=1$ to $ k = K$. A prior probability density function, $ p(x)$, is initially assumed over $ X$. This leads to the history I-state, $ {\eta}_k$, as defined in (11.14).

Now imagine that $ K$ stages have been executed, and the task is to estimate $ e$. From each $ q_k$, a measurement, $ y_k$, of part of the environment is taken. The EM algorithm generates a sequence of improved estimates of $ e$. In each execution of the two EM steps, a new estimate of $ e \in E$ is produced. Let $ \hat{e}_i$ denote this estimate after the $ i$th iteration. Let $ {\tilde q}_K$ denote the configuration history from stage $ 1$ to stage $ K$. The expectation step computes the expected likelihood of $ \eta_K$ given $ \hat{e}_i$. This can be expressed as12.3

\begin{displaymath}\begin{split}Q(e,\hat{e}_{i-1}) = & E \left[ p({\eta}_K,{\til...
... q}_K\vert\;{\eta}_K,\hat{e}_{i-1}) d{\tilde q}_K , \end{split}\end{displaymath} (12.30)

in which the expectation is taken over the configuration histories. Since $ {\eta}_K$ is given and the expectation removes $ {\tilde q}_k$, (12.30) is a function only of $ e$ and $ \hat{e}_{i-1}$. The term $ p({\eta}_K,{\tilde q}_K\vert\;e)$ can be expressed as

$\displaystyle p({\eta}_K,{\tilde q}_K\vert\;e) = p({\tilde q}_K \vert \;{\eta}_K,e) p({\eta}_K\vert e) ,$ (12.31)

in which $ p({\eta}_K)$ is a prior density over the I-space, given nothing but the environment $ e$. The factor $ p({\tilde q}_K \vert
\;{\eta}_K,e)$ differs from the second factor of the integrand in (12.30) only by using $ e$ or $ \hat{e}_{i-1}$. The main difficulty in evaluating (12.30) is to evaluate $ p({\tilde q}_k\vert\;{\eta}_K,\hat{e}_{i-1})$ (or the version that uses $ e$). This is essentially a localization problem with a given map, as considered in Section 12.2.3. The information up to stage $ k$ can be applied to yield the probabilistic I-state $ p(q_k\vert\;{\eta}_k,\hat{e}_{i-1})$ for each $ q_k$; however, this neglects the information from the remaining stages. This new information can be used to make inferences about old configurations. For example, based on current measurements and memory of the actions that were applied, we have better information regarding the configuration several stages ago. In [941] a method of computing $ p(q_k\vert\;{\eta}_k,\hat{e}_{i-1})$ is given that computes two terms: One is $ p(q_k\vert{\eta}_k)$, and the other is a backward probabilistic I-state that starts at stage $ K$ and runs down to $ k+1$.

Note that once determined, (12.30) is a function only of $ e$ and $ \hat{e}_{i-1}$. The maximization step involves selecting an $ \hat{e}_i$ that minimizes (12.30):

$\displaystyle \hat{e}_i = \operatornamewithlimits{argmax}_{e \in E} Q(e,\hat{e}_{i-1}) .$ (12.32)

This optimization is often too difficult, and convergence conditions exist if $ \hat{e}_i$ is chosen such that $ Q(\hat{e}_i,\hat{e}_{i-1}) >
Q(\hat{e}_{i-1},\hat{e}_{i-1})$. Repeated iterations of the EM algorithm result in a kind of gradient descent that arrives at a local minimum in $ E$.

One important factor in the success of the method is in the representation of $ E$. In the EM computations, one common approach is to use a set of landmarks, which were mentioned in Section 11.5.1. These are special places in the environment that can be identified by sensors, and if correctly classified, they dramatically improve localization. In [941], the landmarks are indicated by a user as the robot travels. Classification and positioning errors can both be modeled probabilistically and incorporated into the EM approach. Another idea that dramatically simplifies the representation of $ E$ is to approximate environments with a fine-resolution grid. Probabilities are associated with grid cells, which leads to a data structure called an occupancy grid [307,685,850]. In any case, $ E$ must be carefully defined to ensure that reasonable prior distributions can be made for $ p(e)$ to initialize the EM algorithm as the robot first moves.

Steven M LaValle 2020-08-14