10.1.1 Model Definition

The formulation presented in this section is an extension of Formulation 2.3 that incorporates the effects of nature at every stage. Let $ X$ denote a discrete state space, and let $ U(x)$ denote the set of actions available to the decision maker (or robot) from state $ x \in X$. At each stage $ k$ it is assumed that a nature action $ \theta_k$ is chosen from a set $ \Theta(x_k,u_k)$. This can be considered as a multi-stage generalization of Formulation 9.4, which introduced $ \Theta(u)$. Now $ \Theta$ may depend on the state in addition to the action because both $ x_k$ and $ u_k$ are available in the current setting. This implies that nature acts with the knowledge of the action selected by the decision maker. It is always assumed that during stage $ k$, the decision maker does not know the particular nature action that will be chosen. It does, however, know the set $ \Theta(x_k,u_k)$ for all $ x_k
\in X$ and $ u_k \in U(x_k)$.

As in Section 9.2, there are two alternative nature models: nondeterministic or probabilistic. If the nondeterministic model is used, then it is only known that nature will make a choice from $ \Theta(x_k,u_k)$. In this case, making decisions using worst-case analysis is appropriate.

If the probabilistic model is used, then a probability distribution over $ \Theta(x_k,u_k)$ is specified as part of the model. The most important assumption to keep in mind for this case is that nature is Markovian. In general, this means that the probability depends only on local information. In most applications, this locality is with respect to time. In our formulation, it means that the distribution over $ \Theta(x_k,u_k)$ depends only on information obtained at the current stage. In other settings, Markovian could mean a dependency on a small number of stages, or even a local dependency in terms of spatial relationships, as in a Markov random field [231,377].

To make the Markov assumption more precise, the state and action histories as defined in Section 8.2.1 will be used again here. Let

$\displaystyle {\tilde{x}}_k = (x_1,x_2,\ldots,x_k)$ (10.1)

and

$\displaystyle {\tilde{u}}_k = (u_1,u_2,\ldots,u_k) .$ (10.2)

These represent all information that is available up to stage $ k$. Without the Markov assumption, it could be possible that the probability distribution for nature is conditioned on all of $ {\tilde{x}}_k$ and $ {\tilde{u}}_k$, to obtain $ P(\theta_k \vert {\tilde{x}}_k,
{\tilde{u}}_k)$. The Markov assumption declares that for all $ \theta_k \in
\Theta(x_k,u_k)$,

$\displaystyle P(\theta_k \vert {\tilde{x}}_k, {\tilde{u}}_k) = P(\theta_k \vert x_k, u_k) ,$ (10.3)

which drops all history except the current state and action. Once these two are known, there is no extra information regarding the nature action that could be gained from any portion of the histories.

The effect of nature is defined in the state transition equation, which produces a new state, $ x_{k+1}$, once $ x_k$, $ u_k$, and $ \theta_k$ are given:

$\displaystyle x_{k+1} = f(x_k,u_k,\theta_k) .$ (10.4)

From the perspective of the decision maker, $ \theta_k$ is not given. Therefore, it can only infer that a particular set of states will result from applying $ u_k$ and $ x_k$:

$\displaystyle X_{k+1}(x_k,u_k) = \{ x_{k+1} \in X \;\vert\; \exists \theta_k \in \Theta(x_k,u_k)$    such that $\displaystyle x_{k+1} = f(x_k,u_k,\theta_k)\} .$ (10.5)

In (10.5), the notation $ X_{k+1}(x_k,u_k)$ indicates a set of possible values for $ x_{k+1}$, given $ x_k$ and $ u_k$. The notation $ X_k(\cdot)$ will generally be used to indicate the possible values for $ x_k$ that can be derived using the information that appears in the argument.

In the probabilistic case, a probability distribution over $ X$ can be derived for stage $ k+1$, under the application of $ u_k$ from $ x_k$. As part of the problem, $ P(\theta_k \vert x_k,u_k)$ is given. Using the state transition equation, $ x_{k+1} = f(x_k,u_k,\theta_k)$,

$\displaystyle P(x_{k+1}\vert x_k,u_k) = \sum_{\theta_k \in \Theta^\prime} P(\theta_k \vert x_k,u_k)$ (10.6)

can be derived, in which

$\displaystyle \Theta^\prime = \{ \theta_k \in \Theta(x_k,u_k) \;\vert\; x_{k+1} = f(x_k,u_k,\theta_k) \} .$ (10.7)

The calculation of $ P(x_{k+1}\vert x_k,u_k)$ simply involves accumulating all of the probability mass that could lead to $ x_{k+1}$ from the application of various nature actions.

Putting these parts of the model together and adding some of the components from Formulation 2.3, leads to the following formulation:

Formulation 10..1 (Discrete Planning with Nature)  
  1. A nonempty state space $ X$ which is a finite or countably infinite set of states.
  2. For each state, $ x \in X$, a finite, nonempty action space $ U(x)$. It is assumed that $ U$ contains a special termination action, which has the same effect as the one defined in Formulation 2.3.
  3. A finite, nonempty nature action space $ \Theta(x,u)$ for each $ x \in X$ and $ u \in U(x)$.
  4. A state transition function $ f$ that produces a state, $ f(x,u,\theta)$, for every $ x \in X$, $ u \in U$, and $ \theta \in
\Theta(x,u)$.
  5. A set of stages, each denoted by $ k$, that begins at $ k=1$ and continues indefinitely. Alternatively, there may be a fixed, maximum stage $ k = K+1 = F$.
  6. An initial state $ {x_{I}}\in X$. For some problems, this may not be specified, in which case a solution plan must be found from all initial states.
  7. A goal set $ {X_{G}}\subset X$.
  8. A stage-additive cost functional $ L$. Let $ {\tilde{\theta}}_K$ denote the history of nature actions up to stage $ K$. The cost functional may be applied to any combination of state, action, and nature histories to yield

    $\displaystyle L({\tilde{x}}_F,{\tilde{u}}_K,{\tilde{\theta}}_K) = \sum_{k=1}^K l(x_k,u_k,\theta_k) + l_F(x_F) ,$ (10.8)

    in which $ F = K+1$. If the termination action $ u_T$ is applied at some stage $ k$, then for all $ i \geq k$, $ u_i = u_T$, $ x_i = x_k$, and $ l(x_i,u_T,\theta_i) = 0$.

Using Formulation 10.1, either a feasible or optimal planning problem can be defined. To obtain a feasible planning problem, let $ l(x_k,u_k,\theta_k) = 0$ for all $ x_k
\in X$, $ u_k \in U$, and $ \theta_k \in \Theta_k(u_k)$. Furthermore, let

$\displaystyle l_F(x_F) = \left\{ \begin{array}{ll} 0 & \mbox{ if $x_F \in {X_{G}}$ }  \infty & \mbox{ otherwise. }  \end{array}\right.$ (10.9)

To obtain an optimal planning problem, in general $ l(x_k,u_k,\theta_k)$ may assume any nonnegative, finite value if $ x_k
\not \in {X_{G}}$. For problems that involve probabilistic uncertainty, it is sometimes appropriate to assign a high, finite value for $ l_F(x_F)$ if $ x_F \not \in {X_{G}}$, as opposed to assigning an infinite cost for failing to achieve the goal.

Note that in each stage, the cost term is generally allowed to depend on the nature action $ \theta_k$. If probabilistic uncertainty is used, then Formulation 10.1 is often referred to as a controlled Markov process or Markov decision process (MDP). If the actions are removed from the formulation, then it is simply referred to as a Markov process. In most statistical literature, the name Markov chain is used instead of Markov process when there are discrete stages (as opposed to continuous-time Markov processes). Thus, the terms controlled Markov chain and Markov decision chain may be preferable.

In some applications, it may be convenient to avoid the explicit characterization of nature. Suppose that $ l(x_k,u_k,\theta_k) =
l(x_k,u_k)$. If nondeterministic uncertainty is used, then $ X_{k+1}(x_k,u_k)$ can be specified for all $ x_k
\in X$ and $ u_k \in U(x_k)$ as a substitute for the state transition equation; this avoids having to refer to nature. The application of an action $ u_k$ from a state $ x_k$ directly leads to a specified subset of $ X$. If probabilistic uncertainty is used, then $ P(x_{k+1}\vert x_k,u_k)$ can be directly defined as the alternative to the state transition equation. This yields a probability distribution over $ X$, if $ u_k$ is applied from some $ x_k$, once again avoiding explicit reference to nature. Most of the time we will use a state transition equation that refers to nature; however, it is important to keep these alternatives in mind. They arise in many related books and research articles.

As used throughout Chapter 2, a directed state transition graph is sometimes convenient for expressing the planning problem. The same idea can be applied in the current setting. As in Section 2.1, $ X$ is the vertex set; however, the edge definition must change to reflect nature. A directed edge exists from state $ x$ to $ x^\prime$ if there exists some $ u \in U(x)$ and $ \theta \in
\Theta(x,u)$ such that $ x^\prime = f(x,u,\theta)$. A weighted graph can be made by associating the cost term $ l(x_k,u_k,\theta_k)$ with each edge. In the case of a probabilistic model, the probability of the transition occurring may also be associated with each edge.

Note that both the decision maker and nature are needed to determine which vertex will be reached. As the decision maker contemplates applying an action $ u$ from the state $ x$, it sees that there may be several outgoing edges due to nature. If a different action is contemplated, then this set of possible outgoing edges changes. Once nature applies its action, then the particular edge is traversed to arrive at the new state; however, this is not completely controlled by the decision maker.

Example 10..1 (Traversing the Number Line)   Let $ X = {\mathbb{Z}}$, $ U = \{-2,2,u_T\}$, and $ \Theta = \{-1,0,1\}$. The action sets of the decision maker and nature are the same for all states. For the state transition equation, $ x_{k+1} =
f(x_k,u_k,\theta_k) = x_k+u_k+\theta_k$. For each stage, unit cost is received. Hence $ l(x,u,\theta) = 1$ for all $ x$, $ \theta $, and $ u \not = u_T$. The initial state is $ {x_{I}}= 100$, and the goal set is $ {X_{G}}= \{-1,0,1\}$.

Consider executing a sequence of actions, $ (-2,-2,\ldots,-2)$, under the nondeterministic uncertainty model. This means that we attempt to move left two units in each stage. After the first $ -2$ is applied, the set of possible next states is $ \{97,98,99\}$. Nature may slow down the progress to be only one unit per stage, or it may speed up the progress so that $ {X_{G}}$ is three units closer per stage. Note that after $ 100$ stages, the goal is guaranteed to be achieved, in spite of any possible actions of nature. Once $ {X_{G}}$ is reached, $ u_T$ should be applied. If the problem is changed so that $ {X_{G}}=
\{0\}$, it becomes impossible to guarantee that the goal will be reached because nature may cause the goal to be overshot.

Now let $ U = \{-1,1,u_T\}$ and $ \Theta = \{-2,-1,0,1,2\}$. Under nondeterministic uncertainty, the problem can no longer be solved because nature is now powerful enough to move the state completely in the wrong direction in the worst case. A reasonable probabilistic version of the problem can, however, be defined and solved. Suppose that $ P(\theta) = 1/5$ for each $ \theta \in \Theta$. The transition probabilities can be defined from $ P(\theta)$. For example, if $ x_k =
100$ and $ u_k = -1$, then $ P(x_{k+1} \vert x_k,u_k) = 1/5$ if $ 97 \leq x_k
\leq 101$, and $ P(x_{k+1} \vert x_k,u_k) = 0$ otherwise. With the probabilistic formulation, there is a nonzero probability that the goal, $ {X_{G}}= \{-1,0,1\}$, will be reached, even though in the worst-case reaching the goal is not guaranteed. $ \blacksquare$

Figure 10.1: A grid-based shortest path problem with interference from nature.
\begin{figure}\centerline{\psfig{file=figs/hw3grid.eps,width=2.8truein}}\end{figure}

Example 10..2 (Moving on a Grid)   A grid-based robot planning model can be made. A simple example is shown in Figure 10.1. The state space is a subset of a $ 15 \times 15$ integer grid in the plane. A state is represented as $ (i,j)$, in which $ 1 \leq i,j \leq 15$; however, the points in the center region (shown in Figure 10.1) are not included in $ X$.

Let $ A = \{0,1,2,3,4\}$ be a set of actions, which denote ``stay,'' ``right,'' ``up,'' ``left,'' and ``down,'' respectively. Let $ U = A
\cup u_T$. For each $ x \in X$, let $ U(x)$ contain $ u_T$ and whichever actions are applicable from $ x$ (some are not applicable along the boundaries).

Let $ \Theta(x,u)$ represent the set of all actions in $ A$ that are applicable after performing the move implied by $ u$. For example, if $ x = (2,2)$ and $ u = 3$, then the robot is attempting to move to $ (1,2)$. From this state, there are three neighboring states, each of which corresponds to an action of nature. Thus, $ \Theta(x,u)$ in this case is $ \{0,1,2,4\}$. The action $ \theta = 3$ does not appear because there is no state to the left of $ (1,2)$. Suppose that the probabilistic model is used, and that every nature action is equally likely.

The state transition function $ f$ is formed by adding the effect of both $ u_k$ and $ \theta_k$. For example, if $ x_k = (i,j)$, $ u_k = 1$, and $ \theta_k = 2$, then $ x_{k+1} = (i+1,j+1)$. If $ \theta_k$ had been $ 3$, then the two actions would cancel and $ x_{k+1} = (i,j)$. Without nature, it would have been assumed that $ \theta_k = 0$. As always, the state never changes once $ u_T$ is applied, regardless of nature's actions.

For the cost functional, let $ l(x_k,u_k) = 1$ (unless $ u_k =
u_T$; in this case, $ l(x_k,u_T) = 0$). For the final stage, let $ l_F(x_F) = 0$ if $ x_F \in
{X_{G}}$; otherwise, let $ l_F(x_F) = \infty$. A reasonable task is to get the robot to terminate in $ {X_{G}}$ in the minimum expected number of stages. A feedback plan is needed, which will be introduced in Section 10.1.3, and the optimal plan for this problem can be efficiently computed using the methods of Section 10.2.1.

This example can be easily generalized to moving through a complicated labyrinth in two or more dimensions. If the grid resolution is high, then an approximation to motion planning is obtained. Rather than forcing motions in only four directions, it may be preferable to allow any direction. This case is covered in Section 10.6, which addresses planning in continuous state spaces. $ \blacksquare$

Steven M LaValle 2020-08-14