Monday, August 13, 2018

A brief introduction to Partially Observable Markov Decision Processes

 15 minute read

In this summary, I assume you are familiar with Markov Decision Processes. In a Markov Decision Process (MDP), an agent interacts with the environment, by taking actions that induce a change in the state of the environment.

MDP: The robot knows the position
of the fuze bottle.
Image courtesy of the 
Personal Robotics Lab [6, 7] 
For instance, a robotic arm may grasp a fuze bottle from the table and put it on the tray. The goal of the agent is represented in the form of a reward that the agent receives.  In our example, if the robot succeeds in picking up the object, it gets a positive reward. If it fails and drops the object, it receives a penalty. The goal of the agent is to maximize the reward it will receive throughout the task.

An important assumption in MDPs is that the agent knows the true state of the world. So, our robotic arm would know exactly where the soda can is located. Partially Observable Markov Decision Processes relax this assumption, in that the agent does not know the true state, but can receive observations through its sensors that enable it to estimate that state.

POMDP: The robot retains a
probability distribution over
possible fuze bottle locations.
Image courtesy of the 
Personal Robotics Lab [6, 7] 
Specifically, the agent will maintain a belief state $b$, which is a probability distribution over world states. The agent starts with an initial belief $b,$ it chooses an action $a$ based on that belief, and receives an observation $o$. It then computes a new belief $b'$ based on the previous action and observation, and so on.

For each state and action, the agent receives a reward $r_t$. As in the case of MDPs, the goal of the agent is to maximize the expected accumulated reward that it will receive over a time horizon $T$:


\mathbb{E}\left[ \sum_{t=0}^{T-1} r_t \right]


Formally, we define a partially observable Markov decision process as a tuple $<S, A, \mathcal{T}, \Omega, O, R>$, where:

$S$: finite set of world states; in our example, that is the set of possible robot configurations and positions of the fuze bottle.
$A$: set of a robot actions; for our robotic arm,  an action can be a displacement of the robot end-effector as well as a grasping action.
$\mathcal{T}$: $S \times A \rightarrow \Pi(S)$ is a state-transition function, mapping a world state and robot action to a probability distribution over states. It represents how the true world state changes given a robot action; in our example it encodes the uncertainty in the motion of the robot and of the fuze bottles.
$\Omega$ is a finite set of observations; for the object manipulation example, it can be a set of possible point clouds from a depth sensor or readings from the contact sensors in the robot's fingers.
$O: S \times A \rightarrow \Pi(\Omega)$ is the observation function, which maps a world state and robot action to a probability distribution over observations. It represents the uncertainty in the observations by the robot's sensors.
$R: S \times A \rightarrow \mathbb{R}$ is a reward function that the agent receives given an action at a particular world state, for instance a positive reward if the robot succeeds in picking up the object and a negative reward for robot failure.

After the robot takes an action $a$ and an observation $o$, it can compute a new belief $b'$ as follows:


b'(s') &= P(s'|o, a, b) \\

         &= \frac{P(o | s', a, b) P(s' | a, b)}{P(o | a,b)}  &(\textit{from Bayes' rule}) \\

         &= \frac{P(o|s',a,b) \sum_{s \in S} P(s' | a,b,s)P(s | a, b)}{P(o | a, b)} & (\textit{marginalization})\\

         &= \frac{P(o|s',a) \sum_{s \in S} P(s' | a, s)P(s |  b)}{P(o | a, b)} & (\textit{from conditional independence})\\

         & = \frac{O(s',a, o) \sum_{s \in S} \mathcal{T}(s, a, s') b(s)}{P(o|a,b)} & (\textit{by definition of $~\mathcal{T},O,b$})


Now, how should the agent optimize its expected accumulated reward? The agent can control only the action $a$ that it will take, given its belief $b$. We call the policy of the agent a mapping from beliefs to actions:

$\pi: B \rightarrow A$

The optimal policy $\pi_t^*(b)$ is the policy that returns the optimal action, that is the action that maximizes the expected reward that the agent will receive over the next $t$ steps.

To understand the computation of the optimal policy, we need to first note the concept of a policy tree.  A policy tree $p$ is a tree describing sequences of actions and observations. Each arc in the tree is labelled with an observation, and each node with an action to be performed when the observation leading to that node is received.

Now, to compute the optimal policy, we need a metric of how good a given policy tree p is. The value  of executing a policy tree p in state s is the immediate reward by executing the action $a(p)$ at the root node of the tree,  plus the expected value of the future.


V_t^p(s)&= R(s,a(p)) +(\textit{Expected value of the future})


The value of the future depends (i) on which state the agent will end up, and (ii) on which observation $o_i$ the agent will make. The agent does not have this information in advance, so we will take the expectation over subsequent states and observations: \begin{align}

V_t^p(s)&= R(s,a(p)) + \sum_{s'\in S} P(s'|s, a(p)) \sum_{o_i \in \Omega} P(o_i|s',a(p)) V_{t-1}^{p,o_i}(s')


Based on our definitions of the transition function $\mathcal{T}$ and observation function $O$, we have:


V_t^p(s)&= R(s,a(p)) + \sum_{s'\in S} \mathcal{T}(s, a(p),s') \sum_{o_i \in \Omega}O(s', a(p),o_i) V_{t-1}^{p,o_i}(s')


We see that the value of a state can be computed by one-step look-ahead using $V_{t-1}$. We can then compute the value for every state $s$ recursively, using dynamic programming [4] .

Sadly, the agents does not know the true state, but has a belief $b$, which is a probability distribution over states. Therefore, we need to compute the value of executing a policy tree $p$ given an agent belief $b$. We do this by marginalizing over the agent's belief $b$:


V_t^p(b) &=\sum_{s \in S} b(s) V_t^p(s)


We can write the above equation more compactly by writing the sum as an inner product of the belief with a set of vectors $\alpha_t^p = <V_t^p(s_1), ... , V_t^p(s_n)>$


V_t^p(b) = b \cdot \alpha_t^p


So, we see that a policy tree $p$ is associated with a set of vectors $\alpha_t^p$, which we call "alpha vectors." We have also computed the value of a given policy tree $p$ given a belief $b$.

So, given a set of  $t$ step policy trees $P_t$, we can compute the value of the optimal policy as the value of starting in belief $b$ and executing the best policy tree in that belief:


V_t(b) = \max_{p \in P_t} b \cdot \alpha_t^p \end{align}

This is equivalent to to executing the best root action at time $t$, and the best assignment of policy trees -- and their corresponding $\alpha$ vectors -- at $t-1$:

\begin{align} V_t(b) = \max_{a \in A}\left [ \sum_{s \in S}R(s,a)b(s) + \gamma \sum_{o_i \in \Omega} \max_{\alpha \in P_{t-1}} \sum_{s \in S} \sum_{s'\in S} P(o_i|s',a)P(s'|s,a)b(s) \alpha(s') \right ] \end{align}

The equation above can be solved with exact value iteration: start with a set of $t-1$ step policy trees, and iteratively use that to construct a superset of $t$ step policy trees. Unfortunately, the size of the policy trees grows exponentially at each step. We have |A| actions, and for every action we need to consider $|P_{t-1}|^{|\Omega|}$ new trees at each step. This makes solving it intractably large over time.

Instead, we can use sampling-based algorithms to find efficiently approximate solutions. Most such algorithms alternate between sampling in belief-space, and performing an one-step look-ahead, which propagates the information of the children of $b$ back to the point $b$.

Here is an example of how this is possible [3]: Let's  assume a sampled set of belief points $b \in B$, and substitute $\alpha^{a,*}=R(s,a)$ and $\alpha^{a,o_i}= \sum_{s \in S} \sum_{s'\in S} P(o_i|s',a)P(s'|s,a)$. Then, we can rewrite the value function as:

\begin{align} V_t(b) =\max_{a \in A}\left [ \alpha^{a,*}\cdot b + \sum_{o_i \in \Omega} \max_{\alpha^{a, o_i} \in P_{t-1}} \alpha^{a, o_i}\cdot b \right] \end{align}

Then, for each sampled belief $b$ we can compute the best $\alpha^{a, o_i}$ and store it ahead of time. Then, each time we update the value at $t$ from $t-1$ policy trees, we only need to consider |A| new trees!

This however begs the question, what beliefs should we consider to sample? Recent algorithms [5] have achieved dramatic performance improvements, by attempting to sample from the belief space that is reachable by the optimal policy, since it is usually much smaller than the full belief space.

We thank Siddhartha Srinivasa and Vaibhav Unhelkar for their feedback. 


[1] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. "Planning and acting in partially observable stochastic domains." Artificial intelligence, 1998.
[2] N. Roy, 16420: Planning Under Uncertainty, Massachusetts Institute of Technology, delivered Fall 2010.
[3] J. Pineau, G. Gordon, and S. Thrun. "Point-based value iteration: An anytime algorithm for POMDPs." IJCAI, 2003.
[4] D. Bertsekas, Dynamic Programming and Optimal Control, Athena scientific, 1995.
[5] H. Kurniawati, D. Hsu, W. Sun Lee. "SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces." Robotics: Science and Systems, 2008.
[6] Michael Koval. "Robust Manipulation via Contact Sensing." PhD Thesis. Carnegie Mellon University, 2016. 
[7] Personal Robotics Lab. University of Washington, 2018.