## Wednesday, December 26, 2018

### 7 Rules of the Grad School Game

In a dinner after a conference last spring in Chicago, several graduate students asked me if I had any advice about success in graduate school. I kept thinking of this question, mostly during my evening runs,  and I decided to compile my thoughts into a set of rules. I did not include things such as exercising, having a schedule, waking up at the same time etc, mostly because they are common knowledge and I have never been good at them anyway. I picked the rules system because (1) this seems to be the trend nowadays and (2) rules are easy to remember.

1. Go to the cocktail party.  In academia, information is the valued currency. A solid advice can save you months of unnecessary hustle. Students will speak about advisor habits, research, classes etc. Staff will tell you which rules of the institution are strictly enforced and which can be casually broken. Professors will speak about general research trends, their students, lab equipment that you could possibly use. Not only that, but by clearly articulating your research, your goals, your hurdles, things that you are stuck, you bring structure to thoughts and encourage others to do the same. Be social and go to the party.

2. Choose the right game. You joined a new lab, working on a direction that your advisor has little experience. You are asked to publish a high-quality paper in a year, competing in a domain where hundreds of senior researchers in large research institutions have been working in large teams for years. This is a game that you are -- more often than not -- not going to win. A game that you cannot win is rigged. Do not play a rigged game. Instead, design a new one: bring insights from different fields; find external collaborators; identify a niche that is less explored. Choose a game that is challenging but that you are likely to win.

3. Make a good first impression. A student that excelled in the first term but had bad performance in the second probably overextended by trying too hard. A student that had bad performance in the first term is a mediocre student who may improve -- if lucky enough to be given the chance. First impression defines the initial expectations, and expectations are not easy to change. Apply maximum effort in your first research project, your first review, your first submitted paper. Make a great first impression.

4. Be precise. Research is hard, and if you are pushing yourself, you will inevitably feel at some point overwhelmed. Things will break -- especially if you work on robots and you are running a demo and your advisor is nearby -- and progress may seem like climbing a shadowy mountain. To be prepared for this time, be precise about everything. What you want to achieve, how to formulate the problem, what method you use, when and how it breaks. Precision imposes boundaries and form to the undefined. The shadowy mountain will start having a concrete shape, and it may even reveal paths that you can hike. Be precise.

5. Run the extra mile. Once things start working, research can actually be incredibly rewarding and fun -- for the most part. The last 20% of the effort in any project is usually the hardest and most tedious part. Arnold Schwarzenegger said that the last three reps are what makes the muscle grow. It is then that you get a deep understanding of the topic.  It is also what makes a good paper into a great paper. And a great paper is orders of magnitude more valuable than a good paper. Rerun the statistical tests. Check for patterns in the data. Refine the figures. Don't be like everyone else. Run the extra mile.

6. Be aware of a distorted lens. Every scientific method has strengths and weaknesses. These are typically identified in an objective manner, by comparing to existing work, to an ideal baseline or to a well-defined goal. Focus, on the other hand, can be much more subjective and biased: a reviewer or an audience member can choose to disregard weaknesses and emphasize strengths, or vice versa. Every time you receive feedback that appears overly critical, ask yourself and others whether it is an objective assessment of the work or a result of shifted focus. Embrace and learn from all feedback, positive and negative, and adapt your work and your viewpoint.  But be unaffected from criticism through a distorted lens.

7.  Smile.  As a first-year graduate student, with little research experience and published track record, you are at the bottom of the academic pyramid. But you have youthful energy and excitement that is scarce in an environment where people can be easily stressed and overworked. Maximize that. Be the source of radiant energy, the person who is optimistic when things break, relaxed when everyone else is on edge and who is cracking jokes at 2am. Make your advisor happy to have meetings with you. Be polite and kind to everyone, including other students, administrative stuff, cleaners. You need them more than they need you, and by being pleasant to be around, you encourage everyone else to do the same. Have fun. Smile.

This post is inspired by past discussions with Siddhartha Srinivasa, David Hsu, Julie Shah, Seth Cooper, and many others that I cannot remember.

## Monday, August 13, 2018

### A brief introduction to Partially Observable Markov Decision Processes

In this summary, I assume you are familiar with Markov Decision Processes. In a Markov Decision Process (MDP), an agent interacts with the environment, by taking actions that induce a change in the state of the environment.

 MDP: The robot knows the position of the fuze bottle. Image courtesy of the  Personal Robotics Lab [6, 7]
For instance, a robotic arm may grasp a fuze bottle from the table and put it on the tray. The goal of the agent is represented in the form of a reward that the agent receives.  In our example, if the robot succeeds in picking up the object, it gets a positive reward. If it fails and drops the object, it receives a penalty. The goal of the agent is to maximize the reward it will receive throughout the task.

An important assumption in MDPs is that the agent knows the true state of the world. So, our robotic arm would know exactly where the soda can is located. Partially Observable Markov Decision Processes relax this assumption, in that the agent does not know the true state, but can receive observations through its sensors that enable it to estimate that state.

 POMDP: The robot retains a probability distribution over possible fuze bottle locations. Image courtesy of the  Personal Robotics Lab [6, 7]
Specifically, the agent will maintain a belief state $b$, which is a probability distribution over world states. The agent starts with an initial belief $b,$ it chooses an action $a$ based on that belief, and receives an observation $o$. It then computes a new belief $b'$ based on the previous action and observation, and so on.

For each state and action, the agent receives a reward $r_t$. As in the case of MDPs, the goal of the agent is to maximize the expected accumulated reward that it will receive over a time horizon $T$:

\mathbb{E}\left[ \sum_{t=0}^{T-1} r_t \right]

Formally, we define a partially observable Markov decision process as a tuple $<S, A, \mathcal{T}, \Omega, O, R>$, where:

$S$: finite set of world states; in our example, that is the set of possible robot configurations and positions of the fuze bottle.
$A$: set of a robot actions; for our robotic arm,  an action can be a displacement of the robot end-effector as well as a grasping action.
$\mathcal{T}$: $S \times A \rightarrow \Pi(S)$ is a state-transition function, mapping a world state and robot action to a probability distribution over states. It represents how the true world state changes given a robot action; in our example it encodes the uncertainty in the motion of the robot and of the fuze bottles.
$\Omega$ is a finite set of observations; for the object manipulation example, it can be a set of possible point clouds from a depth sensor or readings from the contact sensors in the robot's fingers.
$O: S \times A \rightarrow \Pi(\Omega)$ is the observation function, which maps a world state and robot action to a probability distribution over observations. It represents the uncertainty in the observations by the robot's sensors.
$R: S \times A \rightarrow \mathbb{R}$ is a reward function that the agent receives given an action at a particular world state, for instance a positive reward if the robot succeeds in picking up the object and a negative reward for robot failure.

After the robot takes an action $a$ and an observation $o$, it can compute a new belief $b'$ as follows:

\begin{align}

b'(s') &= P(s'|o, a, b) \\

&= \frac{P(o | s', a, b) P(s' | a, b)}{P(o | a,b)}  &(\textit{from Bayes' rule}) \\

&= \frac{P(o|s',a,b) \sum_{s \in S} P(s' | a,b,s)P(s | a, b)}{P(o | a, b)} & (\textit{marginalization})\\

&= \frac{P(o|s',a) \sum_{s \in S} P(s' | a, s)P(s |  b)}{P(o | a, b)} & (\textit{from conditional independence})\\

& = \frac{O(s',a, o) \sum_{s \in S} \mathcal{T}(s, a, s') b(s)}{P(o|a,b)} & (\textit{by definition of $~\mathcal{T},O,b$})

\end{align}

Now, how should the agent optimize its expected accumulated reward? The agent can control only the action $a$ that it will take, given its belief $b$. We call the policy of the agent a mapping from beliefs to actions:

$\pi: B \rightarrow A$

The optimal policy $\pi_t^*(b)$ is the policy that returns the optimal action, that is the action that maximizes the expected reward that the agent will receive over the next $t$ steps.

To understand the computation of the optimal policy, we need to first note the concept of a policy tree.  A policy tree $p$ is a tree describing sequences of actions and observations. Each arc in the tree is labelled with an observation, and each node with an action to be performed when the observation leading to that node is received.

Now, to compute the optimal policy, we need a metric of how good a given policy tree p is. The value  of executing a policy tree p in state s is the immediate reward by executing the action $a(p)$ at the root node of the tree,  plus the expected value of the future.

\begin{align}

V_t^p(s)&= R(s,a(p)) +(\textit{Expected value of the future})

\end{align}

The value of the future depends (i) on which state the agent will end up, and (ii) on which observation $o_i$ the agent will make. The agent does not have this information in advance, so we will take the expectation over subsequent states and observations: \begin{align}

V_t^p(s)&= R(s,a(p)) + \sum_{s'\in S} P(s'|s, a(p)) \sum_{o_i \in \Omega} P(o_i|s',a(p)) V_{t-1}^{p,o_i}(s')

\end{align}

Based on our definitions of the transition function $\mathcal{T}$ and observation function $O$, we have:

\begin{align}

V_t^p(s)&= R(s,a(p)) + \sum_{s'\in S} \mathcal{T}(s, a(p),s') \sum_{o_i \in \Omega}O(s', a(p),o_i) V_{t-1}^{p,o_i}(s')

\end{align}

We see that the value of a state can be computed by one-step look-ahead using $V_{t-1}$. We can then compute the value for every state $s$ recursively, using dynamic programming [4] .

Sadly, the agents does not know the true state, but has a belief $b$, which is a probability distribution over states. Therefore, we need to compute the value of executing a policy tree $p$ given an agent belief $b$. We do this by marginalizing over the agent's belief $b$:

\begin{align}

V_t^p(b) &=\sum_{s \in S} b(s) V_t^p(s)

\end{align}

We can write the above equation more compactly by writing the sum as an inner product of the belief with a set of vectors $\alpha_t^p = <V_t^p(s_1), ... , V_t^p(s_n)>$

\begin{align}

V_t^p(b) = b \cdot \alpha_t^p

\end{align}

So, we see that a policy tree $p$ is associated with a set of vectors $\alpha_t^p$, which we call "alpha vectors." We have also computed the value of a given policy tree $p$ given a belief $b$.

So, given a set of  $t$ step policy trees $P_t$, we can compute the value of the optimal policy as the value of starting in belief $b$ and executing the best policy tree in that belief:

\begin{align}

V_t(b) = \max_{p \in P_t} b \cdot \alpha_t^p \end{align}

This is equivalent to to executing the best root action at time $t$, and the best assignment of policy trees -- and their corresponding $\alpha$ vectors -- at $t-1$:

\begin{align} V_t(b) = \max_{a \in A}\left [ \sum_{s \in S}R(s,a)b(s) + \gamma \sum_{o_i \in \Omega} \max_{\alpha \in P_{t-1}} \sum_{s \in S} \sum_{s'\in S} P(o_i|s',a)P(s'|s,a)b(s) \alpha(s') \right ] \end{align}

The equation above can be solved with exact value iteration: start with a set of $t-1$ step policy trees, and iteratively use that to construct a superset of $t$ step policy trees. Unfortunately, the size of the policy trees grows exponentially at each step. We have |A| actions, and for every action we need to consider $|P_{t-1}|^{|\Omega|}$ new trees at each step. This makes solving it intractably large over time.

Instead, we can use sampling-based algorithms to find efficiently approximate solutions. Most such algorithms alternate between sampling in belief-space, and performing an one-step look-ahead, which propagates the information of the children of $b$ back to the point $b$.

Here is an example of how this is possible [3]: Let's  assume a sampled set of belief points $b \in B$, and substitute $\alpha^{a,*}=R(s,a)$ and $\alpha^{a,o_i}= \sum_{s'\in S} P(o_i|s',a)P(s'|s,a)$. Then, we can rewrite the value function as:

\begin{align} V_t(b) =\max_{a \in A}\left [ \alpha^{a,*}\cdot b + \sum_{o_i \in \Omega} \max_{\alpha^{a, o_i} \in P_{t-1}} \alpha^{a, o_i}\cdot b \right] \end{align}

Then, for each sampled belief $b$ we can compute the best $\alpha^{a, o_i}$ and store it ahead of time. Then, each time we update the value at $t$ from $t-1$ policy trees, we only need to consider |A| new trees!

This however begs the question, what beliefs should we consider to sample? Recent algorithms [5] have achieved dramatic performance improvements, by attempting to sample from the belief space that is reachable by the optimal policy, since it is usually much smaller than the full belief space.

We thank Siddhartha Srinivasa and Vaibhav Unhelkar for their feedback.

#### References

[1] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. "Planning and acting in partially observable stochastic domains." Artificial intelligence, 1998.
[2] N. Roy, 16420: Planning Under Uncertainty, Massachusetts Institute of Technology, delivered Fall 2010.
[3] J. Pineau, G. Gordon, and S. Thrun. "Point-based value iteration: An anytime algorithm for POMDPs." IJCAI, 2003.
[4] D. Bertsekas, Dynamic Programming and Optimal Control, Athena scientific, 1995.
[5] H. Kurniawati, D. Hsu, W. Sun Lee. "SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces." Robotics: Science and Systems, 2008.
[6] Michael Koval. "Robust Manipulation via Contact Sensing." PhD Thesis. Carnegie Mellon University, 2016.
[7] Personal Robotics Lab. University of Washington, 2018.