My research goal is to develop deep learning algorithms to not only advance robot intelligence but also to deploy them in real-world robotic applications. To this end, my current research focuses on scaling data collection for robotic applications.

We enable scalable robot data collection by assisting human teleoperators with a learned policy. Our approach estimates its uncertainty over future actions to determine when to request user input. In real world user studies we demonstrate that our system enables more efficient teleoperation with reduced mental load and up to four robots in parallel.

Projects

Assisted Teleoperation for Scalable Robot Data Collection

Modern day robot learning requires a lot of data (eg. offline RL, skill learning, learning from demonstrations).
Human teleoperation is the most popular method for collecting robot demonstrations. However, current teleoperation frameworks require humans to perform repetitive tasks which demand intense focus and time commitment. To collect large datasets, robots must not only be capable of collecting data autonomously, but also be able to ask for help from humans when unsure about what to do next. Hence, we proposed a Policy Assisted TeleOperation (PATO) system to automate repetitive skills in long-horizon tasks using a learned policy. Importantly, PATO only queries human actions when the policy observes an unseen state or when it is uncertain about which behavior to perform next. We show that this can enable humans to collect data on multiple robots simultaneously and improve their data collection throughput while minimizing supervision.

We achieve this by using a hierarchical policy structure, with a high-level subgoal predictor and an ensemble of low-level autoregressive goal reaching policies. The hierarchical structure allows us to represent task uncertainty as the variance in sampled subgoals from the subgoal predictor. It also allows us to detect unseen states by measuring the disagreement between actions predicted by the ensemble of low level policies. These combine to query the human.

More details and experiment results can be found here.

Task Induced Representation Learning

Representation learning has been a crucial element in image based reinforcement learning that not only helps reduce the dimensionality of the input data by extracting useful information but can also allow us to induce some useful bias in the extracted features to enable a more robust downstream learning.
Along these lines, in Task Induced Representation Learning (Jun Yamada, Karl Pertsch, Anisha Gunja, Joseph Lim), the authors argue that common representation learning methods learn to model all information in a scene, including distractors, potentially impairing the agent's learning efficiency.
In their paper they compare such approaches to a class of method they develop, which they call task-induced representation learning (or TARP), which leverages task specific information such as rewards and demonstrations from prior data to learn representations that focus only on task relevant parts of the scene and ignore distractors.

I demonstrate their framework by implementing a modified version of V-TARP on a toy environment.
In the environment(shown above), there is an agent(circle), a target(square) and distractors(triangles).
The prior collected data is reward labelled with various forms of reward functions, such as reward proportional to x/y-coordinate of the agent/target and inversely proportional to the horizontal/vertical distance between the agent and target.
Our objective is to learn a representation from the given data which can efficiently solve the downstream task of following the target while also ignoring the distractors.

The architecture based on TARP is shown above. Here the the encoder network is what learns the representations which is used in the downstream task.
Without going into too much detail the reward heads predict the reward in the demonstrations which is the signal for what informaion in the data is relevant and what is not which gets backpropagated to the encoder.
Here the LSTM is used so that the representation encoder can learn the state abstraction of the future return rather than the single step reward.

We use the enoder as a feature extracter for a PPO on the downstream RL task of following the target with distractors and test it against the baselines shown on right.
Oracle baseline is trained on the true xy pose of the agent and target (while all other methods use raw image as state) and shows the upper limit of the RL policy.
The reward_prediction(_finetune) uses the encoder trained with the above method, where the encoder is frozen during downstream learning or not(finetune).
The image_reconstruction(_finetune) uses an encoder trained with autoencoder image reconstruction loss. These capture all the information of the state, including distractors, which causes their learning to be bad.
Image_scratch is a smaller randomly initialized encoder, fit for RL applications.
We can see that the reward_prediction encoders learn faster as compared to others in the presence of distractors since they learn to ignore task irrelevant objects in their representations of the scene.

Here we show that the TARP encoder does learn to ignore task irrelevant features based on provided data.
First we train an encoder with rewards inversely proportional to both horizontal distance and vertical distance between the agent and target(left).
Then we use the encoder embeddings to train a decoder to reconstruct the images.
Above on left the decoder outputs can be seen, which show that with both horizontal and vertical rewards all the state information can be captured.
Now we train another encoder in a similar fashion as above but with only horizontal distance rewards(right).
Above on right we can see the decoder outputs from embeddings of such an encoder which shows that it only correctly learns horizontal coordinates of the agent and target while the vertical coordinates (which were irrelevant to the task based on prior data) are not learnt.

A reinforcement learning problem is formulated as an MDP where the objective for an agent is to maximize the reward function.
Finding a good reward function to achieve some behaviour is hard and sometimes providing sample demonstrations of the behaviour is easier.
Inverse-reinforcement learning is a sub-field of RL that aims to extract reward functions from given demonstrations and here we take a look at two pioneering papers in this field,
Algorithms for Inverse Reinforcement Learning (Ng &Russell 2000) and Maximum Entropy Inverse Reinforcement Learning (Ziebart et al. 2008).

The paper Algorithms for Inverse Reinforcement Learning is one of the first few papers on the topic of inverse reinforcement learning.
In the paper the authors propose three different algorithms for inverse reinforcement learning, the first two for when the policy is known, for the discrete and continuous state spaces respectively and the third for when only demonstrations are given.
They tackle the problem of degeneracy with a meaningful heuristic and formulate a linear program, the solution of which gives the reward function for the states.
Here we look at the results of the first and third algorithms provided by them.

The first algorithm is used for discrete state spaces, when the policy is give, as shown above.
The optimal policy for the given gridworld is to reach the top right absorbing state as fast as possible and the lambda term is a penalty term for high weights (similar to L2-regularization).
As we can see that a good lambda can recover the reward function pretty accurately.
The third algorithm is useful for when only demonstrations are provided, a more realistic case for applying inverse reinforcement learning.
The results can be seen below, where the state space is [0,1]x[0,1], a continuous version of the gridworld from before.
The rewards extracted again are accurate but a little error prone.
More results are provided here.

The paper Maximum Entropy Inverse Reinforcement Learning was published in 2008 and some of its priniciples are still relevant to the field of inverse reinforcement learning.
Authors employ the principle of maximum entropy to tackle the problem of degeneracy.
This principle leads them to choosing behaviours that are constrained to match the demonstrated behaviour's state feature counts in expectation, while being no more committed to any particular path than this constraint requires.
On the right we can see the results of their algorithm on the same gridworld task as before.
Even though we can see it can recover a pretty good reward function, it's not a good metric to compare the two papers since the task is too simple.
This paper also presents results on a more complicated route planning task which haven't been replicated here.

More results and implementation can be found here.