Shivin Dass

Hi, I am an honors graduate student at University of Southern California, USA, pursuing a degree in Computer Science with Specialization in Intelligent Robotics. I am also a graduate researcher at the Cognitive Learning For Vision and Robotics Lab (CLVR) where I'm working on deep learning, reinforcement learning and robotics with Dr. Joseph Lim.

I completed my B.Tech(with honors) in CSE from IIIT Delhi. During my undergraduate I did my thesis on path planning algorithms for mobile robots at Collaborative Robotics Lab (CORAL) under Dr. PB Sujit and Dr. Syamantak Das, and inverse reinforcement learning, under the guidance of Dr. Sanjit Kaul.

Email  /  Resume  /  Linkedin  /  Github

profile photo

I am interested in machine learning, reinforcement learning and robotics. Currently I am working on imitation learning on real world robots and leveraging the learnt policies for assisted teleoperation in long horizon data collection tasks.

Assisted Teleoperation for Scalable Robot Data Collection
Shivin Dass*, Karl Pertsch*, Hejia Zhang, Youngwoon Lee, Joseph J. Lim, Stefanos Nikolaidis
Under review at ICRA, 2023

We estimate the aleatoric and epistemic uncertainty of a behavior policy using a hierarchical goal reaching architecture and show how it can be leveraged for large scale data collection.

Assisted Teleoperation for Scalable Robot Data Collection

Modern day robot learning requires a lot of data (eg. offline RL, skill learning, learning from demonstrations). A common method to collect such data is through human teleoperation of the robot but it takes time, humans have to always be mentally present and it's not always easy for humans to control the robots. In this currently ongoing project, we are trying to use task-agnostic demonstration data and imitation learning to make much of the data collection process autonomous. We believe this would not only ease the mental load on the teleoperator but will also enable a single human operator to collect data on multiple robots simultaneously thus improving the efficiency of data collection.

For teleoperation, we use a vr framework with oculus quest 2, which enables us to control the robot in all 6 degrees of freedom. We use furniture assembly (left) as a simulation environment and we base our real world environment on GTI (right) with the Kinova Jaco robot.

For imitation learning, we use a behaviour cloning framework where the learnt action space is delta cartesian setpoint, from the previous setpoint to the new one. The actions are communicated at a rate of 10Hz and the joint level controller uses PyBullet for inverse kinematics.

More details to follow...

Task Induced Representation Learning

Representation learning has been a crucial element in image based reinforcement learning that not only helps reduce the dimensionality of the input data by extracting useful information but can also allow us to induce some useful bias in the extracted features to enable a more robust downstream learning. Along these lines, in Task Induced Representation Learning (Jun Yamada, Karl Pertsch, Anisha Gunja, Joseph Lim), the authors argue that common representation learning methods learn to model all information in a scene, including distractors, potentially impairing the agent's learning efficiency. In their paper they compare such approaches to a class of method they develop, which they call task-induced representation learning (or TARP), which leverages task specific information such as rewards and demonstrations from prior data to learn representations that focus only on task relevant parts of the scene and ignore distractors.

I demonstrate their framework by implementing a modified version of V-TARP on a toy environment. In the environment(shown above), there is an agent(circle), a target(square) and distractors(triangles). The prior collected data is reward labelled with various forms of reward functions, such as reward proportional to x/y-coordinate of the agent/target and inversely proportional to the horizontal/vertical distance between the agent and target. Our objective is to learn a representation from the given data which can efficiently solve the downstream task of following the target while also ignoring the distractors.

The architecture based on TARP is shown above. Here the the encoder network is what learns the representations which is used in the downstream task. Without going into too much detail the reward heads predict the reward in the demonstrations which is the signal for what informaion in the data is relevant and what is not which gets backpropagated to the encoder. Here the LSTM is used so that the representation encoder can learn the state abstraction of the future return rather than the single step reward.

We use the enoder as a feature extracter for a PPO on the downstream RL task of following the target with distractors and test it against the baselines shown on right. Oracle baseline is trained on the true xy pose of the agent and target (while all other methods use raw image as state) and shows the upper limit of the RL policy. The reward_prediction(_finetune) uses the encoder trained with the above method, where the encoder is frozen during downstream learning or not(finetune). The image_reconstruction(_finetune) uses an encoder trained with autoencoder image reconstruction loss. These capture all the information of the state, including distractors, which causes their learning to be bad. Image_scratch is a smaller randomly initialized encoder, fit for RL applications. We can see that the reward_prediction encoders learn faster as compared to others in the presence of distractors since they learn to ignore task irrelevant objects in their representations of the scene.

Here we show that the TARP encoder does learn to ignore task irrelevant features based on provided data. First we train an encoder with rewards inversely proportional to both horizontal distance and vertical distance between the agent and target(left). Then we use the encoder embeddings to train a decoder to reconstruct the images. Above on left the decoder outputs can be seen, which show that with both horizontal and vertical rewards all the state information can be captured. Now we train another encoder in a similar fashion as above but with only horizontal distance rewards(right). Above on right we can see the decoder outputs from embeddings of such an encoder which shows that it only correctly learns horizontal coordinates of the agent and target while the vertical coordinates (which were irrelevant to the task based on prior data) are not learnt.

An implementation can be found here.

Inverse Reinforcement Learning

A reinforcement learning problem is formulated as an MDP where the objective for an agent is to maximize the reward function. Finding a good reward function to achieve some behaviour is hard and sometimes providing sample demonstrations of the behaviour is easier. Inverse-reinforcement learning is a sub-field of RL that aims to extract reward functions from given demonstrations and here we take a look at two pioneering papers in this field, Algorithms for Inverse Reinforcement Learning (Ng &Russell 2000) and Maximum Entropy Inverse Reinforcement Learning (Ziebart et al. 2008).

The paper Algorithms for Inverse Reinforcement Learning is one of the first few papers on the topic of inverse reinforcement learning. In the paper the authors propose three different algorithms for inverse reinforcement learning, the first two for when the policy is known, for the discrete and continuous state spaces respectively and the third for when only demonstrations are given. They tackle the problem of degeneracy with a meaningful heuristic and formulate a linear program, the solution of which gives the reward function for the states. Here we look at the results of the first and third algorithms provided by them.

The first algorithm is used for discrete state spaces, when the policy is give, as shown above. The optimal policy for the given gridworld is to reach the top right absorbing state as fast as possible and the lambda term is a penalty term for high weights (similar to L2-regularization). As we can see that a good lambda can recover the reward function pretty accurately. The third algorithm is useful for when only demonstrations are provided, a more realistic case for applying inverse reinforcement learning. The results can be seen below, where the state space is [0,1]x[0,1], a continuous version of the gridworld from before. The rewards extracted again are accurate but a little error prone. More results are provided here.

The paper Maximum Entropy Inverse Reinforcement Learning was published in 2008 and some of its priniciples are still relevant to the field of inverse reinforcement learning. Authors employ the principle of maximum entropy to tackle the problem of degeneracy. This principle leads them to choosing behaviours that are constrained to match the demonstrated behaviour's state feature counts in expectation, while being no more committed to any particular path than this constraint requires. On the right we can see the results of their algorithm on the same gridworld task as before. Even though we can see it can recover a pretty good reward function, it's not a good metric to compare the two papers since the task is too simple. This paper also presents results on a more complicated route planning task which haven't been replicated here.

More results and implementation can be found here.

The website template was inspired from here.