Problem and Setup
In order to apply a Reinforcement Learning algorithm, the goal has to be specified in terms of maximizing cumulative reward.
The objective of the robot was to drive for the longest time without hitting any obstacle. The objective was shaped by a reward function - the robot would receive a penalty when hitting an obstacle. Since the robot receives zero reward for just driving around and a negative reward (penalty) for hitting an obstacle, by maximizing cumulative reward a collision avoiding behavior should emerge.
The collisions were detected automatically using an accelerometer. Any detected collision would also trigger a “back down and turn around” recovery action, so that the robot could explore it’s environment largely unattended. It would drive with a small fixed speed (around 40cm/s) and the Reinforcement Learning Agent would be in control of the steering angle.
Reward Shaping and the Environment
Reinforcement Learning Agents make decisions based on current state. State usually consists of observations of the environment, sometimes extended by historical observations and some internal state of the Agent. For my setup, the only observation available to the Agent was a single (monocular) image from a camera and history of past actions, added so that velocity and “decision consistency” can be encoded in the state. The Agent’s steering control loop ran at 10Hz. There was also a much finer PID control loop for velocity control (maintaining a fixed speed in this case) running independently on the robot.
With the reward specified as in previous section, it was not surprising that the first learned behavior was to drive around in tight circles - as long as there is enough space, the robot would drive around forever. Although correct given the problem statement, this was not the behavior I was looking for. Therefore, I’ve added a small penalty that would keep increasing unless the robot was driving straight. I’ve also added a constraint on how much the steering angle can change between frames to reduce jerk.
This definition of the reward function is sparse - the robot can drive around for a long time before receiving any feedback when hitting an obstacle - especially when the Agent gets better. I improved my results by “backfilling” the penalty to a few frames before the collision to increase the number of negative examples. This is not and entirely correct thing to do, but worked well in my case, since the robot was driving at very low speeds and the Agent communicated with the robot via WiFi, so there was some lag involved anyway.
For a Deep Reinforcement Learning algorithm I chose Soft Actor-Critic (SAC)(specifically the tf-agents implementation). I picked this algorithm since it promises to be sample-efficient (therefore decreasing data collection time, an important feature when running on a real robot and not a simulation) and there were already some successful applications on simulated cars and real robots.
Following the method described in Learning to Drive Smoothly… and Learning to Drive in a Day, in order to speed up training I encoded the images provided to the Agent using a Variational Auto-Encoder (VAE). Data for the VAE model was collected using a random driving policy, and once pre-trained, the VAE model’s weights were fixed during SAC Agent training.
The Good, the Bad and the Next Steps
The Agent has successfully learned to navigate some obstacles in my apartment. The steering was smooth and the robot would generally prefer to drive straight for longer periods of time.
I was not able to consistently measure improvement in episode length (a common metric for sparse-reward-function problems) - it largely depended on the point in the apartment when the robot was started.
Unfortunately, the learned policy was not robust - the robot would not avoid previously unseen obstacles and was sensitive to lightning changes throughout the day.
I suspect the encoded representation was not enough for this type of task. I was counting on the Agent to learn to recognize the obstacles from cues such as visible floor shape (sort of horizon detection) but perhaps a VAE trained as a general image encoder-decoder was not powerful enough to expose such features. In the future I’d like to try both to encode some more useful information (such as predicted depth) and train the vision network along with the Agent. Which brings me to the main takeaway…
For experimentation, a simulator is a necessity. Training an Deep Learning Agent in real world is very time consuming and experimentation is tricky. On the other hand there is a lot to experiment with - from state representation to fine tuning of the hyperparameters of the actor and critic networks. And debugging a neural network running a real robot only sounds cool… Therefore, the next step for me will be building a car dynamics simulation.