Combining Five CNNs into One

As I mention in my previous log, performance could possibly be enhanced by combining the current five networks into one. This would allow to assess the use cases of each finger considering the whole hand rather than treating each finger independently. This could result in the network learning different types of grasps and avoiding unfortunate combinations.

The graphic below shows a simplified version of what I did: Each finger used to have its own CNN that would assess wether the finger would be used to grab the object in question or not, making it a binary classification problem of 'used' or 'not used'. For training this meant associating a label of either 1 (used) or 0 (not used) to each training image. In this case the label can be fed in directly as a number from a text file. For the combined network, the CNN's output was changed from binary to making a 'used' or 'not used' prediction for every finger, given one input image. This meant the last fully connected layer needed to be duplicated 5 times to assess each finger's use in relation to the others. This meant the labels for training changed from being binary to being a five-element vector of 1s and 0s, e.g. 11000 meaning 'use thumb, index, don't use middle, ring, and little finger'. To represent multi-label input I generated 5-pixel binary images with either 0 or 1 value for each pixel.

This is the graph of the combined network as visualised in Digits:

The figure below shows the accuracy and loss for the combined network. Although accuracies for the index and middle finger went down overall accuracy increased across all fingers.

Grasp success for the glue stick object that was used for training, accuracy has improved to 92%. I will follow this up with a post showcasing results and also collect some grasping data for other objects the network has not been trained on.

The real success however is that the hand has learned to avoid strange finger combinations (like grabbing something with middle and little finger only) and now favours pinch grasps using thumb, index (and middle) finger, wrap grasps with index, middle, ring, (and little) finger, as well as grasps using all fingers.

Current Results of Vision-Based Grasp Learning Algorithm

Discussions

Become a Hackaday.io Member