MNIST number recognition is old, and boring. Wikipedia is filled with outstanding Error rates reached by complex ML algorithms and pre-processing trickery. Nevertheless, these are overkill, heavy, slow and... boring. They all rely on many moving parts, complex algorithms, expensive training and over-burdened execution.
This project takes a lean approach to object recognition inspired by a recent paper from Mennel et. al., published in the March 2020 issue of Nature.
How simple can we go?
* Is a low resolution camera simple enough? Can discrete pixel sensors recognize objects?
* What would be a good enough model? Would a a shallow neural network be sufficient? A SVM? What about a simple decision tree?
* Is an Arduino too powerful? Can an array of voltage comparators beat the latest FPGA? What about a 555?
In the end, time-constraints and the suitability and availability of components will determine how far we can go to reach the goal set on this project.
The goal of this project is to try and have a simple and machine vision system, trained to do one thing, do it as good as possible with the available resources and do it fast, really fast.
The idea came from a recent publication in Nature: https://www.nature.com/articles/d41586-020-00592-6, discussed in the Nature Podcast from 4th March 2020. This led me to wonder, how quick is quick and is there aything in between the system developed by Mennel et.al. and a typical camera + processor system?
The project will be divided into 4 parts as follows:
Determine minimum requirements for sensor array
Determine minimum requirements for ML model
Determine minimum requirements for object recognition hardware
Prototyping and final design
The final objective is to have a system that is as close as possible to state of the art ML algorithms but implemented on discrete components for maximum speed at an acceptable complexity level.
The typical machine vision systems I'm familiar with need to get the camera sensor data coded onto a protocol, transferred from the camera to the microchip, then decoded, and finally fed onto a pre-trained model which is solved by an IC, from which we can get a result.
The Nature paper describes an array of sensors which can be trained to recognize simple letters, i.e. it does the interpretation in the sensor array, forgoing most of the steps described above. The sensor array described in the paper is a complex setup well suited for a Research Institute but not so easy for a hacker to reproduce. At least for me to reproduce.
This was for me ML on the very edge, literally. I though that maybe something similar could be obtained with a simpler setup, using off the shelf parts, but also forgoing some of the steps typically needed for a machine-vision system.
My take on this was to try and make a prototype machine vision system that could identify images using discrete components by training a system based on decision trees. The sensors would be an array of photoresistors and the decision trees would be built using voltage comparators.
The training would all be carried out on a PC and the resulting decision tree implemented on an array of voltage comparators. The activation of the voltage comparator would be trimmed by adding resistors with the optical sensors to make voltage dividers which in turn would trigger the selection process.
There is a good possibility that this has been done in the past. If it has, I could not find anything similar.
Very quickly I realised two important things by looking at the dataset:
1. There would be some photoresistors' signal that would be ignored most of the time, i.e. my array would not be much of an array after all. It'd miss a few photoresistors.
2. Even if some photoresistors were missing, a simple array to analyse, e.g the MNIST number dataset would need an array of a few hundred photoresistors. I'm a father of 3 small kids with a full time job. This project quickly became impossible.
Then something came to my mind, what if I could simplify the pictures by averaging pixels? Would a lower definition picture reduce the efficacy of the decision tree?
The first objective was to develop some code to test this. All code was carried out in R using the RStudio IDE. Not extremely efficient but nice IDE I'm familiar with from previous forays on ML.
The target accuracy, in order to be at the level of a commercial solution, should be above 75%.
First decision tree
The first decision tree was produced using the rpart package and the full definition 28x28 digits and trained with the first 10000 images of the MNIST dataset. Below is a sample of a plot showing a digit represented by a single record on the MNIST matrix with all 28 x 28 = 784 "pixels".
After training the decision tree, the results were as follows:
## Decision tree:
rpart.plot(fitt, extra = 103, roundint=FALSE, box.palette="RdYlGn")
The diagonal shows how often the numbers are correctly identified. All the results outside the diagonal are missclassified digits. Not looking so good.
The default cp parameter for the rpart package in R is 0.01 and with successive iterative reduction we obtain no visible increase in accuracy with cp below 0.00025 and we're still quite a way away from the target accuracy obtained with the random forest.
Even settling for a cp of 0.0025, assuming there' a limit on what can be achieved with decision trees, the result is mindboggling.
Could it be implemented using discrete components? Definitely. Maybe.
Decision trees can achieve a reasonable accuracy at recognizing the MNIST dataset, at a complexity cost.
The resulting tree can reach a 73% accuracy which is just shy of the 75% target we set out in this project.
The goal of this task was to evaluate how simple a sensor array would be needed to reach object recognition on the MNIST database to a commercial level accuracy.
According to this site, commercial OCR systems have an accuracy of 75% or higher on manuscripts. Maybe on numbers they do better but we'll keep this number as our benchmark.
So there are two ways to test the minimum pixel density needed o identify the MNIST database with 75% accuracy:
1. Grab a handful of sensors, test them against the same algorithm and benchmark them. Time consuming and beyond my budget and time allowance.
2. Try and train a simple object recognition algorithm with a database that reduces in pixel density. Now, that's up my beach.
The algorithm of choice was an easy sell. Since I'm focusing on the lowest denominator, decision trees it is.
The database was obviously the MNIST database, the model was a standard decision tree with the standard parameters included with the rpart package in R.
The database was loaded using the code from Kory Beckers GIT project page https://gist.github.com/primaryobjects/b0c8333834debbc15be4 and the matrix containing the data was transformed by averaging neighboring cells as below. Code snippets can be found in the last section. Full code to be uploaded as files.
This is the original 28 x 28 matrix image showing a zero.
By applying a factor of 4, the matrix became a 7x7 matrix. We humans could have still told this used to be a zero, if you made an effort.
Finally, the last factor was 14, i.e. the initial 28x28 matrix was averaged to a 2x2. This would be a really poor sensor, but I needed to find the point at which the model couldn't tell the numbers apart and then start going up in pixel density.
Once the matrix datasets were ready, it was time to see how the lower resolution pictures fared versus the full resolution database when pitched against the standard decision trees using a 10000 record training set.
So, the images could be simplified and the number of pixels reduced by averaging them. The decision tree model had already shown a piss poor accuracy for starters. Less pixels might not affect it much. So let's see how they fared.
As a reference, the full 28x28 matrix with the standard decision tree had the following results:
That is to say, 63% of the times it got the number right. Some digits like 0, 1, 7 and 4 fared quite well, whereas the other ones didn't really make the cut. Let's keep an open mind nevertheless, a monkey pulling bananas hanging on strings would have gotten 10%. The model is doing something after all.
14 x 14 dataset
There was a clear improvement in classification for all digits with a marginal improvement for the overall accuracy, i.e. less pixels gave a better classification criteria.
This is the equivalent of going out to the pub and starting to see clearer after a few pints. I'm pretty sure our brains have decision trees and not neural networks.