Motivation

This is potentially a long story. I'll begin at the beginning. What prompted this project was the need to keep Layla out of Foxy's dry food. No real momentum on the project happened until I stumbled upon a cat face detector article while researching object detection models. At that point I decided I would attempt to build an app to control a cat bowl.

Development Process

Platform and Tools

First decision was that the app would run under Android on a smartphone. The reasoning for this was:

Cat Face Detection

Once the platform and tools were decided the next task was to find a model that could detect cat faces, was fast enough for a video feed and would output bounding boxes around each face in a video frame. A pre-trained Haar Cascade model was found that could detect cat faces called: haarcascade_frontalcatface.xml. This model will run in an OpenCV library supporting a Cascade Classifier. The one I used was : "org.opencv.objdetect.CascadeClassifier". Tests with this model went well.

Cat Face Recognition

Next was to find a suitable face recognition model. The most compact and readily available model at the time was MobileFaceNet, which was found in a TFLite format. This model was, of course, trained on human faces. Nothing pre-trained on cat faces was found at the time and to train a recognition model from scratch takes a huge amount of resources and labor. So, I reasoned, a cat face isn't all that different from a human face - two eyes, a nose, 2 ears. To improve the chances of it working I decided to do additional training of the model with cat faces. That's where the fun really started. The MobileFaceNet model found was already in TensorFlow Lite, which is a compact format for deployment on mobile devices. The catch is that it's not possible to do training with a TFLite model. So the model had to be reconstructed as a Keras model for training and could then be converted back to a TFLite model once training was done.

It took a considerable amount of time to replicate the structure of the MobileFaceNet model with a Keras model. The structure had to match exactly so that all training parameters from the MobileFaceNet model could be loaded into the Keras model as the starting point for further training. This work was all done in Google CoLab in Python. CoLab can be used for free with limitations on the amount of CPU consumed. Fortunately, the additional training with cat faces fit within those limits.

Training images were collected from the internet. The images had to be a portrait style format of just the cat's face (ears included). A total of about 5000 cat faces were collected for the training. The images were divided up into training and test groups. Training was done with APN (Actual/Positive/Negative) labelled triplets. This dataset is extremely small by model training standards, but it still took a considerable amount of manual effort to compile.

Using Google CoLab the model was further trained with the cat face data and ultimately achieved reasonably good accuracy. The trained Keras model was then converted back to a TensorFlow Lite model for deployment.

It's important to note that the recognition model doesn't output a "yes/no" result. It wasn't trained to recognize specific cats. The training simply improved its ability to distinguish between cats. The actual output of the recognition model is a vector of 128 values. You can think of this as a fingerprint of the input image. Because of the model training, the more alike two cat faces are to each other the smaller the "distance" will be between their respective fingerprints. Distance in this case is measured by taking the difference between corresponding elements of the two fingerprint vectors, squaring each difference, summing all the squares, and finally taking the square root of the sum.

The App

This section provides an overview of the app. Refer to the User Manual pdf (in files section) if you want more detail.

Basic User Setup

In Basic Setup the user specifies the minimum information required for the app to operate:

Advanced Setup

Since this is a prototype there are numerous parameters that can be tweaked to affect the models. An end user should not mess with these parameters.

Initial Startup

At initial startup the app knows about the cats but doesn't know what they look like. In this state whenever a cat face is detected its snapshot is placed in an Unclassified Images log. This log is later used for training. The app will always place images of unrecognized cats into this log.

Model Training

Training is done by associating a snapshot in the Unclassified Images log with a specific cat. The selected images are then stored in a log maintained for each cat. At first nothing is recognized, but as the user continues to associate images to specific cats the model will begin to recognize the cats and will get better at it as more images are added. Ultimately, each cat will have a collection of images (and their fingerprints) that represent that cat.

Multiple images are required mainly because of different poses the cat may present when approaching the bowl. Doing it this way avoids having the app expend a lot of resources (and time) manipulating input frames (such as image rotations) to find a match. Instead, it simply scans each list of cat images and calculates the distance between the current input image and each stored image. The shortest distance below a pre-set threshold wins and the corresponding cat is recognized.

Working strictly from snapshots taken by the app keeps the context of each snapshot (lighting, background, etc.) consistent. This helps to improve the reliability of face recognition.

Monitoring Progress

The app keeps a log of face recognition events so the user can confirm recognition is working properly. An interface is provided to review the log and make corrections if needed. Each log entry shows a snapshot along with the name of the cat recognized. This information can be used to make corrections to the training if needed.

Web Interface

All user interface tasks can be done directly on the phone via the app. But this is not really all that convenient since the phone is mounted in a holder above the bowl and a landscape orientation has proven to be best. To get around this limitation the app also provides a web interface, which can be accessed from any browser.

The Devilish Details

As with many projects, the devil is usually in the details. Over a period of several months there were a few problems to deal with.

The Bowl

A commercial bowl (Sure Petcare) was heavily modified for use in this project. The bowl's original function was simply to open when motion was detected. For this project the bowl needed to accept open/close commands via Bluetooth so the phone app could control it. To accomplish this, all of the original electronics in the bowl, except for the motor, were removed and replaced with a custom Arduino Nano controller. The bowl is a project unto itself. Software development was done using Microsoft Visual Studio 2019 with the Arduino IDE for Visual Studio extension.

More to come....