Close

Artificial Stupidity

A project log for Somatic - Wearable Control Anywhere

Hand signs and gestures become keystrokes and mouse clicks

zack-freedmanZack Freedman 05/05/2020 at 20:340 Comments

Last log, I was riding high because my ML model was getting 95% accuracy. It actually performed decently well. Buuut... 

When I walked around, the network barely worked at all. Worse, the speed and angle I was gesturing affected performance a lot. Some letters just never worked. It thought H's were N's, A's were Q's, and it straight-up ignored V's. Z's and K's were crapshoots, which is awesome when your name is Zack and you want to capture some footage for Instagram.

It turns out that I had made the most rookie move in machine learning - failing to normalize data. I will now punch myself in the ego for your education.

Normie Data! REEEE

Asking a bunch of floating-point math to recognize air-wiggles is hard enough, and having it also filter out timing, scaling, starting position, and more is just kicking it when it's down. Normalizing the data, or adjusting it to remove irrelevant parts, is a critical first step in any machine-learning system.

I thought I did a decent job normalizing the data. After all, the quaternions were already scaled -1 to 1, and I decimated and interpolated each sequence so every sample had 100 data points. So why didn't it work?

1) I didn't normalize the quaternions.

I read a bunch of articles and grokked that the data should have a magnitude of 1. The problem here is that each data point goes from -1 to +1. This means that in a recurrent network, a negative value then and a positive value now add to zero value. Zero values don't excite the next node at all - they do nothing. What I meant to do was excite the next node halfway between the minimum and maximum.

Instead, I should have scaled from 0 to 1, where -1 becomes 0, 0 becomes 0.5, and 1 keeps its 1-ness. Instead of canceling out the entire operation, Ugh. 

I fixed that, but then discovered that...

2) I fitted the wrong data.

Quaternions were a baaad choice. Not only is the math a mind-melting nightmare, but they intrinsically include rotation data. That isn't relevant for this project - if I'm swiping my pointer finger leftwards, it doesn't matter if my thumb is up or down. I failed to filter this out, so hand rotation heavily influenced the outcome.

Even stupidlier, I realized that what I cared about was yaw and pitch, but what I fed into the model was a four-element quaternion. I was making the input layer twice as complex for literally no reason. I was worried about gimbal lock, but that was dumb because the human wrist can't twist a full continuous 180 degrees. Ugh.

I switched the entire project, firmware and software, to use AHRS (yaw, pitch, and roll) instead of quaternions. Not only did steam stop leaking out of my ears, but I now had half the data to collect and crunch. Buuut...

3) I didn't normalize the timescales.

This is the one that really ground my self-respect to powder. I wrote a complex interpolation algorithm to standardize every gesture sample to 100 samples in one second. I did this by interpolating using the timestamps sent along with the data from the glove.

The algorithm worked great, but I wasted all the time writing it because it was the absolute wrong approach.

See, I care about the shape I'm drawing in the air, not the speed my finger is moving. It doesn't matter if I spend 150ms on the top half of the B and 25ms on the lower half. It's a B.

By scaling the timestamps, I preserved the rate I was drawing. This is extra-dumb because your hand moves faster at the beginning and end of the gesture. Most of my data points were when I was getting my hand moving, and slowing it down, instead of the actual gesturing part with the letter in it.

With shame in my heart, I deleted the code and replaced it with a subdivision algorithm, right out of the programming interview playbook. It worked great, but...

4) I scaled the data like a dumbass.

Gestures of all sizes, big and small, should recognize the same. This wasn't an issue with quats, but it does come into play with angles. I standardized each sample by finding the lowest and highest yaw and scaling each yaw accordingly, then I did the same with each pitch.

Hopefully, you just facepalmed, because that approach discards the aspect ratio, stretching and squishing each sample into a square.

This meant that the tiny yaw (left/right) variance in the letter I were multiplied until the letter looked like a squiggly Z. Lower-case Q and A were squished in the pitch direction, which eliminated the Q's downstroke. Luckily I caught this before wasting too much time, but it was dumb.

Instead, I had to find the bounding box of the gesture, and scale it uniformly so that the longer side went from 0 to 1.

Conclusion

When collecting data, make sure you know exactly what should be standardized, what should be kept, and which formats reduce the workload. If you struggle to understand a concept, like quaternions, there's no shame in switching to a substitute that actually lets you finish the project.

After all that dumbness, I finally started collecting decent data that made for great tests. Now, time to waggle my hand tens of thousands of times to collect enough data.

Discussions