Somatic - Wearable Control Anywhere

Hand signs and gestures become keystrokes and mouse clicks

Similar projects worth following
The Somatic is a wearable keyboard and mouse. It translates hand signs and motions into actions, like the somatic component of a spell in Dungeons and Dragons.

Each knuckle has a Hall sensor, and the first segment of each finger has a magnet. Flexing a finger pivots its magnet out of position, allowing the Somatic to map your hand.

An EM7180SFP IMU near the thumb provides 9-degree tracking. Eventually, this will allow you to move a mouse cursor by pointing, and type letters by drawing them in midair.

Thanks to Alex Glow for modeling the Somatic!

All files are on GitHub!

The Somatic project's priorities are:

  • Control any wearable computer with a heads-up display
  • Ready to use all day, instantly, with no Internet
  • Doesn't cause fatigue or interfere with other tasks
  • Fast enough to do a quick search in less than 10 seconds

The Somatic will not:

  • Reproduce your hand in 3-D space
  • Let you type on a virtual keyboard
  • Use any cloud services at all

The project is still in a pretty rough state. The roadmap includes:

  • Collect gesture samples
  • Use artificial neural network to recognize letters
  • Implement gyro mouse
  • Lay out and fab circuit board
  • Make case smaller
  • Replace on/off Hall sensors with continuous sensors

The Somatic project is MIT licensed, copyright 2019 Zack Freedman and Voidstar Lab.

  • Bleeding for My Art

    Zack Freedman05/05/2020 at 22:19 0 comments

    OK, more like enduring joint pain for my work, but that sounds lame.

    The glove's machine-learning model starts with an RNN (I'm using a LSTM for now) to pull discrete patterns out of the string of incoming coordinates. I suspect I'll need a bunch of hidden layers after that because letters are made of downstrokes, circles, and other mini-shapes, and I need a bunch of abstraction to distinguish them.

    Why does this matter? Recurrent and deep networks are easy to overtrain, and that means my wrist is gonna hurt.

    Overtraining is when the machine-learning model aces a test by writing the answers on its hand. The network has internalized the training data, and instead of recognizing an incoming gesture, it just outputs which training sample it looks like. It does great with the training data, but face-plants in the real world. It gets new data, looks at the answers on its hand, doesn't see any close matches, and takes an embarrassingly poor guess.

    LSTM's and deep learning are physically large and can encode a lot of data within themselves. In other words, it's got a really big hand and it can't help writing answers on it. There's only one solution, and it's...

    More data.

    Each training run needs to consist of many samples, far more than the model can internalize. If the model tries to make a cheat sheet, it quickly runs out of paper and is forced to actually do the work. With 100 nodes in the densest layer, I estimate that I need about 100 samples for each of the 50 letters, numbers, and punctuation marks to do the job. 

    That's 5,000 times I need to waggle my hand for science.

    The problem is, the network will still inevitably overtrain. I need this glove to work in all kinds of circumstances - when I'm walking, sitting, jogging, have my hands full, can't move much, or need to make big gestures onstage. So, I need...

    Even more data.

    The machine-learning model, mad lad that it is, tells me to kiss its ass and it memorizes the answers anyways. But I'm a step ahead! I give it a second test with totally different questions on it.

    The model blows a wet raspberry and memorizes the answers to that test too. So, when it comes back to class, I give it a third totally different test.

    This repeats, the model memorizing tests and me making new tests, until the model starts failing. Its strategy is stretched too thin - the model is just not big enough to encode all those questions and answers, so the only way it can get good grades is to learn patterns, not answers.

    This is called k-fold cross-validation - you split your data into subsets so the same model has to ace many different tests at once, and it never gets asked a question from the practice test in the real test.

    A rule of thumb is that 10-fold cross-validation is the point of diminishing returns. You need 10 times the data that are required to train the model, to prevent the model from memorizing answers. 

    That means I need to flap my meat mittens 50,000 times to collect an appropriate training set.

    So say I do that. The machine-learning model eventually passes all ten tests, and I rejoice. But that little bastard is wearing an obnoxious grin, almost as if to say, "Stupid monkey! I memorized all ten tests! Eat my shorts!" There's only one way to know for sure, and it is...

    I got a fever, and the only prescription is more data.

    That's right. As that pissant mass of floating-point operations snickers to itself, I look it dead in the eye, reach into my back pocket, and whip out a double-super-top-secret eleventh test. 

    This set of data, the sanity check, is special. Unlike the training set, the model isn't allowed to see these data while training. The model has no way to memorize the answers, because it's not allowed to see the grade. I see the grade, and if the model flunks the final exam, I drag it out back and put it out of its misery (reset and start over). If it passes, the...

    Read more »

  • Artificial Stupidity

    Zack Freedman05/05/2020 at 20:34 0 comments

    Last log, I was riding high because my ML model was getting 95% accuracy. It actually performed decently well. Buuut... 

    When I walked around, the network barely worked at all. Worse, the speed and angle I was gesturing affected performance a lot. Some letters just never worked. It thought H's were N's, A's were Q's, and it straight-up ignored V's. Z's and K's were crapshoots, which is awesome when your name is Zack and you want to capture some footage for Instagram.

    It turns out that I had made the most rookie move in machine learning - failing to normalize data. I will now punch myself in the ego for your education.

    Normie Data! REEEE

    Asking a bunch of floating-point math to recognize air-wiggles is hard enough, and having it also filter out timing, scaling, starting position, and more is just kicking it when it's down. Normalizing the data, or adjusting it to remove irrelevant parts, is a critical first step in any machine-learning system.

    I thought I did a decent job normalizing the data. After all, the quaternions were already scaled -1 to 1, and I decimated and interpolated each sequence so every sample had 100 data points. So why didn't it work?

    1) I didn't normalize the quaternions.

    I read a bunch of articles and grokked that the data should have a magnitude of 1. The problem here is that each data point goes from -1 to +1. This means that in a recurrent network, a negative value then and a positive value now add to zero value. Zero values don't excite the next node at all - they do nothing. What I meant to do was excite the next node halfway between the minimum and maximum.

    Instead, I should have scaled from 0 to 1, where -1 becomes 0, 0 becomes 0.5, and 1 keeps its 1-ness. Instead of canceling out the entire operation, Ugh. 

    I fixed that, but then discovered that...

    2) I fitted the wrong data.

    Quaternions were a baaad choice. Not only is the math a mind-melting nightmare, but they intrinsically include rotation data. That isn't relevant for this project - if I'm swiping my pointer finger leftwards, it doesn't matter if my thumb is up or down. I failed to filter this out, so hand rotation heavily influenced the outcome.

    Even stupidlier, I realized that what I cared about was yaw and pitch, but what I fed into the model was a four-element quaternion. I was making the input layer twice as complex for literally no reason. I was worried about gimbal lock, but that was dumb because the human wrist can't twist a full continuous 180 degrees. Ugh.

    I switched the entire project, firmware and software, to use AHRS (yaw, pitch, and roll) instead of quaternions. Not only did steam stop leaking out of my ears, but I now had half the data to collect and crunch. Buuut...

    3) I didn't normalize the timescales.

    This is the one that really ground my self-respect to powder. I wrote a complex interpolation algorithm to standardize every gesture sample to 100 samples in one second. I did this by interpolating using the timestamps sent along with the data from the glove.

    The algorithm worked great, but I wasted all the time writing it because it was the absolute wrong approach.

    See, I care about the shape I'm drawing in the air, not the speed my finger is moving. It doesn't matter if I spend 150ms on the top half of the B and 25ms on the lower half. It's a B.

    By scaling the timestamps, I preserved the rate I was drawing. This is extra-dumb because your hand moves faster at the beginning and end of the gesture. Most of my data points were when I was getting my hand moving, and slowing it down, instead of the actual gesturing part with the letter in it.

    With shame in my heart, I deleted the code and replaced it with a subdivision algorithm, right out of the programming interview playbook. It worked great, but...

    4) I scaled the data like a dumbass.

    Gestures of all sizes, big and small, should recognize the same. This wasn't an issue with quats, but it does come into play...

    Read more »

  • Training Day

    Zack Freedman02/05/2020 at 22:01 0 comments

    Fun fact: Everyone seems to be nuts over machine learning, but almost no one gathers their own data. My training utility isn't technically sophisticated - just Tkinter, Pillow, Numpy, and Pyserial - but apparently most ML enthusiasts haven't needed to make one.

    I'm pretty proud. I haven't made a GUI since Google Glass, and this one came out pretty nice.

     Here's what it looks like:

    The idea is that I want to collect a few hundred samples of all 70 glyphs by actually drawing them in midair, the same way the glove will work in real life. That's the training data.

    The left column shows the hand gesture up top and the currently-drawn path at the bottom. I made it fade from green to blue so I can easily verify the direction of the gesture. The utility automatically rejects any gesture that's too short or slow, to cut out false readings.

    The right column shows the number of samples for each glyph, and the bottom has little thumbnails of each gesture. I can right-click one to remove it, which will help clean up the data.

    At the bottom, you can see the status bar that shows the raw packets received from the glove, as well as the refresh rate. All samples are standardized to one second at 50Hz, but it's nice to keep an eye out for slowdowns that could affect the finished project.

    I've wasted a billion hours wrangling the Android and Silverlight UI builders, so I was very surprised by the Tkinter library's productivity. The Python Image Library (packaged in the Pillow module) made the thumbnails and graphics easy, and serial comms are handled by the excellent Pyserial module. Numpy, and the Pyquaternion extension, simplifies processing the incoming quaternion data, but goddammit, quaternions are such an incomprehensible pain in the ass. 

    As an aside, it's nice to be able to use an up-to-date version of Python, after being stuck with 2.7 on other projects. It has some relevant improvements (TYPE CHECKING!!!) but the updated libraries are also great.

    Anyways, time to collect some data. When this baby hits 14,000 samples, you're gonna see some serious stuff.

  • Give 'em a Hand

    Zack Freedman01/04/2020 at 01:51 0 comments

    The Supercon really loved the data glove - enough to convince me to put it online.

    The project looks pretty good, but inside, it's pandemonium. Below the clean-looking case (I'm pretty proud of that, it came out great) the Somatic prototype is a mess of perfboard and sloppily-soldered dev boards. The code does basically nothing, and the training utility does even less.

    But, the hardware works, and I have my tasks ahead of me. From easy to hard:

    • I have code to recognize hands signs, output keystrokes, and do the gyro mouse stuff from a previous data glove. Gotta port that over to this hardware.
    • Finish the training utility. This will be fed into TensorFlow to recognize gestures. I've never actually built a desktop GUI app, but the learning curve of Tkinter is pretty forgiving.
    • Gather and process training data. I estimate I'll need something like 100 samples of each of the about 100 glyphs. I'll need some serious podcasts (and wrist support) to draw ten thousand letters in the air. Preprocessing should be pretty simple - the IMU itself produces unit quaternions; my job will be to normalize the sequences so they're the same length.
    • Make the artificial neural network. I've never done anything close to this before, but it seems pretty straightforward. The priority is making a network that can be easily ported to the Teensy. I'm not going to train directly on the glove - it'll just run the model.
    • Build custom PCB's and all that jazz to make it smaller and perform better. Normally I'd call this easy, but the Teensy 4.0's new chip is BGA and looks like a nightmare to solder with hot air.

    Anyways, stay tuned, I think I'm pretty close to finishing the training utility... as soon as I can figure out how to render the quaternions and gesture path...

View all 4 project logs

Enjoy this project?



chinappk wrote 11/27/2022 at 06:05 point

This project is very interesting and beautiful. I am also working on a similar project. I have a question. Why do I need four fingers just as a keyboard and mouse? I think one finger is enough.

  Are you sure? yes | no

Jay wrote 08/24/2022 at 02:57 point

This is so genius. Essentially you’ve made a new input device that theoretically could end up being better than your standard mouse. It reminds me of in the sci fi movies when they are pinching and doing all kind of air hand gestures to control their hologram projection comps. Way to go! 

  Are you sure? yes | no

VDfreesince1983 wrote 11/21/2021 at 20:24 point

Still working on this? Ive been tinkering with a similar project for a while now and running into a lot of brick was when it comes to my technical ability and the affordability of some parts. I'm a PhD student in Media Art and Text (focusing mainly on literature) and I think you could potentially get this thing writing faster than a human can possibly type if you think about the way we read instead of how we write. For example the sentence "The quick brown fox jumped over the lazy dog" we don't read that character by character, we read it through a mix of cultural triggers as a complete sentence. Any English speaking reader knows this sentence contains every letter in alphabet so we might see "The quick" or the shape of the sentence and immediately see/hear it in our mind. So my idea was to program a neural network using gestures paired with a cleaned up eeg signal, photoelectric pulse monitor, and you HMD camera for eye movement (Open BCI makes 'open source' eeg hardware, but they use proprietary and legacy parts in their designs, and buying from them is absurdly expensive. Ive prototyped a 3 channel eeg circuit but Ive never quite been able to get it to work, and ran out of time). You train the eeg, pulse, eye side of the network by just reading until It can reliably predict the text,. Then use your gestures, eeg, pulse, eye sensors to draw a connection between the beginning of a phrase and the corresponding data collected during the reading training. I basically read for a living and can read about 300 pages in ~2 hours on a good day, to get that speed I am more or less reading the entire page (the shape of the text, white space, context clues, familiar sentence shapes, and keywords) so I don't see why predictive text NNs couldnt emulate that speed with enough data. 

  Are you sure? yes | no

LovelyAlien wrote 04/07/2021 at 03:23 point

Okay well, I must sayyyyy.... I WANT one!!! Like Wow! What a brilliant idea. You have given me so many ideas. How small do you think you can get the wristband? The finger attachments/sensors look amazing and seem like they would be easy to wear. Reminds me of the gauntlet glove from Avengers haha can't wait to see what new types of designs you can come up with. ps, love the color! It's like I am always telling my team to come up with new drywall designs.

  Are you sure? yes | no

the_3d6 wrote 12/14/2020 at 20:31 point

That's very interesting! I'm working on a somewhat similar interface now - but the main input source is EMG signal. And for now I was focusing on properly recognizing muscle activations, then converting them into some commands (like music: and keystrokes: ). But what I didn't think about it's motion - even though there is IMU on board.

You seem to do the opposite, if I got it right - you are using motion as the primary source of information. And it's a huge challenge for processing - once I was a part of a group trying to solve a similar task (even a simpler one: we had data from visual recognition and less gestures) - and after months of efforts it still wasn't robust enough.

I think the right answer is in fusing motion and finger states - so for each finger combination only a few spatial gestures are needed (and so recognition can be robust), but together they allow to input all letters fast

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates