Close
0%
0%

Android offline speech recognition natively on PC

Porting the Android on-device speech recognition found in GBoard to TensorFlow Lite or LWTNN

Similar projects worth following
March 12, 2019 the Google AI blog posted progress on their on-device speech recognizer. It promises real time, offline, character by character speech recognition, and the early reviews I could find are very positive.
Especially the offline part is very appealing to me, as it should to any privacy conscious mind. Unfortunately this speech recognizer is only available to Pixel owners at this time.

Since GBoard uses TensorFlow Lite, and the blog post is also mentioning the use of this library, I was wondering if I could get my hands on the model, and import it in my own projects, maybe even using LWTNN.

I'm moderately versed in the world of machine learning, so this project will besides the specific model reverse engineering of the trained model, also consist of me learning TensorFlow, lwtnn and the application of trained models in new applications. And it might be all over my head, and result in a complete waste of time.

The workflow will be as follows:

  1. Find the trained models (DONE)
  2. Figure out how to import the model in TensorFlow (DONE)
  3. Figure out how to connect the different inputs and outputs to each other (in progress)
  4. (optional) export to lwtnn
  5. Write lightweight application for dictation
  6. (stretch goal) if importing to TensorFlow Lite is successful, try to get it to work on those cool new RISC-V k210 boards, which could be had including 6 mic array for ~$20!

Finding the trained models was done by reverse engineering the GBoard app using apktool. Further analysis of the app is necessary to find the right parameters to the models, but the initial blog post also provides some useful info:

Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as yu-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network (paper). The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Encoder Network comprises 8 such layers. Image credit: Chris Thornton

Audio input

The audio input is probably 80 log-Mel channels, as described in this paper. Gauging from the number of inputs to the first encoder (enc0), 3 frames should be stacked and provided to enc0. Then three more frames should be captured to run enc0 again to obtain a second output. Both those outputs should be fed to the second encoder (enc1) to provide it with a tensor of length 1280. The output of the second encoder is fed to the joint.

Decoder

The decoder is fed with a tensor of zeros at t=0. The output is fed to the joint. In the next iteration the decoder is fed with the output of the softmax layer, which is of lenght 128 and represents the probabilities of the symbol heard in the audio. This way the current symbol depends on all the previous symbols in the sequence.

Joint and softmax

The joint and softmax have the least amount of tweakable parameters. The two inputs of the joint are just the outputs of de decoder and encoder, and the softmax only turns this output into probabilities between 1 and 0.

dictation.ascii_proto

raw protobuf dictation.config almost completely converted to ascii format. Still a couple ids missing.

ascii_proto - 11.14 kB - 03/17/2019 at 17:39

Download

  • Google Open-Sources Live Transcribe's Speech Engine

    biemster08/19/2019 at 12:54 0 comments

    This is mainly a log to indicate that this project is still very much alive. Google announced August 16th that it open sourced the Live Transcribe speech engine, with an accompanying github repo. What is especially interesting for this project is the following line in the github Readme.md:

    • Extensible to offline models

    I'll be dissecting the code in this repository for the next weeks, and I expect to get some good hints how to feed the models in my project with a correctly processed audio stream. And maybe I can even plug in the offline models directly and build an android app with it, who knows? More to come.

  • Recovering the symbol table

    biemster03/27/2019 at 18:48 0 comments

    One last thing that was still missing was the symbol table. Apparently the output of the softmax layer maps directly to the symbol table, and since this output is 128 in length, I was searching for a table of this same length.

    The prime suspect was of course the 'syms' binary, but I could not seem to open it. The blog post mentions FST, so I started my investigation with OpenFST. There is a nice python wrapper that could open the file, and returned some sensible name for the thing. But when I queried the keys, it would return 0.

    In a hex editor I already noticed a funny sentence somewhere in the body: "We love Marisa." Initially I thought that was some padding built in by a developer, and took no notice of it. However, this is actually the header of a filetype for the library libmarisa, which is an acronym for Matching Algorithm with Recursively Implemented StorAge.

    Extracting just this part of the binary is easy:

    import sys
    fname = sys.argv[1]
    b = bytearray(open(fname, 'rb').read())
    for i in range(len(b)):
        if b[i:i+15] == "We love Marisa.":
            open(fname + '.marisa', 'wb').write(b[i:])
            break

    And the file syms.marisa can be read with the python package marisa-trie, presenting me with a nice symbol table:

    import marisa_trie
    trie = marisa_trie.Trie()
    trie.load('syms.marisa')
    <marisa_trie.Trie object at 0x7f5ad44330b0>
    
    trie.items()
    [(u'{', 0),
    (u'{end-quotation-mark}', 122),
    (u'{end-quote}', 123),
    (u'{exclamation-mark}', 124),
    (u'{exclamation-point}', 125),
    (u'{quotation-mark}', 126),
    (u'{quote}', 127),
    (u'{question-mark}', 103),
    (u'{sad-face}', 104),
    (u'{semicolon}', 105),
    (u'{smiley-face}', 106),
    (u'{colon}', 107),
    (u'{comma}', 108),
    (u'{dash}', 109),
    (u'{dot}', 110),
    (u'{forward-slash}', 111),
    (u'{full-stop}', 112),
    (u'{hashtag}', 113),
    (u'{hyphen}', 114),
    (u'{open-quotation-mark}', 115),
    (u'{open-quote}', 116),
    (u'{period}', 117),
    (u'{point}', 118),
    (u'{apostrophe}', 94),
    (u'{left-bracket}', 95),
    (u'{right-bracket}', 96),
    (u'{underscore}', 97),
    (u'<', 1),
    (u'<s>', 119),
    (u'<sorw>', 120),
    (u'<space>', 121),
    (u'</s>', 98),
    (u'<epsilon>', 99),
    (u'<noise>', 100),
    (u'<text_only>', 101),
    (u'<unused_epsilon>', 102),
    (u'!', 2),
    (u'"', 3),
    (u'#', 4),
    (u'$', 5),
    (u'%', 6),
    (u'&', 7),
    (u"'", 8),
    (u'(', 9),
    (u')', 10),
    (u'*', 11),
    (u'+', 12),
    (u',', 13),
    (u'-', 14),
    (u'.', 15),
    (u'/', 16),
    (u'0', 17),
    (u'1', 18),
    (u'2', 19),
    (u'3', 20),
    (u'4', 21),
    (u'5', 22),
    (u'6', 23),
    (u'7', 24),
    (u'8', 25),
    (u'9', 26),
    (u':', 27),
    (u';', 28),
    (u'=', 29),
    (u'>', 30),
    (u'?', 31),
    (u'@', 32),
    (u'A', 33),
    (u'B', 34),
    (u'C', 35),
    (u'D', 36),
    (u'E', 37),
    (u'F', 38),
    (u'G', 39),
    (u'H', 40),
    (u'I', 41),
    (u'J', 42),
    (u'K', 43),
    (u'L', 44),
    (u'M', 45),
    (u'N', 46),
    (u'O', 47),
    (u'P', 48),
    (u'Q', 49),
    (u'R', 50),
    (u'S', 51),
    (u'T', 52),
    (u'U', 53),
    (u'V', 54),
    (u'W', 55),
    (u'X', 56),
    (u'Y', 57),
    (u'Z', 58),
    (u'[', 59),
    (u'\\', 60),
    (u']', 61),
    (u'^', 62),
    (u'_', 63),
    (u'`', 64),
    (u'a', 65),
    (u'b', 66),
    (u'c', 67),
    (u'd', 68),
    (u'e', 69),
    (u'f', 70),
    (u'g', 71),
    (u'h', 72),
    (u'i', 73),
    (u'j', 74),
    (u'k', 75),
    (u'l', 76),
    (u'm', 77),
    (u'n', 78),
    (u'o', 79),
    (u'p', 80),
    (u'q', 81),
    (u'r', 82),
    (u's', 83),
    (u't', 84),
    (u'u', 85),
    (u'v', 86),
    (u'w', 87),
    (u'x', 88),
    (u'y', 89),
    (u'z', 90),
    (u'|', 91),
    (u'}', 92),
    (u'~', 93)]

     I believe I have recovered all the main components now, so what is left is just brute forcing how to present the data to the individual models. I earlier overlooked that the decoder should be initialised with a <sos> start of sequence when an utterance starts, so I think I should get a proper endpointer implementation and experiment which symbol is this <sos> (I only could find a <sorw>, <s> and </s>).

  • Experiments with the endpointer

    biemster03/26/2019 at 19:25 0 comments

    My focus at the moment is on the endpointer, because I can bruteforce its parameters for the signal processing a lot faster than when I use the complete dictation graph. I added a endpointer.py script to the github repo which should initialize it properly. I'm using a research paper which I believe details the endpointer used in the models as a guide, so I swapped to using log-Mel filterbank energies instead of the plain power spectrum as before.

    I believe the endpointer net outputs two probabilities: p(speech) and p(non speech) as given in this diagram from the paper:

    The results from the endpointer.py are still a bit underwhelming:

    so some more experiments are needed. I'll update this log when there are more endpointer results.

  • First full model tests

    biemster03/22/2019 at 20:45 0 comments

    The github repo is updated with the first full model test. This test just tries to run the RNNs with a sample wav file input.

    What this experiment does is the following:

    1. Split the incoming audio in 25 ms segments, with a stepsize of 10ms (so the input buffers overlap). Compute the FFT to calculate the energies in 80 frequency bins between 125 and 7500 Hz. The above values are taken from the dictation ascii_proto.
    2. Average those 80 channels to 40 channels to feed the EndPointer model. This model should decide if the end of a symbol is reached in the speech, and signal the rest of the RNNs to work their magic. Just print the output of the endpointer, since I don't know how to interpret the results.
    3. Feed the 80 channels to a stacker for the first encoder (enc0). This encoder takes 3 frames stacked as input, resulting in an input tensor of length 240.
    4. The output of the first encoder goes to a second stacker, since the input of the second encoder (enc1) is twice the length of the output of the first.
    5. The output of the second encoder goes to the joint network. This joint has two inputs of length 640, one of which is looped from the decoder. At first iteration a dummy input from the decoder is used, and the values from the second encoder are the second input.
    6. The output of the joint is fed to the decoder, which produces the final result of the model. This model is fed back into the joint network for the next iteration, and should go to the next stage of the recognizer (probably FST?)

    In my initial runs the decoder outputs just NaNs, which is highly disappointing :(.

    When I feed both the first and second encoder with random values, the output of the decoder is actually proper values, so my first guess is that the fft energies are not calculated correctly. That will be my focus for now, in combination with the endpointer. My next experiments will search for the correct feeding of the endpointer, so it gives sensible values at points in the audio sample where symbols should be produced.

    Make it so!

    *I just realize that that should be my test sample.wav*

    UPDATE: the nan issue in the decoder output was easily solved by making sure no NaN values loop back from the decoder into the joint, so the joint is next iteration fed with proper values.

  • Recovering tflite models from the binaries

    biemster03/21/2019 at 08:33 2 comments

    After hours of looking at hex values, searching for constants or pointers or some sort of pattern, comparing it with known data structures, and making importers for both C++ and python to no avail, I finally hit the jackpot:

    When you look at the header of a proper tflite model you will see something like this:

    Especially the TFL3 descriptor is present in all model files. The binary files in the superpack zip supposedly containing the models look like this:

    They all have this 'N\V)' string on the same spot as the tflite model's descriptor, and nowhere else in the 100MB+ files. Then I also remembered being surprised by all these 1a values throughout all the binaries from the zip, and noticed they coincide with 00 values from the proper tflite models.

    Now anybody who ever dabbled a bit in reverse engineering probably immediately says: XOR!

    It took me a bit longer to realize that, but the tflite models are easily recovered xor'ing the files with the value in place of the 00's:

    import sys
    fname = sys.argv[1]
    b = bytearray(open(fname, 'rb').read())
    for i in range(len(b)): b[i] ^= 0x1a
    open(fname + '.tflite', 'wb').write(b)

    This will deobfuscate the binaries, which can than be imported with your favorite tensorflow lite API. The following script will give you the inputs and outputs of the models:

    import tensorflow as tf
    
    models = ['joint','dec','enc0','enc1','ep']
    interpreters = {}
    
    for m in models:
        # Load TFLite model and allocate tensors.
        interpreters[m] = tf.lite.Interpreter(model_path=m+'.tflite')
        interpreters[m].allocate_tensors()
        
        # Get input and output tensors.
        input_details = interpreters[m].get_input_details()
        output_details = interpreters[m].get_output_details()
    
        print(m)
        print(input_details)
        print(output_details)

    Now I actually have something to work with! The above script gives the following output, showing the input and output tensors of the different models:

    De decoder and both encoders have an output with length 640, and the joint has two inputs of length 640. I will have to experiment a bit what goes where, since the graph I made from the dictation.config and the diagram in the blog post don't seem to be consistent here.

    With the dictation.ascii_proto and those models imported in tensorflow, I can start scripting the whole workflow. I hope the config has enough information on how to feed the models, but I'm quite confident now some sort of working example can be made out of this.

  • Analysis of the dictation.config protobuf

    biemster03/17/2019 at 20:10 0 comments

    The dictation.config seems to be the file used by GBoard to make sense of the models in the zipfile. It defines streams, connections, resources and processes. I made a graph of the streams and connections:

    It starts with a single input, as expected the audio stream. There is some signal analysis done of course, before it is fed to the neural nets. If I compare this diagram with the one in the blog post, there are a couple things unclear to me at the moment:

    1. Where is the loop, that feeds the last character back into the predictor?
    2. Where does the joint network come in?

    The complexity of the above graph worries me a bit, since there will be a lot of variables in the signal analysis I will have to guess. It does however seem to indicate that my initial analysis on the 'enc{0,1}' and 'dec' binaries was incorrect, since they are simply called in series in the above diagram.

    This whole thing actually raises more questions than it answers, I will have to mull this over for a while. In the mean time I will focus on how to read the 3 binary nets I mention above.

  • Reverse engineering GBoard apk to learn how to read the models

    biemster03/15/2019 at 10:12 0 comments

    This is my first endeavor in reversing android apk's, so please comment below if you have any ideas to get more info out of this. I used the tool 'apktool', which gave me a directory full of human readable stuff. Mostly 'smali' files, of which I never heard before.

    They seem to me some kind of pseudo code, but are still quite readable.

    When I started grepping through those files again search for keywords like "ondevice" and "recognizer", and the filenames found in the zipfile containing the models, I found the following mention of "dictation" in smali/gpf.smali:

    smali/gpf.smali:    const-string v7, "dictation"

    Opening this file in an editor revealed that the a const-string "config" was very close by, strengthening my suspicion that the app reads the "dictation.config" file to learn how to read rest of the files in the package. This is promising, since I then don't have to figure out how to do this, and if a future update comes along with better models or different languages, I just need to load the new dictation.config!

    Next up is better understanding the smali files, to figure out how this dictation.config is read, and how it (hopefully) constructs TensorFlow objects from it.

    UPDATE: The dictation.config seems to be a binary protobuf file, which can be decoded with the following command:

    $ protoc --decode_raw < dictation.config

     The output I got is still highly cryptic, but it's progress nonetheless!

    UPDATE 2: I've used another {dictation.config, dictation.ascii_proto} pair I found somewhere to fill in most of the enums found in the decoded config file. This ascii_proto is uploaded in the file section, and is a lo more readable now. Next step is to use this config to recreate the tensorflow graph, which I will report on in a new log.

  • Finding the models

    biemster03/15/2019 at 09:55 0 comments

    The update to the on-device speech recognition comes as an option to GBoard, called "Faster voice typing" but is only available on Pixel phones as of now. I downloaded the latest version of the GBoard app, extracted it with apktool and started grepping for word like "faster" and "ondevice".

    After a while the following link came up during my searches:

    https://www.gstatic.com/android/keyboard/ondevice_recognizer/superpacks-manifest-20181012.json

    Following this link presented me with a small json file with a single link to a zip file of 82 MB containing the following files:

    Well this looks promising! The size is about correct as mentioned in the blog post, and there seem to be two encoders, a joint and a decoder, just like in the described model.

    More things that can be speculated are:

    The encoder network is supposed to be four times as large as the prediction network, which is called the decoder. In the file list the 'enc1' file is about four times the 'dec' file, so my guess is the 'dec' file is the predictior network and the 'enc1' is the encoder on the bottom of the diagram. The 'joint' file is almost certain the Joint Network in the middle, and that would leave the 'enc0' file being the Softmax layer on top.

    Fortunately the dictation.config file seems to specify certain parameters on how to read all files listed here, so my focus will be on how to interpret this config file with some TensorFlow Lite loader.

View all 8 project logs

Enjoy this project?

Share

Discussions

Victor Sklyar wrote 04/08/2019 at 09:10 point

any news?..

  Are you sure? yes | no

biemster wrote 04/08/2019 at 18:42 point

I'm struggling with the inputs to the models. I suspect the mean and standard deviation of the inputs used during training are in the file "input_mean_stddev", which I presume is a hashtable_lookup binary considering the file starts with 0a and this corresponds to that OP entry in the tensorflow lite flatbuffer schema.

I did not figure out yet how to import and read this input_mean_stddev file however, and brute forcing the normalization of the inputs also did not yield results yet unfortunately..

  Are you sure? yes | no

罗国强 wrote 04/08/2019 at 22:05 point

the contents of input_mean_stddev:

file starts with a 3 byte header [0a c0 07], followed by 240 little-endian float values. this is followed by another 3 byte header [12 c0 07], followed by another 240 floats.

ep_mean_stddev is similar:

a 3 byte header  [0a a0 01] followed by 40 little-endian floats. another 3 byte header [12 a0 01] followed by another 40-little endian floats.

i hope this helps!

  Are you sure? yes | no

biemster wrote 04/09/2019 at 08:16 point

@罗国强: Awesome!! that helps really a lot! It's like walking around with your eyes closed for weeks, and suddenly remembering how to open them :)

  Are you sure? yes | no

Victor Sklyar wrote 06/10/2019 at 08:31 point

still nothing? :(

  Are you sure? yes | no

biemster wrote 06/10/2019 at 16:27 point

work got in the way.. Almost all the obfuscated stuff is done though, it is just a matter of feeding the correct frequency domain energies to the model, be it plain FFT or Mel-log stuff. The _stddev files show the averages of those values, all is left is an (educated) brute force approach to get those right. I'd be more than happy to give pointers on how to do that, but unfortunately I do not have the spare time to dive into this myself at the moment. Maybe in a couple weeks, but I can't promise anything.

  Are you sure? yes | no

biemster wrote 06/14/2019 at 19:40 point

Also i just found a new version of the model. It has the same endpointer, but a much larger joint and slightly different encoders and decoder.

But most interestingly the dictation.config is a lot larger, so first thing for me in this project is to dissect this new config file. I hope there are some more clues as to how to feed the audio to the models.

  Are you sure? yes | no

罗国强 wrote 03/22/2019 at 12:26 point

Awesome work.

Please consider mirroring the original apk and language files.

If the offline recogniser is as good as stated in the blog, google will try to further protect, by changing the crypto or introducing other measures.

  Are you sure? yes | no

parameter.pollution wrote 03/22/2019 at 14:32 point

I doubt that they are going to try to obfuscate the current version more, since it's already out. But just in case I have uploaded the original files here: https://mega.nz/#F!D5p1AQyQ!ZPpKdTpooHYNE3gl2EHJQA

  Are you sure? yes | no

biemster wrote 03/22/2019 at 17:14 point

I'm a bit worried too that future models, or maybe even just language updates, will be better protected/obfuscated. Let's not turn this project into a turnkey solution by also providing the tflite models directly, and credit the google research team where credit is due.

I doubt that they will put up the expense to overhaul their model distribution system to stop small projects like this (fingers crossed).

  Are you sure? yes | no

Victor Sklyar wrote 08/08/2019 at 09:04 point

so... does it mean that is all?

  Are you sure? yes | no

biemster wrote 08/08/2019 at 14:31 point

@Victor Sklyar still on hold, but this will be first project when i have time again.

  Are you sure? yes | no

parameter.pollution wrote 03/20/2019 at 13:40 point

I had the same idea and decided to google it first and found your project page.
Just to be sure I have an apk that actually contains all the code used by this new speech recognition, I decided to pull the gboard apk from my pixel 2 directly and then I decompiled and deobfuscated the apk with apk deguard (from eth zürich) and this is the result: http://apk-deguard.com/fetch?fp=48a5831fd3f102aead2390db117c39b70f2084fc4397249af370b08edca78498&q=src

It's quite readable java code (though not all class/function/variable names are useful of course),  but the bad news is that all the interesting functions seem to point to native library functions ( "nativeInitFromProto()" is in "libintegrated_shared_object.so").

I'll fire up a few static binary analysis tools and see if I can get something useful out of it and I'll let you know when I do (but it's a ~20MB arm binary....., so I am not very optimistic with the reverse engineering skills I have).

  Are you sure? yes | no

biemster wrote 03/20/2019 at 20:12 point

Awesome, nice work! That is a lot more readable than the smali files apktool is giving me. Keep me posted on your progress!

  Are you sure? yes | no

parameter.pollution wrote 03/21/2019 at 19:05 point

Great work with the XOR!

I first tried decompilation with radare2 + cutter, but since the library files are so big it struggled. So I decided to try it with Ghidra (disassembler/decompiler the NSA recently released) and it handeled it very well.

I have uploaded the decompilation (C code; created with Ghidra) results of the 2 library files I think could contain code we are looking for here: https://mega.nz/#F!D5p1AQyQ!ZPpKdTpooHYNE3gl2EHJQA
But it's a LOT of code. And these libraries were originally written in C++ and this is the result of decompilation to C, so that makes it even less readable.

But there are strings of error messages in there that point to a google internal library called "greco3" and I found this google blogpost that references it: https://ai.googleblog.com/2010/09/google-search-by-voice-case-study.html
So "greco3" might be the library they use for the fft/filterbank/... audio preprocessing stuff that you found in the protobuf file.

The JNI functions that are called by the java code can be found in the decompiled code by searching for functions that start with "Java_com_", e.g. "Java_com_google_speech_recognizer_ResourceManager_nativeInitFromProto" (in "libintegrated_shared_object.so_ghidra-decompiled.c"). But reading the code gets confusing really fast. So for actually jumping around in the code it's probably better to just load and analyze the library files with Ghidra.

I'll try to decompile it to C++, but again, am not very optimistic that it will work well.

Maybe the better approach is to try to implement the preprocessing functions based on what their names suggest they do, but that's probably a bruteforce approach and could take a while.

  Are you sure? yes | no

biemster wrote 03/22/2019 at 09:18 point

The greco3 lib and ResourceManager was exactly what I was searching for too, since the ascii_proto does not give info what to feed the inputs of the networks. The audio preprocessing looks quite straightforward from the config file indeed, with 25 ms samples and 80 channels in the frequency domain between 125 and 7500 Hz. It seems that I should just feed the nets with the energies in these channels, and since the input of the first encoder is 240 in length, I should stack 3 frames from the filter bank? Something like that.

After that it will probably be a lot of brute forcing indeed, since I don't know yet which network comes next and how. Should I follow the diagram from the blog, and feed the output of the enc0 to the joint, or should the output go to enc1 according to the dictation.config diagram?

Also, I have to figure out which model is the softmax layer. I guess that is the dec, but the blog post calls the Prediction Layer the decoder..

And then the output of the softmax layer has to be translated to characters. I have a suspicion that this is done using OpenFST, a finite state transducer. This package shows up in some git commits, and the abbreviation FST shows up a few times in this context in the config file.

Still some work ahead!

  Are you sure? yes | no

parameter.pollution wrote 03/24/2019 at 20:52 point

It took quite some time and trial&error, but I managed to also decompile the libintegrated_shared_object.so binary to C with the retdec decompiler (though only the 32 bit arm binary, not the original from my pixel which was 64 bit arm, because retdec only supports 32bit arm right now).

I think it's a little bit easier to read, but didn't have time yet to take a closer look if we can find the infos we are looking for in there.

(uploaded as "libintegrated_shared_object.so_retdec-decompiled.c" to the same folder as the other files: https://mega.nz/#F!D5p1AQyQ!ZPpKdTpooHYNE3gl2EHJQA )

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates