Description

March 12, 2019 the Google AI blog posted progress on their on-device speech recognizer. It promises real time, offline, character by character speech recognition, and the early reviews I could find are very positive.
Especially the offline part is very appealing to me, as it should to any privacy conscious mind. Unfortunately this speech recognizer is only available to Pixel owners at this time.

Since GBoard uses TensorFlow Lite, and the blog post is also mentioning the use of this library, I was wondering if I could get my hands on the model, and import it in my own projects, maybe even using LWTNN.

I'm moderately versed in the world of machine learning, so this project will besides the specific model reverse engineering of the trained model, also consist of me learning TensorFlow, lwtnn and the application of trained models in new applications. And it might be all over my head, and result in a complete waste of time.

Details

The workflow will be as follows:

Find the trained models (DONE)
Figure out how to import the model in TensorFlow (DONE)
Figure out how to connect the different inputs and outputs to each other (in progress)
(optional) export to lwtnn
Write lightweight application for dictation (DONE)

Finding the trained models was done by reverse engineering the GBoard app using apktool. Further analysis of the app is necessary to find the right parameters to the models, but the initial blog post also provides some useful info:

Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as y_u-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network (paper). The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Encoder Network comprises 8 such layers. Image credit: Chris Thornton

Audio input

The audio input is probably 80 log-Mel channels, as described in this paper. Gauging from the number of inputs to the first encoder (enc0), 3 frames should be stacked and provided to enc0. Then three more frames should be captured to run enc0 again to obtain a second output. Both those outputs should be fed to the second encoder (enc1) to provide it with a tensor of length 1280. The output of the second encoder is fed to the joint.

Decoder

The decoder is fed with a tensor of zeros at t=0. The output is fed to the joint. In the next iteration the decoder is fed with the output of the softmax layer, which is of lenght 128 and represents the probabilities of the symbol heard in the audio. This way the current symbol depends on all the previous symbols in the sequence.

Joint and softmax

The joint and softmax have the least amount of tweakable parameters. The two inputs of the joint are just the outputs of de decoder and encoder, and the softmax only turns this output into probabilities between 1 and 0.

Overview

I recently found a nice overview presentation of (almost) current research, with an interesting description starting on slide 81 explaining when to advance the encoder and retain the prediction network state.

Files

dictation.ascii_proto

raw protobuf dictation.config almost completely converted to ascii format. Still a couple ids missing.

ascii_proto - 11.14 kB - 03/17/2019 at 17:39

Download

Project Logs

Collapse

SODA finally landed, client working
biemster • 12/15/2020 at 18:03 • 23 comments
UPD: wine is not necessary anymore.
So SODA finally landed, sort of, and for a couple weeks already apparently. I've been on the lookout for the Linux library, since that is my preferred environment and I was under the impression that the development was taking place on that platform. But I was wrong, and the Windows and macOS libraries were available since late November.

Since I'm much more capable on a Linux machine, I've searched (and found!) a way to use either one of those available libraries. In my last post I reported on quite a successful project with the Google TTS library, which resulted in a very lightweight client for it. And fortunately the same can be said for the SODA client, resulting in a very small code base with only the library as dependency. This enabled me to work with wine, and have it pipe the data straight from whatever Linux application I wanted to use to the Windows DLL.

Just issue the following command:
```
$ ecasound -f:16,1,16000 -i alsa -o:stdout | wine gasr.exe
```
and watch your conversations roll over the screen:
```
W1215 22:58:43.683654      44 soda_async_impl.cc:390] Soda session starting (require_hotword:0, hotword_timeout_in_millis:0)
>>> hello
>>> hello from
>>> hello from
>>> hello from sod
>>> hello from soda
>>> hello from soda
>>> final: hello from soda
```
The SODA client I wrote is developed in a separate repository (gasr), as it will be mostly just a tool to do the full reverse engineering of the RNN and transducer. But having an actual working implementation will greatly improve my ability to figure out the inner workings of the models.

Using wine as an intermediate is still far from ideal, but I guess that the Linux library will also pop up soon considering ChromeOS would depend on it.

UPDATE:

As @a1is pointed out, the Linux library is also out there already, so no need to go the wine way anymore. And as an added bonus, the GBoard models are working with these libraries as well! That opens up a whole world of experimentation, since there are already quite a few of those spotted in the wild!

UPDATE2:

Now with a python client in the repo, for easier integration with home automation and such.
ChromeVox Next offline TTS client, a sister project
biemster • 11/30/2020 at 15:23 • 2 comments
In a large part my reason to start this speech recognition project was for home automation, and to get some feedback from my home a logical companion project is the reverse of speech recognition: Text to Speech (TTS). The offline TTS on Android has quite a pleasant voice and seemed to be running on Chrome on Desktop as well, so it made a perfect combination with the recognition pipeline.

With the recognition project waiting for SODA to be released for Chrome, I had a bit of time to tackle this TTS, which turned out to be quite a bit easier, and definitely quicker!

I started with getting a copy of Chrome OS in a virtual machine, courtesy of neverware.com. Incidentally I also found a page in the Chrome source repo mentioning that the TTS part of Chrome OS can be downloaded separately.

After a weekend of getting ltrace running on chromeos, and tracing the calls to the googletts library, I wrote a tiny (~50 lines of C) client which pipes any Google voice (there are quite many languages available) to for example ALSA:
```
./gtts "Hello from Google Text to Speech!" | aplay -r22050 -fFLOAT_LE -c1
```
It's just a tiny proof of concept, which can easily be added to any C/C++ project out there. Also a python wrapper would be awesome, maybe I get to that someday (or maybe you want to pick that up?). Code lives on github, if you decide to use it don't forget to leave a comment below!
SODA: Speech On-Device API
biemster • 05/04/2020 at 16:13 • 10 comments
Progress has been slow lately, due to the difficulty of tracing the apps in how they use the models, and also my lack of free android hardware to run tests on. However, this may become completely irrelevant in the near future!

It seems Google is building speech recognition into Chromium, to bring a feature called Live Caption to the browser. To transcribe videos playing in the browser a new API is slowly being introduced: SODA. There is already a lot of code in the chromium project related to this, and it seems that it might be introduced this summer or even sooner. A nice overview of the code already in the codebase can be seen in the commit renaming SODA to speech recognition.

What is especially interesting is that it seems it will be using the same language packs and RNNT models as the Recorder and GBoard apks, since I recently found the following model zip:

It is from a link found in the latest GBoard app, but it clearly indicates that the model will be served via the soda API. It is still speculative though, but it seems only logical (to me at least) that the functionality in Chromium will be based on the same models. There is also this code:

which indicates that soda will come as a library, and reads the same dictation configs as the android apps do.

Since Chromium is open source, it will help enormously in figuring out how to talk to the models. This opens up a third way of getting them incorporated into my own projects:

1) import the tflite files directly in tensorflow (difficult to figure out how, especially the input audio and the beam search at the end)

2) create a java app for android and have the native library import the model for us (greatly limits the platform where it can run on, also requires android hardware which i don't have free atm)

3) Follow the code in Chromium, as it will likely use the same models

I actually came up with a straightforward 4th option as well recently, and can't believe I did not think of that earlier:

4) Patch GBoard so it also enables "Faster voice typing" on non-Pixel devices. Then build a simple app with a single text field that sends everything typed (by gboard voice typing) to something like MQTT or whatever your use case might be.

I'll keep a close eye on the Live Transcribe feature of Chromium, because I think that that is the most promising path at the moment, and keep my fingers crossed it will show how to use those models in my own code. In the mean time I found a couple more RNNT models for the GBoard app, one of which was called "small" and was only 12MB in size:

Very curious how that one performs!
UPDATE june 8:
The following commit just went into the chromium project, stating:
```
This CL adds a speech recognition client that will be used by the Live
Caption feature. This is a temporary implementation using the Open
Speech API that will allow testing and experimentation of the Live
Caption feature while the Speech On-Device API (SODA) is under
development. Once SODA development is completed, the Cloud client will
be replaced by the SodaClient.
```
So I guess the SODA implementation is taking a bit longer than expected, and an online recognizer will be used initially when the Live Caption feature launches. On the upside, this temporary code should already enable me to write some boilerplate stuff to interface with the recognizer, so when SODA lands I can hit the ground running.
Pixel 4 Recorder app with offline transcribe
biemster • 10/16/2019 at 07:58 • 4 comments

So 15th of October Google showcased the new Recorder app for Pixel 4 devices, with real time transcription. After downloading the app and peaking inside it contains the same type of RNNT models with 2 encoders, a decoder and a joint, so I assume it's the same model. It is considerably smaller though, so I expect it to be an update.
The tflite files are also not obfuscated, and the zip contains .ascii_proto files that are human readable. It even contains shell scripts to run the models on a local machine!
This is the third full model I'm analysing, and seems to contain the most info thus far. I'll update this log is I find out more.
Google Open-Sources Live Transcribe's Speech Engine
biemster • 08/19/2019 at 12:54 • 0 comments
This is mainly a log to indicate that this project is still very much alive. Google announced August 16th that it open sourced the Live Transcribe speech engine, with an accompanying github repo. What is especially interesting for this project is the following line in the github Readme.md:
- Extensible to offline models
I'll be dissecting the code in this repository for the next weeks, and I expect to get some good hints how to feed the models in my project with a correctly processed audio stream. And maybe I can even plug in the offline models directly and build an android app with it, who knows? More to come.

Recovering the symbol table

biemster • 03/27/2019 at 18:48 • 0 comments

One last thing that was still missing was the symbol table. Apparently the output of the softmax layer maps directly to the symbol table, and since this output is 128 in length, I was searching for a table of this same length.

The prime suspect was of course the 'syms' binary, but I could not seem to open it. The blog post mentions FST, so I started my investigation with OpenFST. There is a nice python wrapper that could open the file, and returned some sensible name for the thing. But when I queried the keys, it would return 0.

In a hex editor I already noticed a funny sentence somewhere in the body: "We love Marisa." Initially I thought that was some padding built in by a developer, and took no notice of it. However, this is actually the header of a filetype for the library libmarisa, which is an acronym for Matching Algorithm with Recursively Implemented StorAge.

Extracting just this part of the binary is easy:

import sys
fname = sys.argv[1]
b = bytearray(open(fname, 'rb').read())
for i in range(len(b)):
    if b[i:i+15] == "We love Marisa.":
        open(fname + '.marisa', 'wb').write(b[i:])
        break

And the file syms.marisa can be read with the python package marisa-trie, presenting me with a nice symbol table:

import marisa_trie
trie = marisa_trie.Trie()
trie.load('syms.marisa')
<marisa_trie.Trie object at 0x7f5ad44330b0>

trie.items()
[(u'{', 0),
(u'{end-quotation-mark}', 122),
(u'{end-quote}', 123),
(u'{exclamation-mark}', 124),
(u'{exclamation-point}', 125),
(u'{quotation-mark}', 126),
(u'{quote}', 127),
(u'{question-mark}', 103),
(u'{sad-face}', 104),
(u'{semicolon}', 105),
(u'{smiley-face}', 106),
(u'{colon}', 107),
(u'{comma}', 108),
(u'{dash}', 109),
(u'{dot}', 110),
(u'{forward-slash}', 111),
(u'{full-stop}', 112),
(u'{hashtag}', 113),
(u'{hyphen}', 114),
(u'{open-quotation-mark}', 115),
(u'{open-quote}', 116),
(u'{period}', 117),
(u'{point}', 118),
(u'{apostrophe}', 94),
(u'{left-bracket}', 95),
(u'{right-bracket}', 96),
(u'{underscore}', 97),
(u'<', 1),
(u'<s>', 119),
(u'<sorw>', 120),
(u'<space>', 121),
(u'</s>', 98),
(u'<epsilon>', 99),
(u'<noise>', 100),
(u'<text_only>', 101),
(u'<unused_epsilon>', 102),
(u'!', 2),
(u'"', 3),
(u'#', 4),
(u'$', 5),
(u'%', 6),
(u'&', 7),
(u"'", 8),
(u'(', 9),
(u')', 10),
(u'*', 11),
(u'+', 12),
(u',', 13),
(u'-', 14),
(u'.', 15),
(u'/', 16),
(u'0', 17),
(u'1', 18),
(u'2', 19),
(u'3', 20),
(u'4', 21),
(u'5', 22),
(u'6', 23),
(u'7', 24),
(u'8', 25),
(u'9', 26),
(u':', 27),
(u';', 28),
(u'=', 29),
(u'>', 30),
(u'?', 31),
(u'@', 32),
(u'A', 33),
(u'B', 34),
(u'C', 35),
(u'D', 36),
(u'E', 37),
(u'F', 38),
(u'G', 39),
(u'H', 40),
(u'I', 41),
(u'J', 42),
(u'K', 43),
(u'L', 44),
(u'M', 45),
(u'N', 46),
(u'O', 47),
(u'P', 48),
(u'Q', 49),
(u'R', 50),
(u'S', 51),
(u'T', 52),
(u'U', 53),
(u'V', 54),
(u'W', 55),
(u'X', 56),
(u'Y', 57),
(u'Z', 58),
(u'[', 59),
(u'\\', 60),
(u']', 61),
(u'^', 62),
(u'_', 63),
(u'`', 64),
(u'a', 65),
(u'b', 66),
(u'c', 67),
(u'd', 68),
(u'e', 69),
(u'f', 70),
(u'g', 71),
(u'h', 72),
(u'i', 73),
(u'j', 74),
(u'k', 75),
(u'l', 76),
(u'm', 77),
(u'n', 78),
(u'o', 79),
(u'p', 80),
(u'q', 81),
(u'r', 82),
(u's', 83),
(u't', 84),
(u'u', 85),
(u'v', 86),
(u'w', 87),
(u'x', 88),
(u'y', 89),
(u'z', 90),
(u'|', 91),
(u'}', 92),
(u'~', 93)]

I believe I have recovered all the main components now, so what is left is just brute forcing how to present the data to the individual models. I earlier overlooked that the decoder should be initialised with a <sos> start of sequence when an utterance starts, so I think I should get a proper endpointer implementation and experiment which symbol is this <sos> (I only could find a <sorw>, <s> and </s>).

Experiments with the endpointer
biemster • 03/26/2019 at 19:25 • 0 comments

My focus at the moment is on the endpointer, because I can bruteforce its parameters for the signal processing a lot faster than when I use the complete dictation graph. I added a endpointer.py script to the github repo which should initialize it properly. I'm using a research paper which I believe details the endpointer used in the models as a guide, so I swapped to using log-Mel filterbank energies instead of the plain power spectrum as before.

I believe the endpointer net outputs two probabilities: p(speech) and p(non speech) as given in this diagram from the paper:

The results from the endpointer.py are still a bit underwhelming:

so some more experiments are needed. I'll update this log when there are more endpointer results.
UPDATE:
Thanks to awesome work being done by thebabush in the github repo, the endpointer gives very good results now! The most relevant changes are the normalisation of the input by 32767, and the change of the upper bound of the log mel features from 7500Hz to 3800Hz. This gives excellent results:
The top plot is the wav file with some utterances, the middle is p(speech) and the bottom is p(non speech), both with 0 being high probability. (Or the other way around, actually not sure of the absolute meaning of the values)
First full model tests
biemster • 03/22/2019 at 20:45 • 0 comments
The github repo is updated with the first full model test. This test just tries to run the RNNs with a sample wav file input.

What this experiment does is the following:
1. Split the incoming audio in 25 ms segments, with a stepsize of 10ms (so the input buffers overlap). Compute the FFT to calculate the energies in 80 frequency bins between 125 and 7500 Hz. The above values are taken from the dictation ascii_proto.
2. Average those 80 channels to 40 channels to feed the EndPointer model. This model should decide if the end of a symbol is reached in the speech, and signal the rest of the RNNs to work their magic. Just print the output of the endpointer, since I don't know how to interpret the results.
3. Feed the 80 channels to a stacker for the first encoder (enc0). This encoder takes 3 frames stacked as input, resulting in an input tensor of length 240.
4. The output of the first encoder goes to a second stacker, since the input of the second encoder (enc1) is twice the length of the output of the first.
5. The output of the second encoder goes to the joint network. This joint has two inputs of length 640, one of which is looped from the decoder. At first iteration a dummy input from the decoder is used, and the values from the second encoder are the second input.
6. The output of the joint is fed to the decoder, which produces the final result of the model. This model is fed back into the joint network for the next iteration, and should go to the next stage of the recognizer (probably FST?)
In my initial runs the decoder outputs just NaNs, which is highly disappointing :(.

When I feed both the first and second encoder with random values, the output of the decoder is actually proper values, so my first guess is that the fft energies are not calculated correctly. That will be my focus for now, in combination with the endpointer. My next experiments will search for the correct feeding of the endpointer, so it gives sensible values at points in the audio sample where symbols should be produced.

Make it so!

*I just realize that that should be my test sample.wav*
UPDATE: the nan issue in the decoder output was easily solved by making sure no NaN values loop back from the decoder into the joint, so the joint is next iteration fed with proper values.
Recovering tflite models from the binaries
biemster • 03/21/2019 at 08:33 • 4 comments
After hours of looking at hex values, searching for constants or pointers or some sort of pattern, comparing it with known data structures, and making importers for both C++ and python to no avail, I finally hit the jackpot:

When you look at the header of a proper tflite model you will see something like this:

Especially the TFL3 descriptor is present in all model files. The binary files in the superpack zip supposedly containing the models look like this:

They all have this 'N\V)' string on the same spot as the tflite model's descriptor, and nowhere else in the 100MB+ files. Then I also remembered being surprised by all these 1a values throughout all the binaries from the zip, and noticed they coincide with 00 values from the proper tflite models.

Now anybody who ever dabbled a bit in reverse engineering probably immediately says: XOR!

It took me a bit longer to realize that, but the tflite models are easily recovered xor'ing the files with the value in place of the 00's:
```
import sys
fname = sys.argv[1]
b = bytearray(open(fname, 'rb').read())
for i in range(len(b)): b[i] ^= 0x1a
open(fname + '.tflite', 'wb').write(b)
```
This will deobfuscate the binaries, which can than be imported with your favorite tensorflow lite API. The following script will give you the inputs and outputs of the models:
```
import tensorflow as tf

models = ['joint','dec','enc0','enc1','ep']
interpreters = {}

for m in models:
    # Load TFLite model and allocate tensors.
    interpreters[m] = tf.lite.Interpreter(model_path=m+'.tflite')
    interpreters[m].allocate_tensors()
    
    # Get input and output tensors.
    input_details = interpreters[m].get_input_details()
    output_details = interpreters[m].get_output_details()

    print(m)
    print(input_details)
    print(output_details)
```
Now I actually have something to work with! The above script gives the following output, showing the input and output tensors of the different models:
De decoder and both encoders have an output with length 640, and the joint has two inputs of length 640. I will have to experiment a bit what goes where, since the graph I made from the dictation.config and the diagram in the blog post don't seem to be consistent here.
With the dictation.ascii_proto and those models imported in tensorflow, I can start scripting the whole workflow. I hope the config has enough information on how to feed the models, but I'm quite confident now some sort of working example can be made out of this.
Analysis of the dictation.config protobuf
biemster • 03/17/2019 at 20:10 • 0 comments
The dictation.config seems to be the file used by GBoard to make sense of the models in the zipfile. It defines streams, connections, resources and processes. I made a graph of the streams and connections:
It starts with a single input, as expected the audio stream. There is some signal analysis done of course, before it is fed to the neural nets. If I compare this diagram with the one in the blog post, there are a couple things unclear to me at the moment:
1. Where is the loop, that feeds the last character back into the predictor?
2. Where does the joint network come in?
The complexity of the above graph worries me a bit, since there will be a lot of variables in the signal analysis I will have to guess. It does however seem to indicate that my initial analysis on the 'enc{0,1}' and 'dec' binaries was incorrect, since they are simply called in series in the above diagram.
This whole thing actually raises more questions than it answers, I will have to mull this over for a while. In the mean time I will focus on how to read the 3 binary nets I mention above.

View all 12 project logs

Discussions

alopez247 wrote 02/09/2022 at 16:31

Hi! Awesome project! I'm currently trying to compile it. After downloading the models from Chrome (libsoda version: 1.1.0.1) I'm getting this error:

g++ -o gasr gasr.c -L. -Wl,-rpath,. -lsoda
/usr/bin/ld: /tmp/ccOy0vpm.o: in function `main':
gasr.c:(.text+0x172): referencia a `CreateSodaAsync' sin definir
/usr/bin/ld: gasr.c:(.text+0x22b): reference to `AddAudio' undefined
/usr/bin/ld: gasr.c:(.text+0x24f): reference to `DeleteSodaAsync' undefined
collect2: error: ld returned 1 exit status
make: *** [Makefile:5: gasr] Error 1

Might this be due to some backwards compatibility issue? Maybe the project source code requires an older libsoda version (1.0.X.X)? Should this be the case, any tips on how to get an older lisoda version?

Thanks!!

Are you sure? yes | no

biemster wrote 04/03/2022 at 15:14

The newest soda libraries indeed do not support the "nonExtended" API anymore. The python code is the only example that works, because the Extended API requires protobuffers and it would take quite an effort for me to implement that in the C code.

I'm not sure if using an old version of the library is the way forward, since they don't seem to support the newer gboard models. So implementing protobuffers would be advised if you need this in C.

If you really need an old version I can probably find one on my disk, PM in that case.

Are you sure? yes | no

biemster wrote 04/03/2022 at 17:50

I just found protobuf-c, which converts proto files into c code. So adding them to the C client should be quite straightforward.. (although I doubt I will get to it anytime soon)

Are you sure? yes | no

Textmode wrote 04/15/2021 at 03:10

Hi, like your project and looking forward to any progress you will make in the future. Just curious as to where you got the French model from? Are there any other language models that are available? Sorry if you have already answered this, I couldn't find any responses.

Are you sure? yes | no

biemster wrote 04/23/2021 at 06:21

The French model was misplaced in a GBoard superpack (see the other logs / comments how to obtain those), and there is a Spanish model available in Chrome. Besides that there are app ids for Japanese, Italian and German languages, but the updater service does not recognize those yet.

UPDATE: All six languages mentioned here are now available from the chrome updater.

Are you sure? yes | no

Textmode wrote 06/22/2021 at 04:51

That's great! Is it possible to download the other languages and can they be used with this project?

Are you sure? yes | no

biemster wrote 06/22/2021 at 10:38

@Textmode yes they can be downloaded with chrome, and used with this project

Are you sure? yes | no

benhuang2018 wrote 04/06/2021 at 03:04

Hi Sir/Madam,

I'm new to hackaday, i found this project very interesting, and i'm trying to reproduce the result with the instruction in description and readme file. but i can't make it right... please enlight me if i did something wrong...

my environment is ubuntu 18.04.5, it's a clean environment, i install it on a new drive.

here is my steps:

1. download chrome beta from google (https://www.google.com/chrome/beta/), the version i got is 90.0.4430.51(64-bit)

2. enable live caption from setting:

setting -> advanced -> accessibility -> Caption -> live caption -> set to ON and wait for download complate

3. clone the gasr repo: git clone https://github.com/biemster/gasr.git

4. copy the library file from chrome's profile folder:

> cd /home/<username>/.config/google-chrome-beta/SODA/1.0.3/SODAFiles

> cp libsoda.so <project_dir>

5. copy language file from chrome's profile folder:

> cd /home/<username>/.config/google-chrome-beta/SODALanguagePacks/en-US/1.0.0

> cp -r SODAModels/ <project_dir>

6. build the gasr.c

> make

result >>> g++ -o gasr gasr.c -L. -Wl,-rpath,. -lsoda

7. run the model with sample audio (https://raw.githubusercontent.com/Azure-Samples/cognitive-services-speech-sdk/f9807b1079f3a85f07cbb6d762c6b5449d536027/samples/cpp/windows/console/samples/whatstheweatherlike.wav):

> sudo ./gasr < samples_cpp_windows_console_samples_whatstheweatherlike.wav

Are you sure? yes | no

benhuang2018 wrote 04/06/2021 at 03:06

and my output is below:

> sudo ./gasr < samples_cpp_windows_console_samples_whatstheweatherlike.wav

WARNING: Logging before InitGoogle() is written to STDERR
W0406 11:03:00.264677 22156 soda_async_impl.cc:282] Creating soda_impl
I0406 11:03:00.264819 22156 soda_impl.cc:285] Maximum audio history (ms): 30000
I0406 11:03:00.264840 22156 soda_impl.cc:306] Adding Resampler from 16000 to 16000
I0406 11:03:00.264977 22156 soda_impl.cc:506] Enabling power evaluator.
I0406 11:03:00.264983 22156 soda_impl.cc:516] Adding preamble processor.
I0406 11:03:00.264986 22156 soda_impl.cc:525] Enabling On Device ASR
I0406 11:03:00.265156 22156 terse_processor.cc:707] Config file: ./SODAModels/configs/ONDEVICE_MEDIUM_CONTINUOUS.config
I0406 11:03:00.265403 22156 terse_processor.cc:172] Loaded PipelineDef.
I0406 11:03:00.265416 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_model.syms.compact
I0406 11:03:00.265421 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_model.syms.compact
I0406 11:03:00.265424 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model.syms.compact
I0406 11:03:00.265429 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model.syms.compact
I0406 11:03:00.265547 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_dictation_frontend_params.mean_stddev
I0406 11:03:00.265555 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_dictation_frontend_params.mean_stddev
I0406 11:03:00.265557 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_dictation_frontend_params.mean_stddev
I0406 11:03:00.265561 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_dictation_frontend_params.mean_stddev
I0406 11:03:00.265602 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_model.wpm.portable
I0406 11:03:00.265608 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_model.wpm.portable
I0406 11:03:00.265610 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model.wpm.portable
I0406 11:03:00.265614 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model.wpm.portable
I0406 11:03:00.268811 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_model.word_classifier
I0406 11:03:00.269136 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_model.word_classifier
I0406 11:03:00.269139 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model.word_classifier
I0406 11:03:00.269145 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model.word_classifier
I0406 11:03:00.269186 22156 dir_path.cc:52] Checking FileExists: ./denorm/embedded_replace_annotated_punct_words_dash.mfar
I0406 11:03:00.269193 22156 dir_path.cc:57] Not Found FileExists: ./denorm/embedded_replace_annotated_punct_words_dash.mfar
I0406 11:03:00.269195 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/denorm/embedded_replace_annotated_punct_words_dash.mfar
I0406 11:03:00.269200 22156 dir_path.cc:54] Found FileExists: ./SODAModels/denorm/embedded_replace_annotated_punct_words_dash.mfar
I0406 11:03:00.269240 22156 dir_path.cc:52] Checking FileExists: ./denorm/embedded_fix_ampm.mfar
I0406 11:03:00.269246 22156 dir_path.cc:57] Not Found FileExists: ./denorm/embedded_fix_ampm.mfar
I0406 11:03:00.269248 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/denorm/embedded_fix_ampm.mfar
I0406 11:03:00.269252 22156 dir_path.cc:54] Found FileExists: ./SODAModels/denorm/embedded_fix_ampm.mfar
I0406 11:03:00.269277 22156 dir_path.cc:52] Checking FileExists: ./denorm/embedded_class_denorm.mfar
I0406 11:03:00.269284 22156 dir_path.cc:57] Not Found FileExists: ./denorm/embedded_class_denorm.mfar
I0406 11:03:00.269287 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/denorm/embedded_class_denorm.mfar
I0406 11:03:00.269291 22156 dir_path.cc:54] Found FileExists: ./SODAModels/denorm/embedded_class_denorm.mfar
I0406 11:03:00.269358 22156 dir_path.cc:52] Checking FileExists: ./denorm/embedded_normalizer.mfar
I0406 11:03:00.269363 22156 dir_path.cc:57] Not Found FileExists: ./denorm/embedded_normalizer.mfar
I0406 11:03:00.269366 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/denorm/embedded_normalizer.mfar
I0406 11:03:00.269369 22156 dir_path.cc:54] Found FileExists: ./SODAModels/denorm/embedded_normalizer.mfar
I0406 11:03:00.269407 22156 dir_path.cc:52] Checking FileExists: ./denorm/porn_normalizer_on_device.mfar
I0406 11:03:00.269413 22156 dir_path.cc:57] Not Found FileExists: ./denorm/porn_normalizer_on_device.mfar
I0406 11:03:00.269415 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/denorm/porn_normalizer_on_device.mfar
I0406 11:03:00.269419 22156 dir_path.cc:54] Found FileExists: ./SODAModels/denorm/porn_normalizer_on_device.mfar
I0406 11:03:00.269465 22156 dir_path.cc:52] Checking FileExists: ./acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_model
I0406 11:03:00.269472 22156 dir_path.cc:57] Not Found FileExists: ./acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_model
I0406 11:03:00.269475 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_model
I0406 11:03:00.269480 22156 dir_path.cc:54] Found FileExists: ./SODAModels/acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_model
I0406 11:03:00.269483 22156 neural_network_resource.cc:71] Initializing for TENSORFLOW_LITE
I0406 11:03:00.269643 22156 dir_path.cc:52] Checking FileExists: ./acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_mean_stddev
I0406 11:03:00.269650 22156 dir_path.cc:57] Not Found FileExists: ./acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_mean_stddev
I0406 11:03:00.269653 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_mean_stddev
I0406 11:03:00.269658 22156 dir_path.cc:54] Found FileExists: ./SODAModels/acousticmodel/MARBLE_DICTATION_EP.endpointer_portable_lstm_mean_stddev
I0406 11:03:00.269678 22156 dir_path.cc:52] Checking FileExists: ./magic_mic/MARBLE_V2_acoustic_model.int8.tflite
I0406 11:03:00.269683 22156 dir_path.cc:57] Not Found FileExists: ./magic_mic/MARBLE_V2_acoustic_model.int8.tflite
I0406 11:03:00.269686 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/magic_mic/MARBLE_V2_acoustic_model.int8.tflite
I0406 11:03:00.269691 22156 dir_path.cc:54] Found FileExists: ./SODAModels/magic_mic/MARBLE_V2_acoustic_model.int8.tflite
I0406 11:03:00.269694 22156 neural_network_resource.cc:71] Initializing for TENSORFLOW_LITE
I0406 11:03:00.269802 22156 dir_path.cc:52] Checking FileExists: ./magic_mic/MARBLE_V2_acoustic_meanstddev_vector
I0406 11:03:00.269808 22156 dir_path.cc:57] Not Found FileExists: ./magic_mic/MARBLE_V2_acoustic_meanstddev_vector
I0406 11:03:00.269811 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/magic_mic/MARBLE_V2_acoustic_meanstddev_vector
I0406 11:03:00.269816 22156 dir_path.cc:54] Found FileExists: ./SODAModels/magic_mic/MARBLE_V2_acoustic_meanstddev_vector
I0406 11:03:00.269832 22156 dir_path.cc:52] Checking FileExists: ./magic_mic/MARBLE_V2_vocabulary.syms
I0406 11:03:00.269836 22156 dir_path.cc:57] Not Found FileExists: ./magic_mic/MARBLE_V2_vocabulary.syms
I0406 11:03:00.269839 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/magic_mic/MARBLE_V2_vocabulary.syms
I0406 11:03:00.269844 22156 dir_path.cc:54] Found FileExists: ./SODAModels/magic_mic/MARBLE_V2_vocabulary.syms
I0406 11:03:00.271909 22156 dir_path.cc:52] Checking FileExists: ./magic_mic/MARBLE_V2_model.int8.tflite
I0406 11:03:00.271918 22156 dir_path.cc:57] Not Found FileExists: ./magic_mic/MARBLE_V2_model.int8.tflite
I0406 11:03:00.271921 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/magic_mic/MARBLE_V2_model.int8.tflite
I0406 11:03:00.271925 22156 dir_path.cc:54] Found FileExists: ./SODAModels/magic_mic/MARBLE_V2_model.int8.tflite
I0406 11:03:00.273563 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_model-encoder.part_0.tflite
I0406 11:03:00.273572 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_model-encoder.part_0.tflite
I0406 11:03:00.273575 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-encoder.part_0.tflite
I0406 11:03:00.273580 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-encoder.part_0.tflite
I0406 11:03:00.273582 22156 neural_network_resource.cc:71] Initializing for PLAIN_TENSORFLOW_LITE
I0406 11:03:00.279180 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_model-encoder.part_1.tflite
I0406 11:03:00.279198 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_model-encoder.part_1.tflite
I0406 11:03:00.279201 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-encoder.part_1.tflite
I0406 11:03:00.279207 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-encoder.part_1.tflite
I0406 11:03:00.279209 22156 neural_network_resource.cc:71] Initializing for PLAIN_TENSORFLOW_LITE
I0406 11:03:00.297939 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_model-rnnt.decoder.tflite
I0406 11:03:00.297966 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_model-rnnt.decoder.tflite
I0406 11:03:00.297971 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-rnnt.decoder.tflite
I0406 11:03:00.297980 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-rnnt.decoder.tflite
I0406 11:03:00.297984 22156 neural_network_resource.cc:71] Initializing for PLAIN_TENSORFLOW_LITE
I0406 11:03:00.304153 22156 dir_path.cc:52] Checking FileExists: ./endtoendmodel/marble_rnnt_model-rnnt.joint.tflite
I0406 11:03:00.304173 22156 dir_path.cc:57] Not Found FileExists: ./endtoendmodel/marble_rnnt_model-rnnt.joint.tflite
I0406 11:03:00.304176 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-rnnt.joint.tflite
I0406 11:03:00.304181 22156 dir_path.cc:54] Found FileExists: ./SODAModels/endtoendmodel/marble_rnnt_model-rnnt.joint.tflite
I0406 11:03:00.304184 22156 neural_network_resource.cc:71] Initializing for PLAIN_TENSORFLOW_LITE
I0406 11:03:00.305925 22156 dir_path.cc:52] Checking FileExists: ./voice_match/MARBLE_speakerid.tflite
I0406 11:03:00.305935 22156 dir_path.cc:57] Not Found FileExists: ./voice_match/MARBLE_speakerid.tflite
I0406 11:03:00.305937 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/voice_match/MARBLE_speakerid.tflite
I0406 11:03:00.305941 22156 dir_path.cc:54] Found FileExists: ./SODAModels/voice_match/MARBLE_speakerid.tflite
I0406 11:03:00.309580 22156 terse_processor.cc:189] Initialized ResourceManager.
I0406 11:03:00.309700 22156 terse_processor.cc:200] Initialized GoogleRecognizer.
I0406 11:03:00.309708 22156 context-module-factory.cc:35] ContextModuleFactory: Initializing ContextModule.
I0406 11:03:00.309824 22156 dir_path.cc:52] Checking FileExists: ./context_prebuilt/apps.txt
I0406 11:03:00.309837 22156 dir_path.cc:57] Not Found FileExists: ./context_prebuilt/apps.txt
I0406 11:03:00.309842 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/context_prebuilt/apps.txt
I0406 11:03:00.309847 22156 dir_path.cc:54] Found FileExists: ./SODAModels/context_prebuilt/apps.txt
I0406 11:03:00.309888 22156 dir_path.cc:52] Checking FileExists: ./context_prebuilt/en-US_android-auto_car_automation.action.union_STD_FST.fst
I0406 11:03:00.309894 22156 dir_path.cc:57] Not Found FileExists: ./context_prebuilt/en-US_android-auto_car_automation.action.union_STD_FST.fst
I0406 11:03:00.309897 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/context_prebuilt/en-US_android-auto_car_automation.action.union_STD_FST.fst
I0406 11:03:00.309902 22156 dir_path.cc:54] Found FileExists: ./SODAModels/context_prebuilt/en-US_android-auto_car_automation.action.union_STD_FST.fst
I0406 11:03:00.328810 22156 dir_path.cc:52] Checking FileExists: ./context_prebuilt/contacts.txt
I0406 11:03:00.328837 22156 dir_path.cc:57] Not Found FileExists: ./context_prebuilt/contacts.txt
I0406 11:03:00.328842 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/context_prebuilt/contacts.txt
I0406 11:03:00.328866 22156 dir_path.cc:54] Found FileExists: ./SODAModels/context_prebuilt/contacts.txt
I0406 11:03:00.329062 22156 dir_path.cc:52] Checking FileExists: ./context_prebuilt/songs.txt
I0406 11:03:00.329071 22156 dir_path.cc:57] Not Found FileExists: ./context_prebuilt/songs.txt
I0406 11:03:00.329073 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/context_prebuilt/songs.txt
I0406 11:03:00.329077 22156 dir_path.cc:54] Found FileExists: ./SODAModels/context_prebuilt/songs.txt
I0406 11:03:00.329116 22156 dir_path.cc:52] Checking FileExists: ./context_prebuilt/en-US_android-auto_top_radio_station_frequencies_STD_FST.fst
I0406 11:03:00.329122 22156 dir_path.cc:57] Not Found FileExists: ./context_prebuilt/en-US_android-auto_top_radio_station_frequencies_STD_FST.fst
I0406 11:03:00.329125 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/context_prebuilt/en-US_android-auto_top_radio_station_frequencies_STD_FST.fst
I0406 11:03:00.329143 22156 dir_path.cc:54] Found FileExists: ./SODAModels/context_prebuilt/en-US_android-auto_top_radio_station_frequencies_STD_FST.fst
I0406 11:03:00.329230 22156 dir_path.cc:52] Checking FileExists: ./context_prebuilt/en-US_android-auto_manual_fixes_STD_FST.fst
I0406 11:03:00.329238 22156 dir_path.cc:57] Not Found FileExists: ./context_prebuilt/en-US_android-auto_manual_fixes_STD_FST.fst
I0406 11:03:00.329240 22156 dir_path.cc:52] Checking FileExists: ./SODAModels/context_prebuilt/en-US_android-auto_manual_fixes_STD_FST.fst
I0406 11:03:00.329244 22156 dir_path.cc:54] Found FileExists: ./SODAModels/context_prebuilt/en-US_android-auto_manual_fixes_STD_FST.fst
W0406 11:03:00.329525 22156 terse_processor.cc:1753] SODA could not build Hotquery Matcher.
W0406 11:03:00.329545 22156 terse_processor.cc:288] TISID disabled.
I0406 11:03:00.329572 22156 terse_processor.cc:809] Domain: CAPTION
I0406 11:03:00.330211 22156 context-module-impl.cc:244] ContextModule starts to provide model resources: 2021-04-06T11:03:00.33021148+08:00
I0406 11:03:00.330333 22156 context-module-impl.cc:281] ContextModule finished providing model resources : 2021-04-06T11:03:00.330333605+08:00 elapsed: 122.125us
I0406 11:03:00.336258 22156 terse_processor.cc:1438] Resetting Terse Processor
I0406 11:03:00.336319 22156 terse_processor.cc:941] Cancelling session.
W0406 11:03:00.336492 22176 portable_intended_query_stream.cc:235] Exiting due to stream cancellation.
W0406 11:03:00.336942 22156 decoder_endpointer_stream.cc:40] Acoustic ep reader thread cancelled.
I0406 11:03:00.337181 22156 terse_processor.cc:850] Setup completed
I0406 11:03:00.337196 22156 soda_impl.cc:593] Server ASR Disabled
I0406 11:03:00.337201 22156 soda_impl.cc:653] Initializing audio logger
W0406 11:03:00.337240 22156 soda_async_impl.cc:442] SODA session starting (require_hotword:0, hotword_timeout_in_millis:0, trigger_type:TRIGGER_TYPE_UNSPECIFIED, hybrid_asr_config.mode:MODE_DEFAULT)
I0406 11:03:00.337380 22156 soda_async_impl.cc:639] Session parameters updated. Reconfiguring SODA.
W0406 11:03:00.337562 22157 soda_async_impl.cc:788] SODA stopped processing audio, mics audio processed in millis: 0, loopback audio processed in millis: 0
I0406 11:03:00.337794 22157 terse_processor.cc:1438] Resetting Terse Processor
I0406 11:03:00.337804 22157 terse_processor.cc:941] Cancelling session.
W0406 11:03:00.337817 22157 soda_async_impl.cc:840] SODA session stopped due to: STOP_CALLED
I0406 11:03:00.337855 22158 recognition_event_delegate.cc:37] Soda session stopped due to: STOP_CALLED
W0406 11:03:00.360020 22156 soda_async_impl.cc:911] Deleting soda_impl

Are you sure? yes | no

biemster wrote 04/08/2021 at 15:25

Hi Ben, Welcome to hackaday! You got all the steps correct, except for the final thing: you need to patch the library to skip the API key and call stack verifications. I've added some extra info in the git issue you posted with this same question.

Also please trim down your other post with the log output? It makes the rest of the discussions difficult to find. Good luck with the patching, and don't hesitate to follow up in the git issue if needed!

Are you sure? yes | no

am009 wrote 01/24/2021 at 05:26

It's such a awesome experience to try following such a great project !!!

Are you sure? yes | no

biemster wrote 04/27/2021 at 16:07

Are you sure? yes | no

Victor Sklyar wrote 05/19/2020 at 13:07

I replaced lp_en_us.zip with russian model.

But result of work without spaces.

Раздватричетырепятьвышелзайчикпогулять</S>.

Are you sure? yes | no

biemster wrote 05/20/2020 at 11:56

That's odd. But yeah, it was never meant to work like this at all, so every model that works at least a bit is already a bonus.

Are you sure? yes | no

AIFanatic wrote 05/10/2020 at 07:25

I have done some analysis and posted my findings here https://github.com/AIFanatic/google-offline-speech-recognition

Are you sure? yes | no

biemster wrote 05/10/2020 at 11:32

Very impressive! I'm setting up an android-x86 env on my machine at the moment, so the frida scripts will come in very handy (have no experience with that yet).

In the readme.md you mention the unknown function of the *_mean_stdev files, they are very likely the input normalizations of the models.

Although my main focus in this project is the SODA approach for now, I'll be keeping a close eye on your git repo for sure!

Are you sure? yes | no

theafien wrote 05/10/2020 at 12:22

excellent job, will help a lot in the future. I will learn tensorflow in the future. so it will be easier to understand some files.

Are you sure? yes | no

Jude Ashly wrote 04/04/2020 at 23:20

Any updates.. ?

Are you sure? yes | no

biemster wrote 04/05/2020 at 12:51

Not much, creating a java app that links with the lib and just reads the language pack as a whole still seems the most viable route to me, but i don't have spare hardware to work on this.

Are you sure? yes | no

Jude Ashly wrote 04/14/2020 at 20:50

I found out a x86 bit version of the google speech library, will that be of any use to you?

Are you sure? yes | no

biemster wrote 04/15/2020 at 15:51

I think I found the x86_64 lib in gboard or maybe it was some google search app as well, but they are linked against the bionic clib and don't work on plain linux. I'm starting to think it is easiest to keep most of the Recorder app in tact, and try to capture the strings it writes to the GUI using something like frida and have that pass it to something like MQTT. But again, i need some arm hardware for that which i don't atm.

Are you sure? yes | no

Jude Ashly wrote 03/06/2020 at 20:57

Have you tried using libhybris to Port it to raspberry pi ?

Are you sure? yes | no

biemster wrote 03/08/2020 at 17:14

Did not yet, but it looks very promising if it does what it says it does! Gonna try that this week. Thanks!

Are you sure? yes | no

biemster wrote 03/09/2020 at 09:37

Well that turned out to be quite a rabbit hole..

Are you sure? yes | no

theafien wrote 10/17/2019 at 17:49

Today I put my country (Brazil) model from Google App in Google Recorder, and this works fine. The engine is same.

Are you sure? yes | no

biemster wrote 10/18/2019 at 07:10

You mean the pixel 4 recorder app, with offline dictation? To my best recollection the Google App uses online dictation, could you check?

Are you sure? yes | no

theafien wrote 10/18/2019 at 13:12

Google App in anothers devices (i dont know pixel device) manager offline dictations. The Gboard (not pixel) use Google App Intent to works offline recognizer speech.

Whats I Do

I take brazilian model from Google App and replace lp_en_us, this works fine.

Are you sure? yes | no

Mike Tran wrote 03/27/2020 at 03:17

Hi @theafien, I'm trying to do the same thing as you did for Brazilian model with Google Recorder app but for other language. Could you please show us more detail how you did it with Brazilian model? When you said "replace lp_en_us", what it means?

Huge thanks in advance.

Are you sure? yes | no

ha wrote 04/02/2020 at 14:08

1. get APKs

- Google Recorder (com.google.android.apps.recorder):
https://www.apkmirror.com/apk/google-inc/google-recorder/google-recorder-1-1-289058594-release/
- Google App (com.google.android.googlequicksearchbox):
https://www.apkmirror.com/apk/google-inc/google-search/google-search-11-3-7-release/

3. unpack APKs
- for example with ApkStudio: https://github.com/vaibhavpandeyvpz/apkstudio

4. get language files
- in the unpacked Google App you find URLs in:
res/raw/default_voice_search_configuration

- for example Spanish:
https://dl.google.com/dl/android/voice/es-ES/v200/es-ES-v200-f19.zip

5. place downloaded file and checksum into unpacked Recorder
- calculate the md5 checksum of es-ES-v200-f19.zip and save it in lp_en_us_checksum.md5
md5sum es-ES-v200-f19.zip
757284aeecb08ba92a61cbdc4e04a301

- rename es-ES-v200-f19.zip to lp_en_us.zip
- replace lp_en_us.zip and lp_en_us_checksum.md5 with your new ones in folder res/raw of unpacked com.google.android.apps.recorder

5. (if not on google pixel) patch Recorder
- https://github.com/Xmader/google-recorder
- in ApkStudio:
- open: smali/com/google/android/apps/recorder/ui/application/RecorderApplication.smali
- replace "com.google.android.feature.PIXEL_2017_EXPERIENCE" with "android.hardware.microphone"

6. build + sign + deploy

Are you sure? yes | no

Mike Tran wrote 04/03/2020 at 20:20

Thanks a lot @ha . i will try your instruction right way.

Are you sure? yes | no

Victor Sklyar wrote 04/08/2019 at 09:10

any news?..

Are you sure? yes | no

biemster wrote 04/08/2019 at 18:42

I'm struggling with the inputs to the models. I suspect the mean and standard deviation of the inputs used during training are in the file "input_mean_stddev", which I presume is a hashtable_lookup binary considering the file starts with 0a and this corresponds to that OP entry in the tensorflow lite flatbuffer schema.

I did not figure out yet how to import and read this input_mean_stddev file however, and brute forcing the normalization of the inputs also did not yield results yet unfortunately..

Are you sure? yes | no

罗国强 wrote 04/08/2019 at 22:05

the contents of input_mean_stddev:

file starts with a 3 byte header [0a c0 07], followed by 240 little-endian float values. this is followed by another 3 byte header [12 c0 07], followed by another 240 floats.

ep_mean_stddev is similar:

a 3 byte header [0a a0 01] followed by 40 little-endian floats. another 3 byte header [12 a0 01] followed by another 40-little endian floats.

i hope this helps!

Are you sure? yes | no

biemster wrote 04/09/2019 at 08:16

@罗国强: Awesome!! that helps really a lot! It's like walking around with your eyes closed for weeks, and suddenly remembering how to open them :)

Are you sure? yes | no

Victor Sklyar wrote 06/10/2019 at 08:31

still nothing? :(

Are you sure? yes | no

biemster wrote 06/10/2019 at 16:27

work got in the way.. Almost all the obfuscated stuff is done though, it is just a matter of feeding the correct frequency domain energies to the model, be it plain FFT or Mel-log stuff. The _stddev files show the averages of those values, all is left is an (educated) brute force approach to get those right. I'd be more than happy to give pointers on how to do that, but unfortunately I do not have the spare time to dive into this myself at the moment. Maybe in a couple weeks, but I can't promise anything.

Are you sure? yes | no

biemster wrote 06/14/2019 at 19:40

Also i just found a new version of the model. It has the same endpointer, but a much larger joint and slightly different encoders and decoder.

But most interestingly the dictation.config is a lot larger, so first thing for me in this project is to dissect this new config file. I hope there are some more clues as to how to feed the audio to the models.

Are you sure? yes | no

罗国强 wrote 03/22/2019 at 12:26

Awesome work.

Please consider mirroring the original apk and language files.

If the offline recogniser is as good as stated in the blog, google will try to further protect, by changing the crypto or introducing other measures.

Are you sure? yes | no

parameter.pollution wrote 03/22/2019 at 14:32

I doubt that they are going to try to obfuscate the current version more, since it's already out. But just in case I have uploaded the original files here: https://mega.nz/#F!D5p1AQyQ!ZPpKdTpooHYNE3gl2EHJQA

Are you sure? yes | no

theafien wrote 10/23/2019 at 13:49

@parameter.pollution whats you used to deobfuscate apk?

Are you sure? yes | no

biemster wrote 10/23/2019 at 14:55

@theafien the apk was not obfuscated, just the tflite model files. They were XORed with 0x1a.

Are you sure? yes | no

theafien wrote 10/23/2019 at 15:00

@biemster im testing the engine .so files and classes (google recorder), to rely in my applications.

Are you sure? yes | no

biemster wrote 10/24/2019 at 07:52

@theafien you mean you're trying to just link against the whole speech recognizer lib? Than would be awesome if it works, although I don't know of any examples that manages such a feat with any of the android native libraries.. But don't let that discourage you, I'm far from an expert in android.

Are you sure? yes | no

theafien wrote 10/24/2019 at 12:13

@biemster Im already able use the libs, just need understand some elements (like GoogleEndPoint) and some protobuf parameters. Im getting events from lib.

Are you sure? yes | no

biemster wrote 10/24/2019 at 12:30

@theafien I find that really impressive! Do you see any chance to share your progress? I assume you're running it on ARM, like Raspberry Pi or so? If the latter, that would open up tons of possibilities to trace and reverse engineer this!

Are you sure? yes | no

theafien wrote 10/24/2019 at 14:46

@biemster I'll separate the code later. I'm running in my phone arm64 (but i have lib x86 too) and using Google Recorder as base.

Are you sure? yes | no

biemster wrote 03/22/2019 at 17:14

I'm a bit worried too that future models, or maybe even just language updates, will be better protected/obfuscated. Let's not turn this project into a turnkey solution by also providing the tflite models directly, and credit the google research team where credit is due.

I doubt that they will put up the expense to overhaul their model distribution system to stop small projects like this (fingers crossed).

Are you sure? yes | no

Victor Sklyar wrote 08/08/2019 at 09:04

so... does it mean that is all?

Are you sure? yes | no

biemster wrote 08/08/2019 at 14:31

@Victor Sklyar still on hold, but this will be first project when i have time again.

Are you sure? yes | no

parameter.pollution wrote 03/20/2019 at 13:40

I had the same idea and decided to google it first and found your project page.
Just to be sure I have an apk that actually contains all the code used by this new speech recognition, I decided to pull the gboard apk from my pixel 2 directly and then I decompiled and deobfuscated the apk with apk deguard (from eth zürich) and this is the result: http://apk-deguard.com/fetch?fp=48a5831fd3f102aead2390db117c39b70f2084fc4397249af370b08edca78498&q=src

It's quite readable java code (though not all class/function/variable names are useful of course), but the bad news is that all the interesting functions seem to point to native library functions ( "nativeInitFromProto()" is in "libintegrated_shared_object.so").

I'll fire up a few static binary analysis tools and see if I can get something useful out of it and I'll let you know when I do (but it's a ~20MB arm binary....., so I am not very optimistic with the reverse engineering skills I have).

Are you sure? yes | no

biemster wrote 03/20/2019 at 20:12

Awesome, nice work! That is a lot more readable than the smali files apktool is giving me. Keep me posted on your progress!

Are you sure? yes | no

parameter.pollution wrote 03/21/2019 at 19:05

Great work with the XOR!

I first tried decompilation with radare2 + cutter, but since the library files are so big it struggled. So I decided to try it with Ghidra (disassembler/decompiler the NSA recently released) and it handeled it very well.

I have uploaded the decompilation (C code; created with Ghidra) results of the 2 library files I think could contain code we are looking for here: https://mega.nz/#F!D5p1AQyQ!ZPpKdTpooHYNE3gl2EHJQA
But it's a LOT of code. And these libraries were originally written in C++ and this is the result of decompilation to C, so that makes it even less readable.

But there are strings of error messages in there that point to a google internal library called "greco3" and I found this google blogpost that references it: https://ai.googleblog.com/2010/09/google-search-by-voice-case-study.html
So "greco3" might be the library they use for the fft/filterbank/... audio preprocessing stuff that you found in the protobuf file.

The JNI functions that are called by the java code can be found in the decompiled code by searching for functions that start with "Java_com_", e.g. "Java_com_google_speech_recognizer_ResourceManager_nativeInitFromProto" (in "libintegrated_shared_object.so_ghidra-decompiled.c"). But reading the code gets confusing really fast. So for actually jumping around in the code it's probably better to just load and analyze the library files with Ghidra.

I'll try to decompile it to C++, but again, am not very optimistic that it will work well.

Maybe the better approach is to try to implement the preprocessing functions based on what their names suggest they do, but that's probably a bruteforce approach and could take a while.

Are you sure? yes | no

biemster wrote 03/22/2019 at 09:18

The greco3 lib and ResourceManager was exactly what I was searching for too, since the ascii_proto does not give info what to feed the inputs of the networks. The audio preprocessing looks quite straightforward from the config file indeed, with 25 ms samples and 80 channels in the frequency domain between 125 and 7500 Hz. It seems that I should just feed the nets with the energies in these channels, and since the input of the first encoder is 240 in length, I should stack 3 frames from the filter bank? Something like that.

After that it will probably be a lot of brute forcing indeed, since I don't know yet which network comes next and how. Should I follow the diagram from the blog, and feed the output of the enc0 to the joint, or should the output go to enc1 according to the dictation.config diagram?

Also, I have to figure out which model is the softmax layer. I guess that is the dec, but the blog post calls the Prediction Layer the decoder..

And then the output of the softmax layer has to be translated to characters. I have a suspicion that this is done using OpenFST, a finite state transducer. This package shows up in some git commits, and the abbreviation FST shows up a few times in this context in the config file.

Still some work ahead!

Are you sure? yes | no

parameter.pollution wrote 03/24/2019 at 20:52

It took quite some time and trial&error, but I managed to also decompile the libintegrated_shared_object.so binary to C with the retdec decompiler (though only the 32 bit arm binary, not the original from my pixel which was 64 bit arm, because retdec only supports 32bit arm right now).

I think it's a little bit easier to read, but didn't have time yet to take a closer look if we can find the infos we are looking for in there.

(uploaded as "libintegrated_shared_object.so_retdec-decompiled.c" to the same folder as the other files: https://mega.nz/#F!D5p1AQyQ!ZPpKdTpooHYNE3gl2EHJQA )

Are you sure? yes | no

Android offline speech recognition natively on PC

Description

Details

Audio input

Decoder

Joint and softmax

Overview

Files

dictation.ascii_proto

Project Logs

Collapse

SODA finally landed, client working

ChromeVox Next offline TTS client, a sister project

SODA: Speech On-Device API

Pixel 4 Recorder app with offline transcribe

Google Open-Sources Live Transcribe's Speech Engine

Recovering the symbol table

Experiments with the endpointer

First full model tests

Recovering tflite models from the binaries

Analysis of the dictation.config protobuf

Discussions

Similar Projects

Chirp! A Low Cost Function Generator

ESP32 AI assistant

Fire and Forget Wardriving

KiCad BOM Wizard

Android offline speech recognition natively on PC

Become a Hackaday.io member

Just one more thing

Description

Details

Audio input

Decoder

Joint and softmax

Overview

Files

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse