The workflow will be as follows:
- Find the trained models (DONE)
- Figure out how to import the model in TensorFlow (in progress)
- (optional) export to lwtnn
- Write lightweight application for dictation
- (stretch goal) if importing to TensorFlow Lite is successful, try to get it to work on those cool new RISC-V k210 boards, which could be had including 6 mic array for ~$20!
Finding the trained models was done by reverse engineering the GBoard app using apktool. Further analysis of the app is necessary to find the right parameters to the models, but the initial blog post also provides some useful info:
Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as yu-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network (paper). The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Encoder Network comprises 8 such layers. Image credit: Chris Thornton