My focus at the moment is on the endpointer, because I can bruteforce its parameters for the signal processing a lot faster than when I use the complete dictation graph. I added a endpointer.py script to the github repo which should initialize it properly. I'm using a research paper which I believe details the endpointer used in the models as a guide, so I swapped to using log-Mel filterbank energies instead of the plain power spectrum as before.
I believe the endpointer net outputs two probabilities: p(speech) and p(non speech) as given in this diagram from the paper:
The results from the endpointer.py are still a bit underwhelming:
so some more experiments are needed. I'll update this log when there are more endpointer results.