The github repo is updated with the first full model test. This test just tries to run the RNNs with a sample wav file input.
What this experiment does is the following:
- Split the incoming audio in 25 ms segments, with a stepsize of 10ms (so the input buffers overlap). Compute the FFT to calculate the energies in 80 frequency bins between 125 and 7500 Hz. The above values are taken from the dictation ascii_proto.
- Average those 80 channels to 40 channels to feed the EndPointer model. This model should decide if the end of a symbol is reached in the speech, and signal the rest of the RNNs to work their magic. Just print the output of the endpointer, since I don't know how to interpret the results.
- Feed the 80 channels to a stacker for the first encoder (enc0). This encoder takes 3 frames stacked as input, resulting in an input tensor of length 240.
- The output of the first encoder goes to a second stacker, since the input of the second encoder (enc1) is twice the length of the output of the first.
- The output of the second encoder goes to the joint network. This joint has two inputs of length 640, one of which is looped from the decoder. At first iteration a dummy input from the decoder is used, and the values from the second encoder are the second input.
- The output of the joint is fed to the decoder, which produces the final result of the model. This model is fed back into the joint network for the next iteration, and should go to the next stage of the recognizer (probably FST?)
In my initial runs the decoder outputs just NaNs, which is highly disappointing :(.
When I feed both the first and second encoder with random values, the output of the decoder is actually proper values, so my first guess is that the fft energies are not calculated correctly. That will be my focus for now, in combination with the endpointer. My next experiments will search for the correct feeding of the endpointer, so it gives sensible values at points in the audio sample where symbols should be produced.
Make it so!
*I just realize that that should be my test sample.wav*
UPDATE: the nan issue in the decoder output was easily solved by making sure no NaN values loop back from the decoder into the joint, so the joint is next iteration fed with proper values.