Recovering the symbol table

One last thing that was still missing was the symbol table. Apparently the output of the softmax layer maps directly to the symbol table, and since this output is 128 in length, I was searching for a table of this same length.

The prime suspect was of course the 'syms' binary, but I could not seem to open it. The blog post mentions FST, so I started my investigation with OpenFST. There is a nice python wrapper that could open the file, and returned some sensible name for the thing. But when I queried the keys, it would return 0.

In a hex editor I already noticed a funny sentence somewhere in the body: "We love Marisa." Initially I thought that was some padding built in by a developer, and took no notice of it. However, this is actually the header of a filetype for the library libmarisa, which is an acronym for Matching Algorithm with Recursively Implemented StorAge.

Extracting just this part of the binary is easy:

import sys
fname = sys.argv[1]
b = bytearray(open(fname, 'rb').read())
for i in range(len(b)):
    if b[i:i+15] == "We love Marisa.":
        open(fname + '.marisa', 'wb').write(b[i:])
        break

And the file syms.marisa can be read with the python package marisa-trie, presenting me with a nice symbol table:

import marisa_trie
trie = marisa_trie.Trie()
trie.load('syms.marisa')
<marisa_trie.Trie object at 0x7f5ad44330b0>

trie.items()
[(u'{', 0),
(u'{end-quotation-mark}', 122),
(u'{end-quote}', 123),
(u'{exclamation-mark}', 124),
(u'{exclamation-point}', 125),
(u'{quotation-mark}', 126),
(u'{quote}', 127),
(u'{question-mark}', 103),
(u'{sad-face}', 104),
(u'{semicolon}', 105),
(u'{smiley-face}', 106),
(u'{colon}', 107),
(u'{comma}', 108),
(u'{dash}', 109),
(u'{dot}', 110),
(u'{forward-slash}', 111),
(u'{full-stop}', 112),
(u'{hashtag}', 113),
(u'{hyphen}', 114),
(u'{open-quotation-mark}', 115),
(u'{open-quote}', 116),
(u'{period}', 117),
(u'{point}', 118),
(u'{apostrophe}', 94),
(u'{left-bracket}', 95),
(u'{right-bracket}', 96),
(u'{underscore}', 97),
(u'<', 1),
(u'<s>', 119),
(u'<sorw>', 120),
(u'<space>', 121),
(u'</s>', 98),
(u'<epsilon>', 99),
(u'<noise>', 100),
(u'<text_only>', 101),
(u'<unused_epsilon>', 102),
(u'!', 2),
(u'"', 3),
(u'#', 4),
(u'$', 5),
(u'%', 6),
(u'&', 7),
(u"'", 8),
(u'(', 9),
(u')', 10),
(u'*', 11),
(u'+', 12),
(u',', 13),
(u'-', 14),
(u'.', 15),
(u'/', 16),
(u'0', 17),
(u'1', 18),
(u'2', 19),
(u'3', 20),
(u'4', 21),
(u'5', 22),
(u'6', 23),
(u'7', 24),
(u'8', 25),
(u'9', 26),
(u':', 27),
(u';', 28),
(u'=', 29),
(u'>', 30),
(u'?', 31),
(u'@', 32),
(u'A', 33),
(u'B', 34),
(u'C', 35),
(u'D', 36),
(u'E', 37),
(u'F', 38),
(u'G', 39),
(u'H', 40),
(u'I', 41),
(u'J', 42),
(u'K', 43),
(u'L', 44),
(u'M', 45),
(u'N', 46),
(u'O', 47),
(u'P', 48),
(u'Q', 49),
(u'R', 50),
(u'S', 51),
(u'T', 52),
(u'U', 53),
(u'V', 54),
(u'W', 55),
(u'X', 56),
(u'Y', 57),
(u'Z', 58),
(u'[', 59),
(u'\\', 60),
(u']', 61),
(u'^', 62),
(u'_', 63),
(u'`', 64),
(u'a', 65),
(u'b', 66),
(u'c', 67),
(u'd', 68),
(u'e', 69),
(u'f', 70),
(u'g', 71),
(u'h', 72),
(u'i', 73),
(u'j', 74),
(u'k', 75),
(u'l', 76),
(u'm', 77),
(u'n', 78),
(u'o', 79),
(u'p', 80),
(u'q', 81),
(u'r', 82),
(u's', 83),
(u't', 84),
(u'u', 85),
(u'v', 86),
(u'w', 87),
(u'x', 88),
(u'y', 89),
(u'z', 90),
(u'|', 91),
(u'}', 92),
(u'~', 93)]

I believe I have recovered all the main components now, so what is left is just brute forcing how to present the data to the individual models. I earlier overlooked that the decoder should be initialised with a <sos> start of sequence when an utterance starts, so I think I should get a proper endpointer implementation and experiment which symbol is this <sos> (I only could find a <sorw>, <s> and </s>).

Experiments with the endpointer

Google Open-Sources Live Transcribe's Speech Engine

Discussions

Become a Hackaday.io Member