Text-to-Speech Rulez!

A project log for Look Who's Talking 0256

A BluePill Driver/Simulator/Emulator for the GI SP0256-AL2

ziggurat29ziggurat29 05/12/2020 at 20:180 Comments


For today's goose-chase, I am porting over the text-to-speech rules.  Some effort was put forth towards reducing their flash footprint.


Having realized the primary impetus of the project, I'm faced with several other directions to take it next.  Semi-arbitrarily, I decided to try getting text-to-speech capability in place.  As mentioned in a previous post, I have some old TTS code which I ported to Python for a sanity check, and now I am porting it to C for inclusion on the BluePill.  The first step is just encoding the rules as static data to be burned into flash.  Transcoding the rules took a little over a day of mundane reformatting and some considerations of how to work with the C language itself, e.g. there's a bunch of variable-length arrays -- in the other languages the length is an intrinsic property of the array object, but that's not the case in C.  For strings, there is the implicit NUL-terminator, but that is not the case for any other array.  Eventually, I worked out some macros that exploit string-merging to fake it enough to have a result that looks manageable.

By straightforward inclusion of rules as C-defined structures shows that they take 19,300 bytes of flash.  This is too much.  When I had originally written this code (and by 'written' I mean 'ported some existing work and extended'; credits in the source), it was for a platform called 'dotNet Micro Framework'.  It was somewhat interesting, but it lacked a lot of const-friendliness, and tended to put things in live objects (i.e. in RAM) no matter how much 'readonly' qualifier you would apply.  So in that case I pre-processed the rules into an alternative form that would cause the compiler to leave almost all the stuff in flash.  On that platform, I had an abundance of flash (and comparatively an abundance of RAM, too) relative to here.  Those transformations are not meaningful here, but I wanted to see if a similar compactification could reduce the footprint.  The gist would then be that the desktop app would be the 'master' copy of the text-to-speech rules, encoded in C-structs/arrays in a straightforward way, and then they would be pre-processed into the compact form for embedded.  That way the rules can continue to be developed and maintained in a sane way, albeit with the additional pre-processing step.

First I did some basic statistics including raw counts and distinct counts:

Rules: 706
strs: 2118, bins: 706
dstrs: 484, dbins: 400

 So, 706 rules, 2118 strings (the various 'contexts') and 706 phoneme sequences.  Of the 2118 strings 484 were distinct, and of the 706 phoneme sequences 400 were distinct.  This seems like that the strings could be reduced to about 25%, but really that is just count.  The devil is in the details.  Truthfully, a lot of the strings are for exception cases, and these tend to be longer.  So deduping short strings might not really squeeze that much.  Having the program tabulate the lengths showed:

Rules: 706
strs: 2118, bins: 706
dstrs: 484, dbins: 400
strlen: 2783, binlen: 1688
dstrlen: 1549, dbinlen: 1246

 So, 1549/2783 really reduces about 44% rather than the hoped 75%.  But that's still an improvement.  A similar story is told for the phoneme binaries at 26% rather than 43%.  But it occurred to me that this is not considering the nul-terminators, so I reworked it:

Rules: 706
strs: 2118, bins: 706
dstrs: 484, dbins: 400
strlen: 4901, binlen: 2394
dstrlen: 2033, dbinlen: 1646

 Here, the space reduction is better (58% vs 44%, and 31% vs 26%), but wow! the size taking into consideration nul-terminators really added some overhead!  That's what bunch of single-characters strings/bins will do.  But another tale is to be told:  even disregarding de-duping, the total of strings and binaries is 4901+2394 = 7295.  But comparing the flash size before and after including the unmodified C ruleset showed just over 19,000 bytes.  So where did the other 12 KiB go?  Well, it's in pointers and padding.  The rule structure is straightforwardly defined like this:

//structures involved:

typedef struct PhonSeq {
    const char* _phone;
    size_t _len;
} PhonSeq;

typedef struct TTSRule {
    const char* _left;
    const char* _bracket;
    const char* _right;
    PhonSeq _phone;
} TTSRule;

//example rule:
const TTSRule r_a[] = {
    { "^^^", "a", "", { EY, 1 } },
    { "^.", "a", "^e", { EY, 1 } },
    { "^.", "a", "^i", { EY, 1 } },
    { "^^", "a", "", { AE, 1 } },
    { "^", "a", "^##", { EY, 1 } },

Then what actually gets created is a contiguous array of TTSRule struct that look like this:

{ & "^^^", & "a", & "", { & "EY", 1 } },

So, the entries in the array are pointers to nul-terminated strings that are elsewhere -- more overhead.  As a quick calculation, if you take the 19,300 bytes known consumption, minus the 7295 expected consumption, you get 12,005 bytes, and dividing that by the 706 rules is just over 17 additional bytes per rule.  Since pointers are 32-bits on this platform, the 4 pointers in the rule would be 16 bytes, which jibes with the 17 bytes of the quick calculation.

So, if we instead were to pack all our de-duped strings into a contiguous blob (ostensibly 2033 + 1646 = 3679 bytes), and were to use 16-bit indexes (instead of 32-bit pointers) to reference into this blob, then we should have an overhead of 706 * 4 * 2 = 5,648 bytes.  So 3679 + 5,648 = 9,327.  That's less than 19,300!  It's still a bit disappointing since the majority of the ruleset blob is still structural overhead in the way of indices to data components.  Due to the statistical nature of the indices, they might be ripe for entropy encoding, though I'm not going to bother with that just now.

The real results will be a little bit different that that exact number.  One thing is that for performance I segregate the rules into groups based on the initial character of the 'bracket' context.  That adds the overhead of another array of pointers, however this is smaller in that there are only 27 elements, anyway, which is just 108 bytes.  One additional overhead is determining the lengths of the rule groups, which presently involves a terminal 'sentinel' value.  This is a full rule of zeros.  If instead of holding pointers to the rule groups, and instead I concatenated all rule groups together and just kept offsets to the starts of rule groups, then I can reduce the size of the pointers and also eliminate the sentinels.  The sizes of the rule groups then is calculated as the difference between the offsets of adjacent entries.  Something has to be done for the last entry (since there is no subsequent entry), so I add a 'dummy' entry which is the offset to the next rule group if but only there was one.  That means that there is an additional (27+1) * sizeof(uint16_t) = 56 bytes, bringing the grand total to 9,383 bytes.  So, about half of what was originally there.  I'd like to squeeze it more, but I'd also like to get coding, so I'm running with this for now to see if it is good enough.

Since I need to preprocess the rules into this compact form, I made a separate C++ application for that.  I will also have it implement the text-to-speech code against the compactified ruleset (which it will recompute each time it runs) so it can serve as a unit test of that code.


Uses the rules