Summary

With the 'compact' form of the rules in-hand, is is time to use them.

Deets

I ported the code that processes the rules into C. This was a bit more trouble than I anticipated because the Python version uses some conveniences in that environment -- especially with dynamically sized arrays and string concatenation. Since this code is going to be running in an embedded environment, I wanted to avoid as much copying to temporary and dynamically allocated buffers as much as possible, and rather try to process directly out of any buffers or constant definitions. Additionally, there was a hack in the original rules that required a space to be prepended and appended to the word. This hack allowed using the space as a meta-character for 'Nothing', which was used to indicate that a context pattern needed to be at the very beginning and end of the text. I wound up creating a separate meta-character for that '$' and updated all the rules accordingly. That addition cause me to generate a new distinct string, so I incurred a two-byte penalty to 9385 bytes for the compactified rules.

Incrementally building the code shows these numbers for flash usage:

40816 baseline
50208 rules included; delta = 9392
51908 tts code; delta = 1700
51964 simple test code to use TTS to translate a sentence; delta = 56

So this is not too bad; about 2 KB for the actual code, and the simple test (which is fairly representative of how it would be used in practice) is quite small at about 56 bytes.

This means that there is about 12 KB more flash for code growth before the next crisis. I think this might be OK for the remaining stuff I have planned. I've got a little more that 7 KB ram left, and I think this will be enough, too, to finish things up.

The simple test code:

static const char achGettysburg[] = 
"four score and seven years ago our fathers brought forth on this continent \
a new nation, conceived in liberty, and dedicated to the proposition that all \
men are created equal.";

const char* pszText = achGettysburg;
int nTextLen = COUNTOF(achGettysburg);

//quicky test running through text
const char* pchWordStart, * pchWordEnd;
int eCvt;
while ( 0 == ( eCvt = pluckWord ( pszText, nTextLen, 
        &pchWordStart, &pchWordEnd ) ) )
{
    int nWordLen = pchWordEnd - pchWordStart;

    static uint8_t sl_abyPhon[64];    //semi-arbitrarily sized long word

    int nProduced = ttsWord(pchWordStart, nWordLen,
            g_abyTTS, sl_abyPhon, COUNTOF(sl_abyPhon) );
    //stick on a space between words if there is not already a pause
    if ( sl_abyPhon[nProduced-1] > 4 )    //all pauses are code 0 - 4
    {
        sl_abyPhon[nProduced++] = '\x03';
        sl_abyPhon[nProduced++] = '\x02';
    }

    size_t nIdxPhon = 0;
    size_t nRemaining = nProduced;
    while ( nRemaining > 0 )
    {
        size_t nConsumed = SP0256_push ( &sl_abyPhon[nIdxPhon], nRemaining );
        nRemaining -= nConsumed;
        nIdxPhon += nConsumed;
        if ( 0 != nRemaining )
        {
            osDelay ( 200 );    //sleep a little to let the synth catch up
        }
    }

    //advance
    nTextLen -= pchWordEnd - pszText;
    pszText = pchWordEnd;
}

So the gist of using it is to crack the text word-by-word (there is a convenience function pluckWord() provided for this), and then for each word 'plucked' from the buffer, push it into ttsWord() to translate it into a phoneme sequence. You can then send this sequence off to the SP0256 task (or whatever).

I added some debug code to make it send the plucked word and text-to-speeched phoneme sequence to the serial for debugging. E.g. for the first sentence of the Gettysburg address:

four    28 35 33 03 02      
    FF OW ER2 PA4 PA3
score   37 08 35 33 03 02   
    SS KK3 OW ER2
and     1a 0b 15 03 02      
    AE NN1 DD1
seven   37 07 23 07 0b 03 02    
    SS EH VV EH NN1
years   0c 13 33 2b 03 02   
    IH IY ER2 ZZ
ago     1a 3d 35 03 02      
    AE GG2 OW
our     20 33 03 02         
    AW ER2
fathers 28 1a 36 01 34 2b 03 02     
    FF AE DH2 PA2 ER2 ZZ
brought 1c 27 17 0d 03 02   
    BB1 RR2 AO TT2
forth   28 17 17 33 1d 03 02    
    FF AO AO ER2 TH
on      17 0b 03 02         
    AO NN1
this    36 0c 0c 37 37 03 02    
    DH2 IH IH SS SS
continent   08 18 0b 0d 06 0b 07 0b 0d 03 02    
    KK3 AA NN1 TT2 AY NN1 EH NN1 TT2
a       07 14 03 02         
    EH EY
new     0b 1f 03 02         
    NN1 UW2
nation, 0b 14 00 25 0e 0b 04    
    NN1 EY PA1 SH RR1 NN1 PA5
conceived   08 18 0b 37 13 23 07 15 03 02   
   KK3 AA NN1 SS IY VV DD1
in      0c 0c 0b 03 02      
    IH IH NN1
liberty,    2d 0c 3f 34 0d 0c 04    
   LL IH BB2 ER2 TT2 IH PA5
and     1a 0b 15 03 02      
    AE NN1 DD1
dedicated   21 0c 21 0c 2a 1a 1a 00 0d 0c 15 03 02  
    DD2 IH DD2 IH KK1 AE AE PA1 TT2 IH DD1 
to      0d 1f 03 02         
    TT2 UW2
the     12 13 03 02         
    UW2 IY 
proposition 09 27 0e 0e 09 0e 2b 0c 00 25 0e 0b 03 02   
    PP RR2 RR1 RR1 PP RR1 ZZ IH PA1 SH RR1 NN1
that    36 1a 0d 03 02      
    DH2 AE TT2
all     17 2d 03 02         
    AO LL
men     10 07 0b 03 02      
    MM EH NN1
are     18 34 03 02         
    AA ER2
created 08 33 13 14 00 0d 0c 15 03 02   
    KK3 ER2 IY EY PA1 TT2 IH DD1
equal.  13 2a 2e 1a 2d 04   
    IY KK1 WW AE LL PA5 PA5 PA4

I did go ahead and wire in a command in the monitor for testing this stuff: 'sp' for 'speak'. You're meant to supply a sentence and it will parse and translate much as the code is shown above (with a little extra error checking).

Now I'm curious about simulating the SP0256-AL2 using a PWM output. In this way, you wouldn't need the physical chip to enjoy 1970's era speech synthesis output. This will be a challenge with the flash -- the audio files as-is are something like 144 KiB total -- /that/ won't fit! Also, although the chip (STM32F103C8) is designated and self-reports as having 64 KiB flash, it is an open secret that the device in fact has 128 KiB (same as the 'CB). I will exploit this to get the extra room I need if it all works out.

Chasing another goose named 'SP0256-AL2 simulation'.

Discussions

deladriere wrote 05/19/2020 at 11:50

Ahh thanks !

I am also playing with the chip with an Arduino M0 (sorting out some fake chips)

I would like to try the text-to-phoneme part on the Arduino too

Are you sure? yes | no

ziggurat29 wrote 05/19/2020 at 13:36

it should port over easily as there are no special libs beyond the standard C library (I think there is a strlen() call, and it's not strictly required).

* tts_rules_compact.h, .c are the blob of the TTS rules. As I mentioned in the post, to save space I compacted these into this form. The original rules in human readable form are in the Python PoC.
* text_to_speech.h, .c is what processes the rules, transforming English to phoneme sequences. It also has a 'word cracking' function for breaking up a sentence into words. (The algorithm uses a slightly non-obvious word separation technique with regards to punctuation.)

Also the two methods provided were defined such that they are suitable for directly processing from constant buffers, requiring no mallocs or read/write memory (other than the phoneme buffer which you provide). This was to reduce ram requirements, but also to facilitate streaming in data of indefinite length.

A consequence of this is that there does need to be a 'breaking' character at the end of a 'sentence'. (This could be a LF, which is ignored phonetically.) If you imagine typing text into a terminal, which is then processed by the algorithm, if a sentence 'I wasn't going to the store' happened to be processed at the time 'I was' is received, then the 'I' part would be correctly transformed, but the 'was' part would not be because in truth the word had not been fully received. So that's why it is required that there be some final word breaking character at the end of the complete text -- to avoid spurious word breaking while streaming.

An undocumented feature (which I /think/ works) is that if you provide a 0-length phoneme buffer, the routine will fail and give you a negative result which is the number of phonemes required. I don't use this feature, but it seemed useful when I was writing the code.

Are you sure? yes | no

ziggurat29 wrote 05/16/2020 at 13:34

Yup, in the 'project links' there are two github repos -- one is for the python PoC code, and the other is for the BluePill codebase.
You'll (almost) certainly also need an STLink-V2 programmer if you don't already have one. The uber cheap Chinese ones work fine. (I say 'almost' because there is way to burn the firmware over the serial port, though I've never done this myself.)
Let me know how it goes. I'm expecting to be 'done' with this project in the next couple days, meaning it will drive both the physical SP0256-AL2 (as it does now), but also be able to simulate the chip standalone with PWM. When I'm completely done, I'll put a pre-built firmware in the 'files' section so folks don't have to install the toolchain if they just want to kick the tires.

Are you sure? yes | no

deladriere wrote 05/16/2020 at 08:05

Nice. work ! I just ordered a blue pill to test you code

Is it published somewhere ?

Are you sure? yes | no

TTS Rulez Redux

Summary

Deets

Next

Discussions

TTS Rulez Redux

Summary

Deets

Next

Text-to-Speech Rulez!

Compressing the Phoneme Data with ADPCM

Discussions

Become a Hackaday.io Member