Close

When I Hear That Squealing, I Need Textual Healing

A project log for Look Who's Talking 0256

A BluePill Driver/Simulator/Emulator for the GI SP0256-AL2

ziggurat29ziggurat29 05/07/2020 at 00:090 Comments

Summary

While the voice synthesizer is cool, manually sequencing phonemes is not particularly fun.  Text to speech is imperfect, but it definitely lightens the load.  While this was not a goal of the project, I thought I'd spend a couple days on this feature creep.

Deets

Some time back I had implemented a text-to-speech algorithm that was to be a software equivalent of the CTS256A-AL2.  This was a companion chip that accepted text data and implemented a derivative of the Naval Research Lab algorithm, and then drove the SP0256-AL2.  I do happen to have one of those devices, but I have never used it myself.

The algorithm is derived ultimately from some work done by the Naval Research Lab in 1976 contained in 'NRL Report 7948' titled 'Automatic Translation of English Text to Phonetics by Means of Letter-to-Sound Rules'.  It describes a set of 329 rules and claims a 90% accuracy on an average text sample.  It also asserts that 'most of the remaining 10% have single errors easily correctable by the listener', which I assume means you giggle and figure it out for yourself from context.

Further works was done in the 1980's John A. Wasser and Tom Jennings -- I have lost that code but it should be findable on the Internet since that's where I got it.  I also added some additional rules of my own.

The gist of the algorithm is to crack incoming text into words.  Then each word is independently translated.  The translations process consists of three patterns:  a 'left context', a 'right context', and a middle context that I call the 'bracket context' simply because that is the notation that the original rules used.  e.g.:

a[b]c => p

where 'a' is the left context, 'b' is the bracket context, 'c' is the right context, and 'p' are the phoneme sequence produced.

The way this is used is that you pattern match the three and if you get a match, then you replace the 'bracket context' with the phoneme sequence and advance the position in the text that you are processing.  So, 'a' can be thought of as backtracking.

Regular expressions comes to mind for pattern matching, however the NRL report uses some some particular 'character classes' that make sense in this phonetic context, such as 'one or more vowels/consonants' and 'front vowels' and 'e related things at the end of the word'.  So for practical reasons I did not use regex's for this project.  For one, I would have an enormous number of machines since there are hundreds of rules, and for another I wasn't going to have a regex library on this embedded processor anyway.  So I hand coded the pattern matching logic.  Mercifully, it was straightforward even if tedious.

You can view the processing as proceeding with a cursor moving across the word being translated.  The cursor is at the start of the 'bracket' pattern, so the 'left context' can be considered to be 'backtracking'.  If there is a match of all three patterns, then the phoneme sequence is emitted, and the cursor is advanced by the length of the 'bracket' pattern.

Rules are tested sequentially until a successful match is made, so more specific rules should precede more general ones.  To expedite the process of this linear search, I exploited the fact that the 'bracket' context only has literals -- no patterns.  Then I separated the rules into groups on the first character of the 'bracket' context.  In this way, the majority of the rules do not have to be tested at all since it is known that they have no hope of matching.

In real time this actually took a few days to do (I'm behind on posts but nearly caught up), and I possibly have made some transcription errors, but I did get it running and generating some text-to-speech.  E.g., this time I am using Jefferson Airplane's "Somebody to Love": somebodytolove.wav

Not too bad; there are certainly some oddities, such as 'head' 'breast' comes out more like 'heed' and 'breest'.  This could easily be errors in my transcription, perhaps bugs in my pattern matching, or maybe crappy phoneme recording.  I'll investigate this more closely later, but punt on that for now in the interest of getting back on-track with getting the BluePill interface to the physical chip running.

At any rate, I put this Python proof-of-concept prototype in a separate repo from the BluePill stuff, the link to which is in this project's 'links' section.

Next

Back to the BluePill...

Discussions