Current goal is to issue voice commands to the Wild Thumper robot. The speech recognition engine PocketSphinx was chosen for this task because it works with little CPU and memory. Since there does not seem to be an up to date ROS node for Pocketsphinx I decided to write a simple one. Pocketsphinx includes a GStreamer element, so the modular GStreamer Framework can help with the audio processing.
In GStreamer complex tasks like playing a multimedia file is performed by chaining multiple elements to a pipeline. Each element executes a single task, e.g. read a file, decompress data or output data to a monitor.
Pocketsphinx requires the audio to be 16 bit little endian mono at a rate of 16000Hz. The ALSA plughw interface can provide the input from the USB microphone in this format, so "plughw:1,0" is used as input device.
Pocketsphinx is used in two voice recognition modes:
1. Keyword detection
The speech recognition should only react to commands when addressed by name, for example "wild thumper stop", not just "stop" because the robot should not react when e.g. a movie is running in the background where someone says "stop". Also the robot shall only react to its exact name, not something sounding similar. Pocketsphinx provides a keyword spotting mode for this use case. Input to this mode is the file keywords.kws with a threshold for each keyword:
wild thumper /1e-11/
2. Fixed grammar
After spotting the keyword, Pocketsphinx shall recognize a command like "stop", "go forward one meter", "backward", "turn left" or "get voltage". Pocketsphinx is run with a given grammar in the Java Speech Grammar Format (JSGF) format to avoid a spoken "go forward" accidentally getting recognized as "go four" (yes, this happens a lot). Since "go four" is not allowed in the grammar it is discarded. This increases the recognition accuracy from ~40% to ~80%. As of today the jsgf option of the Pocketsphinx GStreamer element is only supported in unreleased git, so it needs to be compiled from source. The robot.jsgf looks like this:
<bool> = (on | off);
<number> = minus* (zero | one | two | three | four | five | six | seven | eight | nine | ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen | twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety | hundred | thousand | million);
<misc_command> = (light | lights) [<bool>];
<engine> = (stop | forward | backward | increase speed | decrease speed);
<get> = get (temp | temperature | light | voltage | current | pressure | mute | mic | silence | speed | velocity | position | angle | compass | motion | secure | engine | odom | humidity);
<go> = go (forward | backward) <number>+ (meter | meters | centimeter | centimeters);
<turn> = turn (left | right | (to | by) <number>+ [(degree | degrees)]);
<speed> = set+ speed <number>+ | set default speed;
public <rules> = <misc_command> | <engine> | <get> | <go> | <turn> | <speed>;
As acoustic model the default U.S. English continuous model of Pocketsphinx is used together with a MLLR adaption for my specific accent and microphone. According to tests with word_align.pl this improved the recognition accuracy to over 90%.
Pocketsphinx ROS node
The GStreamer pipeline in the ROS node uses two Pocketsphinx elements, one for the keyword spotting mode, one for the JSGF grammar mode. A preceding "cutter" element suppresses low background noise. The valve before the JSGF grammar node...
Read more »