Speech recognition in ROS with PocketSphinx

A project log for Wild Thumper based ROS robot

My ROS (Robot Operating System) indoor & outdoor robot

HumpelstilzchenHumpelstilzchen 12/08/2018 at 18:120 Comments

Current goal is to issue voice commands to the Wild Thumper robot. The speech recognition engine PocketSphinx was chosen for this task because it works with little CPU and memory. Since there does not seem to be an up to date ROS node for Pocketsphinx I decided to write a simple one. Pocketsphinx includes a GStreamer element, so the modular GStreamer Framework can help with the audio processing.

In GStreamer complex tasks like playing a multimedia file is performed by chaining multiple elements to a pipeline. Each element executes a single task, e.g. read a file, decompress data or output data to a monitor.

Pocketsphinx requires the audio to be 16 bit little endian mono at a rate of 16000Hz. The ALSA plughw interface can provide the input from the USB microphone in this format, so "plughw:1,0" is used as input device.

Pocketsphinx is used in two voice recognition modes:

1. Keyword detection

The speech recognition should only react to commands when addressed by name, for example "wild thumper stop", not just "stop" because the robot should not react when e.g. a movie is running in the background where someone says "stop". Also the robot shall only react to its exact name, not something sounding similar. Pocketsphinx provides a keyword spotting mode for this use case. Input to this mode is  the file keywords.kws with a threshold for each keyword:

wild thumper /1e-11/

2. Fixed grammar

After spotting the keyword, Pocketsphinx shall recognize a command like "stop", "go forward one meter", "backward", "turn left" or "get voltage". Pocketsphinx is run with a given grammar in the Java Speech Grammar Format (JSGF) format to avoid a spoken "go forward" accidentally getting recognized as "go four" (yes, this happens a lot). Since "go four" is not allowed in the grammar it is discarded. This increases the recognition accuracy from ~40% to ~80%. As of today the jsgf option of the Pocketsphinx GStreamer element is only supported in unreleased git, so it needs to be compiled from source. The robot.jsgf looks like this:

#JSGF V1.0;

grammar robot;

<bool> = (on | off);
<number> = minus* (zero | one | two | three | four | five | six | seven | eight | nine | ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen | twenty | thirty | forty | fifty | sixty | seventy | eighty  | ninety | hundred | thousand | million);

<misc_command> = (light | lights) [<bool>];
<engine> = (stop | forward | backward | increase speed | decrease speed);
<get> = get (temp | temperature | light | voltage | current | pressure | mute | mic | silence | speed | velocity | position | angle | compass | motion | secure | engine | odom | humidity);
<go> = go (forward | backward) <number>+ (meter | meters | centimeter | centimeters);
<turn> = turn (left | right | (to | by) <number>+ [(degree | degrees)]);
<speed> = set+ speed <number>+ | set default speed;

public <rules> = <misc_command> | <engine> | <get> | <go> | <turn> | <speed>;

As acoustic model the default U.S. English continuous model of Pocketsphinx is used together with a MLLR adaption for my specific accent and microphone. According to tests with this improved the recognition accuracy to over 90%.

Pocketsphinx ROS node

The GStreamer pipeline in the ROS node uses two Pocketsphinx elements, one for the keyword spotting mode, one for the JSGF grammar mode. A preceding "cutter" element suppresses low background noise. The valve before the JSGF grammar node is usually closed and only opened on keyword match. After command detection the valve is closed again. This enables the JSGF grammar node only after a spoken keyword.

In GStreamer notation the pipeline looks like this:

alsasrc device="plughw:1,0" ! audio/x-raw,format=S16LE,channels=1,rate=16000
 ! cutter
 ! tee name=jsgf ! queue ! valve drop=true ! pocketsphinx ! fakesink jsgf.
 ! pocketsphinx ! fakesink

 The audio is split into the two branches by the "tee"-element. The "queue" element starts a separate thread for the second branch so both branches of the pipeline are independent of each other. The output of the two Pocketsphinx elements is not used and discarded by the fakesinks, the result of the speech detection is read from the message bus instead and published to the ROS topic "asr_result" as message type string. The complete code follows:


require 'gst'
require 'pry'
require 'logger'
require 'ros'
require 'std_msgs/String'

class Speak
    def initialize(node)
        @logger =
        @publisher = node.advertise('asr_result', Std_msgs::String)
        @pipeline = Gst.parse_launch('alsasrc device="plughw:1,0" ! audio/x-raw,format=S16LE,channels=1,rate=16000 ! cutter leaky=true name=cutter'\
                         ' ! tee name=jsgf ! queue leaky=downstream ! valve name=valve_jsgf drop=true ! pocketsphinx name=asr_jsgf ! fakesink async=false jsgf.'\
                         ' ! pocketsphinx name=asr_kws ! fakesink async=false'\
        # Ignore everything below the configured volume
        cutter = @pipeline.get_by_name('cutter')
        cutter.set_property('threshold-dB', -20)
        cutter.set_property('pre-length', 100000000) # pocketsphinx needs about 0.1s before start
        cutter.set_property('run-length', 1300000000)

        asr_jsgf = @pipeline.get_by_name('asr_jsgf')
        asr_jsgf.set_property('hmm', 'pocketsphinx/adapt/cmusphinx-en-us-5.2')
        asr_jsgf.set_property('mllr', 'pocketsphinx/adapt/mllr_matrix')
        asr_jsgf.set_property('jsgf', 'data/robot.jsgf')

        asr_kws = @pipeline.get_by_name('asr_kws')
        asr_kws.set_property('hmm', 'pocketsphinx/adapt/cmusphinx-en-us-5.2')
        asr_kws.set_property('mllr', 'pocketsphinx/adapt/mllr_matrix')
        asr_kws.set_property('kws', 'data/keywords.kws')

        bus = @pipeline.bus()
        bus.add_watch do |bus, message|
            case message.type
            when Gst::MessageType::EOS
            when Gst::MessageType::ERROR
                p message.parse_error
                binding.pry # open console
            when Gst::MessageType::ELEMENT
                if == "asr_kws"
                    if message.structure.get_value(:final).value
                        keyword_detect(message.structure.get_value(:hypothesis).value, message.structure.get_value(:confidence).value)
                elsif == "asr_jsgf"
                    if message.structure.get_value(:final).value
                        final_result(message.structure.get_value(:hypothesis).value, message.structure.get_value(:confidence).value)
                elsif == "cutter"
                    if message.structure.get_value(:above).value
                        @logger.debug "Start recording.."
                        @logger.debug "Stop recording"

    # Enables/Disables the jsgf pipeline branch
    def enable_jsgf(bEnable)
        valve = @pipeline.get_by_name('valve_jsgf')
        valve.set_property("drop", !bEnable)

    # Result of jsgf pipeline branch
    def final_result(hyp, confidence) "final: " + hyp + " " + confidence.to_s

        # Publish pocketsphinx result as ros message
        msg = = hyp

    def keyword_detect(hyp, confidence)
        @logger.debug "Got keyword: " + hyp

    def stop

if __FILE__ == $0
    node ='pocketsphinx')
    app =
    loop =, false)
    begin {
    rescue Interrupt