.. _RefSphinxEngine:

CMU Pocket Sphinx engine back-end
============================================================================

This version of dragonfly contains an engine implementation using the open
source, cross-platform CMU Pocket Sphinx speech recognition engine. You can
read more about the CMU Sphinx speech recognition projects on the
`CMU Sphinx wiki`_.


Setup
----------------------------------------------------------------------------

There are three Pocket Sphinx engine dependencies:

- `pyaudio <http://people.csail.mit.edu/hubert/pyaudio/>`_
- `pyjsgf <https://github.com/Danesprite/pyjsgf>`_
- `sphinxwrapper <https://github.com/Danesprite/sphinxwrapper>`_

You can install these by running the following command::

  pip install 'dragonfly2[sphinx]'

If you are installing to *develop* dragonfly, use the following instead::

  pip install -e '.[sphinx]'

Once the dependencies are installed, you'll need to copy the
`dragonfly/examples/sphinx_module_loader.py`_ script into the folder
with your grammar modules and run it using::

  python sphinx_module_loader.py


This file is the equivalent to the 'core' directory that NatLink uses to
load grammar modules. When run, it will scan the directory it's in for files
beginning with ``_`` and ending with ``.py``, then try to load them as
command-modules.


Cross-platform Engine
----------------------------------------------------------------------------

Pocket Sphinx runs on most platforms, including on architectures other than
x86, so it only makes sense that the Pocket Sphinx dragonfly engine
implementation should work on non-Windows platforms like macOS as well as on
Linux distributions. To this effect, I've made an effort to mock
Windows-only functionality for non-Windows platforms for the time being to
allow the engine components to work correctly regardless of the platform.

Using dragonfly with a non-Windows operating system can already be done with
`Aenea`_ using the existing *NatLink* engine. Aenea communicates with a
separate Windows system running *NatLink* and *DNS* over a network
connection and has server support for Linux (using X11), macOS, and Windows.


Engine configuration
----------------------------------------------------------------------------

This engine can be configured by changing the engine configuration.

You can make changes to the ``engine.config`` object directly in your
*sphinx_engine_loader.py* file before :meth:`connect` is called or create a
*config.py* module in the same directory using .

The ``LANGUAGE`` option specifies the engine's user language. This is
English (``"en"``) by default.

Audio configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Audio configuration options used to record from the microphone, validate
input wave files in and write wave files if the training data directory is
set.

These options must match the requirements for the acoustic model being used.
The default values match the requirements for the 16kHz CMU US English
models.

- ``CHANNELS`` -- number of audio input channels (default: ``1``).
- ``SAMPLE_WIDTH`` -- sample width for audio input in bytes
  (default: ``2``).
- ``RATE`` -- sample rate for audio input in Hz (default: ``16000``).
- ``FRAMES_PER_BUFFER`` -- frames per recorded audio buffer
  (default: ``2048``).


Keyphrase configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following configuration options control the engine's built-in
keyphrases:

- ``WAKE_PHRASE`` -- the keyphrase to listen for when in sleep mode
  (default: ``"wake up"``).
- ``WAKE_PHRASE_THRESHOLD`` -- threshold value* for the wake keyphrase
  (default: ``1e-20``).
- ``SLEEP PHRASE`` -- the keyphrase to listen for to enter sleep mode
  (default: ``"go to sleep"``).
- ``SLEEP_PHRASE_THRESHOLD`` -- threshold value* for the sleep keyphrase
  (default: ``1e-40``).
- ``START_ASLEEP`` -- boolean value for whether the engine should start in
  a sleep state (default: ``True``).
- ``START_TRAINING_PHRASE`` -- keyphrase to listen for to start a training
  session where no processing occurs
  (default: ``"start training session"``).
- ``START_TRAINING_PHRASE_THRESHOLD`` -- threshold value* for the start
  training keyphrase (default: ``1e-48``).
- ``END_TRAINING_PHRASE`` -- keyphrase to listen for to end a training
  session if one is in progress (default: ``"end training session"``).
- ``END_TRAINING_PHRASE_THRESHOLD`` -- threshold value* for the end training
  keyphrase (default: ``1e-45``).

\* Threshold values need to be set for each keyphrase. The `CMU Sphinx LM
tutorial`_ has some advice on keyphrase threshold values.

If your language isn't set to English, all built-in keyphrases will be
disabled by default if they are not specified in your configuration.

Any keyphrase can be disabled by setting the phrase and threshold values to
``""`` and ``0`` respectively.

Decoder configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``DECODER_CONFIG`` object initialised in the engine config module can be
used to set various Pocket Sphinx decoder options.

The following is the default decoder configuration:

..  code:: Python

    import os

    from sphinxwrapper import DefaultConfig

    # Configuration for the Pocket Sphinx decoder.
    DECODER_CONFIG = DefaultConfig()

    # Silence the decoder output by default.
    DECODER_CONFIG.set_string("-logfn", os.devnull)

    # Set voice activity detection configuration options for the decoder.
    # You may wish to experiment with these if noise in the background
    # triggers speech start and/or false recognitions (e.g. of short words)
    # frequently.
    # Descriptions for VAD configuration options were retrieved from:
    # https://cmusphinx.github.io/doc/sphinxbase/fe_8h_source.html

    # Number of silence frames to keep after from speech to silence.
    DECODER_CONFIG.set_int("-vad_postspeech", 30)

    # Number of speech frames to keep before silence to speech.
    DECODER_CONFIG.set_int("-vad_prespeech", 20)

    # Number of speech frames to trigger vad from silence to speech.
    DECODER_CONFIG.set_int("-vad_startspeech", 10)

    # Threshold for decision between noise and silence frames.
    # Log-ratio between signal level and noise level.
    DECODER_CONFIG.set_float("-vad_threshold", 3.0)


There does not appear to be much documentation on these options outside of
the `pocketsphinx/cmdln_macro.h`_ and `sphinxbase/fe.h`_ header files.
If this is incorrect or has changed, feel free to suggest an edit.

The easiest way of seeing the available decoder options as well as their
default values is to run the ``pocketsphinx_continuous`` command with no
arguments.


Changing Models and Dictionaries
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``DECODER_CONFIG`` object can be used to configure the pronunciation
dictionary as well as the acoustic and language models. You can do this with
something like::

  DECODER_CONFIG.set_string('-hmm', '/path/to/acoustic-model-folder')
  DECODER_CONFIG.set_string('-lm', '/path/to/lm-file.lm')
  DECODER_CONFIG.set_string('-dict', '/path/to/dictionary-file.dict')

The language model, acoustic model and pronunciation dictionary should all
use the same language or language variant. See the `CMU Sphinx wiki`_ for
a more detailed explanation of these components.


Training configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The engine can save *.wav* and *.txt* training files into a directory for
later use. The following are the configuration options associated with this
functionality:

- ``TRAINING_DATA_DIR`` -- directory to save training files into
  (default: ``""``).
- ``TRANSCRIPT_NAME`` -- common name of files saved into the training data
  directory (default: ``"training"``).

Set ``TRAINING_DATA_DIR`` to a valid directory path to enable recording of
*.txt* and *.wav* files. If the path is a relative path, it will be
interpreted as relative to the module loader's directory.

The engine will **not** attempt to make the directory for you as it did in
previous versions of *dragonfly*.


Engine API
----------------------------------------------------------------------------

.. autoclass:: dragonfly.engines.backend_sphinx.engine.SphinxEngine
   :members:

.. automodule:: dragonfly.engines.backend_sphinx.timer
   :members:


Improving Speech Recognition Accuracy
----------------------------------------------------------------------------

CMU Pocket Sphinx can have some trouble recognising what was said
accurately. To remedy this, you may need to adapt the acoustic model that
Pocket Sphinx is using. This is similar to how Dragon sometimes requires
training. The CMU Sphinx `adaption tutorial`_ covers this topic. There is
also a `YouTube video on model adaption`_.

Adapting your model may not be necessary; there might be other issues with
your setup. There is more information on tuning the recognition accuracy in
the CMU Sphinx `tuning tutorial`_.

The engine can record what you say into *.wav* and *.txt* files if the
``TRAINING_DATA_DIR`` configuration option mentioned above is set to an
existing directory. To get files compatible with the Sphinx accoustic model
adaption process, you can use the :meth:`write_transcript_files` engine
method.

Mismatched words may use the engine decoder's default search, typically a
language model search.

There are built-in key phrases for starting and ending training sessions
where no grammar rule processing will occur. Key phrases will still be
processed. See the ``START_TRAINING_PHRASE`` and ``END_TRAINING_PHRASE``
engine configuration options. One use case for the training mode is training
potentially destructive commands or commands that take a long time to
execute their actions.

To use the training files, you will need to correct any incorrect phrases in
the *.transcription* or *.txt* files. You can then use the
`SphinxTrainingHelper`_ bash script to adapt your model. This script makes
the process considerably easier, although you may still encounter problems.
You should be able to play the wave files using most media players (e.g.
VLC, Windows Media Player, aplay) if you need to.

You will want to remove the training files after a successful adaption. This
must be done manually for the moment.


Limitations
----------------------------------------------------------------------------

This engine has a few limitations, most notably with spoken language support
and dragonfly's :class:`Dictation` functionality.


Dictation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Mixing free-form dictation with grammar rules is difficult with the CMU
Sphinx decoders. It is either dictation or grammar rules, not both. For this
reason, Dragonfly's CMU Pocket Sphinx SR engine supports speaking free-form
dictation, but only on its own.

Parts of rules that have required combinations with :class:`Dictation` and
other basic Dragonfly elements such as :class:`Literal`, :class:`RuleRef` and
:class:`ListRef` will not be recognised properly using this SR engine via
speaking. They can, however, be recognised via the :meth:`engine.mimic`
method, the :class:`Mimic` action or the :class:`Playback` action.

.. note::

   This engine's previous :class:`Dictation` element support using utterance
   breaks has been removed because it didn't really work very well.


Unknown words
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

CMU Pocket Sphinx uses pronunciation dictionaries to lookup phonetic
representations for words in grammars, language models and key phrases in
order to recognise them. If you use words in your grammars and/or key
phrases that are *not* in the dictionary, a message similar to the
following will be printed:

  *grammar 'name' used words not found in the pronunciation dictionary: 
  notaword*

If you get a message like this, try changing the words in your grammars/key
phrases by splitting up the words or using to similar words,
e.g. changing "natlink" to "nat link".

I hope to eventually have words and phoneme strings dynamically added to the
current dictionary and language model using the Pocket Sphinx `ps_add_word`_
function (from Python of course).

.. _RefSphinxSpokenLanguageSupport:

Spoken Language Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are a only handful of languages with models and dictionaries
`available from source forge <https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/>`_,
although it is possible to build your own language model `using lmtool
<http://www.speech.cs.cmu.edu/tools/lmtool-new.html>`_ or
pronunciation dictionary `using lextool
<http://www.speech.cs.cmu.edu/tools/lextool.html>`_.
There is also a CMU Sphinx tutorial on `building language models
<https://cmusphinx.github.io/wiki/tutoriallm/>`_.

If the language you want to use requires non-ascii characters
(e.g. a Cyrillic language), you will need to use Python version 3.4 or
higher because of Unicode issues.


Dragonfly Lists and DictLists
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Dragonfly :class:`Lists` and :class:`DictLists` function as normal, private
rules for the Pocket Sphinx engine. On updating a dragonfly list or
dictionary, the grammar they are part of will be reloaded. This is because
there is unfortunately no JSGF equivalent for lists.


Text-to-speech
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This isn't a limitation of CMU Pocket Sphinx; text-to-speech is not a
project goal for them, although as the natlink and WSR engines both support
text-to-speech, there might as well be some suggestions if this
functionality is desired, perhaps utilised by a custom dragonfly action.

The Jasper project contains `a number of Python interface classes
<https://github.com/jasperproject/jasper-client/blob/master/client/tts.py>`_
to popular open source text-to-speech software such as `eSpeak`_,
`Festival`_ and `CMU Flite`_.


.. Links.
.. _Aenea: https://github.com/dictation-toolbox/aenea
.. _CMU Flite: http://www.festvox.org/flite/
.. _CMU Pocket Sphinx speech recognition engine: https://github.com/cmusphinx/pocketsphinx/
.. _CMU Sphinx LM tutorial: https://cmusphinx.github.io/wiki/tutoriallm/#keyword-lists
.. _CMU Sphinx wiki: https://cmusphinx.github.io/wiki/
.. _Festival: http://www.cstr.ed.ac.uk/projects/festival/
.. _SphinxTrainingHelper: https://github.com/ExpandingDev/SphinxTrainingHelper
.. _YouTube video on model adaption: https://www.youtube.com/watch?v=IAHH6-t9jK0
.. _adaption tutorial: https://cmusphinx.github.io/wiki/tutorialadapt/
.. _dragonfly/examples/sphinx_module_loader.py: https://github.com/dictation-toolbox/dragonfly/blob/master/dragonfly/examples/sphinx_module_loader.py
.. _eSpeak: http://espeak.sourceforge.net/
.. _pocketsphinx/cmdln_macro.h: https://github.com/cmusphinx/pocketsphinx/blob/master/include/cmdln_macro.h
.. _ps_add_word: https://cmusphinx.github.io/doc/pocketsphinx/pocketsphinx_8h.html#a5f3c4fcdbef34915c4e785ac9a1c6005
.. _sphinxbase/fe.h: https://github.com/cmusphinx/sphinxbase/blob/master/include/sphinxbase/fe.h
.. _tuning tutorial: https://cmusphinx.github.io/wiki/tutorialtuning/
