spokestack

IO

spokestack.io.pyaudio

This module uses pyaudio for input and output processing

class spokestack.io.pyaudio.PyAudioInput(sample_rate, frame_width, exception_on_overflow=True, **kwargs)[source]

This class retrieves audio from an input device

Parameters
  • sample_rate (int) – desired sample rate for input (Hz)

  • frame_width (int) – desired frame width for input (ms)

  • exception_on_overflow (bool) – produce exception for input overflow

close()[source]

Closes the audio stream

Return type

None

property is_active

Stream active property

Returns

‘True’ if stream is active, ‘False’ otherwise

Return type

bool

property is_stopped

Stream stopped property

Returns

‘True’ if stream is stopped, ‘False’ otherwise

Return type

bool

read()[source]

Reads a single frame of audio

Returns

single frame of PCM-16 audio

Return type

np.ndarray

start()[source]

Starts the audio stream

Return type

None

stop()[source]

Stops the audio stream

Return type

None

class spokestack.io.pyaudio.PyAudioOutput(num_channels=1, sample_rate=24000, frames_per_buffer=1024)[source]

Outputs audio to the default system output

Parameters
  • num_channels (int) – number of audio channels

  • sample_rate (int) – sample rate of the audio (Hz)

  • frames_per_buffer (int) – number of audio samples to buffer on the output device

write(frame)[source]

Writes a single frame of audio to output

Parameters

frame (bytes) – a single frame of audio

Return type

None

spokestack.io.sound_device

Voice Activity Detection (VAD)

WebRTC

This module contains the webrtc component for voice activity detection (vad)

class spokestack.vad.webrtc.VoiceActivityDetector(sample_rate=16000, frame_width=20, vad_rise_delay=0, vad_fall_delay=0, mode=0, **kwargs)[source]

This class detects the presence of voice in a frame of audio.

Parameters
  • sample_rate (int) – sample rate of the audio (Hz)

  • frame_width (int) – width of the audio frame: 10, 20, or 30 (ms)

  • vad_rise_delay (int) – rising edge delay (ms)

  • vad_fall_delay (int) – falling edge delay (ms)

  • mode (int) – named constant to set mode for vad

close()[source]

Close interface for use in pipeline

Return type

None

reset()[source]

Resets the current state

Return type

None

class spokestack.vad.webrtc.VoiceActivityTrigger[source]

Voice Activity Detector trigger pipeline component

Speech Context

This module contains a context class to manage state between members of the processing pipeline

class spokestack.context.SpeechContext[source]

Class for managing context of the speech pipeline.

add_handler(name, function)[source]

Adds a handler to the context

Parameters
  • name (str) – The name of the event handler

  • function (Callable) – event handler function

Return type

None

property confidence

This property contains the confidence of a classification result.

Returns

model confidence of classification

Return type

float

event(name)[source]

Calls the event handler

Parameters

name (str) – The name of the event handler

Return type

None

property is_active

This property manages activity of the context.

Returns

‘True’ if context is active, ‘False’ otherwise.

Return type

bool

property is_speech

This property is to manage if speech is present in the current state or not.

Returns

‘True’ if is_speech set to ‘True’, ‘False’ otherwise

Return type

bool

reset()[source]

Resets the context state

Return type

None

property transcript

This property is the text representation of the audio buffer

Returns

the value of the transcript property

Return type

str

Wakeword

TFLite Model

This module contains the class for detecting the presence of keywords in an audio stream

class spokestack.wakeword.tflite.WakewordTrigger(pre_emphasis=0.0, sample_rate=16000, fft_window_type='hann', fft_hop_length=10, model_dir='', posterior_threshold=0.5, **kwargs)[source]

Detects the presence of a wakeword in the audio input

Parameters
  • pre_emphasis (float) – The value of the pre-emmphasis filter

  • sample_rate (int) – The number of audio samples per second of audio (kHz)

  • fft_window_type (str) – The type of fft window. (only support for hann)

  • fft_hop_length (int) – Audio sliding window for STFT calculation (ms)

  • model_dir (str) – Path to the directory containing .tflite models

  • posterior_threshold (float) – Probability threshold for if a wakeword was detected

close()[source]

Close interface for use in the pipeline

Return type

None

reset()[source]

Resets the currect WakewordDetector state

Return type

None

Models

spokestack.models.tensorflow module

TFLite model base class

class spokestack.models.tensorflow.TFLiteModel(model_path, **kwargs)[source]

TFLite model base class for managing multiple inputs/outputs

Parameters
property input_details

Property for accesing the TFLite model input_details

Returns: Input details for the TFLite model

Return type

List[Any]

property output_details

Property for accesing the TFLite model output_details

Returns: Output details for the TFLite model

Return type

List[Any]

resize(index, shape)[source]

Resize and allocate an input tensor

Parameters
  • index (int) – index of the input tensor to resize

  • shape (List[int]) – new shape of the input tensor

Return type

None

Text to Speech (TTS)

Clients

Spokestack

This module contains the Spokestack client for text to speech

exception spokestack.tts.clients.spokestack.TTSError(response)[source]

Bases: Exception

Text to speech error wrapper

class spokestack.tts.clients.spokestack.TextToSpeechClient(key_id, key_secret, url='https://api.spokestack.io/v1')[source]

Bases: object

Spokestack Text to Speech Client

Parameters
  • key_id (str) – identity from spokestack api credentials

  • key_secret (str) – secret key from spokestack api credentials

  • url (str) – spokestack api url

synthesize(utterance, mode='text', voice='demo-male', profile='default')[source]

Converts the given utterance to speech.

Text can be formatted as plain text (mode=”text”), SSML (mode=”ssml”), or Speech Markdown (mode=”markdown”).

This method also supports different formats for the synthesized audio via the profile argument. The supported profiles and their associated formats are:

  • default: 24kHz, 64kbps mono MP3

  • alexa: 24kHz, 48kbps mono MP3

  • discord: 48kHz, 64kbpz stereo OPUS

  • twilio: 8kHz, 64kbpz mono MP3

Parameters
  • utterance (str) – string that needs to be rendered as speech.

  • mode (str) – synthesis mode to use with utterance. text, ssml, markdown.

  • voice (str) – name of the tts voice.

  • profile (str) – name of the audio profile used to create the resulting stream.

Returns

Encoded audio response in the form of a sequence of bytes

Return type

(Iterator[bytes])

synthesize_url(utterance, mode='text', voice='demo-male', profile='default')[source]

Converts the given utterance to speech accessible by a URL.

Text can be formatted as plain text (mode=”text”), SSML (mode=”ssml”), or Speech Markdown (mode=”markdown”).

This method also supports different formats for the synthesized audio via the profile argument. The supported profiles and their associated formats are:

  • default: 24kHz, 64kbps mono MP3

  • alexa: 24kHz, 48kbps mono MP3

  • discord: 48kHz, 64kbpz stereo OPUS

  • twilio: 8kHz, 64kbpz mono MP3

Parameters
  • utterance (str) – string that needs to be rendered as speech.

  • mode (str) – synthesis mode to use with utterance. text, ssml, markdown.

  • voice (str) – name of the tts voice.

  • profile (str) – name of the audio profile used to create the resulting stream.

Returns: URL of the audio clip

Return type

str

Manager

This module contains the Spokestack text to speech manager which handles a text to speech client, decodes the returned audio, and writes the audio to the specified output.

class spokestack.tts.manager.SequenceIO(sequence)[source]

Wrapper that allows for incrementally received audio to be decoded.

class spokestack.tts.manager.TextToSpeechManager(client, output, format_='mp3')[source]

Manages tts client and io target.

Parameters
  • client (Any) – Text to speech client that returns encoded mp3 audio

  • output (Any) – Audio io target

  • format – Audio format, one of FORMAT_MP3 or FORMAT_PCM16

close()[source]

Closes the client and output.

Return type

None

synthesize(utterance, mode='text', voice='demo-male', profile='default')[source]

Synthesizes the given utterance with the voice and format provided.

Text can be formatted as plain text (mode=”text”), SSML (mode=”ssml”), or Speech Markdown (mode=”markdown”).

This method also supports different formats for the synthesized audio via the profile argument. The supported profiles and their associated formats are:

Parameters
  • utterance (str) – string that needs to be rendered as speech.

  • mode (str) – synthesis mode to use with utterance. text, ssml, markdown, etc.

  • voice (str) – name of the tts voice.

  • profile (str) – name of the audio profile used to create the resulting stream.

Return type

None

TTS-Lite

Spokestack-Lite Speech Synthesizer

This module contains the SpeechSynthesizer class used to convert text to speech using local TTS models trained on the Spokestack platform. A SpeechSynthesizer instance can be passed to the TextToSpeechManager for playback.

Example

This example assumes that a TTS model was downloaded from the Spokestack platform and extracted to the model directory.

from spokestack.io.pyaudio import PyAudioOutput
from spokestack.tts.manager import TextToSpeechManager, FORMAT_PCM16
from spokestack.tts.lite import SpeechSynthesizer, BLOCK_LENGTH, SAMPLE_RATE

tts = TextToSpeechManager(
    SpeechSynthesizer("./model"),
    PyAudioOutput(sample_rate=SAMPLE_RATE, frames_per_buffer=BLOCK_LENGTH),
    format_=FORMAT_PCM16)

tts.synthesize("Hello world!")
class spokestack.tts.lite.SpeechSynthesizer(model_path)[source]

Initialize a new lightweight speech synthesizer

Parameters

model_path (str) – Path to the extracted TTS model downloaded from the Spokestack platform

synthesize(utterance, *_args, **_kwargs)[source]

Synthesize a text utterance to speech audio

Parameters

utterance (str) – The text string to synthesize

Returns

A generator for returns a sequence of PCM-16 numpy audio blocks for playback, storage, etc.

Return type

Iterator[np.array]

Automatic Speech Recognition (ASR)

spokestack.asr.spokestack.cloud_client module

This module contains the websocket logic used to communicate with Spokestack’s cloud-based ASR service.

exception spokestack.asr.spokestack.cloud_client.APIError(response)[source]

Spokestack api error pass through

Parameters

response (dict) – message from the api service

class spokestack.asr.spokestack.cloud_client.CloudClient(key_id, key_secret, socket_url='wss://api.spokestack.io', audio_format='PCM16LE', sample_rate=16000, language='en', limit=10, idle_timeout=None)[source]

Spokestack client for cloud based speech to text

Parameters
  • key_id (str) – identity from spokestack api credentials

  • key_secret (str) – secret key from spokestack api credentials

  • socket_url (str) – url for socket connection

  • audio_format (str) – format of input audio

  • sample_rate (int) – audio sample rate (kHz)

  • language (str) – language for recognition

  • limit (int) – Limit of messages per api response

  • idle_timeout (Any) – Time before client timeout. Defaults to None

connect()[source]

connects to websocket

Return type

None

disconnect()[source]

disconnects client socket connection

Return type

None

end()[source]

sends empty string in binary to indicate last frame

Return type

None

property idle_count

current counter of idle time

Return type

int

property idle_timeout

property for maximum idle time

Return type

Any

initialize()[source]

sends/receives the initial api request

Return type

None

property is_connected

status of the socket connection

Return type

bool

property is_final

status of most recent sever response

Return type

bool

receive()[source]

receives the api response

Return type

None

property response

current response message

Return type

dict

send(frame)[source]

sends a single frame of audio

Parameters

frame (np.ndarray) – segment of PCM-16 encoded audio

Return type

None

spokestack.asr.spokestack.speech_recognizer module

This module contains the recognizer for cloud based ASR in the speech pipeline

class spokestack.asr.spokestack.speech_recognizer.CloudSpeechRecognizer(spokestack_id='', spokestack_secret='', language='en', sample_rate=16000, frame_width=20, idle_timeout=5000, **kwargs)[source]

Speech recognizer for use in the speech pipeline

Parameters
  • spokestack_id (str) – identity under spokestack api credentials

  • spokestack_secret (str) – secret key from spokestack api credentials

  • language (str) – language recognized

  • sample_rate (int) – audio sample rate (kHz)

  • frame_width (int) – frame width of the audio (ms)

  • idle_timeout (int) – the number of iterations before the connection times out

close()[source]

closes client connection

Return type

None

reset()[source]

resets client connection

Return type

None

spokestack.asr.google.speech_recognizer module

This module contains the google asr speech recognizer

class spokestack.asr.google.speech_recognizer.GoogleSpeechRecognizer(language, credentials=None, sample_rate=16000, **kwargs)[source]

Transforms speech into text using Google’s ASR.

Parameters
  • language (str) – The language of given audio as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: “en-US”

  • credentials (Union[None, str, dict]) – Dictionary of Google API credentials or path to credentials. if set to None credentials will be pulled from the environment variable: GOOGLE_APPLICATION_CREDENTIALS

  • sample_rate (int) – sample rate of the input audio (Hz)

  • **kwargs (optional) – additional keyword arguments

close()[source]

closes recognizer

Return type

None

reset()[source]

resets recognizer

Return type

None

This module contains the Spokestack KeywordRecognizer which identifies multiple keywords from an audio stream.

class spokestack.asr.keyword.tflite.KeywordRecognizer(classes, pre_emphasis=0.97, sample_rate=16000, fft_window_type='hann', fft_hop_length=10, model_dir='', posterior_threshold=0.5, **kwargs)[source]

Recognizes keywords in an audio stream.

Parameters
  • classes (List[str]) – Keyword labels

  • pre_emphasis (float) – The value of the pre-emphasis filter

  • sample_rate (int) – The number of audio samples per second of audio (kHz)

  • fft_window_type (str) – The type of fft window. (only support for hann)

  • fft_hop_length (int) – Audio sliding window for STFT calculation (ms)

  • model_dir (str) – Path to the directory containing .tflite models

  • posterior_threshold (float) – Probability threshold for detection

close()[source]

Close interface for use in the SpeechPipeline

Return type

None

reset()[source]

Resets the current KeywordDetector state

Return type

None

Natural Language Understanding (NLU)

Parsers

Digits

This module contains the parser that converts the string representation of a sequence of digits into the corresponding sequence of digits. These digits may be in the form of english cardinal representations of numbers, along with some homophones. The digits can be hyphenated or unhyphenated from twenty through ninety-nine. The unhyphenated numbers get joined automatically. The use of unhyphenated numbers introduces ambiguity. For example, “sixty five thousand” could be parsed as “605000” or “65000”. Our parser will output the latter. However, this can be an issue with values such as “sixty five thousand one” which parses as “650001”. This limitation will most likely be acceptable for most multi-digit use cases such as telephone numbers, social security numbers, etc.

spokestack.nlu.parsers.digits.parse(metadata, raw_value)[source]

Digit Parser

Parameters
  • metadata (Dict[str, Any]) – digit slot metadata

  • raw_value (str) – value tagged by the model

Returns

string parsed digits

Return type

(str)

Entity

This module contains the logic to parse entities from NLU results. The entity parser is a pass through for string values to allow custom logic to resolve the entities. For example, the entity can be used as a keyword in a database search.

spokestack.nlu.parsers.entity.parse(metadata, raw_value)[source]

Entity Parser

Parameters
  • metadata (Dict[str, Any]) – metadata for entity slot

  • raw_value (str) – tagged entity

Returns

tagged entity

Return type

(str)

Integer

This module contains the logic to parse integers from NLU results. Integers can be in the form of words (ie. one, two, three) or numbers (ie. 1, 2, 3). Either form will resolve to Python’s built-in ‘int’ type. The metadata must contain a range key containing the minimum and maximum values for the expected integer range. It is important to note the difference between digits and integers. Integers are counting numbers: 2 apples, a table for two. In contrast, digits can be used for sequences of numbers like phone numbers or social security numbers.

spokestack.nlu.parsers.integer.parse(metadata, raw_value)[source]

Integer Parser

Parameters
  • metadata (Dict[str, Any]) – metadata for the integer slot

  • raw_value (str) – value tagged by the model

Returns

integer if parsable, None if invalid

Return type

Union[int, None]

Selset

This module contains the logic to parse selsets from NLU results. Selsets contain a name along with one or more aliases. This allows one to map any of the listed aliases into a single word. For example, if a selset’s name is “light”, and its aliases are bulbs, light, beam, lamp, etc., occurrences of any alias will be parsed as light

spokestack.nlu.parsers.selset.parse(metadata, raw_value)[source]

Selset Parser

Parameters
  • metadata (Dict[str, Any]) – slot metadata

  • raw_value (str) – value tagged by the model

Returns

selset or None if invalid

Return type

Union[str, None]

TFLite Model

This module contains the class for using TFLite NLU models. In this case, an NLU model is a TFLite model which takes in an utterance and returns an intent along with any slots that are associated with that intent.

class spokestack.nlu.tflite.TFLiteNLU(model_dir)[source]

Abstraction for using TFLite NLU models

Parameters

model_dir (str) – path to the model directory containing nlu.tflite, metadata.json, and vocab.txt

Activation Timeout

This module manages the timeout for speech pipeline activation.

class spokestack.activation_timeout.ActivationTimeout(frame_width=20, min_active=500, max_active=5000, **kwargs)[source]

Speech pipeline activation timeout

Parameters
  • frame_width (int) – frame width of the audio (ms)

  • min_active (int) – the minimum length of an activation (ms)

  • max_active (int) – the maximum length of an activation (ms)

close()[source]

Sets active length to zero

Return type

None

deactivate(context)[source]

Deactivates the speech pipeline

Return type

None

reset()[source]

Resets active length

Return type

None

Speech Pipeline

This module contains the speech pipeline which manages the components for processing speech.

class spokestack.pipeline.SpeechPipeline(input_source, stages)[source]

Pipeline for managing speech components.

Parameters
  • input_source (Any) – source of audio input

  • stages (List[Any]) – components desired in the pipeline

  • **kwargs – additional keyword arguments

activate()[source]

Activates the pipeline

Return type

None

close()[source]

Closes the running pipeline

Return type

None

property context

Current context

Return type

SpeechContext

deactivate()[source]

Deactivates the pipeline

Return type

None

event(function=None, name=None)[source]

Registers an event handler

Parameters
  • function (Optional[Any]) – event handler

  • name (Optional[str]) – name of event handler

Return type

Any

Returns

Default event handler if a function not specified

property is_running

State of the pipeline

Return type

bool

pause()[source]

Stops audio input until resume is called

Return type

None

resume()[source]

Resumes audio input after a pause

Return type

None

run()[source]

Runs the pipeline to process speech and cleans up after stop is called

Return type

None

start()[source]

Starts input source of the pipeline

Return type

None

step()[source]

Process a single frame with the pipeline

Return type

None

stop()[source]

Halts the pipeline

Return type

None

Automatic Noise Suppression

This module contains the class for webrtc automatic noise suppression

class spokestack.nsx.webrtc.AutomaticNoiseSuppression(sample_rate=16000, policy=1, **kwargs)[source]

WebRTC Automatic Noise Suppression

Parameters
  • sample_rate (int) – audio sample rate. (Hz)

  • policy (int) – aggressiveness of the noise suppression: - POLICY_MILD: mild suppression (6dB) - POLICY_MEDIUM: medium suppression (10dB) - POLICY_AGGRESSIVE: aggresive suppression (15dB) - POLICY_VERY_AGGRESSIVE: very aggressive suppression

close()[source]

method for pipeline compliance

Return type

None

reset()[source]

method for pipeline compliance

Return type

None

Automatic Gain Control

This module contains the class for webrtc’s automatic gain control

class spokestack.agc.webrtc.AutomaticGainControl(sample_rate=16000, frame_width=20, target_level_dbfs=3, compression_gain_db=15, limit_enable=True, **kwargs)[source]

WebRTC Automatic Gain Control

Parameters
  • sample_rate (int) – audio sample_rate. (Hz)

  • frame_width (int) – audio frame width. (Ms)

  • target_level_dbfs (int) – target peak audio level. (dBFS)

  • compression_gain_db (int) – dynamic range compression rate. (dB)

  • limit_enable (bool) – enables limiter in compression.

close()[source]

method for pipeline compliance

Return type

None

reset()[source]

method for pipeline compliance

Return type

None