spokestack¶
IO¶
spokestack.io.pyaudio¶
This module uses pyaudio for input and output processing
-
class
spokestack.io.pyaudio.
PyAudioInput
(sample_rate, frame_width, exception_on_overflow=True, **kwargs)[source]¶ This class retrieves audio from an input device
- Parameters
sample_rate (int) – desired sample rate for input (Hz)
frame_width (int) – desired frame width for input (ms)
exception_on_overflow (bool) – produce exception for input overflow
-
property
is_active
¶ Stream active property
- Returns
‘True’ if stream is active, ‘False’ otherwise
- Return type
bool
-
property
is_stopped
¶ Stream stopped property
- Returns
‘True’ if stream is stopped, ‘False’ otherwise
- Return type
bool
-
class
spokestack.io.pyaudio.
PyAudioOutput
(num_channels=1, sample_rate=24000, frames_per_buffer=1024)[source]¶ Outputs audio to the default system output
- Parameters
num_channels (int) – number of audio channels
sample_rate (int) – sample rate of the audio (Hz)
frames_per_buffer (int) – number of audio samples to buffer on the output device
spokestack.io.sound_device¶
Voice Activity Detection (VAD)¶
WebRTC¶
This module contains the webrtc component for voice activity detection (vad)
-
class
spokestack.vad.webrtc.
VoiceActivityDetector
(sample_rate=16000, frame_width=20, vad_rise_delay=0, vad_fall_delay=0, mode=0, **kwargs)[source]¶ This class detects the presence of voice in a frame of audio.
- Parameters
sample_rate (int) – sample rate of the audio (Hz)
frame_width (int) – width of the audio frame: 10, 20, or 30 (ms)
vad_rise_delay (int) – rising edge delay (ms)
vad_fall_delay (int) – falling edge delay (ms)
mode (int) – named constant to set mode for vad
Speech Context¶
This module contains a context class to manage state between members of the processing pipeline
-
class
spokestack.context.
SpeechContext
[source]¶ Class for managing context of the speech pipeline.
-
add_handler
(name, function)[source]¶ Adds a handler to the context
- Parameters
name (str) – The name of the event handler
function (Callable) – event handler function
- Return type
None
-
property
confidence
¶ This property contains the confidence of a classification result.
- Returns
model confidence of classification
- Return type
float
-
event
(name)[source]¶ Calls the event handler
- Parameters
name (str) – The name of the event handler
- Return type
None
-
property
is_active
¶ This property manages activity of the context.
- Returns
‘True’ if context is active, ‘False’ otherwise.
- Return type
bool
-
property
is_speech
¶ This property is to manage if speech is present in the current state or not.
- Returns
‘True’ if is_speech set to ‘True’, ‘False’ otherwise
- Return type
bool
-
property
transcript
¶ This property is the text representation of the audio buffer
- Returns
the value of the transcript property
- Return type
str
-
Wakeword¶
TFLite Model¶
This module contains the class for detecting the presence of keywords in an audio stream
-
class
spokestack.wakeword.tflite.
WakewordTrigger
(pre_emphasis=0.0, sample_rate=16000, fft_window_type='hann', fft_hop_length=10, model_dir='', posterior_threshold=0.5, **kwargs)[source]¶ Detects the presence of a wakeword in the audio input
- Parameters
pre_emphasis (float) – The value of the pre-emmphasis filter
sample_rate (int) – The number of audio samples per second of audio (kHz)
fft_window_type (str) – The type of fft window. (only support for hann)
fft_hop_length (int) – Audio sliding window for STFT calculation (ms)
model_dir (str) – Path to the directory containing .tflite models
posterior_threshold (float) – Probability threshold for if a wakeword was detected
Models¶
spokestack.models.tensorflow module¶
TFLite model base class
-
class
spokestack.models.tensorflow.
TFLiteModel
(model_path, **kwargs)[source]¶ TFLite model base class for managing multiple inputs/outputs
- Parameters
model_path (str) – Path to .tflite model file
**kwargs (Any) – Additional keywords arguments for the TFLite Interpreter. [https://www.tensorflow.org/api_docs/python/tf/lite/Interpreter]
-
property
input_details
¶ Property for accesing the TFLite model input_details
Returns: Input details for the TFLite model
- Return type
List
[Any
]
-
property
output_details
¶ Property for accesing the TFLite model output_details
Returns: Output details for the TFLite model
- Return type
List
[Any
]
Text to Speech (TTS)¶
Clients¶
Spokestack¶
This module contains the Spokestack client for text to speech
-
exception
spokestack.tts.clients.spokestack.
TTSError
(response)[source]¶ Bases:
Exception
Text to speech error wrapper
-
class
spokestack.tts.clients.spokestack.
TextToSpeechClient
(key_id, key_secret, url='https://api.spokestack.io/v1')[source]¶ Bases:
object
Spokestack Text to Speech Client
- Parameters
key_id (str) – identity from spokestack api credentials
key_secret (str) – secret key from spokestack api credentials
url (str) – spokestack api url
-
synthesize
(utterance, mode='text', voice='demo-male', profile='default')[source]¶ Converts the given utterance to speech.
Text can be formatted as plain text (mode=”text”), SSML (mode=”ssml”), or Speech Markdown (mode=”markdown”).
This method also supports different formats for the synthesized audio via the profile argument. The supported profiles and their associated formats are:
default: 24kHz, 64kbps mono MP3
alexa: 24kHz, 48kbps mono MP3
discord: 48kHz, 64kbpz stereo OPUS
twilio: 8kHz, 64kbpz mono MP3
- Parameters
utterance (str) – string that needs to be rendered as speech.
mode (str) – synthesis mode to use with utterance. text, ssml, markdown.
voice (str) – name of the tts voice.
profile (str) – name of the audio profile used to create the resulting stream.
- Returns
Encoded audio response in the form of a sequence of bytes
- Return type
(Iterator[bytes])
-
synthesize_url
(utterance, mode='text', voice='demo-male', profile='default')[source]¶ Converts the given utterance to speech accessible by a URL.
Text can be formatted as plain text (mode=”text”), SSML (mode=”ssml”), or Speech Markdown (mode=”markdown”).
This method also supports different formats for the synthesized audio via the profile argument. The supported profiles and their associated formats are:
default: 24kHz, 64kbps mono MP3
alexa: 24kHz, 48kbps mono MP3
discord: 48kHz, 64kbpz stereo OPUS
twilio: 8kHz, 64kbpz mono MP3
- Parameters
utterance (str) – string that needs to be rendered as speech.
mode (str) – synthesis mode to use with utterance. text, ssml, markdown.
voice (str) – name of the tts voice.
profile (str) – name of the audio profile used to create the resulting stream.
Returns: URL of the audio clip
- Return type
str
Manager¶
This module contains the Spokestack text to speech manager which handles a text to speech client, decodes the returned audio, and writes the audio to the specified output.
-
class
spokestack.tts.manager.
SequenceIO
(sequence)[source]¶ Wrapper that allows for incrementally received audio to be decoded.
-
class
spokestack.tts.manager.
TextToSpeechManager
(client, output, format_='mp3')[source]¶ Manages tts client and io target.
- Parameters
client (
Any
) – Text to speech client that returns encoded mp3 audiooutput (
Any
) – Audio io targetformat – Audio format, one of FORMAT_MP3 or FORMAT_PCM16
-
synthesize
(utterance, mode='text', voice='demo-male', profile='default')[source]¶ Synthesizes the given utterance with the voice and format provided.
Text can be formatted as plain text (mode=”text”), SSML (mode=”ssml”), or Speech Markdown (mode=”markdown”).
This method also supports different formats for the synthesized audio via the profile argument. The supported profiles and their associated formats are:
- Parameters
utterance (str) – string that needs to be rendered as speech.
mode (str) – synthesis mode to use with utterance. text, ssml, markdown, etc.
voice (str) – name of the tts voice.
profile (str) – name of the audio profile used to create the resulting stream.
- Return type
None
TTS-Lite¶
Spokestack-Lite Speech Synthesizer
This module contains the SpeechSynthesizer class used to convert text to speech using local TTS models trained on the Spokestack platform. A SpeechSynthesizer instance can be passed to the TextToSpeechManager for playback.
Example
This example assumes that a TTS model was downloaded from the Spokestack
platform and extracted to the model
directory.
from spokestack.io.pyaudio import PyAudioOutput
from spokestack.tts.manager import TextToSpeechManager, FORMAT_PCM16
from spokestack.tts.lite import SpeechSynthesizer, BLOCK_LENGTH, SAMPLE_RATE
tts = TextToSpeechManager(
SpeechSynthesizer("./model"),
PyAudioOutput(sample_rate=SAMPLE_RATE, frames_per_buffer=BLOCK_LENGTH),
format_=FORMAT_PCM16)
tts.synthesize("Hello world!")
Automatic Speech Recognition (ASR)¶
spokestack.asr.spokestack.cloud_client module¶
This module contains the websocket logic used to communicate with Spokestack’s cloud-based ASR service.
-
exception
spokestack.asr.spokestack.cloud_client.
APIError
(response)[source]¶ Spokestack api error pass through
- Parameters
response (dict) – message from the api service
-
class
spokestack.asr.spokestack.cloud_client.
CloudClient
(key_id, key_secret, socket_url='wss://api.spokestack.io', audio_format='PCM16LE', sample_rate=16000, language='en', limit=10, idle_timeout=None)[source]¶ Spokestack client for cloud based speech to text
- Parameters
key_id (str) – identity from spokestack api credentials
key_secret (str) – secret key from spokestack api credentials
socket_url (str) – url for socket connection
audio_format (str) – format of input audio
sample_rate (int) – audio sample rate (kHz)
language (str) – language for recognition
limit (int) – Limit of messages per api response
idle_timeout (Any) – Time before client timeout. Defaults to None
-
property
idle_count
¶ current counter of idle time
- Return type
int
-
property
idle_timeout
¶ property for maximum idle time
- Return type
Any
-
property
is_connected
¶ status of the socket connection
- Return type
bool
-
property
is_final
¶ status of most recent sever response
- Return type
bool
-
property
response
¶ current response message
- Return type
dict
spokestack.asr.spokestack.speech_recognizer module¶
This module contains the recognizer for cloud based ASR in the speech pipeline
-
class
spokestack.asr.spokestack.speech_recognizer.
CloudSpeechRecognizer
(spokestack_id='', spokestack_secret='', language='en', sample_rate=16000, frame_width=20, idle_timeout=5000, **kwargs)[source]¶ Speech recognizer for use in the speech pipeline
- Parameters
spokestack_id (str) – identity under spokestack api credentials
spokestack_secret (str) – secret key from spokestack api credentials
language (str) – language recognized
sample_rate (int) – audio sample rate (kHz)
frame_width (int) – frame width of the audio (ms)
idle_timeout (int) – the number of iterations before the connection times out
spokestack.asr.google.speech_recognizer module¶
This module contains the google asr speech recognizer
-
class
spokestack.asr.google.speech_recognizer.
GoogleSpeechRecognizer
(language, credentials=None, sample_rate=16000, **kwargs)[source]¶ Transforms speech into text using Google’s ASR.
- Parameters
language (str) – The language of given audio as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: “en-US”
credentials (Union[None, str, dict]) – Dictionary of Google API credentials or path to credentials. if set to None credentials will be pulled from the environment variable: GOOGLE_APPLICATION_CREDENTIALS
sample_rate (int) – sample rate of the input audio (Hz)
**kwargs (optional) – additional keyword arguments
This module contains the Spokestack KeywordRecognizer which identifies multiple keywords from an audio stream.
-
class
spokestack.asr.keyword.tflite.
KeywordRecognizer
(classes, pre_emphasis=0.97, sample_rate=16000, fft_window_type='hann', fft_hop_length=10, model_dir='', posterior_threshold=0.5, **kwargs)[source]¶ Recognizes keywords in an audio stream.
- Parameters
classes (List[str]) – Keyword labels
pre_emphasis (float) – The value of the pre-emphasis filter
sample_rate (int) – The number of audio samples per second of audio (kHz)
fft_window_type (str) – The type of fft window. (only support for hann)
fft_hop_length (int) – Audio sliding window for STFT calculation (ms)
model_dir (str) – Path to the directory containing .tflite models
posterior_threshold (float) – Probability threshold for detection
Natural Language Understanding (NLU)¶
Parsers¶
Digits¶
This module contains the parser that converts the string representation of a sequence of digits into the corresponding sequence of digits. These digits may be in the form of english cardinal representations of numbers, along with some homophones. The digits can be hyphenated or unhyphenated from twenty through ninety-nine. The unhyphenated numbers get joined automatically. The use of unhyphenated numbers introduces ambiguity. For example, “sixty five thousand” could be parsed as “605000” or “65000”. Our parser will output the latter. However, this can be an issue with values such as “sixty five thousand one” which parses as “650001”. This limitation will most likely be acceptable for most multi-digit use cases such as telephone numbers, social security numbers, etc.
Entity¶
This module contains the logic to parse entities from NLU results. The entity parser is a pass through for string values to allow custom logic to resolve the entities. For example, the entity can be used as a keyword in a database search.
Integer¶
This module contains the logic to parse integers from NLU results. Integers can be in the form of words (ie. one, two, three) or numbers (ie. 1, 2, 3). Either form will resolve to Python’s built-in ‘int’ type. The metadata must contain a range key containing the minimum and maximum values for the expected integer range. It is important to note the difference between digits and integers. Integers are counting numbers: 2 apples, a table for two. In contrast, digits can be used for sequences of numbers like phone numbers or social security numbers.
Selset¶
This module contains the logic to parse selsets from NLU results. Selsets contain a name along with one or more aliases. This allows one to map any of the listed aliases into a single word. For example, if a selset’s name is “light”, and its aliases are bulbs, light, beam, lamp, etc., occurrences of any alias will be parsed as light
TFLite Model¶
This module contains the class for using TFLite NLU models. In this case, an NLU model is a TFLite model which takes in an utterance and returns an intent along with any slots that are associated with that intent.
Activation Timeout¶
This module manages the timeout for speech pipeline activation.
-
class
spokestack.activation_timeout.
ActivationTimeout
(frame_width=20, min_active=500, max_active=5000, **kwargs)[source]¶ Speech pipeline activation timeout
- Parameters
frame_width (int) – frame width of the audio (ms)
min_active (int) – the minimum length of an activation (ms)
max_active (int) – the maximum length of an activation (ms)
Speech Pipeline¶
This module contains the speech pipeline which manages the components for processing speech.
-
class
spokestack.pipeline.
SpeechPipeline
(input_source, stages)[source]¶ Pipeline for managing speech components.
- Parameters
input_source (
Any
) – source of audio inputstages (
List
[Any
]) – components desired in the pipeline**kwargs – additional keyword arguments
-
property
context
¶ Current context
- Return type
-
event
(function=None, name=None)[source]¶ Registers an event handler
- Parameters
function (
Optional
[Any
]) – event handlername (
Optional
[str
]) – name of event handler
- Return type
Any
- Returns
Default event handler if a function not specified
-
property
is_running
¶ State of the pipeline
- Return type
bool
Automatic Noise Suppression¶
This module contains the class for webrtc automatic noise suppression
-
class
spokestack.nsx.webrtc.
AutomaticNoiseSuppression
(sample_rate=16000, policy=1, **kwargs)[source]¶ WebRTC Automatic Noise Suppression
- Parameters
sample_rate (int) – audio sample rate. (Hz)
policy (int) – aggressiveness of the noise suppression: - POLICY_MILD: mild suppression (6dB) - POLICY_MEDIUM: medium suppression (10dB) - POLICY_AGGRESSIVE: aggresive suppression (15dB) - POLICY_VERY_AGGRESSIVE: very aggressive suppression
Automatic Gain Control¶
This module contains the class for webrtc’s automatic gain control
-
class
spokestack.agc.webrtc.
AutomaticGainControl
(sample_rate=16000, frame_width=20, target_level_dbfs=3, compression_gain_db=15, limit_enable=True, **kwargs)[source]¶ WebRTC Automatic Gain Control
- Parameters
sample_rate (int) – audio sample_rate. (Hz)
frame_width (int) – audio frame width. (Ms)
target_level_dbfs (int) – target peak audio level. (dBFS)
compression_gain_db (int) – dynamic range compression rate. (dB)
limit_enable (bool) – enables limiter in compression.