Auris & Vox — Audio Subsystem
Added in version 2.0.
Overview
Auris (Latin for ear), Vox (Latin for voice), and Live form the three complementary cores of Synthetic Heart’s unified audio framework:
Auris — file-based STT: accepts a complete audio file, returns a transcript string.
Vox — file-based TTS: accepts a text string, returns synthesised audio bytes.
Live — bidirectional streaming: persistent sessions with interleaved PCM-in / transcript-out and text-in / audio-out.
The three registries follow the same plug-and-play pattern as the LLM cortex engines. New engines can be added without touching any core code.
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ AurisRegistry │ │ VoxRegistry │ │ LiveRegistry │
│ auris_registry │ │ vox_registry │ │ live_registry │
└───────┬─────────┘ └───────┬─────────┘ └────────┬─────────┘
│ file-based STT │ file-based TTS │ bidirectional
▼ ▼ ▼
┌───────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ AurisPlugin │ │ VoxPlugin │ │ (webui WS /stream) │
│ transcribe_audio │ │ speak() │ │ LiveRegistry.load │
│ stt_transcribe │ │ tts_speak │ │ open/send/receive │
└─────────┬─────────┘ └──────────┬───────────┘ └──────────┬──────────┘
│ │ │
┌────────┴────────┐ ┌─────────┴──────────┐ ┌─────────┴──────────┐
│ Auris Engines │ │ Vox Engines │ │ Live Engines │
│ gemini.py │ │ http.py │ │ silero.py (VAD) │
│ │ │ │ │ gemini.py (stub) │
└─────────────────┘ │ kitten.py │ └────────────────────┘
Registry Pattern
Both registries follow the same conventions as core/cortex_registry.py.
Registering an engine
An Auris engine module must define:
ENGINE_CLASS— the class that extendsAurisEngineBaseOptional call to
register_auris_engine()(the class auto-registers via the baseCAPABILITIESdict)
# plugins/auris_engines/my_engine.py
from plugins.auris_base import AurisEngineBase
from core.auris_registry import register_auris_engine
class MyAurisEngine(AurisEngineBase):
def transcribe(self, file_path: str, mime_type: str | None = None) -> str | None:
# ... call your STT service ...
return transcribed_text
ENGINE_CLASS = MyAurisEngine
register_auris_engine(
"my_engine",
"plugins.auris_engines.my_engine",
{"file_based": True, "local": True},
)
A Vox engine follows the same pattern:
# plugins/vox_engines/my_engine.py
from plugins.vox_base import VoxEngineBase
from core.vox_registry import register_vox_engine
class MyVoxEngine(VoxEngineBase):
def generate_tts(self, text: str, emotion: str | None = None, **kwargs) -> bytes | None:
# ... synthesise audio, return WAV or raw PCM bytes ...
return audio_bytes
ENGINE_CLASS = MyVoxEngine
register_vox_engine(
"my_engine",
"plugins.vox_engines.my_engine",
{"streaming": False, "local": True, "voice_cloning": False, "emotions": True},
)
Auris — STT Subsystem
Plugin: auris_plugin
Vosk-specific configuration
When the auris_vosk engine is selected, two additional variables control
language handling:
VOSK_LANGUAGE– language code for the Vosk model ("en-us","it","fr"etc). Starting in version 2.?, this defaults to"auto"instead of English; inautomode the first few seconds of audio are probed by a Whisper‑tiny model (via the optionalfaster-whisperpackage) to identify the spoken language. If detection fails or the dependency is missing, the first downloaded Vosk model is used as a fallback. Explicitly settingVOSK_LANGUAGEoverrides auto‑detection.VOSK_LID_CONFIDENCE– when Whisper LID is used, the minimum probability (0–1) required to accept the detected language. Defaults to0.5. If the confidence is below this threshold the fallback path is taken.
The AurisPlugin is the single authoritative entry-point for all transcription. Other plugins, interfaces, and cortex engines must not call STT backends directly — they should always call AurisPlugin.transcribe_audio().
Configuration (all WebUI-configurable):
Variable |
Default |
Description |
|---|---|---|
|
|
Name of the active engine (e.g. |
|
|
JSON string of engine-specific settings forwarded at load time. |
Public API:
# From any interface or plugin:
from core.core_initializer import PLUGIN_REGISTRY
auris = PLUGIN_REGISTRY.get("auris_plugin")
if auris:
text = await auris.transcribe_audio(
"/path/to/audio.ogg",
mime_type="audio/ogg", # optional MIME hint
engine_name="gemini", # optional per-call override
)
LLM action stt_transcribe:
{
"type": "stt_transcribe",
"payload": {
"audio_path": "/tmp/live_io/in_123.oga",
"mime_type": "audio/ogg",
"engine": "gemini"
}
}
AurisEngineBase contract
All engines must extend plugins.auris_base.AurisEngineBase.
Auris engines are file-based only — for real-time / bidirectional streaming use LiveEngineBase instead.
class AurisEngineBase(ABC):
# Required
@abstractmethod
def transcribe(self, file_path: str, mime_type: str | None = None) -> str | None: ...
# Lifecycle hooks
def setup(self) -> None: ...
def teardown(self) -> None: ...
Available Auris engines
Most Auris capabilities are now configured through the External Endpoints UI. Add a provider as an endpoint and enable the auris mapping when the endpoint supports STT.
The only built-in Auris engine currently shipped by default is:
Alias |
File |
Notes |
|---|---|---|
|
|
Local speech recognition engine. File-based, offline, and suitable for self-hosted deployments. |
Note
Silero has moved to the Live registry (live_engines/silero.py).
Importing auris_engines/silero will emit a deprecation warning and
register nothing.
Vox — TTS & Lip-sync Subsystem
Plugin: vox_plugin
Text language detection
The core plugin performs language detection on the input text using
lingua-language-detector (a much more accurate replacement for the old
lingua-language-detector). The detected ISO-639-1 code ("en", "it" etc)
is logged and forwarded to the active Vox engine via a language keyword
argument. Engines can optionally use this hint to select an appropriate voice
or model. lingua is a required dependency; if detection confidence is below
the internal threshold, no hint is supplied and the plugin continues as before.
VoxPlugin owns the entire TTS pipeline:
Text cleaning (emoji removal, whitespace normalisation)
Engine selection and audio generation
WAV/PCM file writing to
VOX_OUTPUT_DIROptional lip-sync data extraction (
engine.get_lipsync_data())Audio dispatch to the originating interface (WebUI
synth:tts-play, Discord, Telegram)Text fallback when the engine fails
No interface or other plugin should handle TTS audio files or dispatch lip-sync events directly.
Configuration (all WebUI-configurable):
Variable |
Default |
Description |
|---|---|---|
|
|
Name of the active TTS engine. Set to |
|
|
JSON string forwarded to the engine at load time. |
|
|
Directory where generated audio files are written. |
|
|
Maximum seconds to wait for a TTS engine response. |
|
|
When |
Note
Legacy TTS_* config keys (TTS_ENABLED, TTS_ENDPOINTS, TTS_TIMEOUT_SECONDS, TTS_OUTPUT_DIR) are still supported by the built-in http Vox engine for backward compatibility, but the preferred configuration path for new deployments is to register external HTTP TTS servers through the External Endpoints system and map them to vox.
WebUI helper endpoints
Two read-only HTTP endpoints support voice selection and sample playback in the web interface. Engines may implement them if they expose multiple speakers or wish to supply short example clips.
GET /api/vox/speakers?engine=<name>returns a JSON array of speaker metadata for the specified engine (the configuredACTIVE_VOX_ENGINEis used when the query parameter is omitted). The format is engine-specific;kittenreturns[{"code": "en_1", "name": "English Female 1", "language": "en"}, …]. If the engine is unknown a404is returned.GET /api/vox/sample?engine=<name>&speaker=<code>streams a short WAV file for the given speaker. Engines that cannot provide samples should raiseNotImplementedErrorwhich results in a404. A missingspeakerparameter produces a400error.
These helpers are used internally by res/synth_webui/js/main.js to
populate the Kitten voice selector and play sample audio.
Public API:
from core.core_initializer import PLUGIN_REGISTRY
vox = PLUGIN_REGISTRY.get("vox_plugin")
if vox:
result = await vox.speak(
"Hello, world!",
interface_path="synth_webui/session_abc",
emotion="joy",
engine_name="http", # optional per-call override
merged_text="Hello, world!", # plain-text fallback caption
)
# result = {"status": "success"|"skipped"|"error", "filename": ..., ...}
LLM action tts_speak:
{
"type": "tts_speak",
"payload": {
"text": "Hello, world!",
"emotion": "joy"
}
}
When tts_speak is delivered alongside a standard message action (for
example the LLM returns both a message_telegram_bot and a tts_speak),
message_chain will automatically merge the text payload into the TTS action as
__merged_text. This ensures that users receive a single audio message with
a caption and prevents the duplicate text reply that would otherwise occur.
The merged_text field is also used as a fallback caption when the TTS
engine fails.
VoxEngineBase contract
All engines must extend plugins.vox_base.VoxEngineBase:
class VoxEngineBase(ABC):
# Required
@abstractmethod
def generate_tts(self, text: str, emotion: str | None = None, **kwargs) -> bytes | None: ...
# Optional — override to report your format
@property
def output_format(self) -> str: return "wav" # or "pcm"
@property
def sample_rate(self) -> int: return 22050
@property
def channels(self) -> int: return 1
# Optional — return mouth-shape data for the renderer
def get_lipsync_data(self, audio_bytes: bytes) -> dict | None: return None
generate_tts must return:
bytes— WAV file bytes whenoutput_format == "wav"(i.e.RIFFheader present)bytes— raw PCM samples whenoutput_format == "pcm"None— on failure (triggers fallback)
VoxPlugin wraps raw PCM in a valid WAV container before writing to disk, so engines only need to return the sample data.
Available Vox engines
Alias |
File |
Notes |
|---|---|---|
|
|
Legacy built-in HTTP TTS engine. Supports |
|
|
Neural KittenTTS engine; requires the |
Lip-sync Integration
Lip-sync is fully centralised in VoxPlugin. Engines that can produce viseme/mouth-shape data implement VoxEngineBase.get_lipsync_data(audio_bytes):
def get_lipsync_data(self, audio_bytes: bytes) -> dict | None:
# Return a dict compatible with the WebUI synth:lipsync event,
# or None to skip lipsync for this utterance.
return {"mouths": [...], "duration": 3.14}
VoxPlugin.speak() calls this method automatically after writing the audio to disk and includes the result in the dispatched synth:tts-play payload.
No interface or plugin should call a lipsync API directly — all lipsync dispatch goes through VoxPlugin.
Migration from tts_lipsync
The legacy tts_lipsync plugin is maintained only for backward compatibility and should be avoided for new deployments. To move external HTTP TTS support to the modern external endpoint flow:
Add a new endpoint in the Web UI under Settings > External Engines / External Endpoints.
Choose
Protocol: customand setBase URLto the root URI of your HTTP TTS server.In
extra_config, add{"legacy_http_tts": true}and any optional adapter settings such astts_voice_wavortts_endpoint_path.Enable the
voxsubsystem mapping for the endpoint.Set
ACTIVE_VOX_ENGINEto the endpointNameyou created.
This registers the endpoint as a first-class Vox engine and removes the need to manage TTS_ENDPOINTS manually. The built-in http engine and legacy TTS_* keys remain available only for compatibility with existing deployments.
Live — Bidirectional Streaming
The Live subsystem handles persistent sessions where audio and text flow in both directions simultaneously (e.g. a microphone feed producing transcripts while the system synthesises speech).
Configuration: select the active engine via LIVE_CORTEX in the WebUI. The components page dropdown is populated with both
cortex engines of kind live and any external endpoints that were added
and mapped to live. The currently-selected value is highlighted and
persists across page reloads; choosing disabled turns the subsystem off.
Note: Gemini Live is only available after adding it as an external endpoint and enabling the live mapping; it is not automatically exposed by default.
LiveEngineBase contract
All engines extend plugins.live_base.LiveEngineBase:
class LiveEngineBase(ABC):
@property
def supports_input(self) -> bool: return False # PCM → transcript
@property
def supports_output(self) -> bool: return False # text → audio
# Session lifecycle
async def open_session(self, session_id: str, **kwargs) -> None: ... # abstract
async def close_session(self, session_id: str) -> None: ... # abstract
# Data channels
async def send_audio(self, session_id: str, chunk: bytes, sample_rate: int = 16000) -> None: ...
async def receive_events(self, session_id: str) -> AsyncIterator[LiveEvent]: ... # abstract
async def send_text(self, session_id: str, text: str) -> None: ...
# Lifecycle hooks
def setup(self) -> None: ...
def teardown(self) -> None: ...
LiveEvent carries typed payloads:
@dataclass
class LiveEvent:
type: LiveEventType # TRANSCRIPT | AUDIO | VAD | ERROR
text: str | None # transcript text
audio: bytes | None # synthesised audio chunk
is_final: bool # True = committed transcript segment
vad_signal: str | None # "speech_start" | "speech_end"
detail: str | None # error detail or free-form annotation
metadata: dict # engine-specific extras
WebSocket streaming endpoint
The WebUI exposes GET /api/audio/stream as a WebSocket.
It speaks the Live registry protocol directly:
Client sends a JSON config frame:
{"sample_rate": 16000, "engine": "silero"}Client streams binary PCM frames (raw 16-bit mono).
Server replies with JSON events:
{"type": "partial", "text": "..."}— interim transcript{"type": "final", "text": "..."}— committed transcript segment{"type": "vad", "signal": "speech_start"|"speech_end"}) — VAD markersBinary frames — synthesised TTS audio (AUDIO events)
Available Live engines
Alias |
File |
Notes |
|---|---|---|
|
|
Silero VAD with async queue per session. Local, CPU-friendly. Connect a real ASR model in |
Note
Gemini Live is not exposed automatically. Add a Gemini Live-capable endpoint through the External Endpoints UI and enable the live subsystem mapping if you want a Gemini-based live engine.
Adding a Live engine
# plugins/live_engines/my_engine.py
from plugins.live_base import LiveEngineBase, LiveEvent, LiveEventType
from core.live_registry import register_live_engine
class MyLiveEngine(LiveEngineBase):
supports_input = True
supports_output = True
async def open_session(self, session_id, **kwargs): ...
async def close_session(self, session_id): ...
async def receive_events(self, session_id):
yield LiveEvent(type=LiveEventType.TRANSCRIPT, text="hello", is_final=True)
ENGINE_CLASS = MyLiveEngine
register_live_engine("my_engine", __name__, {"input": True, "output": True, "local": True})
Testing
uv run pytest tests/test_auris_plugin.py tests/test_vox_plugin.py tests/test_live_registry.py -v
Test suites cover:
Auris registry: register, list, find-by-capabilities, missing
ENGINE_CLASS, instance caching;AurisPlugin.transcribe_audio()paths;AurisEngineBasecontract (file-based only)Vox registry: same registry coverage;
VoxPlugin.speak()disabled/success/fallback paths;VoxEngineBaseproperty defaultsLive registry: register, list, find-by-capabilities, missing
ENGINE_CLASS;LiveEngineBasedefaults (supports_input/output);LiveEventtype enumeration