Auris & Vox — Audio Subsystem

Added in version 2.0.

Overview

Auris (Latin for ear), Vox (Latin for voice), and Live form the three complementary cores of Synthetic Heart’s unified audio framework:

  • Aurisfile-based STT: accepts a complete audio file, returns a transcript string.

  • Voxfile-based TTS: accepts a text string, returns synthesised audio bytes.

  • Livebidirectional streaming: persistent sessions with interleaved PCM-in / transcript-out and text-in / audio-out.

The three registries follow the same plug-and-play pattern as the LLM cortex engines. New engines can be added without touching any core code.

Architecture

   ┌─────────────────┐  ┌─────────────────┐  ┌──────────────────┐
   │  AurisRegistry  │  │  VoxRegistry    │  │  LiveRegistry    │
   │ auris_registry  │  │  vox_registry   │  │  live_registry   │
   └───────┬─────────┘  └───────┬─────────┘  └────────┬─────────┘
           │ file-based STT     │ file-based TTS       │ bidirectional
           ▼                    ▼                      ▼
┌───────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│   AurisPlugin     │ │    VoxPlugin         │ │  (webui WS /stream) │
│  transcribe_audio │ │  speak()             │ │  LiveRegistry.load  │
│  stt_transcribe   │ │  tts_speak           │ │  open/send/receive  │
└─────────┬─────────┘ └──────────┬───────────┘ └──────────┬──────────┘
          │                      │                         │
 ┌────────┴────────┐   ┌─────────┴──────────┐   ┌─────────┴──────────┐
 │  Auris Engines  │   │   Vox Engines       │   │   Live Engines     │
 │  gemini.py      │   │   http.py           │   │   silero.py (VAD)  │
 │                 │   │                     │   │  gemini.py (stub) │
 └─────────────────┘   │   kitten.py         │   └────────────────────┘

Registry Pattern

Both registries follow the same conventions as core/cortex_registry.py.

Registering an engine

An Auris engine module must define:

  • ENGINE_CLASS — the class that extends AurisEngineBase

  • Optional call to register_auris_engine() (the class auto-registers via the base CAPABILITIES dict)

# plugins/auris_engines/my_engine.py
from plugins.auris_base import AurisEngineBase
from core.auris_registry import register_auris_engine


class MyAurisEngine(AurisEngineBase):
    def transcribe(self, file_path: str, mime_type: str | None = None) -> str | None:
        # ... call your STT service ...
        return transcribed_text


ENGINE_CLASS = MyAurisEngine

register_auris_engine(
    "my_engine",
    "plugins.auris_engines.my_engine",
    {"file_based": True, "local": True},
)

A Vox engine follows the same pattern:

# plugins/vox_engines/my_engine.py
from plugins.vox_base import VoxEngineBase
from core.vox_registry import register_vox_engine


class MyVoxEngine(VoxEngineBase):
    def generate_tts(self, text: str, emotion: str | None = None, **kwargs) -> bytes | None:
        # ... synthesise audio, return WAV or raw PCM bytes ...
        return audio_bytes


ENGINE_CLASS = MyVoxEngine

register_vox_engine(
    "my_engine",
    "plugins.vox_engines.my_engine",
    {"streaming": False, "local": True, "voice_cloning": False, "emotions": True},
)

Auris — STT Subsystem

Plugin: auris_plugin

Vosk-specific configuration

When the auris_vosk engine is selected, two additional variables control language handling:

  • VOSK_LANGUAGE – language code for the Vosk model ("en-us", "it", "fr" etc). Starting in version 2.?, this defaults to "auto" instead of English; in auto mode the first few seconds of audio are probed by a Whisper‑tiny model (via the optional faster-whisper package) to identify the spoken language. If detection fails or the dependency is missing, the first downloaded Vosk model is used as a fallback. Explicitly setting VOSK_LANGUAGE overrides auto‑detection.

  • VOSK_LID_CONFIDENCE – when Whisper LID is used, the minimum probability (0–1) required to accept the detected language. Defaults to 0.5. If the confidence is below this threshold the fallback path is taken.

The AurisPlugin is the single authoritative entry-point for all transcription. Other plugins, interfaces, and cortex engines must not call STT backends directly — they should always call AurisPlugin.transcribe_audio().

Configuration (all WebUI-configurable):

Variable

Default

Description

ACTIVE_AURIS_ENGINE

disabled

Name of the active engine (e.g. gemini or vosk). Set to disabled to disable the Auris subsystem.

AURIS_ENGINE_SETTINGS

{}

JSON string of engine-specific settings forwarded at load time.

Public API:

# From any interface or plugin:
from core.core_initializer import PLUGIN_REGISTRY

auris = PLUGIN_REGISTRY.get("auris_plugin")
if auris:
    text = await auris.transcribe_audio(
        "/path/to/audio.ogg",
        mime_type="audio/ogg",        # optional MIME hint
        engine_name="gemini",         # optional per-call override
    )

LLM action stt_transcribe:

{
  "type": "stt_transcribe",
  "payload": {
    "audio_path": "/tmp/live_io/in_123.oga",
    "mime_type": "audio/ogg",
    "engine": "gemini"
  }
}

AurisEngineBase contract

All engines must extend plugins.auris_base.AurisEngineBase. Auris engines are file-based only — for real-time / bidirectional streaming use LiveEngineBase instead.

class AurisEngineBase(ABC):
    # Required
    @abstractmethod
    def transcribe(self, file_path: str, mime_type: str | None = None) -> str | None: ...

    # Lifecycle hooks
    def setup(self) -> None: ...
    def teardown(self) -> None: ...

Available Auris engines

Most Auris capabilities are now configured through the External Endpoints UI. Add a provider as an endpoint and enable the auris mapping when the endpoint supports STT.

The only built-in Auris engine currently shipped by default is:

Alias

File

Notes

vosk

plugins/auris_engines/vosk_engine.py

Local speech recognition engine. File-based, offline, and suitable for self-hosted deployments.

Note

Silero has moved to the Live registry (live_engines/silero.py). Importing auris_engines/silero will emit a deprecation warning and register nothing.

Vox — TTS & Lip-sync Subsystem

Plugin: vox_plugin

Text language detection

The core plugin performs language detection on the input text using lingua-language-detector (a much more accurate replacement for the old lingua-language-detector). The detected ISO-639-1 code ("en", "it" etc) is logged and forwarded to the active Vox engine via a language keyword argument. Engines can optionally use this hint to select an appropriate voice or model. lingua is a required dependency; if detection confidence is below the internal threshold, no hint is supplied and the plugin continues as before.

VoxPlugin owns the entire TTS pipeline:

  1. Text cleaning (emoji removal, whitespace normalisation)

  2. Engine selection and audio generation

  3. WAV/PCM file writing to VOX_OUTPUT_DIR

  4. Optional lip-sync data extraction (engine.get_lipsync_data())

  5. Audio dispatch to the originating interface (WebUI synth:tts-play, Discord, Telegram)

  6. Text fallback when the engine fails

No interface or other plugin should handle TTS audio files or dispatch lip-sync events directly.

Configuration (all WebUI-configurable):

Variable

Default

Description

ACTIVE_VOX_ENGINE

http

Name of the active TTS engine. Set to disabled to disable the Vox subsystem.

VOX_ENGINE_SETTINGS

{}

JSON string forwarded to the engine at load time.

VOX_OUTPUT_DIR

tmp_tts

Directory where generated audio files are written.

VOX_TIMEOUT_SECONDS

10

Maximum seconds to wait for a TTS engine response.

VOX_FALLBACK_TO_TEXT

true

When true, sends a plain-text message if TTS generation fails.

Note

Legacy TTS_* config keys (TTS_ENABLED, TTS_ENDPOINTS, TTS_TIMEOUT_SECONDS, TTS_OUTPUT_DIR) are still supported by the built-in http Vox engine for backward compatibility, but the preferred configuration path for new deployments is to register external HTTP TTS servers through the External Endpoints system and map them to vox.

WebUI helper endpoints

Two read-only HTTP endpoints support voice selection and sample playback in the web interface. Engines may implement them if they expose multiple speakers or wish to supply short example clips.

  • GET /api/vox/speakers?engine=<name> returns a JSON array of speaker metadata for the specified engine (the configured ACTIVE_VOX_ENGINE is used when the query parameter is omitted). The format is engine-specific; kitten returns [{"code": "en_1", "name": "English Female 1", "language": "en"}, …]. If the engine is unknown a 404 is returned.

  • GET /api/vox/sample?engine=<name>&speaker=<code> streams a short WAV file for the given speaker. Engines that cannot provide samples should raise NotImplementedError which results in a 404. A missing speaker parameter produces a 400 error.

These helpers are used internally by res/synth_webui/js/main.js to populate the Kitten voice selector and play sample audio.

Public API:

from core.core_initializer import PLUGIN_REGISTRY

vox = PLUGIN_REGISTRY.get("vox_plugin")
if vox:
    result = await vox.speak(
        "Hello, world!",
        interface_path="synth_webui/session_abc",
        emotion="joy",
        engine_name="http",           # optional per-call override
        merged_text="Hello, world!",  # plain-text fallback caption
    )
    # result = {"status": "success"|"skipped"|"error", "filename": ..., ...}

LLM action tts_speak:

{
  "type": "tts_speak",
  "payload": {
    "text": "Hello, world!",
    "emotion": "joy"
  }
}

When tts_speak is delivered alongside a standard message action (for example the LLM returns both a message_telegram_bot and a tts_speak), message_chain will automatically merge the text payload into the TTS action as __merged_text. This ensures that users receive a single audio message with a caption and prevents the duplicate text reply that would otherwise occur. The merged_text field is also used as a fallback caption when the TTS engine fails.

VoxEngineBase contract

All engines must extend plugins.vox_base.VoxEngineBase:

class VoxEngineBase(ABC):
    # Required
    @abstractmethod
    def generate_tts(self, text: str, emotion: str | None = None, **kwargs) -> bytes | None: ...

    # Optional — override to report your format
    @property
    def output_format(self) -> str: return "wav"   # or "pcm"
    @property
    def sample_rate(self) -> int: return 22050
    @property
    def channels(self) -> int: return 1

    # Optional — return mouth-shape data for the renderer
    def get_lipsync_data(self, audio_bytes: bytes) -> dict | None: return None

generate_tts must return:

  • bytes — WAV file bytes when output_format == "wav" (i.e. RIFF header present)

  • bytes — raw PCM samples when output_format == "pcm"

  • None — on failure (triggers fallback)

VoxPlugin wraps raw PCM in a valid WAV container before writing to disk, so engines only need to return the sample data.

Available Vox engines

Alias

File

Notes

http

vox_engines/http.py

Legacy built-in HTTP TTS engine. Supports TTS_ENDPOINTS compatibility and is intended for backward-compatible deployments only. For new external HTTP TTS integrations, prefer adding a custom external endpoint and mapping it to vox.

kitten

vox_engines/kitten.py

Neural KittenTTS engine; requires the kittentts package or uses the vendored shim (vendor/kittentts). The shim is lightweight and lazily imports gtts/pydub – those two libraries are declared as normal project dependencies and must be installed (uv add gtts pydub) for audio output to work. Without them the engine will fall back to text and log an informative error. Produces higher-quality audio than the legacy system-voice implementation.

Lip-sync Integration

Lip-sync is fully centralised in VoxPlugin. Engines that can produce viseme/mouth-shape data implement VoxEngineBase.get_lipsync_data(audio_bytes):

def get_lipsync_data(self, audio_bytes: bytes) -> dict | None:
    # Return a dict compatible with the WebUI synth:lipsync event,
    # or None to skip lipsync for this utterance.
    return {"mouths": [...], "duration": 3.14}

VoxPlugin.speak() calls this method automatically after writing the audio to disk and includes the result in the dispatched synth:tts-play payload.

No interface or plugin should call a lipsync API directly — all lipsync dispatch goes through VoxPlugin.

Migration from tts_lipsync

The legacy tts_lipsync plugin is maintained only for backward compatibility and should be avoided for new deployments. To move external HTTP TTS support to the modern external endpoint flow:

  1. Add a new endpoint in the Web UI under Settings > External Engines / External Endpoints.

  2. Choose Protocol: custom and set Base URL to the root URI of your HTTP TTS server.

  3. In extra_config, add {"legacy_http_tts": true} and any optional adapter settings such as tts_voice_wav or tts_endpoint_path.

  4. Enable the vox subsystem mapping for the endpoint.

  5. Set ACTIVE_VOX_ENGINE to the endpoint Name you created.

This registers the endpoint as a first-class Vox engine and removes the need to manage TTS_ENDPOINTS manually. The built-in http engine and legacy TTS_* keys remain available only for compatibility with existing deployments.

Live — Bidirectional Streaming

The Live subsystem handles persistent sessions where audio and text flow in both directions simultaneously (e.g. a microphone feed producing transcripts while the system synthesises speech).

Configuration: select the active engine via LIVE_CORTEX in the WebUI. The components page dropdown is populated with both cortex engines of kind live and any external endpoints that were added and mapped to live. The currently-selected value is highlighted and persists across page reloads; choosing disabled turns the subsystem off.

Note: Gemini Live is only available after adding it as an external endpoint and enabling the live mapping; it is not automatically exposed by default.

LiveEngineBase contract

All engines extend plugins.live_base.LiveEngineBase:

class LiveEngineBase(ABC):
    @property
    def supports_input(self) -> bool: return False   # PCM → transcript
    @property
    def supports_output(self) -> bool: return False  # text → audio

    # Session lifecycle
    async def open_session(self, session_id: str, **kwargs) -> None: ...  # abstract
    async def close_session(self, session_id: str) -> None: ...           # abstract

    # Data channels
    async def send_audio(self, session_id: str, chunk: bytes, sample_rate: int = 16000) -> None: ...
    async def receive_events(self, session_id: str) -> AsyncIterator[LiveEvent]: ...  # abstract
    async def send_text(self, session_id: str, text: str) -> None: ...

    # Lifecycle hooks
    def setup(self) -> None: ...
    def teardown(self) -> None: ...

LiveEvent carries typed payloads:

@dataclass
class LiveEvent:
    type: LiveEventType          # TRANSCRIPT | AUDIO | VAD | ERROR
    text: str | None             # transcript text
    audio: bytes | None          # synthesised audio chunk
    is_final: bool               # True = committed transcript segment
    vad_signal: str | None       # "speech_start" | "speech_end"
    detail: str | None           # error detail or free-form annotation
    metadata: dict               # engine-specific extras

WebSocket streaming endpoint

The WebUI exposes GET /api/audio/stream as a WebSocket. It speaks the Live registry protocol directly:

  1. Client sends a JSON config frame: {"sample_rate": 16000, "engine": "silero"}

  2. Client streams binary PCM frames (raw 16-bit mono).

  3. Server replies with JSON events:

    • {"type": "partial", "text": "..."} — interim transcript

    • {"type": "final", "text": "..."} — committed transcript segment

    • {"type": "vad", "signal": "speech_start"|"speech_end"}) — VAD markers

    • Binary frames — synthesised TTS audio (AUDIO events)

Available Live engines

Alias

File

Notes

silero

live_engines/silero.py

Silero VAD with async queue per session. Local, CPU-friendly. Connect a real ASR model in _transcribe_segment.

Note

Gemini Live is not exposed automatically. Add a Gemini Live-capable endpoint through the External Endpoints UI and enable the live subsystem mapping if you want a Gemini-based live engine.

Adding a Live engine

# plugins/live_engines/my_engine.py
from plugins.live_base import LiveEngineBase, LiveEvent, LiveEventType
from core.live_registry import register_live_engine

class MyLiveEngine(LiveEngineBase):
    supports_input = True
    supports_output = True

    async def open_session(self, session_id, **kwargs): ...
    async def close_session(self, session_id): ...
    async def receive_events(self, session_id):
        yield LiveEvent(type=LiveEventType.TRANSCRIPT, text="hello", is_final=True)

ENGINE_CLASS = MyLiveEngine
register_live_engine("my_engine", __name__, {"input": True, "output": True, "local": True})

Testing

uv run pytest tests/test_auris_plugin.py tests/test_vox_plugin.py tests/test_live_registry.py -v

Test suites cover:

  • Auris registry: register, list, find-by-capabilities, missing ENGINE_CLASS, instance caching; AurisPlugin.transcribe_audio() paths; AurisEngineBase contract (file-based only)

  • Vox registry: same registry coverage; VoxPlugin.speak() disabled/success/fallback paths; VoxEngineBase property defaults

  • Live registry: register, list, find-by-capabilities, missing ENGINE_CLASS; LiveEngineBase defaults (supports_input/output); LiveEvent type enumeration