Auris & Vox — Audio Subsystem ============================== .. versionadded:: 2.0 Overview -------- **Auris** (Latin for *ear*), **Vox** (Latin for *voice*), and **Live** form the three complementary cores of Synthetic Heart's unified audio framework: * **Auris** — *file-based STT*: accepts a complete audio file, returns a transcript string. * **Vox** — *file-based TTS*: accepts a text string, returns synthesised audio bytes. * **Live** — *bidirectional streaming*: persistent sessions with interleaved PCM-in / transcript-out and text-in / audio-out. The three registries follow the same plug-and-play pattern as the LLM cortex engines. New engines can be added without touching any core code. Architecture ------------ .. code-block:: text ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ AurisRegistry │ │ VoxRegistry │ │ LiveRegistry │ │ auris_registry │ │ vox_registry │ │ live_registry │ └───────┬─────────┘ └───────┬─────────┘ └────────┬─────────┘ │ file-based STT │ file-based TTS │ bidirectional ▼ ▼ ▼ ┌───────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │ AurisPlugin │ │ VoxPlugin │ │ (webui WS /stream) │ │ transcribe_audio │ │ speak() │ │ LiveRegistry.load │ │ stt_transcribe │ │ tts_speak │ │ open/send/receive │ └─────────┬─────────┘ └──────────┬───────────┘ └──────────┬──────────┘ │ │ │ ┌────────┴────────┐ ┌─────────┴──────────┐ ┌─────────┴──────────┐ │ Auris Engines │ │ Vox Engines │ │ Live Engines │ │ gemini.py │ │ http.py │ │ silero.py (VAD) │ │ │ │ │ │ gemini.py (stub) │ └─────────────────┘ │ kitten.py │ └────────────────────┘ Registry Pattern ---------------- Both registries follow the same conventions as ``core/cortex_registry.py``. Registering an engine ~~~~~~~~~~~~~~~~~~~~~ An Auris engine module must define: * ``ENGINE_CLASS`` — the class that extends ``AurisEngineBase`` * Optional call to ``register_auris_engine()`` (the class auto-registers via the base ``CAPABILITIES`` dict) .. code-block:: python # plugins/auris_engines/my_engine.py from plugins.auris_base import AurisEngineBase from core.auris_registry import register_auris_engine class MyAurisEngine(AurisEngineBase): def transcribe(self, file_path: str, mime_type: str | None = None) -> str | None: # ... call your STT service ... return transcribed_text ENGINE_CLASS = MyAurisEngine register_auris_engine( "my_engine", "plugins.auris_engines.my_engine", {"file_based": True, "local": True}, ) A Vox engine follows the same pattern: .. code-block:: python # plugins/vox_engines/my_engine.py from plugins.vox_base import VoxEngineBase from core.vox_registry import register_vox_engine class MyVoxEngine(VoxEngineBase): def generate_tts(self, text: str, emotion: str | None = None, **kwargs) -> bytes | None: # ... synthesise audio, return WAV or raw PCM bytes ... return audio_bytes ENGINE_CLASS = MyVoxEngine register_vox_engine( "my_engine", "plugins.vox_engines.my_engine", {"streaming": False, "local": True, "voice_cloning": False, "emotions": True}, ) Auris — STT Subsystem ---------------------- Plugin: ``auris_plugin`` ~~~~~~~~~~~~~~~~~~~~~~~~ **Vosk-specific configuration** When the ``auris_vosk`` engine is selected, two additional variables control language handling: * ``VOSK_LANGUAGE`` – language code for the Vosk model (``"en-us"``, ``"it"``, ``"fr"`` etc). Starting in version 2.?, this defaults to ``"auto"`` instead of English; in ``auto`` mode the first few seconds of audio are probed by a Whisper‑tiny model (via the optional ``faster-whisper`` package) to identify the spoken language. If detection fails or the dependency is missing, the first downloaded Vosk model is used as a fallback. Explicitly setting ``VOSK_LANGUAGE`` overrides auto‑detection. * ``VOSK_LID_CONFIDENCE`` – when Whisper LID is used, the minimum probability (0–1) required to accept the detected language. Defaults to ``0.5``. If the confidence is below this threshold the fallback path is taken. The ``AurisPlugin`` is the single authoritative entry-point for all transcription. Other plugins, interfaces, and cortex engines **must not** call STT backends directly — they should always call ``AurisPlugin.transcribe_audio()``. Configuration (all WebUI-configurable): .. list-table:: :header-rows: 1 :widths: 30 10 60 * - Variable - Default - Description * - ``ACTIVE_AURIS_ENGINE`` - ``disabled`` - Name of the active engine (e.g. ``gemini`` or ``vosk``). Set to ``disabled`` to disable the Auris subsystem. * - ``AURIS_ENGINE_SETTINGS`` - ``{}`` - JSON string of engine-specific settings forwarded at load time. Public API: .. code-block:: python # From any interface or plugin: from core.core_initializer import PLUGIN_REGISTRY auris = PLUGIN_REGISTRY.get("auris_plugin") if auris: text = await auris.transcribe_audio( "/path/to/audio.ogg", mime_type="audio/ogg", # optional MIME hint engine_name="gemini", # optional per-call override ) LLM action ``stt_transcribe``: .. code-block:: json { "type": "stt_transcribe", "payload": { "audio_path": "/tmp/live_io/in_123.oga", "mime_type": "audio/ogg", "engine": "gemini" } } ``AurisEngineBase`` contract ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All engines must extend ``plugins.auris_base.AurisEngineBase``. Auris engines are **file-based only** — for real-time / bidirectional streaming use ``LiveEngineBase`` instead. .. code-block:: python class AurisEngineBase(ABC): # Required @abstractmethod def transcribe(self, file_path: str, mime_type: str | None = None) -> str | None: ... # Lifecycle hooks def setup(self) -> None: ... def teardown(self) -> None: ... Available Auris engines ~~~~~~~~~~~~~~~~~~~~~~~ Most Auris capabilities are now configured through the External Endpoints UI. Add a provider as an endpoint and enable the `auris` mapping when the endpoint supports STT. The only built-in Auris engine currently shipped by default is: .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Alias - File - Notes * - ``vosk`` - ``plugins/auris_engines/vosk_engine.py`` - Local speech recognition engine. File-based, offline, and suitable for self-hosted deployments. .. note:: **Silero** has moved to the Live registry (``live_engines/silero.py``). Importing ``auris_engines/silero`` will emit a deprecation warning and register nothing. Vox — TTS & Lip-sync Subsystem --------------------------------- Plugin: ``vox_plugin`` ~~~~~~~~~~~~~~~~~~~~~~ *Text language detection* The core plugin performs language detection on the input text using ``lingua-language-detector`` (a much more accurate replacement for the old ``lingua-language-detector``). The detected ISO-639-1 code (``"en"``, ``"it"`` etc) is logged and forwarded to the active Vox engine via a ``language`` keyword argument. Engines can optionally use this hint to select an appropriate voice or model. ``lingua`` is a required dependency; if detection confidence is below the internal threshold, no hint is supplied and the plugin continues as before. ``VoxPlugin`` owns the **entire** TTS pipeline: 1. Text cleaning (emoji removal, whitespace normalisation) 2. Engine selection and audio generation 3. WAV/PCM file writing to ``VOX_OUTPUT_DIR`` 4. Optional lip-sync data extraction (``engine.get_lipsync_data()``) 5. Audio dispatch to the originating interface (WebUI ``synth:tts-play``, Discord, Telegram) 6. Text fallback when the engine fails No interface or other plugin should handle TTS audio files or dispatch lip-sync events directly. Configuration (all WebUI-configurable): .. list-table:: :header-rows: 1 :widths: 30 10 60 * - Variable - Default - Description * - ``ACTIVE_VOX_ENGINE`` - ``http`` - Name of the active TTS engine. Set to ``disabled`` to disable the Vox subsystem. * - ``VOX_ENGINE_SETTINGS`` - ``{}`` - JSON string forwarded to the engine at load time. * - ``VOX_OUTPUT_DIR`` - ``tmp_tts`` - Directory where generated audio files are written. * - ``VOX_TIMEOUT_SECONDS`` - ``10`` - Maximum seconds to wait for a TTS engine response. * - ``VOX_FALLBACK_TO_TEXT`` - ``true`` - When ``true``, sends a plain-text message if TTS generation fails. .. note:: Legacy ``TTS_*`` config keys (``TTS_ENABLED``, ``TTS_ENDPOINTS``, ``TTS_TIMEOUT_SECONDS``, ``TTS_OUTPUT_DIR``) are still supported by the built-in ``http`` Vox engine for backward compatibility, but the preferred configuration path for new deployments is to register external HTTP TTS servers through the External Endpoints system and map them to ``vox``. WebUI helper endpoints ~~~~~~~~~~~~~~~~~~~~~~ Two read-only HTTP endpoints support voice selection and sample playback in the web interface. Engines may implement them if they expose multiple speakers or wish to supply short example clips. * ``GET /api/vox/speakers?engine=`` returns a JSON array of speaker metadata for the specified engine (the configured ``ACTIVE_VOX_ENGINE`` is used when the query parameter is omitted). The format is engine-specific; ``kitten`` returns ``[{"code": "en_1", "name": "English Female 1", "language": "en"}, …]``. If the engine is unknown a ``404`` is returned. * ``GET /api/vox/sample?engine=&speaker=`` streams a short WAV file for the given speaker. Engines that cannot provide samples should raise ``NotImplementedError`` which results in a ``404``. A missing ``speaker`` parameter produces a ``400`` error. These helpers are used internally by ``res/synth_webui/js/main.js`` to populate the Kitten voice selector and play sample audio. Public API: .. code-block:: python from core.core_initializer import PLUGIN_REGISTRY vox = PLUGIN_REGISTRY.get("vox_plugin") if vox: result = await vox.speak( "Hello, world!", interface_path="synth_webui/session_abc", emotion="joy", engine_name="http", # optional per-call override merged_text="Hello, world!", # plain-text fallback caption ) # result = {"status": "success"|"skipped"|"error", "filename": ..., ...} LLM action ``tts_speak``: .. code-block:: json { "type": "tts_speak", "payload": { "text": "Hello, world!", "emotion": "joy" } } When ``tts_speak`` is delivered alongside a standard message action (for example the LLM returns both a ``message_telegram_bot`` and a ``tts_speak``), message_chain will automatically merge the text payload into the TTS action as ``__merged_text``. This ensures that users receive a single audio message with a caption and prevents the duplicate text reply that would otherwise occur. The ``merged_text`` field is also used as a fallback caption when the TTS engine fails. ``VoxEngineBase`` contract ~~~~~~~~~~~~~~~~~~~~~~~~~~~ All engines must extend ``plugins.vox_base.VoxEngineBase``: .. code-block:: python class VoxEngineBase(ABC): # Required @abstractmethod def generate_tts(self, text: str, emotion: str | None = None, **kwargs) -> bytes | None: ... # Optional — override to report your format @property def output_format(self) -> str: return "wav" # or "pcm" @property def sample_rate(self) -> int: return 22050 @property def channels(self) -> int: return 1 # Optional — return mouth-shape data for the renderer def get_lipsync_data(self, audio_bytes: bytes) -> dict | None: return None ``generate_tts`` must return: * ``bytes`` — WAV file bytes when ``output_format == "wav"`` (i.e. ``RIFF`` header present) * ``bytes`` — raw PCM samples when ``output_format == "pcm"`` * ``None`` — on failure (triggers fallback) ``VoxPlugin`` wraps raw PCM in a valid WAV container before writing to disk, so engines only need to return the sample data. Available Vox engines ~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Alias - File - Notes * - ``http`` - ``vox_engines/http.py`` - Legacy built-in HTTP TTS engine. Supports ``TTS_ENDPOINTS`` compatibility and is intended for backward-compatible deployments only. For new external HTTP TTS integrations, prefer adding a custom external endpoint and mapping it to ``vox``. * - ``kitten`` - ``vox_engines/kitten.py`` - Neural KittenTTS engine; requires the ``kittentts`` package or uses the vendored shim (`vendor/kittentts`). The shim is lightweight and lazily imports ``gtts``/``pydub`` – those two libraries are declared as normal project dependencies and **must be installed** (``uv add gtts pydub``) for audio output to work. Without them the engine will fall back to text and log an informative error. Produces higher-quality audio than the legacy system-voice implementation. Lip-sync Integration --------------------- Lip-sync is **fully centralised** in ``VoxPlugin``. Engines that can produce viseme/mouth-shape data implement ``VoxEngineBase.get_lipsync_data(audio_bytes)``: .. code-block:: python def get_lipsync_data(self, audio_bytes: bytes) -> dict | None: # Return a dict compatible with the WebUI synth:lipsync event, # or None to skip lipsync for this utterance. return {"mouths": [...], "duration": 3.14} ``VoxPlugin.speak()`` calls this method automatically after writing the audio to disk and includes the result in the dispatched ``synth:tts-play`` payload. No interface or plugin should call a lipsync API directly — all lipsync dispatch goes through ``VoxPlugin``. Migration from ``tts_lipsync`` -------------------------------- The legacy ``tts_lipsync`` plugin is maintained only for backward compatibility and should be avoided for new deployments. To move external HTTP TTS support to the modern external endpoint flow: 1. Add a new endpoint in the Web UI under Settings > External Engines / External Endpoints. 2. Choose ``Protocol: custom`` and set ``Base URL`` to the root URI of your HTTP TTS server. 3. In ``extra_config``, add ``{"legacy_http_tts": true}`` and any optional adapter settings such as ``tts_voice_wav`` or ``tts_endpoint_path``. 4. Enable the ``vox`` subsystem mapping for the endpoint. 5. Set ``ACTIVE_VOX_ENGINE`` to the endpoint ``Name`` you created. This registers the endpoint as a first-class Vox engine and removes the need to manage ``TTS_ENDPOINTS`` manually. The built-in ``http`` engine and legacy ``TTS_*`` keys remain available only for compatibility with existing deployments. Live — Bidirectional Streaming ------------------------------- The **Live** subsystem handles persistent sessions where audio and text flow in both directions simultaneously (e.g. a microphone feed producing transcripts while the system synthesises speech). Configuration: select the active engine via ``LIVE_CORTEX`` in the WebUI. The components page dropdown is populated with both cortex engines of kind ``live`` and any external endpoints that were added and mapped to ``live``. The currently-selected value is highlighted and persists across page reloads; choosing ``disabled`` turns the subsystem off. Note: Gemini Live is only available after adding it as an external endpoint and enabling the ``live`` mapping; it is not automatically exposed by default. ``LiveEngineBase`` contract ~~~~~~~~~~~~~~~~~~~~~~~~~~~ All engines extend ``plugins.live_base.LiveEngineBase``: .. code-block:: python class LiveEngineBase(ABC): @property def supports_input(self) -> bool: return False # PCM → transcript @property def supports_output(self) -> bool: return False # text → audio # Session lifecycle async def open_session(self, session_id: str, **kwargs) -> None: ... # abstract async def close_session(self, session_id: str) -> None: ... # abstract # Data channels async def send_audio(self, session_id: str, chunk: bytes, sample_rate: int = 16000) -> None: ... async def receive_events(self, session_id: str) -> AsyncIterator[LiveEvent]: ... # abstract async def send_text(self, session_id: str, text: str) -> None: ... # Lifecycle hooks def setup(self) -> None: ... def teardown(self) -> None: ... ``LiveEvent`` carries typed payloads: .. code-block:: python @dataclass class LiveEvent: type: LiveEventType # TRANSCRIPT | AUDIO | VAD | ERROR text: str | None # transcript text audio: bytes | None # synthesised audio chunk is_final: bool # True = committed transcript segment vad_signal: str | None # "speech_start" | "speech_end" detail: str | None # error detail or free-form annotation metadata: dict # engine-specific extras WebSocket streaming endpoint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The WebUI exposes ``GET /api/audio/stream`` as a WebSocket. It speaks the Live registry protocol directly: 1. Client sends a JSON config frame: ``{"sample_rate": 16000, "engine": "silero"}`` 2. Client streams binary PCM frames (raw 16-bit mono). 3. Server replies with JSON events: * ``{"type": "partial", "text": "..."}`` — interim transcript * ``{"type": "final", "text": "..."}`` — committed transcript segment * ``{"type": "vad", "signal": "speech_start"|"speech_end"}``) — VAD markers * Binary frames — synthesised TTS audio (AUDIO events) Available Live engines ~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Alias - File - Notes * - ``silero`` - ``live_engines/silero.py`` - Silero VAD with async queue per session. Local, CPU-friendly. Connect a real ASR model in ``_transcribe_segment``. .. note:: Gemini Live is not exposed automatically. Add a Gemini Live-capable endpoint through the External Endpoints UI and enable the ``live`` subsystem mapping if you want a Gemini-based live engine. Adding a Live engine ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # plugins/live_engines/my_engine.py from plugins.live_base import LiveEngineBase, LiveEvent, LiveEventType from core.live_registry import register_live_engine class MyLiveEngine(LiveEngineBase): supports_input = True supports_output = True async def open_session(self, session_id, **kwargs): ... async def close_session(self, session_id): ... async def receive_events(self, session_id): yield LiveEvent(type=LiveEventType.TRANSCRIPT, text="hello", is_final=True) ENGINE_CLASS = MyLiveEngine register_live_engine("my_engine", __name__, {"input": True, "output": True, "local": True}) Testing ------- .. code-block:: bash uv run pytest tests/test_auris_plugin.py tests/test_vox_plugin.py tests/test_live_registry.py -v Test suites cover: * **Auris** registry: register, list, find-by-capabilities, missing ``ENGINE_CLASS``, instance caching; ``AurisPlugin.transcribe_audio()`` paths; ``AurisEngineBase`` contract (file-based only) * **Vox** registry: same registry coverage; ``VoxPlugin.speak()`` disabled/success/fallback paths; ``VoxEngineBase`` property defaults * **Live** registry: register, list, find-by-capabilities, missing ``ENGINE_CLASS``; ``LiveEngineBase`` defaults (``supports_input/output``); ``LiveEvent`` type enumeration