Gemini Live API — SyntH Discord Voice Integration ================================================= This document describes how Synthetic Heart integrates the Gemini Live API with Discord voice channels to enable real-time bidirectional voice conversation with the persona. Overview -------- The integration follows a **Hybrid Voice** architecture: Discord captures and plays back PCM audio, while the Gemini Live API handles speech recognition, reasoning, and speech synthesis over a persistent WebSocket session. .. code-block:: text Discord User │ ▼ (48 kHz stereo PCM) ┌──────────────────────────┐ │ LiveVoiceAudioSink │ ← discord-ext-voice-recv AudioSink │ 48 kHz stereo → 16 kHz │ │ mono (audioop) │ └──────────┬───────────────┘ │ 16 kHz mono PCM ▼ ┌──────────────────────────┐ │ LiveSessionManager │ ← core/live_session_manager.py │ WebSocket session │ │ send_realtime_input() │ │ receive() loop │ └──────────┬───────────────┘ │ 24 kHz mono PCM + tool calls ▼ ┌──────────────────────────┐ │ LiveAudioBuffer │ ← interface/discord_interface.py │ 24 kHz mono → 48 kHz │ │ stereo (audioop) │ └──────────┬───────────────┘ │ 48 kHz stereo PCM ▼ ┌──────────────────────────┐ │ LivePCMAudioSource │ ← discord.AudioSource │ 20 ms frames (3840 B) │ └──────────┬───────────────┘ │ ▼ Discord User Key files --------- .. list-table:: Key files :header-rows: 1 :widths: 20 80 * - File - Purpose * - ``core/live_session_manager.py`` - Session lifecycle, audio I/O, receive loop, tool call dispatch, reconnect logic * - ``cortex/llm_provider/gemini_api.py`` - ``get_live_session_manager()`` factory, ``start_live_voice_session()`` / ``stop_live_voice_session()`` wrappers * - ``core/prompt_engine.py`` - ``build_live_system_instruction()`` — condensed persona for live sessions * - ``interface/discord_interface.py`` - Discord actions, audio pipeline classes, tool-calling bridge, voice state cleanup Audio format details -------------------- - **Discord → Gemini (input)** — 16 kHz; Mono; 16-bit signed LE; MIME: ``audio/pcm;rate=16000`` - **Gemini → Discord (output)** — 24 kHz; Mono; 16-bit signed LE; MIME: ``audio/pcm;rate=24000`` - **Discord voice (native)** — 48 kHz; Stereo; 16-bit signed LE Resampling is performed with ``audioop.ratecv()``, ``audioop.tostereo()`` and ``audioop.tomono()`` in the ``LiveVoiceAudioSink`` (input) and ``LiveAudioBuffer`` (output) classes. Resampling is performed with ``audioop.ratecv()``, ``audioop.tostereo()`` and ``audioop.tomono()`` in the ``LiveVoiceAudioSink`` (input) and ``LiveAudioBuffer`` (output) classes. Session lifecycle ----------------- Starting a session ~~~~~~~~~~~~~~~~~~ Triggered by the ``start_live_voice_discord`` action (the model may decide to start it based on conversation context). 1. **Join voice** — connect to the Discord voice channel using ``voice_recv.VoiceRecvClient`` (required for receiving user audio). 2. **Build system instruction** — ``build_live_system_instruction()`` creates a condensed persona prompt suitable for the live session's context window. 3. **Build tool declarations** — ``_build_gemini_tool_declarations()`` queries all plugins and interfaces via ``get_action_plugin_instructions()`` and converts each action's payload schema into ``genai.types.FunctionDeclaration`` objects. 4. **Open WebSocket** — ``LiveSessionManager.start_session()`` opens a connection to the Live model with ``response_modalities=["AUDIO"]``, the system instruction, and tool declarations. 5. **Start audio pipeline** — the ``LivePCMAudioSource`` begins playing buffered model audio, and ``LiveVoiceAudioSink`` begins forwarding user audio. During a session ~~~~~~~~~~~~~~~~~ - **User speaks** → ``LiveVoiceAudioSink.write()`` downsamples and forwards to ``send_realtime_input()``. - **Model speaks** → ``_receive_loop()`` dispatches ``on_audio`` callback → ``LiveAudioBuffer.write()`` upsamples → ``LivePCMAudioSource.read()`` feeds Discord. - **Model calls a function** → ``_receive_loop()`` dispatches ``on_tool_call`` → ``_handle_live_tool_call()`` → ``core.action_parser.run_action()`` → result sent back via ``send_tool_response()``. Stopping a session ~~~~~~~~~~~~~~~~~~ Triggered by: - ``stop_live_voice_discord`` action - Bot kicked or disconnected from voice (``on_voice_state_update``) - Bot moved to a different channel (``on_voice_state_update``) - All human users leave the voice channel (``on_voice_state_update``) Cleanup cancels the receive task, closes the WebSocket context, stops Discord audio playback and listening, and closes the audio buffer. Automatic reconnection ~~~~~~~~~~~~~~~~~~~~~~ Sessions have a **15-minute limit** (audio-only). The manager checks ``should_reconnect`` on every ``send_audio()`` call and triggers ``_reconnect()`` 30 seconds before the limit. Reconnection steps: 1. Stop the current session. 2. Rebuild the system instruction from the current persona state. 3. Re-discover tool declarations (so function calling persists). 4. Open a new session. .. note:: Conversation context is not preserved across reconnections yet. Future work could inject a conversation summary via ``send_client_content()`` or use the Live API session resumption feature. Tool / function calling ----------------------- The Live API supports function calling, allowing the persona to execute SyntH actions (diary entries, emotion updates, sending messages to other interfaces, etc.) during a voice conversation. How it works ~~~~~~~~~~~~ 1. **At session start**, ``_build_gemini_tool_declarations()`` iterates all plugins and interfaces that implement ``get_prompt_instructions()``. 2. Each action's payload schema is converted to a ``genai.types.FunctionDeclaration`` with: - ``name`` = action name (e.g., ``update_diary``, ``message_discord_bot``) - ``description`` = from ``get_prompt_instructions()["description"]`` - ``parameters`` = JSON Schema built from the payload field definitions 3. Declarations are wrapped in a ``genai.types.Tool`` and passed to ``LiveConnectConfig.tools``. 4. When the model emits a ``tool_call`` message, the receive loop calls ``_handle_live_tool_call()`` which: - Wraps the call as ``{"type": action_name, "payload": args}`` - Routes it through ``core.action_parser.run_action()`` - Returns the result dict to Gemini via ``send_tool_response()`` Limitations ~~~~~~~~~~~ - Tool declarations are a snapshot at session start. If plugins are loaded/unloaded mid-session, the declarations won't update until reconnection. - The Live API may not support all JSON Schema features; complex nested schemas may need simplification. Voice state cleanup ------------------- The ``on_voice_state_update`` event handler in ``DiscordInterface`` monitors three scenarios: - Bot disconnected/kicked from voice — ``_stop_live_voice()`` - Bot moved to a different channel — ``_stop_live_voice()`` - All human users leave the bot's channel — ``_stop_live_voice()`` This ensures Live API sessions are never left orphaned. Dependencies ------------ - ``google-genai`` — Google GenAI SDK (WebSocket client, types) - ``discord.py`` — Discord bot framework - ``discord-ext-voice-recv`` — Audio reception from Discord voice channels Install with:: uv sync Configuration ------------- - ``GEMINI_API_KEY`` — Google AI API key (required) - ``DISCORD_BOT_TOKEN`` — Discord bot token (required) The Live API model is hardcoded as ``gemini-2.5-flash-native-audio-preview-12-2025`` in ``core/live_session_manager.py:LIVE_MODEL``. Troubleshooting --------------- - ``_HAS_VOICE_RECV is False`` — ``discord-ext-voice-recv`` not installed. Run ``uv add discord-ext-voice-recv`` - ``Live session manager unavailable`` — ``google-genai`` not installed or ``GEMINI_API_KEY`` not set - No audio from model — Check ``LIVE_OUTPUT_SAMPLE_RATE`` matches actual model output; inspect ``on_audio_from_model`` logs - WebSocket disconnects — 15-minute session limit hit; reconnection should fire automatically - Tool calls not working — Check ``_build_gemini_tool_declarations()`` log output for declaration count - Bot stays in voice after session ends — ``on_voice_state_update`` handler should clean up; check for exceptions in logs Remaining work -------------- - **Integration testing** against the real Gemini Live API (audio format verification, auth, WebSocket stability). - **Audio output sample rate verification** — the assumed 24 kHz may differ; inspect ``mime_type`` of received audio blobs during testing. - **Session resumption** — use ``send_client_content()`` to inject conversation summaries on reconnect, or implement Google's ``SessionResumptionConfig``. - **Speech config** — add ``speech_config`` to ``LiveConnectConfig`` to explicitly request a voice and output format.