Gemini Live API — SyntH Discord Voice Integration

This document describes how Synthetic Heart integrates the Gemini Live API with Discord voice channels to enable real-time bidirectional voice conversation with the persona.

Overview

The integration follows a Hybrid Voice architecture: Discord captures and plays back PCM audio, while the Gemini Live API handles speech recognition, reasoning, and speech synthesis over a persistent WebSocket session.

Discord User
    │
    ▼ (48 kHz stereo PCM)
┌──────────────────────────┐
│  LiveVoiceAudioSink      │ ← discord-ext-voice-recv AudioSink
│  48 kHz stereo → 16 kHz  │
│  mono (audioop)          │
└──────────┬───────────────┘
           │ 16 kHz mono PCM
           ▼
┌──────────────────────────┐
│  LiveSessionManager      │ ← core/live_session_manager.py
│  WebSocket session       │
│  send_realtime_input()   │
│  receive() loop          │
└──────────┬───────────────┘
           │ 24 kHz mono PCM + tool calls
           ▼
┌──────────────────────────┐
│  LiveAudioBuffer         │ ← interface/discord_interface.py
│  24 kHz mono → 48 kHz    │
│  stereo (audioop)        │
└──────────┬───────────────┘
           │ 48 kHz stereo PCM
           ▼
┌──────────────────────────┐
│  LivePCMAudioSource      │ ← discord.AudioSource
│  20 ms frames (3840 B)   │
└──────────┬───────────────┘
           │
           ▼
      Discord User

Key files

Key files
File	Purpose
`core/live_session_manager.py`	Session lifecycle, audio I/O, receive loop, tool call dispatch, reconnect logic
`cortex/llm_provider/gemini_api.py`	`get_live_session_manager()` factory, `start_live_voice_session()` / `stop_live_voice_session()` wrappers
`core/prompt_engine.py`	`build_live_system_instruction()` — condensed persona for live sessions
`interface/discord_interface.py`	Discord actions, audio pipeline classes, tool-calling bridge, voice state cleanup

Audio format details

Discord → Gemini (input) — 16 kHz; Mono; 16-bit signed LE; MIME: audio/pcm;rate=16000
Gemini → Discord (output) — 24 kHz; Mono; 16-bit signed LE; MIME: audio/pcm;rate=24000
Discord voice (native) — 48 kHz; Stereo; 16-bit signed LE

Resampling is performed with audioop.ratecv(), audioop.tostereo() and audioop.tomono() in the LiveVoiceAudioSink (input) and LiveAudioBuffer (output) classes.

Session lifecycle

Starting a session

Triggered by the start_live_voice_discord action (the model may decide to start it based on conversation context).

Join voice — connect to the Discord voice channel using voice_recv.VoiceRecvClient (required for receiving user audio).
Build system instruction — build_live_system_instruction() creates a condensed persona prompt suitable for the live session’s context window.
Build tool declarations — _build_gemini_tool_declarations() queries all plugins and interfaces via get_action_plugin_instructions() and converts each action’s payload schema into genai.types.FunctionDeclaration objects.
Open WebSocket — LiveSessionManager.start_session() opens a connection to the Live model with response_modalities=["AUDIO"], the system instruction, and tool declarations.
Start audio pipeline — the LivePCMAudioSource begins playing buffered model audio, and LiveVoiceAudioSink begins forwarding user audio.

During a session

User speaks → LiveVoiceAudioSink.write() downsamples and forwards to send_realtime_input().
Model speaks → _receive_loop() dispatches on_audio callback → LiveAudioBuffer.write() upsamples → LivePCMAudioSource.read() feeds Discord.
Model calls a function → _receive_loop() dispatches on_tool_call → _handle_live_tool_call() → core.action_parser.run_action() → result sent back via send_tool_response().

Stopping a session

Triggered by:

stop_live_voice_discord action
Bot kicked or disconnected from voice (on_voice_state_update)
Bot moved to a different channel (on_voice_state_update)
All human users leave the voice channel (on_voice_state_update)

Cleanup cancels the receive task, closes the WebSocket context, stops Discord audio playback and listening, and closes the audio buffer.

Automatic reconnection

Sessions have a 15-minute limit (audio-only). The manager checks should_reconnect on every send_audio() call and triggers _reconnect() 30 seconds before the limit.

Reconnection steps:

Stop the current session.
Rebuild the system instruction from the current persona state.
Re-discover tool declarations (so function calling persists).
Open a new session.

Note

Conversation context is not preserved across reconnections yet. Future work could inject a conversation summary via send_client_content() or use the Live API session resumption feature.

Tool / function calling

The Live API supports function calling, allowing the persona to execute SyntH actions (diary entries, emotion updates, sending messages to other interfaces, etc.) during a voice conversation.

How it works

At session start, _build_gemini_tool_declarations() iterates all plugins and interfaces that implement get_prompt_instructions().
Each action’s payload schema is converted to a genai.types.FunctionDeclaration with: - name = action name (e.g., update_diary, message_discord_bot) - description = from get_prompt_instructions()["description"] - parameters = JSON Schema built from the payload field definitions
Declarations are wrapped in a genai.types.Tool and passed to LiveConnectConfig.tools.
When the model emits a tool_call message, the receive loop calls _handle_live_tool_call() which: - Wraps the call as {"type": action_name, "payload": args} - Routes it through core.action_parser.run_action() - Returns the result dict to Gemini via send_tool_response()

Limitations

Tool declarations are a snapshot at session start. If plugins are loaded/unloaded mid-session, the declarations won’t update until reconnection.
The Live API may not support all JSON Schema features; complex nested schemas may need simplification.

Voice state cleanup

The on_voice_state_update event handler in DiscordInterface monitors three scenarios:

Bot disconnected/kicked from voice — _stop_live_voice()
Bot moved to a different channel — _stop_live_voice()
All human users leave the bot’s channel — _stop_live_voice()

This ensures Live API sessions are never left orphaned.

Dependencies

google-genai — Google GenAI SDK (WebSocket client, types)
discord.py — Discord bot framework
discord-ext-voice-recv — Audio reception from Discord voice channels

Install with:

uv sync

Configuration

GEMINI_API_KEY — Google AI API key (required)
DISCORD_BOT_TOKEN — Discord bot token (required)

The Live API model is hardcoded as gemini-2.5-flash-native-audio-preview-12-2025 in core/live_session_manager.py:LIVE_MODEL.

Troubleshooting

_HAS_VOICE_RECV is False — discord-ext-voice-recv not installed. Run uv add discord-ext-voice-recv
Live session manager unavailable — google-genai not installed or GEMINI_API_KEY not set
No audio from model — Check LIVE_OUTPUT_SAMPLE_RATE matches actual model output; inspect on_audio_from_model logs
WebSocket disconnects — 15-minute session limit hit; reconnection should fire automatically
Tool calls not working — Check _build_gemini_tool_declarations() log output for declaration count
Bot stays in voice after session ends — on_voice_state_update handler should clean up; check for exceptions in logs

Remaining work

Integration testing against the real Gemini Live API (audio format verification, auth, WebSocket stability).
Audio output sample rate verification — the assumed 24 kHz may differ; inspect mime_type of received audio blobs during testing.
Session resumption — use send_client_content() to inject conversation summaries on reconnect, or implement Google’s SessionResumptionConfig.
Speech config — add speech_config to LiveConnectConfig to explicitly request a voice and output format.