Gemini Live API — SyntH Discord Voice Integration
This document describes how Synthetic Heart integrates the Gemini Live API with Discord voice channels to enable real-time bidirectional voice conversation with the persona.
Overview
The integration follows a Hybrid Voice architecture: Discord captures and plays back PCM audio, while the Gemini Live API handles speech recognition, reasoning, and speech synthesis over a persistent WebSocket session.
Discord User
│
▼ (48 kHz stereo PCM)
┌──────────────────────────┐
│ LiveVoiceAudioSink │ ← discord-ext-voice-recv AudioSink
│ 48 kHz stereo → 16 kHz │
│ mono (audioop) │
└──────────┬───────────────┘
│ 16 kHz mono PCM
▼
┌──────────────────────────┐
│ LiveSessionManager │ ← core/live_session_manager.py
│ WebSocket session │
│ send_realtime_input() │
│ receive() loop │
└──────────┬───────────────┘
│ 24 kHz mono PCM + tool calls
▼
┌──────────────────────────┐
│ LiveAudioBuffer │ ← interface/discord_interface.py
│ 24 kHz mono → 48 kHz │
│ stereo (audioop) │
└──────────┬───────────────┘
│ 48 kHz stereo PCM
▼
┌──────────────────────────┐
│ LivePCMAudioSource │ ← discord.AudioSource
│ 20 ms frames (3840 B) │
└──────────┬───────────────┘
│
▼
Discord User
Key files
File |
Purpose |
|---|---|
|
Session lifecycle, audio I/O, receive loop, tool call dispatch, reconnect logic |
|
|
|
|
|
Discord actions, audio pipeline classes, tool-calling bridge, voice state cleanup |
Audio format details
Discord → Gemini (input) — 16 kHz; Mono; 16-bit signed LE; MIME:
audio/pcm;rate=16000Gemini → Discord (output) — 24 kHz; Mono; 16-bit signed LE; MIME:
audio/pcm;rate=24000Discord voice (native) — 48 kHz; Stereo; 16-bit signed LE
Resampling is performed with audioop.ratecv(), audioop.tostereo() and
audioop.tomono() in the LiveVoiceAudioSink (input) and
LiveAudioBuffer (output) classes.
Resampling is performed with audioop.ratecv(), audioop.tostereo() and
audioop.tomono() in the LiveVoiceAudioSink (input) and
LiveAudioBuffer (output) classes.
Session lifecycle
Starting a session
Triggered by the start_live_voice_discord action (the model may decide to
start it based on conversation context).
Join voice — connect to the Discord voice channel using
voice_recv.VoiceRecvClient(required for receiving user audio).Build system instruction —
build_live_system_instruction()creates a condensed persona prompt suitable for the live session’s context window.Build tool declarations —
_build_gemini_tool_declarations()queries all plugins and interfaces viaget_action_plugin_instructions()and converts each action’s payload schema intogenai.types.FunctionDeclarationobjects.Open WebSocket —
LiveSessionManager.start_session()opens a connection to the Live model withresponse_modalities=["AUDIO"], the system instruction, and tool declarations.Start audio pipeline — the
LivePCMAudioSourcebegins playing buffered model audio, andLiveVoiceAudioSinkbegins forwarding user audio.
During a session
User speaks →
LiveVoiceAudioSink.write()downsamples and forwards tosend_realtime_input().Model speaks →
_receive_loop()dispatcheson_audiocallback →LiveAudioBuffer.write()upsamples →LivePCMAudioSource.read()feeds Discord.Model calls a function →
_receive_loop()dispatcheson_tool_call→_handle_live_tool_call()→core.action_parser.run_action()→ result sent back viasend_tool_response().
Stopping a session
Triggered by:
stop_live_voice_discordactionBot kicked or disconnected from voice (
on_voice_state_update)Bot moved to a different channel (
on_voice_state_update)All human users leave the voice channel (
on_voice_state_update)
Cleanup cancels the receive task, closes the WebSocket context, stops Discord audio playback and listening, and closes the audio buffer.
Automatic reconnection
Sessions have a 15-minute limit (audio-only). The manager checks
should_reconnect on every send_audio() call and triggers _reconnect()
30 seconds before the limit.
Reconnection steps:
Stop the current session.
Rebuild the system instruction from the current persona state.
Re-discover tool declarations (so function calling persists).
Open a new session.
Note
Conversation context is not preserved across reconnections yet. Future work
could inject a conversation summary via send_client_content() or use the
Live API session resumption feature.
Tool / function calling
The Live API supports function calling, allowing the persona to execute SyntH actions (diary entries, emotion updates, sending messages to other interfaces, etc.) during a voice conversation.
How it works
At session start,
_build_gemini_tool_declarations()iterates all plugins and interfaces that implementget_prompt_instructions().Each action’s payload schema is converted to a
genai.types.FunctionDeclarationwith: -name= action name (e.g.,update_diary,message_discord_bot) -description= fromget_prompt_instructions()["description"]-parameters= JSON Schema built from the payload field definitionsDeclarations are wrapped in a
genai.types.Tooland passed toLiveConnectConfig.tools.When the model emits a
tool_callmessage, the receive loop calls_handle_live_tool_call()which: - Wraps the call as{"type": action_name, "payload": args}- Routes it throughcore.action_parser.run_action()- Returns the result dict to Gemini viasend_tool_response()
Limitations
Tool declarations are a snapshot at session start. If plugins are loaded/unloaded mid-session, the declarations won’t update until reconnection.
The Live API may not support all JSON Schema features; complex nested schemas may need simplification.
Voice state cleanup
The on_voice_state_update event handler in DiscordInterface monitors
three scenarios:
Bot disconnected/kicked from voice —
_stop_live_voice()Bot moved to a different channel —
_stop_live_voice()All human users leave the bot’s channel —
_stop_live_voice()
This ensures Live API sessions are never left orphaned.
Dependencies
google-genai— Google GenAI SDK (WebSocket client, types)discord.py— Discord bot frameworkdiscord-ext-voice-recv— Audio reception from Discord voice channels
Install with:
uv sync
Configuration
GEMINI_API_KEY— Google AI API key (required)DISCORD_BOT_TOKEN— Discord bot token (required)
The Live API model is hardcoded as gemini-2.5-flash-native-audio-preview-12-2025
in core/live_session_manager.py:LIVE_MODEL.
Troubleshooting
_HAS_VOICE_RECV is False—discord-ext-voice-recvnot installed. Runuv add discord-ext-voice-recvLive session manager unavailable—google-genainot installed orGEMINI_API_KEYnot setNo audio from model — Check
LIVE_OUTPUT_SAMPLE_RATEmatches actual model output; inspecton_audio_from_modellogsWebSocket disconnects — 15-minute session limit hit; reconnection should fire automatically
Tool calls not working — Check
_build_gemini_tool_declarations()log output for declaration countBot stays in voice after session ends —
on_voice_state_updatehandler should clean up; check for exceptions in logs
Remaining work
Integration testing against the real Gemini Live API (audio format verification, auth, WebSocket stability).
Audio output sample rate verification — the assumed 24 kHz may differ; inspect
mime_typeof received audio blobs during testing.Session resumption — use
send_client_content()to inject conversation summaries on reconnect, or implement Google’sSessionResumptionConfig.Speech config — add
speech_configtoLiveConnectConfigto explicitly request a voice and output format.