AiHummer docs
v1.0.x
RU EN

Speech in & out (STT / TTS)

v1.0.x · updated 2026-06-26

AiHummer handles speech in and speech out of the box — no paid speech API required. An incoming audio clip is transcribed (STT), the text drives a normal agent turn, and the answer is spoken back (TTS). Both engines are free and local and run as sidecars: faster-whisper for transcription and edge-tts for synthesis.

How a voice turn works

A voice turn is a normal agent turn with audio on both ends:

  1. The gateway sends the inbound audio to the STT sidecar (:8001, faster-whisper) and gets back text.
  2. That text drives the usual function-calling turn — same router, orchestrator, tools and memory as a text message.
  3. The reply text is sent to the TTS sidecar (:8002, edge-tts) and the spoken audio is returned to the caller.

This is exposed as a single endpoint, POST /v1/voice/turn, so a channel or client can hand over audio and receive audio without orchestrating the steps itself.

The endpoints

MethodEndpointPurpose
POST/v1/voice/turnAudio in → agent turn → audio out
GET/v1/voice/configCurrent voice configuration (engines, tuning knobs)
POST /v1/voice/turn HTTP/1.1
Host: localhost:8765
Content-Type: application/json

{
  "audio": "<base64-encoded clip>",
  "format": "wav"
}

[!NOTE] Speech features are part of core — they are not tied to the SIP plugin. The SIP channel uses these same STT/TTS engines for its local engine, but voice in/out works for any client that calls /v1/voice/turn.

The voice sidecars

PortSidecarEngineRoleWired with
8001STTfaster-whisperspeech → textAIHUMMER_STT_URL
8002TTSedge-ttstext → speechAIHUMMER_TTS_URL

Both engines are free and local: nothing is sent to a paid speech vendor. As with every sidecar, the gateway reaches them by URL, so you can run them next to the gateway or point at a shared instance.

Wiring

The host-native installer sets the STT and TTS URLs for you. The variables are:

# gateway.env — voice in/out (auto-set by the installer)
AIHUMMER_STT_URL=http://127.0.0.1:8001
AIHUMMER_TTS_URL=http://127.0.0.1:8002

[!TIP] Because each sidecar is just a URL, several gateways can share one STT/TTS instance. Centralise the heavier speech services instead of running a copy per host.

Tuning the audio path

Voice quality knobs live in the “Media · Voice” settings group and are configurable from the web admin UI — no redeploy needed. They cover the real-time audio path:

KnobWhat it controls
DuplexFull-duplex (listen while speaking) vs half-duplex
AECAcoustic echo cancellation
Noise suppressionFilters background noise before STT
AGCAutomatic gain control (level normalisation)
VAD thresholdVoice-activity detection sensitivity
Barge-inLet the caller interrupt the agent mid-answer

GET /v1/voice/config returns the effective configuration so a client can discover the active engines and these settings.

Where to next