Speech in & out (STT / TTS)

AiHummer handles speech in and speech out of the box — no paid speech API required. An incoming audio clip is transcribed (STT), the text drives a normal agent turn, and the answer is spoken back (TTS). Both engines are free and local and run as sidecars: faster-whisper for transcription and edge-tts for synthesis.

How a voice turn works

A voice turn is a normal agent turn with audio on both ends:

The gateway sends the inbound audio to the STT sidecar (:8001, faster-whisper) and gets back text.
That text drives the usual function-calling turn — same router, orchestrator, tools and memory as a text message.
The reply text is sent to the TTS sidecar (:8002, edge-tts) and the spoken audio is returned to the caller.

This is exposed as a single endpoint, POST /v1/voice/turn, so a channel or client can hand over audio and receive audio without orchestrating the steps itself.

The endpoints

Method	Endpoint	Purpose
`POST`	`/v1/voice/turn`	Audio in → agent turn → audio out
`GET`	`/v1/voice/config`	Current voice configuration (engines, tuning knobs)

POST /v1/voice/turn HTTP/1.1
Host: localhost:8765
Content-Type: application/json

{
  "audio": "<base64-encoded clip>",
  "format": "wav"
}

[!NOTE] Speech features are part of core — they are not tied to the SIP plugin. The SIP channel uses these same STT/TTS engines for its local engine, but voice in/out works for any client that calls /v1/voice/turn.

The voice sidecars

Port	Sidecar	Engine	Role	Wired with
8001	STT	faster-whisper	speech → text	`AIHUMMER_STT_URL`
8002	TTS	edge-tts	text → speech	`AIHUMMER_TTS_URL`

Both engines are free and local: nothing is sent to a paid speech vendor. As with every sidecar, the gateway reaches them by URL, so you can run them next to the gateway or point at a shared instance.

Wiring

The host-native installer sets the STT and TTS URLs for you. The variables are:

# gateway.env — voice in/out (auto-set by the installer)
AIHUMMER_STT_URL=http://127.0.0.1:8001
AIHUMMER_TTS_URL=http://127.0.0.1:8002

[!TIP] Because each sidecar is just a URL, several gateways can share one STT/TTS instance. Centralise the heavier speech services instead of running a copy per host.

Tuning the audio path

Voice quality knobs live in the “Media · Voice” settings group and are configurable from the web admin UI — no redeploy needed. They cover the real-time audio path:

Knob	What it controls
Duplex	Full-duplex (listen while speaking) vs half-duplex
AEC	Acoustic echo cancellation
Noise suppression	Filters background noise before STT
AGC	Automatic gain control (level normalisation)
VAD threshold	Voice-activity detection sensitivity
Barge-in	Let the caller interrupt the agent mid-answer

GET /v1/voice/config returns the effective configuration so a client can discover the active engines and these settings.

Where to next

The transport model behind these services: Sidecars.
Diarization, translation, voice cloning and video: Diarization, translation & cloning.
Voice on a phone line: SIP telephony.