AiHummer handles speech in and speech out of the box — no paid speech API
required. An incoming audio clip is transcribed (STT), the text drives a normal
agent turn, and the answer is spoken back (TTS). Both engines are free and
local and run as sidecars: faster-whisper for
transcription and edge-tts for synthesis.
How a voice turn works
A voice turn is a normal agent turn with audio on both ends:
The gateway sends the inbound audio to the STT sidecar (:8001,
faster-whisper) and gets back text.
That text drives the usual function-calling turn — same router, orchestrator,
tools and memory as a text message.
The reply text is sent to the TTS sidecar (:8002, edge-tts) and the
spoken audio is returned to the caller.
This is exposed as a single endpoint, POST /v1/voice/turn, so a channel or
client can hand over audio and receive audio without orchestrating the steps
itself.
The endpoints
Method
Endpoint
Purpose
POST
/v1/voice/turn
Audio in → agent turn → audio out
GET
/v1/voice/config
Current voice configuration (engines, tuning knobs)
POST /v1/voice/turn HTTP/1.1Host: localhost:8765Content-Type: application/json{ "audio": "<base64-encoded clip>", "format": "wav"}
[!NOTE]
Speech features are part of core — they are not tied to the SIP plugin. The
SIP channel uses these same STT/TTS engines for its local engine, but voice
in/out works for any client that calls /v1/voice/turn.
The voice sidecars
Port
Sidecar
Engine
Role
Wired with
8001
STT
faster-whisper
speech → text
AIHUMMER_STT_URL
8002
TTS
edge-tts
text → speech
AIHUMMER_TTS_URL
Both engines are free and local: nothing is sent to a paid speech vendor. As with
every sidecar, the gateway reaches them by URL, so you can run them next to
the gateway or point at a shared instance.
Wiring
The host-native installer sets the STT and TTS URLs for you. The variables are:
# gateway.env — voice in/out (auto-set by the installer)AIHUMMER_STT_URL=http://127.0.0.1:8001AIHUMMER_TTS_URL=http://127.0.0.1:8002
[!TIP]
Because each sidecar is just a URL, several gateways can share one STT/TTS
instance. Centralise the heavier speech services instead of running a copy per
host.
Tuning the audio path
Voice quality knobs live in the “Media · Voice” settings group and are
configurable from the web admin UI — no redeploy needed. They cover the
real-time audio path:
Knob
What it controls
Duplex
Full-duplex (listen while speaking) vs half-duplex
AEC
Acoustic echo cancellation
Noise suppression
Filters background noise before STT
AGC
Automatic gain control (level normalisation)
VAD threshold
Voice-activity detection sensitivity
Barge-in
Let the caller interrupt the agent mid-answer
GET /v1/voice/config returns the effective configuration so a client can
discover the active engines and these settings.
Where to next
The transport model behind these services:
Sidecars.
AiHummer handles **speech in and speech out of the box** — no paid speech API
required. An incoming audio clip is transcribed (STT), the text drives a normal
agent turn, and the answer is spoken back (TTS). Both engines are **free and
local** and run as [sidecars](/en/v1.0/architecture/sidecars): faster-whisper for
transcription and edge-tts for synthesis.
## How a voice turn works
A voice turn is a normal agent turn with audio on both ends:
1. The gateway sends the inbound audio to the **STT sidecar** (`:8001`,
faster-whisper) and gets back text.
2. That text drives the usual function-calling turn — same router, orchestrator,
tools and memory as a text message.
3. The reply text is sent to the **TTS sidecar** (`:8002`, edge-tts) and the
spoken audio is returned to the caller.
This is exposed as a single endpoint, `POST /v1/voice/turn`, so a channel or
client can hand over audio and receive audio without orchestrating the steps
itself.
## The endpoints
| Method | Endpoint | Purpose |
|---|---|---|
| `POST` | `/v1/voice/turn` | Audio in → agent turn → audio out |
| `GET` | `/v1/voice/config` | Current voice configuration (engines, tuning knobs) |
```http
POST /v1/voice/turn HTTP/1.1
Host: localhost:8765
Content-Type: application/json
{
"audio": "<base64-encoded clip>",
"format": "wav"
}
```
> [!NOTE]
> Speech features are part of **core** — they are not tied to the SIP plugin. The
> SIP channel uses these same STT/TTS engines for its `local` engine, but voice
> in/out works for any client that calls `/v1/voice/turn`.
## The voice sidecars
| Port | Sidecar | Engine | Role | Wired with |
|---|---|---|---|---|
| 8001 | STT | faster-whisper | speech → text | `AIHUMMER_STT_URL` |
| 8002 | TTS | edge-tts | text → speech | `AIHUMMER_TTS_URL` |
Both engines are free and local: nothing is sent to a paid speech vendor. As with
every sidecar, the gateway reaches them **by URL**, so you can run them next to
the gateway or point at a shared instance.
## Wiring
The host-native installer sets the STT and TTS URLs for you. The variables are:
```ini
# gateway.env — voice in/out (auto-set by the installer)
AIHUMMER_STT_URL=http://127.0.0.1:8001
AIHUMMER_TTS_URL=http://127.0.0.1:8002
```
> [!TIP]
> Because each sidecar is just a URL, several gateways can share one STT/TTS
> instance. Centralise the heavier speech services instead of running a copy per
> host.
## Tuning the audio path
Voice quality knobs live in the **"Media · Voice"** settings group and are
configurable from the web admin UI — no redeploy needed. They cover the
real-time audio path:
| Knob | What it controls |
|---|---|
| Duplex | Full-duplex (listen while speaking) vs half-duplex |
| AEC | Acoustic echo cancellation |
| Noise suppression | Filters background noise before STT |
| AGC | Automatic gain control (level normalisation) |
| VAD threshold | Voice-activity detection sensitivity |
| Barge-in | Let the caller interrupt the agent mid-answer |
`GET /v1/voice/config` returns the effective configuration so a client can
discover the active engines and these settings.
## Where to next
- The transport model behind these services:
[Sidecars](/en/v1.0/architecture/sidecars).
- Diarization, translation, voice cloning and video:
[Diarization, translation & cloning](/en/v1.0/voice/diarization-translation-clone).
- Voice on a phone line:
[SIP telephony](/en/v1.0/channels/sip).