Diarization, translation, cloning & video

Beyond speech in and out, AiHummer offers four additional, opt-in media capabilities: speaker diarization, speech translation, voice cloning and video understanding. Every one runs offline on free/local engines as a sidecar reached by URL.

[!NOTE] These are core / sidecar features, not SIP features. The SIP channel itself supports STT, TTS, barge-in, DTMF and recording only — diarization, translation and voice cloning belong to core voice, exposed at /v1/voice/*.

Speaker diarization

Diarization answers “who spoke when” in an audio clip with multiple speakers. It runs pyannote speaker-diarization-3.1, offline, in the diarize sidecar.

Item	Value
Engine	pyannote `speaker-diarization-3.1` (offline)
Sidecar port	`:8003`
Endpoint	`POST /v1/voice/diarize`
Wired with	`AIHUMMER_DIARIZE_URL`

Speech translation

Translation converts speech (or its transcript) from one language into another so an agent can serve callers across languages.

Item	Value
Endpoint	`POST /v1/voice/translate`

Voice cloning

Voice cloning synthesises speech in a target voice. It runs OpenVoice V2 with MeloTTS — offline and MIT-licensed — in the voiceclone sidecar.

Item	Value
Engine	OpenVoice V2 + MeloTTS (offline, MIT)
Sidecar port	`:8004`
Endpoint	`POST /v1/voice/clone`
Wired with	`AIHUMMER_VOICECLONE_URL`

Video understanding

Video understanding extracts what a clip contains by demuxing audio and sampling keyframes with ffmpeg — there is no ML model in this sidecar; it prepares audio and frames for the agent turn.

Item	Value
Backed by	ffmpeg (demux + keyframes, no ML)
Sidecar port	`:8005`
Endpoint	`POST /v1/video/understand`
Wired with	`AIHUMMER_VIDEO_URL`

Wiring

Each capability is active only when its sidecar URL is set. The host-native installer can provision these, or you can point them at instances you already run.

# gateway.env — optional voice/video sidecars
AIHUMMER_DIARIZE_URL=http://127.0.0.1:8003
AIHUMMER_VOICECLONE_URL=http://127.0.0.1:8004
AIHUMMER_VIDEO_URL=http://127.0.0.1:8005

[!TIP] These sidecars are independent — enable only the ones you need. A deployment that just needs transcription and synthesis can run the STT/TTS pair alone and leave diarize, voiceclone and video unconfigured.

Where to next

The default speech path: Speech in & out.
How sidecars are wired and shared: Sidecars.