AiHummer docs
v1.0.x
RU EN

Diarization, translation, cloning & video

v1.0.x · updated 2026-06-26

Beyond speech in and out, AiHummer offers four additional, opt-in media capabilities: speaker diarization, speech translation, voice cloning and video understanding. Every one runs offline on free/local engines as a sidecar reached by URL.

[!NOTE] These are core / sidecar features, not SIP features. The SIP channel itself supports STT, TTS, barge-in, DTMF and recording only — diarization, translation and voice cloning belong to core voice, exposed at /v1/voice/*.

Speaker diarization

Diarization answers “who spoke when” in an audio clip with multiple speakers. It runs pyannote speaker-diarization-3.1, offline, in the diarize sidecar.

ItemValue
Enginepyannote speaker-diarization-3.1 (offline)
Sidecar port:8003
EndpointPOST /v1/voice/diarize
Wired withAIHUMMER_DIARIZE_URL

Speech translation

Translation converts speech (or its transcript) from one language into another so an agent can serve callers across languages.

ItemValue
EndpointPOST /v1/voice/translate

Voice cloning

Voice cloning synthesises speech in a target voice. It runs OpenVoice V2 with MeloTTS — offline and MIT-licensed — in the voiceclone sidecar.

ItemValue
EngineOpenVoice V2 + MeloTTS (offline, MIT)
Sidecar port:8004
EndpointPOST /v1/voice/clone
Wired withAIHUMMER_VOICECLONE_URL

Video understanding

Video understanding extracts what a clip contains by demuxing audio and sampling keyframes with ffmpeg — there is no ML model in this sidecar; it prepares audio and frames for the agent turn.

ItemValue
Backed byffmpeg (demux + keyframes, no ML)
Sidecar port:8005
EndpointPOST /v1/video/understand
Wired withAIHUMMER_VIDEO_URL

Wiring

Each capability is active only when its sidecar URL is set. The host-native installer can provision these, or you can point them at instances you already run.

# gateway.env — optional voice/video sidecars
AIHUMMER_DIARIZE_URL=http://127.0.0.1:8003
AIHUMMER_VOICECLONE_URL=http://127.0.0.1:8004
AIHUMMER_VIDEO_URL=http://127.0.0.1:8005

[!TIP] These sidecars are independent — enable only the ones you need. A deployment that just needs transcription and synthesis can run the STT/TTS pair alone and leave diarize, voiceclone and video unconfigured.

Where to next