Beyond speech in and out, AiHummer offers four
additional, opt-in media capabilities: speaker diarization, speech
translation, voice cloning and video understanding. Every one runs
offline on free/local engines as a sidecar
reached by URL.
[!NOTE]
These are core / sidecar features, not SIP features. The SIP channel itself
supports STT, TTS, barge-in, DTMF and recording only — diarization, translation
and voice cloning belong to core voice, exposed at /v1/voice/*.
Speaker diarization
Diarization answers “who spoke when” in an audio clip with multiple speakers. It
runs pyannote speaker-diarization-3.1, offline, in the diarize sidecar.
Item
Value
Engine
pyannote speaker-diarization-3.1 (offline)
Sidecar port
:8003
Endpoint
POST /v1/voice/diarize
Wired with
AIHUMMER_DIARIZE_URL
Speech translation
Translation converts speech (or its transcript) from one language into another so
an agent can serve callers across languages.
Item
Value
Endpoint
POST /v1/voice/translate
Voice cloning
Voice cloning synthesises speech in a target voice. It runs OpenVoice V2 with
MeloTTS — offline and MIT-licensed — in the voiceclone sidecar.
Item
Value
Engine
OpenVoice V2 + MeloTTS (offline, MIT)
Sidecar port
:8004
Endpoint
POST /v1/voice/clone
Wired with
AIHUMMER_VOICECLONE_URL
Video understanding
Video understanding extracts what a clip contains by demuxing audio and sampling
keyframes with ffmpeg — there is no ML model in this sidecar; it prepares
audio and frames for the agent turn.
Item
Value
Backed by
ffmpeg (demux + keyframes, no ML)
Sidecar port
:8005
Endpoint
POST /v1/video/understand
Wired with
AIHUMMER_VIDEO_URL
Wiring
Each capability is active only when its sidecar URL is set. The host-native
installer can provision these, or you can point them at instances you already
run.
[!TIP]
These sidecars are independent — enable only the ones you need. A deployment
that just needs transcription and synthesis can run the STT/TTS pair alone and
leave diarize, voiceclone and video unconfigured.
Beyond [speech in and out](/en/v1.0/voice/stt-tts), AiHummer offers four
additional, opt-in media capabilities: **speaker diarization**, **speech
translation**, **voice cloning** and **video understanding**. Every one runs
**offline on free/local engines** as a [sidecar](/en/v1.0/architecture/sidecars)
reached by URL.
> [!NOTE]
> These are **core / sidecar features**, not SIP features. The SIP channel itself
> supports STT, TTS, barge-in, DTMF and recording only — diarization, translation
> and voice cloning belong to core voice, exposed at `/v1/voice/*`.
## Speaker diarization
Diarization answers "who spoke when" in an audio clip with multiple speakers. It
runs **pyannote `speaker-diarization-3.1`**, offline, in the diarize sidecar.
| Item | Value |
|---|---|
| Engine | pyannote `speaker-diarization-3.1` (offline) |
| Sidecar port | `:8003` |
| Endpoint | `POST /v1/voice/diarize` |
| Wired with | `AIHUMMER_DIARIZE_URL` |
## Speech translation
Translation converts speech (or its transcript) from one language into another so
an agent can serve callers across languages.
| Item | Value |
|---|---|
| Endpoint | `POST /v1/voice/translate` |
## Voice cloning
Voice cloning synthesises speech in a target voice. It runs **OpenVoice V2 with
MeloTTS** — offline and **MIT-licensed** — in the voiceclone sidecar.
| Item | Value |
|---|---|
| Engine | OpenVoice V2 + MeloTTS (offline, MIT) |
| Sidecar port | `:8004` |
| Endpoint | `POST /v1/voice/clone` |
| Wired with | `AIHUMMER_VOICECLONE_URL` |
## Video understanding
Video understanding extracts what a clip contains by demuxing audio and sampling
keyframes with **ffmpeg** — there is **no ML model** in this sidecar; it prepares
audio and frames for the agent turn.
| Item | Value |
|---|---|
| Backed by | ffmpeg (demux + keyframes, no ML) |
| Sidecar port | `:8005` |
| Endpoint | `POST /v1/video/understand` |
| Wired with | `AIHUMMER_VIDEO_URL` |
## Wiring
Each capability is active only when its sidecar URL is set. The host-native
installer can provision these, or you can point them at instances you already
run.
```ini
# gateway.env — optional voice/video sidecars
AIHUMMER_DIARIZE_URL=http://127.0.0.1:8003
AIHUMMER_VOICECLONE_URL=http://127.0.0.1:8004
AIHUMMER_VIDEO_URL=http://127.0.0.1:8005
```
> [!TIP]
> These sidecars are independent — enable only the ones you need. A deployment
> that just needs transcription and synthesis can run the STT/TTS pair alone and
> leave diarize, voiceclone and video unconfigured.
## Where to next
- The default speech path:
[Speech in & out](/en/v1.0/voice/stt-tts).
- How sidecars are wired and shared:
[Sidecars](/en/v1.0/architecture/sidecars).