Series 9 — Part 2 of 5
FFmpeg is the Swiss army knife of audio and video processing. For voice AI pipelines, you need one core conversion (WAV → OGG/OPUS) and a handful of diagnostic techniques. This article covers the full command, why each flag matters, bitrate choices for voice, and codec availability checks.
The WAV → OGG/OPUS Command — Annotated
ffmpeg \
-i input.wav \ # Input file
-c:a libopus \ # Audio codec: Opus (requires libopus compiled in)
-b:a 48k \ # Target bitrate: 48 kbps — voice sweet spot
-vbr on \ # Variable bitrate: adapts to signal complexity
-compression_level 10 \ # Encoding effort: 10 = maximum (slower but smaller)
-frame_duration 20 \ # Frame size in ms: 20ms balances latency and quality
-ar 48000 \ # Sample rate: 48000 Hz (Opus native rate)
-ac 1 \ # Channels: 1 (mono — voice needs no stereo)
-application voip \ # Codec optimisation: voip = optimised for speech
-f ogg \ # Output container: Ogg (required for Opus)
output.ogg
Bitrate Choices for Voice
| Use case | Codec | Bitrate | File (1 min) | Quality |
|---|---|---|---|---|
| WhatsApp voice note | Opus/OGG | 48k VBR | ~360 KB | Excellent for voice |
| Podcast quality audio | MP3 | 128k CBR | ~960 KB | High, but overkill for voice |
| Voicemail quality | Opus/OGG | 16k VBR | ~120 KB | Acceptable, intelligible |
| Lowest viable voice | Opus/OGG | 8k VBR | ~60 KB | Recognisable but degraded |
Detecting Actual Format — Don't Trust the Extension
# The 'file' command reads magic bytes, not the extension
file input.wav
# → RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 22050 Hz
# → OR: Ogg data, Opus audio (Kokoro sometimes changes format without warning)
# In a script
format=$(file --mime-type -b "$input")
if [[ "$format" != "audio/x-wav" && "$format" != "audio/wav" ]]; then
echo "Unexpected format: $format — aborting"
exit 1
fi
Codec Availability Verification
# Check libopus is available
ffmpeg -codecs 2>/dev/null | grep -i opus
# Should show: DEA.L. opus Opus (Opus Interactive Audio Codec) (decoders: opus libopus)
# Check libmp3lame (for MP3 output)
ffmpeg -codecs 2>/dev/null | grep -i mp3lame
# Test the conversion pipeline end-to-end before deploying
ffmpeg -i /tmp/test.wav -c:a libopus -b:a 48k -f ogg /tmp/test.ogg && echo "Conversion OK"
file /tmp/test.ogg # Confirm format
What to Watch For
- Input sample rate mismatch — Kokoro may output WAV at 22050 Hz. The
-ar 48000flag resamples to 48kHz for Opus. Without it, you get a quality warning in FFmpeg output and may get playback issues on some devices. - The -y flag —
-yoverwrites the output file without asking. Always use it in non-interactive scripts, or FFmpeg will hang waiting for confirmation that never comes. - FFmpeg stderr goes to /dev/null — FFmpeg is verbose. In production, redirect stderr:
ffmpeg ... 2>/dev/null. But capture it in debug mode or you'll miss important error messages.