Speech-to-Text Pipeline with Whisper

Series 6 — Part 4 of 10

Lawyers dictate voice notes on WhatsApp instead of typing. the WhatsApp AI agent downloads the audio, transcribes it with local Whisper, and processes the text as if it were typed. This article covers the full Whisper STT pipeline: download, transcribe, language hint, and failure handling.

Downloading the Voice Note

function download_voice_note(string $mediaId, string $accessToken): string
{
    // Step 1: Get the download URL from Meta
    $metaUrl  = "https://graph.facebook.com/v19.0/{$mediaId}";
    $response = meta_api_get($metaUrl, $accessToken);
    $dlUrl    = $response['url'] ?? null;

    if (!$dlUrl) {
        throw new \RuntimeException("Could not retrieve media URL for {$mediaId}");
    }

    // Step 2: Download the audio bytes
    $tmpFile = tempnam(sys_get_temp_dir(), 'wa_audio_') . '.ogg';
    $audio   = meta_api_download($dlUrl, $accessToken);
    file_put_contents($tmpFile, $audio);

    return $tmpFile;
}

Local Whisper Transcription

function transcribe_with_whisper(string $audioFile, string $languageHint = 'hi'): string
{
    // Call the local Whisper HTTP service (Python, running on port 8881)
    $ch = curl_init('http://localhost:9011/transcribe');
    curl_setopt_array($ch, [
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => [
            'file'     => new \CURLFile($audioFile, 'audio/ogg', 'audio.ogg'),
            'language' => $languageHint,
            'model'    => 'base',  // 'small' for better accuracy on legal terminology
        ],
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_TIMEOUT        => 60,
    ]);

    $result   = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200 || !$result) {
        throw new \RuntimeException("Whisper transcription failed: HTTP {$httpCode}");
    }

    $data = json_decode($result, true);
    return trim($data['text'] ?? '');
}

// Cleanup: always unlink after transcription
try {
    $text = transcribe_with_whisper($tmpFile, $persona->preferredLanguage());
} finally {
    if (file_exists($tmpFile)) {
        unlink($tmpFile);
    }
}

Language Hint Injection

Whisper is multilingual but benefits significantly from a language hint. For the WhatsApp AI agent, use the user's stored preferred_lang as the hint. If the user has no stored preference, default to 'hi' (Hindi) for Indian legal contexts — it will still handle English correctly.

Language hint values follow ISO 639-1 codes: 'en' for English, 'hi' for Hindi, 'pa' for Punjabi.

What to Watch For

Model size vs accuracy tradeoff — Whisper 'base' transcribes at ~10x realtime on a Raspberry Pi. Whisper 'small' is more accurate but ~4x realtime. Legal terminology (case numbers, section references) benefits from 'small'.
STT failure handling — Never silently drop a voice note. If transcription fails, send the user a message: "I received your voice note but couldn't transcribe it. Could you type your message instead?"
Audio duration limit — WhatsApp voice notes can be up to 16 MB. Long recordings (>5 minutes) will cause Whisper timeouts on low-powered hardware. Set a duration limit and warn users who exceed it.

Downloading the Voice Note

Local Whisper Transcription

Language Hint Injection

What to Watch For

Stay at the cutting edge