Text-to-Speech with Kokoro TTS — Govind Preet Singh

Series 6 — Part 5 of 10

the WhatsApp AI agent sends voice note responses using Kokoro TTS. Three production bugs lurk in this pipeline: the WAV-not-MP3 trap (Kokoro always outputs WAV regardless of what you request), the UTF-8 corruption bug (missing /u flag on multibyte regex), and the audio type classification problem. This article covers all three.

Kokoro TTS Integration

function generate_tts(string $text, string $voice = 'af_heart', string $format = 'wav'): string
{
    $payload = json_encode([
        'model'  => 'kokoro',
        'input'  => prepareText($text),
        'voice'  => $voice,
        'response_format' => $format,  // TRAP: Kokoro ignores this and always returns WAV
        'speed'  => 1.0,
    ]);

    $ch = curl_init('http://localhost:9010/v1/audio/speech');
    curl_setopt_array($ch, [
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => $payload,
        CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_TIMEOUT        => 30,
    ]);

    $audioBytes = curl_exec($ch);
    $httpCode   = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200 || !$audioBytes) {
        throw new \RuntimeException("Kokoro TTS failed: HTTP {$httpCode}");
    }

    // The WAV-not-MP3 trap: detect actual format regardless of what we requested
    $actualFormat = detect_audio_format($audioBytes);
    // $actualFormat will be 'wav' even if you requested 'mp3'

    return $audioBytes;
}

function detect_audio_format(string $bytes): string
{
    // WAV files start with 'RIFF'
    if (str_starts_with($bytes, 'RIFF')) return 'wav';
    // MP3 files start with 0xFF 0xFB or ID3 tag
    if (str_starts_with($bytes, "\xFF\xFB") || str_starts_with($bytes, 'ID3')) return 'mp3';
    // OGG files start with 'OggS'
    if (str_starts_with($bytes, 'OggS')) return 'ogg';
    return 'unknown';
}

The prepareText() Function

function prepareText(string $input): string
{
    // THE UTF-8 BUG: Without /u flag, \b and \W break on multibyte characters
    // BAD:  preg_replace('/\*([^*]+)\*/u', '$1', $input)  — missing /u causes corruption
    // GOOD: always use /u flag on any regex that touches non-ASCII content

    // Strip WhatsApp markdown
    $text = preg_replace('/\*([^*]+)\*/u', '$1', $input);      // *bold*
    $text = preg_replace('/_([^_]+)_/u',  '$1', $text);       // _italic_
    $text = preg_replace('/~([^~]+)~/u',  '$1', $text);       // ~strikethrough~

    // Convert bullet lists to spoken connectives
    $text = preg_replace('/^\s*[-•*]\s+/mu', 'Next, ', $text);
    $text = preg_replace('/^\s*\d+\.\s+/mu', 'Step: ', $text);

    // Expand common legal abbreviations for natural TTS pronunciation
    $abbrevs = [
        'S/o'  => 'son of',
        'D/o'  => 'daughter of',
        'W/o'  => 'wife of',
        'vs.'  => 'versus',
        'HC'   => 'High Court',
        'SC'   => 'Supreme Court',
        'CPC'  => 'Code of Civil Procedure',
        'CrPC' => 'Code of Criminal Procedure',
        'IPC'  => 'Indian Penal Code',
    ];
    foreach ($abbrevs as $abbrev => $expansion) {
        // /u flag ensures multibyte safety
        $text = preg_replace('/\b' . preg_quote($abbrev, '/') . '\b/u', $expansion, $text);
    }

    // Character cap with spoken truncation notice
    if (mb_strlen($text) > 800) {
        $text = mb_substr($text, 0, 800) . '… and more. Please check the app for the full details.';
    }

    return trim($text);
}

Audio Type Classification

Not every message deserves a voice response. Classify the text before calling TTS:

conversation — Short responses (<50 words): send as text, not audio
information — Case details, hearing dates: send as audio
instructions — Multi-step guidance: send as audio
evidence — Anything that could be used in a legal proceeding: send as text only (auditable)

What to Watch For

The /u flag on all multibyte regex — This is the single most common source of corrupted text in PHP handling Indian language content. Add a linting rule for it.
Kokoro cold start — The first TTS call after a cold start takes 3-4 seconds for model loading. Keep a health-check cron that sends a short test string every 5 minutes to keep the model warm.
Voice selection — Kokoro's af_heart voice is appropriate for professional contexts. Expose voice selection in the client config but restrict to approved voices.

Kokoro TTS Integration

The prepareText() Function

Audio Type Classification

What to Watch For

Stay at the cutting edge