Series 6 — Part 5 of 10

the WhatsApp AI agent sends voice note responses using Kokoro TTS. Three production bugs lurk in this pipeline: the WAV-not-MP3 trap (Kokoro always outputs WAV regardless of what you request), the UTF-8 corruption bug (missing /u flag on multibyte regex), and the audio type classification problem. This article covers all three.

Kokoro TTS Integration

function generate_tts(string $text, string $voice = 'af_heart', string $format = 'wav'): string
{
    $payload = json_encode([
        'model'  => 'kokoro',
        'input'  => prepareText($text),
        'voice'  => $voice,
        'response_format' => $format,  // TRAP: Kokoro ignores this and always returns WAV
        'speed'  => 1.0,
    ]);

    $ch = curl_init('http://localhost:9010/v1/audio/speech');
    curl_setopt_array($ch, [
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => $payload,
        CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_TIMEOUT        => 30,
    ]);

    $audioBytes = curl_exec($ch);
    $httpCode   = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200 || !$audioBytes) {
        throw new \RuntimeException("Kokoro TTS failed: HTTP {$httpCode}");
    }

    // The WAV-not-MP3 trap: detect actual format regardless of what we requested
    $actualFormat = detect_audio_format($audioBytes);
    // $actualFormat will be 'wav' even if you requested 'mp3'

    return $audioBytes;
}

function detect_audio_format(string $bytes): string
{
    // WAV files start with 'RIFF'
    if (str_starts_with($bytes, 'RIFF')) return 'wav';
    // MP3 files start with 0xFF 0xFB or ID3 tag
    if (str_starts_with($bytes, "\xFF\xFB") || str_starts_with($bytes, 'ID3')) return 'mp3';
    // OGG files start with 'OggS'
    if (str_starts_with($bytes, 'OggS')) return 'ogg';
    return 'unknown';
}

The prepareText() Function

function prepareText(string $input): string
{
    // THE UTF-8 BUG: Without /u flag, \b and \W break on multibyte characters
    // BAD:  preg_replace('/\*([^*]+)\*/u', '$1', $input)  — missing /u causes corruption
    // GOOD: always use /u flag on any regex that touches non-ASCII content

    // Strip WhatsApp markdown
    $text = preg_replace('/\*([^*]+)\*/u', '$1', $input);      // *bold*
    $text = preg_replace('/_([^_]+)_/u',  '$1', $text);       // _italic_
    $text = preg_replace('/~([^~]+)~/u',  '$1', $text);       // ~strikethrough~

    // Convert bullet lists to spoken connectives
    $text = preg_replace('/^\s*[-•*]\s+/mu', 'Next, ', $text);
    $text = preg_replace('/^\s*\d+\.\s+/mu', 'Step: ', $text);

    // Expand common legal abbreviations for natural TTS pronunciation
    $abbrevs = [
        'S/o'  => 'son of',
        'D/o'  => 'daughter of',
        'W/o'  => 'wife of',
        'vs.'  => 'versus',
        'HC'   => 'High Court',
        'SC'   => 'Supreme Court',
        'CPC'  => 'Code of Civil Procedure',
        'CrPC' => 'Code of Criminal Procedure',
        'IPC'  => 'Indian Penal Code',
    ];
    foreach ($abbrevs as $abbrev => $expansion) {
        // /u flag ensures multibyte safety
        $text = preg_replace('/\b' . preg_quote($abbrev, '/') . '\b/u', $expansion, $text);
    }

    // Character cap with spoken truncation notice
    if (mb_strlen($text) > 800) {
        $text = mb_substr($text, 0, 800) . '… and more. Please check the app for the full details.';
    }

    return trim($text);
}

Audio Type Classification

Not every message deserves a voice response. Classify the text before calling TTS:

  • conversation — Short responses (<50 words): send as text, not audio
  • information — Case details, hearing dates: send as audio
  • instructions — Multi-step guidance: send as audio
  • evidence — Anything that could be used in a legal proceeding: send as text only (auditable)

What to Watch For

  • The /u flag on all multibyte regex — This is the single most common source of corrupted text in PHP handling Indian language content. Add a linting rule for it.
  • Kokoro cold start — The first TTS call after a cold start takes 3-4 seconds for model loading. Keep a health-check cron that sends a short test string every 5 minutes to keep the model warm.
  • Voice selection — Kokoro's af_heart voice is appropriate for professional contexts. Expose voice selection in the client config but restrict to approved voices.