Series 6 — Part 5 of 10
the WhatsApp AI agent sends voice note responses using Kokoro TTS. Three production bugs lurk in this pipeline: the WAV-not-MP3 trap (Kokoro always outputs WAV regardless of what you request), the UTF-8 corruption bug (missing /u flag on multibyte regex), and the audio type classification problem. This article covers all three.
Kokoro TTS Integration
function generate_tts(string $text, string $voice = 'af_heart', string $format = 'wav'): string
{
$payload = json_encode([
'model' => 'kokoro',
'input' => prepareText($text),
'voice' => $voice,
'response_format' => $format, // TRAP: Kokoro ignores this and always returns WAV
'speed' => 1.0,
]);
$ch = curl_init('http://localhost:9010/v1/audio/speech');
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $payload,
CURLOPT_HTTPHEADER => ['Content-Type: application/json'],
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => 30,
]);
$audioBytes = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200 || !$audioBytes) {
throw new \RuntimeException("Kokoro TTS failed: HTTP {$httpCode}");
}
// The WAV-not-MP3 trap: detect actual format regardless of what we requested
$actualFormat = detect_audio_format($audioBytes);
// $actualFormat will be 'wav' even if you requested 'mp3'
return $audioBytes;
}
function detect_audio_format(string $bytes): string
{
// WAV files start with 'RIFF'
if (str_starts_with($bytes, 'RIFF')) return 'wav';
// MP3 files start with 0xFF 0xFB or ID3 tag
if (str_starts_with($bytes, "\xFF\xFB") || str_starts_with($bytes, 'ID3')) return 'mp3';
// OGG files start with 'OggS'
if (str_starts_with($bytes, 'OggS')) return 'ogg';
return 'unknown';
}
The prepareText() Function
function prepareText(string $input): string
{
// THE UTF-8 BUG: Without /u flag, \b and \W break on multibyte characters
// BAD: preg_replace('/\*([^*]+)\*/u', '$1', $input) — missing /u causes corruption
// GOOD: always use /u flag on any regex that touches non-ASCII content
// Strip WhatsApp markdown
$text = preg_replace('/\*([^*]+)\*/u', '$1', $input); // *bold*
$text = preg_replace('/_([^_]+)_/u', '$1', $text); // _italic_
$text = preg_replace('/~([^~]+)~/u', '$1', $text); // ~strikethrough~
// Convert bullet lists to spoken connectives
$text = preg_replace('/^\s*[-•*]\s+/mu', 'Next, ', $text);
$text = preg_replace('/^\s*\d+\.\s+/mu', 'Step: ', $text);
// Expand common legal abbreviations for natural TTS pronunciation
$abbrevs = [
'S/o' => 'son of',
'D/o' => 'daughter of',
'W/o' => 'wife of',
'vs.' => 'versus',
'HC' => 'High Court',
'SC' => 'Supreme Court',
'CPC' => 'Code of Civil Procedure',
'CrPC' => 'Code of Criminal Procedure',
'IPC' => 'Indian Penal Code',
];
foreach ($abbrevs as $abbrev => $expansion) {
// /u flag ensures multibyte safety
$text = preg_replace('/\b' . preg_quote($abbrev, '/') . '\b/u', $expansion, $text);
}
// Character cap with spoken truncation notice
if (mb_strlen($text) > 800) {
$text = mb_substr($text, 0, 800) . '… and more. Please check the app for the full details.';
}
return trim($text);
}
Audio Type Classification
Not every message deserves a voice response. Classify the text before calling TTS:
- conversation — Short responses (<50 words): send as text, not audio
- information — Case details, hearing dates: send as audio
- instructions — Multi-step guidance: send as audio
- evidence — Anything that could be used in a legal proceeding: send as text only (auditable)
What to Watch For
- The /u flag on all multibyte regex — This is the single most common source of corrupted text in PHP handling Indian language content. Add a linting rule for it.
- Kokoro cold start — The first TTS call after a cold start takes 3-4 seconds for model loading. Keep a health-check cron that sends a short test string every 5 minutes to keep the model warm.
- Voice selection — Kokoro's
af_heartvoice is appropriate for professional contexts. Expose voice selection in the client config but restrict to approved voices.