Series 9 — Part 5 of 5
Text going into a TTS engine needs preparation: strip markdown, expand abbreviations, convert lists to spoken connectives, and cap length. Do this wrong and the TTS engine mispronounces legal terms, reads asterisks aloud, or truncates mid-sentence. This article covers a production prepareText() implementation.
The Full prepareText() Implementation
function prepareText(string $input, int $charCap = 800): string
{
// 1. Strip WhatsApp markdown — ALWAYS use /u flag for multibyte safety
$text = preg_replace('/\*\*?([^*]+)\*\*?/u', '$1', $input); // *bold* and **bold**
$text = preg_replace('/_([^_]+)_/u', '$1', $text); // _italic_
$text = preg_replace('/~([^~]+)~/u', '$1', $text); // ~strikethrough~
$text = preg_replace('/`([^`]+)`/u', '$1', $text); // `code`
// 2. Convert bullet lists to spoken connectives
$text = preg_replace('/^\s*[-•*]\s+/mu', 'Next item: ', $text);
// 3. Convert numbered lists to spoken steps
$text = preg_replace('/^\s*(\d+)\.\s+/mu', 'Step $1: ', $text);
// 4. Expand legal abbreviations for natural TTS pronunciation
$expansions = [
// Case parties
'S/o' => 'son of', 'D/o' => 'daughter of',
'W/o' => 'wife of', 'R/o' => 'resident of',
// Court references
'vs.' => 'versus', 'vs' => 'versus',
'v.' => 'versus',
'HC' => 'High Court', 'SC' => 'Supreme Court',
'DC' => 'District Court',
// Statutes — spoken form avoids robotic letter-by-letter reading
'IPC' => 'Indian Penal Code',
'CPC' => 'Code of Civil Procedure',
'CrPC' => 'Code of Criminal Procedure',
'IEA' => 'Indian Evidence Act',
'GST' => 'Goods and Services Tax',
// Proceedings
'Sec.' => 'Section', 'Sec' => 'Section',
'Art.' => 'Article', 'cl.' => 'clause',
'pg.' => 'page', 'pp.' => 'pages',
'para' => 'paragraph',
];
foreach ($expansions as $abbrev => $full) {
// /u flag + word boundary — safe for mixed scripts
$pattern = '/\b' . preg_quote($abbrev, '/') . '\b/u';
$text = preg_replace($pattern, $full, $text);
}
// 5. Clean up multiple consecutive newlines → single pause
$text = preg_replace('/\n{2,}/u', '. ', $text);
$text = preg_replace('/\n/u', ', ', $text);
// 6. Character cap with graceful truncation notice
$text = trim($text);
if (mb_strlen($text) > $charCap) {
$text = mb_substr($text, 0, $charCap, 'UTF-8');
// Find last sentence boundary before the cap
$lastPeriod = mb_strrpos($text, '.', 0, 'UTF-8');
if ($lastPeriod > $charCap * 0.7) {
$text = mb_substr($text, 0, $lastPeriod + 1, 'UTF-8');
}
$text .= ' Please check the app for the complete details.';
}
return trim($text);
}
Audio Type Classification
function classify_audio_type(string $text, PersonaInterface $persona): string
{
$wordCount = str_word_count($text);
// Short conversational replies → text only (no TTS)
if ($wordCount < 15) return 'text_only';
// Evidence or anything that could be used in legal proceedings → text only
if (str_contains($persona->getSystemPromptContext(), 'evidence')) return 'text_only';
// Instructions (multi-step guidance) → audio
if (preg_match('/step \d|first|second|third|next step/i', $text)) return 'instructions';
// Information (case details, hearing dates) → audio
return 'information';
}
function should_send_audio(string $type): bool
{
return in_array($type, ['information', 'instructions']);
}
What to Watch For
- Order of operations matters — Expand abbreviations before converting newlines. Otherwise "CPC\nSection 9" becomes "Code of Civil Procedure, Section 9" correctly, not "Code of Civil Procedure\n Section 9" (which the TTS reads oddly).
- Test with actual Hindi/Punjabi content — Unit tests with ASCII only will miss multibyte issues. Include Devanagari test strings in your test suite for prepareText().
- TTS mispronunciation is a user experience bug — "IPC" read as "I P C" sounds robotic. "Indian Penal Code" sounds professional. The abbreviation expansion directly affects trust in the AI assistant.