Series 9 — Part 5 of 5

Text going into a TTS engine needs preparation: strip markdown, expand abbreviations, convert lists to spoken connectives, and cap length. Do this wrong and the TTS engine mispronounces legal terms, reads asterisks aloud, or truncates mid-sentence. This article covers a production prepareText() implementation.

The Full prepareText() Implementation

function prepareText(string $input, int $charCap = 800): string
{
    // 1. Strip WhatsApp markdown — ALWAYS use /u flag for multibyte safety
    $text = preg_replace('/\*\*?([^*]+)\*\*?/u', '$1', $input);  // *bold* and **bold**
    $text = preg_replace('/_([^_]+)_/u',          '$1', $text);   // _italic_
    $text = preg_replace('/~([^~]+)~/u',           '$1', $text);   // ~strikethrough~
    $text = preg_replace('/`([^`]+)`/u',           '$1', $text);   // `code`

    // 2. Convert bullet lists to spoken connectives
    $text = preg_replace('/^\s*[-•*]\s+/mu', 'Next item: ', $text);

    // 3. Convert numbered lists to spoken steps
    $text = preg_replace('/^\s*(\d+)\.\s+/mu', 'Step $1: ', $text);

    // 4. Expand legal abbreviations for natural TTS pronunciation
    $expansions = [
        // Case parties
        'S/o'  => 'son of',       'D/o'  => 'daughter of',
        'W/o'  => 'wife of',      'R/o'  => 'resident of',
        // Court references
        'vs.'  => 'versus',       'vs'   => 'versus',
        'v.'   => 'versus',
        'HC'   => 'High Court',   'SC'   => 'Supreme Court',
        'DC'   => 'District Court',
        // Statutes — spoken form avoids robotic letter-by-letter reading
        'IPC'  => 'Indian Penal Code',
        'CPC'  => 'Code of Civil Procedure',
        'CrPC' => 'Code of Criminal Procedure',
        'IEA'  => 'Indian Evidence Act',
        'GST'  => 'Goods and Services Tax',
        // Proceedings
        'Sec.' => 'Section',      'Sec'  => 'Section',
        'Art.' => 'Article',      'cl.'  => 'clause',
        'pg.'  => 'page',         'pp.'  => 'pages',
        'para' => 'paragraph',
    ];

    foreach ($expansions as $abbrev => $full) {
        // /u flag + word boundary — safe for mixed scripts
        $pattern = '/\b' . preg_quote($abbrev, '/') . '\b/u';
        $text    = preg_replace($pattern, $full, $text);
    }

    // 5. Clean up multiple consecutive newlines → single pause
    $text = preg_replace('/\n{2,}/u', '. ', $text);
    $text = preg_replace('/\n/u',      ', ', $text);

    // 6. Character cap with graceful truncation notice
    $text = trim($text);
    if (mb_strlen($text) > $charCap) {
        $text = mb_substr($text, 0, $charCap, 'UTF-8');
        // Find last sentence boundary before the cap
        $lastPeriod = mb_strrpos($text, '.', 0, 'UTF-8');
        if ($lastPeriod > $charCap * 0.7) {
            $text = mb_substr($text, 0, $lastPeriod + 1, 'UTF-8');
        }
        $text .= ' Please check the app for the complete details.';
    }

    return trim($text);
}

Audio Type Classification

function classify_audio_type(string $text, PersonaInterface $persona): string
{
    $wordCount = str_word_count($text);

    // Short conversational replies → text only (no TTS)
    if ($wordCount < 15) return 'text_only';

    // Evidence or anything that could be used in legal proceedings → text only
    if (str_contains($persona->getSystemPromptContext(), 'evidence')) return 'text_only';

    // Instructions (multi-step guidance) → audio
    if (preg_match('/step \d|first|second|third|next step/i', $text)) return 'instructions';

    // Information (case details, hearing dates) → audio
    return 'information';
}

function should_send_audio(string $type): bool
{
    return in_array($type, ['information', 'instructions']);
}

What to Watch For

  • Order of operations matters — Expand abbreviations before converting newlines. Otherwise "CPC\nSection 9" becomes "Code of Civil Procedure, Section 9" correctly, not "Code of Civil Procedure\n Section 9" (which the TTS reads oddly).
  • Test with actual Hindi/Punjabi content — Unit tests with ASCII only will miss multibyte issues. Include Devanagari test strings in your test suite for prepareText().
  • TTS mispronunciation is a user experience bug — "IPC" read as "I P C" sounds robotic. "Indian Penal Code" sounds professional. The abbreviation expansion directly affects trust in the AI assistant.