UTF-8 Safety in PHP — The /u Flag and Multibyte Regex

Series 8 — PHP Engineering Patterns • Part 1 of 7

PHP handles multibyte characters poorly by default. Without the /u flag, PCRE treats strings as sequences of bytes, not Unicode characters. In production, this silently corrupts Devanagari text, breaks bullet point stripping, and causes json_encode() to return false. This article explains why and how to prevent it.

The Problem

// The bullet-point corruption bug (real production case)
$text = "• Status: अगली सुनवाई 15 जून को है\n• Action: दस्तावेज़ जमा करें";

// WITHOUT /u flag — WRONG
$cleaned = preg_replace('/^\s*•\s+/m', '', $text);
// Result: garbled output — the multibyte bullet eats adjacent bytes

// WITH /u flag — CORRECT
$cleaned = preg_replace('/^\s*•\s+/mu', '', $text);
// Result: "Status: अगली सुनवाई 15 जून को है\nAction: दस्तावेज़ जमा करें"

The /u flag tells PCRE to treat the string as UTF-8 and use Unicode character classes. Without it, \s, \w, \b and . all operate on bytes, not characters. This means a 3-byte Hindi character is treated as 3 separate "characters" by the regex engine.

When /u Is Required

Any regex that runs on text that may contain non-ASCII characters
Any regex using \b (word boundary) — \b without /u does not understand Unicode word boundaries
Any regex using [a-z] or \w when the input may contain multibyte characters that happen to have bytes in the ASCII range
Any regex that replaces or captures variable-length patterns near multibyte characters

Diagnosing json_encode() Returning false

$data = ['message' => $corruptedText];
$json = json_encode($data);

if ($json === false) {
    // Not the json_encode call that's broken — it's the input string
    $error = json_last_error_msg();  // "Malformed UTF-8 characters, possibly incorrectly encoded"

    // Diagnose: find which field is corrupt
    foreach ($data as $key => $value) {
        if (!mb_check_encoding($value, 'UTF-8')) {
            error_log("Corrupt UTF-8 in field: {$key}");
        }
    }

    // Recover (lossy): strip invalid byte sequences
    $clean = mb_convert_encoding($corruptedText, 'UTF-8', 'UTF-8');
    // Or use: preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|(?<=^|[\x00-\x7E])[\x80-\xBF]+|[\xC0-\xC1]|[\xF0-\xFF][\x80-\xBF]{0,2}|[\xE0-\xEF][\x80-\xBF]?|[\xC2-\xDF](?![\x80-\xBF])/S', '', $corruptedText)
}

A Lint Rule for the /u Flag

Add a custom PHPStan rule or a simple grep check to your CI pipeline:

# Flag any preg_replace/preg_match that doesn't have /u flag when used with non-ASCII patterns
# Simple grep check for patterns using \b, \w, \s without /u
grep -rn "preg_replace\|preg_match\|preg_split" src/ | grep -v "'/u" | grep -E '\\\\b|\\\\w|\\\\s'

What to Watch For

mb_str_split vs str_split — str_split() splits on bytes. For characters, use mb_str_split() (PHP 7.4+).
strlen vs mb_strlen — strlen('अ') returns 3 (bytes). mb_strlen('अ') returns 1 (characters). Use mb_strlen for character counts.
The /u flag and performance — Unicode-aware PCRE is slightly slower than byte-mode. In tight loops over very large strings, measure the impact. In webhook handlers, it is negligible.

The Problem

When /u Is Required

Diagnosing json_encode() Returning false

A Lint Rule for the /u Flag

What to Watch For

Stay at the cutting edge