PHP handles multibyte characters poorly by default. Without the /u flag, PCRE treats strings as sequences of bytes, not Unicode characters. In production, this silently corrupts Devanagari text, breaks bullet point stripping, and causes json_encode() to return false. This article explains why and how to prevent it.
The Problem
// The bullet-point corruption bug (real production case)
$text = "• Status: अगली सुनवाई 15 जून को है\n• Action: दस्तावेज़ जमा करें";
// WITHOUT /u flag — WRONG
$cleaned = preg_replace('/^\s*•\s+/m', '', $text);
// Result: garbled output — the multibyte bullet eats adjacent bytes
// WITH /u flag — CORRECT
$cleaned = preg_replace('/^\s*•\s+/mu', '', $text);
// Result: "Status: अगली सुनवाई 15 जून को है\nAction: दस्तावेज़ जमा करें"
The /u flag tells PCRE to treat the string as UTF-8 and use Unicode character classes. Without it, \s, \w, \b and . all operate on bytes, not characters. This means a 3-byte Hindi character is treated as 3 separate "characters" by the regex engine.
When /u Is Required
- Any regex that runs on text that may contain non-ASCII characters
- Any regex using
\b(word boundary) —\bwithout/udoes not understand Unicode word boundaries - Any regex using
[a-z]or\wwhen the input may contain multibyte characters that happen to have bytes in the ASCII range - Any regex that replaces or captures variable-length patterns near multibyte characters
Diagnosing json_encode() Returning false
$data = ['message' => $corruptedText];
$json = json_encode($data);
if ($json === false) {
// Not the json_encode call that's broken — it's the input string
$error = json_last_error_msg(); // "Malformed UTF-8 characters, possibly incorrectly encoded"
// Diagnose: find which field is corrupt
foreach ($data as $key => $value) {
if (!mb_check_encoding($value, 'UTF-8')) {
error_log("Corrupt UTF-8 in field: {$key}");
}
}
// Recover (lossy): strip invalid byte sequences
$clean = mb_convert_encoding($corruptedText, 'UTF-8', 'UTF-8');
// Or use: preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|(?<=^|[\x00-\x7E])[\x80-\xBF]+|[\xC0-\xC1]|[\xF0-\xFF][\x80-\xBF]{0,2}|[\xE0-\xEF][\x80-\xBF]?|[\xC2-\xDF](?![\x80-\xBF])/S', '', $corruptedText)
}
A Lint Rule for the /u Flag
Add a custom PHPStan rule or a simple grep check to your CI pipeline:
# Flag any preg_replace/preg_match that doesn't have /u flag when used with non-ASCII patterns
# Simple grep check for patterns using \b, \w, \s without /u
grep -rn "preg_replace\|preg_match\|preg_split" src/ | grep -v "'/u" | grep -E '\\\\b|\\\\w|\\\\s'
What to Watch For
- mb_str_split vs str_split —
str_split()splits on bytes. For characters, usemb_str_split()(PHP 7.4+). - strlen vs mb_strlen —
strlen('अ')returns 3 (bytes).mb_strlen('अ')returns 1 (characters). Usemb_strlenfor character counts. - The /u flag and performance — Unicode-aware PCRE is slightly slower than byte-mode. In tight loops over very large strings, measure the impact. In webhook handlers, it is negligible.