I ran into an ugly issue having to discard invalid UTF-8 characters from a string before I pass it to json_decode() as otherwise it fails decoding it. First I’ve discovered that it’s possible to ignore invalid UTF-8 characters using:
iconv(“UTF-8″, “UTF-8//IGNORE”, $text)
However turns out this has been broken for ages and using //IGNORE produces an E_NOTICE. Luckily I found a comment which suggests a workaround:
ini_set(‘mbstring.substitute_character’, “none”);
$text = mb_convert_encoding($text, ‘UTF-8′, ‘UTF-8′);
This however was not enough. Because I was getting some characters that were non printable UTF-8 characters json_decode was failing on them as well. To work around this I’ve used:
$text = preg_replace(‘/[^\pL\pN\pP\pS\pZ\pM]/u’, ”, $text);
This will remove new lines as well which is fine for me. You can also try a removing non-printable byte sequences.