Resolving charset encoding mix-ups / mojibake

I can share one other neat little thing that MySQL can do. This is really getting into the territory of “black magic”, but it can fix some things that other methods cannot. MySQL has a CONVERT function that can be used like this:

SELECT setting_value, convert(cast(convert(setting_value using latin1) as binary) using utf8) as fixed_setting_value
FROM `whatever_table`

And what this will result in is two printed columns in the output. the original mojibaked column (in this case setting_value), and the converted and cast column. For some mojibake instances, you’ll get something like this in the original column:

Accusé de réception de l’évaluation d’une soumi…

and this in the converted column:

Accusé de réception de l’évaluation d’une soumissi…

Notice that in the second case, the mojibake is resolved. You may be asking yourself why this isn’t just the way to fix it generally, since you could course do this in an UPDATE statement on problematic columns, but the issue here is that in some cases the CONVERT function occasionally encounters a character it can’t fix and truncates the string. It’s annoying since it does this silently and if you’re doing this on a large table you won’t know if a column was truncated in a particular record. But it can be awesome for at least looking at problematic columns and then manually copying the fixed string back into the broken record.

Cheers,
Jason