Hi everybody,
I would need help to solve this font problem, I have several jobs that have wrong letters like: ’ or [“do you have any idea how to correct this error?
mysql> SELECT TABLE_NAME, COLUMN_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'your-database' AND COLLATION_NAME IS NOT NULL AND COLLATION_NAME != 'utf8_general_ci';
If the above query returns any tables with collations that are not ‘utf8_general_ci’ these collations will need to be converted to ‘utf8_general_ci’
mysql> ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Please let me know if you need any further assistance.
Best regards,
Josh Noronha (he/his)
Systems Specialist
PKP|PS Support Team
Hi @jnoronha , thanks for your help, I followed your advice, but I was sure that all the tables were in utf8_general_ci, I confirm that, I still need your help to solve this that looks like an upgrade bug, because, we have many journals with OJS, and for others when I upgraded to version 3.3.0.8 they did not have this result, all the characters are correct, I could not tell the difference.
Hi @thelaris, @ojs_univie solutions have not been found yet, with our technicalities we are still working but there seems to be a bug in the update, Looking with a hex editor, I see this:
The wrong sequences are:
C3 83 C2 A8 which should be an accented lowercase e.
C3 83 C2 B9 which should be an accented lowercase u.
They are valid characters in utf-8, or rather they can be interpreted as valid characters because: in the 2 byte sequences the most significant bits are 11 (here everything starts for C, which has bits 7 and 6 at 1), and the second byte has the 2 most significant bits equal to 10 (here the second bytes start for A, 8 or B and are fine). But they are sequences of 4 bytes to represent each character, and this is not good. The good thing is that we know what should be there instead of garbage. So I went to look here: https://www.utf8-chartable.de/. How is accented lowercase e encoded in utf-8? C3 A8. What about the lowercase accented u? C3 B9.
So:
C3 A8 has become C3 83 C2 A8
C3 B9 has become C3 83 C2 B9
if I remove from both sequences the “middle” 83 C2 I get the right encoding. How the extra bytes have been inserted I don’t know, but it seems the result of a bug, and the suspicion obviously falls on the update procedure.
I would really like to take your word for it that the collations are all fine, but it would be good to see some proof of that. Maybe provide the output of a query where the broken character can be seen as well.