I wasn’t sure if I should open a new ticket or continue here but I have done some more research into this issue and the problem seems to be a more long term and wider database problem.
Current setup:
-
Latest stable OJS v3
-
The database is configured for UTF8
-
The database is an import and upgrade from an old OJS v2 database. The upgrade was successful
-
The data in the current database has characters such like small tildes, double quotes, em and en dashes, etc which are incorrectly encoded and look like ü in the database tables
-
The incorrect characters appear in the old OJS v2 database so this is a legacy issue from a few years back and isn’t the results of the new OJS v3 implementation
My findings:
-
I have exported the db text and it appears to be Windows 1252 encoding. There are two causes for this; 1) if the previous database and db client connection where latin1, MySQL’s interpretation of latin1 is actually Windows 1252 encoding not the more standard ISO 8859 1. 2) users have probably been pasting in from Windows applications which have been encoded in Windows 1252. Because the previous OJS v2 db was probably latin1 it has held on to the Windows 1252 (I know Windows now defaults to utf8 but the versions of Word etc probably being used will be old XP and 7 versions)
-
I have exported the data from the database and run iconv against it. It is hitting a couple of errors where it can’t convert some of the characters. Small tilde is an example of a Windows 1252 character which causes iconv to error
-
I have managed to use Notepad to convert Windows 1252 content to UTF8 although my concern is that Notepad simply saves a UTF8 wrapper around the Windows 1252 characters
-
I have managed to import the data into UTF8 mysql tables and the data in there looks correct (small tildes, em/en dashes, apotrophes, double quotes) display correctly when I select tables. So far I have converted the citations, email_templates_data and email_templates_default_data tables.
-
Running cmd apps such as file and chardetect are reporting the db dumps (I have used mysqldump to create db dump files) as almost certainly UTF8 encoded.
Now here is what is strange; so far my workaround is to set the connection_charset to utf8 (by default it is set to Off). This outputs the content from articles correctly. However, this breaks some of the backend code; for example if I try and edit the user registration email I get a JSON error. If I convert back to connection_charset Off, the JSON errors go but the articles now show incorrect data. For example, an em dash (a long -) will show as –.
This happens whether I have the broken data in MySQL or I have the cleaned data. Digging further I found this, if connection_charset is set to utf8, the JSON data (E.g. user registration emails) contains invalid characters. In particular the diamond with the ? inside it. Changing back to connection_charset Off, the data output is correct.
Now I would have thought that the data output would be the same; if I set connection_charset to utf8 then both the article and the json would be encoded the same way. So I set up a test harness using the php mysqli library with two SQL queries; one pulls data from the citations table and the other pulls from the email_template_data table. Sure enough, if I don’t set the connection character set using mysql_set_charset, the data displays special characters incorrectly. If I set mysql_set_charset to utf8, both outputs are correct.
So the next question is; why, if my test harness outputs data correctly from both citations and email_templates_data, then why is this not also the case in OJS?
My test harness would suggest the issue lies with the db connection, in particular what is returned by the db, as opposed to how the resulting output file (in my case HTML) has its encoding defined.
If anyone has any suggestions or has run into a similar problem I’d really appreciate any feedback. I’ve dug around the code but can’t work out where, for example, the citations table is queried when building the Article DAO output, so even if someone can point me in the direction of where this query is actually built.
My last test will be to try and debug the OJS db lib; I figure if I can intercept the query output I might be able to find out where the encoding is going wrong.