DB Charset breaking apostrophes

haydenyoung · March 20, 2018, 2:57pm

When editing the User Validate and User Register email we received this message: Failed Ajax request or invalid JSON returned.

It turned out changing the connection_charset = utf8 in config.inc.php fixed the problem. However, this change is now causing incorrect characters to show for the apostrophe, for example:

"The image was taken by Dr Charles von der Meden from the South African Environmental Observation Network (SAEON) using SAEONâ€™s SkiMonkey III deep-sea camera. "

This should read:

"The image was taken by Dr Charles von der Meden from the South African Environmental Observation Network (SAEON) using SAEON’s SkiMonkey III deep-sea camera.

Note the apostrophes in SAEON’s is changing to SAEONâ€™s.

Changing the connection_charset back to utf-8 fixes the problem but then we are back to where we started where the User Validate and User Register emails throw a JSON error.

Any help much appreciated.

asmecher · March 20, 2018, 3:10pm

Hi @haydenyoung,

A connection_charset setting of utf-8 is a typo – that setting goes to ADODB’s setCharSet function, and it won’t know what utf-8 means. It’s likely that the connection is left at a default of latin1.

To correct it, you’ll need not only to correct the connection_charset to utf8, you’ll also need to transcode your database to properly encode the data there. There are some suggestions to correct it on StackOverflow, and personally I’ve used both iconv and ftfy tools to correct encodings.

Regards,
Alec Smecher
Public Knowledge Project Team

haydenyoung · March 27, 2018, 6:28am

Thanks for the suggestion.

The db was migrated from a different server. When I create any db I always specify character set utf8

E.g.

create database my_db character set utf8;

I have also changed the connection_charset = utf8 in the config php.

Therefore, my last step should involve running a re-encoder against the existing data (E.g. iconv/ftfy)?

Thanks

asmecher · March 27, 2018, 3:09pm

Hi @haydenyoung,

Yes, you’ll probably need to transcode the data. I’ve used iconv in the past, and there are some good hints on StackOverflow for how to apply it to a database dump with similar encoding problems to yours.

I’ve heard good things about ftfy but haven’t used it myself. It attempts to automatically detect content garblings, and can handle different kinds of mixed encodings in the same file, so if that’s your situation it might work best.

If it were me, I’d probably start with iconv as it’s a more “predictable” transform – it doesn’t try to get too clever, but it’ll transform everything absolutely consistently.

Regards,
Alec Smecher
Public Knowledge Project Team

haydenyoung · March 28, 2018, 2:39pm

Hi @asmecher

Thanks for the feedback. I’ve done some preliminary runs of iconv without much success. I have been discussing this issue with my developer and we are of the thinking that there are a number of reasons as to why we could be seeing this issue (and it may not even be db related):

We imported from another db which which was configured with the charset latin1, meaning legacy content may have not been converted when migrating to the new charset = utf8 db,
The problem may be related to new records. Is it possible that the web form is incorrectly encoding input?
It could be an end-user problem. What if users are copying and pasting directly from a Word document with a different encoding and the special chars are being copied directly into the web form. I’ve seen this happen with apostrophes and other punctuation chars.

I’m thinking there could be a number of possible issues causing these problems so my thinking is:

I would have to establish how the data is being input into OJS and what the encoding of the source documents might be,
Whether the source of the problem is user-facing, backend or db related.

I’m wondering if anyone else has experienced these issues and whether it might not be db related?

Thanks

Hayden

asmecher · March 28, 2018, 2:47pm

Hi @haydenyoung,

This is indeed a tricky subject.

In my experience various parts of the web toolset are pretty good at making sure that invalid data doesn’t get into the system, as long as they’re configured correctly – a recent web server, PHP, TinyMCE, and OJS all configured to work with UTF-8 shouldn’t be introducing bad data into the database.

Some parts of that toolchain used to cause trouble because of Microsoft’s “Smart Quotes” non-standard character set additions, but I haven’t seen that lately.

If you think you’ve got mixed encodings in your database, ftfy might be a better tool to use.

Regards,
Alec Smecher
Public Knowledge Project Team

haydenyoung · March 28, 2018, 2:52pm

my feeling is that this could be indeed be the issue but it probably warrants further investigation.

haydenyoung · May 25, 2018, 6:30am

I wasn’t sure if I should open a new ticket or continue here but I have done some more research into this issue and the problem seems to be a more long term and wider database problem.

Current setup:

Latest stable OJS v3
The database is configured for UTF8
The database is an import and upgrade from an old OJS v2 database. The upgrade was successful
The data in the current database has characters such like small tildes, double quotes, em and en dashes, etc which are incorrectly encoded and look like Ã¼ in the database tables
The incorrect characters appear in the old OJS v2 database so this is a legacy issue from a few years back and isn’t the results of the new OJS v3 implementation

My findings:

I have exported the db text and it appears to be Windows 1252 encoding. There are two causes for this; 1) if the previous database and db client connection where latin1, MySQL’s interpretation of latin1 is actually Windows 1252 encoding not the more standard ISO 8859 1. 2) users have probably been pasting in from Windows applications which have been encoded in Windows 1252. Because the previous OJS v2 db was probably latin1 it has held on to the Windows 1252 (I know Windows now defaults to utf8 but the versions of Word etc probably being used will be old XP and 7 versions)
I have exported the data from the database and run iconv against it. It is hitting a couple of errors where it can’t convert some of the characters. Small tilde is an example of a Windows 1252 character which causes iconv to error
I have managed to use Notepad to convert Windows 1252 content to UTF8 although my concern is that Notepad simply saves a UTF8 wrapper around the Windows 1252 characters
I have managed to import the data into UTF8 mysql tables and the data in there looks correct (small tildes, em/en dashes, apotrophes, double quotes) display correctly when I select tables. So far I have converted the citations, email_templates_data and email_templates_default_data tables.
Running cmd apps such as file and chardetect are reporting the db dumps (I have used mysqldump to create db dump files) as almost certainly UTF8 encoded.

Now here is what is strange; so far my workaround is to set the connection_charset to utf8 (by default it is set to Off). This outputs the content from articles correctly. However, this breaks some of the backend code; for example if I try and edit the user registration email I get a JSON error. If I convert back to connection_charset Off, the JSON errors go but the articles now show incorrect data. For example, an em dash (a long -) will show as â€“.

This happens whether I have the broken data in MySQL or I have the cleaned data. Digging further I found this, if connection_charset is set to utf8, the JSON data (E.g. user registration emails) contains invalid characters. In particular the diamond with the ? inside it. Changing back to connection_charset Off, the data output is correct.

Now I would have thought that the data output would be the same; if I set connection_charset to utf8 then both the article and the json would be encoded the same way. So I set up a test harness using the php mysqli library with two SQL queries; one pulls data from the citations table and the other pulls from the email_template_data table. Sure enough, if I don’t set the connection character set using mysql_set_charset, the data displays special characters incorrectly. If I set mysql_set_charset to utf8, both outputs are correct.

So the next question is; why, if my test harness outputs data correctly from both citations and email_templates_data, then why is this not also the case in OJS?

My test harness would suggest the issue lies with the db connection, in particular what is returned by the db, as opposed to how the resulting output file (in my case HTML) has its encoding defined.

If anyone has any suggestions or has run into a similar problem I’d really appreciate any feedback. I’ve dug around the code but can’t work out where, for example, the citations table is queried when building the Article DAO output, so even if someone can point me in the direction of where this query is actually built.

My last test will be to try and debug the OJS db lib; I figure if I can intercept the query output I might be able to find out where the encoding is going wrong.

asmecher · May 25, 2018, 4:39pm

Hi @haydenyoung,

Database encoding correction is a subject of its own, with lots of discussion on e.g. StackOverflow, but I’d suggest…

Make absolutely sure that your config.inc.php parameters are correct. If you innocently corrected a utf8 to a utf-8 or vice versa, you might have confused one of the underlying tools OJS uses.
Work with the database dump to correct encodings, then reload it. Tools like FTFY and iconv can help. I often use vim.
Beware of mixing correctly-encoded data in with badly-encoded data! This will make it a lot harder to consistently correct everything. If your data is consistently encoded wrong, often a single pass with the right iconv command can fix the whole thing.

Regards,
Alec Smecher
Public Knowledge Project Team

haydenyoung · May 28, 2018, 9:04pm

Hi @asmecher

Thanks again for the assistance.

It looks like I’ve managed to make some progress…

I was relying on iconv to re-encode mysqldumps but was hitting encoding errors. Every combination I tried wouldn’t work. It’s difficult to determine what the original encoding was but I do know there is a difference between what most tools interpret as Latin1 and what MySQL uses (I believe it is CP1252). There were also issues getting mysqldump to output the correctly encoded data (this can be set using SET NAMES which I found sets the character_set_results mysql param which I needed to set to LATIN1 to get the correctly encoded output. Setting this is not possible (as far as I know) in mysqldump using the default-character-set flag and probably needs to be set in the my.cnf script.

So rather than persisting with mysql and iconv I decided to pull together some PHP scripts and build a reencoding/exporting tool. Turns out there is an excellent encoding library GitHub - neitanod/forceutf8: PHP Class Encoding featuring popular Encoding::toUTF8() function --formerly known as forceUTF8()-- that fixes mixed encoded strings. which will reencode latin1 output to utf8. Using this tool along with GitHub - ifsnop/mysqldump-php: PHP version of mysqldump cli that comes with MySQL to build the db inserts, I was able to cobble together a small cmd line app which can fix incorrectly encoded dbs from latin1 to utf8 and now I’m getting the correct output with my OJS config’s charset set to utf8.

I’m sure there is a way to do the above with mysqldump and iconv but I was unable to work it out and the neitanod/forceutf8 took the guess work out of which encoding I needed to use.