Resolving charset encoding mix-ups / mojibake

Extension of Charset problems (ISO-8859-1 x UTF-8) - #10 by wilsonw

After upgrading to 3.3 and setting the utf-8 charset according to the config template, we found our database to have past/historical data with mixed charset encodings. New data is saved properly. To remedy old data, I found titles, subtitles, and abstracts in the publication_settings table (among other places) to have the â€* characters stored at the database level.

SELECT * FROM `publication_settings` WHERE `setting_value` LIKE '%â€%'

Screen Shot 2021-05-21 at 9.53.34 AM

With ftfy I ran ftfy.fix_text() and resolved a few publications by updating the database manually. Since there are 600+ cases with characters of mixed encoding, I’m planning to run a loop through that resulting dataset and fixing the text via ftfy. Since this worked manually for a few publications, I’m fairly certain automating the rest should work. Is there anything I should be careful of before proceeding, or can you confirm that this should work in theory?

Thank you!

Edit: looks like we have these mixed encodings all over the place.

Search results for "%â€%" at least one of the words:
3 matches in announcement_settings
9 matches in author_settings
30 matches in citations
9 matches in comments
10 matches in controlled_vocab_entry_settings
1143 matches in email_log
64 matches in email_templates_default_data
15 matches in email_templates_settings
41 matches in event_log_settings  
3 matches in issue_settings
25 matches in journal_settings
1 match in navigation_menu_item_settings
208 matches in notes
3 matches in publication_galleys
598 matches in publication_settings
21 matches in rt_searches
2 matches in section_settings
272 matches in submission_comments
36 matches in submission_file_settings
7967 matches in submission_search_keyword_list
4 matches in user_settings
Total: 10464 matches

Wilson

Hi Wilson,

Your plan seems reasonable to me. We use ftfy at PKP|PS to fix mojibaked database dumps we get from clients migrating to us. We’ve slightly modified our version of ftfy to leave single quotes alone, and what we do is mysqldump the database, ftfy the dump file using the command line tool, and reimport it after fixing up the table character set definitions.

I’ll recommend just TRUNCATING your submission_search_* tables and then just rebuilding your search index using the

php tools/rebuildSearchIndex.php

tool that comes with OJS. It’ll knock 8 thousand matches off of your list.

Best,
Jason

Oh, and I almost forgot. Please make a database backup before doing work like this, in case you need to roll back.

Best,
Jason

Thanks Jason for the tip. Re-running rebuildSearchIndex is helping. I did bump into one error which I bypassed by changing the collation of the submission_search_* table to utf8_general_ci. You mentioned fixing charset definitions – is utf8_general_ci the recommended? It looks like all of ours are latin1_swedish_ci.

PHP Fatal error: Uncaught PDOException: SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' in /var/www/html/live/ojs/lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:115

For the other tables in the database, I ran in to the problem you alluded to with unescaped single quotes breaking the mysqldump re-import. Was the ftfy adjustment code something your team is willing to share or shed more light on?

As an alternative, via Python I ran SELECT queries from each respective table, ran ftfy.fix_text(), then ran UPDATE queries to save the columns. This worked for the big workloads (journal_settings, notes, publication_settings, submission_comments) as the single quotes were automatically escaped by the Python mysql connector. Thank you!

Hello all …
I’m facing this very problem with our OJS 3.3.0.5 Multilingual - multijournal installation…

I had reports about the search not working so I tried to rebuild the index as I have done in the past (Need help fixing keywords encoding or regenearting keywords table - OJS 3.2.1.1)

Sadly… now the process is not working

If I do
select table_name, table_schema, table_collation from information_schema.tables where table_schema='revistasuncu';
Most of the tables are reported with a table_collation of utf8_general_ci the rest of them have utf8mb4_general_ci

The rebuildSearchIndex.php result:

php tools/rebuildSearchIndex.php 
Restablecer índice ... hecho
Indexando "Revista de Turismo e Identidad" ... PHP Warning:  Invalid argument supplied for foreach() in /var/www/html/ojs3/lib/pkp/classes/submission/SubmissionKeywordDAO.inc.php on line 57
PHP Warning:  Invalid argument supplied for foreach() in /var/www/html/ojs3/lib/pkp/classes/submission/SubmissionKeywordDAO.inc.php on line 57
PHP Warning:  Invalid argument supplied for foreach() in /var/www/html/ojs3/lib/pkp/classes/submission/SubmissionLanguageDAO.inc.php on line 57
14 artículos indexados
Indexando "Experticia" ... 23 artículos indexados
Indexando "Revista de Médica Universitaria" ... 0 artículos indexados
Indexando "Revista EUROPA" ... 21 artículos indexados
Indexando "ENCUENTRO EDUCATIVO. Revista de investigación del Instituto de Ciencias de la Educación" ... 49 artículos indexados
Indexando "Revista de Historia Universal" ... PHP Fatal error:  Uncaught PDOException: SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' in /var/www/html/ojs3/lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:115
Stack trace:
#0 /var/www/html/ojs3/lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php(115): PDOStatement->execute(NULL)
#1 /var/www/html/ojs3/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(370): Doctrine\DBAL\Driver\PDOStatement->execute()
#2 /var/www/html/ojs3/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(664): Illuminate\Database\Connection->Illuminate\Database\{closure}(Object(Illuminate\Database\Query\Expression), Array)
#3 /var/www/html/ojs3/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(631): Illuminate\Database\Connection->runQueryCallback(Object(Illuminate\Database\Query\Expression), Array, Object(Closure))
#4 /var/www/html/ojs3/l in /var/www/html/ojs3/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php on line 671

Clearly something is wrong within the “Revista de Historia Universal” journal … but I haven’t been able to detect the problem… I have been working several days hunting down and fixing all kinds of mojibake without success, at least in the search index front…

Could someone guide me towards what tables are involved in the search index building process? Because maybe I’m focusing in the wrong ones :thinking:

What I mean is … which tables are parsed to fill i.e. the submission_search_keyword_list …

Thanks in advance!

Hi everyone,

I think I missed an earlier reply. Use whatever character set you like, but just make sure it’s consistent. We generally recommend UTF8 and some form of utf8 collation (utf8_general_ci, utf8_unicode_ci, or one of the newer mb4 types) since most modern MySQL servers will default to it, or use it implicitly.

When you see stuff like that, what’s being said here is that MySQL is failing to be able to compare two strings of text because the collations on the two strings are different. One is utf8_general_ci, and the other is latin1, but the latin1 collation is “implicit” which means that this is being used because that’s what MySQL is defaulting to in its environment. That would be set in the /etc/my.cnf file, and also in the various OJS character set/connection set options in your config.inc.php file.

Fixing this sort of thing is not straight forward. Our approach is to export the MySQL database as a mysql dump file, then use tools like FTFY or iconv to convert the files to UTF-8. We then edit the file to make sure that the collations and character sets on the tables are utf8, make sure that the SET NAMES command at the top of the mysql dump is UTF8, and then re-import (or import into a new mysql database). Then see what you get. It’s rarely perfect, and usually involves trial and error, and even manual editing of database records to fix characters that don’t look right.

Cheers,
Jason

1 Like

Thanks a lot @jnugent for taking the time to answer!

I actually been doing something similar to what you describe… althou I’ve found that FTFY has a rough time fixing greek characters when packed in words… Anyway, I’ll check the collation and the SET NAMES commands.

Any idea about what tables should I look to solve the index building problem? … The plan is end up fixing all tables, but if I could prioritize the ones that will let me build the index it would be awesome :slight_smile:

Hi @hilongo

If you turn on debugging in the config file, you’ll get the exact query that is failing which will contain the table name:


; Enable database debug output (very verbose!)
debug = Off

But perhaps do not use this on your live site since turning it on for this will turn it on everywhere and database queries will log to the browser. OTOH, most of the tables used for submission searching are the various *_settings tables (publication_settings, author_settings, etc) and the controlled vocabulary settings which would contain things like keywords and subjects attached to a submission.

Cheers,
Jason

1 Like

I can share one other neat little thing that MySQL can do. This is really getting into the territory of “black magic”, but it can fix some things that other methods cannot. MySQL has a CONVERT function that can be used like this:

SELECT setting_value, convert(cast(convert(setting_value using latin1) as binary) using utf8) as fixed_setting_value
FROM `whatever_table`

And what this will result in is two printed columns in the output. the original mojibaked column (in this case setting_value), and the converted and cast column. For some mojibake instances, you’ll get something like this in the original column:

Accusé de réception de l’évaluation d’une soumi…

and this in the converted column:

Accusé de réception de l’évaluation d’une soumissi…

Notice that in the second case, the mojibake is resolved. You may be asking yourself why this isn’t just the way to fix it generally, since you could course do this in an UPDATE statement on problematic columns, but the issue here is that in some cases the CONVERT function occasionally encounters a character it can’t fix and truncates the string. It’s annoying since it does this silently and if you’re doing this on a large table you won’t know if a column was truncated in a particular record. But it can be awesome for at least looking at problematic columns and then manually copying the fixed string back into the broken record.

Cheers,
Jason

Hello again! I was finally able to solve the index building problem.

Aside of cleaning some mojibake chars from the exported data, I found that even when the global SET NAMES and COLLATION where set to utf8 some of the columns had it’s CHARACTER SET as ‘latin1’

tools/rebuildSearchIndex.php is working as expected and the in-site searching function has been restored.

Thanks a million @jnugent !

very glad to hear it!

Cheers,
Jason

1 Like