[OJS 3.3.0.15] Incorrect string value: '\xF0\x9D\x91\xA6ab...' for column 'keyword_text' when rebuilding search indexes

I have an issue with the script rebuildSearchIndex.php in OJS 3.3.0.15, because it finish with the following error:

Incorrect string value: ‘\xF0\x9D\x91\xA6ab…’ for column ‘keyword_text’ at row 1 in /usr/share/nginx/ojs_install_directory/lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:119

I have upgraded an OJS v.2.4.8.2 which the database had latin1 as character set and latin1_swedish_ci as collation, looking the original config.inc.php from OJS v.2.4.8.2 I found:

client_charset = utf-8
connection_charset = Off
database_charset = Off
charset_normalization = Off

Is it correct to think that both the schema and data are in latin1?

I tought this is the case, the data was in latin1 so I proceed to make a dump with the following command:



mysqldump -u ojs_user -p --opt --default-character-set=latin1 --result-file=dump_latin1.sql ojs2482_database

Reviewing the data it seems well, i.e at least the accents are shown well.

Afterwards I changed engine and charset within the dump using the sed command as following:

sed -i 's/) ENGINE=MyISAM.*/) ENGINE=InnoDB DEFAULT CHARSET=utf8;/' dump_latin1.sql

Afterwards that, I restored the dump_latin1.sql with UTF8 modifications within an new UTF8 database which has utf8mb3 and utf8mb3_general_ci as charset and collation by default.

I ran the upgrade.php script to migrate ojs to OJS 3.2.1.5 and I got the Successfully upgraded to version 3.2.1.5 message. Having the OJS 3.2.1.5 migrated I ran the rebuildSearchIndex.php script which goes until the end without error.

So I proceed to run the upgrade.php script to migrate ojs to v.3.3.0.15 and I got the Successfully upgraded to version 3.3.0.15 message. I ran the rebuilSearchIndex.php script in v.3.3.0.15 and here appeared the trouble.

I got the message:

Incorrect string value: ‘\xF0\x9D\x91\xA6ab…’ for column ‘keyword_text’ at row 1 in /usr/share/nginx/ojs_install_directory/lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:119

This makes me think that I have a trouble related with garbled text (i.e mojibake) and possible I don’t make well the process of converting the character set from latin1 to utf8. I have other doubt too, because the search indexes rebuilding goes well in v3.2.1.5, that makes me think that OJS 3.3.x.x isn’t compatible with utf8mb3? It is necesary utf8mb4?

I have read the troubleshooting section which talk about character encoding troubles, I thought that my case is the common problems number 1, so I proceed with the steps explained there, so I think OJS 3.3.0.x is compatible only with utf8mb4.

Hi @juancure,

Have you enabled the PDF indexing tools in config.inc.php? You might well be getting invalid UTF-8 from your PDFs, so you might consider disabling those for the moment. That way you’ll just be dealing with what’s in your database, which should be consistent UTF-8. (If you are publishing in HTML as well, note that you’ll still be indexing those files – that’s another possible source for invalid UTF8 content.)

The config.TEMPLATE.inc.php file contains some example PDF indexing lines that pass the content through a filter to attempt to remove bad UTF-8 contents, but the examples don’t always work with all PDFs. You might need to experiment a bit, if you determine that PDFs are the source of the bad data.

Regards,
Alec Smecher
Public Knowledge Project Team

This topic was automatically closed after 9 days. New replies are no longer allowed.

Added a follow-up from @juancure (thanks, Juan):

I would like to share you how resolved this. Finally it was necessary to convert to utf8mb4 and utf8mb4_0900_ai_ci the charset and collation of my database. After that I set collation = utf8mb4_0900_ai_ci and connection_charset = utf8mb4 in the config.inc.php. Finally I got rebuilding the search indexes.