Charset problems (ISO-8859-1 x UTF-8)

Hello all,

I started administering an OJS portal that has been without a lot of upgrades and maintenance for a long time. As a result, the system has problems with accent characters, typical of incorrect charset configuration.

The portal runs on FreeBSD whose default charset is ISO8859-1 and mysql tables are as ISO-8859-1.
PHP also has the ISO-8859-1 standard charset.

Today config.inc.php is as follows:

locale = pt_BR
client_charset = iso-8859-1
connection_charset = iso-8859-1
database_charset = iso-8859-1
charset_normalization = Off

OJS pages are incorrectly accented but publications are not.
I believe the database must have records in utf and iso-88591 because these values have changed several times.
How can I get this right? Should I migrate everything to utf-8 and recode the database?

Very soon I will need to migrate this site to another server that runs debian OS and keeps as default charset UTF-8.

Regards,

Renato L. Sousa

Hi @rensousa,

When you say that publications are not showing correctly, can you describe what you mean?

Thanks,
Alec Smecher
Public Knowledge Project Team

hi @asmecher,

I refer to the display of the accentuation of the Portuguese language - Brazil (pt-BR).
I was able to fix it with the setting below:

locale = pt-BR
client_charset = utf-8
connection_charset = utf8
database_charset = utf8
charset_normalization = On

It took me a while to realize that I needed to clear data cache to apply the settings.

Regards,

Renato

1 Like

Hi guys,
We are facing issues with character encoding when try to upgrade from 3.1.1.2 to 3.1.2.1. When we make a copy of the database of 3.1.1.2 and run OJS 3.1.2.1 with it, even before running the upgrade, all the spanish special characters like ´ or ñ are replaced by ó or Ñ. On the 3.1.1.2 it works fine. Did you change something respecting this on the upgrade? Anyone is facing similar issues?

We’ve tried every combination in configuration, even clearing cache files each time, but none of them works. This is the config.inc.php setting:

locale = es_ES
client_charset = utf-8
connection_charset = utf-8
database_charset = utf-8
charset_normalization = utf-8

Here you can see what I say: http://icono14.net/ojs-3121/index.php/icono14/index

It is taking us a lot of time and we aren’t been able to upgrade the journal. Could you please help us?

Thanks
Daniel Becerra
ICONO14

Hi @celuloide,

Did you attempt to correct a utf8 to a utf-8? Look at e.g. Charset problems (ISO-8859-1 x UTF-8) - #3 by rensousa – the inconsistencies are important! Different libraries that OJS uses depend on UTF8 being written in different ways.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher,
Yes I did! I’ve tried every single combination, even clearing cache on between.
On our OJS 3.1.1-2 this setup works well…
imagen

Thanks for your help,

Daniel Becerra
ICONO14

Hi @celuloide,

The above settings posted by @rensousa are correct. You have invalid settings for both connection_charset and database_charset. The charset_normalization setting has been removed so it’s not doing anything.

The invalid settings are going to the third-party ADODB library; I’m not sure what its behaviour is when it gets a setting it doesn’t understand, but at a guess, I suspect it connect using the database default character set.

I would suggest taking a complete backup before you tinker with character sets, since it’s really easy to mix two configurations together by experimenting with this, but very hard to resolve that once it’s happened. Consistency is key.

If you set everything as it’s supposed to be, but you’re still seeing garbled characters like ó, then it’s likely that the database is incorrectly encoded in the database. This is more of a database management issue than an OJS issue, so you might have better luck looking e.g. on Stackoverflow.com – or maybe try a tool like ftfy.

If you use the configuration recommended above, and your database contents are correctly encoded, then everything should work – if not, it’ll be one of the two problems.

Regards,
Alec Smecher
Public Knowledge Project Team

Thank you @asmecher,
It is very possible our database have contents with wrong encoding, but the same database in another server with the settings I posted before looks to work fine: http://icono14.dysing.es/ojs/index.php/icono14/index
However, if I put the @rensousa settings, mojibakes shows up. Does this mean something for you?

Regards,

Daniel Becerra
ICONO14

Hi @celuloide,

I would suggest looking at the process you’re using to move the database between servers – maybe a missing DEFAULT CHARACTER SET utf8 clause on the CREATE DATABASE statement, or a missing --default-character-set parameter on mysqldump (off the top of my head)? If you can identify one of the settings that’s causing you grief e.g. in the journal_settings table, one way to compare the contents between the two to ensure it’s the same is to call the SQL LENGTH function on it from each – that may help determine whether the SQL contents have gotten garbled during the transfer.

Just to re-iterate, if you have utf-8 where you should have utf8, that’s wrong and may cause problems – but if you have the same mistake consistently between OJS2 and OJS3, they should both behave the same.

Regards,
Alec Smecher
Public Knowledge Project Team

Hello @asmecher ,

Thank you for sharing your leads on this. After upgrading to 3.3 and setting the utf-8 charset according to the config template, we found our database to have past/historical data with mixed charsets. New data is saved properly. To remedy old data, I found titles, subtitles, and abstracts in the publication_settings table (among other places) to have the â€* characters stored at the database level.

SELECT * FROM `publication_settings` WHERE `setting_value` LIKE '%â€%'

Screen Shot 2021-05-21 at 9.53.34 AM

With your reference to ftfy I ran ftfy.fix_text() and resolved a few publications by updating the database manually. Since there are 600+ cases with characters of mixed encoding, I’m planning to run a loop through that resulting dataset and fixing the text via ftfy. Since this worked manually for a few publications, I’m fairly certain automating the rest should work. Is there anything I should be careful of before proceeding, or can you confirm that this should work in theory?

Thank you!

Edit: looks like we have these mixed encodings all over the place.

Search results for "%â€%" at least one of the words:
3 matches in announcement_settings
9 matches in author_settings
30 matches in citations
9 matches in comments
10 matches in controlled_vocab_entry_settings
1143 matches in email_log
64 matches in email_templates_default_data
15 matches in email_templates_settings
41 matches in event_log_settings  
3 matches in issue_settings
25 matches in journal_settings
1 match in navigation_menu_item_settings
208 matches in notes
3 matches in publication_galleys
598 matches in publication_settings
21 matches in rt_searches
2 matches in section_settings
272 matches in submission_comments
36 matches in submission_file_settings
7967 matches in submission_search_keyword_list
4 matches in user_settings
Total: 10464 matches

Wilson

Hi @wilsonw,

Can you please create a new post with your issue (and link back to this one, if you’d like). This is quite an older post, and creating a new one will help us keep the forum organized.

Thank you,

Roger
PKP Team