[OJS 2.4.5] Encoding problem exporting users

Hi,

We have an OJS 2.4.5 (ISO-8859-1) which we want to export its users to a 2.4.8-1 (which i believe its using utf8 but it is not relevant), and the exported xml from 2.4.5 comes with some problems like:

<?xml version="1.0" encoding="UTF-8"?>
.
.
.
<affiliation locale="pt_BR">Universidade Tecnol�gica Federal do Paran�</affiliation>
.
.
.
<first_name>Ad�o</first_name>
.
.
.

And a bunch of other � that truncates user data when imported.

Here’s the config.inc.php relevant part:

[i18n]

; Default locale
locale = pt_BR

; Client output/input character set
client_charset = ISO-8859-1

; Database connection character set
; Must be set to "Off" if not supported by the database server
; If enabled, must be the same character set as "client_charset"
; (although the actual name may differ slightly depending on the server)
connection_charset = ISO-8859-1

; Database storage character set
; Must be set to "Off" if not supported by the database server
database_charset = ISO-8859-1

; Enable character normalization to utf-8 (recommended)
; If disabled, strings will be passed through in their native encoding
; Note that client_charset and database collation must be set
; to "utf-8" for this to work, as characters are stored in utf-8
charset_normalization = ISO-8859-1
mysql> SHOW FULL COLUMNS from users;
+----------------------+--------------+-------------------+------+-----+---------+----------------+---------------------------------+---------+
| Field                | Type         | Collation         | Null | Key | Default | Extra          | Privileges                      | Comment |
+----------------------+--------------+-------------------+------+-----+---------+----------------+---------------------------------+---------+
| user_id              | bigint(20)   | NULL              | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| username             | varchar(32)  | latin1_swedish_ci | NO   | UNI | NULL    |                | select,insert,update,references |         |
| password             | varchar(40)  | latin1_swedish_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| first_name           | varchar(40)  | latin1_swedish_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| middle_name          | varchar(40)  | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| last_name            | varchar(90)  | latin1_swedish_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| initials             | varchar(5)   | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| email                | varchar(90)  | latin1_swedish_ci | NO   | UNI | NULL    |                | select,insert,update,references |         |
| url                  | varchar(255) | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| phone                | varchar(24)  | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| fax                  | varchar(24)  | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| mailing_address      | varchar(255) | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| country              | varchar(90)  | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| locales              | varchar(255) | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| date_registered      | datetime     | NULL              | NO   |     | NULL    |                | select,insert,update,references |         |
| date_last_login      | datetime     | NULL              | NO   |     | NULL    |                | select,insert,update,references |         |
| must_change_password | tinyint(4)   | NULL              | YES  |     | NULL    |                | select,insert,update,references |         |
| auth_id              | bigint(20)   | NULL              | YES  |     | NULL    |                | select,insert,update,references |         |
| disabled             | tinyint(4)   | NULL              | NO   |     | 0       |                | select,insert,update,references |         |
| disabled_reason      | text         | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| salutation           | varchar(40)  | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| gender               | varchar(1)   | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| date_last_email      | datetime     | NULL              | YES  |     | NULL    |                | select,insert,update,references |         |
| date_validated       | datetime     | NULL              | YES  |     | NULL    |                | select,insert,update,references |         |
| auth_str             | varchar(255) | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| suffix               | varchar(40)  | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| billing_address      | varchar(255) | latin1_swedish_ci | YES  |     | NULL    |                | select,insert,update,references |         |
| inline_help          | tinyint(4)   | NULL              | YES  |     | NULL    |                | select,insert,update,references |         |
+----------------------+--------------+-------------------+------+-----+---------+----------------+---------------------------------+---------+
28 rows in set (0.00 sec)

I’m not really experienced in debugging encoding problems. So i would be glad if someone could help me.
Is there any more information i could give in order to help finding what’s the problem?

Thanks for your attention

Hi @samueloph,

I would strongly suggest converting everything over to UTF-8 – I haven’t even worked with a Latin1 install for years.

To do this you’ll need to transcode your database. Assuming that everything is consistent in your database in a Latin1 encoding (this may or may not be the case), you can run mysqldump through iconv to convert all the characters over to UTF-8. Then load the converted dump into another empty database – you can use this for testing purposes before performing the final conversion.

Then, adjust your config.inc.php to use the new database (or work with another copy of the code for testing purposes). Just be sure that you note a minor quirk – client_charset should be utf-8, and connection_charset and database_charset should both be utf8 (note the dash in the first).

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher, thanks for the hint.

I’m gonna try to convert a backup’ed 2.4.5 instance to see if there’s any problem.

But regarding the exportation, isn’t it a bug to export truncated data as utf-8? I think there’s more people using ISO-8859-1. I mean, shouldn’t the export plugin convert this data (iconv) while generating the xml?

Regards.

Hi @samueloph,

You’re likely correct that we’re assuming that the data is in UTF8, but that has become such a (welcome) standard that I suspect we’d drop support for other configurations rather than test and fix mixed approaches. There really isn’t a downside to using UTF8.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher,

I totally agree with using UTF-8, in fact i have no idea why some people configure it to use ISO-8859-1, i guess most of the time it’s just someone who doesn’t know what are the side effects, and OJS (at least 2) doesn’t defaults to “charset_normalization” besides having the “recommended” flag commented. Maybe it’s time to change the defaults? (if it hasn’t already changed on OJS3).

OJS didn’t dropped support for ISO-8859-1 yet, if things like exporting/importing users won’t deal with this kind of things, some warnings should be made on config.inc.php and on the installation wizard to explain that OJS only has “full-support” for UTF-8, right?
Ideally exportation of data should work (should be fixed), since support for other encodings were already advertised, but i won’t be able to make a PR :frowning:

The others import/export plugins also suffers from this (eg. article metadata).