Database error after upgrade to Mariadb v 10.2

Original post is here: http://pkp.sfu.ca/support/forum/viewtopic.php?f=8&t=14435

Content of the original post:
Hi at PKP,

we have the following issue:
We upgraded our hardware and databases. Database was upgraded from MySQL to MariaDB version 10.2.
Having successfully restored the data base and after some minor UTF8 corrections we have the following problem:

Filling a form (reviewers comments, article title, article abstract etc.) with
special characters (especially from the eastern European language codes;
sample text: Jósvafő paleo-polje: morphology and relation to the
landform evolution of Aggtelek) results in an error message:

DB Error: Incorrect string value: ‘\xC5\x91 pal…’ for column ‘comments’ at row 1

Is this a known issue and is there a solution? This problem becomes more
and more dramatic as we have many authors and papers from eastern Europe
whow now fail to upload content to ojs.


		Re: Database error after upgrade to Mariadb v 10.2
		by asmecher » Wed Apr 01, 2015 5:30 pm 

		Hi Andreas,

What are your
connection_charset and database_charset in config.inc.php? Can you check
your database itself to see what the table encodings are set to?

Regards,
Alec Smecher
Public Knowledge Project Team


Reply dated May 27:

Dear Alec,

as I seem to be unable to post a reply to your
answer, please find my comments below. Sorry it took a while, but we had
to check our systems…

I looked at the default encoding of the DATABASE and it turns out to
be UTF-8:

MariaDB [zfgojs]> show variables like “character_set_database”;
±-----------------------±------+
| Variable_name | Value |
±-----------------------±------+
| character_set_database | utf8 |
±-----------------------±------+
1 row in set (0.00 sec)

The tables however, are encoded in latin1:

MariaDB [zfgojs]>
MariaDB [zfgojs]> SELECT table_name, CCSA.character_set_name FROM
information_schema.TABLES T,
information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE
CCSA.collation_name = T.table_collation AND T.table_schema = “zfgojs”;
±--------------------------------------±-------------------+ |
table_name | character_set_name |
±--------------------------------------±-------------------+ |
access_keys | latin1 | |
announcement_settings | latin1 | |
announcement_type_settings | latin1 | |
announcement_types | latin1 | |
announcements | latin1 | |
article_comments | latin1 | |
article_files | latin1 | |
article_galley_settings | latin1 | |
article_galleys | latin1 | |
article_html_galley_images | latin1 | |
article_notes | latin1 | |
article_search_keyword_list | latin1 | |
article_search_object_keywords | latin1 | |
article_search_objects | latin1 | |
article_settings | latin1 | |
article_supp_file_settings | latin1 | |
article_supplementary_files | latin1 | |
article_xml_galleys | latin1 | | articles
| latin1 | | auth_sources
| latin1 | | author_settings
| latin1 | | authors | latin1
| | books_for_review | latin1
| | books_for_review_authors | latin1 | |
books_for_review_settings | latin1 | | captchas
| latin1 | | citation_settings
| latin1 | | citations
| latin1 | | comments | latin1
| | completed_payments | latin1
| | controlled_vocab_entries | latin1 | |
controlled_vocab_entry_settings | latin1 | |
controlled_vocabs | latin1 | |
counter_monthly_log | latin1 | |
custom_issue_orders | latin1 | |
custom_section_orders | latin1 | |
data_object_tombstone_oai_set_objects | latin1 | |
data_object_tombstone_settings | latin1 | |
data_object_tombstones | latin1 | |
edit_assignments | latin1 | |
edit_decisions | latin1 | | email_log
| latin1 | | email_log_users
| latin1 | | email_templates
| latin1 | | email_templates_data | latin1
| | email_templates_default | latin1
| | email_templates_default_data | latin1 | |
event_log | latin1 | |
event_log_settings | latin1 | |
external_feed_settings | latin1 | |
external_feeds | latin1 | |
filter_groups | latin1 | |
filter_settings | latin1 | | filters
| latin1 | | gifts
| latin1 | | group_memberships
| latin1 | | group_settings | latin1
| | groups | latin1
| | institutional_subscription_ip | latin1 | |
institutional_subscriptions | latin1 | | issue_files
| latin1 | | issue_galley_settings
| latin1 | | issue_galleys
| latin1 | | issue_settings | latin1
| | issues | latin1
| | journal_settings | latin1 | |
journals | latin1 | |
metadata_description_settings | latin1 | |
metadata_descriptions | latin1 | | mutex
| latin1 | | notes
| latin1 | | notification_mail_list
| latin1 | | notification_settings | latin1
| | notification_subscription_settings | latin1
| | notifications | latin1 | |
oai_resumption_tokens | latin1 | |
paypal_transactions | latin1 | |
plugin_settings | latin1 | | processes
| latin1 | | published_articles
| latin1 | | queued_payments
| latin1 | | referral_settings | latin1
| | referrals | latin1
| | review_assignments | latin1 | |
review_form_element_settings | latin1 | |
review_form_elements | latin1 | |
review_form_responses | latin1 | |
review_form_settings | latin1 | |
review_forms | latin1 | |
review_rounds | latin1 | | roles
| latin1 | | rt_contexts
| latin1 | | rt_searches
| latin1 | | rt_versions | latin1
| | scheduled_tasks | latin1
| | section_editors | latin1 | |
section_settings | latin1 | | sections
| latin1 | | sessions
| latin1 | | signoffs
| latin1 | | site | latin1
| | site_settings | latin1
| | static_page_settings | latin1 | |
static_pages | latin1 | |
subscription_type_settings | latin1 | |
subscription_types | latin1 | |
subscriptions | latin1 | |
temporary_files | latin1 | | theses
| latin1 | | timed_views_log
| latin1 | | user_interests
| latin1 | | user_settings | latin1
| | users | latin1
| | versions | latin1 |
±--------------------------------------±-------------------+ 114 rows in
set (0.01 sec)

MariaDB [zfgojs]>

My question hence:

Is it okay to convert all tables, wholesale, to utf8, using:

ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8 COLLATE
utf8_general_ci;

without breaking anything?

Thanks for your help, please advise.

Please carefully read the documentation here:
https://dev.mysql.com/doc/refman/5.1/en/alter-table.html#idm140661332563136
especially the warning:

The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8).

The answer to your question will depend on:

  • What character set your original tables used (in mysql and in config.inc.php)
  • What “minor UTF8 corrections” you made in the meantime
  • What character set your config.inc.php is currently using, and whether new data has been added with that character set against the mysql latin1 configuration.

We have left the settings in config.inc.php untouched in the course of the
transfer to Mariadb:

Old system (mysql):
config.inc.php:
client_charset = utf-8
connection_charset = Off
database_charset = Off
charset_normalization = Off

The tables on the old system are encoded in latin1:
mysql> SELECT CCSA.*,table_name, T.table_collation,
CCSA.character_set_name FROM information_schema.TABLES T,
information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE
CCSA.collation_name = T.table_collation AND T.table_schema = “zfgojs”;
±------------------±-------------------±-------------------------------
-------±------------------±-------------------+ | COLLATION_NAME |
CHARACTER_SET_NAME | table_name |
table_collation | character_set_name |
±------------------±-------------------±-------------------------------
-------±------------------±-------------------+ | latin1_swedish_ci |
latin1 | access_keys |
latin1_swedish_ci | latin1 | | latin1_swedish_ci | latin1
| announcement_settings | latin1_swedish_ci |
latin1 | | latin1_swedish_ci | latin1 |
announcement_type_settings | latin1_swedish_ci | latin1
| …[more tables in latin1] …

New system (mariadb 10):
config.inc.ph:
client_charset = utf-8
connection_charset = Off
database_charset = Off
charset_normalization = Off

Mariadb Tables are still in latin1:
MariaDB [(none)]> SELECT CCSA.*,table_name, T.table_collation,
CCSA.character_set_name FROM information_schema.TABLES T,
information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE
CCSA.collation_name = T.table_collation AND T.table_schema = “zfgojs”;
±------------------±-------------------±-------------------------------
-------±------------------±-------------------+ | COLLATION_NAME |
CHARACTER_SET_NAME | table_name |
table_collation | character_set_name |
±------------------±-------------------±-------------------------------
-------±------------------±-------------------+ | latin1_swedish_ci |
latin1 | access_keys |
latin1_swedish_ci | latin1 | | latin1_swedish_ci | latin1
| announcement_settings | latin1_swedish_ci |
latin1 | | latin1_swedish_ci | latin1 |
announcement_type_settings | latin1_swedish_ci | latin1
| …[more tables in latin1] …

So it seems from what you suggested and the data above that the OJS
tables are still encoded in latin1 and that the database ‘polices’
this as soon as an attempt is made to insert non-latin1 (= utf8
characters unavailable in latin1) with an error.

Planned conversion procedure on the new system:

  1. dump the database in utf8:
    mysqldump -u${USER} -p${PWD} -c -e --default-character-set=utf8
    –single-transaction --skip-set-charset --add-drop-database -B ${DB} >
    dumpfile.sql

  2. fix dumpfile (create database statement) by changing character set
    statements to utf8: cat dumpfile.sql | sed ‘s/DEFAULT CHARACTER SET
    latin1/DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci/’ | sed
    ‘s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/’ > dumpfile-converted.sql

  3. restore database: mysql -u${USER} -p${PWD} < dumpfile-converted.sql

  4. CHANGE THE SETTINGS in config.inc.ph to:
    client_charset = utf-8
    connection_charset = utf-8
    database_charset = utf-8
    charset_normalization = On

Can you confirm this or do you have other suggestions ?

Thanks for your support

Sorry for the slow reply. I have been out this past week on holiday.

Your description of the problem sounds correct to me.

Your solution makes sense to me, but I’m just another set of eyes. I have not done this type of character set correction before.

Keep a backup and let us know here if it works!

Here is the solution for this problem:

Situation: old latin1 database from mysql 5.xx

Need: everything in utf-8 to avoid problems with utf-8 characters in
reviews etc. (mariadb 10.2.xx):

Suggested procedure…

  1. Dump database using mysqldump, wit utf8 as default character set.
  2. use sed to massage remaining charset settings into utf8
  3. drop database
  4. recreate database with charset=utf8;
  5. SET NAMES to utf8
  6. restore massaged dump (source)

mysqldump -u root -p -c -e --default-character-set=utf8
–single-transaction --skip-set-charset --add-drop-database -B
DATA_BASE_NAME > DATA_BASE_NAME.sql

sed ‘s/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8 COLLATE
utf8_general_ci/’ < DATA_BASE_NAME.sql | sed ‘s/DEFAULT
CHARSET=latin1/DEFAULT CHARSET=utf8/’ > DATA_BASE_NAME_converted.sql

to check what sed did:
diff DATA_BASE_NAME.sql DATA_BASE_NAME_converted.sql

mysql -u root -p

drop database DATA_BASE_NAME;
create database DATA_BASE_NAME charset=utf8;
use DATA_BASE_NAME;
SET NAMES ‘utf8’;
source DATA_BASE_NAME_converted.sql
quit;

Adjust settings of ojs’ config.inc.php accordingly:
client_charset = utf-8
connection_charset = utf-8
database_charset = utf-8
charset_normalization = On

Good luck
Andreas

Great job, @ndaustue! Thanks for the detailed update.

I wonder why there would have been DEFAULT CHARACTER SET commands still specifying latin1 in the dumpfile if the mysqldump --default-character-set option actually performs any character recoding. That seems odd.