[OJS-3.5.0] character encoding problem: sequence of \0 characters in MySQL dump

Describe the issue or problem
After the upgrade of OJS-3.4.0 to OJS-3.5.0, we noticed how some form fields with user input contained nothing but illegible characters. This seemed limited to the review_form_responses table. In the OJS website, this produced following output when viewing such a review response:

In the SQL dump, this appeared to be a series of null characters:

(260,15,46,'string','\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\.....')

When I first noticed this, I’d been able to restore the correct values from a pre-migration dump, after which the display in the OJS website was completely fixed.

After that, for a subsequent upgrade to OJS-3.5.0-3, I have dumped again, and restored that dump.

Yesterday I bumped on a fresh case, containing two reviews submitted after the dump containing my earlier fix. Fortunately, I’ve been able to retrieve the correct values from an intermediate SQL dump, and a snapshot of the db files, so the damage has been restored.

Yet, I want to find the cause and avoid this at all cost in the future. I think I it has to do with dumping / restoring dumps, given following observations:

  • the intermediary dump (after the first fix but before the next stated corruption) didn’t contain any \0 characters at all, including one new review that got corrupted in the latest dump
  • the latest dump (after the latest stated corruption) did contain exactly the same fields consisting entirely of \0 characters, in exactly the same table as in the first corrupt dump, as well as 2 new corrupted review responses

Since the intermediary SQL dump did contain correct values for a review that got mangled in the latest dump, I’m suspecting that the problem doesn’t lie with character encoding when OJS is storing the review forms in the review_form_responses table, but rather with the process of creating and/or restoring SQL dumps.

I realize this is probably a broader MySQL character encoding issue, but since the db is administered within an OJS context, I hope to get some remediation here. I’ll provide details of the character encoding settings under “Additional information” below.

Steps I took leading up to the issue

  1. create a SQL dump from the db
  2. import that SQL dump into the db

I haven’t checked where exactly the corruption occurs, unfortunately, but will keep a close look in the future. After my latest fix, the dump remains ok, so perhaps it’s step 2 that’s causing the issue.

I’m creating the dump with following command:

mysqldump --all-databases -u $OJS_DB_USER -p$OJS_DB_PASSWORD $OJS_DB_NAME > ojs_dump.sql

I’m importing the dump with following command:

mysql -u $OJS_DB_USER -p$OJS_DB_PASSWORD $OJS_DB_NAME < ojs-dump.sql

Could it be the case that these commands need an explicit encoding parameter, such as `--default-character-set=utf8mb4`?

What application are you using?
OJS 3.5.0-3

Additional information
Character encoding settings in the docker-compose.yml file:

  ojsdb:
    image: mariadb:10
    container_name: ojsdb
    command: --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci

Character encoding settings in the config/config.inc.php file:

[i18n]
client_charset=utf-8
database_charset=Off
connection_charset=utf8mb4
locale=en

[database]
driver=mysqli
connection_charset=utf8mb4
collation=utf8mb4_unicode_ci

Character encoding settings in the SQL dump file:

CREATE DATABASE /*!32312 IF NOT EXISTS*/ `ojs` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */;

--
-- Table structure for table `review_form_responses`
--

DROP TABLE IF EXISTS `review_form_responses`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `review_form_responses` (
  `review_form_response_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `review_form_element_id` bigint(20) NOT NULL,
  `review_id` bigint(20) NOT NULL,
  `response_type` varchar(6) DEFAULT NULL,
  `response_value` text DEFAULT NULL,
  PRIMARY KEY (`review_form_response_id`),
  KEY `review_form_responses_unique` (`review_form_element_id`,`review_id`),
  KEY `review_form_responses_review_id` (`review_id`),
  CONSTRAINT `review_form_responses_review_form_element_id_foreign` FOREIGN KEY (`review_form_element_id`) REFERENCES `review_form_elements` (`review_form_element_id`) ON DELETE CASCADE,
  CONSTRAINT `review_form_responses_review_id_foreign` FOREIGN KEY (`review_id`) REFERENCES `review_assignments` (`review_id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=3778 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;

Further info: I’ve been able to test with a development version of the database, and suspect it’s the import that is causing, or at least triggering the corruption.

  1. I start from a cleaned up version of the SQL dump, encoded itself as UTF-8
  2. I import that dump into the development version with mysql -u $OJS_DB_USER -p$OJS_DB_PASSWORD $OJS_DB_NAME < ojs-dump.sql
  3. The data in the review_form_responses is corrupted again

For the export and import of the dump, it doesn’t matter whether I specify –-default-character-set=utf8mb4.

Inspecting the db charset and collation, I’m seeing subtle differences, but I haven’t found a way to get everything to utf8mb4:

  • MariaDB [ojs]> SHOW VARIABLES LIKE ‘character_set%’;+--------------------------+----------------------------+
    | Variable_name            | Value                      |
    +--------------------------+----------------------------+
    | character_set_client     | utf8mb3                    |
    | character_set_connection | utf8mb3                    |
    | character_set_database   | utf8mb4                    |
    | character_set_filesystem | binary                     |
    | character_set_results    | utf8mb3                    |
    | character_set_server     | utf8mb4                    |
    | character_set_system     | utf8mb3                    |
    | character_sets_dir       | /usr/share/mysql/charsets/ |
    +--------------------------+----------------------------+
    8 rows in set (0.002 sec)
    
  • MariaDB [ojs]> SHOW VARIABLES LIKE 'collation%';
    +----------------------+--------------------+
    | Variable_name        | Value              |
    +----------------------+--------------------+
    | collation_connection | utf8mb3_general_ci |
    | collation_database   | utf8mb4_unicode_ci |
    | collation_server     | utf8mb4_unicode_ci |
    +----------------------+--------------------+
    
  • MariaDB [ojs]> SHOW CREATE DATABASE ojs;
    +----------+---------------------------------------------------------------------------------------------+
    | Database | Create Database                                                                             |
    +----------+---------------------------------------------------------------------------------------------+
    | ojs     | CREATE DATABASE `ojs` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ |
    +----------+---------------------------------------------------------------------------------------------+
    1 row in set (0.000 sec)
    

So my question remains: what could be wrong, why only with this table (which looks ok in the fixed dump)? A current workaround when exporting / importing the db would be to:

  1. create a dump
  2. upload the dump
  3. create a fresh dump
  4. check for \0 occurrences
  5. look up the uncorrupted values in the dump of step 1
  6. convert those rows to SQL update statements
  7. apply these SQL update statements to the db created in step 2

I’m hoping there must be a more solid way to fix this!

Trimmed it down even more: when the SQL statements for creating and populating the problematic table are isolated in a separate file, that separate table can be imported correctly!

So:

  1. create a dump
  2. copy the SQL statements for creating and populating the review_form_responses table into a separate SQL file
  3. restore the dump from the SQL file from step 1
    (now \0 values are introduced in rows in review_form_responses)
  4. restore the review_form_responses table from the SQL file from step 2
    (now everything is imported correctly)

I don’t have any clue what’s causing this difference!

Hi @rvdb,

As you’ve gathered, this seems to be a MySQL/backup/restore thing, and it’s hard to know exactly what’s happening here, but –

Years ago, the MySQL default character set changed from Latin1 to UTF8. If you didn’t create your database specifying UTF8, but told OJS to use UTF8, then OJS would happily load UTF8-encoded data into a Latin1-encoded database. Because OJS was doing both the loading and fetching, in this circumstance nothing would seem amiss.

However, after the MySQL default character set changed, creating a new database the same way would result in a UTF8 database, where previously it would’ve created a Latin1 database. Loading the old data into that database would result in character sets getting scrambled.

Maybe that’s what you’re encountering here, maybe not – but either way I would recommend taking a look at the database dump with a trustworthy text editor (I use vim, but there are plenty of options). You might find that your terminal emulator or text editor is trying to be helpful and hiding a character set encoding issue in the dump from you.

mysqldump puts some global configuration, including character set information, at the top of its output; if you’re seeing different behaviour when you split the file into different tables, it’s probably the presence/absence of that header that’s causing the discrepancy.

Regards,
Alec Smecher
Public Knowledge Project Team

Thanks, @asmecher . Meanwhile I’ve been able to pinpoint the exact trigger causing the corruption when loading just that single review_form_responses table from SQL: it’s this set of FOREIGN_KEY_CHECKS instructions at the start and end of the SQL:

--in the SQL header:
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;

--in the SQL footer:
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
  • When these instructions are present in the SQL, that’s causing \0 values to be imported in the table.
  • If I remove these instructions, that makes the import succeed without data corruption.

Nothing else is changed whatsoever, and it seems the character encoding is just fine: when dumping that single table from a non-corrupted database instance, everything’s looking good, all is encoded as UTF-8 in the dump file (according to EditPad Lite).

Just adding these instructions causes the hiccups when loading the SQL.

Could that suggest some kind of OJS migration issue after all, with foreign keys that aren’t in place?

Best,

Ron

Hi @rvdb,

The presence or absence of referential integrity checks during a dump/reload cycle makes no sense to me, considering that it’s a string column (response_value) that’s getting corrupted. I’d see if you can isolate the corruption outside of the OJS environment – I don’t think OJS is causing this.

Thanks,
Alec Smecher
Public Knowledge Project Team

Thanks for your thoughts, @asmecher . It took me quite some hours of sleep and bewildered attempts at pinpointing the exact place in the dump that would trigger seemingly unrelated (and silent) errors elsewhere in the table, and trying to find at least some character encoding fault at that point in the dump (none found, eventually). But I’ve finally found a solution: upgrading the mariadb image in my docker-compose.yml file to 10.11 (whose initdb script doesn’t choke on startup parameters that contain whitespace – see below).

This allowed me to maximize control over the startup charset and collation settings with following parameters:

character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
init-connect = 'SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'
character-set-client-handshake = FALSE

Now everything’s utf8mb4; and with these settings and this version of mariadb, the exact same dump that produced the zero strings earlier is loading without messing up the string columns. I’ve checked a fresh dump after that import, and couldn’t find any \0 sequences. Repeated that export / import cycle a couple of times, and everything’s still looking OK, so I consider that a fix.

Best,

Ron