Resolving integrity constraint violation: duplicate entry for submission_search_keyword_text

@jnugent thank you for your help and suggestion to use the search index rebuild tool. After TRUNCATE’ing the three submission_search_* tables and running php tools/rebuildSearchIndex.php I realized the rebuild tool crashes after running into the issue of a duplicate entry key violation. I tried this multiple times but the issue persists.

PHP Fatal error: Uncaught PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry ‘covid’ for key ‘submission_search_keyword_text’

My first thought was this could be related to my post-mojibake cleanup. Your post Resolving charset encoding mix-ups / mojibake - #6 by jnugent was my last lead. My submission_search_* is utf8_general_ci but the majority of the other tables including submission_* and *_settings are latin1_swedish_ci. I changed the relevant ones you described to utf8_general_ci but the problem persists.

Screen Shot 2021-06-23 at 9.37.00 AM

My second thought, since our editors are stuck publishing the issue (500 server error on generating submission keywords), was to unschedule the publication in question with the covid keyword. After that, the publication process completed successfully (no error generating submission keywords).

I’m slightly unconvinced that the specific COVID submission is the problem, but would like your thoughts. Is it more likely to be the mojibake remnants and collation incompatibilities, or a specific metadata field that might be wrecking that single COVID submission?

PS. Running the index rebuild with the article unpublished still runs into the fatal error. Here’s the full stack trace output from the command line.

PHP Fatal error: Uncaught PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry ‘covid’ for key ‘submission_search_keyword_text’ in lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:115
Stack trace:
#0 lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php(115): PDOStatement->execute()
#1 lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(489): Doctrine\DBAL\Driver\PDOStatement->execute()
#2 lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(664): Illuminate\Database\Connection->Illuminate\Database{closure}()
#3 lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(631): Illuminate\Database\Connection->runQueryCallback()
#4 lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(496): Illuminate\Database\Connection->run()
#5 lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php on line 671

Hi @wilsonw,

Can you please include the version of OJS in your post? This is helpful if staff or community members are to assist in helping you to resolve your issue.

-Roger
PKP Team

Will edit, thanks! We’re on 3.3.0.5

Edit: er, I guess I can’t edit the topic post anymore.

Hi,

we experience a similar error on our journal.

PHP Fatal error: Uncaught PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry ‘???’ for key ‘submission_search_keyword_text’ in /var/www/virtual/ojs.chimia.ch/htdocs/ojs/lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:115

Stack trace:

#0 /var/www/virtual/ojs.chimia.ch/htdocs/ojs/lib/pkp/lib/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php(115): PDOStatement->execute(NULL)

#1 /var/www/virtual/ojs.chimia.ch/htdocs/ojs/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(489): Doctrine\DBAL\Driver\PDOStatement->execute()

#2 /var/www/virtual/ojs.chimia.ch/htdocs/ojs/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(664): Illuminate\Database\Connection->Illuminate\Database{closure}(‘INSERT INTO sub…’, Array)

#3 /var/www/virtual/ojs.chimia.ch/htdocs/ojs/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php(631): Illuminate\Database in /var/www/virtual/ojs.chimia.ch/htdocs/ojs/lib/pkp/lib/vendor/laravel/framework/src/Illuminate/Database/Connection.php on line 671

OJS 3.3.0-8
MySQL database
Collation is utf8_general_ci

I’m trying to find the file that creates this keyword, this is what I have from submission_search_keyword_list:

| 77790 | ???????????????????????????????????????????????????????????? |

The corresponding object would be:

select * from submission_search_object_keywords where keyword_id=77790;

+-----------+------------+------+
| object_id | keyword_id | pos  |
+-----------+------------+------+
|      5135 |      77790 | 1051 |
+-----------+------------+------+

Now how do I get the corresponding file?

It looks like the keyword with id 77790 must be a very long string that was cut. Can those be filtered out by the indexing routine?

The keyword length is limited by 60 characters:

show full columns from submission_search_keyword_list ;

+--------------+-------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field        | Type        | Collation       | Null | Key | Default | Extra          | Privileges                      | Comment |
+--------------+-------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+

| keyword_id | bigint(20) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| keyword_text | varchar(60) | utf8_general_ci | NO | UNI | NULL | | select,insert,update,references | |
±-------------±------------±----------------±-----±----±--------±---------------±--------------------------------±--------+

Thank you for pointers,

Martin

@asmecher @rcgillis Warming up this old issue … I had debugged for quite a while and found the reason for the Integrity constraint violation.

First, the error always happened with the same submissions. Usually, these were with PDFs that had some typeset mathematical formulas.

The reason is that the pdftotext utility (that can be configured in config.inc.php) delivers the extracted text in UTF-8 encoding, but the extracted terms may not only be uft8_general_ci (which is 3-byte UTF-8, as I understand, and will be deprecated in more recent MySQL versions), but also utf8mb4 (fixed 4-byte UTF-8). Actually, one may get a mixture of both character types.
Simply changing the collation of the submission_search_keyword_list table to utf8mb4 won’t help, because the collation of the remaining tables still requires utf8_general_ci, as the OJS schema is set up.

Actually, my solution was to filter out the utf8mb4 characters (which, in my opinion, bring no added value to the index) in the terms that get indexed.

I modified classes/search/ArticleSearchIndex.inc.php as follows:

    protected function _indexObjectKeywords($objectId, $text, &$position) {
            $searchDao = DAORegistry::getDAO('ArticleSearchDAO'); /* @var $searchDao ArticleSearchDAO */
            $keywords = $this->filterKeywords($text);
            for ($i = 0, $count = count($keywords); $i < $count; $i++) {
                    // CHIMIA CHANGE CHIMIA-12 2022/01/22/mb Substitute utf8mb4 characters
                    $keyword_subst = $this->_replace4byte($keywords[$i]);
                    if ($searchDao->insertObjectKeyword($objectId, $keyword_subst, $position) !== null) {
                            $position += 1;
                    }
                    // END CHIMIA CHANGE
            }
    }
    
    // CHIMIA CHANGE CHIMIA-12 2022/01/22/mb Substitute utf8mb4 characters
    // from https://stackoverflow.com/questions/16496554/can-php-detect-4-byte-encoded-utf8-chars
    /**
     * Detect 4byte utf8 chars and substitute them.
     * @param $string string
     * @param replacement string optional
     */

    protected function _replace4byte($string, $replacement = '') {
            return preg_replace('%(?:
                    \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
                    | [\xF1-\xF3][\x80-\xBF]{3}        # planes 4-15
                    | \xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
            )%xs', $replacement, $string);
    }
    // END CHIMIA CHANGE

With this modification, all the PDFs were indexed fine.