OAI-PMH duplicate items in 2.4.x

,

Hi,

We have recently identified a bug/issue on the OAI-PMH interface of OJS 2.4.x (tested in 2.4.6 and 2.4.8), related to the duplication of OAI items in different resumptionTokens.

This issue has been identified because we harvest on a search portal many OJS journals, and after some accurate analysis of the items of the journal, the active items on the OAI and the harvested items, we found some differences.

Also, the error is not constant, but random. The duplicate items are sometimes different and appears on different resumptions tokens. In one test, the duplicate items appears on the first and last resumptionTokens. Also, this didn’t affect all journals of the same OJS installation.
We didn’t find any errors on the log files.

To exemplify, we have listed the first 400 items of the OAI-PMH interface and put everything on a spreadsheet and identified the duplicates on these two tests.

On the first one, we identify 2 duplicate items (oai:ojs.revistas.rcaap.pt:article/1269 and oai:ojs.revistas.rcaap.pt:article/2644), on the second test, wie found these 2 but with many more duplicates. Find here the 2 files for the 2 tests and the duplicates in red on the last sheet: https://drive.google.com/file/d/19Vlnxd6429DXNZ7Ellb525p6qQQSCqkV/view?usp=sharing

Here are the links for the journal and OAI:
http://revistas.rcaap.pt/app/oai
http://revistas.rcaap.pt/app/

So it seems that the creation of the OAI is based on some kind of random list or cache. Do you have any idea of what can be causing this problem?

Regards,

José Carvalho

Hi @josekarvalho,

I recently had something similar reported by another user but haven’t been able to replicate the behavior myself.

My best guess is that the cause is something like this:

OJS responds to OAI requests by executing something like…

SELECT abc FROM xyz LIMIT 200

…and then subsequent requests for resumption tokens using e.g…

SELECT abc FROM xyz LIMIT 200 OFFSET 200

…then OFFSET 400, etc. The queries should be the same, excepting the OFFSET.

We don’t specify a sort order for all the JOINed tables, so it’s possible that MySQL is sometimes responding with records in a different order.

To establish whether this is happening, I’d suggest capturing the SQL query that’s being used to respond to OAI requests, and using the SQL EXPLAIN function to try to capture OJS responding to the same request with results in a different order.

If this is indeed the cause, we might need to add some extra ORDER BY clauses to force MySQL to respond with the same ordering each time. I would view this behavior as “bad citizenship” by MySQL – but I’m not sure whether it’s officially incorrect according to the SQL spec, or whether different versions of MySQL will behave differently.

Regards,
Alec Smecher
Public Knowledge Project Team

I run a few tests with database logging activated and aside de fetch for article data, there are two important queries. The first one wich fetch token data:

SELECT * FROM oai_resumption_tokens WHERE token = '69e78b7d74a57f9d65de3a57e6ed002f'

I think it will return a serialized string. And a second one wich will fetch the list of articles to display. I think the used query will be something like this:

SELECT CASE 
         WHEN COALESCE(dot.date_deleted, a.last_modified) < i.last_modified THEN 
         i.last_modified 
         ELSE COALESCE(dot.date_deleted, a.last_modified) 
       END                                        AS last_modified, 
       COALESCE(a.article_id, dot.data_object_id) AS article_id, 
       COALESCE(j.journal_id, tsoj.assoc_id)      AS journal_id, 
       COALESCE(tsos.assoc_id, s.section_id)      AS section_id, 
       i.issue_id, 
       dot.tombstone_id, 
       dot.set_spec, 
       dot.oai_identifier 
FROM   mutex m 
       LEFT JOIN published_articles pa 
              ON ( m.i = 0 ) 
       LEFT JOIN articles a 
              ON ( a.article_id = pa.article_id 
                   AND a.journal_id = 6 ) 
       LEFT JOIN issues i 
              ON ( i.issue_id = pa.issue_id ) 
       LEFT JOIN sections s 
              ON ( s.section_id = a.section_id ) 
       LEFT JOIN journals j 
              ON ( j.journal_id = a.journal_id ) 
       LEFT JOIN data_object_tombstones dot 
              ON ( m.i = 1 ) 
       LEFT JOIN data_object_tombstone_oai_set_objects tsoj 
              ON ( tsoj.tombstone_id = dot.tombstone_id 
                   AND tsoj.assoc_type = 256 
                   AND tsoj.assoc_id = 6 ) 
       LEFT JOIN data_object_tombstone_oai_set_objects tsos 
              ON tsos.assoc_id = NULL 
WHERE  ( ( s.section_id IS NOT NULL 
           AND i.published = 1 
           AND j.enabled = 1 
           AND a.status <> 0 ) 
          OR dot.data_object_id IS NOT NULL ) 
ORDER  BY journal_id 

Perhaps I’ve refered the wrong query, but as I can see, there isn’t any SQL OFFSET or LIMIT definitions. If I’m correct, each request (pagination) will return a full articles list. Probably the offsets are done in the PHP code part (using arrays) but if there aren’t any kind of SORT besides the journal_id, the articles list will always have a random SORT for every each request.
I think the problem sits on missing params like last_modified SORT on this query or on the resumption token serialized data.

Hi @paulo-graca,

Ah, I see where this is happening. I think simply changing OAIDAO’s ORDER BY statement from…

 ORDER BY journal_id

…to…

ORDER BY journal_id, COALESCE(a.article_id, dot.data_object_id)`

…should add an additional ordering that’ll maintain coherence across requests.

Would you be able to try this and see if it improves the coherence of queries split across multiple requests?

FYI, OJS3 already specifies the submission ID in the ordering, so I don’t think it suffers from this. And if you’re curious, the code that does the “pagination” PHP-side is here.

Regards,
Alec Smecher
Public Knowledge Project Team

Yes, thank you, this doesn’t occur in OJS3.