We have recently identified a bug/issue on the OAI-PMH interface of OJS 2.4.x (tested in 2.4.6 and 2.4.8), related to the duplication of OAI items in different resumptionTokens.
This issue has been identified because we harvest on a search portal many OJS journals, and after some accurate analysis of the items of the journal, the active items on the OAI and the harvested items, we found some differences.
Also, the error is not constant, but random. The duplicate items are sometimes different and appears on different resumptions tokens. In one test, the duplicate items appears on the first and last resumptionTokens. Also, this didn’t affect all journals of the same OJS installation.
We didn’t find any errors on the log files.
To exemplify, we have listed the first 400 items of the OAI-PMH interface and put everything on a spreadsheet and identified the duplicates on these two tests.
On the first one, we identify 2 duplicate items (oai:ojs.revistas.rcaap.pt:article/1269 and oai:ojs.revistas.rcaap.pt:article/2644), on the second test, wie found these 2 but with many more duplicates. Find here the 2 files for the 2 tests and the duplicates in red on the last sheet: https://drive.google.com/file/d/19Vlnxd6429DXNZ7Ellb525p6qQQSCqkV/view?usp=sharing
Here are the links for the journal and OAI:
So it seems that the creation of the OAI is based on some kind of random list or cache. Do you have any idea of what can be causing this problem?
I recently had something similar reported by another user but haven’t been able to replicate the behavior myself.
My best guess is that the cause is something like this:
OJS responds to OAI requests by executing something like…
SELECT abc FROM xyz LIMIT 200
…and then subsequent requests for resumption tokens using e.g…
SELECT abc FROM xyz LIMIT 200 OFFSET 200
OFFSET 400, etc. The queries should be the same, excepting the
We don’t specify a sort order for all the
JOINed tables, so it’s possible that MySQL is sometimes responding with records in a different order.
To establish whether this is happening, I’d suggest capturing the SQL query that’s being used to respond to OAI requests, and using the SQL
EXPLAIN function to try to capture OJS responding to the same request with results in a different order.
If this is indeed the cause, we might need to add some extra
ORDER BY clauses to force MySQL to respond with the same ordering each time. I would view this behavior as “bad citizenship” by MySQL – but I’m not sure whether it’s officially incorrect according to the SQL spec, or whether different versions of MySQL will behave differently.
Public Knowledge Project Team
I run a few tests with database logging activated and aside de fetch for article data, there are two important queries. The first one wich fetch token data:
SELECT * FROM oai_resumption_tokens WHERE token = '69e78b7d74a57f9d65de3a57e6ed002f'
I think it will return a serialized string. And a second one wich will fetch the list of articles to display. I think the used query will be something like this:
WHEN COALESCE(dot.date_deleted, a.last_modified) < i.last_modified THEN
ELSE COALESCE(dot.date_deleted, a.last_modified)
END AS last_modified,
COALESCE(a.article_id, dot.data_object_id) AS article_id,
COALESCE(j.journal_id, tsoj.assoc_id) AS journal_id,
COALESCE(tsos.assoc_id, s.section_id) AS section_id,
FROM mutex m
LEFT JOIN published_articles pa
ON ( m.i = 0 )
LEFT JOIN articles a
ON ( a.article_id = pa.article_id
AND a.journal_id = 6 )
LEFT JOIN issues i
ON ( i.issue_id = pa.issue_id )
LEFT JOIN sections s
ON ( s.section_id = a.section_id )
LEFT JOIN journals j
ON ( j.journal_id = a.journal_id )
LEFT JOIN data_object_tombstones dot
ON ( m.i = 1 )
LEFT JOIN data_object_tombstone_oai_set_objects tsoj
ON ( tsoj.tombstone_id = dot.tombstone_id
AND tsoj.assoc_type = 256
AND tsoj.assoc_id = 6 )
LEFT JOIN data_object_tombstone_oai_set_objects tsos
ON tsos.assoc_id = NULL
WHERE ( ( s.section_id IS NOT NULL
AND i.published = 1
AND j.enabled = 1
AND a.status <> 0 )
OR dot.data_object_id IS NOT NULL )
ORDER BY journal_id
Perhaps I’ve refered the wrong query, but as I can see, there isn’t any SQL OFFSET or LIMIT definitions. If I’m correct, each request (pagination) will return a full articles list. Probably the offsets are done in the PHP code part (using arrays) but if there aren’t any kind of SORT besides the journal_id, the articles list will always have a random SORT for every each request.
I think the problem sits on missing params like last_modified SORT on this query or on the resumption token serialized data.
Ah, I see where this is happening. I think simply changing OAIDAO’s ORDER BY statement from…
ORDER BY journal_id
ORDER BY journal_id, COALESCE(a.article_id, dot.data_object_id)`
…should add an additional ordering that’ll maintain coherence across requests.
Would you be able to try this and see if it improves the coherence of queries split across multiple requests?
FYI, OJS3 already specifies the submission ID in the ordering, so I don’t think it suffers from this. And if you’re curious, the code that does the “pagination” PHP-side is here.
Public Knowledge Project Team
Yes, thank you, this doesn’t occur in OJS3.