Lucene/Solr Plugin indexing intermittent failure (OJS 3.1.2.4)

Platform: OJS 3.1.2.4
Plugin: Lucene/Solr Plugin (release 1.1.0.0)
Solr server: version 8.11.2 (embedded)
Jetty: 9.4.44.v20210927

Testing Lucene plugin on large OJS installation:

  • 80% of journals (188596 articles) indexed successfully
  • 20% of journals fail with similar error An error occurred while indexing: Processed 27 out of a batch of 200

Inspected a number of articles on which it fails, and most contain straight-forward content, no unusual characters, yet it fails because of Illegal character ((CTRL-CHAR, code 4)) at [row,col {unknown-source}]: [276,158146] . It appears this failure might occur in the indexing of the galley file (pdf)?

Any ideas how to troubleshoot/fix this problem would be greatly appreciated.

Below is an example of the failure extracted from solr.log (some journal-specific content redacted, replaced with descriptive text):

o.a.s.h.d.XPathEntityProcessor Parsing failed for xml, url:null rows processed:180 last row: {
 journal_id=336,
 etl_journalTitleList_locales=[en_US],
 loadAction=[replace],
 etl_authorList=[...........author names containing no unusual characters............],
 etl_titleList=[......title containing no unusual characters...........],
 etl_titleList_sortOnly=[false],
 etl_abstractList_locales=[en_US],
 submission_id=47515,
 etl_galley_xml=<galleyList><galley locale="en_US" fileName="......................................./article/download/xxxxx/xxxxx"/></galleyList>,
 $forEach=/articleList/article,
 section_id=336,
 inst_id=localhost,
 etl_journalTitleList_sortOnly=[false],
 etl_journalTitleList=[...........journal title containing no unusual characters............],
 etl_abstractList=[.......abstract containing no unusual characters...........],
 issuePublicationDate_dtsort=2009-11-05T15:40:45Z,
 etl_titleList_locales=[en_US]} => java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: 
Illegal character ((CTRL-CHAR, code 4))
at [row, col {unknown-source}]: [276,158146]
java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 4))
at [row, col {unknown-source}]: [276,158146] at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:183) ~[?:?]
.................

Hello @makouvlei,

Please note that the version of OJS that you’re using is no longer supported by PKP. I would recommend that you upgrade to the newest version of OJS, as it is possible that your issue will be resolved by upgrading (although this is not guaranteed). However, other community members may wish to offer assistance.

Upgrading instructions are available in the PKP Administrator’s Guide and as part of our [Upgrade Guide] (https://docs.pkp.sfu.ca/dev/upgrade-guide/).

Information about the latest version of OJS can be found on the PKP Website

-Roger
PKP Team