Lucene/Solr Plugin indexing intermittent failure (OJS 3.1.2.4)

Platform: OJS 3.1.2.4
Plugin: Lucene/Solr Plugin (release 1.1.0.0)
Solr server: version 8.11.2 (embedded)
Jetty: 9.4.44.v20210927

Testing Lucene plugin on large OJS installation:

  • 80% of journals (188596 articles) indexed successfully
  • 20% of journals fail with similar error An error occurred while indexing: Processed 27 out of a batch of 200

Inspected a number of articles on which it fails, and most contain straight-forward content, no unusual characters, yet it fails because of Illegal character ((CTRL-CHAR, code 4)) at [row,col {unknown-source}]: [276,158146] . It appears this failure might occur in the indexing of the galley file (pdf)?

Any ideas how to troubleshoot/fix this problem would be greatly appreciated.

Below is an example of the failure extracted from solr.log (some journal-specific content redacted, replaced with descriptive text):

o.a.s.h.d.XPathEntityProcessor Parsing failed for xml, url:null rows processed:180 last row: {
 journal_id=336,
 etl_journalTitleList_locales=[en_US],
 loadAction=[replace],
 etl_authorList=[...........author names containing no unusual characters............],
 etl_titleList=[......title containing no unusual characters...........],
 etl_titleList_sortOnly=[false],
 etl_abstractList_locales=[en_US],
 submission_id=47515,
 etl_galley_xml=<galleyList><galley locale="en_US" fileName="......................................./article/download/xxxxx/xxxxx"/></galleyList>,
 $forEach=/articleList/article,
 section_id=336,
 inst_id=localhost,
 etl_journalTitleList_sortOnly=[false],
 etl_journalTitleList=[...........journal title containing no unusual characters............],
 etl_abstractList=[.......abstract containing no unusual characters...........],
 issuePublicationDate_dtsort=2009-11-05T15:40:45Z,
 etl_titleList_locales=[en_US]} => java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: 
Illegal character ((CTRL-CHAR, code 4))
at [row, col {unknown-source}]: [276,158146]
java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 4))
at [row, col {unknown-source}]: [276,158146] at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:183) ~[?:?]
.................

Hello @makouvlei,

Please note that the version of OJS that you’re using is no longer supported by PKP. I would recommend that you upgrade to the newest version of OJS, as it is possible that your issue will be resolved by upgrading (although this is not guaranteed). However, other community members may wish to offer assistance.

Upgrading instructions are available in the PKP Administrator’s Guide and as part of our [Upgrade Guide] (https://docs.pkp.sfu.ca/dev/upgrade-guide/).

Information about the latest version of OJS can be found on the PKP Website

-Roger
PKP Team

@makouvlei did you ever find out what the problem was about? Seeing the same thing.

Ok so here is a way to debug these types of problems. For me at least it was not about the pdf and not about the submission shown in the error message.

Do the following:
In SolrWebService.inc.php / SolrWebService.php edit the function called _addArticleXml

Add a simple line to the end of that function that will show you the submission id of the articles being processed, like echo $article->getId().PHP_EOL;

Start the indexing from the command line for the journal.

When you get the error, check the submissionId mentioned there and compare it to the list of submission ids the script has printed out. The problem is with the next submission after the submissionId mentioned in the error, copy that submissionId

Now open your database and find the matching publication data from the publication_settings. The problematic character is most likely in either the abstract or the citations.