Platform: OJS 3.1.2.4
Plugin: Lucene/Solr Plugin (release 1.1.0.0)
Solr server: version 8.11.2 (embedded)
Jetty: 9.4.44.v20210927
Testing Lucene plugin on large OJS installation:
- 80% of journals (188596 articles) indexed successfully
- 20% of journals fail with similar error
An error occurred while indexing: Processed 27 out of a batch of 200
Inspected a number of articles on which it fails, and most contain straight-forward content, no unusual characters, yet it fails because of Illegal character ((CTRL-CHAR, code 4)) at [row,col {unknown-source}]: [276,158146]
. It appears this failure might occur in the indexing of the galley file (pdf)?
Any ideas how to troubleshoot/fix this problem would be greatly appreciated.
Below is an example of the failure extracted from solr.log (some journal-specific content redacted, replaced with descriptive text):
o.a.s.h.d.XPathEntityProcessor Parsing failed for xml, url:null rows processed:180 last row: {
journal_id=336,
etl_journalTitleList_locales=[en_US],
loadAction=[replace],
etl_authorList=[...........author names containing no unusual characters............],
etl_titleList=[......title containing no unusual characters...........],
etl_titleList_sortOnly=[false],
etl_abstractList_locales=[en_US],
submission_id=47515,
etl_galley_xml=<galleyList><galley locale="en_US" fileName="......................................./article/download/xxxxx/xxxxx"/></galleyList>,
$forEach=/articleList/article,
section_id=336,
inst_id=localhost,
etl_journalTitleList_sortOnly=[false],
etl_journalTitleList=[...........journal title containing no unusual characters............],
etl_abstractList=[.......abstract containing no unusual characters...........],
issuePublicationDate_dtsort=2009-11-05T15:40:45Z,
etl_titleList_locales=[en_US]} => java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException:
Illegal character ((CTRL-CHAR, code 4))
at [row, col {unknown-source}]: [276,158146]
java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 4))
at [row, col {unknown-source}]: [276,158146] at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:183) ~[?:?]
.................