rebuildSearchIndex problems with PDF full text in OJS 3.2.1.1

Hi,
We are running OJS 3.2.1.1 on RHEL 7.9 with mysql 5.6
We recently upgraded from OJS 2.8 and noticed that full text indexing of PDFs is no longer working. For our migrated journals, older articles continue to be full text searchable, but articles added since migration to 3.2.1.1 are not full text searchable.
This line is uncommented in our config.inc.php
index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

When I run the above command manually on the command line against a newly added article, the expected output is send to stdout.
apache is owner of all files and directories in the OJS file directory and has rwx permissions.

I ran the rebuildSearchIndex.php tool on the command line on a test instance and it reported that it successfully indexed 47 articles. However, when I checked the database I saw that the number of table entries dropped as follows
submission_search_keyword_list dropped from 18532 to 1029 records
submission_search_object_keywords dropped from 74491 to 2144 records
submission_search_objects dropped from 416 to 370 records

It seems that only the article metadata is now being indexed.

The only thing I see in the logs is that PHP records a Division by zero Warning whenever an editorial decision is recorded. But this does not stop us from being able to push an article through the editorial workflow and publish it, as expected. I saw no errors in the log when I ran rebuildSearchIndex… A couple of PHP Notices were output regarding Array to string conversion, and two PDFs were not found (path was given as journals/1//articles/… instead of journals/1/articles/… but this seems to be an unrelated issue).

Can anyone suggest where the problem might lie?

many thanks.

Hello! I am just going to tag @asmecher to take a look at this - Alec, this is related to a Coalition Publica journal.

Many thanks for tagging this @EmmaU
We are not related to Coalition Publica in any way (but many thanks anyway!)

@elt Sorry for any confusion! I had this forum request passed to me by a librarian at McGill as being Coalition Publica-related, but maybe a wire got crossed somewhere. :slightly_smiling_face: Either way, hopefully our developer team will be able to help soon.

Hi @EmmaU I checked back with our librarians here and discovered that we have recently embarked on a joint inititiave with Coalition Publica, so apologies for the earlier correction. You were absolutely right!

Hi @elt,

If you’d be willing to (privately of course) share a test copy of your journal and information about an example article (e.g. a search that should result in a specific article but does not) I can investigate further. Please send me a private message.

Thanks,
Alec Smecher
Public Knowledge Project Team

Hi all,

Just to document the resolution – it appears that this may be a problem with SELinux permissions not being open enough to allow OJS to execute the PDF full-text extraction program configured in config.inc.php.

Regards,
Alec Smecher
Public Knowledge Project Team

This topic was automatically closed after 2 days. New replies are no longer allowed.