rebuildSearchIndex problems with PDF full text in OJS 3.2.1.1

elt · March 17, 2021, 7:59pm

Hi,
We are running OJS 3.2.1.1 on RHEL 7.9 with mysql 5.6
We recently upgraded from OJS 2.8 and noticed that full text indexing of PDFs is no longer working. For our migrated journals, older articles continue to be full text searchable, but articles added since migration to 3.2.1.1 are not full text searchable.
This line is uncommented in our config.inc.php
index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

When I run the above command manually on the command line against a newly added article, the expected output is send to stdout.
apache is owner of all files and directories in the OJS file directory and has rwx permissions.

I ran the rebuildSearchIndex.php tool on the command line on a test instance and it reported that it successfully indexed 47 articles. However, when I checked the database I saw that the number of table entries dropped as follows
submission_search_keyword_list dropped from 18532 to 1029 records
submission_search_object_keywords dropped from 74491 to 2144 records
submission_search_objects dropped from 416 to 370 records

It seems that only the article metadata is now being indexed.

The only thing I see in the logs is that PHP records a Division by zero Warning whenever an editorial decision is recorded. But this does not stop us from being able to push an article through the editorial workflow and publish it, as expected. I saw no errors in the log when I ran rebuildSearchIndex… A couple of PHP Notices were output regarding Array to string conversion, and two PDFs were not found (path was given as journals/1//articles/… instead of journals/1/articles/… but this seems to be an unrelated issue).

Can anyone suggest where the problem might lie?

many thanks.

EmmaU · March 23, 2021, 4:40pm

Hello! I am just going to tag @asmecher to take a look at this - Alec, this is related to a Coalition Publica journal.

elt · March 23, 2021, 4:54pm

Many thanks for tagging this @EmmaU
We are not related to Coalition Publica in any way (but many thanks anyway!)

EmmaU · March 23, 2021, 5:05pm

@elt Sorry for any confusion! I had this forum request passed to me by a librarian at McGill as being Coalition Publica-related, but maybe a wire got crossed somewhere. Either way, hopefully our developer team will be able to help soon.

elt · March 23, 2021, 6:16pm

Hi @EmmaU I checked back with our librarians here and discovered that we have recently embarked on a joint inititiave with Coalition Publica, so apologies for the earlier correction. You were absolutely right!

asmecher · March 30, 2021, 11:01pm

Hi @elt,

If you’d be willing to (privately of course) share a test copy of your journal and information about an example article (e.g. a search that should result in a specific article but does not) I can investigate further. Please send me a private message.

Thanks,
Alec Smecher
Public Knowledge Project Team

asmecher · April 14, 2021, 8:31pm

Hi all,

Just to document the resolution – it appears that this may be a problem with SELinux permissions not being open enough to allow OJS to execute the PDF full-text extraction program configured in config.inc.php.

Regards,
Alec Smecher
Public Knowledge Project Team

asmecher · April 17, 2021, 8:00am

This topic was automatically closed after 2 days. New replies are no longer allowed.