rebuildSearchIndex problems

Hello all
I’m afraid I have to circle back to rebuildSearchIndex problems with PDF full text in OJS 3.2.1.1

I had hoped that the problem would be related to SELinux, but it seems this is not the case.
We are running OJS 3.2.1.1 on RHEL 7.9 with mysql 5.6
We recently upgraded from OJS 2.8 and noticed that full text indexing of PDFs is no longer working. For our migrated journals, older articles continue to be full text searchable, but articles added since migration to 3.2.1.1 are not full text searchable.

After confirming with Alec (thanks Alec!!) that the problem is NOT reproduceable on other systems, I set SELinux on our server to permissive mode, but unfortunately that does not fix the problem.

I traced the problem to the regexp_replace function in the PKPString class. If I remove the PCRE_UTF8 constant on line 281, PDF indexing succeeds. I can confirm that the constant is set to ‘u’ when the class is called.

In class SubmissionSearchIndex filterKeywords method, a check of the input encoding (mb_check_encoding() ) and the reported error ( preg_last_error() ) after the first call to PKPString::regexp_replace reports that the input is invalid UTF8 and the error is PREG_BAD_UTF8_ERROR. So it seems that the problem is with the text extracted by pdftofext.

However, when I run the pdftotext command on the terminal, the output is valid utf8. And if I use the SearchFileParser class to parse the PDF and send the result to ArticleSearchIndex->filterKeywords then I cannot reproduce the error - filterKeywords returns an array of keywords as expected.

The steps for that test are:
$submissionFileDao = DAORegistry::getDAO(‘SubmissionFileDAO’);
$file = $submissionFileDao->getLatestRevision($fileId);
$parser = SearchFileParser::fromFile($file);
$prog = Config::getVar(‘search’ , ‘index[’ . $parser->type . ‘]’);
$articleSearchIndex = Application::getSubmissionSearchIndex();
if(isset($parser) && $parser->open() ) {
while(($text = $parser->read()) !== false) { $keywords = $articleSearchIndex->filterKeywords($text); }
}
Am I missing a step which might cause the encoding of the input to change?

Can anyone offer any pointers on where to look next?

thanks!

PS. Our servers are behind load balancers. Would that impact any steps in full text indexing?

Hi @elt,

The command line you’re using in config.inc.php

index[application/pdf] = "/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr '[:cntrl:]' ' '"

…passes PDF contents extracted from pdftotext through the tr command in order to remove any control characters using the tr command. Off the top of my head, I suspect this removes non-alphanumeric ASCII characters, so I’m confused about why invalid UTF8 would be getting through.

What’s odd is that the same command, database, and files on my own machine don’t present any problems, so I suspect it’s a difference between the underlying tools on your server vs. my machine (pdftotext, tr, PHP’s UTF8 support, and possibly other elements).

I’d suggest picking a PDF to work with – maybe have your system dump the names while running the indexing command so you can try with one that causes the PREG_BAD_UTF8_ERROR problem. Then try running the pdftotext command manually to see what the output looks like. I suspect you’ll see invalid UTF8 content.

If that’s the case, you may be able to resolve it by adding a call to iconv into the command line in the configuration file; see e.g. this discussion:

Regards,
Alec Smecher
Public Knowledge Project Team

Hi Alec,
Many thanks indeed for that suggestion. I did check the output of pdftotext manually and it was confirmed as valid UTF8, but I tried adding the pipe to iconv to the config file anyway. And it worked! All our PDFs are now being indexed.
Thanks again.

1 Like

Hi @elt,

Glad to hear it’s working!

Regards,
Alec Smecher
Public Knowledge Project Team