OJS 3.1.2.4 - problem with rebuildSearchIndex

I want to rebuild the Search Index.

After running the command “tools/rebuildSearchIndex.php”, the following messages appeared:

user@wwwnew:/home/www/ojs3/html$ /usr/local/php_7.2.25/bin/php tools/rebuildSearchIndex.php

Clearing index … done

Indexing “Mechanik SC TEST” … PHP Warning: Declaration of SubmissionDisciplineEntryDAO::getByControlledVocabId($controlledVocabId, $rangeInfo = NULL) should be compatible with ControlledVocabEntryDAO::getByControlledVocabId($controlledVocabId, $rangeInfo = NULL, $filter = NULL) in /home/www/ojs3/html/lib/pkp/classes/submission/SubmissionDisciplineEntryDAO.inc.php on line 20

PHP Warning: Declaration of SubmissionSubjectEntryDAO::getByControlledVocabId($controlledVocabId, $rangeInfo = NULL) should be compatible with ControlledVocabEntryDAO::getByControlledVocabId($controlledVocabId, $rangeInfo = NULL, $filter = NULL) in /home/www/ojs3/html/lib/pkp/classes/submission/SubmissionSubjectEntryDAO.inc.php on line 44

4 articles indexed

Now “search” does not find the words in some articles that I added before rebuilding the indexes.

In the article that I added after rebuilding the indexes, “search” works fine.

All articles are PDF documents.

Regards
Wojtek

Additional information.

I checked the tables: submission_search_keyword_list, submission_search_objects and submission_search_object_keywords. In these tables, only a few words remain in the article that I added before rebuilding the indexes.
I checked that “search” found these words.

Regards
Wojtek

Hi @WSMH,

Do you have the PDF text extraction tools configured in your config.inc.php?

Regards,
Alec Smecher
Public Knowledge Project Team

Yes, I have configured.
I have the following entries in the config.inc.php file:

index[application/pdf] = “/usr/bin/pstotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

If I add a new article (QuickSubmit), it is indexed and “search” works fine.
After rebuilding the indexes, what I wrote in the first post happened.

Regards
Wojtek

New informations.

The server administrator has run “rebuildSearchIndex” with administrator privileges.
Now there were no error messages:

root@wwwnew:/home/www/ojs3/html# /usr/local/php_7.2.25/bin/php tools/rebuildSearchIndex.php

Clearing index … done

Indexing “Mechanik SC TEST” … 5 articles indexed

But some entries have been removed from the tables and now “search” does not find some words in the latest article (before rebuilding the indexes, the “search” function in this article worked correctly).

The number of entries in the tables before rebuilding the indexes:

submission_search_keyword_list - 1810
submission_search_objects - 41
submission_search_object_keywords - 5456

And after rebuilding the indexes:

submission_search_keyword_list - 1343
submission_search_objects - 41
submission_search_object_keywords - 3483

Before rebuilding the indexes, I noted 3 keywords: “Symantec”, “Endpoint” and “Protection”. All three were in the “submission_search_keyword_list” table and “search” found those words.

After rebuilding the indexes, there is only “Endpoint” in the table. “Symantec” and “Protection” have been removed. The “search” function only finds “Endpoint”.

Summary:

  1. Before the first index rebuilding, I had 4 articles. In all four articles, the “search” function worked correctly.

  2. After the first index rebuilding, most of the words disappeared from the database and the “search” function did not work properly.

  3. I added a new (fifth) article. The “search” function worked correctly in this article.

  4. After the second index rebuilding, most of the words in the fifth article disappeared from the database. Now the “search” function works incorrectly in all five articles.

The first 16 records of the “submission_search_keyword_list” table before and after the second index rebuild (in alphabetical order). All records shown in the pictures refer to the same article (fifth).

przed_i_po

Regards
Wojtek

Hi @WSMH,

Submission keywords (that is, the “Keywords” metadata field) is currently not indexed and not searchable in OJS 3.x. I’ve filed that here: Submission keywords are not indexed/searchable · Issue #5388 · pkp/pkp-lib · GitHub

That appears to be the major part of your concern, correct?

Regards,
Alec Smecher
Public Knowledge Project Team

Thank you for your response.

Here is the “fifth” article: Artykuł testowy | Mechanik SC TEST
There are only two words in the “Keywords” metadata field: “słowa” and “kluczowe”. The words “Symantec”, “Endpoint” and “Protection” have never been in this field. There have never been words in these fields that were removed from the database when rebuilding indexes (look at the image in the previous post).

Regards
Wojtek

Edit.

I see that in the previous post I wrote: “I noted 3 keywords:”.
My mistake. I should have written, “I noted 3 words from article:”.
Sorry for the lack of precision.

Hi @WSMH,

Unfortunately this scenario is tough for me to replicate locally, since it’s quite data dependent. I’d suggest picking a word that you think should be indexed, but isn’t during the index rebuild process. Add some error_log commands to various steps in the index rebuild process to capture moments where it’s indexed (and presumably then removed); if those log entries never appear, then the issue will be that the content is never getting indexed.

I’d suggest starting with lib/pkp/classes/search/SubmissionSearchDAO.inc.php in the insertKeyword function – this should be called whenever a keyword indexed.

Regards,
Alec Smecher
Public Knowledge Project Team

Thank you for your interest and advice.
Unfortunately, I’m not a programmer, I don’t know php and I don’t understand what I should do.

I’ll wait, maybe someday someone will have the same problem and be able to apply your advice.

Once again, thank you and sorry for the confusion.

Regards
Wojtek

Hi everyone - mentioning that this seems to be affecting an installation we have as well, running 3.2.1. With roughly 500 submissions, some have PDFs that get indexed and some do not. Permissions look good, pdftotext can extract content from any of the problem ones, but words that should be stored are not being stored. Stranger, some words are being stored for some PDFs, but those same words aren’t being stored for others.

I’ve spent time today debugging file paths and file revisions; those look good. Moving into the keyword part of this now, and will update if I learn anything.

Jason

Thank you for the information.
It’s nice that you got interested in the problem.

So, I continued to dig into this yesterday and I have a theory. When you configure a parser for a particular MIME type in your config.inc.php file, you’re providing the method through which text gets handed to the OJS indexer.

In lib/pkp/classes/search/SubmissionSearchIndex.inc.php there is a method called filterKeywords that runs the string of text through a series of regular expressions designed to remove characters that are not useful, like punctuation and HTML tags. During my testing I included debugging statements before and after these regular expressions and in some cases, the entire string of text was stripped away, leaving nothing to index. The issue is with the first two regular expressions:

$text = PKPString::regexp_replace('/[!"\#\$%\'\(\)\.\?@\[\]\^`\{\}~]/', '', $text);
$text = PKPString::regexp_replace('/[\+,:;&\/<=>\|\\\]/', ' ', $text);

since commenting out the first one meant that the second one stripped the string bare instead.

It’s not every PDF and not necessarily the regular expression (I tried [[:punct:]] which is a regex that matches all punctuation, and the same thing happened), so I suspect this may be related to the PDF itself, and how it is generated.

I’ve also tried adjusting the filter command in the config.inc.php file in the hope that a different set of replacement patterns to tr could provide cleaner input, but I’m not having a lot of luck with that either. Even doing something like this:

/usr/bin/tr '[:cntrl:][:punct:][:space:]' ' '

which should convert control characters, all white space, and punctuation, still results in the same thing.

Commenting out all of the regular expressions does mean that the data gets indexed, but it looks terrible and is not recommended.

So I feel better in that I think I know where the problem is, but I’m not quite sure what else to try at this point, short of digging into some messy regexes that do work on other files.

Cheers,
Jason

I don’t know if this information will help or make it harder.

OJS 2.4.8.5 indexed PDF files correctly. OJS 3.1.2.4 has problems with the same PDF files. More about it: OJS 3.1.2.4 - Another problem with indexing and word search in PDF articles

This might suggest that the problem is not the PDF but the OJS.

Unfortunately version 2.4.8.5 has already been removed from our server. But if you can, try indexing the same PDF in both versions.

Regards
Wojtek

Hi Wojtek,

2.4.8-5 is still available for download if you ever want to test it out again.

But, looking at the search indexing code in 2.4.8, the regular expressions are the same:

        $cleanText = Core::cleanVar($text);

        // Remove punctuation
        $cleanText = PKPString::regexp_replace('/[!"\#\$%\'\(\)\.\?@\[\]\^`\{\}~]/', '', $cleanText);
        $cleanText = PKPString::regexp_replace('/[\+,:;&\/<=>\|\\\]/', ' ', $cleanText);
        $cleanText = PKPString::regexp_replace('/[\*]/', $allowWildcards ? '%' : ' ', $cleanText);
        $cleanText = PKPString::strtolower($cleanText);

The problem may lie with the call to Core::cleanVar in 2.4.8 doing some cleanup work that OJS3 is not doing or perhaps doing in a different way. That method no longer exists in OJS3.

Cheers,
Jason

Please note that these two threads deal with two problems.

The thread OJS 3.1.2.4 - Another problem with indexing and word search in PDF articles is about mis-indexing new PDF articles.

And in this thread I wrote that the “rebuildSearchIndex” tool removes words from the database that are already indexed.

Could the mysterious “Core :: cleanVar” be the cause of both problems?

Regards
Wojtek

rebuildSearchIndex should indeed remove words from the database, since the first thing it does is clear out the entire search index. If that wasn’t happening in previous versions of OJS, that would be a bug. The problem here is that rebuilding the search index doesn’t put them back. If a search index had been created, and then rebuildSearchIndex run in a newer version of OJS, and that new version mis-indexed the words, then yes, the word would disappear from the index.

I wrote briefly so as not to repeat everything I wrote in the first posts of this thread. :slight_smile:

For me it helps to use

index[application/pdf] = “/usr/bin/pdftotext -enc Windows-1255 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

in config.inc.php

Even better (e.g. for German Umlaute) is to use the original

index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

but with an additional line

$text = mb_convert_encoding( $text, ‘UTF-8’, ‘Windows-1252’);

before applying the regexps in lib/pkp/classes/search/SubmissionSearchIndex.inc.php. The old cleanVar did something similar.

I’ve been wrestling with this issue too. I can see the extracted PDF text before the PKPString::regexp_replace lines, but then the output is blank.

I tried the above suggestion and it worked! I even dropped the Windows-1252 encoding because it was messing up diacritics and just converting it to utf8 also allows the parser to work. This is the line that I used:

$text = mb_convert_encoding($text, ‘UTF-8’);

I’m not sure if this is an issue with SubmissionSearchIndex.inc.php or something with PKPString.inc.php or maybe I’m missing something on my server.