OJS 3.1.2.4 - Another problem with indexing and word search in PDF articles

WSMH · December 26, 2019, 3:10pm

In short: OJS 2 searches PDF files well, and OJS 3 has problems.

Details:

On the same server I have OJS 2.4.8.5 and 3.1.2.4 installed for testing.
I have the same article in both OJS versions. I added the article using the QuickSubmit plugin.
Both OJS versions use the same “pstotext” and “pdftotext” files.

Fragments of the config.inc.php file (identical in both OJS versions):

ojs2
; PDF
index[application/pdf] = “/usr/bin/pstotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”
index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

ojs3
; PDF
index[application/pdf] = “/usr/bin/pstotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”
index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

OJS 2 indexed the test article well and finds the words in this article well.
OJS 3 has partially indexed the test article and finds only some of the words in this article.

Article links
OJS 2 (title “Pneumtayczny 2”): http://ojs.mechanik.media.pl/index.php/ATiM/article/view/86
OJS 3 (title “Pneumatyczne”): Pneumatyczne | Mechanik SC TEST

Sample words:
works
vibrations
constructed
range
requirements

OJS 2 finds all these words in a test article. OJS 3 only finds “range” in a test article.

Regards
Wojtek

asmecher · December 28, 2019, 6:02pm

Hi @WSMH,

Of the two lines in config.inc.php:

index[application/pdf] = “/usr/bin/pstotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”
index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

…only the first of the two will have any effect.

I would suggest running the PDF text extraction tool manually on your PDF to see what text is being extracted. Take the command from the configuration file and replace the %s with the path and filename to the PDF you expect to see indexed.

Depending on how your PDFs are being generated, the tool may have a hard time extracting text from it.

Regards,
Alec Smecher
Public Knowledge Project Team

WSMH · December 28, 2019, 7:13pm

Thank you for the hint.
I will forward this to the server administrator.

But note that both OJS versions are installed on the same server and use the same “pstotext” (or “pdftotext”) file.

If the tool had problems, they would probably appear in both OJS versions? I think so.

The case described is not the only one. I am transferring articles from our old website to the production version OJS 3 and I have more examples.

Regards
Wojtek

Edit

Could you look into this topic: OJS 3.1.2.4 - problem with rebuildSearchIndex ? I added some information there.

WSMH · January 2, 2020, 4:43pm

The server administrator performed the tests.
The pdftotext tool processed a PDF file with the article I wrote about in the first post.
The administrator performed a test on a file in the OJS 2 directory and on a file in the OJS 3 directory.
In both cases an identical TXT file was created (both files had the same md5 checksum).
One of these files can be downloaded from here:

https://megawrzuta.pl/files/ac10baf88014be3deefb18739a236717.txt

The TXT file contains all the words I wrote about.
From such a file, OJS 2 indexed all words, and OJS 3 had problems.

Regards
Wojtek

WSMH · January 2, 2020, 6:04pm

I did one more test.
I uploaded the same article to the public test versions of OJS 2 and OJS 3 (title “TEST PNEUMATYCZNE”):

https://demo.publicknowledgeproject.org/ojs2/testdrive/index.php/testdrive/article/view/1

https://demo.publicknowledgeproject.org/ojs3/testdrive/index.php/testdrive-journal/article/view/971

The result is the same as on my server: OJS 2 indexes all words, and OJS 3 indexes only some words.
I think this test shows that the problem is not related to my server, my installation or my configuration.

Regards
Wojtek

WSMH · January 7, 2020, 5:00pm

Once again I sent the same article to the public test versions of OJS 2 and OJS 3 (title “PNEUMATIC TEST”):

https://demo.publicknowledgeproject.org/ojs2/testdrive/index.php/testdrive/article/view/1

https://demo.publicknowledgeproject.org/ojs3/testdrive/index.php/testdrive-journal/article/view/969

OJS 2 indexes all words, and OJS 3 indexes only some words.

Sample words:
works
vibrations
constructed
range
requirements

OJS 2 finds all these words in a test article. OJS 3 only finds “range” in a test article.

This is not the only article with such problems.

Should I report this problem to some special section or is it enough to report it in the “Questions” section?

Regards
Wojtek

asmecher · January 8, 2020, 1:02am

Hi @WSMH,

I’m following this thread, but haven’t had time to look into it yet.

Regards,
Alec Smecher
Public Knowledge Project Team

WSMH · January 8, 2020, 9:08am

Thank you for the information.

If you need more articles for testing, I uploaded one more to our server (title “Hierarchical”):
OJS 2 - http://ojs.mechanik.media.pl/index.php/ATiM/article/view/89
OJS 3 - Hierarchical | Mechanik SC TEST

Sample words:
satisfactory
characteristics
deformed
specimens
considered
stretching
representation

OJS 2 finds all these words in a test article. OJS 3 only finds “characteristics” in a test article.

I’ve checked that none of these words are on the “stopwords.txt” list.

Regards
Wojtek

asmecher · January 28, 2020, 8:09pm

Hi @WSMH,

I’ve added some suggestions for debugging over at this thread: OJS 3.1.2.4 - problem with rebuildSearchIndex - #8 by asmecher

Regards,
Alec Smecher
Public Knowledge Project Team