Ojs 3.2.1.x multi journal site. Search inside pdf files returns no results

I am having an ojs 3.2.1.1 installation with many journals and using the bootstrap3 theme.
Search works fine among journals, articles, issues etc.
The issue is that when I try to search for a text which exists inside a pdf file I get no results.
Tables: submission_search_keyword_list, submission_search_objects, submission_search_object_keywords seems to be fine and I have already run the rebuildSearchIndex.php file.
Does ojs3 supports search inside the pdf files?
Is there something else I can check?

Hi @Dimitris_Sioulas,

Are your PDF text extraction tools configured in config.inc.php?

Regards,
Alec Smecher
Public Knowledge Project Team

Hello @asmecher,
I have not edit anything regarding pdf text extraction.
Please let me know what I need to do.

I found this code in the config.inc.php file.
[search]

; Minimum indexed word length
min_word_length = 3

; The maximum number of search results fetched per keyword. These results
; are fetched and merged to provide results for searches with several keywords.
results_per_keyword = 500

; The number of hours for which keyword search results are cached.
result_cache_hours = 1

; Paths to helper programs for indexing non-text files.
; Programs are assumed to output the converted text to stdout, and “%s” is
; replaced by the file argument.
; Note that using full paths to the binaries is recommended.
; Uncomment applicable lines to enable (at most one per file type).
; Additional “index[MIME_TYPE]” lines can be added for any mime type to be
; indexed.

; PDF
; index[application/pdf] = “/usr/bin/pstotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”
; index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

; PostScript
; index[application/postscript] = “/usr/bin/pstotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”
; index[application/postscript] = “/usr/bin/ps2ascii %s | /usr/bin/tr ‘[:cntrl:]’ ’ '”

; Microsoft Word
; index[application/msword] = “/usr/bin/antiword %s”
; index[application/msword] = “/usr/bin/catdoc %s”

my config.inc.php file includes:

; PDF
; index[application/pdf] = “/usr/bin/pstotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”
index[application/pdf] = “/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr ‘[:cntrl:]’ ’ '”

I am using a windows environment. I have installed pdftotext.
I have to change the path so instead of usr/bin/pdftotext to add the equivalent path for my machine
and run again the rebuildSearchIndex.php. Is there something more?

Hi @Dimitris_Sioulas,

Yes, for a Windows machine you’ll have to change the path to pdftotext. I’d also recommend testing out the command line manually on your console, using any of your OJS PDF files in place of the %s. You’ll probably also have to change the path to /usr/bin/tr – that appears to be available as part of Coreutils for Windows, or you can probably just remove that part of the command line (| /usr/bin/tr ‘[:cntrl:]’ ’ ') without causing problems.

Regards,
Alec Smecher
Public Knowledge Project Team

1 Like

Hello @asmecher,

I have try that as well and the problem was still unsolved.
The rebuildSearchIndex was still not indexing the pdf files.
Since indexing pdf files is fundamental I spend some time debugging.
I believe there was a mistake in the code so I created a pull request and added a short explanation.

After the code I added the indexing for pdf files works.
Please let me know if there is actually an issue with the code or it is just my mistake.
Thank you for your help