Indexing PDF fulltext for search does not work

I am still working with OJS 3.0.1 in a virtual box and checked the search funtionality. The articles have PDF fulltext files and i activated the corresponding option in config.inc.php for pdftotext extraction. To be sure I rebuild the search via php rebuildSearchIndex.php. It returns some PHP warnings for several plugins but then the indexing seems to be fine. For each journal he returns e.g. 95 articles indexed. I flushed the data cache in the OJS administration afterwards.

Searching for metadata (author, title, abstract,…) works fine but picking just a single word from one of the PDF files returns 0 hits.

I tried the pdftotext command on the command line with one of the pdfs and used the same syntax as in config.inc.php. I just switched %s for a filename. It works fine an extracts the PDF to the command line.

The setup seems fine to me so far. Any suggestions where to start looking for the error?

1 Like

Hi @florianruckelshausen,

OJS uses ADODB for database queries, and ADODB has a caching mechanism that OJS uses to speed up full-text searches. Make sure you clear your data cache (this is stored in cache/_db) when working with re-indexing.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher

i cleared the data cache via console (so that cache/_db is emtpy) and did the reindexing. But no effect. Searching for single word from one of the PDF files returns 0 results.

Here are the PHP warnings when i run rebuildsearchindex:

PHP Warning: Declaration of OrcidProfilePlugin::getTemplatePath() should be compatible with Plugin::getTemplatePath($inCore = false) in /var/www/html/ojs/plugins/generic/orcidProfile/OrcidProfilePlugin.inc.php on line 413 PHP Warning: Declaration of PdfJsViewerPlugin::getTemplatePath() should be compatible with Plugin::getTemplatePath($inCore = false) in /var/www/html/ojs/plugins/generic/pdfJsViewer/PdfJsViewerPlugin.inc.php on line 141 PHP Warning: Declaration of GoogleAnalyticsPlugin::getTemplatePath() should be compatible with Plugin::getTemplatePath($inCore = false) in /var/www/html/ojs/plugins/generic/googleAnalytics/GoogleAnalyticsPlugin.inc.php on line 147 PHP Warning: Declaration of BrowsePlugin::manage($verb, $args, &$message, &$messageParams, &$pluginModalContent = NULL) should be compatible with Plugin::manage($args, $request) in /var/www/html/ojs/plugins/generic/browse/BrowsePlugin.inc.php on line 151 PHP Warning: Declaration of BrowsePlugin::getTemplatePath() should be compatible with Plugin::getTemplatePath($inCore = false) in /var/www/html/ojs/plugins/generic/browse/BrowsePlugin.inc.php on line 151 PHP Warning: Declaration of RecommendByAuthorPlugin::getTemplatePath() should be compatible with Plugin::getTemplatePath($inCore = false) in /var/www/html/ojs/plugins/generic/recommendByAuthor/RecommendByAuthorPlugin.inc.php on line 156 PHP Warning: Declaration of WebFeedPlugin::getTemplatePath() should be compatible with Plugin::getTemplatePath($inCore = false) in /var/www/html/ojs/plugins/generic/webFeed/WebFeedPlugin.inc.php on line 190 PHP Warning: Declaration of LensGalleyPlugin::getTemplatePath() should be compatible with Plugin::getTemplatePath($inCore = false) in /var/www/html/ojs/plugins/generic/lensGalley/LensGalleyPlugin.inc.php on line 157 Clearing index ... done Indexing "Test 4 Journal" ... PHP Warning: Declaration of SubmissionFileDAO::fromRow($row) should be compatible with PKPSubmissionFileDAO::fromRow($row, $fileImplementation) in /var/www/html/ojs/classes/article/SubmissionFileDAO.inc.php on line 23 95 articles indexed Indexing "Test3 Journal" ... 3 articles indexed Indexing "Kult Online" ... 1 articles indexed Indexing "Rationality, Markets and Morals" ... 95 articles indexed

All the PDF fulltexts are attached as galley files and the articles are published in a volume that is available.

Hi @florianruckelshausen,

Hmm, nothing comes to mind. Would you mind investigating a little in the index tables? These are (for OJS 3.x) submission_search_keyword_list (which lists all indexed keywords), submission_search_objects (which lists all objects that are indexed), and submission_search_object_keywords (which maps objects to keywords). Check to see whether the keywords you expect to see are indexed, and if so, what objects they map to.

Regards,
Alec Smecher
Public Knowledge Project Team