[OJS 3.02] Search for parts of search keywords?

On our site www.tatup.de we tested the search functionality of OJS. As German is a very composed language, it seems to us, that the search of OJS isn’t respecting this enough. E.g.: A user would search for »Arznei« or »Arzneimittel«, but the search gives no hit, although there is an article about »Arzneimittelentwicklung«. I looked into submission_search_keyword_list on MySQL and there is only the search keyword »arzneimittelentwicklung«.

Is there an option to modify the search, that OJS will look for part of the keywords either?

Thanks
Tobias

BTW – the HTML and PDF of the article both contain a lot of the simple form »Arzneimittel«. Aren’t they indexed either?

@asmecher, sorry for forcibly planting you into this thread, but I’m hoping that at least you could help me with this …

Thanks
Tobias

Hi @twa,

For indexing PDFs, you’ll need to configure some command line tools (like pdf2text) in config.inc.php. HTML files should already be getting indexed using OJS’s built-in tool – but before debugging this in detail, I’d suggest rebuilding the search index (php tools/rebuildSearchIndex.php) to make sure it’s clean. Note that searches use cached results, so you’ll have to flush the data cache if you want to run searches repeatedly under different server/index conditions.

OJS’s built in search engine is fairly simple and doesn’t consider partial keyword matches. For that kind of functionality, I think it’s simply better to use a more comprehensive search engine – in OJS 2.x there was the Lucene/SOLR plugin, and I believe this is currently being ported forward to OJS 3.x by one of our contributing groups in Germany. If this is of interest, and you’re able to run Lucene/SOLR, I can put you in touch.

Regards,
Alec Smecher
Public Knowledge Project Team

1 Like

Thanks for your kind help! Yes, please put me in touch with the German group for Lucene/SOLR.

Thanks
Tobias

Do I have a chance to use the OJS tools like rebuildSearchIndex.php with an ISP hosted OJS installation?

Thanks
Tobias

Hi @twa,

I think you’re asking about installations that don’t have access to command-line tools, correct? Currently no, you can’t rebuild the search index unless you have command-line access. It would be possible to temporarily disable the check that prevents command-line tools from being kicked off via the web, but part of the problem with web-based requests in this case is that there are often server time-outs that could interrupt the process when the index is half-built.

Regards,
Alec Smecher
Public Knowledge Project Team

1 Like

I now have access via SSH. Running php tools/rebuildSearchIndex.php throws me the following error:
X-Powered-By: PHP/4.4.9
<b>Parse error</b>: syntax error, unexpected T_FUNCTION, expecting ')' in <b>OJSPATH/lib/pkp/includes/functions.inc.php</b> on line <b>319</b><br />

And may I ask for the contact with the German group for Lucene/SOLR?

Thanks
Tobias

Hi @twa,

Woah, that appears to be PHP 4.4.9, which is 10 years old! You’ll need to have a command-line PHP that’s at least version 5.6. You probably already have this somewhere on your server, if it’s successfully running OJS 3.x.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @twa,

And I forgot to mention: the best way to get involved in the conversation about updating the Solr/Lucene for OJS 3.x is to keep tabs on [OJS] Upgrade Lucene Plugin · Issue #2575 · pkp/pkp-lib · GitHub.

Regards,
Alec Smecher
Public Knowledge Project Team

Has there been any movement on this? My users are having quite a bit of difficulty finding anything at the moment.

When I rebuild my search index I get these errors repeatedly:
Error: Mismatch between font type and embedded font file
Error: Mismatch between font type and embedded font file

Is there some way that I can fix this? Any help is much appreciated.

Hi @jamilj,

There are quite a few things discussed on this thread – can you be more specific?

Also, which of our apps are you using, and what version?

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher, I’m running 3.1.1.2. This has been going on for some time now (since 2x), every time I import a PDF. However, when I try to run this command:

php public_html/archives/tools/rebuildSearchIndex.php

These two errors come up, the first hundreds of times and the second thousands:

Fontconfig error: Cannot load default config file
Error: Mismatch between font type and embedded font file

I just let the rebuild run for 24 hours straight and it never finished (broken pipe). I attempted this because the search function was not working very well at all. But apparently searching the content of PDFs will not work at all until the LUCENE plugin is updated.

I tried creating a mount for /etc/fonts in my users home directory but that did not solve the problem. I also disabled charset_normalization because my DB is in latin1. That had no impact either.

Hi @jamilj,

The index rebuild can legitimately take a long time to complete – you might need to use a tool like nohup to prevent the re-index from dying if your terminal disconnects.

The error messages you quote come from your PDF extraction tools, configured in config.inc.php under e.g. index[application/pdf]. It’s some interaction between that tool and your PDF that’s the problem; you might try running the tools on your PDFs manually to see if you still get decent full-text output. If you don’t, then that PDF will still be searchable via its metadata.

OJS will index the text from PDFs without the use of Lucene, but if the tools you’ve configured for text extraction aren’t getting the full-text out, and if your indexing process is being killed before it finishes, your index won’t be complete.

Regards,
Alec Smecher
Public Knowledge Project Team