Fulltext search fails when short or stop words are included - OHS

We are using OHS 2.3.1 and are experiencing unexpected behaviour with a full text index. We have created a fulltext index on one of our tables but when this index is searched via the PHP application we find that it returns zero results if a stop word or short word is included in the search string. We expected that short words or stop words would be ignored when included in longer strings, rather than causing the search to fail completely.

The index was built using the following command:

alter table search_entries add fulltext search_fulltext(author,title,abstract,country,archive,institution,year,language,subject);

Some examples:
Full title: 3D hand pose regression with variants of decision forests
This search finds no results: 3D hand pose regression
This search works: hand pose regression

Full title: Adsorptive cellulose membranes for fluid separation
This search finds no results: Adsorptive cellulose membranes for fluid
This search works: Adsorptive cellulose membranes fluid

Searches including fewer than 4 characters or stop words will work as a “phrase search”, but are the above examples expected behaviour, or is something wrong with our settings? We were expecting keywords which are fewer than 4 characters or stop words to be ignored, rather than break the search.



I just found one way to reproduce your problem:

  1. Setting the “min_word_length” setting (available in the config.inc.php file) to something higher than 3 characters
  2. Indexing an archive
  3. Updating the “min_word_length” to a smaller value, such as 2

If that’s your case, re-indexing the archives will fix the problem.

Brief explanation:
Both indexing and searching processes use this setting to filter words, and the problem resides on the indexing (by not indexing such words we save some space). A fast fix to ignore this setting when indexing is possible, but I’m not sure it’s worth, as we’re currently updating the Harvester.

Next steps:
We’re currently updating the Harvester, the stable branch (GitHub - pkp/harvester at ohs-stable-2_3) just got a 10-fold improvement in indexing and we already have some commits to use a full text index when searching (among other improvements).
What’s missing: resurrect the master branch by updating the internal sub-modules/dependencies, and if possible completely replace the internal search engine by a full text index or another search engine, such as Lucene/Elasticsearch.

Jonas Raoni

Thanks for your response. We’re not sure this fixes the problem though. Our min_word_length is set to 3 and seems to be accurately indexed. Searching for strings containing 3-letter words behaves as expected.

However, searching for any string containing a 2-letter word or 1-letter word returns zero results, so it looks like the search is failing when words with length less than min_word_length are included, rather than the search ignoring them. The same applies to search strings including stop words (which are not indexed). How can we fix this?

Example 1
This search: comparative analysis of multicultural perspectives on leadership competencies
returns zero results
Whereas: comparative analysis multicultural perspectives leadership competencies
returns the correct hit

Example 2
This search: defining elite space where
returns zero hits (because where is a stop word)
Whereas: defining elite space
returns hits as expected