Unicode normalization in search results

OJS version 3.0.2
We have encountered an issue with searching Unicode characters and would like to know if Unicode normalization can be enabled?

When a user searches for ‘Māori’, on our journal (Policy Quarterly), they are returned several results. However a search on ‘Maori’ returns nothing. Ideally both searches would return the same results.

Is there a solution to this problem?

Hi @Max_Sullivan

Our search index plans are broadly to support two approaches: a built-in search index, as is currently built into OJS; and options to integrate with Lucene/Solr, which was offered in OJS 2.x and is in the process of being ported to OJS 3.x. There are plenty of features that Lucene/Solr support that we’ll never add to OJS, like stemming, faceting, etc., because they’re complex and we don’t want to re-invent the wheel.

However, I agree that normalization (e.g. to ASCII) might be a good thing to add (and I can’t recall specifically, but it’s possible that we supported it in a previous version). If you’re willing to do a bit of experimentation it would help move this forward.

The code that’s responsible for cleaning up search keywords is in lib/pkp/classes/search/SubmissionSearchIndex.inc.php in the filterKeywords function. I’d suggest adding a call to iconv there to transcode any incoming content into plain ASCII.

After making a change like this you’ll need to rebuild your search index using tools/rebuildSearchIndex.php. And you’ll probably need to also flush your data cache, which is available as Site Administrator in the Administration area.

If you’re able to make any headway here, I can help turn it into a pull request etc. for inclusion in a forthcoming release!

Regards,
Alec Smecher
Public Knowledge Project Team

Thanks @asmecher,

It’s good to hear that Lucene/Solr is being ported to OJS 3.x.

We are keen to investigate making the changes to lib/pkp/classes/search/SubmissionSearchIndex.inc.php that you suggested and will report back when we get further.

Kind regards,

Max

This is a valuable function for the Slavic languages. It’s strange that the utility rebuildSearchIndex.php does not rebuild the index.

$ php tools/rebuildSearchIndex.php
<h1>Database connection failed!</h1>ojs2: Database connection failed!
$ php tools/rebuildSearchIndex.php vestnik
<h1>Database connection failed!</h1>ojs2: Database connection failed!

Here is vestnik is internal stort name of my journal. I note that journal site at the same time fully works, there is access to the MySQL database and my database name is not ojs2 but ojs31. There are no errors in the MySQL log. Test configuration is local in MAMP on MacOS. What could be the problem?

Hi @crosfield,

Weird – the above quoted messages are on the same system as your production OJS, or are they on your local (testing) system?

Regards,
Alec Smecher
Public Knowledge Project Team

@asmecher All messages from the test and the system. Before production I fill the database on the local (MacOS) system now with articles archive.

Update. Now I installed OJS from the latest OJS GitHub branch and also tried to rebuild the index. The same result.

crosfield at Crosfields-MBP in ~/sites/ojs on master*
$ php tools/rebuildSearchIndex.php
<h1>Database connection failed!</h1>ojs2: Database connection failed!

So the problem is not in OJS but in my MacOS/MAMP system? But what is wrong with it if the site itself is fully functional? How to debug this cli utility?

Hi @crosfield,

I’m still not sure I’ve got the details of your environment straight, but here are a few things to check:

  • Make sure your database credentials in config.inc.php are correct
  • Check to see that your command-line php.ini is suitable (i.e. has the necessary database driver modules loaded) – note that the web-based php.ini may not be the same as the command-line php.ini.
  • Check our database server’s log to see if it’s recording “access denied” events

Regards,
Alec Smecher
Public Knowledge Project Team

Thanks @asmecher! The reason for this behavior was the second point. When I started the script with PHP from MAMP App Folder it all worked.

UPDATE. Can it be that after updating the index (without Unicode normalization) when searching through the main menu by some authors surname his articles are found but for the surnames of other authors the list is empty? In this case, if you search specifically through the “Search by author” field, there are articles of both the first and second authors group. I did not reveal the patterns and I cleared the data cache. At the same time, I remember that before the search index was updated, the search for any surnames worked fine.

Hi @crosfield,

Are you sure the re-indexing completed successfully? Make sure that your php.ini (which may be different for your command-line PHP vs. your web-based PHP) doesn’t specify any time limits. You can also check for this in practice using php -i.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi, @asmecher,

I set max_execution_time = 0 through MAMP in the “right” php.ini. Some names that were not searched before were found, but not all. Is it possible to somehow look at the search index in the text file? (or insert somewhere echo command to redirect the output stream of rebuildSearchIndex.php to a file)
And I also checked all the articles are indexed at runtime of rebuildSearchIndex.php? Yes all

UPDATE. I inserted echo $cleanText; in lib/pkp/classes/search/SubmissionSearchIndex.inc.php before // Split into words section and redirected output stream to a text file. It has all the words including those that are not searchable. Is this the correct way to find the cause of the problem? Where to find the cause further?

Hi @crosfield,

Are you sure you’ve cleared the data cache between search tests? There’s SQL result caching on search operations, and it may be obstructing the new data.

Regards,
Alec Smecher
Public Knowledge Project Team

Yes of course every time if you mean clearing the cache through Administration - Clear Data Caches

Hi @crosfield,

Hmm. The next thing to debug, I think, is the query / set of queries that perform the search. Try (temporarily) turning on the debug option in config.inc.php – this will cause all SQL queries to be dumped to the browser (for all users). With that option turned on, perform the search and inspect the results page to see where it’s querying for the author name you haven’t been able to see in the results.

Regards,
Alec Smecher
Public Knowledge Project Team

Hmm. It’s all very strange. If we called unsearchable name as “wrong”, then it looks like this when searching for the “wrong” name that there are no queries to the database with this name. When requested by the “correct” name, a query is made to the database using the this name, and when repeated search the call to the ADOConnection.CacheExecute function is visible. For any search to the “wrong” name this function does not exist.

Debug session log with "wrong" name
(mysql): SELECT * FROM sessions WHERE session_id = '3b024fbfdc40c276707ca2fbc00d8c22'  
(mysql): SELECT * FROM users WHERE user_id = 1  
(mysql): SELECT * FROM user_settings WHERE user_id = '1'  
(mysql): SELECT * FROM journals WHERE path = 'vestnik'  
(mysql): SELECT * FROM journal_settings WHERE journal_id = '1'  
(mysql): SELECT * FROM site  
(mysql): SET time_zone = '+3:00'  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = '1' OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.generic.usageStats.UsageStatsLoader'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.crossref.CrossrefInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.datacite.DataciteInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.doaj.DOAJInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.medra.MedraInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'lib.pkp.classes.task.ReviewReminder'  
(mysql): SELECT count(*) FROM notifications WHERE user_id = 1 AND date_read IS NULL AND level = 3  
(mysql): SELECT setting_name, setting_value, setting_type FROM plugin_settings WHERE plugin_name = 'defaultthemeplugin' AND context_id = 1  
(mysql): SELECT setting_name, setting_value, setting_type FROM plugin_settings WHERE plugin_name = 'defaultmanuscriptchildthemeplugin' AND context_id = 1  
(mysql): SELECT * FROM journals WHERE path = 'vestnik'  
(mysql): SELECT * FROM journal_settings WHERE journal_id = '1'  
(mysql): SELECT	ug.* FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE uug.user_id = 1  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '1'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '2'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '3'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '7'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '11'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '14'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '16'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '17'  
(mysql): SELECT * FROM navigation_menus WHERE context_id = 1  
(mysql): SELECT nmi.* FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 2 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '8'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '11'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '9'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '12'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '10'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '13'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '14'  
(mysql): SELECT nmi.*,nmh.navigation_menu_id,nmh.parent_id,nmh.seq, nmh.navigation_menu_item_assignment_id FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 2 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '8'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '11'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '9'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '12'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '10'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '13'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '14'  
(mysql): SELECT	DISTINCT ug.role_id FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE	uug.user_id = 1 AND ug.context_id = 1  
(mysql): SELECT	DISTINCT ug.role_id FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE	uug.user_id = 1  
(mysql): SELECT c.* FROM journals c WHERE	c.enabled = 1 OR c.journal_id IN (SELECT DISTINCT ug.context_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE uug.user_id = 1) OR 1 IN (SELECT user_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE ug.role_id = 1) ORDER BY seq  
(mysql): SELECT * FROM journal_settings WHERE journal_id = '1'  
(mysql): SELECT nmi.* FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 3 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '15'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '16'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '25'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '23'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '26'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '20'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '18'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '24'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '21'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '22'  
(mysql): SELECT nmi.*,nmh.navigation_menu_id,nmh.parent_id,nmh.seq, nmh.navigation_menu_item_assignment_id FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 3 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '93'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '94'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '95'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '96'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '97'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '98'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '99'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '100'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '101'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '102'  
(mysql): SELECT * FROM notifications WHERE user_id = 1 AND level = 1 ORDER BY date_created DESC  
(mysql): SELECT c.* FROM journals c WHERE	c.enabled = 1 OR c.journal_id IN (SELECT DISTINCT ug.context_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE uug.user_id = 1) OR 1 IN (SELECT user_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE ug.role_id = 1) ORDER BY seq  
(mysql): SELECT * FROM static_pages WHERE context_id = 1 AND path = 'search/search'  
(mysql): SELECT	ug.* FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE uug.user_id = 1  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '1'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '2'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '3'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '7'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '11'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '14'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '16'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '17'  
(mysql): SELECT setting_name, setting_value, setting_type, locale FROM site_settings  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = 0 OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT setting_name, setting_value, setting_type, locale FROM site_settings  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = 0 OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT setting_name, setting_value, setting_type, locale FROM site_settings  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = 0 OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT	MAX(ps.date_published), MIN(ps.date_published) FROM	published_submissions ps, submissions s WHERE	ps.submission_id = s.submission_id AND s.context_id = 1

@asmecher Are there any options for further debugging problems?

Hi @crosfield,

Are you using the COinS plugin? Try disabling it and see if that changes things – if it does, you may need to apply the patch referred to in Problem with search after upgrading to OJS 3.1.0.1 - #25 by asmecher.

Regards,
Alec Smecher
Public Knowledge Project Team

I’ve never used COinS plugin.

Hi @crosfield,

Is your journal online? If so, can you send me the URL and a couple of example names that demonstrate the problem?

Regards,
Alec Smecher
Public Knowledge Project Team

Thanks, @asmecher for answer. Not yet online. Now I’m filling out our journal with the archive for 15 years on my test machine. As soon as I rolled it in online, I’ll write a link here. While everything looks so there is a list of “stop words” in OJS which for some reason contains at least one the author’s name because when searching for this name there are no any queries to the database.
P.S. This could be due to the implementation of multilanguage author’s names by @litvinovg which I integrated into OJS 3.1.0-1 from its github?

Hi @crosfield,

There is a list of stopwords maintained in lib/pkp/registry/stopwords.txt – not sure if something there could explain the issue.

The multilingual author name modification could well explain the issue; I haven’t read that code. One of PKP’s developers has started assessing it but in the meantime you might ask @litvinovg directly whether he’s adapted the searching/indexing code to suit the changes.

Regards,
Alec Smecher
Public Knowledge Project Team