Unicode normalization in search results

Hi @crosfield,

Are you sure you’ve cleared the data cache between search tests? There’s SQL result caching on search operations, and it may be obstructing the new data.

Regards,
Alec Smecher
Public Knowledge Project Team

Yes of course every time if you mean clearing the cache through Administration - Clear Data Caches

Hi @crosfield,

Hmm. The next thing to debug, I think, is the query / set of queries that perform the search. Try (temporarily) turning on the debug option in config.inc.php – this will cause all SQL queries to be dumped to the browser (for all users). With that option turned on, perform the search and inspect the results page to see where it’s querying for the author name you haven’t been able to see in the results.

Regards,
Alec Smecher
Public Knowledge Project Team

Hmm. It’s all very strange. If we called unsearchable name as “wrong”, then it looks like this when searching for the “wrong” name that there are no queries to the database with this name. When requested by the “correct” name, a query is made to the database using the this name, and when repeated search the call to the ADOConnection.CacheExecute function is visible. For any search to the “wrong” name this function does not exist.

Debug session log with "wrong" name
(mysql): SELECT * FROM sessions WHERE session_id = '3b024fbfdc40c276707ca2fbc00d8c22'  
(mysql): SELECT * FROM users WHERE user_id = 1  
(mysql): SELECT * FROM user_settings WHERE user_id = '1'  
(mysql): SELECT * FROM journals WHERE path = 'vestnik'  
(mysql): SELECT * FROM journal_settings WHERE journal_id = '1'  
(mysql): SELECT * FROM site  
(mysql): SET time_zone = '+3:00'  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = '1' OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.generic.usageStats.UsageStatsLoader'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.crossref.CrossrefInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.datacite.DataciteInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.doaj.DOAJInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'plugins.importexport.medra.MedraInfoSender'  
(mysql): SELECT last_run FROM scheduled_tasks WHERE class_name = 'lib.pkp.classes.task.ReviewReminder'  
(mysql): SELECT count(*) FROM notifications WHERE user_id = 1 AND date_read IS NULL AND level = 3  
(mysql): SELECT setting_name, setting_value, setting_type FROM plugin_settings WHERE plugin_name = 'defaultthemeplugin' AND context_id = 1  
(mysql): SELECT setting_name, setting_value, setting_type FROM plugin_settings WHERE plugin_name = 'defaultmanuscriptchildthemeplugin' AND context_id = 1  
(mysql): SELECT * FROM journals WHERE path = 'vestnik'  
(mysql): SELECT * FROM journal_settings WHERE journal_id = '1'  
(mysql): SELECT	ug.* FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE uug.user_id = 1  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '1'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '2'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '3'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '7'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '11'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '14'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '16'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '17'  
(mysql): SELECT * FROM navigation_menus WHERE context_id = 1  
(mysql): SELECT nmi.* FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 2 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '8'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '11'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '9'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '12'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '10'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '13'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '14'  
(mysql): SELECT nmi.*,nmh.navigation_menu_id,nmh.parent_id,nmh.seq, nmh.navigation_menu_item_assignment_id FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 2 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '8'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '11'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '9'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '12'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '10'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '13'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '14'  
(mysql): SELECT	DISTINCT ug.role_id FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE	uug.user_id = 1 AND ug.context_id = 1  
(mysql): SELECT	DISTINCT ug.role_id FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE	uug.user_id = 1  
(mysql): SELECT c.* FROM journals c WHERE	c.enabled = 1 OR c.journal_id IN (SELECT DISTINCT ug.context_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE uug.user_id = 1) OR 1 IN (SELECT user_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE ug.role_id = 1) ORDER BY seq  
(mysql): SELECT * FROM journal_settings WHERE journal_id = '1'  
(mysql): SELECT nmi.* FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 3 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '15'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '16'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '25'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '23'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '26'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '20'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '18'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '24'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '21'  
(mysql): SELECT * FROM navigation_menu_item_settings WHERE navigation_menu_item_id = '22'  
(mysql): SELECT nmi.*,nmh.navigation_menu_id,nmh.parent_id,nmh.seq, nmh.navigation_menu_item_assignment_id FROM navigation_menu_item_assignments as nmh LEFT JOIN navigation_menu_items as nmi ON (nmh.navigation_menu_item_id = nmi.navigation_menu_item_id) WHERE nmh.navigation_menu_id = 3 ORDER BY nmh.seq  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '93'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '94'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '95'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '96'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '97'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '98'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '99'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '100'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '101'  
(mysql): SELECT * FROM navigation_menu_item_assignment_settings WHERE navigation_menu_item_assignment_id = '102'  
(mysql): SELECT * FROM notifications WHERE user_id = 1 AND level = 1 ORDER BY date_created DESC  
(mysql): SELECT c.* FROM journals c WHERE	c.enabled = 1 OR c.journal_id IN (SELECT DISTINCT ug.context_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE uug.user_id = 1) OR 1 IN (SELECT user_id FROM user_groups ug JOIN user_user_groups uug ON (ug.user_group_id = uug.user_group_id) WHERE ug.role_id = 1) ORDER BY seq  
(mysql): SELECT * FROM static_pages WHERE context_id = 1 AND path = 'search/search'  
(mysql): SELECT	ug.* FROM	user_groups ug JOIN user_user_groups uug ON ug.user_group_id = uug.user_group_id WHERE uug.user_id = 1  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '1'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '2'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '3'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '7'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '11'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '14'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '16'  
(mysql): SELECT * FROM user_group_settings WHERE user_group_id = '17'  
(mysql): SELECT setting_name, setting_value, setting_type, locale FROM site_settings  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = 0 OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT setting_name, setting_value, setting_type, locale FROM site_settings  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = 0 OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT setting_name, setting_value, setting_type, locale FROM site_settings  
(mysql): SELECT v.* FROM versions v LEFT JOIN plugin_settings ps ON lower(v.product_class_name) = ps.plugin_name AND ps.setting_name = 'enabled' AND (context_id = 0 OR v.sitewide = 1) WHERE v.current = 1 AND (ps.setting_value = '1' OR v.lazy_load <> 1)  
(mysql): SELECT	MAX(ps.date_published), MIN(ps.date_published) FROM	published_submissions ps, submissions s WHERE	ps.submission_id = s.submission_id AND s.context_id = 1

@asmecher Are there any options for further debugging problems?

Hi @crosfield,

Are you using the COinS plugin? Try disabling it and see if that changes things – if it does, you may need to apply the patch referred to in Problem with search after upgrading to OJS 3.1.0.1 - #25 by asmecher.

Regards,
Alec Smecher
Public Knowledge Project Team

I’ve never used COinS plugin.

Hi @crosfield,

Is your journal online? If so, can you send me the URL and a couple of example names that demonstrate the problem?

Regards,
Alec Smecher
Public Knowledge Project Team

Thanks, @asmecher for answer. Not yet online. Now I’m filling out our journal with the archive for 15 years on my test machine. As soon as I rolled it in online, I’ll write a link here. While everything looks so there is a list of “stop words” in OJS which for some reason contains at least one the author’s name because when searching for this name there are no any queries to the database.
P.S. This could be due to the implementation of multilanguage author’s names by @litvinovg which I integrated into OJS 3.1.0-1 from its github?

Hi @crosfield,

There is a list of stopwords maintained in lib/pkp/registry/stopwords.txt – not sure if something there could explain the issue.

The multilingual author name modification could well explain the issue; I haven’t read that code. One of PKP’s developers has started assessing it but in the meantime you might ask @litvinovg directly whether he’s adapted the searching/indexing code to suit the changes.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher! It seems I found a “stop word”. “Stop-word” is a “stop-char” - in the Cyrillic alphabet the char “х” (Cyrillic letter ha) with code U+0445 (maybe in uppercase too). My OJS does not search for any words containing this letter. And if my theory is true the other “stop-char” will be “ъ” (Cyrillic letter hard sign) with code U+044A. Both these symbols are located on the Cyrillic keyboard on the keys of the left and right square brackets.

Hi @crosfield,

Ah, that’s helpful. I wonder whether your search terms (or parts of them) aren’t being cleared out as punctuation – see lib/pkp/classes/search/SubmissionSearchIndex.inc.php in the filterKeywords function, in particular these lines. Try selectively commenting/removing these lines until your missing content reappears, and let me know whether/which lines were at issue.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher! Unfortunately the problem is not in the filterKeywords function but somewhere deeper. If you mean the line 42 and especially the piece of regexp \[\] then the problem is not in it (as well as commenting this and the following lines). If its comment it will not change anything. Of course, every time I regenerate the index and I clear the cache not only in OJS but also in the browser.
How can I see what happens when I enter “wrong” word in the PHP code? I have an opportunity to temporarily use PHPStorm, but I could not configure XDebug for debugging. Perhaps there is some simpler way to debug by inserting echo commands into some piece of the code?

Hi @crosfield,

I tend to use plain old error_log calls for debugging – they don’t interfere with AJAX responses. But yes, please give that a try, and let me know if you’re able to narrow it down.

I’d suggest turning off charset_normalization in config.inc.php as you work with this. It keeps things simpler, and that option was more useful before infrastructure support for UTF8 was as good, going back 5-10 years. It’s an option I’d actually like to remove at some point.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi all,

As I understand, all entries from submission_search_keyword_list should give results in the search form, except those filtered by filterKeywords method, am I right?
Or, maybe, there are other filters. For example, those keywords that come from unpublished articles are ignored?

I’m asking, because in my case queries with letter x, e.g. широких give result. But, for example, keyword_text entry -forcible gives nothing.

Thanks @asmecher. While I realized that already in the filterKeywords function is passed the “wrong” string (for example мухин) in which the “stop-char” х is removed (муин is obtained). But if you delete line 42 then the “stop-char” replaced by a question char ? ( му?ин is obtained). And $text variable that is passed to the function already contains the wrong word му?ин. Therefore, at an earlier stage (in the _indexObjectKeywords function or in the _parseQueryInternal function), an incorrect value is passed to filterKeywords function.
And charset_normalization option does not seem to affect anything.

UPDATE: If you insert echo $cleanText; in the filterKeywords function and redirect the output of the tools/rebuildSearchIndex.php utility to the text file there will contain the full word мухин. However, when searching for the word мухин through the site, it will display the wrong word муин at the top of the site. This probably means that the filterKeywords function call comes from different places with different results.
The output from error_log($cleanText);:

[17-Mar-2018 13:22:13 Europe/Moscow] му
[17-Mar-2018 13:22:13 Europe/Moscow] ин

UPDATE2: The query string changes in the _parseQuery function in the classes/search/SubmissionSearch.inc.php module

Hi @Vitaliy! All articles on which a search is made are published. You have implemented multilingual fix by @litvinovg?

Nope. Just trying to understand this part of OJS code for no reason.

I have tried to repeat your problem, but the $cleanText variable after passing PKPString static methods always gives me мухин. Can you show me the link to those changes you have made? I suppose they are available as a pull request on GitHub.

Unfortunately, I do not know the github commands well, but I’ll try to fill somehow all my “project” to my github not as fork and I’ll write a link here.
UPDATE: https://github.com/crosfield/ojs-vestnik/blob/fdcec5562baeda8486efb2ffd567ae16f9a14005/lib/pkp/classes/search/SubmissionSearchIndex.inc.php#L50