Unicode normalization in search results

Hi @asmecher! It seems I found a “stop word”. “Stop-word” is a “stop-char” - in the Cyrillic alphabet the char “х” (Cyrillic letter ha) with code U+0445 (maybe in uppercase too). My OJS does not search for any words containing this letter. And if my theory is true the other “stop-char” will be “ъ” (Cyrillic letter hard sign) with code U+044A. Both these symbols are located on the Cyrillic keyboard on the keys of the left and right square brackets.

Hi @crosfield,

Ah, that’s helpful. I wonder whether your search terms (or parts of them) aren’t being cleared out as punctuation – see lib/pkp/classes/search/SubmissionSearchIndex.inc.php in the filterKeywords function, in particular these lines. Try selectively commenting/removing these lines until your missing content reappears, and let me know whether/which lines were at issue.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher! Unfortunately the problem is not in the filterKeywords function but somewhere deeper. If you mean the line 42 and especially the piece of regexp \[\] then the problem is not in it (as well as commenting this and the following lines). If its comment it will not change anything. Of course, every time I regenerate the index and I clear the cache not only in OJS but also in the browser.
How can I see what happens when I enter “wrong” word in the PHP code? I have an opportunity to temporarily use PHPStorm, but I could not configure XDebug for debugging. Perhaps there is some simpler way to debug by inserting echo commands into some piece of the code?

Hi @crosfield,

I tend to use plain old error_log calls for debugging – they don’t interfere with AJAX responses. But yes, please give that a try, and let me know if you’re able to narrow it down.

I’d suggest turning off charset_normalization in config.inc.php as you work with this. It keeps things simpler, and that option was more useful before infrastructure support for UTF8 was as good, going back 5-10 years. It’s an option I’d actually like to remove at some point.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi all,

As I understand, all entries from submission_search_keyword_list should give results in the search form, except those filtered by filterKeywords method, am I right?
Or, maybe, there are other filters. For example, those keywords that come from unpublished articles are ignored?

I’m asking, because in my case queries with letter x, e.g. широких give result. But, for example, keyword_text entry -forcible gives nothing.

Thanks @asmecher. While I realized that already in the filterKeywords function is passed the “wrong” string (for example мухин) in which the “stop-char” х is removed (муин is obtained). But if you delete line 42 then the “stop-char” replaced by a question char ? ( му?ин is obtained). And $text variable that is passed to the function already contains the wrong word му?ин. Therefore, at an earlier stage (in the _indexObjectKeywords function or in the _parseQueryInternal function), an incorrect value is passed to filterKeywords function.
And charset_normalization option does not seem to affect anything.

UPDATE: If you insert echo $cleanText; in the filterKeywords function and redirect the output of the tools/rebuildSearchIndex.php utility to the text file there will contain the full word мухин. However, when searching for the word мухин through the site, it will display the wrong word муин at the top of the site. This probably means that the filterKeywords function call comes from different places with different results.
The output from error_log($cleanText);:

[17-Mar-2018 13:22:13 Europe/Moscow] му
[17-Mar-2018 13:22:13 Europe/Moscow] ин

UPDATE2: The query string changes in the _parseQuery function in the classes/search/SubmissionSearch.inc.php module

Hi @Vitaliy! All articles on which a search is made are published. You have implemented multilingual fix by @litvinovg?

Nope. Just trying to understand this part of OJS code for no reason.

I have tried to repeat your problem, but the $cleanText variable after passing PKPString static methods always gives me мухин. Can you show me the link to those changes you have made? I suppose they are available as a pull request on GitHub.

Unfortunately, I do not know the github commands well, but I’ll try to fill somehow all my “project” to my github not as fork and I’ll write a link here.
UPDATE: https://github.com/crosfield/ojs-vestnik/blob/fdcec5562baeda8486efb2ffd567ae16f9a14005/lib/pkp/classes/search/SubmissionSearchIndex.inc.php#L50

Hmm, I don’t see how regex inside _parseQuery function can affect the letter x

I do not see either. But in function_parseQuery ($query) the correct word is passed, and the function function _parseQueryInternal($signTokens, $tokens, &$pos, $total) already operates with the wrong word.

Unfortunately, cannot reproduce the problem even with your installation.

Anyway thanks @Vitaliy for trying to help)
And what version of PHP are you using? I got a thought about an error in some of the built-in PHP functions (although I doubt it. Someone would have to find it before.)

I use PHP 7.2 version.

Thanks, @asmecher and @Vitaliy for your valuable ideas and help!
I found the cause of my problems with some search queries. The reason is that the preg_match_all function in the _parseQuery function (lib/pkp/classes/search/SubmissionSearch.inc.php) does not correctly process the UTF-8 strings (at least in the PHP 5.6). It’s enough to add the modifier u to the preg_match_all function and I have everything worked perfectly. This function returns the number of the “words” in the $query variable, but in my case it considered the Cyrillic letter х as a delimiter along with the space, +, - etc. Probably in PHP 7 PCRE-functionality is enabled by default.

1 Like

Hi @crosfield,

Could you try the change proposed at Use mbstring-capable regexp functions in searching · Issue #3491 · pkp/pkp-lib · GitHub? It should resolve the issue (and is compatible with non-mbstring capable PHP, if that’s still a thing).

Regards,
Alec Smecher
Public Knowledge Project Team

On the PHP 5.6 all right! Thanks @asmecher!

Merged – thanks, @crosfield!

Regards,
Alec Smecher
Public Knowledge Project Team