Hi @asmecher! It seems I found a “stop word”. “Stop-word” is a “stop-char” - in the Cyrillic alphabet the char “х” (Cyrillic letter ha) with code U+0445 (maybe in uppercase too). My OJS does not search for any words containing this letter. And if my theory is true the other “stop-char” will be “ъ” (Cyrillic letter hard sign) with code U+044A. Both these symbols are located on the Cyrillic keyboard on the keys of the left and right square brackets.
Hi @crosfield,
Ah, that’s helpful. I wonder whether your search terms (or parts of them) aren’t being cleared out as punctuation – see lib/pkp/classes/search/SubmissionSearchIndex.inc.php
in the filterKeywords
function, in particular these lines. Try selectively commenting/removing these lines until your missing content reappears, and let me know whether/which lines were at issue.
Regards,
Alec Smecher
Public Knowledge Project Team
Hi @asmecher! Unfortunately the problem is not in the filterKeywords
function but somewhere deeper. If you mean the line 42 and especially the piece of regexp \[\]
then the problem is not in it (as well as commenting this and the following lines). If its comment it will not change anything. Of course, every time I regenerate the index and I clear the cache not only in OJS but also in the browser.
How can I see what happens when I enter “wrong” word in the PHP code? I have an opportunity to temporarily use PHPStorm, but I could not configure XDebug for debugging. Perhaps there is some simpler way to debug by inserting echo
commands into some piece of the code?
Hi @crosfield,
I tend to use plain old error_log
calls for debugging – they don’t interfere with AJAX responses. But yes, please give that a try, and let me know if you’re able to narrow it down.
I’d suggest turning off charset_normalization
in config.inc.php
as you work with this. It keeps things simpler, and that option was more useful before infrastructure support for UTF8 was as good, going back 5-10 years. It’s an option I’d actually like to remove at some point.
Regards,
Alec Smecher
Public Knowledge Project Team
Hi all,
As I understand, all entries from submission_search_keyword_list
should give results in the search form, except those filtered by filterKeywords
method, am I right?
Or, maybe, there are other filters. For example, those keywords that come from unpublished articles are ignored?
I’m asking, because in my case queries with letter x
, e.g. широких
give result. But, for example, keyword_text entry -forcible
gives nothing.
Thanks @asmecher. While I realized that already in the filterKeywords
function is passed the “wrong” string (for example мухин
) in which the “stop-char” х
is removed (муин
is obtained). But if you delete line 42 then the “stop-char” replaced by a question char ?
( му?ин
is obtained). And $text
variable that is passed to the function already contains the wrong word му?ин
. Therefore, at an earlier stage (in the _indexObjectKeywords
function or in the _parseQueryInternal
function), an incorrect value is passed to filterKeywords
function.
And charset_normalization
option does not seem to affect anything.
UPDATE: If you insert echo $cleanText;
in the filterKeywords
function and redirect the output of the tools/rebuildSearchIndex.php
utility to the text file there will contain the full word мухин
. However, when searching for the word мухин
through the site, it will display the wrong word муин
at the top of the site. This probably means that the filterKeywords
function call comes from different places with different results.
The output from error_log($cleanText);
:
[17-Mar-2018 13:22:13 Europe/Moscow] му
[17-Mar-2018 13:22:13 Europe/Moscow] ин
UPDATE2: The query string changes in the _parseQuery
function in the classes/search/SubmissionSearch.inc.php
module
Hi @Vitaliy! All articles on which a search is made are published. You have implemented multilingual fix by @litvinovg?
Nope. Just trying to understand this part of OJS code for no reason.
I have tried to repeat your problem, but the $cleanText
variable after passing PKPString static methods always gives me мухин
. Can you show me the link to those changes you have made? I suppose they are available as a pull request on GitHub.
Unfortunately, I do not know the github commands well, but I’ll try to fill somehow all my “project” to my github not as fork and I’ll write a link here.
UPDATE: https://github.com/crosfield/ojs-vestnik/blob/fdcec5562baeda8486efb2ffd567ae16f9a14005/lib/pkp/classes/search/SubmissionSearchIndex.inc.php#L50
Hmm, I don’t see how regex inside _parseQuery
function can affect the letter x
I do not see either. But in function_parseQuery ($query)
the correct word is passed, and the function function _parseQueryInternal($signTokens, $tokens, &$pos, $total)
already operates with the wrong word.
Unfortunately, cannot reproduce the problem even with your installation.
Anyway thanks @Vitaliy for trying to help)
And what version of PHP are you using? I got a thought about an error in some of the built-in PHP functions (although I doubt it. Someone would have to find it before.)
I use PHP 7.2 version.
Thanks, @asmecher and @Vitaliy for your valuable ideas and help!
I found the cause of my problems with some search queries. The reason is that the preg_match_all
function in the _parseQuery
function (lib/pkp/classes/search/SubmissionSearch.inc.php
) does not correctly process the UTF-8 strings (at least in the PHP 5.6). It’s enough to add the modifier u
to the preg_match_all
function and I have everything worked perfectly. This function returns the number of the “words” in the $query
variable, but in my case it considered the Cyrillic letter х
as a delimiter along with the space
, +
, -
etc. Probably in PHP 7 PCRE-functionality is enabled by default.
Hi @crosfield,
Could you try the change proposed at Use mbstring-capable regexp functions in searching · Issue #3491 · pkp/pkp-lib · GitHub? It should resolve the issue (and is compatible with non-mbstring capable PHP, if that’s still a thing).
Regards,
Alec Smecher
Public Knowledge Project Team