Indexing of strange strings from deleted submission files

Hi all.

I have encountered a strange problem. In one of our hosted journals, a handful of submissions were uploaded as HTML files. These files were created using RMarkdown and contain a lot of JS, CSS and base64 encoding causing the files to be pretty large and to contain a lot of extraneous text.

When the issue was submitted it took a very long time for the submission to complete. It looks like this was due to all the extraneous text being indexed. I found terms like ‘font-family’ and ‘bootstrap’ in index database table. There were also lots of entries of gibberish—strings like ‘safkdsfn’–that probably came from the base64 encoded images.

Anyway, we removed these files from the submision and replaced them with clean HTML files with none of the css, js, or base 64 text. Yet, when we submit the issue for publishing, the issue takes a very long time to finish the process and the same css, JS and base64 text is being indexed. I can see the insert statements when I watch the apache log with mysql verbose logging enabled.

Does anyone have any idea how we can publish this issue without that text being indexed? I don’t understand how it’s happening since the offending file are no longer attached to the submission, at least on the front end of OJS.

Any suggestions would be very welcome!

Thanks.

-tim

When you post your question give as much detail as possible, including the following:

  • Application Version - e.g., OJS 3.2.14

Hi Tim,

Deleting a submission file does not remove the terms from the submission search index. If you’ve cleaned up the HTML and re-uploaded them, the best way to get a cleaned up search index is to run the command line tool for rebuilding the index:

php tools/rebuildSearchIndex.php

That will perhaps take a long time to run but it cleans out the index first and then re-indexes the content.

Cheers
Jason

Thanks Jason!

I’ve tried this, I think, or a version of this. What I’ve done is the following:

  • Unpublished the journal.
  • Deleted the rows from the tables below:

submission_search_objects;
submission_search_keyword_list;
submission_search_object_keywords;

  • Republished the journal with the clean HTML.

Still get the old HTML files being indexed on submission. I can actually see the insert operations in the mysql logs.

It’s really strange. Do you know which table contains the files to be indexed? Maybe the files are still there and attached to the submission somehow. I’m completely baffled!

Hi Tim,

All submission files are stored in submission_files, but which files get indexed depend on what file stage they have been uploaded as. If you know the submission id, you can look in the table for the files associated with that submission and then go out on disk and head down your files_dir directory to see what’s there. If you’ve taken the steps you have, I guess the question is, are you sure that the new files don’t contain the old strings again?

Cheers,
Jason