Full-text not working after OJS-Import

I am using OJS 3.1.2.1 . All articles and issues incl. the PDFs in this journal have been generated by a custom XML-generator and imported.

In the config.inc.php, the line

index[application/pdf] = "/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr '[:cntrl:]' ' '"

is uncommented and I cleared the cache/_db/ multiple times.

When I run php tools/rebuildSearchIndex.php, the scripts runs without errors, but words from the PDF files are not included (I checked the database table submission_search_keyword_list) and tested it on the website.

Then, I checked the submission files and all files that where originally .pdf files are now .txt files in the submission folder. Changing the index line in the config.inc.php to

index[plain/text] = "/usr/bin/pdftotext -enc UTF-8 -nopgbrk %s - | /usr/bin/tr '[:cntrl:]' ' '"

had no effect.

Subsequently, I checked the file_type column in the database table submission_files. This says for all submissions application/pdf (which is correct).

Now, I am running out of ideas. Where is the type defined to process the submission files? Or what can I do?

Any help is appreciated!

Cheers,

Adrian

So after importing the XML file, the submission files that were .pdf files are now .txt files? Can you share a sample of the XML import file?

An example would be:

<issues xmlns="http://pkp.sfu.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
    <issue xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" published="0" access_status="1">
            <id type="internal" advice="ignore">10624985</id>
            <issue_identification>
                    <volume>68</volume>
                    <year>1912</year>
                    <title locale="de_DE">10624985</title>
                    <title locale="en_US">10624985</title>
            </issue_identification>
            <date_published>1912-01-01</date_published>
            <last_modified>1912-01-01</last_modified>
            <sections>
                    <section ref="ART" seq="1" editor_restricted="0" meta_indexed="1" meta_reviewed="1" abstracts_not_required="1" hide_title="0" hide_author="0" abstract_word_count="0">
                            <id type="internal" advice="ignore">1</id>
                            <abbrev locale="de_DE">ART</abbrev>
                            <abbrev locale="en_US">ART</abbrev>
                            <title locale="de_DE">Artikeltext</title>
                            <title locale="en_US">Artikeltext</title>
                    </section>
            </sections>
            <articles xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
                    <article xmlns="http://pkp.sfu.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" locale="de_DE" section_ref="ART" xsi:schemaLocation="http://pkp.sfu.ca native.xsd" stage="production" date_published="1912-01-01">
                            <id type="internal" advice="ignore">10633615</id>
                            <title locale="de_DE">Report XY</title>
                            <title locale="en_US">Report XY</title>
                            <authors xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
                                    <author include_in_browse="true" user_group_ref="Autor/in">
                                            <givenname>N.</givenname>
                                            <familyname>N.</familyname>
                                            <email>dummy@mail.com</email>
                                    </author>
                            </authors>
                            <submission_file xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" stage="final" id="106336151" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
                                    <revision number="1" filename="pdf_10633615" viewable="false" date_uploaded="2020-06-08" date_modified="2020-06-08" filesize="2776313" filetype="application/pdf" uploader="ojs_admin" genre="Artikeltext">
                                            <name locale="de_DE">pdf_10633615</name>
                                            <name locale="en_US">pdf_10633615</name>
                                    <embed encoding="base64">Here comes a lot of byte code</embed>
                                    </revision>
                            </submission_file>
                            <article_galley xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" approved="false" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
                                    <name locale="de_DE">PDF</name>
                                    <seq>0</seq>
                                    <submission_file_ref id="106336151" revision="1"/>
                            </article_galley>
                            <pages>XLIV-LIV</pages>
                    </article>
.... (more articles)
</issue>
</issues>

Hi @NateWr ,

I drilled a little further and found out that OJS does not find any submission files associated with the articles.

The code is in the classes/search/ArticleSearchIndex.inc.php (watch out OJS 3.1.2.1 code!):

if ($hookResult === false || is_null($hookResult)) {
  $fileDao = DAORegistry::getDAO('SubmissionFileDAO');
  import('lib.pkp.classes.submission.SubmissionFile'); // Constants
  // Index galley files
  $files = $fileDao->getLatestRevisions(        <- this returns with 0 elements
    $article->getId(), SUBMISSION_FILE_PROOF
  );
  foreach ($files as $file) {
    if ($file->getFileId()) {
      self::submissionFileChanged($article->getId(), SUBMISSION_SEARCH_GALLEY_FILE, $file->getFileId());
      // Index dependent files associated with any galley files.
      $dependentFiles = $fileDao->getLatestRevisionsByAssocId(ASSOC_TYPE_SUBMISSION_FILE, $file->getFileId(), $article->getId(), SUBMISSION_FILE_DEPENDENT);
      echo "Got dep files\n";
      foreach ($dependentFiles as $depFile) {
        if ($depFile->getFileId()) {
          self::submissionFileChanged($article->getId(), SUBMISSION_SEARCH_SUPPLEMENTARY_FILE, $depFile->getFileId());
        }
      }
    }
  }

I wil investigate further, but perhaps you just know what could be the problem.

Edit: Okay! Now I get an idea! This is a problem when importing articles with only galleys! Since there are no files in the “proof” (are these the files in the "production ready segment?), OJS does not search for the uploaded galleys. Is this a bug?

Edit 2: If this is a bug, then it is still present in the current version.

Edit 3: Ah! I found it! It’s not a bug! I gave in my upload file stage=final, but it should have been proof. So, besides that I am stupid, what is the actual difference between proof and final?

Glad you were able to track it down!

what is the actual difference between proof and final?

We’ve just put up some documentation that describes each of the file stages.

1 Like