Importing PDFs from remote URL with Native XML Plugin

How to import a PDF from a remote source in OJS 3.1.0.1 when using the Native XML Plugin? When I enter

  <article_galley>
    <name>PDF</name>
    <seq>0</seq>
    <remote src="...remote URL..." />
  </article_galley>

in the XML file the PDFs are not imported. In the “Galleys” section of the article only the box “This galley will be available at a separate website.” is checked. The PDF should be uploaded to OJS during the import process. In OJS 2.x that was possible (Creating the XML Import File - #2 by bozana) - how to do this in OJS 3.x?

The file referenced can be imported to OJS (submission_file_ref), or can be remotely linked.

The <remote src="..." /> tag links an external file.

The <href src="..." mime_type="..." /> tag will import the file from an external reference.

Hi @ctgraham,

thanks for your answer. I don’t know exactly where to put the href-tag. Within “submission”?

I tried the following

<submission_file stage="proof" id="1">
  <revision number="1" genre="Article Text" filename="xyz.pdf" viewable="true" date_uploaded="2009-01-01" date_modified="2009-01-01" filetype="application/pdf">
    <name>xyz.pdf</name>
    <href src="...remote url..." mime_type="application/pdf"/>
  </revision>
</submission_file>
<article_galley>
  <name>PDF</name>
  <seq>0</seq>
  <submission_file_ref id="1" revision="1"/>
</article_galley>  

When I try to upload this XML I get an error, that a temporary file could not be created. I wonder if something’s wrong in the XML, or if some rights on our server are missing. The folder, where the remote PDF comes from, is accessible and the upload folder in ojs is writable.

We did a similar import job via the native XML-plugin with OJS 3.0.2 and it worked fine. In the article galley section we used this syntax

<article_galley xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" approved="false" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
<id type="internal" advice="ignore">355</id>
<name locale="en_US">PDF</name>
<seq>0</seq>
<remote src="http://testserver.my/ojs/oldfiles/article.pdf"/>
</article_galley>

I cant exactly remember why we used the internal id but think we needed the id as reference point for the DOIs that also should be generated. And i think you cannot influence this id when importing, OJS just uses the next available number. But we tried to match this number during the import.

We don’t have OJS 3.1 yet. But i can try to upgrade my personal testing system just to see if the import file still works.

The <href> tag should go in the <revision>, within <submission_file>.

Here’s a sample article I’ve imported recently:

      <article stage="production" date_published="1964-01-01" section_ref="FM" seq="01">
        <title locale="en_US">Frontmatter</title>
        <copyrightHolder locale="en_US">The American Cleft Palate Association</copyrightHolder>
        <authors>
          <author user_group_ref="Author">
            <firstname>Editorial</firstname>
            <lastname>Staff</lastname>
            <email>cleftpalatejournal@pitt.edu</email>
          </author>
        </authors>
        <submission_file stage="submission" id="stage.e20986v01n1.01">
          <revision number="1" genre="Article Text" viewable="true" filetype="application/pdf" user_group_ref="Journal editor" uploader="admin" filename="e20986v01n1.01-scan.pdf">
            <name locale="en_US">Frontmatter</name>
            <href src="original/e20986v01n1.01.pdf"/>
          </revision>
        </submission_file>
        <submission_file stage="submission" id="ocr.e20986v01n1.01">
          <revision number="1" genre="Article Text" viewable="true" filetype="application/xml" user_group_ref="Journal editor" uploader="admin" filename="e20986v01n1.01-ocr.xml">
            <name locale="en_US">Frontmatter</name>
            <href src="xml/e20986v01n1.01.xml"/>
          </revision>
        </submission_file>
        <submission_file stage="production_ready" id="proof.e20986v01n1.01">
          <revision number="1" genre="Article Text" viewable="true" filetype="application/pdf" user_group_ref="Journal editor" uploader="admin" filename="e20986v01n1.01.pdf">
            <name locale="en_US">Frontmatter</name>
            <href src="ocr/e20986v01n1.01.pdf"/>
          </revision>
        </submission_file>
        <article_galley approved="true">
          <name locale="en_US">PDF</name>
          <seq>0</seq>
          <submission_file_ref revision="1" id="proof.e20986v01n1.01"/>
        </article_galley>
      </article>

The temporary file errors is concerning. We try to create a temporary file in files_dir (per your config.inc.php) as part of the import. This could be a file permissions problem, or could be an edge case not considered in relatively new code.

Hi @ctgraham,

I’ve noticed that the XML files are imported to our upload directory. The file names are always starting with “xml…” (“xml0l1DWT”), but the error message is “Temporary file /srv/www/ojs/upload/temp/srcrGrq4y could not be created” (file name is always starting with “src”). We will also check our “NativeXmlSubmissionFileFilter.inc.php” file.

Did you configure your files_dir in config.inc.php to be “upload”?

If your files_dir is under your OJS document root, do be sure it is protected from web access.

Is OJS able to read and write other files to the files_dir, especially the “temp” folder?

Hi,

yes, we have configured the files_dir in our config.inc.php

files_dir = /srv/www/ojs/upload

OJS can read and write to the “upload” directory and all it’s subfolders including “temp”

That really seems like it ought to work.

Can you walk us through your step-by-step process? Can you share an example XML file? We can see if we can reproduce the error.

Hi @bibliothekswelt,

As @ctgraham noted above, please make sure that the contents of your files_dir aren’t available for direct access via the web server. This directory should either be moved outside the web root, or protected from direct access using a .htaccess file or similar mechanism.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi all, I’ve run into this exact error as well with a hosted client of ours. I’m in the middle of troubleshooting it and will report back what we find. (Clinton, I may ping you separately just to confirm our XML is indeed good.)

Cheers,
James / PKP

Any update on this @jmacgreg?

Hi all,

The issue appears to be that using <href="…"/> to pull in local files (ie. files specified on the local filesystem) doesn’t work. We were trying to do like so:

<href src="/home/journal/data/filename.pdf"/>

… which just doesn’t work. Moving the files to a web-accessible location, and referencing them that way, worked:

<href src="https://example.com/data/filename.pdf"/>

The 2.x version of the plugin allowed us to specify a local path, so I figure this is possibly just some missing code. If others are running into this, maybe it’s a feature worth returning?

The intent of the code is to handle this use case. See:

I used local filenames for an import here, though they may have been relative paths instead of absolute paths.

Are you confident that the user running the import (hopefully the web / apache user) has filesystem level access to /home/journal/data/filename.pdf?

Can you confirm this errors out on this line:

?

Thanks for the information. I will check it out (sorry, for not responding earlier, but I wasn’t at my workplace for a couple of weeks).

@ctgraham Here’s a sample XML file

<?xml version="1.0"?>
<issue xmlns="http://pkp.sfu.ca" published="1" current="0" access_status="1">
  <issue_identification>
   <title>IFFOnZeit Nr. 1 (2009)</title>
  </issue_identification>
  <date_published>2009-01-01</date_published>
  <sections>
    <section ref="Edit" seq="1">
      <abbrev>Edit</abbrev>
      <title>Inhaltsverzeichnis</title>
    </section>
  </sections>
  <articles>
   <article section_ref="Edit" seq="1" stage="submission" date_submitted="2009-01-01"     date_published="2009-01-01" access_status="1">
      <id>100</id>
      <id type="doi">10.4119/UNIBI/izgonzeit-100</id>     
      <title>Editorial</title>
      <authors>
        <author primary_contact="true" include_in_browse="true" user_group_ref="Autor/in">
          <firstname>Birgitta</firstname>
          <lastname>Wrede</lastname>
          <country>DE</country>
          <email>izg@uni-bielefeld.de</email>
        </author>
      </authors>
      <article_galley xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" approved="false" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
        <id type="internal" advice="ignore">355</id>
        <name locale="en_US">PDF</name>
        <seq>0</seq>
        <remote src="https://www.ub.uni-bielefeld.de/div/ojs/izgonzeit/exim/2009_1 Editorial.pdf"/>
      </article_galley>   
      <pages>3-4</pages>
    </article>    
  </articles>
</issue>

The article is imported (via Tools → Import/Export → Native XML plugin → Upload File → Import) but without the PDF

Just want to let you know, that we have found the problem. There was a space in the URL which was not encoded (%20) :roll_eyes:

I’ve also noticed a problem with the “genre” attribute. If it’s missing or incorrect, the PDF will be uploaded, but not shown on the article frontdoor.