How do I import external HTML file to issue and article?

I am trying to import back-issues along with articles into OJS 3.x.
I am using the native importer inside OJS 3, but I have problems configuring the XML for import.
So far I have tried this XML for importing the HTML to an article

 <submission_file xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" stage="submission" id="11" xsi:schemaLocation="http://dev.openjournal.tld hb_import.xsd">
    <revision number="1" genre="Artikeltext" filename="paper1.html" viewable="true" date_uploaded="2018-05-16" date_modified="2018-05-16" filesize="23155" filetype="text/html" user_group_ref="F&#xF6;rfattare" uploader="christerjohansson">
      <name locale="sv_SE">christerjohansson, F&#xF6;rfattare, paper1.html</name>          
    </revision>
  </submission_file>

But it seems the native importer requires the HTML to be in Base64-code within the XML-file. How do I import HTML-files along with issue and article? The documentation only describes version 2.x. The above XML comes from the exported XML I did, with content I manually created.

Hi @Chrizze

An example of a full issue import XML you can see here: https://github.com/pkp/ojs/blob/master/tests/data/60-content/issue.xml. For an article would be similar, just starting with the element “article” (and if the article should be assigned to an issue the issue_identification element should be added within the article element).
If your article file is online i.e. accessible under an URL, instead of element “embed” (in the element “revision” in the element “submissin_file”) you could use
<href src="http://..." />
OJS would then (try to) import the file from that URL (defined in the attribute “src” in the element “href”).
Maybe this would be an easier solution for you?

Best,
Bozana

1 Like

This is my current XML code
Link to PasteBin: MyXMLCode

https://pastebin.com/t7X0jKgd

When trying to import this, using the native XML importer in OJS 3.1.1-2 I get this message,

Validation errors:
Opening and ending tag mismatch: href line 34 and revision
Opening and ending tag mismatch: revision line 32 and submission_file
Opening and ending tag mismatch: submission_file line 31 and article
Premature end of data in tag article line 2
The document has no document element.


Files are in right location, but tag mismatch? Am I setting the tags wrong?

Hi @Chrizze

Your href element is missing the end /> i.e. it should be <href src="http://dev.ojs/arkivet/1-1/paper1.html" />

Best,
Bozana

1 Like

Haha, yes! I totally missed that one. Thank you for finding it. :slightly_smiling_face:
However, my imported html-file does not show up as a production galley on the site. Do I need to import it separately, or is that another element?

Thank you for helping, much appreciated!

I noticed that I can set a stage attribute to submission file element, but it still don’t show up as a published galley on the site. I am trying to generate a XML that will import a few thousand html-based papers into a journal.

All the html-files will reside on the server once import is ready to go, and I need them to be imported along with any images they link to. (I can generate a list of images into the xml, but need some info on this)

Hi @Chrizze

I think you should have id="13" in the submission_file_ref element of your article_galley element, i.e. this: <submission_file_ref id="13" revision="1"/> , because your submission_file element has that id. The system has to know the relationship – what file does belong to the galley – and this is done with that id. Revision number seems to be OK.

EDIT: I will have to take a look about the images embedded in the HTML…

Best,
Bozana

The images inside an html file is often embedded like so,
<p><img src="equatn.gif"></p>

The file names of the images are idexed and saved into a list inside my software, but I need to know how to construct the XML elements for importing these image files. All files are indexed, and the XML is programatically generated, I just need to know how to configure the actual XML-element to import an image file into OJS.

I forgot to mention that I managed to import a HTML galley into OJS, and it showed up on the website too. Now I only need the images and we’re good to go. :slight_smile:

Thank you for all help! :slight_smile:

Hi @Chrizze

I’ve just checked it and unfortunately we haven’t considered such a dependency for HTML images in the import/export format yet :frowning:
I opened a new issue for that: import/export of HTML galley images · Issue #3878 · pkp/pkp-lib · GitHub
Please track there the work progress…

Best,
Bozana

Ok, thank you. :slight_smile:
Is there any other way to import images from a html galley then? Maybe manually import them over ftp to a specific folder where all images are located?

It is rather important that images follow these html galleys, as they represent research data and similar.

Hi @Chrizze

You could upload the images manually, but you would need to use the web UI, in the article production stage, galleys grid – in order for all information to be correctly saved in the DB and according to that in the files folder – so that the OJS knows that they belong to that HTML file…
Also, then you should not include them in the import XML file.
How many images do you have/would need to import?

Best,
Bozana

Thank you for your support. We’re currently in a suspended state. I will return later this year to continue this task. Your support is much appreciated.

The amount of images is about 5000, in gif’s, png’s and jpg’s. I am currently trying to retrace my steps and remember my thought pattern on this problem.
I am manually editing the XML for now, and putting links and such to import. Then use this as a sort of template for the rest of the projects files. There are somewhere between 12000-15000 files that need to be imported.

@bozana I saw that there were some additions to the thread over at GitHub in regards to the issue of importing dependency files of the HTML. I can however not understand how to actually reference images that are in my HTML-file to be imported into OJS along with the HTML-file.

Here is my current submission file element in the XML-file,

<submission_file xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” stage=“submission” id=“17” xsi:schemaLocation=“http://pkp.sfu.ca native.xsd”>
<revision number=“1” genre=“Article Text” filename=“paper1.html” viewable=“true” date_uploaded=“2016-02-11” date_modified=“2016-02-11” filetype=“text/html” uploader=“christerjohansson”>
<name locale=“en_US”>Non-hierarchic document clustering using a genetic algorithm</name>
<href src=“http://dev.openjournal.tld/imports/html/paper1.html”></href>
</revision>
</submission_file>

<article_galley xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” approved=“true” xsi:schemaLocation=“http://pkp.sfu.ca native.xsd”>
<name locale=“en_US”>Non-hierarchic document clustering using a genetic algorithm</name>
<seq>0</seq>
<submission_file_ref id=“17” revision=“1”/>
</article_galley>

How do I make OJS include my images? Do I need to move them to a certain folder on server? Or can I reference them somehow in the XML?

Hi @Chrizze

For an image belonging to the HTML file in your example, you would need to add the following element i.e. something like:

<artwork_file xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" stage="dependent" id="XXX" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
    <revision number="1" genre="Image" filename="images.png" viewable="false" date_uploaded="2018-12-11" date_modified="2018-12-11" filetype="image/png" uploader="christerjohansson">
      <name locale="en_US">image name</name>
      <submission_file_ref id="17" revision="1"/>
      <href src=“http://dev.openjournal.tld/imports/html/image.png"></href>
    </revision>
</artwork_file>

Here you would need to adapt the id, filename, date_uploaded, date_modified, filetype, uploader as well as src attribute. Also the value for the image name element.
The ID of this artwork_file element is referenced nowhere.
The artwork_file element references your HTML submission_file with the id=17 and revision=1.
And your article_galley element also references your HTML submission_file with the id=17 and revision=1.

I hope that helps…

Best,
Bozana

1 Like

Thank you very much for this snippet.
One question, does it matter where in the structure of my XML i put it?
Does it belong within or , or anywhere within ?

Regards,
Christer Johansson

Hi @Chrizze

It is provided in the same way as submission_file element, s. for example this sample: ojs/sample.xml at stable-3_1_2 · pkp/ojs · GitHub

Best,
Bozana

1 Like

@bozana
First off, I would like to say thank you for all your help and your enourmous patience! :slight_smile:

My OJS version is 3.1.1.4 using Mysqli on MariaDB 10.1.26 (Debian 9).

I have been trying and tryĂ­ng to import these html-files and their dependent images. And I have seen some inconsistency in how OJS interprets the incoming XML (Native XML plugin).

In the first attempts, I followed your example and used the XML snippets from the mentioned sample (3.1.2 native). But OJS don’t import the HTML-file at all, nor showing it.

This XML did not work (for me),

<submission_file xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        stage="submission" id="17" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
    <revision number="1" genre="Article Text"
    filename="Paper1.html" viewable="true"
    date_uploaded="1995-04-11" date_modified="2019-02-20"
    filetype="text/html"
    uploader="christerjohansson">
    <name locale="en_US">Non-hierarchic document clustering using a 
    genetic algorithm</name>
    <href 
    src="http://hbojs.christerjohansson.net/import/html/Paper1.html"> 
    </href>
    </revision>
</submission_file>
<article_galley xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" approved="true" galley_type="htmlarticlegalleyplugin" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
<name locale="en_US">HTML</name>
    <seq>0</seq>
    <submission_file_ref id="17" revision="1"/>
</article_galley>

Then I reverted back to an old XML that I knew was working in a previous version. And now the HTML-file shows up in the article as intended.

This XML did work (for me),

<submission_file xmlns:xsi="http://www.w3.org/2001/XMLSchema- 
 instance" stage="submission" id="17" 
 xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
  <revision number="1" genre="Article Text" filename="paper1.html" 
  viewable="true" date_uploaded="2016-02-11" 
  date_modified="2016-02-11" filetype="text/html" 
  uploader="christerjohansson">
  <name locale="en_US">Non-hierarchic document clustering using 
  a genetic algorithm</name>
  <href 
  src="http://hbojs.christerjohansson.net/import/html/Paper1.html"> 
  </href>
  </revision>
</submission_file>
<article_galley xmlns:xsi="http://www.w3.org/2001/XMLSchema- 
  instance" approved="true" xsi:schemaLocation="http://pkp.sfu.ca 
  native.xsd">
  <name locale="en_US">Non-hierarchic document clustering using a 
  genetic algorithm</name>
  <seq>0</seq>
  <submission_file_ref id="17" revision="1"/>
</article_galley>

I don’t see any other difference than the name of the article galley, is this why the html-file won’t show up on front end? I am sure it is not, I would like it to be named HTML for better usability. But why is the sample XML not working, while this “custom” snippet is?

I am confused, and very inexperienced in OJS.

Also, upon trying to import just one image like so,

<submission_file xmlns:xsi="http://www.w3.org/2001/XMLSchema- 
instance" stage="submission" id="17" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
            <revision number="1" genre="Article Text" filename="Paper1.html" viewable="true" date_uploaded="1995-04-15" date_modified="2019-02-11" filetype="text/html" uploader="christerjohansson">
                <name locale="en_US">Non-hierarchic document clustering using a genetic algorithm</name>
                <name locale="sv_SE">En svensk rubrik</name>
                <href src="http://dev.openjournal.tld/imports/html/Paper1.html"></href>
            </revision>
</submission_file>
<artwork_file xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
stage="submission" id="17" xsi:schemaLocation="http://pkp.sfu.ca 
native.xsd">
<revision number="1" genre="Image" filename="equatn.gif" 
viewable="true" date_uploaded="1995-04-15" date_modified="2019- 
02-11" filetype="image/gif" uploader="christerjohansson">
<name locale="en_US">equatn.gif</name>
<submission_file_ref id="17" revision="1"/>
<href src="http://hbojs.christerjohansson.net/import/html/equatn.gif"> 
</href>
</revision>
</artwork_file>
<article_galley xmlns:xsi="http://www.w3.org/2001/XMLSchema- 
instance" approved="true" xsi:schemaLocation="http://pkp.sfu.ca 
native.xsd">
<name locale="en_US">HTML</name>
<seq>0</seq>
<submission_file_ref id="17" revision="1"/>
</article_galley>

breaks the import somehow, and HTML-files no longer show up on front end. Removing the -element makes OJS import the HTML again, but without images. Upon importing the XML files OJS gives an error message stating “The revision “1” for submission file “17” would create a duplicate record”.

Am I doing something wrong? (Obviously)
Could you correct my code, please? :slight_smile: