Corrupt html file using native file import

Hi. I built an xml file to import html files into ojs 3.1.2.1. It works fine for 99% of the html files (all from the same website) but there are a couple that when pulled get corrupted.
The file contents looks like this: ‹ í½a`I–
I checked the physical file as well not just the file via the web browser and it’s corrupted.

Everything else about the record looks fine (abstract, author data etc). Any Idea what this could be?
Here is the xml:

<?xml version="1.0"?>
<issues xmlns="http://pkp.sfu.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pkp.sfu.ca native.xsd">
 <issue published="1">
  <issue_identification>
   <number>89</number><year>2018</year><title>Number 89: Spring 2018</title></issue_identification>
  <date_published>2020-02-18</date_published>
  <sections>       <section ref="ART">        <abbrev>ART</abbrev> <policy/>   <title>Articles</title> </section>
  </sections>
  <articles>
   <article section_ref="ART" stage="production" date_published="2020-02-18" seq="1" language="">
    <id type="doi" advice="update">10.5062/F4668BF1</id>
    <title locale="en_US">Citation Analysis of Ph.D. Theses at Faculty of Science, University of Ibadan, Nigeria</title>
    <abstract>The authors analyzed 21,005 ..... </abstract>
    <subjects><subject>Citation analysis</subject></subjects>
    <authors>
     <author user_group_ref="Author" include_in_browse="true" primary_contact="true">
      <givenname>Malik </givenname><familyname></familyname><email>removedForPrivacy@gmail.com</email>
     </author>
     <author user_group_ref="Author" include_in_browse="true">
      <givenname>Wole </givenname><familyname></familyname><email>removedForPrivacy@yahoo.co.uk</email>
     </author>
    </authors>
    <submission_file id="1" stage="proof">
     <revision genre="Article Text" number="1" filetype="text/html" filename="refereed3.html">
      <name>refereed3.html</name>
      <href src="http://www.istl.org/18-spring/refereed3.html" mime_type="text/html"/>
     </revision>
    </submission_file>
    <article_galley>
     <name>HTML</name>
     <seq>1</seq>
     <submission_file_ref id="1" revision="1"/>
    </article_galley>
    <pages></pages>
   </article>
  </articles>
 </issue>
</issues>

I noticed it’s not the xml it has to do with the curl command as
curl Citation Analysis of Ph.D. Theses at Faculty of Science, University of Ibadan, Nigeria outputs corrupt characters.

Hi @jhennig,

That page has a text encoding of windows-1252; you’ll need to convert it to UTF8 using a tool like iconv.

Regards,
Alec Smecher
Public Knowledge Project Team

Thanks Alec,
I was playing around with curl outside of OJS and noticed that adding
curl_setopt($ch, CURLOPT_ENCODING, “gzip”);
fixed the problem as well.
using: mb_convert_encoding($data, ‘HTML-ENTITIES’, ‘auto’); or utf8_decode($data);
also works. I’m not sure why the gzip option works.

Anyway, I don’t have access to the server with the html files on it.
I think that the gzip solution might be a quicker at this point.
Do you know which ojs file contains the curl code?

Jeremy

Hi @jhennig,

It’s in lib/pkp/plugins/importexport/native/filter/NativeXmlSubmissionFileFilter.inc.php in the handleRevisionChildElement function. Look for case 'href'.

Regards,
Alec Smecher
Public Knowledge Project Team

1 Like

After playing around I’m not entirely certain the encoding of windows-1252 was the problem.
The webserver seemed to be using gzip to send the file and it was not being properly decoded in OJS.
After a few tests (although not extensive) the below code seemed to work
I modified the href section of lib/pkp/plugins/importexport/native/filter/NativeXmlSubmissionFileFilter.inc.php handleRevisionChildElement function slightly. This should test to see if the contents needs to be decoded then decode and write to the temporaryFilename.

$contents = $wrapper->contents();
$is_gzip = 0 === mb_strpos($contents, "\x1f" . "\x8b" . "\x08", 0, "US-ASCII");
$contents = $is_gzip ? gzdecode($contents) : $contents;
file_put_contents($temporaryFilename,$contents);
1 Like