JATS XML to embedded HTML article

ajnyga · March 24, 2017, 2:03pm

I have been planning to build a plugin which parses JATS XML into a simple HTML article. Instead of presenting the HTML article in a separate page, I was thinking that you could use a hook in the article abstract page and make it appear there as a full text article. I mean something like this: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0170224

I am thinking of using the lens converter to do the job, but have not started yet. The other option could be a simple php based parser, but anyway I would like to have it done “on the fly” straight from the XML file.

I have seen you write a lot about XML, lensGalley and htmlGalley. Have you worked on anything similar to what I described?

Vitaliy · March 24, 2017, 4:05pm

Hi @ajnyga,

I have worked with Ambra (PLoS) and there they use server-side transformation through Java and XSLT. For example, simple JATS to HTML xslt stylesheets as open source can be found here: GitHub - ncbi/JATSPreviewStylesheets: JATS Preview Stylesheets.

So, if I would write this code in Java, I simple would create XSLT stylesheet and evoke the transformation with SaxonHE or Xalan library. Actually, we use this mechanism in production of HTML Galley files for our journal. But I didn’t start to learn php yet. And I don’t know how to use XSLT through this programming language.

From Saxon developers can be seen, that this transformer is working with PHP: http://www.saxonica.com/saxon-c/index.xml. So it can be done somehow.

As for lens conventer (Lens Viewer?), it transformes XML to HTML on client-side. It has one, but very severe limitation - google is not indexing XML.

ajnyga · March 24, 2017, 4:09pm

Thanks, good to know about google not indexing. I wonder if the developers of lens are aware?

Vitaliy · March 24, 2017, 5:27pm

I think that they don’t care to much, because elifesciences uses lens viewer only as addition to the main article galley, which they get through server-side xml to html transormation.

asmecher · March 24, 2017, 5:34pm

Hi all,

In reference to Google Scholar indexing – I think they may eventually want to have access to the JATS XML for high-quality indexing, but so far haven’t given clear guidance about this.

Regards,
Alec Smecher
Public Knowledge Project Team

Vitaliy · March 24, 2017, 5:46pm

I have just taken a look to the php library, need to say thay there are all neccessary tools for parsing xml without XSLT: xpath and dom.

For now I am writing JATS to LaTeX converter in Java, where first I am creating document object model and than write data from xml to it. Nothing more than proper inheritance and getters with setters.

In php there is a need just to copy this steps. I have also thought about plugin, that will allow to show html galley right on article detail page. So if you really planning to develop such plugin, I can help you with parser after getting familiar with php

ajnyga · March 24, 2017, 7:29pm

There are a lot of tools to work with XML in PHP: http://php.net/manual/en/refs.xml.php
Especially: http://php.net/manual/en/class.xsltprocessor.php
That is definitely not my speciality however.

Basically we would only need to read and parse the body and back parts. Pretty much everything in front part is already visible via OJS article metadata so no need to touch that. Also, the html we need is of course basically just the body part in html.

What needs to be considered is how to choose the correct galley file for processing. That is, if there are more than one xml galley available. Also, I did not figure out yet how to add artwork files for xml galleys. That is probably only possible when a html galley is available? Which is of course a problem with lens as well.

I could create the the basic plugin structure as far as printing out the raw xml to the abstract page. Then all we need is the filter for the first version.

Edit: GitHub - PeerJ/jats-conversion: Conversion and validation for JATS XML

Vitaliy · March 24, 2017, 8:18pm

I have created a custom web-access directory, where all images are stored. To add them to XML or HTML we simple add a link to corresponding file.

As for parser, I see the next steps:

Import XML
Creating Object representation. For example, creating a classes for the hole document; classes for front, body, back, that extend the document class; create class section, that will extend body class; create classes paragraph, title and subsection, that extend section etc. adding setters and getters to the classes, that will contain data.
Iterate through the XML node tree with XPath or DOM, retrieve the data and write to the correspondent objects in our tree;
Retrieve metadata from OJS and write to corresponding objects in our tree.
Do whatever needed with all data. Write to HTML, MySQL, PDF (through TeX, Cassius etc.). Or even write a php-based editor for article content.

The problems may be only in tables and formulas as it requires writing more complicated functions. On the other hand, tables in JATS XML and HTML are identical, so they may be transferred directly. Also I don’t know OJS back-end structure in particular and PHP in general, but the last thing is quick to amend

As for choosing correct galley file, I think that there should be only 1 XML galley file available per article. So not a problem.

ajnyga · March 24, 2017, 8:45pm

Hi,

I think that there should be a similar way of handling artwork files as with html galleys. I asked this elsewhere, lets see what Alec thinks. We have around 40 journals now and if XML is something that is being picked up, there is no way for me to handle image uploads manually. I have to confess that I have not looked into html galleys and OJS that much so do not know if there are problems with images as well.

Regarding the parser, that sounds good. However, I still think that we do not need the JATS XML metadata (front) or the OJS metadata for that matter. But maybe you were thinking a wider use for the filter than I was.

With formulas, we could just go with https://www.mathjax.org/ and this could be even something that would be possible to switch on/off in the plugin settings, because only some journals need it. This would mean that mathML would not need to be converted at all.

With tables I also think that no problem there, if of course the original in XML is well formed.

Edit: with several XML files I was mainly thinking of different language versions, but this is probably a marginal problem and possibly even easily checked against the chosen UI language.

Vitaliy · March 24, 2017, 9:15pm

OK, I will consider about this idea for a while.

Also, maybe it is possible to start transformation not from JATS, but from DOCX. That what I have already done with Java, but in case of OJS plugin it can be much useful…

ajnyga · March 25, 2017, 7:01am

Ok, I will definitely try this, I totally understand if you have other things.

I am working on a workflow that uses the markup-plugin to generate a JATS XML from docx. It works fairly well. It sends the docx from OJS to Open Typesetting Stack and OTS sends a XML file back. I also made a small plugin that generates the front part of the file and shows it on the article metadata page. You can easily copy/paste it from there to the XML file. The Open Typesetting Stack is already adding some of the OJS metadata there, but at the moment it is very limited compared to the amount of data available. Probably this will be enhanced and my plugin is not needed anymore.

The biggest challenge there are of course references, but while testing this with five of our journals some of the articles returned had promising results. So next week I will try to figure out the “holy grail” of marking references in the docx end.

I also got an answer from Alec regarding the artwork with xml files Artwork files for XML galleys - #2 by asmecher
This would probably help your workflow as well?

I will probably try to add the rewriting of artwork url’s like with the html files. I was also thinking whether it would be good idea to try to write the publication date (and also issue data if not yet available) to the JATS XML file when the issue is published. There should be a hook there when publishing an issue which should be easy to use with a plugin. I think that this is something PKP might not want into the core.

What remains a problem is pdf. I tried to print a few of the CaSSius examples I found and although they usually look good on the screen, the printed version has many problems. One of our journals (with an editor with excellent technical skills) is doing XML by first converting to Markdown (with Pandoc) and from Markdown to JATS XML. They have really nice pdf’s by converting from Markdown to PDF. But the tools they are using a mostly command line tools and the workflow I am trying to work out should be something that a normal journal editor could handle. Command line tools is not really among the options.

Vitaliy · March 25, 2017, 5:05pm

Hi @ajnyga,

Thanks for the link.

I have checked the PHP syntax. Parsing JATS with this language will look like this:

$xml = new DOMDocument();
$xml->load("D:\workphp\JATSParser/test.xml");
$xpath = new DOMXPath($xml);
foreach ($xpath->evaluate("/article/body/sec") as $sec) {
	echo "\n";
	foreach ($xpath->evaluate("title|p|fig|sec|table-wrap|list", $sec) as $secContent) {
		if ($secContent->tagName == "title") {
			echo $secContent->nodeValue, "\n";
		} else if ($secContent->tagName == "list") {
			echo "List will appear next \n";
			foreach ($xpath->evaluate("list-item/p", $secContent) as $listItem) {
				echo "listItem: ", $listItem->nodeValue, "\n";
			}
		} else if ($secContent->tagName == "p") {
			echo "\n";
			foreach ($secContent->childNodes as $parContent) {
				if ($parContent->nodeType == XML_TEXT_NODE) {
					echo $parContent->nodeValue;
				} else if ($parContent->tagName == "xref") {
					if ($parContent->getAttribute("ref-type") == "bibr") {
						echo "Citation: ", $parContent->nodeValue;
					} else if ($parContent->getAttribute("ref-type") == "table") {
						echo "Table: ", $parContent->nodeValue;
					} else if ($parContent->getAttribute("ref-type") == "fig") {
						echo "Figure: ", $parContent->nodeValue;
					}
				} else if ($parContent->tagName == "italic") {
					echo "<i>", $parContent->nodeValue, "</i>";
				} else if ($parContent->tagName == "bold") {
					echo "<b>", $parContent->nodeValue, "</b>";
				}
			}
			
		} else if ($secContent->tagName == "sec") {
			echo "\n";
			foreach ($xpath->evaluate("title|p|fig|sec|table-wrap|list", $secContent) as $subSecContent) {
				if ($subSecContent->tagName == "title") {
					echo $subSecContent->nodeValue;
				} else if ($subSecContent->tagName == "list") {
					echo "List will appear next \n";
					foreach ($xpath->evaluate("list-item/p", $subSecContent) as $listItem) {
						echo "listItem: ", $listItem->nodeValue, "\n";
					}
				} else if ($subSecContent->tagName == "p") {
					echo "\n";
					foreach ($subSecContent->childNodes as $parContent) {
						if ($parContent->nodeType == XML_TEXT_NODE) {
							echo $parContent->nodeValue;
						} else if ($parContent->tagName == "xref") {
							if ($parContent->getAttribute("ref-type") == "bibr") {
								echo "Citation: ", $parContent->nodeValue;
							} else if ($parContent->getAttribute("ref-type") == "table") {
								echo "Table: ", $parContent->nodeValue;
							} else if ($parContent->getAttribute("ref-type") == "fig") {
								echo "Figure: ", $parContent->nodeValue;
							}
						} else if ($parContent->tagName == "italic") {
							echo "<i>", $parContent->nodeValue, "</i>";
						} else if ($parContent->tagName == "bold") {
							echo "<b>", $parContent->nodeValue, "</b>";
						}
					}
				}
			}
		}
	}
}

Also I have managed to create classes for the document object model and set there data from xml nodes. But I can’t figure out how I can retrieve them back…

require("classes/section.php"); // link to my custom classes with getters and setters

$xml = new DOMDocument();
$xml->load("D:\workphp\JATSParser/test.xml");
$xpath = new DOMXPath($xml);
// in Java I would have created an ArrayList here with Section objects
foreach ($xpath->evaluate("/article/body/sec") as $sec) {
   echo "\n";
   $section = new Section();
   // do something with $section here
 } // After this I could iterate through Array and receive back data with getter....

@asmecher, also maybe you know how to do this with PHP?

Vitaliy · March 25, 2017, 7:13pm

Found the ArrayList analog in PHP - ArrayObject.

So not a problem to write a JATS parser, that would create HTML and LaTeX as output.

ajnyga · March 25, 2017, 7:29pm

Sounds good, I will probably have time tomorrow to write the basic plugin for fetching the xml galley and hooking into the abstract page. Monday the latest. I will upload it to Github when ready and let you know.

Vitaliy · March 25, 2017, 11:13pm

So, I had wrote an example and push it on github: https://github.com/Vitaliy-1/JATSParser/blob/master/main/Main.php

First, data are transferred to objects and than from objects back to data (from line 91, can be inserted into html or tex or other text format). Think, OOP principles are better, because the code can be reused. To finish this work there is a need to spend several days for mapping entire JATS XML.Can do this after I’ll finish another work

Also it would be interesting to see, how the plugin for fetching xml file will look like. I have no experience in this at all.

ajnyga · March 26, 2017, 2:52pm

Hi,

The basic plugin is now here: GitHub - ajnyga/embedGalley: OJS3 plugin for visualizing JATS XML galleys

At the moment it searches for a XML galley and if available, it fetches the content and embeds it to the article page below the abstract. There is a function there that can be extended to include the parser. I already added a settings form for the plugin, but at the moment there are no settings yet. I think that adding mathJax is definitely one of the things to handle. Also the plugin should consider the possibility of having several XML files.

With the full text available, also the journal theme plugin should probably be modified some way, but that is beoynd the scope of this plugin. I have been hoping to develop our theme plugin so that there would be a tabbed view.

I was kind of hoping originally that there would be a some sort of library ready for the heavy lifting you have done now from scratch. Are you sure something like that is not already available somewhere? I am worried about your workload

A question to @asmecher: if several plugins are using the same template hook, is there a way of defining the order in which the plugin ouput is presented on the page. I had a disqus plugin enabled with this new plugin an noticed that the disqus view was shown before the article output, which is of course a wrong order in this case.

Vitaliy · March 26, 2017, 3:06pm

My main aim is to produce LaTeX. HTML will be side effect. In case of simple XML->HTML transformation there is no need to create new objects (3 times less coding).

ajnyga · March 27, 2017, 6:27am

I added the PeerJ jats-conversion to this new version. GitHub - ajnyga/embedGalley: OJS3 plugin for visualizing JATS XML galleys

It is in a way a working plugin now. But there are several issues left:

The conversion does a lot of unnecessary work especially with the metadata. Possibly you implementation would be lighter.
How would this support different citation styles
The article should have better css. Probably best solution would be to allow custom css for each journal.
the image url’s etc. need to be worked out. A lot of the details are probably ready in the htmlGalley plugin and these woulf be good to handle already with the xml file

Vitaliy · March 27, 2017, 7:46am

For different citation style is needed to rewrite those XSLT parts, that account for this part of conversion…

ajnyga · March 27, 2017, 9:22am

Yes, that should be on the plugin options. I have to look through the OJS code if something similar is already available.

I also noticed that for some reason the lib I used is linking all references to Google Schoral searches which is a bit weird in my opinion. I will probably fork the lib and make a very simplified version of that. There is a lot of stuff for validation there that the plugin does not need (validation should be done before anyway)