Correct formatting of docx files before xml jats converting

ActaBiologica · April 17, 2019, 1:08pm

@Vitaliy
Hi, eveybody! I tried to compile the .xml galley by docx2jats and need some assistance regards proper docs formatting before compilation. I download some examples from here: GitHub - Vitaliy-1/DOCX2JATS: Java project, aimed to facilitate DOCX to JATS XML transformation for scientific articles but still have some questions - are in-text citation must be in brackets, the references support only one style, no headers&footers? We for instance use APA for refs and in-text citations, like http://journal.asu.ru/biol/article/view/5184/3960
Does it mean we need to prepare two docx - one for pdf and one for xml galley? And do we need to manually add metadata in xml?

Vitaliy · April 17, 2019, 1:20pm

Yes, regarding DOC2JATS it supports only intext citations in square brackets and AMA citation style (but even not all elements).

I am currently developing a better tool: GitHub - Vitaliy-1/docxConverter: Plugin for OJS 3 that parses DOCX and converts it to JATS XML format
This is a plugin for OJS written in PHP with my own parser. I’m planning to make an alpha release this summer (closer to mid-August). I’m planning to add support for MS Word native citations and Zotero MS Word plugin there but probably in beta.

There are also other tools for conversion as well. MeTypeset (DOCX → JATS) and Grobid (PDF → TEI XML). Grobid is quite good for citations and references actually but the drawback is that it produces TEI XML as an output which needs additional conversion. There are available converters from TEI to JATS though. Let me know if you want to try Grobid, I can provide some guidance.

ActaBiologica · April 17, 2019, 2:41pm

Thank you so much for your detail explanation, I didn’t have experience with this soft and I’m a bit confused - what is better: DOCX → JATS or PDF → TEI XML? If you suggest I would try Grobid. Will appreciate additional information from you.

Vitaliy · April 17, 2019, 3:04pm

Grobid is good for parsing metadata, like references, authors, and intext citations but bad in parsing actual article’s text. I believe @Dulip_Withanage knows about tools to convert TEI to JATS.

The test instance of Grobid is here: http://cloud.science-miner.com/grobid/
Grobid also can be trained to parse better but this requires some practice. The documentation is here: Home - GROBID Documentation
I’ve already trained Grobid, so if documentation would be not clear I can explain the process.

ActaBiologica · April 17, 2019, 5:13pm

@Dulip_Withanage
So, I will examine the documentation and check. In this case what is the workflow for PDF->XML? (like pdf-tei-xml). Sometimes, when we deal with archive issues, only the pdfs are available. The OJS will later convert xml into html. What do you think?

Dulip_Withanage · April 18, 2019, 10:11am

@ActaBiologica

in n this case what is the workflow for PDF->XML? (like pdf-tei-xml). Sometimes, when we deal with archive issues, only the pdfs are available. The OJS will later convert xml into html. What do you think?

As we use JATS as the primary language for XML, we consider all output formats are generated from JATS (or the derivative BITS, which is only an extension for books)

For PDF your workflow might look like this

PDF->TEI XML - JATS XML → HTMl /PDF
grobid → meTypeset → output styleheets or plugins

TEI to JATS conversion using meTypeset

meTypeset.py tei input_tei.xml <output_folder> [ --puretei or --prettytei ]

If you have only PDF as input , grobid is the best known solution we can recommend as @Vitaliy pointed out in order to get the TEI XML.

ActaBiologica · April 18, 2019, 12:07pm

@Vitaliy, @Dulip_Withanage,
Thanks to all for your very valuable and detail comments and explanations. I also appreciate your taking the time to answer!