Pandoc to convert Docx to Jats

Hi, Im stuck with a project trying to convert a Docx article into Jats XML.

And looking I found this Pandoc. But i dont know how to use it for convert a docx into jats xml.

Anyone have expirience with this??.

Thanks!

Hi @josuevalrob,

You might be interested in Open Typesetting Stack, which includes Pandoc among other tools.

Regards,
Alec Smecher
Public Knowledge Project Team

1 Like

Although the answer may be a bit late. I made the experience that first converting the docx to HTML and then to JATS gives better results (at least for me on my Mac).

pandoc --standalone -p -f docx -t html -o article.html input.docx
pandoc --standalone -p -f html -t jats -o article.xml article.html

You may try to convert from docx to jats in one step:
pandoc --standalone -p -f docx -t jats -o article.xml input.docx

If the numbering of your article gets screwed when running the above commands, try giving the additional parameter -N:
pandoc --standalone -Np -f docx -t jats -o article.xml input.docx

4 Likes

And again me! Again a little later!

Since I had to invest a lot of time into working with JATS and how to get there, I would like to give here some hints. Perhaps in the future I will share some more insight, but this should do it for the moment.

First, it is very bad to take the docx of the author, run it through a pipeline and get a JATS XML. The written text and the article layout should be two very seperated things!

However, if you really want to go follow (in my oppinion) the dark side, you may check out this: GitHub - withanage/heimpt: Heidelberg Monograph PublishingTool (heiMPT) is a stand-alone platform, as well as a plug-in application for OMP. It enables a high degree of automation in the digital publication process.

Recently, I worked out a workflow with an editor to get clean JATS with Texture, but learned that this tool, although having its benefits, it is not sufficiently configurable yet. For example, you cannot configure the citation style, but per default get citations like “[3]” in the inline text and then the resolution of the number in the reference list.
Also you cannot include alt-text for images. And the naming of the tables and figure is default in English without the option to change this.

After this being said, I would recommend going with marcdown. Normally, I would suggest LateX, but this is a little overcomplex and most users are afraid of it. So, you can convert your docx in marcdown with pandoc. Subsequently, you can edit everything (insert citations properly, correct the pandoc conversion errors, etc.). Finally you will have a SINGLE file from which you can create both PDF and JATS XML. No post-processing necessary.

Of course you need some insight how pandoc and pandoc-citeproc work and which input you have to give them, but it is not too complex. There is a whole syntax in pandoc to convert scientific articles: Pandoc - Pandoc User’s Guide .
And, when converting to multiple output, you may find useful that pandoc can generate code specifically for one output and ignore it in another: Pandoc - Pandoc User’s Guide .

I spend around 1 working week digging into the whole conversion issue and currently figure out a workflow I hope to publish on GitHub.

3 Likes

Did you published your Workflow on GitHub?

Unfortunately I did not get to the point to have drafted workflow. It is very hard to come up with a generic workflow.
However, some colleagues from Hamburg put a lot of work into this topic: Modern Publishing . Adapting their workflow will take some time (and technical expertise), but I would say that it is currently the best approach to make the whole process as automated as possible.

Thank you for your response. I could not find any workflow on their website, only intention about it. If you now where there is their workflow expanded I would apriciate a hint.

Hi @milan88 ,

sorry for the delay.

The project group worked out a process, where docx files are converted to Markdown by Pandoc. The Markdown file then is edited with Zettlr to include citations, images, and proper formatting (because docx is quite hard to convert). Subsequently, they established a GitLab continuous integration workflow that automatically converts your Markdown to all configured output formats (PDF, JATS, HTML etc.).

However, for the setup of this workflow, you will need some technical expertize. Further all journal editors should be familiar (or be willing to learn) Git, because it is currently essential in this workflow.

Unfortunately, the documentation in English is only very basic. The German Blog is (a little bit) better filled.

Although the project has ended, you should not hesitate to contact the project team, if you are willing to put the time and work into this. The team is still documenting their results.

Cheers,

Adrian

Hi @GrazingScientist,
Thank you, all of this is usefull information and in line with my outline to do the same.
Kindly,
Milan

Hi,
I just found this thread and would like to offer my help. I am part of the Hamburg team mentioned above and a pandoc dev. Improved JATS support is currently a priority (for a different project, but will be usable with the “modpub” workflow). We’re continuously improving pandoc output.
Cheers,
Albert

Hi! I’d love some help with this. I’m working on the conversion of a lot of issues from a journal. All we have are .doc files. I’ve already tried a lot of tools, but none of them seems to work correctly.
If I’m getting this right, in your method, I would need to convert the .doc file to Markdown, input the references and citations manually and then convert it to the JATS XML. Is there an easier way?
I have experience with Markdown, Latex and a little with JATS XML. Writing an article directly using XML would not be a problem, but I’d really like a good workflow to convert the .doc to the JATS XML.
Thanks in advance,
Lucas.

I’m not aware of a simpler solution, I’m afraid. We did a good bit of manual cleanup in our conversion process, especially for tables and citations. For some articles, authors were able to provide us with their citation databases from zotero/Mandeley/etc. This made the conversion easier. Bibliography extraction from docx is not supported yet in pandoc, but you may have some luck with this ref-extractor. Often, authors wrote the bibliography by hand, in which case there is no way around manual editing.
The advantage of going via Markdown is that conversions from Markdown into other formats like JATS, HTML, EPUB, and PDF work very reliably, which was important for our project. If, however, all you really need is JATS, then there might be better options – but I only know about pandoc.
Is this the kind of info you were looking for?

Hi! Thanks for the reply.
I believe that the solutions that convert directly to JATS are not easier, some of them seem to be even harder, because of the more complex structure on XML.
The workflow that I’m thinking about right now is:

  • Convert the text references used by the authors to .bib files, which can be imported to Zotero and marked on the .odt/.doc file.
  • Convert the .odt/.doc file to Markdown using Pandoc. (I’m having trouble with the references, that can’t be converted. Is there an easy way to do this, or I’ll have to manually insert the references on the text?)
  • Clean up the Markdown file, looking for eventual problems and inserting JATS metadata.
  • Convert from Markdown to JATS using this tool. I’m not aware if there is other way or tool to be used on this step. (Also, I don’t know how to deal with multiple abstracts and key words in different languages on Markdown. Have you ever done something like this?)
  • Clean up and check the final JATS file.

What do you think? Any tips?
Sorry for the excess of information, I’m really breaking my head over this.
Thanks again,
Lucas.

I’m not aware of a simpler method. We’ve used Zettlr for this, as it has good support for citations. I know people with a similar workflow who use RStudio.

Convert from Markdown to JATS using this tool.

JATS generation has been integrated into pandoc proper, extra tools are no longer required. Just make sure you are using the latest version (2.11.4).

Clean up the Markdown file, looking for eventual problems and inserting JATS metadata.

The list of supported metadata and its expected layout is documented on the pandoc website. I wrote these docs, so please let me know if something is amiss. And don’t hesitate to raise an issue if there is anything missing.

Something to watch out for: the bibtex citation keys should not contain : or /, as pandoc currently has a bug that leads to the creation of invalid JATS XML if the keys are not valid XML identifiers.

Let me know if you run into issues.

Thanks for the replies and tips! I’ll try it out.
Cheers!

1 Like

Update: Pandoc now can read EndNote and Zotero citations in Word files when called with --from=docx+citations, so the process should be much simpler now. Citavi support is still pending.