DOCX to JATS XML converter

Greetings to all OJS users.

Lens Viewer is great application for displaying scientific articles for JATS XML.
For our journal I have developed DOCX to JATS converter, which make the transformation process more convenient. The link to the project on Github: GitHub - Vitaliy-1/DOCX2JATS: Java project, aimed to facilitate DOCX to JATS XML transformation for scientific articles
Because DOCX OOXML format is not very structured, which is not the case for JATS XML, input articles must be highly structured. Also, because OOXML is not contain any metadata, latter must be entered manually.
My project includes TEIC stylesheets for the hardwork and Java for more subtle parsing of references (for now only in AMA, Vancouver citations style) in-text references, table and figure labels, titles and captions. The examples of how articles must be formatted in DOCX are situated in the root directory of the project (article1.docx, article2.docx). For good results articles MUST have the same format:

  • 1st level and 2nd level titles for sections and subsections;

  • separate reference section for reference list; references in AMA or Vancouver style. Journal articles, books, chapter and conference are supported;

  • lists must be lists in docx (ordered and unordered lists are supported);

  • bold and italic text is supported;

  • for in-text references square brackets must be used;

  • in-text references for tables and figures are parsed if they mark as “tabl 1.”, “table 1”, “fig. 1”, “figure” or cyrillic analogs;

  • Table label and title are need to be situated above the tables and figures as: Table 1. Boring table title.

  • Table and figure description need to be situated under the table and start with symbol *

  • UTF-8 encoding support

To run the program java 8 must be installed. Link to archive: Releases · Vitaliy-1/DOCX2JATS · GitHub
Archive contains 1.jar file and stylesheets folder, which need to be unziped into one directory. Because I am not good programmer, there is a need to place article in docx format in this folder before making transformation. Suppose archive is unzipped on the drive C in the jats folder. Input article article1.docx is also situated there. From windows cmd user need to go to this folder and enter:
java -jar 1.jar C:\jats\article1.docx article1.xml
Converter does not parse metadata and formulas. Also tables may need some correction. If article in docx is accordingly formatted the full process of manual correction takes about 30 minutes (in our case). Maybe someone also finds this converter usefull.
It is need to be pointed that we use the last version of Lens Viewer and parser converts articles according to it`s JATS XML support.

9 Likes

Hi @Vitaliy,

Excellent work! I’ll pass this around the team.

Regards,
Alec Smecher
Public Knowledge Project Team

I have tried normally also, but none of them work.

1 Like

Hi @varshilmehta,

Very good! You mostly caught the concept. First of all, references were not parsed as requires. Reference section must be named as References (in lower case) not REFERENCES. You can see this part of code here: DOCX2JATS/transformerBiblAMA.java at master · Vitaliy-1/DOCX2JATS · GitHub
[Rr]eference means that the program will catch the word, which starts with the first letter R or r (might be in lower or upper case), and eference must be only in lower case. Letter s is optional.
I can make this part of code case insensitive in the future release.
After simple renaming converter will parse them as requires for JATS standard.

Secondly, the older Lens Viewer version, that is used by default in OJS3, is not really for public use. It checks the publisher name in the JATS XML and if it is not a eLife Sciences it blocks the transformation. So you need to update Lens Viewer to the newest version. It is really not hard. Take a look at this topic: Can't view XML file in OJS 3.0 - #30 by Vitaliy
Let me know if you will encounter with problems.

Also check the external links in the reference list. They must be at the end of each reference according to the Vancouver standard. For example in case of DOI:
Kissane DW, McKenzie M, McKenzie DP, Forbes A, O'Neill I, Bloch S. Psychosocial morbidity associated with patterns of family functioning in palliative care: baseline data from the Family Focused Grief Therapy controlled trial. Palliative Medicine. 2003; 17(6):527-537. DOI: https://doi.org/10.1191/0269216303pm808oa
In case of simple url just change DOI with URL. And if you not explicitly point out the type of a link (DOI or PMID or URL) the program will treat it as usual URL link - in JATS XML this 3 types of links are marked differently.

Thanks got it, Will update here, after going through everything once again. Thanks a lot bro for your guide and this java.

It is stillnot working. I replaced the entire Lensfolder with Lens 2. Made other changes too

Can you show me result JATS in the private message or through any filestore service?

Actually, after completing theprocess, I cant see any xml file in the folder. Before it used to make a xml file. Now i am using 1.1 version

This error means that the program can’t parse authors names in the reference list. Can you provide me the article docx file, that I can trace the problem and check whether the result XML will be rendered by Lens Viewer? Through providing the link to the file (by personal message) in google disk, for example? Or send me by email?

JohnsonJG - Maybe this is because of this reference. The program expects author surname and given names to be separate. But like I said, it will be better for me to take a look at this docx, transform it and check with eLife Lens Viewer.

@varshilmehta,

I will make program more user friendly in this case. There will be only such notice without breaking the transformation.

1 Like

Hi @varshilmehta,

I have made a new release, where put several fixes:

  • Previous error with authors parsing is not crucial anymore. It also through a notice where is a problem in the reference list.

  • Program throws a notice whether it finds reference list section

  • Now the program can be run from any directory. Also the article file may lie anywhere on the file system. As long as jar file and stylesheets are situated in the same parent directory and the pointed path is correct. Was tested in Windows.

  • Removed the text from the nodes, that is specific for our journal (ISSN, publisher and journal name).

Also, if you will find the problems in the future, related to the app, you can open an issue on its page on github. It would be more convenient to work with :slight_smile:

1 Like

Hmmm, third point from the list is not resolved completely. Will fix it in the future.

Can we show Journals name, volume, issue and doi in the info part? I tried adding volume and issue but it didnt work.

Lens Viewers doesn’t render these data.
We use them mostly because it is standard requirements for JATS XML. Also result JATS XML transforms to LaTeX (PDF) and this info goes there too.

This version has support for OJS artwork files: GitHub - ajnyga/lensGalley: Galley viewer plugin integrating eLife Lens for OJS 3.0: fork using the latest develop branch of Lens

1 Like

Hi @varshilmehta,
In standard Lens Viewer they (elifesciences.org) use own links to images, like: https://cdn.elifesciences.org/articles/…
To override this there is a need to point a full link to the image (like in our case) or add custom variables into the plugin (like in @ajnyga plugin version).

Actually we are planning to refuse from using Lens Viewer in the future and replace it with plugin fully written in PHP to make server-side transformation. As you maybe know, XML in case of in-browser transformation is not indexing by google. Also Lens Viewer is not mobile friendly.

But if we just hook the transformed xml on to the abstract (like ajngya’s plugin), is it indexed by google? I think it should index since, the main part is the abstract. Rest everything is hooked on!

I am not saying about google scholar, which parses article metadata. All article content is also important. Google simply not see what is inside JATS XML. It sees only HTML.