Article fulltext in XML for PMC

Has anyone experiences in exporting xml fulltext for the PMC archive submissions? PMC recommends JATS. But I see no such plugin within our installation. Can the METS ore ERUDIT plugin be used? Thanks.

Kind Regards


Hi @trace,

OJS by itself doesn’t have enough information to fully generate JATS XML – that would include, for example, semantically marked up XML representing the full text of the article. OJS will produce partial JATS XML – it’s available via e.g. the OAI interface – but you’d need to do quite a bit of work to get it ready for PubMed Central.

We have another project working on automated document parsing, and our goal for that is to produce high-quality JATS XML automatically. Tagging @axfelix, who can provide some specifics.

Alec Smecher
Public Knowledge Project Team

Hi Jan,

We’re working on a new experimental stack for parsing Word or PDF documents into fully-structured JATS XML; you can try out a standalone version here, look at the code here, obtain an (early) plugin for integrating that stack in OJS here GitHub - pkp/ojs-markup: A Public Knowledge Project Open Journal Systems (OJS) plugin for converting various document types to xml, pdf and html, or, if you want to just dump out a JATS <front> stub (i.e. just article metadata) from the articles in an OJS site, I have some code that can be easily rigged up to do that here GitHub - axfelix/metadump: dumps out OJS article metadata from DB and puts it all in JATS stubs.

Hope that helps! All of our new XML functionality is under active development.


Thanks. I’ve tried out the application on by converting a PDF. The result is not bad I think but I have to analyze it further. Which DTD you refere here? This one JATS: Journal Archiving and Interchange Tag Set: DTD ?

Yup, that’s the one, green JATS. There might be a mismatch in the output DTD currently, I need to fix that (JATS updates a lot), but the differences should be very minor.

Onfortunately the application doesn’t recognize subscript and superscript characters in the text.

doesn’t proceed pdf files actually. It always failes. Even files it
proceeded before. Thanks for fixing.

Hi @trace,

There’s a CrossRef API issue at the moment that’s causing some problems for our stack – this seems to have been going on since Friday and was just brought to my attention this morning. I’m monitoring it and will be working on a better failover scenario for when this happens in the future.

Everything seems to be fixed now.

1 Like

Greetings @axfelix,

After parsing doc file I see that all elements inside “( )” are transformed to citation reference. Maybe it is not a good move? Our authors extensive use this symbols not for this purpose. Can I somehow disable this option and leave only “[ ]” for references?

Hi Vitaliy,

It shouldn’t necessarily be all elements inside parentheses that are transformed to a citation reference, but it is quite possible that our citation parser is being too aggressive with your document formatting.

We don’t have a toggle for individual document elements, but if you’re interested in changing the code paths for classifying inline references in Word documents, you’ll find those here: meTypeset/ at master · MartinPaulEve/meTypeset · GitHub

1 Like

Many Thanks!

This parser is really great thing. How you do this? :slight_smile:

Years and years of work combining the output of lots of different open source libraries!

many thanks for the fix!

Hi @trace, @Vitaliy, @nef and others – we’re going to be running a closed beta for the OTS and Substance Editor integration with OJS 3 pretty soon. If you’re interested in participating, please send me a private message on this forum with your email address, or email me at Thanks!