Article fulltext in XML for PMC

Hi
Has anyone experiences in exporting xml fulltext for the PMC archive submissions? PMC recommends JATS. But I see no such plugin within our installation. Can the METS ore ERUDIT plugin be used? Thanks.

Kind Regards

Jan

Hi @trace,

OJS by itself doesn’t have enough information to fully generate JATS XML – that would include, for example, semantically marked up XML representing the full text of the article. OJS will produce partial JATS XML – it’s available via e.g. the OAI interface – but you’d need to do quite a bit of work to get it ready for PubMed Central.

We have another project working on automated document parsing, and our goal for that is to produce high-quality JATS XML automatically. Tagging @axfelix, who can provide some specifics.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi Jan,

We’re working on a new experimental stack for parsing Word or PDF documents into fully-structured JATS XML; you can try out a standalone version here http://pkp-udev.lib.sfu.ca/, look at the code here https://github.com/pkp/xmlps, obtain an (early) plugin for integrating that stack in OJS here GitHub - pkp/ojs-markup: A Public Knowledge Project Open Journal Systems (OJS) plugin for converting various document types to xml, pdf and html, or, if you want to just dump out a JATS <front> stub (i.e. just article metadata) from the articles in an OJS site, I have some code that can be easily rigged up to do that here GitHub - axfelix/metadump: dumps out OJS article metadata from DB and puts it all in JATS stubs.

Hope that helps! All of our new XML functionality is under active development.

2 Likes

Thanks. I’ve tried out the application on http://pkp-udev.lib.sfu.ca/ by converting a PDF. The result is not bad I think but I have to analyze it further. Which DTD you refere here? This one JATS: Journal Archiving and Interchange Tag Set: DTD ?
Jan

Yup, that’s the one, green JATS. There might be a mismatch in the output DTD currently, I need to fix that (JATS updates a lot), but the differences should be very minor.

Onfortunately the application doesn’t recognize subscript and superscript characters in the text.

The
converter http://pkp-udev.lib.sfu.ca/
doesn’t proceed pdf files actually. It always failes. Even files it
proceeded before. Thanks for fixing.

Hi @trace,

There’s a CrossRef API issue at the moment that’s causing some problems for our stack – this seems to have been going on since Friday and was just brought to my attention this morning. I’m monitoring it and will be working on a better failover scenario for when this happens in the future.

Everything seems to be fixed now.

1 Like

Greetings @axfelix,

After parsing doc file I see that all elements inside “( )” are transformed to citation reference. Maybe it is not a good move? Our authors extensive use this symbols not for this purpose. Can I somehow disable this option and leave only “[ ]” for references?

Hi Vitaliy,

It shouldn’t necessarily be all elements inside parentheses that are transformed to a citation reference, but it is quite possible that our citation parser is being too aggressive with your document formatting.

We don’t have a toggle for individual document elements, but if you’re interested in changing the code paths for classifying inline references in Word documents, you’ll find those here: meTypeset/referencelinker.py at master · MartinPaulEve/meTypeset · GitHub

1 Like

Many Thanks!

This parser is really great thing. How you do this? :slight_smile:

Years and years of work combining the output of lots of different open source libraries!

many thanks for the fix!

Hi @trace, @Vitaliy, @nef and others – we’re going to be running a closed beta for the OTS and Substance Editor integration with OJS 3 pretty soon. If you’re interested in participating, please send me a private message on this forum with your email address, or email me at garnett@sfu.ca. Thanks!