OJS import plugin for LaTeX

kmccurley · September 18, 2023, 10:08pm

About a year ago I looked at using OJS for our new journal, but I quickly gave up for several reasons that are not relevant to discuss here. Our society currently has two journals that “sort of” run on OJS. They don’t use the reviewing part, and they don’t use the editing process. They only use it for hosting. As a result, they have to do a lot of manual work to import the papers into OJS, and the process often involves errors along the way.

After I gave up on OJS, I started building a system to compile LaTeX in the cloud and extract metadata from it during the compilation process. A paper on this appears on arXiv and the Tex User’s Group journal in 2023. The system is nearing completion, and produces PDF/A along with crossref XML to register for a DOI (including the bibliographic references, multiple affiliations, ROR IDs, funding information, etc). It also has a copyedit workflow built into it, but that is optional. I’m now considering whether to write an OJS import plugin. The way I imagine it happening is that authors would be directed from OJS to the external system to upload their final versions, and then OJS would import the PDF/A and metadata once it is completed. The journal could choose between which copy editing workflow to use, so that the paper is either imported before or after copy editing. There are advantages to using the external copy editing platform, because it recognizes many problems that arise during the LaTeX compilation phase (e.g., missing DOIs on references, missing metadata, widows, orphans, etc).

I assume this would be useful for journals in computer science, mathematics, and physics (possibly others like economics). I know it would be useful for our journals. There are two questions however:

are there any successful examples of an import plugin? When I looked in the past, the only import plugin was barely functional and kept breaking when new versions of OJS rolled out. The sample plugin is an export plugin, and it’s not clear what schema is stable for imports. The external system is written in python, but I can also write PHP. It would use a REST API to the external system.
is there sufficient demand for direct OJS support for LaTeX?

tabber · September 19, 2023, 7:15am

We are very much interested in this feature!

marc · September 19, 2023, 8:10am

Hi @kmccurley , we are trying to make all first “Feature Request” posts follow the same structure to facilitate the understanding of the petitions and at same time, will ensure no relevant info is missing.

Do you mind to reEdit your fist post following this template:

Describe the problem you would like to solve
Example: Our editors need a way to […]

Describe the solution you’d like
Tell us how you would like this problem to be solved.

Who is asking for this feature?
Tell us what kind of users are requesting this feature. Example: Journal Editors, Journal Administrators, Technical Support, Authors, Reviewers, etc.

Additional information
Add any other information or screenshots about the feature request here.

You can use this post as a reference.

Please don’t answer: As soon as you make the changes, we will remove this post to avoid adding noise to your FR thread.

Thanks for your help.

marc · September 19, 2023, 8:11am

@Dulip_Withanage and @ronste1 aren’t you both working on projects that are helping with this?

Dulip_Withanage · September 19, 2023, 10:14am

Yes, I have a latex plugin for 3.3

I am on holidays currently and I can answer further questions or give any help after holidays

kmccurley · September 19, 2023, 2:58pm

The problem is bigger than just running LaTeX - ours is designed to extract metadata without any errors during the compilation. LaTeX plugins suffers from a serious security issue as well: it is unsafe to accept LaTeX sources from authors and run them outside a sandbox. See arXiv:2102.00856. Still, this plugin should be a good basis for me to start from. Thanks!

LLBremer4578 · September 25, 2023, 9:01am

I would also be interested in this.

Dulip_Withanage · October 18, 2023, 4:05am

@kmccurley
yes, please update if there are any developments on your side.

kmccurley · November 10, 2023, 1:31am

Just an update. I think it is very unlikely that I will implement a plugin to access our platform for compiling LaTeX and extracting metadata.

There are several reasons for this:

I’ve written PHP since 1995, but I find that I like it less and less each year. I doubt that I could find anyone to maintain the code. All of our new systems are being written in python, which means the interaction with OJS would have to be through external APIs.
the import/export story for OJS looks like a complete mess. The native import XML format is unstable from one version to another, and the XML schema is poorly documented. There is also evidence that the native import plugin is being deprecated in favor of a REST API (that is also weakly documented). I would have to write a python client for this.
Our platform is separated into three distinct pieces with APIs between them:
a) submission and review (we’re currently using hotcrp pretty much as a black box).
b) production, copy editing, and metadata extraction. This is the part that is LaTeX-centric.
c) public indexing, archiving, registration, and hosting.
I can only imagine us using OJS for c) and frankly that is the easiest part for us to just replace.

asmecher · November 10, 2023, 7:51pm

Hi @kmccurley,

Constructive criticism is welcome; “xyz is a mess” is a bit hard for us to work with.

The XML import/export toolkit is not intended to be stable across releases; it was written to facilitate getting batch data into OJS, and shouldn’t be used e.g. as an archival format. We will continue to support it for the foreseeable future, but are not in a rush to improve it beyond quality-of-life and error fixing – our future interoperability/data exchange plans are to leverage the REST API, development on which will have much better knock-on benefits for other parts of the system.

If you ran into specific limitations in the REST API documentation, please note them here and I might be able to address them.

Regards,
Alec Smecher
Public Knowledge Project Team

kmccurley · February 19, 2024, 12:58am

I apologize for my previous statement that import/export is a complete mess. I should have had something more constructive to say. That comment was motivated in part by looking at the PLN plugin for long term preservation. The whole purpose of long term preservation is to write data in a format that will still be readable 10, 20, or a hundred years later. This format should not be dependent on technology of the day, but the native import/export plugin seems to be at the other end of the spectrum, being completely dependent on a short-lived version of software.

Unfortunately the PLN plugin currently has a dependency on the native import export plugin. Since you say that the native import/export plugin is not intended to be used across different versions, that seems to imply that the PKP Preservation system has a built-in flaw.

I can appreciate that it was convenient to tie the object model of a given version to the export schema, but the whole point of digital communication and preservation is to preserve the scholarly record across time and technology. This is one reason why people who work in digital preservation work so hard on schemas that are independent of technology - the whole point is to convey the work to someone without a dependency on what they use to read it. Building such a thing is a perhaps a big project, but there are several groups that have already done the heavy lifting to define and evolve a schema, namely JATS, crossref for a journal issue, and PDF/A for media. Some other media formats have their own weak efforts at technology-independent serialization, like docx.

JATS has been criticized by some because it’s hard to convert common document formats into JATS. Full conversion to JATS from another formation like PDF or docx or LaTeX is problematic, in part because JATS embodies structure but not appearance. The main problem arises in the <body> tag, but that is optional. Moreover, the <body> tag has a <media> tag to enclose the raw document in whatever format is appropriate (PDF/A, Microsoft Word, LaTeX, or whatever). The <back> section is mostly to capture bibliographic references in a structured way, but that is also optional. The thing that JATS would fulfill is a way to transmit the metadata and media about a publication in a well thought-out XML schema. It’s possible that not all metadata elements in the native PKP format would map directly into JATS <head>, but for that there is a <custom-meta> tag that can hold arbitrary key-value pairs. These need not be recognizable by other technologies, but can be used to transmit things that are used for OJS import.

No schema is perfect, which is why people continue to evolve them. JATS is now on version 1.3, and there are proposals in discussion for 1.4. JATS is also lacking a way to bundle multiple articles into an issue or a volume, but that part is covered by crossref. Crossref is also evolving their standards in things like how to capture funding and affiliation structure. If OJS is serious about building a long term future for digital preservation and interoperability, then the team should consider embracing a well documented serialization format like JATS more strongly. Clearly the strategy of trying to maintain the native XML import/export plugin isn’t serving the need. When I looked at how I would import documents into OJS, I couldn’t find a plausible strategy.

After looking at the REST API for a few minutes, I immediately spotted some deficiencies:

when creating a publication, supportAgencies is an array of strings with no structure. This means there are no ROR or fundref IDs, no department, no country, no grantID, etc. Others are far ahead on this.
when creating a publication, the citations are also just raw strings without structure. Not even a DOI, and with no clue about how they would be displayed from the raw strings. Are they HTML? Is inline mathematics allowed in either MATHML or LaTeX format?
disciplines are just a list of strings, which apparently ignores any existing structured hierarchy of taxonomies like the library of congress, NCBI, ACM, AMS, or those maintained by other disciplines.
when creating a contributor, affiliations also has no structure. At least this is now an array of strings instead of a single string.

These things are related to the fact that the internal schema in OJS is lagging behind external standards for metadata about publications. It has now been over two years since I mentioned that OJS doesn’t adequately capture multiple affiliations per author, even though this is incredibly common in many disciplines. It may also be related to the fact that you’re trying to map everything to a PHP object, which then stores elements into a column in a relational database. Going forward, the complete mapping of fields to columns in tables of a relational database doesn’t scale well. I can only imagine what it must cost on average to fully populate a publication from the database, much less an issue with hundreds of articles.

OJS has a lot of technical debt due to the age of the project, and I know it would be pointless to have architectural discussions in this thread. On the other hand, I think your future schema migrations would be well served to pay more attention to external schema requirements and less attention to breaking every object apart into a column in a relational database. The API looks like it is designed to serve that rather than any external data exchange.

asmecher · February 20, 2024, 7:54pm

Hi @kmccurley,

I’ll try to break this into the right fragments:

The whole purpose of long term preservation is to write data in a format that will still be readable 10, 20, or a hundred years later…

Agreed in principle, with some mitigating arguments. When the PKP|PN got its start a decade ago, JATS wasn’t as prominent a candidate for archiving. PKP|PN chose the XML import/export format with a plan to use XSL documents to permit easy forward-porting of older XML to a current format. This made it quick to launch and get out into the wild. The PKP|PN is a “dark archive”, intended to meet an archiving gap in getting a large body of OJS journals out there without institutional support; for example, by contrast to something like the Wayback Machine, it’s designed with an expectation that someone from PKP will take action when a journal goes down in order to stand an archived copy back up.

We have had some brief conversations about a shift to a more preservation-friendly format, and JATS is a natural contender, but those haven’t been conclusive. After a decade on the same infrastructure, we’d probably want to change some other things about PKP|PN given the opportunity and funding. Meanwhile the current design serves a need.

Clearly the strategy of trying to maintain the native XML import/export plugin isn’t serving the need. When I looked at how I would import documents into OJS, I couldn’t find a plausible strategy.

The biggest use case for the XML import/export tools is the one it was originally written for: back-issue import.

We’ve opted not to expand the current XML import/export toolset beyond published content e.g. to support more entities like peer reviews despite significant demand, because we would like to draw together import/export, user interface, and 3rd-party integration needs around the REST API. This will be much more maintainable than a sprawling XML toolset just for its own sake. As a result, while we do make quality of life improvements here and there, the XML implementation is fairly stagnant (especially around things like error handling, entity relationships, etc). It’s still frequently used for back issue migration, but it’s known to be fussy and incomplete. We are happy to review proposals for change, and there are third parties working on related items (see e.g. Extend native import/export plugin to include additional entities · Issue #3261 · pkp/pkp-lib · GitHub).

This has been explored at a few sprints; you might find context in past sprint reports.

…when creating a publication, supportAgencies is an array of strings with no structure. This means there are no ROR or fundref IDs, no department, no country, no grantID, etc. Others are far ahead on this.

There is a ROR plugin and a Funding plugin.

…when creating a publication, the citations are also just raw strings without structure.

There is 3rd-party work on this that I’ve seen demoed and which we hope to integrate into OJS 3.5. Some details here: https://projects.tib.eu/komet/en/

Is inline mathematics allowed in either MATHML or LaTeX format?

In published articles, you can do this with e.g. https://www.mathjax.org/ or https://latex.js.org. Submission titles have limited formatting starting with OJS 3.4.0, but I’m not aware of a way we can include formulae in titles that would play well with downstream services.

disciplines are just a list of strings, which apparently ignores any existing structured hierarchy of taxonomies like the library of congress, NCBI, ACM, AMS, or those maintained by other disciplines.

This is explored in Support browsing by keyword or subject · Issue #4932 · pkp/pkp-lib · GitHub and a couple of related issues. Long story short, we haven’t been able to find a global vocabulary that we can just adopt wholesale: it needs to be well translated, openly licensed, and applicable to the community at large. That’s probably an impossible combination, so we would like to add better generalized support for swapping in vocabularies.

when creating a contributor, affiliations also has no structure. At least this is now an array of strings instead of a single string.

Yes, we’ve explored this in Need to support multiple author affiliations · Issue #7135 · pkp/pkp-lib · GitHub.

It may also be related to the fact that you’re trying to map everything to a PHP object, which then stores elements into a column in a relational database. Going forward, the complete mapping of fields to columns in tables of a relational database doesn’t scale well. I can only imagine what it must cost on average to fully populate a publication from the database, much less an issue with hundreds of articles.

I get the sense you’re coming at OJS from a different design culture – which is fine, and we have a lot to learn from other approaches. But there is a methodology here, and it’s a lot more nuanced than more columns vs. less columns (at the risk of oversimplifying that conversation). You might have to spend time engaging on details before some of it starts to fit together.

Regards,
Alec Smecher
Public Knowledge Project Team