docxConverter & texture compatibility table?

marc · September 9, 2019, 11:16am

Hi,

This summer I got time to test docxConverter, and (despite some issues), I can say, in general, it works like a charm. Impressive work @Vitaliy. Thanks a lot.

IMHO, docxConverter, combined with Texture plugin and oldGregg is, right now, the best toolbox to work with JATS (although all is still beta, I think this set is even easier and better than Marcalyc or SciELO toolsets).

The workflow is perfeclly integrated in OJS and is as clean as simple as:

Upload	> Convert	> Web edit	> Present
ojs	docxConverter	texture plugin	oldGregg (JATSParser)

I will test those tools deeply but I’m wondering is somebody started a compatibility table of the issues found with each plugin, like this?

Feature	docxConverter	texture plugin	oldGregg	texture desktop
Version	beta2 [0.5.1.0]	[v2.2.0.0]	[1.1.1]	[ v2.3]
Format	JATS	JATS	JATS	dar (ziped jats)
Bold
Italic
Formulas	(latex)	(klatex)		(?)
…	…	…	…	…

I think it cant help the community to know how mature is the project and decide if they like to adopt the technology or wait.

If nobody did the job, I will publish my testing here in a couple of days, as a wiki, to let everybody update/contribute.

Cheers,
m.

Vitaliy · September 9, 2019, 12:05pm

Hi @marc,

Thanks for taking a look I would be appreciate for testing. Basically, DOCX Converter is still in beta because requires testing. The only thing that I want to add is the support for figures, I still need to look how they are supported in Google Docs, MS Word and Libreoffice Writer. Unfortunately the format that they produce isn’t strictly OOXML, thus extraction of some data, like table/figure caption isn’t an easy task - they are marked as a simple paragraph there.

Old Gregg uses JATS Parser and I’ll decouple them in the next release. JATS Parser Plugin will allow to show JATS XML on article landing page, this functionality is already on the master branch but I need to make some other changes before the next release. This option is available per galley basis as well as site-wide. And also it will take some time to adapt official themes.

marc · September 9, 2019, 6:26pm

Feature	docxConverter	texture plugin	oldGregg	texture desktop
Version	beta2 [0.5.1.0]	[v2.2.0.0]	[1.1.1]	[ v2.3]
Format	docx	JATS	JATS	dar (ziped jats)
Bold
Italic
underline
~~strikethrough~~
_sub
^sup
hyperlink	[1]
header styles (h1, h2)	[2]
unsorted list
sorted list
tables
table styles	[3]
table legend	[4]
table mergecell/cols
…	…	…	…	…

: for “working fine”.
: for “partially working”
: for “not working”
: for “not tested yet”
: for “could not be fixed”

Comments about the issues:

[1]: Link is shown, but not linked to destination (href lost).
[2]: Title styles are lost. Part of the titles are converted to a list in the middle of the document.
[3]: Every table formatting is removed. Bold in header disappear. Also justification, sub, super, lines.
[4]: Table legend is keep as simple text (not as legend field). See this.

marc · September 9, 2019, 6:27pm

Our pleasure Vitaliy. It’s the least we can do to help.

I don’t know if I catch you. Are now GoogleDocs and libreOffice supported or is something you like to include? I won’t recognize I said it, but… if you didn’t developed it yet, go first with M$ Word that is the most common format from authors.

Here a brief/fast list of the main errors we found in docxConverter (I will report them in github… as soon as I get time):

Headers (h1, h2…) from a converted M$Word are not converted properly.
Figures as you said, images are not yet imported.
Cites fail when you have multiple references (ie: [1,2].
Formulas are not imported.

Just in case it’s important… I don’t have M$Word, so I exported from libreOffice.
I attach the testing article I’m using.

This is so far I found in docxConverter… I will come here with comments about texture and oldgregg.

But I like to be more systematic in testing and to a better reporting to be sure I’m not missing something.

Cheers,
m.

Vitaliy · September 10, 2019, 10:23am

Yes, it should be compatible with Google Docs and LibreOffice. On the current stage functionality is the same. I primary check the functionality with LibreOffice documents, thus those documents should have less bugs.

Can you send me MS Word document where headers aren’t currently recognized? Let me know if you need my email.

As for figures, citation and formulas they indeed aren’t supported yet. extracting images should be more or less easy tasks as they are stored inside DOCX archive. So, they can be not only parsed but also uploaded to the system and attached to the galley file.

Parsing citation is a bit problematic as they can be in various formats. I plan to support Zotero, probably native MS Word and LibreOffice citations. I’m not sure what to do if citations are just raw text. It would be a headache to parse those with regex. It would need regex patterns for each citation style and for each reference type (book, journal article, chapter, thesis, etc,). Machine Learning could be a solution but there aren’t any good PHP libraries for that, particularly that implement Deep Learning Networks or even CRF/Hidden Markov Model.

I haven’t seen yet how formulas are implemented in OOXML compared to JATS. I need to dive into guidelines but as far as I know they are highly structured, making the support for them quite possible.

BTW, DOCX Converter uses own parsing mechanisms, it doesn’t rely on any 3rd party library, like TEIC Stylesheets that are used by meTypeset or OxGarage. Thus, I hope, it doesn’t inherit their problems The drawback is that it takes more type for developing.

marc · September 10, 2019, 9:42pm

Yes, it should be compatible with Google Docs and LibreOffice.

I submited a libreoffice (odt) and the “Convert to JATS XML” button is not shown.
About google Docs, OJS3 only let me submit a file… not an url.
I’m testing docxConverter 0.5.1.0

If you can convert from all those sources, probably docxConverter is not the best name.
What about jatsConverter ?

I primary check the functionality with LibreOffice documents, thus those documents should have less bugs.

Waiting for your answer to discover how to test this… In confidence, my final goal is finding the way to cover the whole workflow only with free software.

Can you send me MS Word document where headers aren’t currently recognized? Let me know if you need my email.

I love to but discourse only let us upload images, please mail me to marc.bria(spiral)gmail.com.

I will send you the ODT and DOCX that I use for testing.
Citations are in APA inserted via zotero plugin (odt with RefMarks, docx without).
I will also test Vancouver (that it’s one of the styles that I read it’s implemented).

I hope those documents will be good enough for testing (covers common needs), but please, let me know if you want me to include something else or modify the files yourself (metadata in the doc? header/footers? Different Citation format? TOC?..) and send them back to me.

As for figures, citation and formulas they indeed aren’t supported yet.

Yes, I noticed and you explained it one or twice in the forum. Take it easy.

extracting images should be more or less easy tasks as they are stored inside DOCX archive. So, they can be not only parsed but also uploaded to the system and attached to the galley file.

When you add a figure with Texture plugin the figure is attached to the xml document as a “Dependent file”. Both documents (odt and docx) are zip files with pictures inside. So I suspect the plan is parsing the source, build the right JATS tags and unzip and attach as “dependent” file the image, isn’t it?

I suggest focus on inserted images (that is the usual and also the easiest) and go with external-linked images in future.

BTW… as a feature request (for far, far future) what about including DAR format in your converter source list? I mean, at the end is JATS with files and a manifest and it will make OJS compatible with texture-desktop (that in some contexts could be more comfortable to edit than texture-web).

Parsing citation is a bit problematic as they can be in various formats.

“A bit” it’s you been ironic, isn’t it?
IMHO, this is the most difficult task you will have in this project.

I plan to support Zotero, probably native MS Word and LibreOffice citations.

If you want a second opinion here… I think is better covering one of them and do it really well (covering all citations formats) than go with all three at the same time (and cover partially).

Probably, my bet would be for zotero because it’s free soft, multiplatform and a “you must have” tool for authors, but I call one of my editors and he said that most of the authors are be more familiar with Word, so…

It would need regex patterns for each citation style and for each reference type (book, journal article, chapter, thesis, etc,).

Time to quote Zawinski?

I have doubts if we need to do this with the docxConverter. I mean, if authors deliver references in something more or less standard (bibTex? JSON?) probably we can import later with texture.

Even better: If authors submit their docx and bibTex, will be difficult for docxConveter to take both and do the job?

Trying to understand all the citation formats it’s a crazy, so at the end I’m suggesting relay this task on citation tools… I don’t know if I explained myself.

I haven’t seen yet how formulas are implemented in OOXML compared to JATS. I need to dive into guidelines but as far as I know they are highly structured, making the support for them quite possible.

And looks like the JATS standard is not clear about this…
Texture is reading a latex variant, while JATS4R is working over MathLM.
I think we need to clarify the direction here before start coding.

BTW, DOCX Converter uses own parsing mechanisms, it doesn’t rely on any 3rd party library, like TEIC Stylesheets that are used by meTypeset or OxGarage. Thus, I hope, it doesn’t inherit their problems The drawback is that it takes more type for developing.

All the projects you mention are great but you did an impressive job Vitaliy.
Thanks again.

PD: Testing table updated. It’s a wiki page, so anybody can join the testing or fix if something is wrong.

Vitaliy · September 11, 2019, 10:26am

I meant that you can export them from LibreOffice and Google Docs as DOCX documents. I’m looking at the ODT format right now - it’s also an archive that contains XML but with different format. I can’t say that it’s something impossible to support but is it necessary as you can transform a document into DOCX almost with any document editor?

I’m developing with LibreOffice but using Save as… DOCX feature.

Exactly.

Yes. It wouldn’t be hard to support as I’m targeting the output to be compatible with Texture.

Agree.

Vitaliy · September 11, 2019, 10:40am

Unfortunately MS Word and LibreOffice doesn’t support table/figure legends as they should according to OOXML standards. It’s basically a simple text there even if you mark it as a figure/table caption. If you look at the OOXML standard regarding the caption, it should look like:

<w:tblPr>
  <w:tblCaption w:val="This is the caption text"/>
</w:tblPr>

But MS Word and LibreOffice Writer saves as a simple text run, e.g.:

<w:p>
  <w:r>
    <w:rPr></w:rPr>
    <w:t>: This is caption</w:t>
  </w:r>
</w:p>

It’s not according to guidelines and completely outside of a table workflow.
Thus, I would say that DOCX converter supports captions but MS Word and LibreOffice don’t Although it doesn’t solve the problem.

marc · September 11, 2019, 10:56am

Ok. I misunderstood this point.
Actually I’m doing exactly the same.

Unfortunately MS Word and LibreOffice doesn’t support table/figure legends as they should according to OOXML standards.

Then it’s something we can life with. No transformation will be perfect and we have texture for final the fixing.

About the preliminary issue list, I’m curious about links [1] (losing href) and headings (h1, h2…) [2]. Does it also happens to you?

Vitaliy · September 11, 2019, 4:23pm

Ahh yes, links are stored in the separate file inside a DOCX archive, I’ve added a fix.

Headings should be parsed normally. I’ll take a look at your example and let you know.

Vitaliy · September 19, 2019, 3:41pm

I’ve added the support for images in JPEG and PNG formats for the master branch. I’ll compose 3rd beta release soon after other minor fixes are ready.

marc · September 19, 2019, 7:39pm

Today I made a presentation about JATS explaining all the plugins involved, bugs, alternatives…
A guy told to me “looks like we are becoming Vitaliy-dependent”

Will this new beta include the fix for headers build with libreoffice in other langs than English?

Let me know if you want me a new round of testing.

Thanks a lot Vitaly.

Vitaliy · September 20, 2019, 7:44am

Yes, I’ve made a fix based on your example, it will be included in the beta3 release. But it will require some modifications in the future based on other real examples. Maybe I’ll end up with something like a dictionary. I’m still exploring OOXML specifications and real LibreOffice/MS Word outputs.

Testing would be really useful after this new release.