DOCX to JATS XML converter

Hi @Vitaliy

No I did not. Thank you for your help. It is now working.

Hi again @Vitaliy

The plugin successfully made galleys. However, the lens viewer does not show the HTML and XML formats? Is there any problem with lens viewer?

I receive this error in error_log every time I click on HTML:
Declaration of DocxToJatsPlugin::register($category, $path) should be compatible with LazyLoadPlugin::register($category, $path, $mainContextId = NULL) in /home/ccb/public_html/plugins/generic/docxConverter/DocxToJatsPlugin.inc.php on line 122

This is only cosmetic warning that will be fixed in the next release.

Lens Viewer requires certain tags to be present in XML for proper rendring.
You can try JATS Parser Plugin when a new release for 3.2 will be ready.

Thank you so much for your help @Vitaliy. I am eagerly waiting for the updates.

Hi
I have the same problem with the Docx converter. Is their any sultion.

[14-Oct-2022 10:59:01 UTC] PHP Fatal error: Uncaught Error: Class ā€˜ZipArchive’ not found in /-----------plugins/generic/docxConverter/docxToJats/src/docx2jats/DOCXArchive.php:14
Stack trace:
#0 /home/public_html/lib/pkp/lib/vendor/composer/ClassLoader.php(571): include()
#1 /home/public_html/lib/pkp/lib/vendor/composer/ClassLoader.php(428): Composer\Autoload\includeFile(ā€˜/home/n7ttt5zjv…’)
#2 [internal function]: Composer\Autoload\ClassLoader->loadClass(ā€˜docx2jats\DOCXA…’)
#3 /home//public_html/plugins/generic/docxConverter/DOCXConverterHandler.inc.php(55): spl_autoload_call(ā€˜docx2jats\DOCXA…’)
#4 /home//public_html/lib/pkp/classes/core/PKPRouter.inc.php(395): ConverterHandler->parse(Array, Object(Request))
#5 /home//public_html/lib/pkp/classes/core/PKPPageRouter.inc.php(246): PKPRouter->_authorizeInitializeAndCallRequest(Array, Object(Request), Array, false)
#6 /home//public_html/lib/pkp/classes/core/Dispatcher.inc.php(144): PKPPageRouter->route(Object in /home//public_html/plugins/generic/docxConverter/docxToJats/src/docx2jats/DOCXArchive.php on line 14

Hi @Rohaan123,

Check if PHP’s zip extension is installed and enabled.

Hi @Vitaliy, how are you doing? hopefully great!

This week I started experimenting with docxtojats (php one) through XAMPP (PHP 8.1.12), it was working fine but then today I’m getting some errors.

I am trying to convert the ā€œsamples/input/msword_zotero.docxā€ file into ā€œsamples/outputā€ but don’t know what I am doing wrong. Would you be kind to help me please?

Below are the error messages:
Tikinet@DESKTOP-UQTF7ER c:\docxToJats

php docxtojats.php samples/input/msword_zotero.docx samples/output/msword_zotero/msword_zotero.xml

PHP Fatal error: Uncaught ArgumentCountError: DOMNode::appendChild() expects exactly 1 argument, 0 given in C:\docxToJats\src\docx2jats\jats\Figure.php:57
Stack trace:
#0 C:\docxToJats\src\docx2jats\jats\Figure.php(57): DOMNode->appendChild()
#1 C:\docxToJats\src\docx2jats\jats\Document.php(208): docx2jats\jats\Figure->setContent()
#2 C:\docxToJats\src\docx2jats\jats\Document.php(57): docx2jats\jats\Document->extractContent()
#3 C:\docxToJats\docxtojats.php(72): docx2jats\jats\Document->__construct(Object(docx2jats\DOCXArchive))
#4 C:\docxToJats\docxtojats.php(51): writeOutput(ā€˜samples/input/m…’, Array, Array, ā€˜samples/output/…’, false)
#5 {main}
thrown in C:\docxToJats\src\docx2jats\jats\Figure.php on line 57

Fatal error: Uncaught ArgumentCountError: DOMNode::appendChild() expects exactly 1 argument, 0 given in C:\docxToJats\src\docx2jats\jats\Figure.php:57
Stack trace:
#0 C:\docxToJats\src\docx2jats\jats\Figure.php(57): DOMNode->appendChild()
#1 C:\docxToJats\src\docx2jats\jats\Document.php(208): docx2jats\jats\Figure->setContent()
#2 C:\docxToJats\src\docx2jats\jats\Document.php(57): docx2jats\jats\Document->extractContent()
#3 C:\docxToJats\docxtojats.php(72): docx2jats\jats\Document->__construct(Object(docx2jats\DOCXArchive))
#4 C:\docxToJats\docxtojats.php(51): writeOutput(ā€˜samples/input/m…’, Array, Array, ā€˜samples/output/…’, false)
#5 {main}
thrown in C:\docxToJats\src\docx2jats\jats\Figure.php on line 57

Hi @Tiago_Manzato_de_Sou,

Adapting to power outages, but otherwise good

That looks like a type here: https://github.com/Vitaliy-1/docxToJats/blob/9c8579dd48bb6dda9957c65776afaf3c7f5969be/src/docx2jats/jats/Figure.php#L57
I believe, it should be

$captionNode->appendChild($title);

Can you test this change? Let me know if not

Блиет братан. Every tech guy’s cryptonite :sweat_smile:

It worked like a charm! Дпасибо!

Now, I’m trying to convert another article, it works partially because the .xml is generated, but without formatting style tags such as <bold>, <italic>, <sup>, <sub>.

I’m getting the following notice/error message:

# php docxtojats.php rus/1/1.docx rus/1/1.xml
PHP Notice:  Cannot find document inside the archive by the path word/numbering.xml in C:\docxToJats\src\docx2jats\DOCXArchive.php on line 212

Notice: Cannot find document inside the archive by the path word/numbering.xml in C:\docxToJats\src\docx2jats\DOCXArchive.php on line 212

Any clues?

I guess I’ve figured something out.

The article in question was exported to .rtf and then saved as .docx, maybe this was the reason why DocxToJats was ignoring all formatting styles, once I created the file as .docx to start with then the convertion worked perfectly (even tho I still received the same error message)

PHP Notice: Cannot find document inside the archive by the path word/numbering.xml in C:\docxToJats\src\docx2jats\DOCXArchive.php on line 212

File with a .docx is an archive with specific rules regarding its structure, to which text editors should adhere when exporting the document. This particular message means that the file, which contains sequences for lists is missing but is specified in the relationship file.

It’s possible that the program, which exported/imported the document didn’t comply this rules. But if the problem persist, you can send me a file to explore. Let me know and I’ll send my email in the private message.

@Vitaliy Thank you very much for all the info!

Since it works (even tho displaying a notice) it’s A-OK to me.

I’ve been messing around with your code and have made one improvement that I hope you can consider to implement in the near future (I tried to make the change directly on GitHub but I’m a gitnoob :sweat_smile: )

It’s in this file:

Line 158

I changed this:
$urlEl = $this->createAndAppendElement($elementCitationEl, 'ext-link', $url);

To this:
$urlEl = $this->createAndAppendElement($elementCitationEl, 'ext-link', $url, ['ext-link-type' => 'uri', 'xlink:href' => $url]);

So now, DocxToJats parses reference’s URL, generates the <ext-link> tag and the ext-link-type="uri" xlink:href="$url" attributes as well.

But I’d like to know a little bit more about the parsing process of zotero references and <xref> in a .docx.

The articles I work with generally won’t use Vancouver reference standard, it uses mostly ABNT standard, that said, one rule of this standard is that citations in the text must be ($authorSurname, $year, e.g.: (Manzato, 2023)) but DocxToJats generates the <xref> content according to the xref’s RID. So in the XML <body> the xrefs end up looking like: <xref ref-type="bibr" rid="bib#">#</xref>

I tried to change the function to use the variables $surname and $year concatenated instead of $id but doesn’t work like that since the $id is the foreach condition to stop the loop.

So would you be so kind to help me figure out how can DocxToJats generate <xref> content as follows:
<xref ref-type="bibr" rid="bib#">($surname, $year)</xref>

Thanks in advance!

Yes, according to the NISO JATS documentation it should be right way.

Zotero and Mendeley don’t provide any documentation on how they add references to the document. E.g., Zotero plugin even uses different mechanisms in LibreOffice vs MS Word. I suspect this is because using different libraries for interactions with OOXML.

I’m interacting with DOM directly, so it shouldn’t be hard to retrieve this data but I really don’t know/remember where they store it. All the info that I know I found the empirical way :slight_smile: So, it would require to make a sample file (with Zotero references), unpack it and inspect the content. Are you talking about Zotero plugin for MS Word or LibreOffice Writer or something different?

Zotero plugin for MS Word, I’ve had terrible experience with LibreOffice’s constant crashes.
Looking at your code further I think I understood some bits and pieces about how DocxToJats deal with Zotero entries and also was able to tweak the code a little bit further.

But first things first, I should of said this in the past but during the end of the year my life is a total rush but better later than never… Congratulations!!! :tada: :confetti_ball: I’m a big fan of your work man!! This automation has become a HUGE piece in my companie’s XML development process, specially because these php libraries enables me to produce several XML files in a batch.

It’s so darn good that now I’m trying to adequate the output XML to match the SciELO standard, which is a little bit different.

At the moment I’ve made changes to Reference.php only, simple things like:

Changes from:

class Reference extends \DOMElement {

	public const JATS_REF_ID_PREFIX = 'bib';

to:

class Reference extends \DOMElement {

	public const JATS_REF_ID_PREFIX = 'b';

changing the <ref id="bib1"> to <ref id="b1"> which also made changes to the <xref rid=""> value as well (as documented by SciELO, they require citations and bib. ref. ids to be like ā€œb1, b2, b3ā€¦ā€)

Also tried to make DocxToJats generate entries for translators in a reference (works for Zotero references, raw text references I haven’t tested) by adding the piece of code below:

$translator = $this->getStdClassPropertyValue($data, 'translator');
		if ($translator && !$containerAuthors) {
			$this->extractCSLNames($elementCitationEl, $translator, 'author');
			$this->createAndAppendElement($elementCitationEl, 'role', 'Tradutor');
		}

This addition creates the <person-group> tag and it’s child tags as well, but as per SciELO’s documentation, translators, reviewers, etc in a bib. ref. must be marked as person-group-type="author" with a tag <role> as a <person-group> child tag, like the example below:

<person-group person-group-type="author">
  <name>
    <surname>Bezerra</surname>
    <given-names>Paulo</given-names>
  </name>
  <role>Tradutor</role>
</person-group>

but my code is outputting it like:

<person-group person-group-type="author">
  <name>
    <surname>Bezerra</surname>
    <given-names>Paulo</given-names>
  </name>
</person-group>
<role>Tradutor</role>

I’m also trying to find out a way to make DocxToJats generate bibr xrefs like:
<xref ref-type="bibr" rid="b1">SURNAME, YEAR</xref> instead of <xref ref-type="bibr" rid="b1">1</xref>, <xref ref-type="bibr" rid="b2">2</xref>, <xref ref-type="bibr" rid="b3">3</xref> etc…

I hope you can help me figure out how to code it properly (please bear with me, the last time I Read/Write php code was back in 2002 and haven’t touch a single php file ever since :sweat_smile:)

1 Like

I’m glad that you finding the parser useful :slight_smile:

Unfortunately, I’ve considered different citation styles only for transformation to HTML (in JATS Parser Plugin), for presentation purposes. So, indeed, it requires some tweaking.
The part where JATS in-text citation is handled for MS Word is here: https://github.com/Vitaliy-1/docxToJats/blob/d976ab1b42e338aeba261718c4cf3a981afb74e2/src/docx2jats/jats/Par.php#L27
Probably, the best would be to find reference by ID, get authors data and replace the text. Something like:

$references = $content->getOwnerDocument->getReferences();
foreach ($content->getRefIds() as $key => $id) {
  $ref = $references[$id]; // this is docx2jats\objectModel\Document\Reference containing reference, in this case in [CSL](https://citationstyles.org) format
  if ($csl = $ref->getCSL()) {
     $firstAuthor = $csl->author[0]->family; // you'll need to extract the data, like author(s) and year, from this already decoded CSL. The year part is tricky because it can be in structured or raw format
    ...
  }
  $refEl = $this->ownerDocument->createElement('xref', $firstAuthor ?? $id);
  $refEl->setAttribute('ref-type', 'bibr');
  $refEl->setAttribute('rid', Reference::JATS_REF_ID_PREFIX . $id);
  $this->appendChild($refEl);
  if ($key !== $lastKey) {
    $refEl = $this->ownerDocument->createTextNode(' ');
    $this->appendChild($refEl);
  }
}

Regarding another question, I’ll respond once test that part