PKP PN - NativeImportExport - inefficient memory usage

@jonasraoni

Describe the issue or problem
We have one issue of a journal that can’t be deposited to PKP PN because of its sheer size - the PKP PN plugin reports an Unknown processing state bag-error for this one.
(For all other journals and their issues the deposit process runs fine and we know what to do if there’s a problem.)

I have tested export of the issue with the command-line export tool and the NativeImportExport plugin. The export process is killed by the system because it uses too much memory. Even after tripling the available memory it still crashes.

The reason for this issue is that it provides audio and video in the articles (the HTML version of the article embeds the audio and video from a streaming platform). For migration and archiving reasons (sic!), compressed versions of the mp4 files had been added as supplements. And exactly these are the culprit.

The problem seems to be caused by that the complete XML tree is handled within memory before it is written to the output.

According to some debugging, the 36 mp4 files add up to a total file size of 6.6 GB (the largest one being about 1 GB). base64 encoding adds 35% overhead, so total 9GB, which could be held in memory still, but probably the XML tree requires additional overhead as well.

There are several questions and suggestions

  • Does it make sense in general to deposit such data, does PKP PN have size limits for large deposit files?
  • How to proceed further? Filter out the mp4 supplements for the moment, split the issue temporarily into two parts, create the bag file by other means (e.g. export single articles, then create a bag file from them and inject separately into deposit process, …)
  • Suggestion to optimize the native XML import/export for better memory management, e.g. write output in chunks to disk, deallocate memory after chunk has been written

BTW: I assume that similar problems may arise for the import process if there are large base64 encoded files.

What application are you using?
OJS 3.3.0-10 (+latest patches)

1 Like

HI @mpbraendle,

  1. The current limit for PKP PN is 1GB, but we’re probably going to get it upgraded to at least 2GB (this isn’t scheduled yet).

  2. Yes, with the current limitations you would be required to split the issue. Each package shouldn’t exceed 1GB.

  3. This should be working fine at your version, see the no-embed argument for the NativeImportExport plugin, which is supposed to replace the base64 data by paths:

Which is being used at the PKP PN plugin as well:

Best,
Jonas

Thank you @jonasraoni - I tried with NativeImportExport --no-embed and that works (file size of the resulting issue XML is then 260kB).

I’m just wondering why the packaging/depositing then failed, if the default is no-embed => 1 …

To have at least a minimal deposit (HTML galley and PDF equivalent, which just has links to the videos/audios) , we will probably decide to leave out the mp4 files for the moment . Splitting the issue is not a viable path in our opinion - the issue and its topic was a project on its own: https://www.psychoanalyse-journal.ch/issue/view/172

Hi!

What error did you get at the packaging? Was it the base64 encoding?! Perhaps adding the files to the bag was the culprit.

In a perfect world, there shouldn’t be limits, but well… I can see some alternatives to address this issue, that’s something that will require some internal discussions.

Best,
Jonas

This is what last night’s PKPPNDepositorTask-63507361aa02f-20221020.log tells:

[2022-10-20 00:00:06] [Notice] Depositor processing for Journal für Psychoanalyse.
[2022-10-20 00:00:06] [Notice] Getting service document.
[2022-10-20 00:00:07] [Notice] Processing deposit status updates.
[2022-10-20 00:00:07] [Notice] Trying status update for 94 (Issue: 172) (Local Status: [Transferred], Processing Status: [Unknown], Lockss Status: [Unknown])
[2022-10-20 00:00:07] [Notice] Processing status got for 94 → (bag-error)
[2022-10-20 00:00:07] [Notice] Deposit 94 has unknown processing state bag-error

I didn’t find any error around this time in the PHP error log giving some clue.

Hi @mpbraendle,

I can confirm the deposit was received by us (it has around ~6GB), but something happened while extracting the “bag” file, perhaps the file is corrupted, it will require a better investigation :slight_smile:

I’ll take a look on it, but as it’s not going to be preserved anyway, I won’t give it a high priority at this moment.

Best,
Jonas

In the meantime, we had decided to leave out the mp4 files (I will write a filter for that), so that a reduced package is sent (metadata, HTML and PDF files only).

1 Like

This topic was automatically closed after 10 days. New replies are no longer allowed.