Metadata collection and embedded metadata data in PDFs

alberto_moyano · November 23, 2023, 9:39pm

Hello, does anyone know if metadata harvesters, when collecting metadata, internally read the metadata from PDFs and EPUBs, or do they only collect the ones declared in OJS and OMP?

rcgillis · November 24, 2023, 1:06am

Hi @alberto_moyano,

I think this would depend on the metadata harvesters. From what I’ve seen a lot of harvesters use the OAI-PMH functionality to harvest metadata, but some may take metadata from OJS/OMP’s export utilities (if provided to them). It is plausible that some harvesting units could extract metadata from PDFs or other galleys, but that would depend on the harverster and the mechanisms that they use.

-Roger
PKP Team

alberto_moyano · November 24, 2023, 1:14am

@rcgillis, thank you for your response. The origin of my question stems from the fact that I am developing software that works with LaTeX and injects metadata into the PDF. I have been studying and analyzing the PDFs from about 20 university publishers, and none of them have embedded metadata in the PDFs, hence my doubt.

rcgillis · November 24, 2023, 1:22am

Hi @alberto_moyano,

Really interesting - thanks for sharing the context. Yes, I suspect it is the case that few publishers embed the metadata in the PDF, based on what I’ve seen. FYI: I just modified the title of the post in case other members of the community wish to weigh in if the use this/have seen this practice.

-Roger
PKP Team

alberto_moyano · November 24, 2023, 1:57am

@rcgillis, thank you for the change in the title.

The software is licensed under GPL (https://gitlab.com/alberto.alejandro.moyano/gbtexpublisher), and I use it daily for my personal production. So far, I have completed the output for PDF and EPUB. For HTML and JATS, I am studying the best possible model for injecting metadata since they are processed from an SQLite database.

rcgillis · April 22, 2024, 11:00am

This topic was automatically closed after 14 days. New replies are no longer allowed.