OJS scrapping package for r

Gaston_Becerra · January 2, 2020, 11:39pm

Hi there,

anyone know about any attempts or efforts to build an R (statistical software) package to scrap information from OJS pages? I’m thinking in a tool that would allow you to retrieve information about an article or an issue by passing its URL (e.g. metadata, URLs to download pdf/galleys),

(I know OJS rest API will already make this available… but perhaps this could be useful in OJS <v3… what do you think?)

Cheers!

nuest · January 7, 2020, 12:50pm

A cool idea! This would be surely useful, and if there is no such package yet I’d consider joining the development. I am a contributor of the suppdata package and have been wondering how to support download of supplemental material based on DOIs pointing to OJS-based articles in a generic way (so far only JStatSoft is supported), and this package should surely make that easier.

Have you taken a look at other packages for accessing academic content from other sources? They might provide good boilerplates (functions, code) or may even be a place for the functionality you describe.

Do you think about bibliographic metadata or also download content?

ctgraham · January 7, 2020, 10:16pm

OAI-PMH can readily be used for discovery and download of metadata. This will also give you a reference to the article’s abstract page via the Resource Identifier. Getting to the fulltext and supplementary resources is a bit more tricky. The meta elements list a “citation_pdf_url” if the fulltext is exposed as PDF. It is up to the theme(s), however, to semantically render the fulltext and supplemental file links, and most don’t. Getting some RDF or similar expression of the metadata surrounding the galleys would be a big benefit here.

rshiggin · January 8, 2020, 4:39pm

I have a couple of simple Python scripts that can harvest and parse OAI-PMH, which we’re using to scrape OJS. But I would be interested in something more sophisticated, especially a way to download a journal’s PDF/Galley content in one process.

A related question I’m working on right now: Does the rest API include a way to match a large number of article IDs (such as those in OJS URLs) with each ID’s vol and issue info as a single process?

Thanks!

Gaston_Becerra · January 16, 2020, 7:22pm

Hi y’all, many thanks for your responses and cheers!
I checked a bunch of packs from rOpenSci. My goal is to contribute with this such a pack.
I already have a few functions sketched to process OJS urls (detect if its pointing to an issue, article, galley, retrieve OAI url) and download metadata (both through OAI and theme-scrapping). I think I can make a first draft of the pack by end of february. Eager to collaborate, if you have the patience (it will be my first package).
Best,

Gaston_Becerra · April 24, 2020, 5:07pm

Hi there,
Just a quick msg to announce that an R package for scrapping OJS has been accepted on CRAN: CRAN - Package ojsr
Best,