Keep UI archivable by Heritrix web crawler

In a journal project where only HTML galleys are published, we have decided to use our library’s web archive for long term preservation of the journal site as a whole.

While for OJS 2.4.5, the crawl using Heritrix and Web Curator Tool works really well, For OJS 3 (as taken from the github branch) however, the only thing that’s visible on the archived front page are the “loading” spinners and the quick search field (see attached image).

We think compatibility with the OJS reader frontend could be a general requirement, as journals might like to get their content archived on archive.org, which uses similar software.

Hi @ojsbsb,

We’ve just merged some changes that may help this situation – watch for the beta release (due in August) and see if that helps. If not, please contact us for a little more specific work on this. Indexing is obviously a very important requirement for journals.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher,

thanks for your reply. We will watch out for the beta release and do another test crawl then.

Hi @asmecher,

having found this link on twitter around #pkp5, I’d like to ask for permission to do a test crawl from that site. It would not visible to anyone, of course, but it would avoid finding a false-positives due to installation problems on our side.

Of course, we would make the results public here and say whether problems like the one mentioned above persisted or have been solved with the new UI.

Hi @ojsbsb,

As long as the crawl results aren’t public, that’s fine. (The test data closely resembles scholarly content so I’d hate to see it get somehow added to a search engine; that’s why we’ve got no-index set in the .htaccess file.) The feedback would be welcome. Note that the current beta level front-end for readers is fairly minimal.

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @asmecher,

thanks for the reply, I have just conducted a test crawl (after I confirmed with my colleague that the data will definitely stay internal and cannot be accessed by our library patrons or other users on the web). It looks very good, especially the big problem regarding the loading spinners in the previous version has disappeared.

It was not possible to check the availability of some article files in the archive, as the view script for some reason redirected to the search function.

Also, I would like to point out one problem with the current layout: There are journals which put a lot of work into the translation of contents such as “About us”, abstracts and sometimes also articles. Using the dropdown language selector, only the site’s default language is archived. It would be necessary to have links for each language so that the crawler is able to follow them and get the content in all available languages. I believe this might be a desirable feature for multilanguage journals.

We will install the beta version on our own test server and carry out some more and in-depth checks, but the overall result is much better than with the previous alpha version.

Regards,
ojsbsb

Hi @ojsbsb,

Thanks – that’s great news. I’ve filed the language switcher for attention: Ensure that all languages are indexable by crawlers · Issue #699 · pkp/pkp-lib · GitHub

Regards,
Alec Smecher
Public Knowledge Project Team

Hi @alec,

after getting some more information about this from our web archiving specialist, I was told that the only way for multilingual content in journals to be included would be to use language-specific URLs for those parts.

For the web archiving system, each URL is one resource and will be crawled/indexed/archived once, assuming that there is only exactly one content element behind it. Because OJS delivers different types of content for one URL, which is depending on the locale set to the current session ID, this is not part of a resource to be archived, but a kind of client-server interaction which cannot be archived.

To solve this, journals would need an option to enable language-specific links and points where the different language versions link to one another. These could be language switcher links, which then do not only change a session setting, but rather change a language bit in the url (like /ojs/index.php/en_US → /ojs/index.php/hu_HU.

Hi @ojsbsb,

Gotcha, thanks – we’ll track this from the github issue.

Regards,
Alec Smecher
Public Knowledge Project Team