OJS 3.1.0.1 Access and error log filling with references to old(?) citation plugin addresses

ajnyga · February 18, 2018, 7:24pm

Hi,

After upgrading to OJS 3.1.0.1. our access log and error log is filling up with 404 errors that are caused by requests to old citation plugin addresses.

There are so many of them that also Google Webmaster tools has sent me an email about the growing number of 404’s two times now (edit, three times now). First I thought that they would eventually stop, but they just keep on coming. Any idea what is causing them? I am thinking that this is connected to the change that occured with the citationStylePlugin. Maybe there should be a 302 redirect of some sort? @NateWr @asmecher

Here are a few of them within just a couple of seconds. This is from the access log, there is a corresponding 404 error line in the error_log of course.
54.xx.xx.70 - - [18/Feb/2018:21:13:04 +0200] “GET /journal1/article/cite/50582/CbeCitationPlugin HTTP/1.1” 404 22
54.xx.xx.85 - - [18/Feb/2018:21:13:18 +0200] “GET /journal2/article/cite/60023/RefWorksCitationPlugin HTTP/1.1” 404 22
54.xx.x.143 - - [18/Feb/2018:21:13:19 +0200] “GET /journal3/article/cite/58614/BibtexCitationPlugin HTTP/1.1” 404 22
54.xx.xx.39 - - [18/Feb/2018:21:13:20 +0200] “GET /journal4/article/cite/8313/RefManCitationPlugin HTTP/1.1” 404 22

All these errors are making the error log pretty useless, or at least hard to follow. I got around 100 404 errors within 5 minutes so after a day or so you really have to search for actual errors.

edit: ok, so it was what I suspected: pkp/pkp-lib#723 Remove unused ArticleHandler::cite method. · pkp/ojs@0f9a658 · GitHub

I am worried that the large amount of 404s will affect our results in google. Maybe when handlers are removed from OJS there should be a policy of first adding a 302 redirect and removing the actual handler only later?

asmecher · February 20, 2018, 8:15pm

Hi @ajnyga,

Hmm, yes, we removed these hand-written citation format plugins in favour of a CSL-based implementation. In other aspects of OJS where URLs that were previously used changed, we introduced 301 Moved Permanently redirects, but those were to avoid breaking useful links, where I think this is more a case of a potential SEO issue, which is a lower priority. I’d be happy to review/merge a proposal on this.

Regards,
Alec Smecher
Public Knowledge Project Team

ajnyga · February 20, 2018, 8:27pm

Hi @asmecher,

Yes, I of course meant to say 301 not 302.

I guess I could reintroduce the cite handler and just add the redirects there.

Maybe:

Check if Citation Style Language plugin is enabled.
if it is enabled match requests to new url’s. For example:
/journal3/article/cite/58614/BibtexCitationPlugin => /journal3/citationstylelanguage/download/bibtex?submissionId=58614
If there is no match (for example RefWorksCitationPlugin seems to be missing from the new plugin) then redirect to the submission abstract page.
If Citation style plugin is disabled, redirect all requests to the correct abstract page.

What do you think?

asmecher · February 21, 2018, 10:46pm

Hi @ajnyga,

I’d be tempted to have imperfect matches stay as 404s, as that’s technically more accurate – the content was here but now it doesn’t exist anymore – but it’s not a strong opinion.

Regards,
Alec Smecher
Public Knowledge Project Team

ajnyga · February 22, 2018, 8:31pm

You are probably right. I am having hard time to understand where all those thousands of hits are coming from. They seem to be at least partly search engines, but you would think that they would learn within a few weeks that the resource is really gone. I will give it a few weeks more and see what happens.

ajnyga · August 15, 2018, 3:17pm

Just an updatede here. We are still getting those calls to old citation plugins

These are examples from our access logs:
article/cite/62892/RefManCitationPlugin HTTP/1.1" 404 22
article/cite/57552/BibtexCitationPlugin HTTP/1.1" 404 22
article/cite/51292/RefWorksCitationPlugin HTTP/1.1" 404 22
article/cite/51999/TurabianCitationPlugin HTTP/1.1" 404 22

We get 300-400 of those per hour and each of them create a line in the error_log.

I still agree that 404 is the right way to handle this, but where the hell are these calls coming from?

asmecher · August 15, 2018, 4:50pm

Hi @ajnyga,

I’d assume it’s a bevy of crawlers that picked these URLs up from the publishing front-end e.g. via the article view. Does the user agent include any indication?

Regards,
Alec Smecher
Public Knowledge Project Team

ajnyga · August 15, 2018, 5:09pm

hmm, I see now that our access_log is not saving the user agent (could be a GDPR setting they made with our server, have to ask).

But from the IP’s I can see that at least some of them are coming from Google. So yeah, probably crawlers, but you would think that they would learn in 6 months that those url’s no longer exist.

ajnyga · April 25, 2024, 12:16pm

just a note here since I found my own thread when googling, that we still get these hits to the old url’s