Hello,
We’re using OJS 2.4.8 and trying to generate some stats from previous years. We have galleys in several different formats (pdf, html, and eBook). The stats for the html galleys for several articles seem way off. For instance, there are a few articles that have pdf galley views around 1,000 and eBook galley downloads around 200, but the html galley views in the report number around 30,000.
We’ve had an issue with the usage stats script running a log file multiple times, but that doesn’t seem to be the issue here. The other galley formats seem about right. It’s just the HTML galley views that are way off base.
Looking back through some of the log files it looks like some articles got hit repeatedly by bots, and for whatever reason the log processing script did not accurately weed them out.
Does anyone have experience with cleaning up the view reports so they’re more accurate? Any tips?
Thanks!
Hi @mchladek,
Have you compared the user agents in your Apache access log against the list of excluded bots? See lib/pkp/registry/botAgents.txt
.
Regards,
Alec Smecher
Public Knowledge Project Team
Thanks, @asmecher! We have OJS set to process OJS-generated log files rather than the Apache access logs, but I’m assuming the botAgents.txt
file is still used to detect bots in that case, right?
I’ve looked through some of our old log files, and while there are some bots that we should probably add to the botAgents.txt
file, I don’t know if that’ll fix everything. We seem to get a lot of hits from users that are probably bots but don’t identify themselves as such. For instance, in one of our log files there’s the following user that requests the same HTML galley file for an article every couple of seconds for several hours so that there are thousands of hits from this one user:
122.15.22.45 - - "2015-12-26 16:59:42" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
There doesn’t seem to be anything in the user agent string I could use to mark it as a bot. I believe OJS is setup to ignore the same request from the same IP if done within a set amount of time, right? But I think what might be happening is that when a bot is constantly requesting the same file for several hours, OJS is still marking it as being viewed several times.
Hi @mchladek,
I don’t know the Counter code of practice as well as e.g. @ctgraham might, but my understanding is that it requires double-downloads of the same article by the same “user” to be ignored, but not crawling of the whole set of articles. See e.g. Data Processing | Project Counter. We’ve generally kept to these guidelines.
Regards,
Alec Smecher
Public Knowledge Project Team
I think for HTML the Counter code suggests ignoring double-downloads if the same file is requested within 10 seconds. Requests after a 10-second interval are counted as another hit. So, I think the issue might be that requests like in the partial log below end up counting as four hits (lines 1, 11, 13, and 18 would be counted), and when it happens for several hours it ends up counting for several dozen or more for a day.
Any ideas on how to clean such log files? Thanks!
1 122.15.22.45 - - "2015-12-26 12:39:09" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
2 122.15.22.45 - - "2015-12-26 12:39:09" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
3 122.15.22.45 - - "2015-12-26 12:39:09" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
4 122.15.22.45 - - "2015-12-26 12:39:09" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
5 122.15.22.45 - - "2015-12-26 12:39:09" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
6 122.15.22.45 - - "2015-12-26 12:39:10" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
7 122.15.22.45 - - "2015-12-26 12:39:10" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
8 122.15.22.45 - - "2015-12-26 12:39:10" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
9 122.15.22.45 - - "2015-12-26 12:39:10" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
10 122.15.22.45 - - "2015-12-26 12:39:16" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
11 122.15.22.45 - - "2015-12-26 12:39:37" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
12 122.15.22.45 - - "2015-12-26 12:39:40" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
13 122.15.22.45 - - "2015-12-26 12:40:07" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
14 122.15.22.45 - - "2015-12-26 12:40:10" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
15 122.15.22.45 - - "2015-12-26 12:40:16" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
16 122.15.22.45 - - "2015-12-26 12:40:17" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
17 122.15.22.45 - - "2015-12-26 12:40:17" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
18 122.15.22.45 - - "2015-12-26 12:40:41" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
19 122.15.22.45 - - "2015-12-26 12:40:42" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
20 122.15.22.45 - - "2015-12-26 12:40:50" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
21 122.15.22.45 - - "2015-12-26 12:40:51" http://www.haujournal.org/index.php/hau/article/view/hau2.1.005/1049 200 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
The COUNTER Code of Practice does require filtering out known bot traffic. This is typically handled by a bot user agent list. This is implemented in OJS. Unfortunately the examples you’ve copied above do not specify a bot-based User Agent. This may be the result of user agent spoofing. In that case, it is appropriate to block or discount the usage by IP. Eprints is beginning to explore such a IP blacklist. That approach may be something we would want to consider for the future of OJS.
1 Like
Yes, for the future, more robust filtering of bots could be good. In the meantime, it looks like I’m going to have to clear out all entries in the metrics
table for the past couple years and reprocess all the log files. To prevent the high number of HTML galley views, it looks like I have a couple options:
- Increase the
COUNTER_DOUBLE_CLICK_TIME_FILTER_SECONDS_HTML
variable in UsageStatsLoader.php
to something more than 10 seconds. Maybe set it equal to the 30 seconds for other file formats.
- Create an ad hoc list of IPs to skip that are making high frequency requests.
I’m thinking of going with #2, editing the UsageStatsLoader.php
script by adding an array that will store each $entryHash
that has more than three double-downloads.