External stats problem

Hello,
There was a stats problem in ojs 2.4.7.1 where the total Galley Views would always be 0. We have since upgraded to 2.4.8.1 and the Galley Views are now working. However I want to process the old log files to try to get some of the galley view stats. My understanding is that we have to use the apache access logs to do this, as the usage_event logs will not work.

I have moved the access logs into the usageStats/stage folder
then I modify the DB scheduled_tasks.last_run value for the UsageStatsLoader to be older then 24 hours ago.
Then I run
php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasksExternalLogFiles.xml

However the stats are not processed. The access logs do get deleted from the stage folder.

I’m fairly sure the access logs are in the correct format:
68.180.230.102 - - [12/Sep/2016:04:42:29 -0600] “GET /index.php/ewjus/article/view/227/95 HTTP/1.1” 200 17026
66.249.79.166 - - [12/Sep/2016:05:05:30 -0600] “GET /index.php/ewjus/article/download/218/86 HTTP/1.1” 200 236740

Is there something I’m missing?

Did the access logs get moved to the ‘archive’ folder, or to the ‘reject’ folder, or are they still in the ‘processing’ folder?

If processing completed, there should be a log file in the ‘scheduledTaskLogs’ folder under your files_dir. This log file will detail what happened, or what went wrong.

Thanks in the scheduledTaskLogs folder i get: The line number 1 from the file /journals-data/uploads/www.ewjus.com/usageStats/processing/www.ewjus.com-access_log-20160814 is not a valid log entry and the file was rejected.

What does a valid log entry look like?

The regular expression for external log processing can be defined in:
User Home → Journal Manager → System Plugins → Generic Plugins → Settings
Does anything appear there?

If not, the default regex is used from:

What is the first line of
/journals-data/uploads/www.ewjus.com/usageStats/reject/www.ewjus.com-access_log-20160814
?

The regex is: /^(\S+) \S+ \S+ [(.?)] "(\S+).?" \d+ \d+ “(.?)" "(.?)”/

The first line of the log file is:
208.115.111.69 - - [08/Aug/2016:05:42:15 -0600] “GET /ojs/index.php/ewjus/search/authors?searchInitial=H HTTP/1.1” 404 232

Your Apache logs are not (or at least the first line of that log is not) in the standard “Combined” format. Do the rest of your lines in the file look similar? Your Apache LogFormat is probably something like:
LogFormat "%h %l %u %t "%r" %>s %b"

You should be able to construct a regular expression which will work with your existing logformat, with the exception that it doesn’t appear you are capturing the userAgent string in your logs. This information will be lost. A functional regex should be something like:
/^(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?)\] "\S+ (?P<url>\S+).*?" (?P<returnCode>\S+) \S+(?P<userAgent> *?)/

Thanks I’ll work on that. I was doing this:
/^(\S+) \S+ \S+ [(.+)] “(\S+).*” \d+ \d+$/

but that didn’t work.

I think your regex would have matched if you escaped the square braces, but it almost certainly throw “index not defined” warnings when the code went to look for the fifth match:

Right. Thanks. I tried this:

/^(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?)\] "\S+ (?P<url>\S+).*?" (?P<returnCode>\S+) \S+(?P<userAgent> *?)/

again and it worked now. Didn’t work the first time, I must have missed something.

Thanks for your help.
Jeremy