Usage Statistics: Importing Apache logs but "pdf" and "times viewed" columns stay empty

netsensei · January 8, 2021, 4:32pm

Hello,

I am trying to import Apache logs into our OJS installation using the php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasksExternalLogFiles.xml command.

=> However, when I look at stats/publishedSubmissions page of a journal, the “pdf” and “times viewed” columns in the table below the graph are all showing 0’s (zero, nothing recorded) whereas the “abstract viewed” column does show positive integers.

Our OJS installation contains historic data (past years, since 2014) that was processed through the usage events and cron. Earlier years do contain data over all columns.

Using PHP’s var_dump , I’ve digged around in the UsageStatsLoader.inc.php file, feeling my way through the ingest process.

Potential issues I’ve scratched off my list:

It’s not the regex. The data is parsed correctly.
It’s not permissions to the directories or the files in the usageStats folder (stage, archive,…)
(I’m working in a local VM) I’ve tried setting the base_url to the domain of the server where the application is hosted.
The apache logs contain parsable data. And I can see how the data gets picked up by the loader.
I can see that the data is loaded in the metrics table with the ojs::counter metric_type. The load_id column refers to the log file I imported. Comparing with data from the usage_events logs, I don’t see anything missing (e.g. context_id, submission_id,… all contain data)

During import via runSchedulredTasks.php, I do get this cryptic warning on the CLI, but it doesn’t break the process:

PHP Warning: assert(): assert($submissionId > 0 || (int)$uploaderUserId || (int)$fileId || (int)$assocId) failed in /vagrant/ojs.ugent.be/lib/pkp/classes/submission/SubmissionFileDAO.inc.php on line 1003

The scheduledTaskLogs directory does contain a log file for the run which reads like this:

After a run via the CLI, the log file ends up in the Archive directory.

I have tried moving the same file back to the Stage directory and running php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasks.xml as a stopgap. But that just seems to execute the exact same logic only using the regex matching the internal “usage event” log format. (So, doing this seems to be redundant)

[2021-01-08 16:35:17] [Notice] Task process started.
[2021-01-08 16:35:19] [Warning] The line number 17607 from the file /opt/ojs-files/usageStats/processing/logfile_443_access_ssl.log-20200607 contains an url that the system can't remove the base url from.
[2021-01-08 16:35:22] [Warning] The line number 33611 from the file /opt/ojs-files/usageStats/processing/logfile_443_access_ssl.log-20200607 contains an url that the system can't remove the base url from.
...
[2021-01-08 16:36:41] [Notice] File /opt/ojs-files/usageStats/processing/logfile_443_access_ssl.log-20200607 was processed and archived.
[2021-01-08 16:36:41] [Notice] Task process stopped.

It only contains a very small fraction of references to unprocessable lines of the entire logfile. So, that leaves me to conclude that the entire logfile, barring those few was processed entirely.

Background information:

This installation runs OJS 3.1.2.1
It’s an older installation that started out at 2.3.7.0 and got subsequent updates over the years according to the system information (I inherited this installation).
We migrated from MySQL to PostgreSQL at the end of 2019.

We disabled the logging via usage events in early 2020 because it caused performance issues on the PostgreSQL server, understanding that we could import the apache logs in good order.

The goal here is to generate COUNTER statistics (XML download via OJS) for 2020 based on the apache logs.

=> Are there any subsequent steps I have to take after importing the apache log to complete the parsing of the log data?

Kind regards!

netsensei · January 11, 2021, 2:08pm

Okay, I have whittled the problem down to how the base_url is configured in config.inc.php.

This is how it looks in my case:

base_url = https://site.com

base_url[journalA] = https://site.com/journalA
base_url[journalB] = https://site.com/journalB
base_url[journalC] = https://www.journalC.com

The Apache log file looks a bit like this:

WWW.XXX.YYY.ZZZ - - [01/Jan/2020:03:04:05 +0100] "GET /journalA/article/view/1234/4567 HTTP.1/1" 200 870 "https://site.com/journalA/article/view/1234/4567" "user agent string"

In UsageStatsLoader.inc.php in the processFile function, the importer tries to match the paths in the GET requests logged the Apache log against the base_url configuration. This happens via the _getUrlMatches function.

Long story short, that function ultimately tries to match the /journalA/article/view/1234/4567 string against the list of base_url’s in the configuration. Deep down in core.inc.php the _getBaseUrlAndPath function gets called, which tries to strip away the base_url from the url.

The logic will break on the configuration above and end up mis-matching the base_url from journalC with each and every URL from the apache log. It will also incorrectly strip the respective base_url from the presented URL. As a consequence, the URL’s won’t be processed correctly by the processing logic.

So, what I have to do in order to make the matching work: I have to comment out the offending base_url from the configuration, like this:

base_url = https://site.com

base_url[journalA] = https://site.com/journalA
base_url[journalB] = https://site.com/journalB
; base_url[journalC] = https://www.journalC.com

Of course, there’s all kinds of concerns that come with approaching the issue like this.