Old statistics not shown

OJS3 only shows statistics since a certain date.

imagen

  • Acron and Statistics plugins are enabled.
  • I use a regular (not modified) apache2 log.
  • Plenty of files in the “archived” folder.

What I tried to resolve the issue:

I tried to “reprocess” the logs moving old files from “archive” to “stage” folders.
When I call the “tools/runScheduledTasks.php” script, OJS takes a long time to reprocess them… but then, nothing is shown in the statistics page.

I cleared the cache (client and server sides) and I wait more than 24 hours to let the tool complete any possible internal process, but the result is the same.

Application Version: OJS 3.2.1-4

Is it possible that the base_url changed at this point? Or that published articles got new IDs? When processing old log files, the URL is matched against the base_url. And when calculating article stats, the full URL is used to match (including the urlPath).

A change in any of these can cause a log entry to point to a URL that is no longer recognized, and therefore discarded. We are adapting the log files for 3.4 to avoid this problem.

2 Likes

Thanks Nate for your answers. Comments about your comments:

Is it possible that the base_url changed at this point? Or that published articles got new IDs? When processing old log files, the URL is matched against the base_url. And when calculating article stats, the full URL is used to match (including the urlPath).

No change in base_url (I checked it twice) and as far as I know, articles keep same IDs and urlPath (DOIs keep working fine after upgrade).

I don’t know if this is relevant to identify the problem, but just after the migration I got stats since 2015… but when I “reprocess” the logs, older stats just disapear.

More info:

Before the “reprocess”… I got a boa constrictor eating an elephant:

After reprocessing it… the boa is walking away:

If is an issue with the reprocess script and won’t be fixed till 3.4, is there a way to recover old statistics?
I got a copy of the old BD.

If not… what about remove all the data and let OJS recalculate it all? My concern with this is I don’t know if I will lose very old statistics (the ones before the new statistics framework that don’t include log files and are ony in the DB).

Thanks a lot for your time Nate.

Unfortunately, I’m not very familiar with the log file reprocessing. Looking at those two graphs, though, it looks a lot like something happened in April. The stats exist before or after. Have you inspected the log files before and after that month to see if you can identify a difference?

Sorry to resurrect this after sooo long, but this impacts in counter statistics too and today I found the solution.

@NateWr, I lied to you because looking into the logs I found they were http so looks like base_url changed from “http” to “httpS”.

If somebody else fall in same hole:

  1. Check your archived logs (older vs newer ones) and confirm the URI (url including the protocol) is the same.
  2. Backup your old logs and replace uris to the right name. For example:
$ find . -name '*.log' -exec sed -i -e 's/http:\/\/atheneadigital/https:\/\/atheneadigital/g' {} \;

  1. Move log files from archive to stage.
  2. Reprocess as follows:
$ php tools/runScheduledTasks.php plugins/generic/usageStats/schedu
ledTasks.xml
2 Likes

Ok… too early to claim victory.

Statistics from 2015 were recovered… but not from 2016 till July 2021 (that was the moment we moved to ojs 3.2).

Graphically:

I took two random dates to compare…

First one is from 28/01/2015 and is correctly processed:

192.151.52.159 - - "2015-01-28 03:11:08" https://atheneadigital.net/article/download/349/408 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"

Second one, from 01/01/2018 is completely ignored:

82.158.21.185 - - "2018-01-01 14:05:38" https://atheneadigital.net/article/download/v11-n1-farias/826-pdf-es 200 "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0"

The only difference I see is first is first url use an url with numbers and the second uses alias…
Is the statistics script able to parse both or only works with numbers?

Alias been ignored could also explain low statistics in 2015… because log file is a mixure of both syntaxes. In the 2018 log file syntax is uniform (only alias) and whole file is ignored.

@bozana does it makes any sense to you?
Any idea I could try?

Thanks for your time,
m.

Hi @marc,

Oh, yes, that is surely the reason :frowning:
You are using Apache log files, right? – They can contain the urlPath, wile our internal log files only contains IDs.
I’ve just checked: the feature to consider urlPath went lost with 3.0 release, when the plugin was overhauled :frowning: I think we have forgotten to consider those URLs from Apache log files back then :frowning:
You do not have the internal, OJS log files (any more)? – to re-process and re-compile the stats from them…
I have considered the Apache URLs (and urlPath) in the newest work for 3.4, but… 3.4 is still to come…
What OJS version are you using?
I could eventually help with a patch, but I am not sure if this will be possible before 3.4 – 3.4 is high priority now…

Best,
Bozana

Thank you very much for your feedback, Bozana!

Oh, yes, that is surely the reason :frowning: You are using Apache log files, right? – They can contain the urlPath, wile our internal log files only contains IDs.

Good news is this clarifies everything. :wink:

You are using Apache log files, right?

Yes, I am.

I’ve just checked: the feature to consider urlPath went lost with 3.0 release, when the plugin was overhauled

Upps. I’m sorry to hear this. :frowning:

I have considered the Apache URLs (and urlPath) in the newest work for 3.4, but… 3.4 is still to come…

Hum… it may be that people are getting wrong statistics and don’t know it?
And it means that users using apache and LTS won’t have proper statistics till next LTS that will be 3.6 or so?

I mean, using apache log files is not wired, so I’m wondering if won’t be a good idea to create a script in /tools to help in the reprocessing of files.

You do not have the internal, OJS log files (any more)? – to re-process and re-compile the stats from them…

Yes I got them. Still sotred in “archive” folder but also the original apache log files (if is useful in some sense).

What OJS version are you using?

OJS 3.2.1-4 but moving to 3.3-LTS next month or so.

I could eventually help with a patch, but I am not sure if this will be possible before 3.4 – 3.4 is high priority now…

I full understand. No worry.
Just forget about this till 3.4 is released… and let me know if you want me to publish this as a github issue to keep the track.

Take care!
m.

Hi @marc,

Hum… it may be that people are getting wrong statistics and don’t know it?
And it means that users using apache and LTS won’t have proper statistics till next LTS that will be 3.6 or so?

I mean, using apache log files is not wired, so I’m wondering if won’t be a good idea to create a script in /tools to help in the reprocessing of files.

Yes, it seems there are not so many users using Apache log files and probably even less using urlPaths (instead of IDs).
But, yes, once the major work on 3.4 is finished I can see to fix the UsageStatsLoader, to consider that too… s. Consider urlPaths in Apache log files in UsageStatsLoader · Issue #8599 · pkp/pkp-lib · GitHub.

Yes I got them. Still sotred in “archive” folder but also the original apache log files (if is useful in some sense).

If you would need to recalculate the old stats, you could try to do it using the internal log files from the archive folder. Maybe to do it step by step, always re-processing just a few, to see if all works well… Also, if domains/URLs have changed, you would need to adapt the old usage stats log files…
Let me know if you would need any help with that.

Thank and best wishes,
Bozana

Yes, it seems there are not so many users using Apache log files and probably even less using urlPaths (instead of IDs).

I’m probably missing something here.

What is people using instead of Apache?
I mean… I know some use nginx but I always though Apache was the most extended webserver, and if you use apache, you will have apache-logs, don’t you?

If you would need to recalculate the old stats, you could try to do it using the internal log files from the archive folder. Maybe to do it step by step, always re-processing just a few, to see if all works well… Also, if domains/URLs have changed, you would need to adapt the old usage stats log files…

Is not what I did?

Explained here: Old statistics not shown - #5 by marc

Summarizing:

  1. Backup “usageStats/archive” folder.
  2. Replaced URIs to move from http to httpS (the domain never changed).
  3. Moved from archive to stage
  4. Asked to reprocess calling php tools/runScheduledTasks.php plugins/generic/usageStats/scheduledTasks.xml

Let me know if you would need any help with that.

Thanks for your help!

But please… don’t worry about this now.
OJS 3.4 is the priority and we can talk about this once released.
I will mention you then.

Take care,
m.

Yes, it seems there are not so many users using Apache log files and probably even less using urlPaths (instead of IDs).

@marc, regardless of the Web server, most people probably use the logs generated by OJS to process statistics, according to this option:

If you have these logs, generated by OJS and not the web server, that live in the files_dir folder, they will not contain the custom IDs.

Protocol/domain switching issue persists. There is still a need to standardize. But we don’t need to worry about custom IDs.

1 Like

Thanks Diego.

I got this checkbox enabled so it will probably explain why counting seams fine to me right now.
So the problem is just with old statistics… and just in case you ask to recalculate it.
Now I understand why Bozana said, it will only happen in a very few situations.

If is ok to you, let’s keep talking about this after OJS 3.4 release.

Thanks you both,
m.

Referring here a post where Clinton explains how metrics were stored in 2.x that could also be useful to recover data from old backups: