OJS page stalls - awesome delays

retostauffer · October 8, 2015, 4:36pm

Hy there

Me again. We have a very urgent problem. Since we have “a lot of traffic” on our OJS installation (OJS 2.4.6) the response time of the OJS itself is very jumpy going from 0-1 seconds up to 30, 60, 9000 seconds. After a while, the system is recovering. I’ve checked the system and a guy from the IT-service was also checking the logfiles from the system, apache, and the mysql deamon. Nothing suspicious.

The server seems to stall on the index.php requests. Redirects are fine (above), but then waiting for the index.php/ file to respond. When restarting the httpd deamon everything looks ok … for a wihle.

For testing I set up a virtual host running a second instance (a direct copy) of the OJS installation on the same machine. Well, no problem on the virtual host.

Has anyone any idea where it comes from? It seems to be an OJS related problem, and not a server dependent problem. I’ve tried whatever was possible.

UPDATE: just got a new idea and started another test. Both instances are now using the same database. The only difference is the OJS installation (and therefore the whole PKP-Caching-Smarty-Template-Backend. Seems to have something to do with that. Are there any OJS config.inc.php settings I missed resulting in my problem? Any ideas to increase the performance? Requests: the operational version gets about 15.000 requests per hour.

Thanks very very much!
Reto

PS: marc wrote this summer about a similar problem, which was a browser dependent thing (as I understood the topic) and - unluckily no solution to that ( OJS 2.4.6 - Slow performance on firefox... flying with Chrome? - #2 by marc ). We tested with several browsers from several locations using several different systems - definitively no browser-problem.

retostauffer · October 8, 2015, 4:58pm

BTW: the delay jumps in 30 second intervals - whatever it is. Could not find any setting related to this interval.

asmecher · October 8, 2015, 5:26pm

Hi @retostauffer,

The good news is that if you can observe a stalled request in action you can do a little bit of forensic work on it. I’d especially recommend checking for MySQL locks, e.g using SHOW PROCESSLIST in MySQL.

If you’re using the acron plugin to handle scheduled tasks, then the occasional long request is legitimate. If you’re running a high volume server, I’d suggest disabling that and using a proper cron job. See docs/README for details.

You can also check what Apache is doing by running apachectl fullstatus – though depending on what PHP SAPI you’re using this will probably not be terribly useful.

In any case, narrowing it down to MySQL vs. Apache/PHP is the first step. The fact that you’re seeing 30-second increments suggests a server timeout, so one of those is likely the culprit.

Regards,
Alec Smecher
Public Knowledge Project Team

retostauffer · October 8, 2015, 5:46pm

Good evening Alec

Thanks for the quick reply. As I’ve checked “everything” (all I knew) backend-related should be fine. I increased the mysql/apache limits to a “too huge” limit which should never be reached with the current setup (ans was never reached screening apache/system usage).

Acron sounds like a good hint. I just disabled it a second ago and will go on screening the response times. Will let you know if this solved the problem. Sounds possible. I’ll let you know …

Greez
Reto

retostauffer · October 9, 2015, 11:20am

Well, seems that the Acron wasn’t the problem. apachectl fullstatus gives me:

apachectl fullstatus
                                 404 Not Found

    Stack Trace:

   File: /mnt/data/www/html/lib/pkp/classes/core/PKPPageRouter.inc.php line
   184
   Function: Dispatcher->handle404()

   File: /mnt/data/www/html/lib/pkp/classes/core/Dispatcher.inc.php line 134
   Function: PKPPageRouter->route(Object(Request))

   File: /mnt/data/www/html/lib/pkp/classes/core/PKPApplication.inc.php line
   178
   Function: Dispatcher->dispatch(Object(Request))

   File: /mnt/data/www/html/index.php line 64
   Function: PKPApplication->execute()

Seems that this weekend won’t be relaxing at all .

Is there any job deleting all old cache files? Could this lead to the long delays?

marc · October 9, 2015, 12:27pm

Hi @retostauffer,

Sorry for the silence.
I also had the feeling I reviewed everything… it made me crazy during a week or so.
The point is that I made a lot of changes and suddenly one day… the issue disappears.

Now I’m without CDN, without extra caching (no xcache, no apc…) and works fine.

I never worked with acron (i prefer crontab) and the issue was still there (at the end, I found it in every browser).

My best guess? a conflict of permissions in the cache file or registry folder.
Remove every file inside cache folder and be sure the folders permissions are fine (pe. 777 to test) to see what happens.

If not, take a deep look to your php errors… during the journey I fixed issues related with my multilanguage data.

Keep us up to date, please. I’m still very interested in discover the origin of this.

Let me know if you want to compare config files or whatever.

Best wishes,
m.

ctgraham · October 9, 2015, 12:30pm

Via the interface, you can delete the cache files using the options in User Home → Site Administrator → Clear [Data / Template] Cache.

The expected outcome is that your cache directory under the OJS root should be emptied of the files in that directory and in subdirectories, but the subdirectories themselves will remain. It should look a lot like the clean install:

retostauffer · October 18, 2015, 8:22am

Dear community

We finally resolved the problem with the response times. Had something to do with the OJS, but it was actually a hardware problem at the end. Our system is running on a RedHat Enterprise virtual machine with a relatively small system disc (high available disc) and a storage disc.

This storage is an NFS mount. The stalling problem occurred when we had relatively high traffic. OJS is logging all views and downlods into the usageEventLog daily files. To write the statistics in there, the file is locked for a short time, written, and unlocked again. During high traffic a lot of apache processes want to write their logs parallel into the usageEventLogs - and the NFS mount is not handling the file-locks very quick. Therefore the processes (users) had to wait in the queue to get write-permissions leading to the awesome delay times.

As the usageEventLogs is the only file having this high parallel-write-issues we moved this file only (not to have any problems with the available disc space) to the system disc where the Linux itself is managing the file-locks. As this is done way more efficient, the queue can be processed way quicker - the stalling problem could be resolved with this fix.

Thanks for your help, as I havn’t any experience with NFS mounts I could not find this problem in the past. However, I’m glad we could fix it and use the OJS with the expected performance :).

Furthermore: using the OJS web_cache=ON and disabled the scheduled_tasks in the config.inc.php config file of the OJS, handling the scheduled tasks and the remove of the cached files via some cron's not to bother the apache with these tasks during user visits.

All best from Austria,
Reto