Lots of duplicate DB queries when getting submissions

Jim_Nicholls · August 2, 2022, 8:24am

Describe the issue or problem

Calls to the v1 submission API generates thousands of database queries, with 40-90% of queries being duplicates. Adding memcached does not change the situation.

Steps I took leading up to the issue

Go to into the databoard of a journal, being a URL like https://ojs.example.com/ABCD/submissions. Particularly look at requests like https://ojs.example.com/ABCD/api/v1/_submissions?…

What application are you using?

OJS 3.3.0.11

Additional information

Sample

For some journals, it take tens of seconds for the submissions page to load as up to 15 000 queries are sent to the database. Typically only 40-50% of the database queries are unique. In extreme cases, only 10% of database queries are unique.

To illustrate here are the number of database queries generated by different calls to v1 submissions API. They are all requests for 30 submissions at most. They are for different journals, status and offsets.

+---------------+----------------+
| Total queries | Unique queries |
+---------------+----------------+
｜        15229 ｜          1512 ｜
｜        14795 ｜          1557 ｜
｜        13767 ｜          1359 ｜
｜        13017 ｜          1404 ｜
｜         7261 ｜          1628 ｜
｜         7204 ｜          1170 ｜
｜         6814 ｜          1330 ｜
｜         1808 ｜          1126 ｜
｜          902 ｜           569 ｜
+---------------+----------------+

Correlation

The total number of submissions doesn’t seem to matter.

The number of queries and the amount of duplication seems to be related to how many authors there are. When there are few to no authors on the submissions, the request completes faster with fewer database queries. It is requests where the submission has many authors that are taking a long time to complete and generate excessive database queries. The last two examples above are from a test journal, whose submissions rarely have any authors at all.

Detail of the top example

Taking the top example from above, I can see that most of the queries are for user groups settings, controlled vocabulary and stage assignments.

+---------------------------------+---------------+----------------+
| Table                           | Total queries | Unique queries |
+---------------------------------+---------------+----------------+
| user_group_settings             |          2349 |             46 |
| controlled_vocab_entries        |          1950 |            150 |
| controlled_vocabs               |          1950 |            150 |
| controlled_vocab_entry_settings |          1564 |            122 |
| stage_assignments               |          1440 |            300 |
| author_settings                 |           756 |             57 |
| submissions                     |           753 |             63 |
| user_settings                   |           606 |              8 |
| categories                      |           390 |             30 |
| publication_galleys             |           390 |             30 |
| submission_settings             |           390 |             30 |
| publications                    |           390 |             30 |
| user_groups                     |           360 |              2 |
| users                           |           276 |              8 |
| (other)                         |          1665 |            486 |
+---------------------------------+---------------+----------------+
| Grand Total                     |         15229 |           1512 |
+---------------------------------+---------------+----------------+

Continuing with the first example, these were the top 15 queries. These are from a single request to the v1 submissions API.

+---------------------------------------------------------------+-------+
| Query                                                         | Count |
+---------------------------------------------------------------+-------+
| SELECT * FROM user_settings WHERE user_id = 10351             |   405 |
| SELECT ug.* FROM user_groups ug JOIN user_group_stage ugs […] |   330 |
| SELECT * FROM user_group_settings WHERE user_group_id = 3     |   330 |
| SELECT * FROM user_group_settings WHERE user_group_id = 5     |   330 |
| SELECT * FROM user_group_settings WHERE user_group_id = 6     |   330 |
| SELECT * FROM user_group_settings WHERE user_group_id = 16    |   330 |
| SELECT * FROM user_group_settings WHERE user_group_id = 9     |   330 |
| SELECT * FROM user_group_settings WHERE user_group_id = 14    |   330 |
| SELECT * FROM user_group_settings WHERE user_group_id = 15    |   330 |
| SELECT * FROM users WHERE user_id = 10351                     |    75 |
| SELECT * FROM users WHERE user_id = 1730                      |    65 |
| SELECT * FROM user_settings WHERE user_id = 1730              |    65 |
| SELECT * FROM users WHERE user_id = 19850                     |    55 |
| SELECT * FROM user_settings WHERE user_id = 19850             |    55 |
+---------------------------------------------------------------+-------+

Memcached

After adding memcached, the numbers didn’t change. I already had the file object cache enabled. The only thing I saw going into memcached was some basic data about what was the current issue for a journal. I didn’t see any issue or submission data going into memcached from accessing either front-end or back-end pages.

Jim_Nicholls · August 2, 2022, 8:52am

Methodology

This is the method I use to capture the SQL queries OJS was generating.

I used a local dev environment running on my laptop.

I turned on the general query log in MySQL (5.7) with the destination being a table.
I went to the submissions page of a journal.
I truncated the general query log table.
In either the All Active or Archive tab, I clicked the link to page 2 of the submissions. This generated a single request to the v1 submissions API.
I exported the general query log table to a CSV file.
I truncated the general query log table again.
I clicked the link to go back to page 1 of the submissions. This again generated a single request to the v1 submissions API.
I again exported the general query log table to a second CSV file.
I repeated steps 2-8 with other tabs and other journals.

asmecher · August 2, 2022, 5:12pm

Hi @Jim_Nicholls,

We’re aware that the query count for OJS/OMP/OPS is extremely high and I’ve got it flagged for some future work. However, we’re making some infrastructural changes around data storage/retrieval that should get completed and stabilized beforehand.

So far the high query count seems to only affect performance when the database is not on the same server as the PHP codebase. Is that your situation? If so, I’d recommend hosting with the database and web server on the same box for the moment.

There’s some discussion about caching in Github, see e.g.:

However, as you’ll see, that work is pending some infrastructural work.

Regards,
Alec Smecher
Public Knowledge Project Team

Jim_Nicholls · August 2, 2022, 10:24pm

Hi @asmecher,

Yes, our database is not co-deployed with the application. Our cyber security and architecture standards do not allow me to put them on the same box.

I look forward to the fruits of the pending infrastructural work.

asmecher · August 2, 2022, 11:02pm

Hi @Jim_Nicholls,

The other option would be to deploy using Docker, if that’s a possibility. You’d be co-deploying, but it would be containerized.

Regards,
Alec Smecher
Public Knowledge Project Team