It seems the workflow of linter regarding the database has many issues, creating user complains. Based on the wikimedia logs, I see 2 kinds of issues:
- Queries duing counts that take over 60 seconds to run, and because of that, they are killed by the query killer- those should either cached and run as mainteanance jobs or all queries should take less than 1 second to be executed, no matter the size of the table
- Database contention due to multiple rows being written at the same time, creating Lock wait timeout errors
Those seem to be created by multiple executions of job runners, which are probably retried and make the problem worse. A way of coordinating multiple executions over the same data could be thought. For actual examples of both issues, look at:
https://logstash.wikimedia.org/goto/a5d9a903670cebc33251939f42e8c292 The errors seem to be increasing. The good news is that they do not seem to be affecting other functionalities, but it could end up affecting the reliability of the host if multiple connections are stuck.
At least one wiki user complained about delays of the jobs, probably related to these issues.