Symptoms
We have two productive load balanced nodes with similar but not totally equal behaviour. We're unable to identify a root cause for the current behaviour. The following issues are persistently occurring (partly only on one node, partly on both):
node 1
node 2
- Management console does not load or does load for a long while and is asking for credentials (node 1)
- Worklist in Management console does not load (node 2)
- Worklists are not displayed on approvers worklist page (node 1 and 2) -> has been solved on node 1 with uninstalling .NET security patches KB3035490 and KB3023224 // will be applied also to node 2
- One of our forms is causing error messages "invalid authentication" on submit after loading for a long while
Diagnoses
We found that one table was corrupt, specifically the eK2].]Server].]WorklistHeader] table. We saw that it took up 9GB of space but only had a bit more than 2700 entries, although the SQL select query kept on spinning/executing, that's how we figured out there must be something corrupt about that table.
We eventually used a simple PowerShell script to interface with the K2API and do a Get List on the currently logged in user's worklist. That command also never finished executing, it just hang.
So the next steps after that was to increase the thread pool count in the K2Server.setup file to something ridiculous like 60 and restart the K2 Blackpearl service on both nodes (after changing on both).
We could see immediate results. Things started happening, K2 was very responsive.
Resolution
After initially resolving the issue by increasing the thread pool count, we needed to do some additional digging into why this happened in the first place, so we could reset the thread pool count to default levels which is usually recommended.
We found that there were certain events that sent out emails. These events were using a custom service broker that connected through the K2 API to the K2 process, which in turn returned the data for the emails. Now you can imagine that when 20 or more process instances happen to hit this email event at the same time, all threads on the server would be used up and the processes would start getting blocked.
Normally this never happens since only very seldom you'll have 20 email events firing exactly at the same time. However, in this specific process there is a use case that when a user does a certain thing, all active process instances have to be cancelled and then it should send a mail.
So when there are more than 20 active processes, it would cause more than 20 emails to be sent which would cause the threads to be used up and this problem to start happening.
Now here comes the elegant solution.
What we did was to put a start rule on the activity that sent the email, using a "Random" inline function to set the start rule to execute with a delay of between 0 and 180 seconds. This means that it can take between 0 seconds and 3 minutes before the email is sent, therefore freeing up threads in the pool.
In this specific instance it is not crucial that the user gets the email immediately, so the above works beautifully. For example, we were able to execute more than 500 processes whereas before we could only execute 20.