Symptoms
There is a SharePoint list with 2000 employee records. The is a status field we set to start a K2 workflow we have developed. The workflow starts on-item-changed of the list. We kicked off about 2000 instances. The workflows were successful but when there were only 70 left - the workflows stopped. We now can NOT start any new ones or resume the 70 left to do. This needs to resolved before employees return to work in the morning. It is impacting the entire company.
Diagnoses
This issue was investigated on an after hours call. During the session we noticed that the MSMQ was getting clogged up with messages. This queue would contain all the process instances triggered by a SharePoint event as well as client event notifications that are waiting to be sent out.
Resolution
After a while, we noticed that the number of messages in the queue was going down gradually. This could be the reason why it took a while for instances to appear on Workspace and why the remaining 71 have not been started as they are still in queue. Likewise, client event notifications which have not been sent out are still sitting in there. If client event notifications failed to send out, this would usually throw an error in the host server logs and the hEventbus].sClientRecorderError] table. In this case, there weren't any errors found pertaining to the issue.
The messages slowly went out of the queue and processed through. This issue resolved itself after the queue had cleared itself from the last set of messages in the queue. Since no issues were found in iEventbus].sClientRecorderError], this means that the messages themselves had been passed off to be sent out but something on Exchange or SMTP is holding it up.
Once the messages cleared out of the queue the issue resolved itself.