We are working with SharePoint 2010 and Nintex Workflows, we've got a role similar to an advanced user meaning that we cannot access the central administration settings, not being able to access any error information but the one provided on site when a Nintex Workflow "misbehaves".
That said, we have several site collections with several sub sites on them (around 40 sub sites and growing), some of them have Nintex Workflows running and in some cases workflows fail randomly showing sometimes the "workflow failed to run" message and others the "workflow failed to start" message and even both of them. This error are more common (but not exclusive) in the library workflows than in the list ones.
After reading in the forums about similar errors we've tried different approaches to solve this (following the best practices suggested) including:
- Small sized workflows
- The dehydratation techniques
- Reducing the number of workflows working simultaneously in the site by using "pause"
- And so on
But apparently the errors still persists (not always but often), the most intriguing thing here is that were facing different scenarios even for the same workflow:
- Some executions fail to run / start but later on are able to automatically start / run
- Some fail to run / start and are never able to work
- Some work perfectly well from the beginning
We do not have information about the farm / server configuration, and the system administrators does not seem very enthusiastic about accessing the logs to find a reason for this behavior.
Has anyone faced something like this before? and if so, any ideas on how to solve this or at least where to look at?
I assume you are familiar with checking the NintexWorkflowHistory list at the site level to see what's going on?
Note: It's worth remembering this is a standard Sharepoint list, so you can create custom views. I find this helps, especially for debugging where you can hide a lot of the "noisy" system-level information shown in the list by default and just show the details you need.
This list not always particularly helpful as regards an individual workflow crash. Sharepoint has a habit of just leaving you with a content-free message like "workflow failed". However you can look in here for general indications of problems.
For instance if you see "Pause" actions waiting for far longer than their defined pause time (e.g. a 5 minute pause lasting 40+ minutes) this indicates either a general overload on the server, or a problem with the Sharepoint "Timer Job Queue".
We've had several issues with the Timer Job Queue on our servers recently, the symptoms are things like Workflows failing to start, or getting stuck in pause, loop, or state machine actions and never progressing.
Thanks for the answer Colin Evans,
Yes, we've checked the NintexWorkflowHistory but there we've only found the same information:
- Event type: error
- Description: Workflow failed to run
And the Id's and dates of the elements involved.
As you point out sometimes we've seen the overload message when waiting for a workflow action to be executed but that has not always become into a "failed to run/start" situation.
And regarding o the Timer Job Queue we'll ask the system administrators just in case they can find anything there, we'll see...
We have found on a few occasions recently that when our IT department has installed an upgrade or patch on Sharepoint, typically one that doesn't appear to have anything to do with Nintex, the Nintex functions start having performance issues.
Everything will slow down. Small workflows might take 6 or 7 minutes to start (vs. <10s) and larger workflows won't start at all. The workflow and Forms editors also take an age to load, and publishing a workflow might take 6 or 7 minutes as well.
What often seems to clear this problem is to disable the Nintex workflow features at Site Collection level, and then re-enable them.
Unfortunately this gets to be a pain to do in a large organisation where many separate sites have Nintex in use. Hence my request for an admin feature to cycle (stop+restart) Nintex across an entire farm.
I have found when a search crawl is occurring it can greatly affect the performance of Nintex workflows especially when they are attempting to start. I had create a separate indexing server to avoid this issue.
Yes, we've definitely seen this on our sites. The search indexer seems to "run amok" sometimes and use up all the available CPU and/or memory. In the end workflows (even small ones) fail to start. Annoyingly it seems to impact automatic starting (on create or on change) first, often you can see these workflows in the Sharepoint native WF interface as permanently stuck in the "starting" state, as far as Nintex is concerned they never exist, so they don't show in the Nintex workflow status UI for the list item.
However there's also a problem with Nintex being slugged after (unrelated) site updates. A few weeks ago our ICT team installed a feature to update the theming / branding of our Intranet sites, and since then I've had to stop and restart the Nintex WF features on every site collection I have with NF enabled.
Sorry for the late response.
No, we were not able to perform any action to solve this, but everything is more or less working fine now.
I know that a lot of effort has been made from the Farm administrators' team but I do not know exactly what they've made in order for the workflows to work again...
If you have a small site collection, without too many sub sites, you can disable and re-enable the NWF features in a few minutes. I have not seen any adverse affects from doing this, although in my case the sites are small enough that I can easily pick a time when there are no active workflows running.
If you have a large site collection with ~40 subsites, all of which have Nintex workflows implemented, then you have a more difficult problem. I would want confirmation from Nintex Support that cycling the NWF features was non-destructive before I tried that on a large active site collection.
If you do go through that discussion with support, please let us know the outcome.