andrew.beals@nintex.com

Tracking and Resolving Stuck Workflow Timer Jobs

Blog Post created by andrew.beals@nintex.com Support on Aug 25, 2015

Preface

 

I've been on a number of collaborative calls with Microsoft investigating paused/failing timer jobs and decided to share the process used to track a rogue workflow that is running amok.

 

If you are seeing workflow timer jobs across your entire farm fail one by one in succession then it's likely there is a particular workflow being rehydrated that is causing your workflow timer job to pause or fail. Once this timer job pauses/fails the workflow failover timer job will pick up this rogue workflow and execute it again on the next available timer job in the load balancer pool thus leaving you with no workflow timer jobs running on your farm.

 

PLEASE NOTE:

This guide should not be used to troubleshoot all issues with workflow timer jobs. Please review the following document first: Workflows not continuing - Timer Service and Delays

 

Additionally, a workflow timer job can pause, fail or run for a very long time if you have large amounts of workflow data. The following documentation should be addressed first before pointing the finger at a rogue workflow:

 

Defensive Workflow Design Part 1 - Workflow History Lists

Demystifying Workflow History (Part 1)

 

Large amounts of unmanaged workflow data generally make up for 90% of the failures we see with pausing or long running workflow timer jobs. This is due to how SharePoint writes/queries information to/from SQL when involving large SQL tables. (e.g. the larger the table, the longer it will take to query).

 

Now, after taking all that in to consideration, let's move on to the guide!

 

Gathering information

 

1) Set ULS logging levels on “Nintex Workflow 2010” and “Workflow Infrastructure” to Verbose.

image001.gif

2) Run the ‘New-SPLogfile’ (Technet: New-SPLogFile) cmdlet in a SharePoint Management Shell window afterwards to start a new log file using the newly configured settings (Verbose).

 

3) Wait until the issue (hung timer job) occurs again, give it about 10-15 minutes after the first job hangs. Then gather logs from each of the servers, or have the customer merge the logs using the ‘Merge-SPLogfile' (Technet: Merge-SPLogFile) command.

 

4) Once you have the logs, you will need to use ULS viewer (Download ULS Viewer from Official Microsoft Download Center) to view and filter them. Open ULS viewer and import each log file and applying a filter to each as shown below:

 

image002.gif

After the filter is applied, you will see one of two things;

 

1) You will see a workflow compile, start and never complete (as shown below):

image003.gif

   This is a good indication that this is the workflow that is causing your timer job to fail. Since the telemetry scope    never closes we can see that this is the last workflow being compiled and most likely the culprit.

 

2) You will see workflows compile and run. When finished, you will see a “closing workflow telemetry” message    indicating the job has completed. (as shown below)

image004.gif

 

If you find the first option, this is a good indication that this is the hung workflow timer job. The workflow that compiled right before the hang is likely the workflow that you should be focusing efforts on.

 

Identifying the workflow

 

If you are not familiar with the particular workflow name found in the logs or you don't know where to find this workflow, you can use the following process to gather more information.

 

First you will need to use the workflow instance ID found inside of the ULS logs:

 

Next insert the workflow instance ID into this SQL query and run it:

USE NW2010DB --Update to the name of your Nintex Content Database.

SELECT * FROM dbo.WorkflowInstance

WHERE WorkflowInstanceID = '408c7d6a-b919-4c84-b7b8-e410e6b3d2d3' --Change Workflow Instance ID to track additional workflows

This query will provide you with the workflow name, site id, list id, (sub) webid, item id, and other identifying information so that you can find the specific instance of the workflow. Your results should look like the following:

image006.gif

 

Tracking down the workflow

 

Now that you have something to work with you can use the below PowerShell Scripts/Commands to gather information necessary for tracking down the workflow if the customer isn’t sure exactly where this workflow exists.

 

Find all Site Collection IDs:

Add-PSSnapin Microsoft.Sharepoint.Powershell

Get-SPSite | SELECT URL, ID, RootWeb

Example output:

image007.gif

 

Find all Web ID's (sites/subsites) in a site collection:

Add-PSSnapin Microsoft.Sharepoint.Powershell
$site = Get-SPSite http://yoursitecollection
$site.AllWebs | select ID, URL, Title

Example output:

 

Find all ListIDs in a sub-site:

Add-PSSnapin Microsoft.Sharepoint.Powershell
$site = Get-SPSite http://YourWebUrl (change me!)
$web = $site.OpenWeb("subsitename") #<---- Add the subsite title here
write-host "Site: " + $site.id
write-host "Web: " + $web.id
$web.lists | SELECT Title, id

Example output:

image008.gif

 

Terminating the workflow Instance

 

Once you know which workflow you are dealing with, there are a couple of options to terminate the instance to restore your workflow timer jobs from a pausing state.

 

1) Terminate the workflow using Microsoft supported methods: Cancel or terminate a workflow instance

2) Using a web service call using the Nintex Workflow web service by passing the workflow instance ID. Endpoint: (http://SiteCollectionUrl/_vti_bin/nintexworkflow/workflow.asmx) Method: (TerminateWorkflow)

3) Using the following PowerShell script:

 

#Terminating specific workflow instances with PowerShell.

#For the problematic workflow instance IDs, you can now use them to begin cleaning things up.

#This is especially useful for runaway workflows generating lots of activity in the dbo.workflowprogress table.

#Open the SharePoint Management Console and for the PowerShell commands below, replace references like <Enter SiteID here> with the GUIDs in inverted commas from the above results. Eg: “61543AD-7562-TFR43D-76T5-7645535”

 

#using SiteID column

$site =Get-SPSite <Enter SiteID here>;

$siteurl = $site.url

 

#Site URL

$web = Get-SPWeb $siteurl;

$web.AllowUnsafeUpdates = $true;

 

#Get List by guid using listID from DB

$ListID = <Enter ListID GUID here>;

$list = $web.Lists.GetList($ListID,$true);

 

#use the ItemID from DB ..Note: in Powershell its zero based so itemID minus 1

$items = $list.Items; $myItemid = <Enter ItemID here>;

$item = $items[$myItemid - 1]; ## as per the table where the WF ran on, you can display $item.Title to make sure that you selected the right correct Item

 

#Get this Item WorkFlows collection

$WorkflowCollection = $item.Workflows;

 

#Get WF instance ID from DB

$InstanceWFId = <Insert WorkflowInstance ID GUID here>;

$wf = $WorkflowCollection | where { $_.InstanceId -eq $InstanceWFId}

 

#Cancel Workflows

[Microsoft.SharePoint.Workflow.SPWorkflowManager]::CancelWorkflow($wf);

$web.Dispose();

 

The above script is also attached to this article. Just be sure to change the file extension to .ps1 from .txt and follow the commenting inside the script to configure the script correctly.

 

Cheers,

Andrew Beals

Senior Service Engineer

Outcomes