Tracking and Resolving Stuck Workflow Timer Jobs

  • 24 March 2023
  • 0 replies
  • 850 views

Userlevel 5
Badge +20

Introduction

I've been on several collaborative calls with Microsoft investigating paused/failing timer jobs and decided to share the process used to track a rogue workflow that is running amok. If you are seeing workflow timer jobs across your entire farm fail one by one in succession, then it's likely there is a particular workflow being rehydrated that is causing your workflow timer job to pause or fail. Once this timer job pauses/fails, the workflow failover timer job will pick up this rogue workflow and execute it again on the next available timer job in the load balancer pool, thus leaving you with no workflow timer jobs running on your farm.

 

NOTE: This guide should not be used to troubleshoot all issues with workflow timer jobs. Please review the following document first: Workflows not continuing - Timer Service and Delays

 

Additionally, a workflow timer job can pause, fail, or run for a long time if you have large amounts of workflow data. The following documentation should be addressed first before pointing the finger at a rogue workflow:

Large amounts of unmanaged workflow data generally make up for 90% of the failures we see with pausing or long-running workflow timer jobs. This is due to how SharePoint writes/queries information to/from SQL when involving large SQL tables. (e.g., the larger the table, the longer it will take to query). Now, after taking all that into consideration, let's move on to the guide!

 

Gather information

  1. Set ULS logging levels on Nintex Workflow 2010 and Workflow Infrastructure to Verbose.

     

  2. Run the ‘New-SPLogfile’ (Technet: New-SPLogFile) cmdlet in a SharePoint Management Shell window afterward to start a new log file using the newly configured settings (Verbose).
  3. Wait until the issue (hung timer job) occurs again. Give it about 10-15 minutes after the first job hangs. Then, gather logs from each server or have the customer merge the logs using the ‘Merge-SPLogfile' (Technet: Merge-SPLogFile) command.
  4. Once you have the logs, you must use ULS Viewer (Download ULS Viewer from the Official Microsoft Download Center) to view and filter them. Open ULS Viewer, import each log file, and apply a filter to each, as shown below:

    After the filter is applied, you will see one of two things:

  • You will see a workflow compile, start, and never complete (as shown below):
     

    This is a good indication that this is the workflow that is causing your timer job to fail. Since the telemetry scope never closes, we can see that this is the last workflow being compiled and is most likely the culprit.

  • You will see workflows compile and run. When finished, you will see a closing workflow telemetry message indicating the job has been completed. (as shown below):
     

    If you find the first option, this indicates that this is the hung workflow timer job. The workflow compiled right before the hang is likely the one you should focus on.

Identify the workflow

If you are unfamiliar with the particular workflow name found in the logs or don't know where to find this workflow, you can use the following process to gather more information.

  1. First, you will need to use the workflow instance ID found inside the ULS logs, as shown below:

     

  2. Next, insert the workflow instance ID into the following SQL query and run it:

    USE NW2010DB --Update to the name of your Nintex Content Database. SELECT * FROM dbo.WorkflowInstance WHERE WorkflowInstanceID = '408c7d6a-b919-4c84-b7b8-e410e6b3d2d3' --Change Workflow Instance ID to track additional workflows

    This query will provide you with the workflow name, SiteID, ListID, ItemID, WebID,  and other identifying information to find the specific workflow instance. Your results should look like the following:

     

 

Locate the workflow

Now that you have something to work with, you can use the PowerShell Scripts/Commands below to gather information necessary for tracking down the workflow if the customer isn’t sure exactly where this workflow exists.

Find all Site Collection IDs
 

Add-PSSnapin Microsoft.Sharepoint.Powershell Get-SPSite | SELECT URL, ID, RootWeb


Example output:

 

Find all WebIDs (sites/subsites) in a site collection
 

Add-PSSnapin Microsoft.Sharepoint.Powershell $site = Get-SPSite http://yoursitecollection $site.AllWebs | select ID, URL, Title


Example output:

 

Find all ListIDs in a sub-site
 

Add-PSSnapin Microsoft.Sharepoint.Powershell $site = Get-SPSite http://YourWebUrl (change me!) $web = $site.OpenWeb("subsitename") #<---- Add the subsite title here write-host "Site: " + $site.id write-host "Web: " + $web.id $web.lists | SELECT Title, id


Example output:

Terminate the workflow Instance

Once you know which workflow you are dealing with, there are a couple of options to terminate the instance to restore your workflow timer jobs from a pausing state.

  1. Terminate the workflow using Microsoft-supported methods: Cancel or terminate a workflow instance
  2. Use a web service call using the Nintex Workflow web service by passing the workflow instance ID. Endpoint: (http://SiteCollectionUrl/_vti_bin/nintexworkflow/workflow.asmx) Method: (TerminateWorkflow)
  3. Use the following PowerShell script:

    #Terminating specific workflow instances with PowerShell.

    #For the problematic workflow instance IDs, you can now use them to begin cleaning things up.

    #This is especially useful for runaway workflows generating lots of activity in the dbo.workflowprogress table.

    #Open the SharePoint Management Console and for the PowerShell commands below, replace references like <Enter SiteID here> with the GUIDs in inverted commas from the above results. Eg: “61543AD-7562-TFR43D-76T5-7645535”

     

    #using SiteID column

    $site =Get-SPSite <Enter SiteID here>;

    $siteurl = $site.url

     

    #Site URL

    $web = Get-SPWeb $siteurl;

    $web.AllowUnsafeUpdates = $true;

     

    #Get List by guid using listID from DB

    $ListID = <Enter ListID GUID here>;

    $list = $web.Lists.GetList($ListID,$true);

     

    #use the ItemID from DB ..Note: in Powershell its zero based so itemID minus 1

    $items = $list.Items; $myItemid = <Enter ItemID here>;

    $item = $items[$myItemid - 1]; ## as per the table where the WF ran on, you can display $item.Title to make sure that you selected the right correct Item

     

    #Get this Item WorkFlows collection

    $WorkflowCollection = $item.Workflows;

     

    #Get WF instance ID from DB

    $InstanceWFId = <Insert WorkflowInstance ID GUID here>;

    $wf = $WorkflowCollection | where { $_.InstanceId -eq $InstanceWFId}

     

    #Cancel Workflows

    [Microsoft.SharePoint.Workflow.SPWorkflowManager]::CancelWorkflow($wf);

    $web.Dispose();


    The above script is also attached to this article. Just be sure to change the file extension to .ps1 from .txt and follow the commenting inside the script to configure the script correctly.

0 replies

Be the first to reply!

Reply