Symptoms
Most of the applications being used in Microsoft Active Directory networks has certain degree of dependency on availability Active Directory Directory Services (AD DS) for purposed of authentication and obtaining required data about AD DS objects (users, groups etc.) from catalog.
In case you have failures or other availability issues with your AD DS infrastructure you may observe symptoms/problems on K2 side similar to those described below.
Example scenario 1:
You may observe growing queue of AD SmO queries on IIS side to the point at which all of the queries sent from K2 smarforms to AD DS fail/no longer returning any information and after very long delay the following error is being thrown:
A referral was returned from the server.
This error comes from AD DS (more specifically from DC which servers K2 app/K2 server queries to specific domain) and most likely caused by the fact that no domain available to serve this query at all.
Example scenario 2:
You receiving the following error on K2 server:
System.DirectoryServices.ActiveDirectory.ActiveDirectoryServerDownException: The server is not operational. Name: "DC-XYZ.domain.com" ---andgt System.Runtime.InteropServices.COMException: The server is not operational.
You confirmed that DC mentioned in the error message is down but there are other DCs up and running in this domain.
Two described scenarios are different but in bot cases it may seem that one specific DC failed but K2 didn't switch to alternative available DC for specific domain. Key question here is how to switch to another available DC without any downtime or with a minimum downtime (no K2 server or K2 service restart).
Diagnoses
First of all it is necessary to understand what kind of dependency on AD DS we have from K2 side. Most obvious things are AD Service SmOs and User Role Manager Service (URM) - both of them dependent on AD DS availability but in a different ways. AD Services queries AD DS directly (so it is a good test to check whether AD DS queries can be served without issues) whereas URM service relies on K2 identity cache and returns you cached data from K2 database. URM service is dependent on AD DS only at the time of cached data refresh thus it will allow you not to notice AD DS failure if your AD DS cache is not expired yet.
Next we may distinguish two major scenarios for AD DS failure:
1) WAN link failure when no domains are available to serve K2 request to specific domain because all DCs are behind the WAN link. This is applicable to multi domain environments with remote domains.
2) Failure of specific DC to which K2 server is connected for querying specific domain.
Given AD DS design best practices none of those scenarios should present any problems for applications dependent on AD DS:
(1) It is best practice to have extra DCs placed on remote sites so that there is no dependency on WAN link both because bandwidth and availability issues. At the very least RODC should be present locally on site for any remote domain if for some reason you can not place RWDC locally on each remote site.
Note: in a link failure scenario when there are no locally available DCs there is nothing that can be done from K2 side, it is a question of restoring WAN link or placing locally available RWDC/RODC to mitigate against this scenario.
(2) Golden rule and requirement for any production AD DS deployment is to have not less than two DCs per domain. So failure of one domain controller should not present any issues.
Consequently firs point to check for you when you see any issues described in Symptoms section of this article is to work with your AD DS team to clarify which specific issue do you have and whether it is fixed/addressed or not. There is no use to perform any attempts of fixing things from K2 side if AD DS issue is not addressed. The only possible thing is to temporarily remove connection string to some extra domain if you can afford this (and if it is this less important domain which has an issue).
You may get a confirmation from AD DS support team that the they have issue with one specific DC which is failed or down for maintenance (the latter should be very rare exceptional case of planned maintenance during business hours) and there are others locally available DCs to serve requests from K2 server. If this is the case you can try to do the following things:
1) Use AD Service SmO to check that you can query affected domain - if it works you should not have any issues in K2, if not proceed with further checks.
2) Use the following command to verify which DC is currently being used by K2 server for specific domain:
nltest /dsgetdc:DomainName
If this command returns failed DC then this is an issue with your DC locator service/AD DS infrastructure, or to put it another way problem external to K2.
In general AD DS as a technology with decades of evolution and high adoption rate is very stable and there are no well known cases where DC locator fails to switch to alternative available DC. But depending on configuration and issues of specific environments as well as implementations of application of code which interacts with AD DS there can be some cases when DC locator switching does not work properly.
3) If on the 2nd step you getting failed/unavailable DC try to use the following command:
nltest /dsgetdc:DomainName /force
This will force DC locator cache refresh and may help you to switch to another DC. Note sometimes it is necessary to run this a few times till another DC is selected.
4) If step 3 does not help you to switch to another available DC you may try to restart the netlogon service as DC locator cache is implemented as a part of this service. Here is an example of how to do it with PowerShell:
Get-Service netlogon | restart-service
nltest.exe /sc_verify:andltfully.qualified.domain.name.here
Once this is done verify whether you are switched to available DC with use of the following command:
nltest /dsgetdc:DomainName
5) If you see that after switching of DC locator to available DC K2 AD Service SmOs are still does not work consider K2 service restart/or server reboot.
Note the only valid test here is use of AD Service SmOs to query domain - if it works then no need to do something else from K2 side. In case you see issue in the areas depending on URM User service it may simply be the case that cached data is expired and new data is still builds up. Sometimes it may be necessary to force identity cache refresh and wait till cache builds up completely (this can take very long time in large scale production environments).
Resolution
See Diagnosis section. Additional information:
K2 performs bind with the DirectoryEntry class e.g:
new DirectoryEntry(?LDAP://DC=Domain,DC=COM?, ??, ??,AuthenticationTypes.ReadOnly)
This process relies on Domain Controller Locator which an algorithm that runs in the context of the Net Logon service. Essentially Domain Controller Locator is a sort of AD DS client part which is responsible for selecting specific DC for specific domain. Domain Controller Locator has its own cache. The Net Logon service caches the domain controller information so that it is not necessary to repeat the discovery process for subsequent requests. Caching this information encourages the consistent use of the same domain controller and, thus, a consistent view of Active Directory.
Refer to the Microsoft documentation for details:
Domain Controller Location Process
https://technet.microsoft.com/en-us/library/cc978011.aspx
Domain Controller Locator
https://technet.microsoft.com/en-us/library/cc961830.aspx
Additional recommendations:
1) Reconfigure K2 to use GC instead of LDAP.
The global catalog is a distributed data repository that contains a searchable, partial representation of every object in every domain in a multidomain Active Directory Domain Services (AD DS) forest. So essentially your GC placed in local domain can serve part of the queries which otherwise should go to DCs in another domain, potentially over WAN link.
From purely AD DS side GC has the following benefits:
- Forest-wide searches. The global catalog provides a resource for searching an AD DS forest.
- User logon. In a forest that has more than one domain GC can be used during logon for universal group membership group enumeration (Windows 2000 native DFL or higher) and for resolving UPN name when UPN is used at logon.
- Universal Group Membership Caching: In a forest that has more than one domain, in sites that have domain users but no global catalog server, Universal Group Membership Caching can be used to enable caching of logon credentials so that the global catalog does not have to be contacted for subsequent user logons. This feature eliminates the need to retrieve universal group memberships across a WAN link from a global catalog server in a different site. Essentially you may enable this feature to make use of GC even more efficient.
To reconfigure K2 to use GC you have to edit RoleInit XML field of HostServer.SecurityLabel table and replace "LDAP://" to "GC://" with subsequent restart of K2 service.
From K2 prospective it should improve responsiveness of AD SmartObjects as well as slightly decrease reliance on WAN link/number of queries to DCs outside of local domain.
2) Try to use Domain Locator cache refresh clear up for example scenario 2 (see details above, nltest /dsgetdc:DomainName /force) and verify whether it is viable workaround. Use "nltest /dsgetdc:DomainName" to confirm which specific DC is being used by K2 server and verify status and availability of this specific DC with your infrastructure team.
3) There is also an existing feature request to investigate possibility to built in some DC failure detection/switching capabilities into K2 code in the future versions of our product (TFS 612867).