Symptoms
We are investigating an issue that was discovered a few weeks ago during UAT testing and need some further insight or help from K2. The issue is, when a user clicks the green refresh button on the Worklist (NOT the browser refresh button) the Worklist times out with no visible error. The only thing the user sees is the spinning refresh icon in the middle of the screen, and it never goes away.
The processes to recreate:
- Open the Worklist
- Don't touch anything on this page for at least 15-20 minutes
- Press the green refresh button on the Worklist
Part of how they want to use the Worklist is to set up a custom filter on the Worklist so they can limit the tasks they see. Because they have a custom filter they don't want to refresh the entire browser as that would reset the filter, so they have been pressing the green refresh button on the Worklist itself. If they press it after the page had been idle for 15-20 minutes, though, then this issue is present and they cannot continue.
We have done some testing and found the following:
- I turned on network capture in the IE developer tools and recreated the issue. When clicking the green refresh button, the POST response returns a 500 error, though the response body does not say exactly what the issue is. I am attaching this information to the ticket.
- Our environment expert has done some testing / research as well and determined it is related to HTTPS caching. He says that the Worklist is cached when it first loads, however when that cache times out the refresh button won't work because the Worklist cannot request the data from the server again because it's over HTTPS.
We are able to recreate this in our Staging and Production environment, but I can't seem to recreate it in my development environment. Our Staging and Production environments are on our production domain so may have extra security, and they are also distributed environments (2 application servers the K2 service runs on, one terminal server)
Diagnoses
- Without any changes made I turned on IE network capture and loaded the worklist. 15 minutes after the page loads the client attempts to send a keep-alive to the K2 server. The keep-alive fails after getting a 500 server error (without any details in the message body) in response. After that, any attempt to refresh the page using the green refresh button will cause the issue.
- I bypassed the load balancer using the host file. I loaded the worklist again and turned on IE network capture. After 15 minutes, the keep-alive message was sent and acknowledge successfully. Every 15 minutes another keep-alive was successfully sent and acknowledged. I could refresh the page at any time without issue.
- Our environment expert turned one K2 server off and disconnected it from the load balancer. Traffic was still routed thru the load balancer, but now I could see the keep-alives and I could not recreate the issue. When he turned the second K2 server on and connected it to the load balancer the issue came back.
So it appears that the issue is due to network traffic between the worklist and the K2 server being routed thru the load balancer between two K2 servers.
Resolution
The F5 load balancer environment was completely redesigned. The certificate on the K2 servers was moved to the F5 load balancer server, and we switched the configuration profile from HTTP to HTTPS. This means the connection between client and server does not switch servers as long as the connection is open, and by adding the certificate on the F5 load balancer the certificate check can be done on that server as well.