This is a post about a couple things I found at a customer site regarding PVS retries and why I came up to the conclusion they do not matter as much as most people think. There is more you need to look at.
First some background information about the environment:
- XenServer based. The pool hosting the XenApp VMs was running 6.1 and was upgraded to 6.5 SP1.
- Two PVS Servers, virtual, with 16GB RAM each.
- Virtual File Server hosting the vDisk on a CIFS share. 16GB RAM.
We noticed performance started to degrade once we upgraded the pool to XenServer 6.5 SP1. For some reason users would complain about their sessions freezing for a while. After doing some investigation, I found the PVS retries to be on the higher side. First problem was what would be considered high. Some servers did show 500 or more but over a period of like eight to ten hours. That gave us around 1 retry per minute. This could not be the culprit. Several network devices retransmit data every single minute, at much worse rates and no one notices, even when real time audio and video are there.
That is when I decided to take a look at the servers themselves. A quick search on the Event Log under ‘System’ showed the following for the ‘bnistack’:
[MIoWorkerThread] Too many retries Initiate reconnect.
And right after:
[IosReconnectHA] HA Reconnect in progress.
So what is happening here? The PVS target device driver on the XenApp, after a certain number of retries (that I am still to find what that is), automatically triggers a reconnection to another PVS server what we know is not instant. It does take a couple seconds and that is exactly why users would experience a ‘freezing’ on their sessions. After asking the users to write down the time such thing was happening we could clearly see it was exactly when the reconnection process was triggered.
If you right-click a vDisk on your PVS store and select ‘Show usage’, it does show the retries but more important, it shows to which server each device is connected to. That is when I started monitoring if the connection would change during the day and bingo, when it had changed, users would complain.
Now what? We knew what the issue was but why was this happening?
We started thinking about the XenServer 6.5 SP1 upgrade as that was the only thing that had changed. Our PVS image had only a couple versions (three I think), with the base one with the 6.1 XenServer tools and the latest one with the 6.5 SP1 one. That is when I decided to merge all versions (again just a few, under the default threshold). Once I did that, retries dropped dramatically to under 20 retries per day for 90% of the devices. Even the remaining ones fell to under 50 a day. Much better and no more HA reconnections.
The lesson learned here is if your base image has one version of the XenServer tools and different XenServer tools exist in one of the PVS image versions, you better merge everything right after the upgrade is done.
The other really odd thing that happened is once I merged the image I brought it back to the XenServer host as a new VM (so you can easily update the PVS Target Device to a newer one) and tried to start it, I got a blue screen. One more time, thinking the upgrade could have caused the problem, I decided to get the VM UUID and change its device ID by using xe vm-param-set uuid= platform:device_id=0002. That fixed the BSOD.
I am still not sure why having different XenServer tools on different versions would cause the much higher retries but I know for sure the merging fixed all that.
Resuming: PVS retries are something you do need to monitor but just looking at numbers may not tell you anything (unless you are seeing several retries per second). Also keep in mind it is all UDP based… The really important thing is indeed the HA kicking in and flipping the PVS server the target is connected to. That will cause the famous hangs and freezing on the devices.
And yes, ideally always merge your images after some major change like hypervisor tools. 🙂
6,006 total views, 2 views today