Tales from the Trenches: the Case of the Missing Server.


As most of you know, even though I run WTSLabs, I also spend quite a lot of time doing consulting work across the globe, having worked in all kinds of projects, from major App-V deployments to pure RDS Session Host setups. And that has been the case for several years and thanks to that I was blessed to be able to see all sorts of great and terrible things out there. So I decided to start a series of posts called “Tales from the Trenches” where not only myself, but other great names in the industry will share their best stories with us so we can all learn and realize there is indeed crazy people out there doing all sorts of unbelievable stuff.

So starting the series, this week, at one of my customers, very strange things started to happen with their XenApp 5 environment. Regardless of how it was architected and deployed (by the way, probably one of the worst environments I have seen in a LONG time, where pretty much every single worst practice out there was followed), the reasons and the outcome of how this whole thing happened is worth a post.

Couple weeks ago a maintenance window was scheduled due to some work on their electrical systems (generators, transformers, etc) and something went wrong. Really wrong. As far as I know one person got injured (or dead, do not remember – seriously) and power went out completely. No generators, nada. All gone.

This brought down the whole thing for a while and all Citrix servers were down. When power was restored, one of the six XenApp boxes (all Dell servers) had the hard drives toasted and it did not boot at all. They could access it remotely through the DRAC and it was indeed gone. So they let me know we had lost a Citrix server.

As I was away for that week after the power outage I told them I would check when back and to my surprise the farm was reporting the box as up and running and serving users. I checked my emails for any alerts from Resource Manager (yes, once I set it up for that, what they never did – please do not even start asking why EdgeShite is not there…) expecting to see a server unreachable message but no, nothing, nada.

So I go and RDP to that server IP address and indeed I get a session and it IS for sure a Citrix box, with the proper name, IP address and part of that farm. The funny thing once I started digging was this was no Dell server but an HP box…

At the same time most users started complaining their Outlook signature reverted back to what it was eight, nine months ago and some other very odd things…

After further investigation, here it is what happened… Someone had setup, back in July, 2010, a server for testing and as we had 5 boxes at the time on the farm, he created this sixth one and named it using the proper naming convention, just increasing the number at the end of the name so this became whatever-6. He also gave it a proper IP address and made the server part of the farm. Once he was done with his testing (what included allowing all users to use the server for a couple weeks) he simply shut it down, never removing it from the farm.

Later the need for a sixth server came up and a new Dell box was setup and given the EXACT name and IP as the now powered off HP one. When the power outage happened three to four weeks ago the guys at the data center powered on all servers that were off and as Dell #6 had a disk failure it did not boot but the HP one did and guess what? It started serving users immediately but as they keep the cached profiles on the servers, users started to get mixed things (meaning profiles started to get fucked up big time) thanks to 9 months old cached copies and the fact roaming profiles are not the most intelligent things in the world.

Thanks to great documentation and procedures in place no one knew or remembered about the HP server that was hiding somewhere in a rack. And of course due to the fact profiles were not properly handled with a decent and robust solution, hundreds of users got screwed up big time.

Next time you are done with your tests on a production environment (yes, this was production) try at least to disconnect the ethernet cables on the back.

Oh and do not forget to disable the wireless card on it, in case your company does think it is a great idea to use laptops as Citrix XenApp servers, serving users over the wireless card.

Well that is another story for another great post…

CR

896 total views, 1 views today

Leave a comment

Your email address will not be published. Required fields are marked *