A few years ago, a new project dropped into my lap at work. This project was to involve the setup and configuration of a “next-generation” linux server for our Genetics department to run their sequencing analyses. The next-gen server was to replace an aging 1U Dell server with a pitiful single P4 processor and 4Gb RAM (the poor thing ran maxed out pretty much 24×7). So far, so good.
And then the IT manager got involved and decided that the money for the next-gen server should be invested in a decent VMWare environment, with the Genetics department getting a virtual linux server to use instead. And this is where the fun begins. To begin with, I had been getting quotes for a server with between 48 and 128Gb RAM. Our new virtual system has 32Gb, total. Given that the Genetics server will be one of many virtual hosts running on this system, 32Gb is likely to be insufficient. This is not the biggest issue however.
Our IT manager then decided to pay for consultants to come out and set up the system. This seemed logical; while we have some experience with ESXi having run a small ESXi environment for a year or so, we don’t have experience with complex VMWare setups with multiple blades and SANs.
Unfortunately, the IT manager also ordered two new switches that the SAN and blade chassis would connect to. Equallly unfortunately, he didn’t consult his infrastructure engineer (that’d be me) and so we ended up with cheap switches that don’t support the features necessary for a fully fault tolerant setup. The consultants weren’t too fazed by this. After a bit of head scratching, they set up a semi-working system and left. I say semi-working because as soon as a single link failed, the entire system fell over. Not exactly the level of fault tolerance we were looking for! The lack of proper Spanning Tree support meant we also couldn’t have a redundant link from the switches to the rest of the LAN.
One of my first tasks was to replace the two switches with better spec’ed switches. And that’s where my problems began. I immediately ran into a problem where one of the VLANs no longer worked on the new switches. Unfortunately, the Spanning Tree feature we planned to rely on to provide redundant links without causing a loop in the network was disabling the uplinks from one of the VLANs. A quick search of google suggested multiple Spanning Tree instances would be the answer. A quick look through the procurve specification showed this would work. Except it didn’t. Apparently HP’s implementation of multiple Spanning Tree’s leaves something to be desired. No problem, I simply disabled Spanning Tree on the disabled uplink ports. This won’t cause problems since the two uplinks will be in separate broadcast domains, hence no loop.
My next conundrum was to work out why the system wasn’t suriving when one of the two physical switches died. We’d checked and double checked the cabling, it was all correctly balanced across both switches. We checked and double checked our switch configs. A test setup using a dell server with two bonded NIC’s going into a couple of spare switch esshowed that the theory worked – when one switch failed, traffic was correctly routed over the remaining link.
Now, at this point, you may be wondering why we hadn’t involved the consultants. They specifically said we should email them if we needed assistance. So I did. I them emailed them a copy of our switch configuration at their request, and they told me their network engineer was looking into it. Fast forward a couple of weeks, and I’m told by the IT manager that a complaint had been made against me due to fact I kept emailing them!! I emailed them twice over the period of two or three weeks, which I think you’ll agree hardly constitutes a DoS attack against their email server! The company we’d been dealing with have since tried selling us some more days consultancy time, but suffice to say, they had not exactly impressed us with their customer service. In my opinion, the system was not left in a state suitable for a production environment, therefore we shouldn’t have paid the first consultants fee, let alone have to pay extra for them to come back and finish the job.
Anyway, having spent a few hours/days banging my head against various brick walls, I believe that I have the problem narrowed down to the type of link state detection used by VMware. The default setting of Link State Only will only detect link failures further downstream if the physical switches support Uplink Failure Detection (HPs term for Cisco’s Link State Tracking). Ours, typically, don’t.
Changing the link state detection to Beacon Probing should, in theory, resolve this. The theory says that there will be a slight hit in CPU utilisation and that network utilisation will increase due to an increase in broadcast packets sent out from VMWare. If it works, I can accept this. I have just about finished testing this on my lab setup, and plan to make the changes to the live system this weekend. Unfortunately my lab setup involves an old 2U Dell server, not a £40,000 HP blade enclosure and SAN. Oh well, what’s the worst that can happen…?
Update: It’s now Sunday morning, and I’ve been here two hours working on this bloody system. My proposed fix worked fine, up until the point where it stopped working and the server fell over. After 30 minutes coaxing it back to life, I managed to reset the NIC teaming configuration, but now I’m having trouble with one of the guests hanging at 95% when powering on. And of course, it’s not one of the test servers we have on there, it’s the main VM used by genetics. I am beginning to take a dislike to VMWare. Bring back the days where all systems sat on their own physical boxes. Much simpler to work with!