As readers of my blog may recall, I posted about issues I’ve been having with our new VMWare system running on an HP 3000 blade chassis.
As a quick reminder, the system had been set up by consultants who knew their way around VMware, but didn’t seem to have a clue about the networking side. Although the system was cabled correctly into two physical switches, failure of one switch caused the system to drop offline rather than use the second switch.
Political issues and various managers throwing their toys around meant the consultants were no longer willing to assist us, so the problem fell to me to resolve. My efforts were hampered by the fact that the IT Manager allowed people to start accessing the VM’s, which of course meant I couldn’t take the system offline during working hours.
As I’d correctly guessed, our issues were being caused by the fact that the failure of a physical switch was not being seen by the ESXi host connected to that switch (each ESXi host is connected to the C3000 interconnect switch, which connects to the physical switch).
Having tried (and failed) to use Beacon Probing to work around this, the solution would appear to be to enable Uplink Failure Detection on the C3000 interconnects. Thich allows us to tell the interconnect to kill the downlinks to an ESXi host when it detects a failure on a physical switch uplink. This has the effect of alerting the ESXi host to the network failure, which will then start utilising remaining network paths for outboun traffic.
Unfortunately, even that wasn’t straight forward, as UFD only works on uplinks that share the same VLAN configuration. Our consultants had set up multiple uplinks with each in a different VLAN. This week, I managed to recable the system so that the VLANs are all trunked over the same uplinks, allowing me to enable UFD. Four days later, the system is still up, there are no signs of the new configuration causing any issues and (I think) we now have a fully fault tolerant ESXi environment. Stay tuned for part 3 when I test my work and start pulling out various cables!
Oh, and the issue with one of the VM’s hanging at 95% when powered on was due to it having been automatically migrated to another ESXi host during the maintenance period. It was waiting for me to respond to a question asking if I had moved or copied the VM, but I hadn’t spotted this as the question is actually asked on a different tab; no indication of this is given on the main tab, it just looks like the VM has hung!