Something that I see out in the field a lot, or at least more than I should, are clusters that have all the defaults. This is a common cause of unexpected outages. HA was developed to help automate the recovery of VM's when they become unavailable. The thing we have to define is what is unavailable. Technically speaking if something is isolated on the network it is unavailable. So that's one feature VMware has built into HA. I'm not going to dig too deep on this because I wouldn't do nearly as good a job as Duncan Epping does on his blog yellow-bricks.com in his HA Deepdive section;
One of the most underutilized option in HA is the ability to control what is used to define host isolation. For this first part lets assume that we're talking vSphere 4.x and earlier. By default host isolation is determined by the hosts ability to simply ping the default gateway from a management interface. Now if you look at the image below it shows the default settings if you just check the HA box on the cluster;
You notice that the default isolation response at the top of the page is set to "shutdown" the VM's on the host. The reason why this is the most common cause of an accidental outage is because the default gateway is commonly not controlled by the guys who manage the VMware environment. Reference the image below;
Imagine that the network team asked you to use the core switch as the default gateway on your ESXi hosts. Well if the network admin does a reboot of the core switch for a code upgrade or some sort of maintenance, guess what happens. The ESXi hosts all think they are isolated and start shutting down VM's. So something that was a four or five minute outage for the switch to reboot and initialize the new code now turns into an hour or more of trying to make sure all the VM's come back up and are power on in the correct order etc. This is obviously something we want to avoid.
How do we fix this? The easiest way to fix this is to identify a secondary and/or tertiary device to use as a host isolation address. What device should we use? Well in the example above we have a few options. I would probably use the management interface on the top of rack switch, the Firewall address, or try to find something in the same rack (like a pair of load balancers or something). This will ensure that if the rack becomes isolated, due to an upstream link failure, your VM's don't just shutdown. If the whole rack is isolated there's no point in shutting down the VM's. Having these three devices would provide the best level of protection we can get. If you plan to do a reboot of the top of rack switch there's no getting around an isolation response and you should disable HA when doing such maintenance.
Implimenting HA Advanced options are quite easy, and that is how we would remedy this issue. We would use a couple HA options in this scenario
1. das.usedefaultisolationaddress - we'll set this to false, so we can define multiple isolation addresses even if we do decide to use the default gateway
2. das.isolationaddress<X> - Where <X> is the numbered entry of the isolation address, so if we had three isolation addresses we'd use das.isolationaddress1, das.isolationaddress2, das.isolationaddress3
To set these values go to the VMware Virtual Infrastructure Client, right click your HA cluster, and choose the HA section, and the advanced options button on the bottom right of the window. We will then fill it out as depicted in the image below;
This will set your cluster to use these addresses to determine host isolation. Keep in mind that if any of these devices IP's change your cluster needs to be updated. Again this applies mostly to vSphere 4.x and earlier there are a few more protection mechanisms in vSphere 5 that I will cover in a later post.