As many of you reading my blog have realized I am a huge proponent of datacenter virtualization. So here we are with a nicely virtualized Hyper-V failover cluster, we’ve got excellent guest to host density and everything is going well….until it isn’t. I was taking some time off when I received a call in the middle of the night that one of my customer’s entire infrastructure was down. Upon logging in it was apparent something was quite wrong, the failover cluster had failed convergence and the local Hyper-V MMC showed that most over 75% of the VMs were in a paused critical state (see the screenshot below as an example that was reproduced in a lab).
As it turns out one of the cluster shared volumes had reached a point where > 200MB was remaining on the volume. Hyper-V will begin writing to the event log when the volume drops below 2GB with an event log entry for each VM. Once the space drops below 200MB all VMs on that volume will enter a paused critical state. This only applies to thin provisioned disks however at the time of writing this there is a bug that this will occur even with thick provisioned disks. The fun part in my scenario was that both domain controllers were residing on this volume. To remedy the issue I was able to forcibly power-off the domain controllers from their paused state and storage migrate them to another cluster volume. Upon bringing them back up cluster convergence was able to be restored and everything was up and running again. The other thing of note is that the clocks were all paused during this time resulting in major time drift that had to be manually reconciled.
- ALWAYS ensure your storage admin has monitoring on the SAN to send alerts when the LUNs reach a critical threshold and warn before that at a lower threshold
- If you have multiple LUNs keep your Domain Controllers spread across them