BugZero | VMware BugID 76898 - Handling Split Brain scenario in vSphere

VMware - Defect ID: 76898

Handling Split Brain scenario in vSphere

VMware - Defect ID: 76898

Handling Split Brain scenario in vSphere

Last updated on 10/7/2020

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

No defect details.

Symptoms

HA enabled vCenter clusters may end up seeing multiple instances of a VM running on different ESXi hosts. The VC UI may see the owning host of the VM flapping between where the VMs are running.VMs sharing host resources with VMs in the split brain scenario may witness a "DRS Storm" where many VMs will rapidly migrate between hosts with the split brain VMs. This can result in degradation of performance on VMs during migration and possible resource over commitment on affected hosts.

Purpose

This article explains how to identify and kill the vmx processes which have lost control over the VM.

Cause

We can encounter this problem in a vSphere HA environment in the following way(s): Let’s say a host H1 has a VM running on it which is protected by HA. Scenario 1) If the host isolation response in vSphere HA settings is set to “Disabled” and host H1 becomes network isolated, then HA primary present in the other partition will failover the VM running on the isolated host to other hosts in the cluster. When the isolated host joins back to the network, there will be two instances of the VM running on different hosts in the cluster. Scenario 2) Let's say virtual machine VM1 is stored on datastore DS1 and is registered and running on Host H1. If DS1 hits APD and let's say host H1 becomes network isolated from other hosts in the cluster around the same time. FDM Primary in other partition will mark host H1 as Dead and failover the VMs to other hosts in the cluster. When a datastore hits APD, FDM waits for APD timeout (140 secs by default) + VMCP timeout (180 secs by default) to take any action. FDM will act on the selected APD policy ONLY if APD timeout + VMCP timeout gets expired.If APD gets cleared before the timeout, FDM wouldn't act upon the VMCP policy. If you have selected "vmReactionOnAPDCleared" as "none", FDM wouldn't take any action when APD got cleared and there will be split-brain scenario when the partition gets resolved.Scenario 3)The host isolation address remains accessible to the ESXi host however connectivity to HA cluster is lost for a short period. When the isolated host joins back to the network, there will be two instances of the VM running on different hosts in the cluster.

Impact / Risks

No impact

Resolution

There will be a VMX process running on two different ESXi hosts. Only one of them will be holding the lock for its files. To resolve the split-brain scenario, we need to identify the host in which the VMX process is holding the lock on VM files and power-off the VM in the other host (the one which hasn’t held the lock). 1. Identify the host which has locked the VMX files For VMFS, VSAN, VVOL datastores:Method 1: Follow the steps mentioned in this KB article: Investigating virtual machine file locks on ESXi Method 2: 1. Login to an ESXi host Run the below command “vmkfstools -D <path-to-vmx-lck file>” vmkfstools -D /vmfs/volumes/5df95a18-fd1e0b13-fc0d-02004b8f7cfd/New\ Virtual\ Machine/New\ Virtual\ Machine.vmx.lckLock [type 10c00001 offset 8003584 v 106, hb offset 3670016 gen 13115, mode 1, owner 5df9572c-7a1e42d7-074b-02004b84b281 mtime 2144440 num 0 gblnum 0 gblgen 0 gblbrk 0] Addr <4, 0, 65>, gen 95, links 1, type reg, flags 0xa, uid 0, gid 0, mode 600 len 1073741824, nb 1024 tbz 1024, cow 0, newSinceEpoch 1024, zla 3, bs 1048576 affinityFD <4,0,62>, parentFD <4,0,62>, tbzGranularityShift 20, numLFB 0 lastSFBClusterNum 15, numPreAllocBlocks 0, numPointerBlocks 1 The output contains information regarding “owner” of the file. From the above output, owner 5df9572c-7a1e42d7-074b-02004b84b281 The last section of 5df9572c-7a1e42d7-074b-02004b84b281 (which is 02004b84b281) is the MAC address of the host which owns the file. The next step is to find out the host which has the MAC address 02004b84b281. One way of finding this out is to login to esxi host and the following command esxcfg-nics -l Name PCI Driver Link Speed Duplex MAC Address MTU Description vmnic0 0000:0b:00.0 ne1000 Up 1000Mbps Full 02:00:4b:2e:10:fc 1500 Intel Corporation Virtual 82574L Gigabit Ethernet vmnic1 0000:13:00.0 ne1000 Up 1000Mbps Full 02:00:4b:85:ef:75 1500 Intel Corporation Virtual 82574L Gigabit Ethernet vmnic2 0000:1b:00.0 ne1000 Up 1000Mbps Full 02:00:4b:03:a8:29 1500 Intel Corporation Virtual 82574L Gigabit Ethernet vmnic3 0000:04:00.0 ne1000 Up 1000Mbps Full 02:00:4b:84:b2:81 1500 Intel Corporation Virtual 82574L Gigabit Ethernet If one of the vmnics has the MAC address 02004b84b281, then we found the host which has held the locks for VM. 2. Power-off the VM from the other host which doesn't have the lock In order to power-off the VM, we can do the following after logging into ESXi host:a. Run the vim-cmd vmsvc/getallvms command to display the names of the virtual machines registered on this host.b. Take note of the impacted virtual machine ID “VMID”.c. Power-off the virtual machine with the following command:vim-cmd vmsvc/power.off VMID d. Run the vim-cmd vmsvc/getallvms command and see if the stale VM still exists on the host.e. If the powered-off copy of the VM still exists on the host, unregister the VM withvim-cmd vmsvc/unregister VMID For more information refer to VMware KB : Investigating virtual machine file locks on ESXi For NFSTo identify the host which has locked the VMX files refer to VMware KB belowUnderstanding the NFS .lck lock file to understand the ESX host and NFS filename it refers to Note:The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment

Workaround

If these symptoms are seen on VMC on AWS please contact VMware support to address.

Original Vendor Announcement

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

VMware - Defect ID: 76898

Handling Split Brain scenario in vSphere

VMware - Defect ID: 76898

Handling Split Brain scenario in vSphere

Last updated on 10/7/2020

Vendor details

Vendor details

Description

Symptoms

Purpose

Cause

Impact / Risks

Resolution

Workaround

Links

Top VMware defects by risk score

Ready to prevent the next vendor outage?