...
Description of problem: Consider a cluster in which only some nodes are intended to be unfenced. For example, in OpenStack clusters, only the Pacemaker remote nodes need to be unfenced (by fence_compute). If any one node needs to be unfenced, then all nodes must be unfenced using their stonith device or stonith topology when they join the cluster – even if unfencing serves no purpose for some nodes. This is redundant at best and problematic at worst. To see why, let's look at a minimal reproducer. ~~~ Pacemaker Nodes: node1 node2 Resources: Resource: node3-rem (class=ocf provider=pacemaker type=remote) Attributes: server=node3-rem Stonith Devices: Resource: remotefence (class=stonith type=fence_xvm) Attributes: pcmk_host_map=node3-rem:fastvm-rhel-8.0-52 Meta Attrs: provides=unfencing Resource: kdump (class=stonith type=fence_kdump) Attributes: pcmk_host_list="node1 node2" ~~~ The remote node's stonith device is the only one that provides unfencing. But every node has to be unfenced when it joins the cluster. If node1 and node2 had a stonith device that supported the "on" action (e.g., fence_ipmilan or fence_xvm), then the unfencing would be redundant but wouldn't hurt anything. The node is already powered on if it's joining the cluster, so the "on" action would be a no-op. If node1 and node2 only have a stonith device that does *not* support the "on" action (e.g., fence_kdump), then the unfencing will never succeed, and they will be unable to run resources. The same is true if their stonith device does support the "on" action but is not working (e.g., fence_ipmilan when the iLO is temporarily unreachable). Until pacemaker can successfully run the "on" action to unfence the nodes, they can't run resources. Brief demo (node1 is in standby but you can see that node2 can't run anything besides node3-rem): date && pcs cluster start --all Mon Dec 13 02:17:34 PST 2021 node2: Starting Cluster... node1: Starting Cluster... date && pcs status Mon Dec 13 02:18:03 PST 2021 ... Node List: Node node1: standby RemoteNode node3-rem: standby Online: [ node2 ] Full List of Resources: node3-rem (ocf::pacemaker:remote): Started node2 dummy1 (ocf::heartbeat:Dummy): Stopped remotefence (stonith:fence_xvm): Stopped kdump (stonith:fence_kdump): Stopped Failed Fencing Actions: unfencing of node2 failed: delegate=node1, client=pacemaker-controld.693855, origin=node1, last-failed='2021-12-13 02:17:52 -08:00' unfencing of node1 failed: delegate=node1, client=pacemaker-controld.693855, origin=node1, last-failed='2021-12-13 02:17:52 -08:00' NOTES: In the above example, dummy1 won't start because the default `requires` becomes "unfencing" if unfencing is enabled in the cluster. It's possible to work around this and allow dummy1 to start by setting a cluster-wide resource default of `requires=fencing` (or by setting `requires=fencing` on dummy1 directly). While this workaround is valid, I don't believe it should be required in order to allow resources to start on nodes whose stonith devices are not configured with `provides=unfencing`. *However*, this workaround still will not allow the other stonith devices (e.g., kdump in this example) to start. That seems to be because pcmk__is_unfence_device() function doesn't work as documented. The pcmk__is_unfence_device doesn't do what its doc comment advertises (to check whether the device supports unfencing). Instead, it checks whether the resource is a fence device and whether unfencing is enabled cluster-wide. ~~~ /*! \internal \brief Check whether a resource is a fencing device that supports unfencing * \param[in] rsc Resource to check \param[in] data_set Cluster working set * \return true if \p rsc is a fencing device that supports unfencing, otherwise false */ bool pcmk__is_unfence_device(const pe_resource_t *rsc, const pe_working_set_t *data_set) { return pcmk_is_set(rsc->flags, pe_rsc_fence_device) && pcmk_is_set(data_set->flags, pe_flag_enable_unfencing); } ~~~ This function gets called by pcmk__order_vs_unfence(). The net effect is that a stonith device can't start until unfencing has occurred, seemingly no matter what. (This was the original issue that came to us via a support case.) Ref: https://github.com/ClusterLabs/pacemaker/blob/57b49dcd1b0c90976f7ff610fecd5cf28eec824a/lib/pacemaker/pcmk_sched_fencing.c Version-Release number of selected component (if applicable): pacemaker-2.1.0-8.el8 How reproducible: Always Steps to Reproduce: 1. Configure a cluster in which one or more nodes have a stonith device with `meta provides=unfencing` and one or more other nodes have a stonith device that does not support the "on" action. 2. Start the cluster on all nodes. Actual results: The nodes whose stonith devices don't support the "on" action fail to be unfenced, and they cannot run resources. Expected results: The nodes whose stonith devices don't support the "on" action (or don't have `meta provides=unfencing` configured) either: don't need to be unfenced when they join the cluster, or are unfenced by a dummy internal stonith device that performs a no-op and returns success. Additional info: N/A
Won't Do