Troubleshooting pacemaker: Discarding cib_apply_diff message (xxx) from server2: not in our membership

Yesterday, when we rebooted one of our high availability servers (to update the goddamn vmware-tools, screwing my precious uptime), we faced a serious problem: pacemaker was not syncing, so we had lost high availability. The active node could see the cluster as if there was no problem:

[[email protected]]# crm status
============
Last updated: Wed Nov 7 12:36:01 2012
Last change: Tue Nov 6 18:33:15 2012 via crmd on server2
Stack: openais
Current DC: server2 - partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
6 Resources configured.
============

Online: [ server2 server1 ]
(...)

Passive node, instead, was seeing the cluster as if all nodes, including itself, were off-line:

[[email protected] ]# crm status
============
Last updated: Wed Nov 7 12:36:27 2012
Last change: Wed Nov 7 12:35:57 2012 via cibadmin on server1
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
6 Resources configured.
============

OFFLINE: [ server2 server1 ]

Looking logfiles on the passive node we could see startup was fin, but at certain point there were error like::

Nov 6 16:40:29 server1 cib [4607]: warning: cib_peer_callback: Discarding cib_replace message (776) from server2: not in our membership
Nov 6 16:40:29 server1 cib[4607]: warning: cib_peer_callback: Discarding cib_apply_diff message (777) from server2: not in our membership

In the active node there ere no strange messages Looking corosync, we checked the nodes were talking to each other with no problems. Both of them were returning the same message:

[[email protected] ]# corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.200.ip=r(0) ip(10.10.10.10)
runtime.totem.pg.mrp.srp.members.200.join_count=1
runtime.totem.pg.mrp.srp.members.200.status=joined
untime.totem.pg.mrp.srp.members.201.ip=r(0) ip(10.10.10.11)
runtime.totem.pg.mrp.srp.members.201.join_count=1
runtime.totem.pg.mrp.srp.members.201.status=joined

We used tcpdump listenint to corosync port and we checked there was traffic (as it was obvious, but at this point we doubted everything), so it was clear that the problem was in pacemaker and also pretty clear we had no idea what was it. On further investigation we found some links (for instance, this one: http://comments.gmane.org/gmane.linux.highavailability.pacemaker/13185) pointint at this problem as a bug, fixed with this commit https://github.com/ClusterLabs/pacemaker/commit/03f6105592281901cc10550b8ad19af4beb5f72f entering at 1.1.8 version of pacemaker. And ours was 1.1.7. Crap.

Trying to make things safer, instead of upgrading the existing machine, we created another one and installed pacemaker from scratch, following instructions from http://www.clusterlabs.org/rpm-next/ and http://www.clusterlabs.org/wiki/Install. Basically we did:

yum install -y corosync corosynclib.x86_64 corosynclib-devel.x86_64
wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm-next/rhel-6/clusterlabs.repo
yum install -y pacemaker cman

We copied our corosync.conf adapting it (just changing nodeid) and started it, and it joined the cluster without any problem:

[[email protected]]# crm status

Last updated: Wed Nov 7 18:14:08 2012
Last change: Wed Nov 7 18:07:01 2012 via cibadmin on balance03
Stack: openais
Current DC: balance03 - partition with quorum
Version: 1.1.8-1.el6-394e906
3 Nodes configured, 3 expected votes
6 Resources configured.

Online: [ server3 server2 ]
OFFLINE: [ server1 ]
(...)

We did a smooth migrate to the new node and everything went well. We upgraded the rest and we got back our high availability.

But the new versions has its inconveniences, because crm, the tool we use to configure pacemaker, now is distributed separatedly, following mantainer’s will. It has become a project by itself with the name crmsh, it has its own web: http://savannah.nongnu.org/projects/crmsh/. The compiled package is available here http://download.opensuse.org/repositories/network:/ha-clustering/ but it has a dependency on pssh package, whom has its own dependencies itself. Bottomline, we did this:


wget http://apt.sw.be/redhat/el6/en/i386/rpmforge/RPMS/pssh-2.0-1.el6.rf.noarch.rpm
rpm -Uvh pssh-2.0-1.el6.rf.noarch.rpm
yum -y install python-dateutil.noarch
yum -y install redhat-rpm-config
wget http://download.opensuse.org/repositories/network:/ha-clustering/CentOS_CentOS-6/x86_64/crmsh-1.2.5-55.3.x86_64.rpm
rpm -Uvh crmsh-1.2.5-55.3.x86_64.rpm

And with that we had everything online again

Tomàs

Tomàs

I'll make something up