Troubleshooting pacemaker: Pacemaker IP doesn’t appear in ifconfig

People who are used to manage our network devices through ifconfig, we find some trouble when putting a virtual IP in pacemaker, because we can’t see it. If we type ip addr there it is, but not with ifconfig. To make it visible we just need to use iflabel=”label” option, and we will see the IP, and with “ip addr” we will now fast which is the server IP and which is the service IP in pacemaker:

#crm configure show
(...)
primitive IP_VIRTUAL ocf:heartbeat:IPaddr2
params ip="10.0.0.11" cidr_netmask="32" iflabel="IP_VIRTUAL"
op monitor interval="3s"
meta target-role="Started"
(...)

IMPORTANT: Device label accept only 10 characters. If we put more than 10, pacemaker won’t be able to start the virtual IP and will fail (this gave me some headaches :D). Make sure you put 10 characters at most.

Without iflabel it doesn’t appear in ifconfig and isn’t labeled in ip addr:

# ip addr
1: lo: mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:50:56:9e:3c:94 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.1/24 brd 10.0.0.255 scope global eth0
inet 10.0.0.11/32 brd 10.0.0.11 scope global eth0
inet 10.0.0.12/32 brd 10.04.0.12 scope global eth0
inet 10.0.0.13/32 brd 10.0.0.13 scope global eth0
inet 10.0.0.14/32 brd 10.0.0.14 scope global eth0
inet6 fe80::250:56ff:fe9e:3c94/64 scope link
valid_lft forever preferred_lft forever
# ifconfig
eth0 Link encap:Ethernet HWaddr 00:50:56:9E:3C:94
inet addr:10.0.0.1 Bcast:10.10.0.255 Mask:255.255.255.0
inet6 addr: fe80::250:56ff:fe9e:3c94/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1825681745 errors:0 dropped:0 overruns:0 frame:0
TX packets:2044189443 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:576307237739 (536.7 GiB) TX bytes:605505888813 (563.9 GiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:924190306 errors:0 dropped:0 overruns:0 frame:0
TX packets:924190306 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:415970933288 (387.4 GiB) TX bytes:415970933288 (387.4 GiB)


However, if we use iflabel, there they are:

[[email protected] ~]# ip addr
1: lo: mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth1: mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:50:56:9e:3c:9c brd ff:ff:ff:ff:ff:ff
inet 10.0.0.1./24 brd 10.254.1.255 scope global eth1
inet 10.0.0.11/32 brd 10.0.0.11 scope global eth1:nginx-ncnp
inet 10.0.0.12/32 brd 10.0.0.12 scope global eth1:nginx-clnp
inet 10.0.0.13/32 brd 10.0.0.13 scope global eth1:hap-ncnp
inet 10.0.0.14/32 brd 10.254.1.14 scope global eth1:hap-clnp
inet6 fe80::250:56ff:fe9e:3c9c/64 scope link
valid_lft forever preferred_lft forever
[[email protected] ~]# ifconfig
eth1 Link encap:Ethernet HWaddr 00:50:56:9E:3C:9C
inet addr:10.0.0.1 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::250:56ff:fe9e:3c9c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:322545491 errors:0 dropped:0 overruns:0 frame:0
TX packets:333825895 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:92667389749 (86.3 GiB) TX bytes:93365772607 (86.9 GiB)

eth1:hap-clnp Link encap:Ethernet HWaddr 00:50:56:9E:3C:9C
inet addr:10.0.0.12 Bcast:10.254.1.52 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

eth1:hap-ncnp Link encap:Ethernet HWaddr 00:50:56:9E:3C:9C
inet addr:10.0.0.11 Bcast:10.254.1.51 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

eth1:nginx-clnp Link encap:Ethernet HWaddr 00:50:56:9E:3C:9C
inet addr:10.0.0.13 Bcast:10.254.1.32 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

eth1:nginx-ncnp Link encap:Ethernet HWaddr 00:50:56:9E:3C:9C
inet addr:10.0.0.14 Bcast:10.254.1.30 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:4073 errors:0 dropped:0 overruns:0 frame:0
TX packets:4073 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1136055 (1.0 MiB) TX bytes:1136055 (1.0 MiB)

Much better this way :)

More info here: http://linux.die.net/man/7/ocf_heartbeat_ipaddr2

Installing graphite 0.9.10 on debian squeeze

The graphing system I like the most is graphite. It’s very useful for a lot of things (fast, scalable, low resources consuming, etc, etc). Today I’ll explain how to install graphite 0.9.10 on Debian squeeze.

First of all, we install the requirements:

apt-get install python apache2 python-twisted python-memcache libapache2-mod-python python-django libpixman-1-0 python-cairo python-django-tagging

And then we make sure we don’t have installed whisper from debian’s repository, which is old and may have incompatibilities with the last version of graphite:

apt-get remove python-whisper

Then we install the application. I’ve build some .deb packages that can be used directly:

wget http://www.tomas.cat/blog/sites/default/files/python-carbon_0.9.10_all.deb
wget http://www.tomas.cat/blog/sites/default/files/python-graphite-web_0.9.10_all.deb
wget http://www.tomas.cat/blog/sites/default/files/python-whisper_0.9.10_all.deb
dpkg -i python-carbon_0.9.10_all.deb python-graphite-web_0.9.10_all.deb python-whisper_0.9.10_all.deb

But if you don’t like mine, it’s ease to make them yourself with fpm (Effing package managers), a ruby app to build packages for different package managers. First we install ruby and fpm:

apt-get install ruby rubygems
gem install fpm

Then we download graphite and we untar it:

wget http://pypi.python.org/packages/source/c/carbon/carbon-0.9.10.tar.gz#md5=1d85d91fe220ec69c0db3037359b691a
wget http://pypi.python.org/packages/source/w/whisper/whisper-0.9.10.tar.gz#md5=218aadafcc0a606f269b1b91b42bde3f
wget http://pypi.python.org/packages/source/g/graphite-web/graphite-web-0.9.10.tar.gz#md5=b6d743a254d208874ceeff0a53e825c1
tar zxf graphite-web-0.9.10.tar.gz
tar zxf carbon-0.9.10.tar.gz
tar zxf whisper-0.9.10.tar.gz

Finally we build the packages and install them:

/var/lib/gems/1.8/gems/fpm-0.4.22/bin/fpm --python-install-bin /opt/graphite/bin -s python -t deb carbon-0.9.10/setup.py
/var/lib/gems/1.8/gems/fpm-0.4.22/bin/fpm --python-install-bin /opt/graphite/bin -s python -t deb whisper-0.9.10/setup.py
/var/lib/gems/1.8/gems/fpm-0.4.22/bin/fpm --python-install-lib /opt/graphite/webapp -s python -t deb graphite-web-0.9.10/setup.py
dpkg -i python-carbon_0.9.10_all.deb python-graphite-web_0.9.10_all.deb python-whisper_0.9.10_all.deb

We have graphite app installed. Whisper doesn’t need any configuration.Carbon does, but we can go with the default config files:

cp /opt/graphite/conf/carbon.conf.example /opt/graphite/conf/carbon.conf
cp /opt/graphite/conf/storage-schemas.conf.example /opt/graphite/conf/storage-schemas.conf

This storache-schemas.conf stores data every minute for a day. As it’s very likely that we need to store data longer (a month, a year…), my storache-schemas.conf looks like that:

[default_1min_for_1month_15min_for_2years]
pattern = .*
retentions = 60s:30d,15m:2y

This way data is stored every minute for 30 days, and every 15 minutes for 2 years. This makes each graph data size 1,4MB, something reasonable. You can play with these numbers if you need more time or you want to store less space on disk (it’s pretty intuitive).

After that we need to initialize the database:

cd /opt/graphite/webapp/graphite
sudo python manage.py syncdb

And now we could start carbon to begin to collect data, executing:

cd /opt/graphite/
./bin/carbon-cache.py start

But we also want the service to start with the machine, so we need to add it to init.d. As there is no init file with the application, I downloaded an init.d file for graphite in RedHat, and I did little changes to make it work in Debian:

#!/bin/bash
#
# Carbon (part of Graphite)
#
# chkconfig: 3 50 50
# description: Carbon init.d

. /lib/lsb/init-functions
prog=carbon
RETVAL=0

start() {
log_progress_msg "Starting $prog: "

PYTHONPATH=/usr/local/lib/python2.6/dist-packages/ /opt/graphite/bin/carbon-cache.py start
status=$?
log_end_msg $status
}

stop() {
log_progress_msg "Stopping $prog: "

PYTHONPATH=/usr/local/lib/python2.6/dist-packages/ /opt/graphite/bin/carbon-cache.py stop > /dev/null 2>&1
status=$?
log_end_msg $status
}

# See how we were called.
case "$1" in
start)
start
;;
stop)
stop
;;
status)
PYTHONPATH=/usr/local/lib/python2.6/dist-packages/ /opt/graphite/bin/carbon-cache.py status
RETVAL=$?
;;
restart)
stop
start
;;
*)
echo $"Usage: $prog {start|stop|restart|status}"
exit 1
esac

exit $RETVAL

To install it we just need to put it where it belongs:

wget http://www.tomas.cat/blog/sites/default/files/carbon.initd -O /etc/init.d/carbon
chmod 0755 /etc/init.d/carbon
chkconfig --add carbon

Now we can start it from initd (service carbon start or also /etc/init.d/carbon start). Finally, we configure the webapp to access the data. We create an apache virtualhost with this content:


ServerName YOUR_SERVERNAME_HERE
DocumentRoot "/opt/graphite/webapp"
ErrorLog /opt/graphite/storage/log/webapp/error.log
CustomLog /opt/graphite/storage/log/webapp/access.log common


SetHandler python-program
PythonPath "['/opt/graphite/webapp'] + sys.path"
PythonHandler django.core.handlers.modpython
SetEnv DJANGO_SETTINGS_MODULE graphite.settings
PythonDebug Off
PythonAutoReload Off

Alias /content/ /opt/graphite/webapp/content/

SetHandler None


We add the virtualhost and allow apache user to access whisper data:

wget http://www.tomas.cat/blog/sites/default/files/graphite-vhost.txt -O /etc/apache2/sites-available/graphite
a2ensite graphite
chown -R www-data:www-data /opt/graphite/storage/
/etc/init.d/apache reload

And that’s it! One last detail… graphite comes with Los Angeles timezone. In order to chang it, we need to set “TIME_ZONE” variable in /opt/graphite/webapp/graphite/local_settings.py file. There is a file with lots of variables in /opt/graphite/webapp/graphite/local_settings.py.example, but as I just need to change the timezone, I run this command:

echo "TIME_ZONE = 'Europe/Madrid'" > /opt/graphite/webapp/graphite/local_settings.py

And with that we have everything. Now we just need to send data to carbon (port 2003) to be stored in whisper si graphite webapp can show it. Have fun!

Bibliography: I’ve followed official documentation http://graphite.wikidot.com/installation and http://graphite.wikidot.com/quickstart-guide, but debian specifics were at http://slacklabs.be/2012/04/05/Installing-graphite-on-debian/

Troubleshooting pacemaker: Discarding cib_apply_diff message (xxx) from server2: not in our membership

Yesterday, when we rebooted one of our high availability servers (to update the goddamn vmware-tools, screwing my precious uptime), we faced a serious problem: pacemaker was not syncing, so we had lost high availability. The active node could see the cluster as if there was no problem:

[[email protected]]# crm status
============
Last updated: Wed Nov 7 12:36:01 2012
Last change: Tue Nov 6 18:33:15 2012 via crmd on server2
Stack: openais
Current DC: server2 - partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 2 expected votes
6 Resources configured.
============

Online: [ server2 server1 ]
(...)

Passive node, instead, was seeing the cluster as if all nodes, including itself, were off-line:

[[email protected] ]# crm status
============
Last updated: Wed Nov 7 12:36:27 2012
Last change: Wed Nov 7 12:35:57 2012 via cibadmin on server1
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
6 Resources configured.
============

OFFLINE: [ server2 server1 ]

Looking logfiles on the passive node we could see startup was fin, but at certain point there were error like::

Nov 6 16:40:29 server1 cib [4607]: warning: cib_peer_callback: Discarding cib_replace message (776) from server2: not in our membership
Nov 6 16:40:29 server1 cib[4607]: warning: cib_peer_callback: Discarding cib_apply_diff message (777) from server2: not in our membership

In the active node there ere no strange messages Looking corosync, we checked the nodes were talking to each other with no problems. Both of them were returning the same message:

[[email protected] ]# corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.200.ip=r(0) ip(10.10.10.10)
runtime.totem.pg.mrp.srp.members.200.join_count=1
runtime.totem.pg.mrp.srp.members.200.status=joined
untime.totem.pg.mrp.srp.members.201.ip=r(0) ip(10.10.10.11)
runtime.totem.pg.mrp.srp.members.201.join_count=1
runtime.totem.pg.mrp.srp.members.201.status=joined

We used tcpdump listenint to corosync port and we checked there was traffic (as it was obvious, but at this point we doubted everything), so it was clear that the problem was in pacemaker and also pretty clear we had no idea what was it. On further investigation we found some links (for instance, this one: http://comments.gmane.org/gmane.linux.highavailability.pacemaker/13185) pointint at this problem as a bug, fixed with this commit https://github.com/ClusterLabs/pacemaker/commit/03f6105592281901cc10550b8ad19af4beb5f72f entering at 1.1.8 version of pacemaker. And ours was 1.1.7. Crap.

Trying to make things safer, instead of upgrading the existing machine, we created another one and installed pacemaker from scratch, following instructions from http://www.clusterlabs.org/rpm-next/ and http://www.clusterlabs.org/wiki/Install. Basically we did:

yum install -y corosync corosynclib.x86_64 corosynclib-devel.x86_64
wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm-next/rhel-6/clusterlabs.repo
yum install -y pacemaker cman

We copied our corosync.conf adapting it (just changing nodeid) and started it, and it joined the cluster without any problem:

[[email protected]]# crm status

Last updated: Wed Nov 7 18:14:08 2012
Last change: Wed Nov 7 18:07:01 2012 via cibadmin on balance03
Stack: openais
Current DC: balance03 - partition with quorum
Version: 1.1.8-1.el6-394e906
3 Nodes configured, 3 expected votes
6 Resources configured.

Online: [ server3 server2 ]
OFFLINE: [ server1 ]
(...)

We did a smooth migrate to the new node and everything went well. We upgraded the rest and we got back our high availability.

But the new versions has its inconveniences, because crm, the tool we use to configure pacemaker, now is distributed separatedly, following mantainer’s will. It has become a project by itself with the name crmsh, it has its own web: http://savannah.nongnu.org/projects/crmsh/. The compiled package is available here http://download.opensuse.org/repositories/network:/ha-clustering/ but it has a dependency on pssh package, whom has its own dependencies itself. Bottomline, we did this:


wget http://apt.sw.be/redhat/el6/en/i386/rpmforge/RPMS/pssh-2.0-1.el6.rf.noarch.rpm
rpm -Uvh pssh-2.0-1.el6.rf.noarch.rpm
yum -y install python-dateutil.noarch
yum -y install redhat-rpm-config
wget http://download.opensuse.org/repositories/network:/ha-clustering/CentOS_CentOS-6/x86_64/crmsh-1.2.5-55.3.x86_64.rpm
rpm -Uvh crmsh-1.2.5-55.3.x86_64.rpm

And with that we had everything online again