Changing nagios face: nagios nuvola style

Let’s be honest. Nagios has lots of good things. But it also have bad ones: it stores data in non-indexed text files, executes a compiled CGI, configuration files are unconfortable when adding and removing new machines (mainly removing them)… and mainly: it’s ugly. Maybe ugly is not the word… it’s austere, simple, not attractive.

I don’t give a damn (and I’m pretty sure I’m not the only one), because it’s a tool, and it does its work. I’m not here to enjoy watching it, but to warn me when things go wrong, and to explain me why are they going wrong. But in this world, the ones who take decisions and buy things, they frequently look at the appearance, sometimes beyond functionality. Then if you are trying to convince someone to use nagios, you’ll have more chances if it was pretty.

Here is where nagios nuvola style comes, to give nagios another look, very different, and to make it nicer (in the full article you’ll find two screenshots so you can compare). It’s made by the same people that made nagiosql, and although it has a nagiosexchange page, the downloadable file there is corrupt (one css is bad, status.css, and it looks very different). I took it from this website, and you can download it here, too.

Installing it couldn’t be easier. You just copy files inside “html” directory inside your nagios. In a debian, for instance, it’s as easy as:

wget http://tomas.cat/blog/sites/default/files/nagios-nuvola-1.0.3.tar_.gz
mkdir nuvola
cd nuvola
tar zxvf ../nagios-nuvola-1.0.3.tar_.gz
cp -a html/* /usr/share/nagios3/htdocs/
cp -a html/stylesheets/* /etc/nagios3/stylesheets/.

And that’s it, you got your look-improved nagios!

That’s how “original” nagios looks:

That’s how it looks with its face changed:

First example of remote pstools execution: winps.sh

Next mission was making scripts to execute pstools remotely. I started make one for each tool, but I found out there was a lot of shared code, so I decided to create an generic script, psexec.sh (honoring pstools), receiving the server and the tool, with its parameters. After that, we should only create a wrapper for every command to make our life easier.

The script must check if file credentials are valid, and asking for others if they aren’t. Once authenticated, it must check if there’s pstools installed or not, and copy them if not.

In the full story you can see the code of psexec.sh and an example wrapper, winps.sh. Keep in mind that they need some files, winvars.sh and cp_pstools.sh in order to work properly, as we saw in the rc=$?

if [ $rc -eq 0 ];then
echo "Default credentials"
elif [ $rc -eq 1 ];then
echo "$? Hola"
echo "Default credentials doesn't authentify. Try others."
read -p "Type username (DOMAIN/user):" user
read -s -p "Type password:" pass
PSCREDENTIALS="--user $user --password $pass"
winexe //$1 "$TOOLS_UNIT:\$TOOLS_DIR\pstools\pslist.exe /accepteula -t" $PSCREDENTIALS
rc=$?
if [ $rc -eq 1 ];then
echo "Credentials error or unit not available (check smbmount errors)" && exit 1
elif [ $rc -eq 99 ];then
echo "There's no pstools, I'll copy them"
$PROGPATH/cp_pstools.sh $1 $user $pass
[ $? -ne 0 ] && echo "There's no pstools and I couldn't copy them." && exit 1
winexe //$1 "d:\scripts\pstools\pslist.exe /accepteula -t" $PSCREDENTIALS
rc=$?
if [ $rc -eq 99 ];then
echo "Pstools are copied, but they don'y work, somethings going on." && exit 1
fi
fi
elif [ $rc -eq 99 ];then
echo "There's no pstools, I'll copy them"
$PROGPATH/cp_pstools.sh $1
[ $? -ne 0 ] && echo "There's no pstools and I couldn't copy them." && exit 1
winexe //$1 "d:\scripts\pstools\pslist.exe /accepteula -t" $PSCREDENTIALS
rc=$?
if [ $rc -eq 99 ];then
echo "Pstools are copied, but they don'y work, somethings going on." && exit 1
fi
fi

Wrapper example: winps.sh

#!/bin/bash

[ $# -ne 1 ] && echo "Error, I need one and only one argument" && exit 1
PROGPATH=echo $0 | /bin/sed -e 's,[\/][^\/][^\/]*$,,'
$PROGPATH/psexec.sh $1 pstools\pslist -t

Automatically copying pstools to the windows server from your linux

Following the path we were… What if we want to use pstools in 50 servers? As an idea, we can creat a shared unit, and make all servers to execute pstools there. But if we have some in some networks, some in some other networks (including DMZ), in a domain or not… Couln’t be an easy way to copy them?

With this purpose I’ve made this little script, doing exactly that: copying pstools to the server we want. First of all it mounts a cifs unit (with smbmount), then copy the files and then umount it.

I’ve made it to be called from other scripts. For instance, if we make a “winps”, we can make it to check if pstools are installed first, and to copy them if they aren’t.

In the full article you can see the code an download the file.

winvars.sh
winvars.sh

#!/bin/bash

TOOLS_DIR=tools
TOOLS_UNIT=d
CREDENTIALS="/home/user/secretfile"
SMBCREDENTIALS="credentials=$CREDENTIALS"
PSCREDENTIALS="-A $CREDENTIALS"

Difference between SMBCREDENTIALS and PSCREDENTIALS is the way smbmount and winexe accept them.

cp_pstools.sh
cp_pstools.sh

#!/bin/bash

# We get the script path
PROGPATH=echo $0 | /bin/sed -e 's,[\/][^\/][^\/]*$,,'
# We load our vars
. $PROGPATH/winvars.sh

PSTOOLS_SRC=/home/user/pstools/
RAND_DIR=$1-$RANDOM

[ $# -lt 1 ] && echo "Error, too few parameters" && echo "Use: $0 server [unit]" && exit
[ ! -z "$2" ] && SMBCREDENTIALS="username=$2,password=$3"

mkdir /tmp/$RAND_DIR
smbmount //$1/$TOOLS_UNIT$ /tmp/$RAND_DIR -o $SMBCREDENTIALS

if [ $? -eq 0 ];then
echo "Default credentials"
else
echo "Default credentials doesn't authentify. Try others."
read -p "Type username (DOMAIN/user):" user
read -s -p "Type password:" pass
smbmount //$1/$TOOLS_UNIT$ /tmp/$RAND_DIR -o username=$user,password=$pass
[ $? -ne 0 ] && echo "Credentials error or unit not available (check smbmount errors)" && rmdir /tmp/$RAND_DIR && exit 1
fi

mkdir /tmp/$RAND_DIR/$TOOLS_DIR
cp -av $PSTOOLS_SRC /tmp/$RAND_DIR/$TOOLS_DIR/.
smbumount /tmp/$RAND_DIR

rmdir /tmp/$RAND_DIR
>

Disconnecting windows remote desktop (terminal server) users

You are trying to connect via remote desktop (terminal server) to the server, but you find out there’s too much people already connected. You get the damn message:

You can't connect!

What can I do? Is easy. As we already have our brand new tool winexe, we can make a little script to make our lives easier:

#!/bin/bash

[ $# -lt 1 ] && echo "Error: Missing argument" && echo "Use: $0 server [disc #session]" && exit

[ ! -z "$2" ] && [ $2 != disc ] && echo "Error: Can't understand second argument" && echo "Use: $0 server [disc #session]" && exit
[ "$2" == "disc" ] && echo "Disconnecting session $3 from server $1..." && winexe //$1 "logoff $3" -A secretfile && exit
echo "Listing server $1 sessions:"
winexe //$1 "query session" -A secretfile

File “secretfile” is optional, just in case you don’t want to type user and pass. Contents are:

domain=YOURDOMAIN
username=user
password=pass

That’s an poorly error-controlled script, but it allows you to watch who is connected:

[email protected]:~/$ ts.sh server2
Listing server server2 sessions:
SESSIONNAME USERNAME ID STATE TYPE DEVICE
> user1 0 Disc rdpwd
rdp-tcp 65536 Listen rdpwd
Administrator 3 Disc rdpwd
user2 1 Disc rdpwd
console 5 Conn wdcon
[email protected]:~/$

In this server you can’t login, there are too much users. We can see everybody is “disconnected”, so there is no one working. We choose the user we like the least, and we kick him out:

[email protected]:~/$ ts.sh server2 disc 1
Disconnecting session 1 from server server2...
[email protected]:~/$ ts.sh server2
Listing server server2 sessions:
SESSIONNAME USERNAME ID STATE TYPE DEVICE
> user1 0 Disc rdpwd
rdp-tcp 65536 Listen rdpwd
Administrator 3 Disc rdpwd
console 5 Conn wdcon

Et voilà, we just get a free session to connect to admin this server.

Obviously, is way better if everybody logs off when they end working. But if you have to share your servers with absentminded admins, you must take care of yourself…

My server reboots when it hasn’t access to SAN disks!

You have your Linux boxes, with a Oracle 10g RAC. Everything works perfectly, but suddenly one server reboots. Yo peek in the logfile and you find this:

Sep 18 00:27:24 server1 kernel: SCSI error : <2 0 2 0> return code = 0x20000
Sep 18 00:27:24 server1 kernel: end_request: I/O error, dev sdae, sector 1672
Sep 18 00:27:24 server1 kernel: device-mapper: dm-multipath: Failing path 65:224.
Sep 18 00:34:14 server1 syslogd 1.4.1: restart.
Sep 18 00:34:14 server1 syslog: syslogd startup succeeded
Sep 18 00:34:14 server1 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Ok… SAN disks failed… server has lost part of its disks… But this doesn’t seem to be a big deal, it shouldn’t have rebooted, should it? Operating system (root filesystem “/”) is mounted on a local disc. In fact, there is nothing using SAN disks but the ocfs from Oracle… The only one who should have faild was Oracle, and nothing more, isn’t it? Why has been rebooted the whole machine?

Turns out that long ago, Oracle RAC, when it found itself in this situation, tried to pull out machine from cluster via “evict node”. But this didn’t work most of the time, ocfs2 driver hung, hunging the whole cluster a lot of times (every machine in the cluster). Drastic solution… What’s the safest way to get out of a cluster? You got it, rebooting the machine.

They could have made Oracle to leave some messages in the logfile, warning it was the one who rebooted the machine, so things would be clearer. But you can’t always get what you want.

So if you find your machine rebooting when it losses SAN disks, don’t blame the machine, and don’t blame Oracle… get your SAN fixed so it won’t happen again.