My server reboots when it hasn’t access to SAN disks!

You have your Linux boxes, with a Oracle 10g RAC. Everything works perfectly, but suddenly one server reboots. Yo peek in the logfile and you find this:

Sep 18 00:27:24 server1 kernel: SCSI error : <2 0 2 0> return code = 0x20000
Sep 18 00:27:24 server1 kernel: end_request: I/O error, dev sdae, sector 1672
Sep 18 00:27:24 server1 kernel: device-mapper: dm-multipath: Failing path 65:224.
Sep 18 00:34:14 server1 syslogd 1.4.1: restart.
Sep 18 00:34:14 server1 syslog: syslogd startup succeeded
Sep 18 00:34:14 server1 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Ok… SAN disks failed… server has lost part of its disks… But this doesn’t seem to be a big deal, it shouldn’t have rebooted, should it? Operating system (root filesystem “/”) is mounted on a local disc. In fact, there is nothing using SAN disks but the ocfs from Oracle… The only one who should have faild was Oracle, and nothing more, isn’t it? Why has been rebooted the whole machine?

Turns out that long ago, Oracle RAC, when it found itself in this situation, tried to pull out machine from cluster via “evict node”. But this didn’t work most of the time, ocfs2 driver hung, hunging the whole cluster a lot of times (every machine in the cluster). Drastic solution… What’s the safest way to get out of a cluster? You got it, rebooting the machine.

They could have made Oracle to leave some messages in the logfile, warning it was the one who rebooted the machine, so things would be clearer. But you can’t always get what you want.

So if you find your machine rebooting when it losses SAN disks, don’t blame the machine, and don’t blame Oracle… get your SAN fixed so it won’t happen again.

Checking a domain name expiration date: check_domain

We woudn’t like our domain to expire, and having our domain bought by an ciberspeculator (also bad-known as “cibersquatters”), asking us 1.000$ for it, when it’s actually worth 20 (I’ve lived that situation with a personal domain name of myself).

It’s not a big deal, after all registrars always warn users in advance, giving you every chance for renewal (that’s their interest). But… what if the email address you configured that day, is not active any more? What if the new boss secretary mistake it with spam, and ignore it (real case)? What if the company is so big that no one knows who is reading that email address?

To make sure we are up-to-date with our domains, I’ve created a nagios plugin, named “check_domain”. It’s simple (if you look at the code, you’ll see there’s more lines parsing parameters than doing things ), but it covers our needs, and warns you when the domain name is near to expire.

In the full article (“read more”) you can see the code, and a downloadable file.

check_domain

#!/bin/bash

PROGPATH=echo $0 | /bin/sed -e 's,[\/][^\/][^\/]*$,,'

. $PROGPATH/utils.sh

# Default values (days):
critical=7
warning=30

# Parse arguments
args=getopt -o hd:w:c:P: --long help,domain:,warning:,critical:,path: -u -n $0 -- "$@"
[ $? != 0 ] && echo "$0: Could not parse arguments" && echo "Usage: $0 -h | -d [-c ] [-w ]" && exit
set -- $args

while true ; do
case "$1" in
-c|--critical) critical=$2;shift 2;;
-w|--warning) warning=$2;shift 2;;
-d|--domain) domain=$2;shift 2;;
-P|--path) whoispath=$2;shift 2;;
-h|--help) echo "check_domain - v1.01"
echo "Copyright (c) 2005 Tom�s N��ez Lirola under GPL License"
echo "This plugin checks the expiration date of a domain name."
echo ""
echo "Usage: $0 -h | -d [-c ] [-w ]"
echo "NOTE: -d must be specified"
echo ""
echo "Options:"
echo "-h"
echo " Print detailed help"
echo "-d"
echo " Domain name to check"
echo "-w"
echo " Response time to result in warning status (days)"
echo "-c"
echo " Response time to result in critical status (days)"
echo ""
echo "This plugin will use whois service to get the expiration date for the domain name. "
echo "Example:"
echo " $0 -d domain.tld -w 30 -c 10"
echo ""
exit;;
--) shift; break;;
*) echo "Internal error!" ; exit 1 ;;
esac
done

[ -z $domain ] && echo "UNKNOWN - There is no domain name to check" && exit $STATE_UNKNOWN

# Looking for whois binary
if [ -z $whoispath ]; then
type whois &> /dev/null || error="yes"
[ ! -z $error ] && echo "UNKNOWN - Unable to find whois binary in your path. Is it installed? Please specify path." && exit $STATE_UNKNOWN
else
[ ! -x "$whoispath/whois" ] && echo "UNKNOWN - Unable to find whois binary, you specified an incorrect path" && exit $STATE_UNKNOWN
fi

# Calculate days until expiration
expiration=whois $domain |grep "Expiration Date:"| awk -F"Date:" '{print $2}'|cut -f 1
expseconds=date +%s --date="$expiration"
nowseconds=date +%s
((diffseconds=expseconds-nowseconds))
expdays=$((diffseconds/86400))

# Trigger alarms if applicable
[ -z "$expiration" ] && echo "UNKNOWN - Domain doesn't exist or no WHOIS server available." && exit $STATE_UNKNOWN
[ $expdays -lt 0 ] && echo "CRITICAL - Domain expired on $expiration" && exit $STATE_CRITICAL
[ $expdays -lt $critical ] && echo "CRITICAL - Domain will expire in $expdays days" && exit $STATE_CRITICAL
[ $expdays -lt $warning ]&& echo "WARNING - Domain will expire in $expdays days" && exit $STATE_WARNING

# No alarms? Ok, everything is right.
echo "OK - Domain will expire in $expdays days"
exit $STATE_OK

Executing windows commands from your linux box: winexe

When you see a windows stopped server in your nagios console, sometimes you would like to add an event_handler who tries to start the service automatically.

With samba , it´s been a long term feature, some way to control services ( net stop or net start ), but I haven’t found that this ever worked.

There’s a useful tool: winexe . With this tool, you can, not only stop and start windows services, but execute any shell comand, even having a windows shell inside your linux box, as simply as:

winexe -U HOME/Administrator%Pass123 //host cmd


It’s a open source project (software libre), having the source code is published in the same web, and having no modification since 26/10/07. It surely haven’t needed any modification, because it is fully functional, and I haven’t had any problem so far, beyond that bloody craze, using backslashes () everywhere, forcing us to escape characters every now and then…

Winexe turned out a useful complement as a event_handler nagios tool.

Getting system info from command line: pstools

In the previous post, where we talked about winexe, we showed how to execute shell commands from our linux console. Our first idea was to start and stop services ( net start; net stop), but once we have a windows shell, we can go beyond a do a lot more. to achieve that, we can use pstools .

With them we will feel like we were our windows console, because we can have ps (pslist), a kill (

Problem using pstools for the first time: /accepteula

I’ve found that executing pstools directly from the console, without gui (in my case, using winexe ), it got stuck, sort of hung. Trying to connect via remote desktop to the server, and executing then by hand, it shows a window with an EULA, waiting for our “Ok, I agree”.

If this happens to you in a single server there is no problem, but if you planned using pstools in more than 50 server, it’s a PITA executing grafically one by one, all servers and all tools, to accept the EULA.

Neither the tools “help”, nor the official documentation shows any way to get rid of this. But in a website (that I’ve lost, so I can’t credit them accordingly, sorry) I’ve found an undocumented feature: /accepteula

Just add this parameter to your scripts, and accepting EULA won’t give you any more problems. For instance:

winexe -U HOME/Administrator%Pass123 //host "d:scriptspstoolspslist.exe /accepteula"

This way you can make a ps in any server without worrying if it’s the first time you use pslist or not.