My server reboots when it hasn’t access to SAN disks!

You have your Linux boxes, with a Oracle 10g RAC. Everything works perfectly, but suddenly one server reboots. Yo peek in the logfile and you find this:

Sep 18 00:27:24 server1 kernel: SCSI error : <2 0 2 0> return code = 0x20000
Sep 18 00:27:24 server1 kernel: end_request: I/O error, dev sdae, sector 1672
Sep 18 00:27:24 server1 kernel: device-mapper: dm-multipath: Failing path 65:224.
Sep 18 00:34:14 server1 syslogd 1.4.1: restart.
Sep 18 00:34:14 server1 syslog: syslogd startup succeeded
Sep 18 00:34:14 server1 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Ok… SAN disks failed… server has lost part of its disks… But this doesn’t seem to be a big deal, it shouldn’t have rebooted, should it? Operating system (root filesystem “/”) is mounted on a local disc. In fact, there is nothing using SAN disks but the ocfs from Oracle… The only one who should have faild was Oracle, and nothing more, isn’t it? Why has been rebooted the whole machine?

Turns out that long ago, Oracle RAC, when it found itself in this situation, tried to pull out machine from cluster via “evict node”. But this didn’t work most of the time, ocfs2 driver hung, hunging the whole cluster a lot of times (every machine in the cluster). Drastic solution… What’s the safest way to get out of a cluster? You got it, rebooting the machine.

They could have made Oracle to leave some messages in the logfile, warning it was the one who rebooted the machine, so things would be clearer. But you can’t always get what you want.

So if you find your machine rebooting when it losses SAN disks, don’t blame the machine, and don’t blame Oracle… get your SAN fixed so it won’t happen again.

Tomàs

Tomàs

I'll make something up