Thursday, February 26, 2009

RAC and Linux

Hi for many years I have been supported Linux and Oracle combination.

After few last experiences I’m thinking that newest RedHat 5 and Oracle 10g required more attention and work and you can run into some very strange issues.

1. Bonding

A new card (Intel NIC 4 port) has been added to server. All new interfaces have been configured on modprobe.conf. After that a new configuration for bonding has been created. As far looks good. Restart network interfaces and hurray we have bonding interfaces.

So simple RAC reconfiguration and we have ClusterWare using bonding interface.

But unfortunately until server reboot.

After reboot a order of “ethx” interfaces has been changed L

and Eth0 become Eth 4 and so on.

Ok lets change a configurations to new one.

Next restart and .... ?

Next order of interfaces !!!

Hopefully it was not my task, anyway my colleague found a
solution on RH website

After that is was OK.

Bonding was working ...

but RAC ....

see next point



2. RAC

So we have bounding up and running, according to Metalink we have to change interfaces using oifcfg and then change a VIP configuration.
I have stoped a cluster make all changes, restart cluster ...

Hurray is working ... yes ... until reboot of servers.

After reboot only on node was up and running, on second one ClusterWare didn’t start.

Fast check of logs and in ocssd.log I have found

[ CSSD]2009-02-20 19:43:44.758 [1115699552] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14467) LATS(11888034) Disk lastSeqNo(14467)
[ CSSD] 2009-02-20 19:43:44.758 [1220598112] >TRACE: clssnmRcfgMgrThread: Local Join

WTF ? Node 1 is up and running and why local join ?

Another review and ...

[ CSSD] 2009-02-20 19:43:44.758 [1147169120] >TRACE: clsc_send_msg: (0x7271b0) NS err
(12571, 12560), transport (530, 113, 0)

My favourite error – no network. Quick check with ping – both interfaces private and public are up. So what is with that box ?
A problem is that bounding need some time after interfaces are up to start working.
If ClusterWare is starting BEFORE binding is 100 % operational, Oracle check only once if there is a other node using network connection and then decide to start ClusterWare in local configuration. But after that it realize that on vote disk other node is already registered so local join is impossible.
BTW in my opinion ClusterWare should be more flexible and check other node more then once.

A solution from Metalink is to add sleep a few second before ClusterWare will be started using /etc/init.d/init.crs

'start')
CMD=`$BASENAME $0`

# If we are being invoked by the user, perform manual startup.
# If we are being invoked as an RC script, check for autostart.
if [ "$CMD" = "init.crs" ]; then
$LOGMSG "Oracle Cluster Ready Services starting by user request."
$ID/init.cssd manualstart
else
$ID/init.cssd autostart
fi
;;


Add 'sleep ' before '$ID/init.cssd autostart', for example, sleep 20 seconds:

else
sleep 20
$ID/init.cssd autostart
fi


At the end of that my personal opinion - previous versions of Linux and Oracle was more stable.
I hope Linux and Oracle will not finish like MS products.

regards,
Marcin

0 comments: