Wednesday, June 01, 2011

Grid Infrastructure Startup Issues

We have a two node RAC (Oracle 11gR2) configuration running on Linux environment. This is a test environment where lot of testing happens. The second node of the two node RAC configuration was down and it refused to start. The OHASD was up but CRS, CSS, and EVMD were offline.

I started troubleshooting the issue in the following order:

1) Reading the ohasd.log and found nothing worth interesting.

2) No file permission issues on the disks were reported.

3) Voting Disks are located in ASM and were also accessible to the second node.

4) Moving ahead, following error was reported in the ocssd.log

2011-04-06 21:13:57.781: [    CSSD][1111677248]clssnmvDHBValidateNCopy: node 1, tptrac1, has a disk HB, but no network HB, DHB has rcfg 183608157, wrtcnt, 41779628, LATS 3108751988, lastSeqNo 41779625, uniqueness 1294698909, timestamp 1302104569/3108688378

The error message clearly says there is no network heart beat between the two nodes. Indeed, it failed when I tried to ping using the private IP address. On node1, “eth1” was messed up as shown below:

[root@tptrac1 ~]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr F4:CE:46:84:F7:CA
          inet6 addr: fe80::f6ce:46ff:fe84:f7ca/64 Scope:Link
          RX packets:3244634 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6800251 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2828581056 (2.6 GiB)  TX bytes:1669807875 (1.5 GiB)
          Interrupt:162 Memory:f6000000-f6012800

[root@tptrac1 ~]#

I reassigned the private IP address as shown below:

ifconfig eth1 netmask up

I was able to ping after setting the private IP address on Node1. It was now time to stop and start the cluster.

-bash-3.2$ ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Well, the cluster happily came up without reporting any errors.

I would have saved all the troubleshooting time if I had checked node reachability in the first place. Anyways, it was a good troubleshooting exercise and I also got something to share on my blog.

No comments: