Showing posts with label RAC. Show all posts
Showing posts with label RAC. Show all posts

Friday, August 12, 2011

Confused SRVCTL in Oracle 11g R2

OS: RHEL 5.5 64-bit

DB: Oracle Database 11gR2

Grid: Oracle Clusterware 11gR2

A database “adverten” is running on a Linux (64-bit) box. This database happens to be a two-node Oracle RAC database. The instance names of this database are “adverten1” and ““adverten2” respectively on the two nodes.

I ran the RACcheck tool but unfortunately it does not list this database. However other databases running on this server are listed.

Below is the excerpt from “raccheck” tool:

List of running databases registered in OCR

1. clonedb

2. frisdb

3. All

4. None

Believe me; the instance is there and running.

 
[oracle@node1 raccheck]$ ps -ef|grep pmon
grid      8585     1  0 Aug06 ?        00:00:17 asm_pmon_+ASM1
oracle   12351  8299  0 22:16 pts/7    00:00:00 grep pmon
oracle   21959     1  0 Aug09 ?        00:00:31 ora_pmon_clonedb1
oracle   24222     1  0 Aug09 ?        00:00:19 ora_pmon_adverten1
oracle   29006     1  0 Aug07 ?        00:00:32 ora_pmon_frisdb1
[oracle@node1 raccheck]$

It’s time to debug “raccheck” tool.

“is_rdbms_installed_crs ()” procedure of the “raccheck” script is of our interest for now and below are the statements from where it is actually fetching database names.

It grabs the list of databases using the following:

       $CRS/bin/srvctl config database > $db_list_fil

And checks for database status using the following:

 
crs_db_status=$($CRS/bin/srvctl status database -d $db_list|grep -i $loc	alnode|grep -ic "is running")

Okay, running the first command manually lists my database:

 
[grid@node1 ~]$ $CRS_HOME/bin/srvctl config database
advertence
clonedb
frisdb
[grid@node1 ~]$

If you notice, the database name listed above is “advertence” (10 letter word). The database name from V$DATABASE view is “adjudica” (8 letter word), also instances use “adverten”.

 
SQL> select name from v$database;

NAME
---------
ADJUDICA

SQL>

However, the second command could not locate the database instances:

[grid@node1 ~]$ $CRS_HOME/bin/srvctl status database -d advertence
Instance adverten1 is not running on node node1
Instance adverten2 is not running on node node2
[grid@node1 ~]$

Of course “adverten” is not registered resource. So the following will fail:

 
[grid@abis19 ~]$ $CRS_HOME/bin/srvctl status database -d adverten
PRCD-1120 : The resource for database adverten could not be found.
PRCR-1001 : Resource ora. adverten.db does not exist
[grid@abis19 ~]$

This is the reason “raccheck” is not able to list this database.

What might have happened is the question that needs to be answered.

At the time of database creation, “advertence” was keyed in as the database name. But Oracle created a database by trimming it to 8 characters and reserving 1 character for instance number. But for some reason “advertence” database resource was registered.

Apparently, the “ora. advertence.db” resource registered with the cluster is in “OFFLINE” state.

 
[grid@node1 ~]$ crs_stat  ora. advertence.db
NAME=ora. advertence.db
TYPE=ora.database.type
TARGET=OFFLINE
STATE=OFFLINE

[grid@node1 ~]$

It’s all messy here, but things are working.

I will have to create a Service Request with Oracle Support to resolve this issue for this client of mine.

Wednesday, June 01, 2011

Grid Infrastructure Startup Issues

We have a two node RAC (Oracle 11gR2) configuration running on Linux environment. This is a test environment where lot of testing happens. The second node of the two node RAC configuration was down and it refused to start. The OHASD was up but CRS, CSS, and EVMD were offline.

I started troubleshooting the issue in the following order:

1) Reading the ohasd.log and found nothing worth interesting.

2) No file permission issues on the disks were reported.

3) Voting Disks are located in ASM and were also accessible to the second node.

4) Moving ahead, following error was reported in the ocssd.log

2011-04-06 21:13:57.781: [    CSSD][1111677248]clssnmvDHBValidateNCopy: node 1, tptrac1, has a disk HB, but no network HB, DHB has rcfg 183608157, wrtcnt, 41779628, LATS 3108751988, lastSeqNo 41779625, uniqueness 1294698909, timestamp 1302104569/3108688378

The error message clearly says there is no network heart beat between the two nodes. Indeed, it failed when I tried to ping using the private IP address. On node1, “eth1” was messed up as shown below:

 
[root@tptrac1 ~]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr F4:CE:46:84:F7:CA
          inet6 addr: fe80::f6ce:46ff:fe84:f7ca/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3244634 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6800251 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2828581056 (2.6 GiB)  TX bytes:1669807875 (1.5 GiB)
          Interrupt:162 Memory:f6000000-f6012800

[root@tptrac1 ~]#

I reassigned the private IP address as shown below:

ifconfig eth1 10.28.177.1 netmask 255.255.255.128 up

I was able to ping after setting the private IP address on Node1. It was now time to stop and start the cluster.

-bash-3.2$ ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
-bash-3.2$

Well, the cluster happily came up without reporting any errors.

I would have saved all the troubleshooting time if I had checked node reachability in the first place. Anyways, it was a good troubleshooting exercise and I also got something to share on my blog.

Monday, June 28, 2010

Restoring Standby Controlfile results in ORA-00600

I received ORA-00600 error while restoring Controlfile on a RAC-standby database (Physical). We purposely disabled redo log shipping on the Primary database and as a result the Standby Database was lagging way behind the production. Instead of applying all the Archive logs I took the shortest path of taking SCN based RMAN backup on the Primary and apply it to the Standby.

While doing so, I took a backup of Controlfile from primary database and shipped it to the standby site. I was able to successfully restore the controlfile as shown below:


RMAN> restore controlfile from '/tmp/control01.ctl';

Starting restore at 08-JUN-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=1083 instance=qtprod1 devtype=DISK

channel ORA_DISK_1: copied control file copy
output filename=+DATADG/control01.ctl
output filename=+FRADG/control02.ctl
output filename=+DATADG/control03.ctl
Finished restore at 08-JUN-10

RMAN> exit

However, while mounting the database with this new controlfile, I was greeted with ORA-00600. Below is the error:


SQL> alter database mount standby database;
alter database mount standby database
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kccsga_update_ckpt_5], [1453], [0],
[62], [0], [], [], []

Searching for the above error in "Troubleshoot an ORA-600 or ORA-7445 Error Using the Error Lookup Tool" did not provide any information. :(

Looking back at the steps that lead me to ORA-00600 error, I found the solution to the problem. The controlfile should have been restored as a standby CF. After correcting the mistake database happily mounted.


RMAN> restore standby controlfile from '/tmp/control01.ctl';

Starting restore at 08-JUN-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=1082 instance=biodb1 devtype=DISK

channel ORA_DISK_1: copied control file copy
output filename=+DATADG/control01.ctl
output filename=+FRADG/control02.ctl
output filename=+DATADG/control03.ctl
Finished restore at 08-JUN-10

RMAN> exit


SQL> alter database mount standby database;

Database altered.

After applying the RMAN backup, standby database got in sync with the primary.

Tuesday, May 25, 2010

"node1:6200 already configured" while Installing Oracle Clusterware 10g

Recently while installing Oracle Clusterware 10g, I stumbled upon the following error:

"camrac1:6200 already configured"

This error occurred when Oracle Universal Installer was configuring "Oracle Notification Server Configuration Assistant" and as usual OUI just stopped after displaying the error. If you come across this error then follow this guidelines:

[oracle@camrac1 bin]$cd $ORA_CRS_HOME/bin
[oracle@camrac1 bin]$ racgons remove_config camrac1:6200
racgons: Existing key value on camrac1 = 6200.
racgons: camrac1:6200 removed from OCR.
[oracle@camrac1 bin]$ racgons remove_config camrac2:6200
racgons: Existing key value on camrac2 = 6200.
racgons: camrac2:6200 removed from OCR.
[oracle@camrac1 bin]$ 

After removing the configuration retry the operation.

Saturday, July 28, 2007

How to find Cluster Interconnect IP address from Oracle Database

Hello,

The easiest way to find the cluster interconnect is to view the “hosts” file. The “hosts” file is located under:

UNIX .......... /etc

Windows ...... C:\WINDOWS\system32\drivers\etc

Following are the ways to find the cluster interconnect through Oracle database:

1) Query X$KSXPIA

The following query provides the interconnect IP address registered with Oracle database:

SQL> select IP_KSXPIA from x$ksxpia where PUB_KSXPIA = 'N';
IP_KSXPIA
----------------
192.168.10.11

This query should be run on all instances to find the private interconnect IP address used on their respective nodes.

2) Query GV$CLUSTER_INTERCONNECTS view

Querying GV$CLUSTER_INTERCONNECTS view lists the interconnect used by all the participating instances of the RAC database.

SQL> select INST_ID, IP_ADDRESS from GV$CLUSTER_INTERCONNECTS;

INST_ID IP_ADDRESS
---------- ----------------
1 192.168.10.11
2 192.168.10.12

3) ASM and Database log

We can find out the cluster interconnect used by viewing the log file of both ASM instance and the database instance.

When the ASM instance is started it registers the private interconnect used. Following is the excerpt from the ASM alert log on node 1:

:
:
Cluster communication is configured to use the following interface(s) for this instance
192.168.10.11
:
:

Similarly, the database instance also registers the private interconnect with its instance and following message is recorded in the alert log:

:
:
Cluster communication is configured to use the following interface(s) for this instance
192.168.10.11
:
:

4) CLUSTER_INTERCONNECT initialization parameter

This parameter provides Oracle with information on the availability of additional cluster interconnects. It overrides the default interconnect settings at the OS level with a preferred one.

You can view the current value of this parameter by issuing the following in SQL*Plus, connected as SYS user:

SQL> show parameter cluster_interconect

Thanks for reading :-)