Tuesday, June 21, 2011

Addressing Block corruption that is not part of any segment

Yesterday’s RMAN backup failed with ORA-19566 error reporting block corruption, here’s the error as reported by RMAN:

channel ORA_DISK_1: starting piece 1 at 20-JUN-11
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: =========================================================== 
RMAN-03009: failure of backup command on c1 channel at 20/06/2011 11:55:11
ORA-19566: exceeded limit of 0 corrupt blocks for file xxxxxxxxxx

The next logical step is to run DB Verify (dbv) utility to find out what blocks are corrupt. Dbv confirmed block corruption as shown below:

Corrupt block relative dba: 0x0346c3cf (file 113, block 229845)
Bad check value found during backing up datafile
Data in bad block -
type: 6 format: 2 rdba: 0x0346c3cf

:
:

DBVERIFY - Verification complete

Total Pages Marked Corrupt   : 1
Total Pages Influx           : 0
:
:

Below are the OS and database details:

-bash-3.2$ cat /etc/*-release
Red Hat Enterprise Linux Server release 5.5 (Tikanga)
-bash-3.2$
-bash-3.2$ uname -r
2.6.18-194.el5
-bash-3.2$

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

SQL>

Note that Block 229845 is corrupt in datafile 113. The next step in the process is to identify the object by querying DBA_EXTENTS view

SQL> select segment_name, segment_type, owner
  2  from dba_extents
  3  where file_id = 113
  4  and 229845 between block_id and block_id + blocks -1;

no rows selected

SQL>

Apparently, this block is not associated with any of the segments in the database. The next question to ask is where this corrupt block lies in the data file. Is it in between the allocated blocks after the HWM?

SQL> select max(block_id) from dba_extents where file_id = 113;

MAX(BLOCK_ID)
-------------
       182520

SQL>

So, the corrupt block is beyond the maximum allocated block. Think of the data file as shown below (blue=used data, white=empty blocks, red=corrupt block).

The easiest solution is to shrink the datafile below the corrupt block, this way we drop the corrupt block off the datafile.

Identify the datafile size:

SQL> select bytes/1024/1024 size_mb  from dba_data_files where file_id = 113;

   SIZE_MB
----------
      2048

Identify the location of the maximum block in the datafile by multiplying it with the blocksize (8K):

SQL>  select max(BLOCK_ID) * 8/1024 size_mb from dba_extents where file_id = 113;

   SIZE_MB
----------
 1425.9375

SQL>

Also, find out the location of the corrupt block in the datafile:

SQL> select 229845 * 8/1024 size_mb from dual;

   SIZE_MB
----------
1795.66406

SQL>

Resize the datafile beyond the maximum allocated data block and below the corrupt block:

SQL> alter database datafile 113 resize 1450m;

Database altered.

SQL>

After resizing the datafile run DBV utility to check for corrupt blocks:

DBVERIFY - Verification complete

Total Pages Examined         : 188160
Total Pages Processed (Data) : 62277
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 71630
Total Pages Failing   (Index): 0
Total Pages Processed (Lob)  : 9337
Total Pages Failing   (Lob)  : 0
Total Pages Processed (Other): 21867
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 23049
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Total Pages Encrypted        : 0
Highest block SCN            : 0 (0.0) 

There are no corrupt blocks in the datafile. Happy ending. 

But it wouldn’t have been this easy if the corrupt block were somewhere in between the allocated blocks. This solution will not be helpful in that case. Refer to following note from Oracle Support to address this issue:

How to Format Corrupted Block Not Part of Any Segment [ID 336133.1]

Wednesday, June 01, 2011

Grid Infrastructure Startup Issues

We have a two node RAC (Oracle 11gR2) configuration running on Linux environment. This is a test environment where lot of testing happens. The second node of the two node RAC configuration was down and it refused to start. The OHASD was up but CRS, CSS, and EVMD were offline.

I started troubleshooting the issue in the following order:

1) Reading the ohasd.log and found nothing worth interesting.

2) No file permission issues on the disks were reported.

3) Voting Disks are located in ASM and were also accessible to the second node.

4) Moving ahead, following error was reported in the ocssd.log

2011-04-06 21:13:57.781: [    CSSD][1111677248]clssnmvDHBValidateNCopy: node 1, tptrac1, has a disk HB, but no network HB, DHB has rcfg 183608157, wrtcnt, 41779628, LATS 3108751988, lastSeqNo 41779625, uniqueness 1294698909, timestamp 1302104569/3108688378

The error message clearly says there is no network heart beat between the two nodes. Indeed, it failed when I tried to ping using the private IP address. On node1, “eth1” was messed up as shown below:

 
[root@tptrac1 ~]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr F4:CE:46:84:F7:CA
          inet6 addr: fe80::f6ce:46ff:fe84:f7ca/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3244634 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6800251 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2828581056 (2.6 GiB)  TX bytes:1669807875 (1.5 GiB)
          Interrupt:162 Memory:f6000000-f6012800

[root@tptrac1 ~]#

I reassigned the private IP address as shown below:

ifconfig eth1 10.28.177.1 netmask 255.255.255.128 up

I was able to ping after setting the private IP address on Node1. It was now time to stop and start the cluster.

-bash-3.2$ ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
-bash-3.2$

Well, the cluster happily came up without reporting any errors.

I would have saved all the troubleshooting time if I had checked node reachability in the first place. Anyways, it was a good troubleshooting exercise and I also got something to share on my blog.