Tuesday, March 17, 2009

Strange Data Guard Issue

We have recently failed over one of our database to the DRC. At that time the production archivelog sequence was 80,000 plus. After staying there for couple of hours we recreated a physical standby database on the primary and switched back to original primary location. Everything went fine without any hoo-ha.

The failover and switchover took place on February 20, 2009. We have an automated job on all our standby databases to check whether the Standby is lagging behind the Primary, if so, then it automatically restores the missing archivelogs and applies them.

This job was working fine until day before yesterday (March 15, 2009) but started reporting errors thereafter. It failed with “RMAN-20242” error.

Starting restore at 15-MAR-09
released channel: ch12
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 03/15/2009 07:26:59
RMAN-06004: ORACLE error from recovery catalog database: 
RMAN-20242: specification does not match any archive log in the recovery catalog

When notified, I queried V$ARCHIVE_GAP and the output of this query took me to a surprise:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bi
PL/SQL Release 10.2.0.3.0 - Production
CORE    10.2.0.3.0      Production
TNS for IBM/AIX RISC System/6000: Version 10.2.0.3.0 - Production
NLSRTL Version 10.2.0.3.0 - Production

SQL>
SQL> select * from v$archive_gap;

   THREAD# LOW_SEQUENCE# HIGH_SEQUENCE#
---------- ------------- --------------
         1           776          82468

SQL> 

Well, all of a sudden Oracle thinks that it is 81 thousand archive logs behind the primary. I tried to switching couple of archive logs on the production and the gap was still intact. I don't really know from where did Oracle took this information.

The automated job was trying to restore archives between 776 and 82468 and was kicked out by the recovery catalog database saying, there are no such archive logs (in fact there exist no such archivelogs).

I created a standby controlfile on the production database and replaced the controlfile on the standby database with the new one. The standby database seems to be happy with this version of controlfile and have stopped reporting any gap and the automated job is also leading a happy life.

SQL> select * from v$archive_gap;

no rows selected

SQL>