Sunday, May 17, 2009

Cardinality Analysis - A Review

After reading Michelle Deng’s “Tuning by Cardinality Feedback” paper on Jonathan's blog, I thought of re-creating the whole scenario to actually feel what is going on.

I did this test on Oracle 10g Release 2 (10.2.0.3) database with a block size of 16K. We first create the table and populate it in such a way that we match exactly with Michelle’s data.

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bi
PL/SQL Release 10.2.0.3.0 - Production
CORE    10.2.0.3.0      Production
TNS for IBM/AIX RISC System/6000: Version 10.2.0.3.0 - Productio
NLSRTL Version 10.2.0.3.0 - Production

Elapsed: 00:00:00.03
SQL>

SQL> show parameter db_block_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------
db_block_size                        integer     16384
SQL>

I performed couple of tests with different data distributions and here’s the first test case:

Test Case 1

Create and Populate Data

drop table prod_fctr_mv purge;

create table prod_fctr_mv(
 extract_mkt_ssk varchar2(5), 
 mkt_ssk     varchar2(20), 
 mkt_id      number(16), 
 hier_type    varchar2(20), 
 txt       varchar2(500));

insert into prod_fctr_mv 
 select '00006', '00006', 3, 'CORP', 
        rpad('*', 285, '*') 
   from dual 
  connect by level <= 42;


insert into prod_fctr_mv
  select
         lpad(mod(level, 100), 5, '0') extract_mkt_ssk ,
         lpad(mod(level, 73), 5, '0') mkt_ssk ,
         mod(level, 91) mkt_id ,
         decode(mod(level, 5), 0, 'CORP', 1, 'CALL', 2, 'CORP', 
                               3, 'WALL', 'CORP') hier_type ,
         rpad('*', 285, '*')
    from dual
   connect by level <= 4965 - 42 ;


update prod_fctr_mv set mkt_ssk = null 
 where mkt_ssk <> '00006' 
   and rownum <= 296;

update (select mkt_id 
          from prod_fctr_mv 
         where mkt_id <> 3) 
   set mkt_id = 3
 where rownum <= 72;

update (select extract_mkt_ssk  
          from prod_fctr_mv 
         where extract_mkt_ssk <> '00006'
           and mkt_id <> 3) 
   set extract_mkt_ssk = '00006' 
 where rownum <= 77;

select count(*) 
  from prod_fctr_mv
 where mkt_ssk  = '00006' ;

update (select mkt_ssk 
          from prod_fctr_mv 
         where mkt_ssk <> '00006' 
           and mkt_ssk is not null ) 
   set mkt_ssk = '00006' 
 where rownum <= 59;

update (select hier_type 
          from prod_fctr_mv 
         where hier_type = 'CORP' 
           and mkt_id <> 3
           and extract_mkt_ssk <> '00006')
   set hier_type = decode(mod(rownum, 2), 0, 'TALL', 'BALL') 
 where rownum <= 554;

commit;

Verify Data

Now that the data is populated, let’s see did we get on the right track.

SQL> exec dbms_stats.gather_table_stats(user, 'prod_fctr_mv');

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.42
SQL> 
SQL> select table_name, num_rows, blocks, empty_blocks, avg_space, 
  2         chain_cnt, avg_row_len
  3   from user_tables
  4  where table_name = 'PROD_FCTR_MV';

TABLE_NAME                       NUM_ROWS     BLOCKS EMPTY_BLOCKS  AVG_SPACE  CHAIN_CNT AVG_ROW_LEN
------------------------------ ---------- ---------- ------------ ---------- ---------- -----------
PROD_FCTR_MV                         4965        108            0          0          0         305

Elapsed: 00:00:00.01
SQL> 
SQL> 
SQL> select column_name, num_distinct, density, num_buckets, 
  2         num_nulls, histogram
  3    from user_tab_columns
  4   where table_name = 'PROD_FCTR_MV';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS  NUM_NULLS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------- ---------------
TXT                                       1          1           1          0 NONE
HIER_TYPE                                 5 .000100705           5          0 FREQUENCY
MKT_ID                                   91 .000100705          91          0 FREQUENCY
MKT_SSK                                  73 .000107089          73        296 FREQUENCY
EXTRACT_MKT_SSK                         100 .000100705         100          0 FREQUENCY

Elapsed: 00:00:00.07
SQL> 
SQL> 
SQL> select count(decode(mkt_id, 3, 1)) mkt_id,
  2         count(decode(extract_mkt_ssk, '00006', 1)) extract_mkt_ssk,
  3         count(decode(mkt_ssk, '00006', 1)) mkt_ssk,
  4         count(decode(hier_type, 'CORP', 1)) hier_type
  5    from prod_fctr_mv ;

    MKT_ID EXTRACT_MKT_SSK    MKT_SSK  HIER_TYPE
---------- --------------- ---------- ----------
       169             169        169       2441

Elapsed: 00:00:00.01
SQL> 
SQL> select count(*) 
  2    from prod_fctr_mv
  3   where mkt_id  = 3
  4     and extract_mkt_ssk = '00006'
  5     and mkt_ssk = '00006'
  6     and hier_type = 'CORP';

  COUNT(*)
----------
        42

Elapsed: 00:00:00.03
SQL> 
SQL>

Search an optimal value for OPTIMIZER_DYNAMIC_SAMPLING parameter

Yea, the total number of rows, the count for the key columns, count for the query in question, and the statistics, they all match to what was presented by Michelle.

The real fun starts here:

SQL> set autotrace traceonly exp
SQL> 
SQL> select count(*) 
  2    from prod_fctr_mv
  3   where mkt_id  = 3
  4     and extract_mkt_ssk = '00006'
  5     and mkt_ssk = '00006'
  6     and hier_type = 'CORP';
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    20 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    20 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |     1 |    20 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

Delete the statistics and run the same query:

SQL> exec dbms_stats.delete_table_stats(user, 'prod_fctr_mv');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.39
SQL> 
SQL> select count(*) 
  2    from prod_fctr_mv
  3   where mkt_id  = 3
  4     and extract_mkt_ssk = '00006'
  5     and mkt_ssk = '00006'
  6     and hier_type = 'CORP';
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    41 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    41 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |    72 |  2952 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

Note
-----
   - dynamic sampling used for this statement

SQL>

Optimizer has considered dynamic sampling but still the number of rows reported is now “72”, which is nearly twice than the actual number of rows. Let’s gradually increase the value of OPTIMIZER_DYNAMIC_SAMPLING initialization parameter and check the plan.

SQL> alter session set optimizer_dynamic_sampling=3;

Session altered.

Elapsed: 00:00:00.00
SQL> @t
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    41 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    41 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |    72 |  2952 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

Note
-----
   - dynamic sampling used for this statement

SQL> alter session set optimizer_dynamic_sampling=4;

Session altered.

Elapsed: 00:00:00.00
SQL>  @t
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    41 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    41 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |    72 |  2952 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

Note
-----
   - dynamic sampling used for this statement

SQL> alter session set optimizer_dynamic_sampling=5;

Session altered.

Elapsed: 00:00:00.00
SQL>   @t
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    41 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    41 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |    72 |  2952 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

Note
-----
   - dynamic sampling used for this statement

SQL> alter session set optimizer_dynamic_sampling=6;

Session altered.

Elapsed: 00:00:00.01
SQL> @t
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    41 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    41 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |    42 |  1722 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

Note
-----
   - dynamic sampling used for this statement

SQL>
SQL> set autotrace off
SQL>

Finally, when OPTIMIZER_DYNAMIC_SAMPLING parameter reached to a value of “6”, the plan is now reporting correct cardinality.

Test Case 2

Let’s run through the entire test again with a slight modification. Instead of inserting all the 42 rows together, we will insert them scattered in between.

Create and Populate Data

In this test case, we will insert one row of our interest after every 100 rows. So, here goes the script:

drop table prod_fctr_mv purge;

create table prod_fctr_mv(
 extract_mkt_ssk varchar2(5), 
 mkt_ssk     varchar2(20), 
 mkt_id      number(16), 
 hier_type    varchar2(20), 
 txt       varchar2(500));

begin
  for i in 1..42 loop
    insert into prod_fctr_mv 
     select '00006', '00006', 3, 'CORP', 
            rpad('*', 285, '*') 
       from dual 
      connect by level <= 1;

    insert into prod_fctr_mv
      select
            lpad(mod(level, 100), 5, '0') extract_mkt_ssk ,
            lpad(mod(level, 73), 5, '0') mkt_ssk ,
            mod(level, 91) mkt_id ,
            decode(mod(level, 5), 0, 'CORP', 1, 'CALL', 2, 'CORP', 
                                  3, 'WALL', 'CORP') hier_type ,
            rpad('*', 285, '*')
       from dual
      connect by level <= 100 ;
  end loop;

  commit;
end;
/

select count(*) 
  from prod_fctr_mv;

  insert into prod_fctr_mv
    select
          lpad(mod(level, 100), 5, '0') extract_mkt_ssk ,
          lpad(mod(level, 73), 5, '0') mkt_ssk ,
          mod(level, 91) mkt_id ,
          decode(mod(level, 5), 0, 'CORP', 1, 'CALL', 2, 'CORP', 
                                3, 'WALL', 'CORP') hier_type ,
          rpad('*', 285, '*')
     from dual
    connect by level <= 723 ;

update prod_fctr_mv set mkt_ssk = null 
 where mkt_ssk <> '00006' 
   and rownum <= 296;

update (select mkt_id 
          from prod_fctr_mv 
         where mkt_id <> 3
           and extract_mkt_ssk <> '00006') 
   set mkt_id = 3
 where rownum <= 35;

update (select extract_mkt_ssk  
          from prod_fctr_mv 
         where extract_mkt_ssk <> '00006'
           and mkt_id <> 3) 
   set extract_mkt_ssk = '00006' 
 where rownum <= 77;

update (select mkt_ssk 
          from prod_fctr_mv 
         where mkt_ssk <> '00006' 
           and mkt_ssk is not null ) 
   set mkt_ssk = '00006' 
 where rownum <= 33;

update (select hier_type 
          from prod_fctr_mv 
         where hier_type = 'CORP' 
           and mkt_id <> 3
           and extract_mkt_ssk <> '00006')
   set hier_type = decode(mod(rownum, 2), 0, 'TALL', 'BALL') 
 where rownum <= 554;

commit;

Verify Data

Let us run through the same queries to verify data population.

SQL> exec dbms_stats.gather_table_stats(user, 'prod_fctr_mv');

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.09
SQL> select table_name, num_rows, blocks, empty_blocks, avg_space, 
  2         chain_cnt, avg_row_len
  3   from user_tables
  4  where table_name = 'PROD_FCTR_MV';

TABLE_NAME                       NUM_ROWS     BLOCKS EMPTY_BLOCKS  AVG_SPACE  CHAIN_CNT AVG_ROW_LEN
------------------------------ ---------- ---------- ------------ ---------- ---------- -----------
PROD_FCTR_MV                         4965        108            0          0          0         305

Elapsed: 00:00:00.01
SQL> 
SQL> 
SQL> select column_name, num_distinct, density, num_buckets, 
  2         num_nulls, histogram
  3    from user_tab_columns
  4   where table_name = 'PROD_FCTR_MV';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS  NUM_NULLS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------- ---------------
TXT                                       1          1           1          0 NONE
HIER_TYPE                                 5 .000100705           5          0 FREQUENCY
MKT_ID                                   91 .000100705          91          0 FREQUENCY
MKT_SSK                                  73 .000107089          73        296 FREQUENCY
EXTRACT_MKT_SSK                         100 .000100705         100          0 FREQUENCY

Elapsed: 00:00:00.03
SQL> 
SQL> select count(decode(mkt_id, 3, 1)) mkt_id,
  2         count(decode(extract_mkt_ssk, '00006', 1)) extract_mkt_ssk,
  3         count(decode(mkt_ssk, '00006', 1)) mkt_ssk,
  4         count(decode(hier_type, 'CORP', 1)) hier_type
  5    from prod_fctr_mv ;

    MKT_ID EXTRACT_MKT_SSK    MKT_SSK  HIER_TYPE
---------- --------------- ---------- ----------
       169             169        169       2441

Elapsed: 00:00:00.01
SQL> 
SQL> select count(*) 
  2    from prod_fctr_mv
  3   where mkt_id  = 3
  4     and extract_mkt_ssk = '00006'
  5     and mkt_ssk = '00006'
  6     and hier_type = 'CORP';

  COUNT(*)
----------
        42

Elapsed: 00:00:00.01
SQL>

Search an optimal value for OPTIMIZER_DYNAMIC_SAMPLING parameter

We managed to get the number of records and other statistics same as that of the original document. Now, we will hunt for the optimal value of OPTIMIZER_DYNAMIC_SAMPLING parameter.

SQL> set autotrace traceonly exp
SQL> 
SQL> @t
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    20 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    20 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |     1 |    20 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

SQL> exec dbms_stats.delete_table_stats(user, 'prod_fctr_mv');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.25
SQL> 
SQL> @t
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 2899207340

-----------------------------------------------------------------------------------
| Id  | Operation          | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |              |     1 |    41 |    42   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |              |     1 |    41 |            |          |
|*  2 |   TABLE ACCESS FULL| PROD_FCTR_MV |    36 |  1476 |    42   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MKT_ID"=3 AND "EXTRACT_MKT_SSK"='00006' AND
              "MKT_SSK"='00006' AND "HIER_TYPE"='CORP')

Note
-----
   - dynamic sampling used for this statement

SQL>

We get a better cardinality reported by the Optimizer at the default level of dynamic sampling (OPTIMIZER_DYNAMIC_SAMPLING=2).

Conclusion

In the above test cases the Oracle optimizer started displaying correct number of rows when OPTIMIZER_DYNAMIC_SAMPLING was set to 6 with interested data being inserted together while the default setting for OPTIMIZER_DYNAMIC_SAMPLING worked when data was almost equally distributed.

Michelle’s data distribution might be such that the Optimizer was able report correct number of rows at OPTIMIZER_DYNAMIC_SAMPLING=4.

Lessons learned include:

1) Column correlation could really mislead Optimizer, and

2) Data distribution plays equally critical role.