07 February 2012

WTF - The star like schema

Sometimes when plundering through legacy systems, you come across things that make just about zero sense. When you ask around, people say there must be a reason, some explanation for the madness, and often there is, but often there isn't either.

I recently came across such an example, and it sounds like something right out of the Daily WTF. The application had a table like the following:

create sequence 
t_master_seq
start with
100000000
increment by 1;

-- generate 500K random rows
create table t_master
as
select t_master_seq.nextval         pk,
       lpad('a', 400, 'a')          padding,
       dbms_random.string('U', 20)  email,
       dbms_random.string('U', 20)  surname,
       dbms_random.string('U', 20)  forename,
       dbms_random.string('U', 20)  ship_to_address_hash,
       dbms_random.string('U', 20)  invoice_to_address_hash
from dual
connect by level <= 500000;

-- quadruple the data, giving 4 rows of each key
begin
  for i in 1..2 loop
    insert into t_master 
    select t_master_seq.nextval,
           padding,
           email,
           surname,
           forename,
           ship_to_address_hash,
           invoice_to_address_hash
    from t_master;
  end loop;
end;
/

create unique index t_master_pk on t_master(pk);

This table holds transactions, and I have filled each key field (email, forename, surname etc) with some random data. Then I ensure there are 4 rows that contain each key field by duplicating each row and replacing the PK value with a new unique one. Each email will appear in the table in 4 different transactions. Same goes for the other fields too.

There is nothing strange about that, but then there is a child table for each key field, with an index on the key field, along with carrying down the primary key:

-- child table 1
create table t_email
as select pk, email
from t_master;

create unique index t_email_pk on t_email(pk);
create index t_email_idx1 on t_email(email);

-- child table 2
create table t_surname
as select pk, surname
from t_master;

create unique index t_surname_pk on t_surname(pk);
create index t_surname_idx1 on t_surname(surname);

-- child table 3
create table t_forename
as select pk, forename
from t_master;

create unique index t_forename_pk on t_forename(pk);
create index t_forename_idx1 on t_forename(forename);

-- child table 4
create table t_ship_to_address_hash
as select pk, ship_to_address_hash
from t_master;

create unique index t_ship_to_address_hash_pk on t_ship_to_address_hash(pk);
create index t_ship_to_address_hash_idx1 on t_ship_to_address_hash(ship_to_address_hash);

-- child table 5
create table t_invoice_to_address_hash
as select pk, invoice_to_address_hash
from t_master;

create unique index t_invoice_to_address_hash_pk on t_invoice_to_address_hash(pk);
create index t_invoice_to_address_hash_idx1 on t_invoice_to_address_hash(invoice_to_address_hash);

Duplicate Data

At this point you may ask yourself, what is the point of all these child tables? All they do is hold pieces of the original table. The primary keys are there to prove uniqueness. Each row in each child table is just a piece of a corresponding row in the master table, with no additional information.

It turns out the data can be queried on any combination of the 5 key fields, so the problem is slightly tricky as the fields are not related. One index cannot efficiently solve the queries, and a bitmap index is out as the tables are updated in an OLTP fashion.

Without resorting to full text indexes, the best way to answer the application requirements is something like:

select *
  from t_master
  where email = :b1
union all
  select *
  t_master
  where forename = :b2
etc ...

Does it Perform Better this way?

Even with this tricky requirement, there is still no reason for these additional five tables. If you read this far, you are probably waiting for the exciting reason, so I resorted to asking the designer. It is a star schema, it was designed this way for performance, he told me

The problem is, is is not a star schema and the performance is worse than it should be, much worse. I don't think this would surprise many people, but it really hammers home the importance of benchmarking, and knowing how to prove one method is better than another. In this case, all we need is Autotrace to prove how much extra work this design causes.

Consider a query on just one of the fields, email. With the current design, to retrieve all the rows with a given email, the following query is required:

select m.*
from t_master m, t_email e
where e.email = 'BVVEDVZRCZHPHXEYDPHM'
and   e.pk = m.pk;

This takes 22 logical IOs, and on the real system, many of them will be from disk.

However, if an index was created on t_master.email, the query can be answered in 8 logical IOs.

select * 
from t_master m
where m.email = 'BVVEDVZRCZHPHXEYDPHM';

That is almost a third of the work, and things get worse. The cache becomes less efficient due to all this duplicated data and the extra 5 primary key indexes. That doesn't even address the overhead of maintaining the indexes, and the application logic to split those rows up into many and all the extra disk space required.

Seriously, tell me why is was designed this way!

Sorry, I cannot. There is no explanation I can find for the madness in this case, but I'm petty sure I can half the application response time with a few simple changes.

Better At Oracle

WTF - The star like schema

Duplicate Data

Does it Perform Better this way?

Seriously, tell me why is was designed this way!