13 July 2011

Referential Integrity - Performance

All these constraints are well and good, but they sure don't come for free. All this checking must cost something in CPU time, but how much?
We all know that Java applications are easy to scale - as the number of users increase, just add more application servers and things will be fine. The database is a bit more tricky, so maybe off loading all this checking into the application tier makes sense? Well maybe, until another application needs to access the database and doesn't do things the correct way.

Tom's advice on pushing all the data integrity checks into the database seemed good at the time, and he was confident they will be fast, but how can you be sure? Benchmark of course, and hopefully all the critics can be silenced with some hard evidence, at last!

Benchmarking Strategy

Lacking real world data and volumes to test with, we must improvise and attempt some scientific testing. If we use the same set of data for each test, and only change one thing in the table at a time, then we will have a fair test. As a control, we will always have a table with no constraints, and then can vary the constraints on a second table to see what the impact is. Our old friend SQL Trace will help us figure out the results.

There are three groups of tests:

Check Constraints
Unique Constraints
Foreign Keys

The first task is to generate some data to test with:

SQL11G> create table data_source as
          select level               id,
          'Somename '||level         name,
          sysdate - 10               sign_up_date,
          sysdate                    active_date,
          'someemail@betteratoracle' email,
          mod(level, 10)             status_id
          from dual connect by level <= 500000;

Now we have a table of 500,000 rows. In this table, we have two unique columns (id and name), a couple of dates, an email that passes our format test and a 10 different status_ids. Don't worry if you don't understand how this statement works, it is just an easy way to generate some data.

Check Constraints

Check constraints are the easiest thing to test. Before starting, think about check constraints for a minute. All they do is look at the values being inserted, and see if the data passes a test. They don't write any data, or change the data in any way, so there should be no space overhead, and no additional disk overhead. All they will do is consume additional CPU cycles, and so they are highly unlikely to lead to locks / contention in a multi user environment.

The Cost of not NULL

Time for some actual tests at last! For the first test, lets create a table identical to the Data_Source table with no constraints and another identical table with a single not null constraint present, and then insert all 500,000 rows one at a time:

SQL11G> create table null_test_no_constraints
        as 
        select * from data_source where 1 = 0;

SQL11G> create table null_test_with_constraints
        as 
        select * from data_source where 1 = 0;

SQL11G> alter table null_test_with_constraints modify name not null;

Now run the test, making sure to enable SQL Trace:

SQL11G> ALTER SESSION SET timed_statistics=TRUE;
SQL11G> ALTER SESSION SET max_dump_file_size=UNLIMITED;
SQL11G> ALTER SESSION SET tracefile_identifier='null_test';    
SQL11G> ALTER SESSION SET EVENTS '10046 trace name context forever, level 8';

SQL11G> begin
  for row in (select * from data_source) loop
    insert into null_test_no_constraints
    values (row.id, row.name, row.sign_up_date, row.active_date, row.email, row.status_id);

    insert into null_test_with_constraints
    values (row.id, row.name, row.sign_up_date, row.active_date, row.email, row.status_id);

  end loop;
  commit;
end;
/

SQL11G> ALTER SESSION SET EVENTS '10046 trace name context forever, level 1';

After running the trace file through tkprof the results are:

INSERT INTO NULL_TEST_NO_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 500000      9.18      15.46       2157       4978     538323      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001      9.18      15.46       2157       4978     538323      500000


INSERT INTO NULL_TEST_WITH_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.00          0          0          0           0
Execute 500000      9.01      12.00       2149       4915     537691      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001      9.03      12.00       2149       4915     537691      500000

The key stat to examine is the CPU used - with only 0.1 CPU second of difference, either of these approaches performs just as well as the other. Running this test many times will show the approach with constraints as faster sometimes, while other times without constraints will win. It depends on what other things the database machine was doing during each test. So the conclusion is:

Not NULL constraints add approximately zero overhead when inserting data.

Simple Check Constraints

Things are going to get repetitive now, repeating the same process for each of the check constraint tests in turn. In this case, lets check the ID column is always greater than or equal to zero, and then run a similar test to the one above:

SQL11G> create table check_test_no_constraints
        as 
        select * from data_source where 1 = 0;

SQL11G> create table check_test_with_constraints
        as 
        select * from data_source where 1 = 0;

SQL11G>  alter table check_test_with_constraints add constraint check_test_id
           check (id >= 0);

Again the key stat is the CPU used:

INSERT INTO CHECK_TEST_NO_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 500000      8.34      10.07          0       5656     545890      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001      8.34      10.07          0       5656     545890      500000    


INSERT INTO CHECK_TEST_WITH_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 500000     10.14      11.47          0       5658     545902      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001     10.14      11.47          0       5658     545902      500000

In this case, the constraint added a noticeable overhead - for 500,000 inserts, it required an additional 1.8 seconds of CPU time, or 0.0000036 seconds per row. While this is a measurable difference over 500,000 rows, it is unlikely that tiny overhead will significantly affect any application. Most importantly, no additional IO is required.

Simple check constraints add a small, but mostly negligible CPU overhead

Checks Between Columns

What about a check constraint that compare two columns, such as signupdate < active_date? Time to repeat the test again:

-- First Create two new tables as before, then add a constraint to one 
SQL11G> alter table check2_test_with_constraints add constraint check2_signup_active_ck
          check (sign_up_date < active_date);

The results indicate this test is about the same as a simple check constraint:

INSERT INTO CHECK2_TEST_NO_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 500000      8.34       9.39          0       5656     545890      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001      8.34       9.39          0       5656     545890      500000

INSERT INTO CHECK2_TEST_WITH_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.00          0          0          0           0
Execute 500000     10.60      14.25          0       5657     545892      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001     10.62      14.25          0       5657     545892      500000

Check constraints that compare columns add a small, but mostly negligible CPU overhead

Regex Check Constraint

For the final check constraint test, lets try out something more complex, like the email_regex example. Once again create two new tables, and then add the email check:

SQL11G> alter table check_rg_test_with_constraints add constraint email_regex_ck
          check(  regexp_like(email,'^[[:alpha:]]+@[[:alpha:]]+$'));

Finally the results show something worth mentioning:

INSERT INTO CHECK_RG_TEST_NO_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 500000      8.90       9.59          0       5653     545865      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001      8.90       9.59          0       5653     545865      500000


INSERT INTO CHECK_RG_TEST_WITH_CONSTRAINTS 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 500000     12.07      14.55          0       5654     545858      500000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   500001     12.07      14.55          0       5654     545858      500000

Adding the email check to the column has cost about 3 seconds of CPU time, increasing the run time by about 33%. That may seem like a lot, but Oracle is still inserting 500,000 rows in only 12 seconds of CPU time. I would argue that few applications will even notice a difference.

Check constraints using regular expressions are more expensive than simple checks, but it is still unlikely a typical application will see an impact.

Check Constraint Conclusions

All these format checks are easy to do in any application tier, so off-loading them from the database could be done efficiently. As we have already established, this application will not be the last application that uses this database, so it is highly likely all the logic will need to be duplicated in several places which could lead to checks getting missed and the data becoming corrupted. Besides, if Oracle can insert 500,000 rows in under 12 seconds on my laptop with the most expensive constraint enabled, then your application can probably afford the overhead!

Unique Checks

Doing unique checks in the application tier is impossible. OK, Maybe it is possible, with a lot of work, but it certainly won't perform well. Consider a web application that has a user table. In this application, user_name must be unique. Have a think about how you would enforce this unique constraint without using the database features.

Your first thought is probably something along the lines of:

For each username, I will quickly check if it exists, and if it is already there, I will tell the user to pick something else

Fine, that will work, if there is only ever one user. What if two users submit the username 'check_this' at almost the same time:

SESSION 1: Does user 'check_this' exist? No
SESSION 2: Does user 'check_this' exist? No
SESSION 1: Create user 'check_this' - success
SESSION 2: Create user 'check_this' - success

Ooops! Session 1 did a look up confirming that the user didn't exist but before it could create the user, session 2 did the same check finding no rows. Then both sessions successfully inserted the row, breaking the unique user_name rule.

Your next thought may be to create a lock to prevent two users being created at once. This is the only way an application can enforce this check, but it immediately creates a bottleneck in the application - only one user can be created at once! So much for a scalable application tier, and this is a trivial example.

In this application, there are potentially millions of users - how do you even find out if the username already exists or not? You must use an index on the username column. As a unique index is what is used to enforce a unique constraint, letting the database do these checks is really the only option.

OK OK, I get it already, isn't this the section where we test the performance, not lecture on the whys of unique constraints?!

The point I am trying to make is that it doesn't matter what the performance of a unique constraint is, there is no other way to keep a unique value unique. Compared to check constraints, adding indexes to a table is expensive:

They use more disk space
They require additional IO to find the correct part of the index to check / modify
They require additional IO to add the new value to the index
IO is expensive compared to CPU and it can be a point of contention in a multi-user environment much more quickly

So how much does it cost? Back to our test case once again, this time creating a unique index on name:

SQL11G> create unique index unq_check_with_index_uk1 on unq_check_with_index (name);

As suspected, the results show that the unique index is more expensive than even the regex check constraint:

INSERT INTO UNQ_TEST_NO_INDEX 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query        current        rows
------- ------  -------- ---------- ---------- ----------     ----------  ----------
Parse        1      0.00       0.00          0          0              0           0
Execute 500000      9.06       9.84          0       5664         546788      500000
Fetch        0      0.00       0.00          0          0              0           0
------- ------  -------- ---------- ---------- ----------     ----------  ----------
total   500001      9.06       9.84          0       5664         546788      500000

INSERT INTO UNQ_TEST_WITH_INDEX 
VALUES
 (:B6 , :B5 , :B4 , :B3 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query        current        rows
------- ------  -------- ---------- ---------- ----------     ----------  ----------
Parse        1      0.00       0.00          0          0              0           0
Execute 500000     13.22      16.60        172      16373        2045341      500000
Fetch        0      0.00       0.00          0          0              0           0
------- ------  -------- ---------- ---------- ----------     ----------  ----------
total   500001     13.22      16.60        172      16373        2045341      500000

This time, CPU is not the entire story - Notice how many more disk, query and current blocks were required, all to maintain the index.

Unique constraints are expensive as they require indexes, but if a business rule states something must be unique, the only option is to use a database constraint

Foreign Keys

Finally it is time to tackle foreign keys. Before diving into any tests, it is worth thinking about what a foreign key must do and think about how it should perform.

Consider the Product and Order_Items tables we designed for the online order portal.

The Products table has an indexed primary key on product_id - this is required to enforce the unique primary key and allow the details for a product to be retrieved efficiently.

It is also likely (but not essential) that the OrderItems table will have a non-unique index on productid, allowing an efficient way of finding all the orders containing a specific product.

If there is a foreign key on Orders.ProductID referencing Products.ProductID, then for each row inserted into Order_Items, it will have to check the index on Products to see if the product exists. This sounds like a similar overhead to adding a unique index to a column in the Orders table - for each row inserted, the unique index has to be checked to see if the entry already exists.

The Order_Items table also has a foreign key referencing the Orders table, for now, it will be disabled.

Lets test this theory out. First fill the products table with some products:

SQL11G>  insert into products (product_id, sales_start_date, sales_end_date, product_name, price)
        select level, sysdate, null, 'Product_name_'||level, 9.99
        from dual 
        connect by level <= 1000;

Disable the foreign key constraint between OrderItems and Orders (to make this test simpler), and drop the primary key on the orderitems table (so we are only testing the effect of the foreign key):

SQL11G> alter table order_items disable constraint order_items_orders_fk;  

SQL11G> alter table order_items disable constraint order_items_pk

SQL11G> drop index order_items_uk1;

To make this a scientific test, we need a control table, ie one that looks the same as the Order_Items table, but with no foreign key referencing Products:

SQL11G> create table order_items_no_fk
        as 
        select * from order_items where 1 = 0;

Now, run a test similar to the others:

SQL11G> ALTER SESSION SET timed_statistics=TRUE;

SQL11G> ALTER SESSION SET max_dump_file_size=UNLIMITED;

SQL11G> ALTER SESSION SET tracefile_identifier='simple_check'; 

SQL11G> ALTER SESSION SET EVENTS '10046 trace name context forever, level 8';

SQL11G> begin
      for row in (select id, product_id
                  from (select id, mod(id, 999) product_id 
                        from data_source) a
                  where a.product_id > 0 ) loop
        insert into order_items (order_item_id, order_id, product_id)
        values (row.id, row.id, row.product_id);

        insert into order_items_no_fk (order_item_id, order_id,     product_id)
        values (row.id, row.id, row.product_id);

      end loop;
      commit;
    end;
/

SQL11G> ALTER SESSION SET EVENTS '10046 trace name context forever, level 1';

The results are interesting:

INSERT INTO ORDER_ITEMS_NO_FK (ORDER_ITEM_ID, ORDER_ID, PRODUCT_ID) 
VALUES (:B2 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 499500      8.73      10.34          1       1387     516876      499500
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   499501      8.73      10.34          1       1387     516876      499500


INSERT INTO ORDER_ITEMS (ORDER_ITEM_ID, ORDER_ID, PRODUCT_ID) 
VALUES (:B2 , :B2 , :B1 )


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 499500     10.77      11.94          0       1472    1517388      499500
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   499501     10.77      11.94          0       1472    1517388      499500

Performing the foreign key check cost about 2 CPU seconds, not too far from the unique index, but that is not the whole story. The main performance concern is the increase in current gets. The inserts into the table with the foreign key constraint required about 1M additional get operations. These consistent gets will be against the index on the Products table. As the foreign key doesn't actually change the index on the products table, resource wise, it is actually cheaper than a unique index.

Caching Effects

There are two types of consistent gets - the ones that read blocks from the Oracle buffer cache, and the ones that need to read blocks from disk.

Consider the Products table. This table is small, and the primary key index on ProductID will be very small. As users access the application, browsing products and adding them to shopping carts, the Products table will be queried frequently and it will probably end up cached. Therefore, any time a row is inserted into OrderItems, it is very fast to check if the Product_ID exists from the cache.

Now imagine a telephone billing system. It has 20M account records, and each day it must load all the phone calls made by all those 20M customers. Ideally there will be a foreign key between the Phone_Calls and Account tables, ensuring that for each telephone call loaded, the account exists.

In this case, the index on the account table may be too big for it all to fit in the cache, it is likely that to check if an account exists, many of the look ups will have to read from disk, causing a noticeable impact on the insert performance.

Foreign Key checks are cheaper than an index on a table, but the performance will vary depending on whether the referenced index can fit in the cache

Summary

Check constraints add almost no overhead, and can be used to do some pretty complex checking.

Unique constraints (and indexes in general) add significant overhead to a table. The size of this overhead depends on the number of indexes, the size of them and the data insert patterns. To ensure a value remains unique, there is no other acceptable option apart from using unique indexes in the database.

Foreign key constraints can have a noticeable impact on insert performance, but how much depends on the size of the referenced index, and how much of it can fit in the cache. This is a similar problem to indexes (and unique constraints), and foreign keys will typically consume less resources than additional indexes on the table, as they don't make any changes to the index.

For me, this is an interesting point - you often hear comments about the negative performance impact of adding foreign keys to tables, but less frequently hear concerns about the performance of adding additional indexes to tables. The reason is probably because an index is necessary so queries work quickly enough, while foreign keys can be sacrificed.

If a table referenced by a foreign key is small, then it will likely be cached and is unlikely to be a problem. If the table is large then it might be a problem if the entire index cannot be cached.

If you need to insert many millions of records into a table each day, which references other large tables, then I would start to worry about the overhead of foreign keys, and do some thorough testing. For 99% of tables, foreign keys will not cause an noticeable issue and will ensure the data integrity is sound.

The final take away is this - you tend to write data once, and read it many times. It is best to make sure the data is correct when it is written and suffer the small overheads, than find out later that much of it is corrupt.

Other areas for investigation

What happens to insert /*+ append */ when there are foreign keys on the table (hint - it won't do direct path inserts)?
Bulk loading and creating foreign keys as NOVALIDATE.
How do constraints and foreign keys help the optimizer?

This post is part of a series:

Previous in series | Series Index | No further articles

Better At Oracle