Search This Blog

Sunday, June 8, 2008

Ever Wonder Why?

I've been doing this data warehouse thing for a few days now. I learned quite a bit during the process. Most of the more important learning came from the school of hard knocks. One of my favorite Data Warehouse Architecture lessons is around key management strategy. The great debate is; should we use Surrogate Keys or Natural Keys? Data Architects are an interesting breed. We will debate this topic with the kind of passion typically reserved for topics of significant social impact. Ever Wonder Why?

This is a simple decision. Ask the following questions to understand the characteristics of your environment.

Question: Can the source applications 100% guarantee there will never be a change in their key management practices? If you cannot get a resounding yes you DO NOT want to use the source application primary keys as your data warehouse architecture keys.
Question: Do you have only one source application that will be feeding data into the subject areas of your data warehouse? If you cannot get a resounding yes you DO NOT want to use the source application primary keys as your data warehouse architecture keys.
I get a lot of rhetoric around the value and promise of using "natural keys". I ask you, what is "natural" about the database management system keys used by any of your source applications? There is nothing natural about an order number that is generated from an Order Management system. It's a number, a code. Sometimes it is cooked up by an application developer as a means to try to ensure each order posted to the database will have a unique index address.
I also get heated opinions around assigning a surrogate key when the source application is already assigning surrogate keys. (Think Oracle Applications) If my concern is well behaved keys, well it's already a surrogate, why put a surrogate on top of a surrogate? Good point, but (behold the underlying truth) what should we do when the single instance of the source application simply cannot handle the workload of the entire enterprise and the source application team decides to clone the application for each major region? Same Code base, different physical instances each using Oracle Sequencers for IDs.
What about source data that is truly an event, once posted will never change, has a decent unique id and has extremely high volumes? Now this is a set of characteristics that make a case for a different PK management strategy reference pattern. I guess you can't ALWAYS say always or never.

Yes, Surrogate Key management requires some engineering. It is a valid and necessary component of your overall Data Warehouse Architecture. No, not in ALL cases will every table need to have a Surrogate key. There are exceptional reference patterns that will exist. Make sure you allow for that in your overall Data Warehouse Architecture.