Jeyanthi's Java Cafe: February 2010

Surrogate keys are those that serve as the join fields between the fact and dimension tables in a dimension data warehouse. Surrogate keys are widely used in data warehouses instead of merely referencing the operational natural keys due to its advantages:

Performance – using a 4-byte integer usually makes a great surrogate key because it can represent 2³² values. This is compact and efficient compared to bulky alphanumeric natural keys used by the operational systems. This translates into better performance by using more efficient joins, smaller indices, and more fact rows per block.
Space saving - As surrogate keys are only 4 bytes long, they occupy less space that the bulky alphanumeric production keys. For example the date data type occupies 8 bytes. So by using surrogate keys we save 4 bytes. If we have 1 billion records in the fact table, we would be saving 4x1billion bytes = 3.73 GB of space!!!!
Buffer from operational key management practices – reuse of operational keys pose a problem using them in a data warehouse that stores historical data. Using a surrogate key creates a buffer from potential problems like these.
Mapping to integrate disparate sources – Systems developed independently in company divisions may not use the same keys, or they may use keys that conflict with data in the systems of other divisions. This situation may not cause problems when each division independently reports summary data, but it cannot be permitted in the data warehouse where data is consolidated. Since data warehouses are populated from many different operational data sources, using surrogate key acts as an integration mechanism and the ETL process uses a mapping table to uniquely identify the linked natural keys.
Handle unknown or not applicable conditions – there are cases when the data in the operational data source may violate certain rules followed by the data warehouse. For example, the operational key could have null values which may pose a problem since these are the foreign keys to the fact table which cannot have null values. Using a surrogate key prevents conditions like these.
Track changes in dimension attribute values – Though dimension attribute values are more static than fact tables, they may still undergo changes. Using a surrogate key, we can generate new keys for each change and easily handle the versioning. This enables easier changes to attributes when business requirements/rules change
Easier implementation - Surrogate keys enable easier implementations of concurrency than natural keys. In the natural key case, if two users are concurrently working with the same information set, such as a customer row, and one of the users modifies the natural key value, then an update by the second user will fail because the customer code they are updating no longer exists in the database. In the surrogate key case, the update will process successfully because immutable ID values are used to identify the rows in the database, not mutable customer codes.

A surrogate key may be meaningful or meaningless. A meaningful key stores information within the key. Meaningless keys are sometimes called system keys because they only have meaning to the system. The advantages of meaningful keys are that they are constructed for data clustering, reducing retrieval times. The disadvantages of meaningful keys are that they are inflexible and rely on a code structure that could break.

A surrogate key can be created by using CREATE SEQUENCE command.

Jeyanthi's Java Cafe

Enforcing database security policies

Surrogate Keys

About Me

Labels

Blog Archive