What is SCD (Slowly Changing Dimension) and types in Data Warehouse
Slowly Changing Dimensions (SCD) are actual dimensions in data warehouse database and mainly used to maintaining or tracking different level of slowly changeable data from source.
Data warehouse is mainly focused to analyse the business statistics by reports which holds multiple subset information about the business such as products, locations, employees, customers, sales and so on. The data warehouse database consist of two objects such as Dimensions and Facts and those data actually coming from real time database (OLTP - OnLine Analytical Processing). Dimensions are hold textual description and Fact table contains measures of sales records. In order to view the historical report from data warehouse database we need to track the data changes in the dimension attributes. Say for example a customer has relocated his residential to many places, he has ordered the products from each location and now we have to analyse the sales metrics for the each location. So we need slowly changing dimension concept to identify the data changes from the source and we have six different options to track the changes.
Data warehouse is mainly focused to analyse the business statistics by reports which holds multiple subset information about the business such as products, locations, employees, customers, sales and so on. The data warehouse database consist of two objects such as Dimensions and Facts and those data actually coming from real time database (OLTP - OnLine Analytical Processing). Dimensions are hold textual description and Fact table contains measures of sales records. In order to view the historical report from data warehouse database we need to track the data changes in the dimension attributes. Say for example a customer has relocated his residential to many places, he has ordered the products from each location and now we have to analyse the sales metrics for the each location. So we need slowly changing dimension concept to identify the data changes from the source and we have six different options to track the changes.
SCD Type | Description |
---|---|
Type - 0 | Retaining Original Data |
Type - 1 | Overwriting on Existing Data |
Type - 2 | Adding new records on same table with flag |
Type - 3 | Adding new column on same table |
Type - 4 | Adding new records on new table (History Table) |
Type - 6 | Hybrid approach (Type 1 + Type 2 + Type 3) |
Each type is having some advantage and disadvantages.
SCD Type 0: Retaining Original
This dimension attributes never change and holds the static data as it was inserted at first time and the remaining attributes may be overwritten. For example, if we are handling customer data in our database and few data would not be changed after insertion such as Date of Birth and SSN number. This type mostly preferred for Date dimension attributes because the attributes value would not be modified after insertion.
SCD Type 1: Overwrite on existing
This type overwrites the old data with new data and this will not track the historical data in the dimension.
For Instance in customer dimension, the customer may change his residential from one place to another and the address information will be tracked based on the transferential.
For Instance in customer dimension, the customer may change his residential from one place to another and the address information will be tracked based on the transferential.
Old Data
Cust_ID | Customer Name | Address Line | City | Zip Code | State |
---|---|---|---|---|---|
1001 | BaskaranR | 122, East West Street | Chennai | 600001 | TamilNadu |
1002 | George | 13A, Church Circle Road | Chennai | 600051 | TamilNadu |
1xxx | --------- | -------- | ------ | ------ | -------- |
New Data
Cust_ID | Customer Name | Address Line | City | Zip Code | State |
---|---|---|---|---|---|
1001 | BaskaranR | 521, North South Street | Chennai | 600002 | TamilNadu |
1002 | George | 13A, Church Circle Road | Chennai | 600051 | TamilNadu |
1xxx | --------- | -------- | ------ | ------ | -------- |
In the above example, the customer has relocated from one place to another and the recent address has been overwritten on the existing records. But the negative thing is we cannot retrieve his previous address information from this scenario.
SCD Type 2: Adding new Records
This type is satisfying the previous minus that track the historical data by creating new entry on the same dimension table. The recent and previous data will be identified by flag or start date and end date fields on same dimension table.
Cust_ID | Customer Name | Address Line | City | Zip Code | State | Active |
---|---|---|---|---|---|---|
1001 | BaskaranR | 122, East West Street | Chennai | 600001 | TamilNadu | Y |
1001 | BaskaranR | 521, North South Street | Chennai | 600002 | TamilNadu | N |
1002 | George | 13A, Church Circle Road | Chennai | 600051 | TamilNadu | Y |
1xxx | --------- | -------- | ------ | ------ | -------- | Y |
From the above example, the customer recent address has been inserted as new record and previous address retained in the same table. Both records are identified by using active flag and the flag Y is indicating the customer current address and N is representing the previous address.
SCD Type 3: Adding new Fields
This type tracks the recent changes in separate field on same table and this type is preserving limited history of data.
Cust_ID | Customer Name | Current Address | Previous Address | City | Zip Code | State |
---|---|---|---|---|---|---|
1001 | BaskaranR | 122, East West Street | 521, North South Street | Chennai | 600001 | TamilNadu |
1002 | George | 13A, Church Circle Road | Chennai | 600051 | TamilNadu | |
1xxx | --------- | -------- | ------ | ------ | -------- |
Based on the above instance, the previous address has been moved from current address and the recent address placed on current address field. This type would track the most recent historical changes and we cannot retrieve few old records from this scenario.
SCD Type 4: Adding new Table (History Table)
This type of tracks having separate table to track the changes from source and usually this method called as history table. We can retrieve all the history changes from this table if required. This method is advantageous for high volatile or frequently used attributes in very large size dimension.
Customer Table
Cust_ID | Customer Name | Address Line | City | Zip Code | State |
---|---|---|---|---|---|
1001 | BaskaranR | 122, East West Street | Chennai | 600001 | TamilNadu |
1002 | George | 13A, Church Circle Road | Chennai | 600051 | TamilNadu |
1xxx | -------- | --------- | ------ | ----- | --------- |
Customer History Table
Cust_ID | Customer Name | Address Line | City | Zip Code | State | CreateDate |
---|---|---|---|---|---|---|
1001 | BaskaranR | 521, North South Street | Chennai | 600001 | TamilNadu | 25-JUN-2013 |
Based on the above example, the historical data will be tracked on separate table which holds recent and previous data identified by date field.
SCD Type 5: (Type 4 + Type 1)
This type is combination of type 1, type 4 methods and type 1 dimension table on top of the type 4 design of base and history dimension. The Type 1 dimension table is joined directly to the base dimension without having to join directly to the fact table. This will hold current state attributes such as current address for the customer. This constrains the ETL team to update this Type 1 table as opposed to the whole base table on any changes.
SCD Type 6: Hybrid Approach (Type 1 + Type 2 + Type 3)
It an another approach to handle the data changes from source which is hybrid of type 1, type 2 and type 3 methods.
Cust_ID | Customer Name | Address Line | City | Zip Code | State | EffectiveFrom | EffectiveTo | IsActive |
---|---|---|---|---|---|---|---|---|
1001 | BaskaranR | 122, East West Street | Chennai | 600001 | TamilNadu | 26-JUN-2013 | Y | |
1001 | BaskaranR | 521, North South Street | Chennai | 600001 | TamilNadu | 10-AUG-2009 | 25-JUN-2013 | N |
1002 | George | 13A, Church Circle Road | Chennai | 600051 | TamilNadu | 6-SEP-2009 | Y | |
1xxx | ------- | --------- | ------ | ------ | ---- | xx-xxx-xxxx | Y |
As shown in above example, the history data tracked in new row which is identified by start date and end date fields. Additionally we can have isActive flag field to identify the current record from the list. We can maintain a history of all the changes whilst simultaneously updating the current value on existing records.
Nice Blog.Thankyou for sharing.
ReplyDeleteDevOps Training
DevOps Online Training