Friday 5 October 2007

Datawarehousing Notes

Fact

A fact could contain raw data or aggregated data with quantifiable measures.
Measures are numerical values defined by the granularity of the fact.

The ‘grain’ represents the unit at which information is stored and defines the type of the fact.

(1) Transactional
Data is stored at the transaction level, typically with little manipulation from the source OLTP system except cleanup and assigning of keys.

(2) Periodic Snapshot
Data is a picture of what happened at a moment in time i.e. more rows are added as time passes.

(3) Accumulating Snapshot
Data record is amended over time e.g. for an hourly sales fact, the data record would be updated during the hour real time transactions were occurring.

Fact tables have many rows and fewer columns when compared to dimensions.



A fact is uniquely identified by a set of foreign keys (linking to dimension tables). The primary key of a fact may be some or all of the foreign keys.

Fact Examples
Sales Transactions, Website Visitor Sessions


Dimension

A dimension is a table representing a category by which we organize facts.
They provide labeling, grouping and filtering functionality.

It is made up of attributes, the combinations of which represent a unique level of data within a dimension.

Dimension tables should have many attributes and typically less rows than related fact tables.






Dimensions are independent of one another.
To query multiple dimensions, you join through fact tables.
A dimension used by multiple facts is known as a Conformed Dimension.
There are 3 types of dimension, defined by the way data updates are handled >

Type 1
Data changes replace the existing row. No record of the previous attribute values exist.
‘Address’ is a good example as storing old address details serves little use.

Type 2
Data changes are added to the table as a new row. Therefore the history of attribute values are stored. ‘Interest Rate’ is an example where the history is important.

Type 3
Data changes overwrite the attribute values, but the old values are stored in additional columns in the dimension. Only 1 level of history can be retrieved therefore.


Dimension Examples
Calendar, Customer List, Product List




Monster Dimensions

An extremely large Dimension needs to be divided for the sake of performance.
Examples of this are >
1) splitting a calendar dimension into separate Date and Time dimensions on account of it’s shear size.
2) splitting a customer dimension out into 2 dimensions for mostly static and changeable data respectively.


Junk Dimensions

These are a way of storing commonly used flags (y/n) and indicators that have a low number of possible values. It is a tidier solution to store them centrally, rather than create clutter i.e numerous small dimensional tables.


Nulls

Nulls have no place in a data warehouse.
To cope for null values in dimensions, implement an ‘Unknown Value’ row in dimensions and ensure foreign keys in facts point to it.
-1 is a suggested primary key value for the Unknown data row.


Keys

Natural Key
A Natural key is a meaningful value.
For example, a National Insurance (NI) number uniquely identifies employees in the UK

Surrogate Key
This is a sequential number assigned to records.
Primary and Foreign keys in data warehouse designs should use surrogate values.

No comments: