|
Home > Archive > MS SQL Data Warehousing > September 2005 > what is a data warehouse?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
what is a data warehouse?
|
|
|
| I'm thinking I may have worked with a data warehouse before, but how can I
be sure? I am applying for jobs that ask for an understanding of multiple
data warehouses but the definition for data warehouse to me is unclear since
i'm not sure if i've seen one (or worked with one). Would it be the same as
an adhoc database (many views coming from different databases)?
If you answer, can you give me a basic example?
thanks.
| |
| Guilbert 2005-09-07, 8:23 pm |
| In simple terms a data warehouse consists of data that has been copied
from other data repositories (relational, flat files etc), and then
cleansed,
merged, summarized, or aggregated in some way.
The data is usually copied from the source on a regular ongoing basis,
daily, weekly, etc, so that gradually the target data builds up a history
(the sales of all products for the last year, broken down by week for
example).
This target data (the data warehouse) is then used by end-users to do
queries and reporting and other business intelligence so they can make
decisions on how to better run their company.
You may also hear the term data mart, and this is a small data warehouse,
usually a subset of the main data warehouse, based perhaps on one part
of the company.
Another term you may hear is OLAP cube. This is a type of data warehouse
where all the data is pre-aggregated (in other words the results of any
query
you may wish to do has already been done).
This allows you to get very fast query results, and you can use a product
like
Excel to look at your data in any way you wish (by year, product,
department,
country, price, colour etc).
So a data warehouse has many forms, but it is basically copying data from
something like an transaction (OLTP) database into a database that has been
built for query and reporting.
| |
| Ian Boyd 2005-09-12, 3:23 am |
| i asked this same question about a year ago. i kept reading about data
warehouses, but they would only use made up terms to describe data
warehouses. i had to way "in" to understand what they were talking about.
Someone then recommended "The Data Warehouse Toolkit" by Ralph Kimball. And
now i understand; and i recommend the same book.
Some of the concepts are so foreign and alien to someone who has only ever
done "normal" databases, that i wouldn't have understood it if i didn't see
his examples in the book.
The fundamental goal of a data warehouse is to arrange your data in a way
that normal non-technical end-users (i.e. managers) can peek into the
database themselves, and assemble whatever queries they want. The
fundamental goal is to let people "get at the data".
One major design difference between regular databases (your transactional
system), and a data warehouse database is denormalized "dimension" tables.
i'll give a limited example, that really hit me over the head with how
different data warehouse database design is from regular database design.
This example comes from the book (but the book is much more comprehensive).
Take a grocery store's database, that has a Transactions table. This table
holds all the items on customers receipts. In a regular database, you would
probably see a design like:
CREATE TABLE Transactions (
Date datetime,
ProductID int,
StoreID int,
TransactionNumber int,
UnitCost money
Quantity float,
Amount money)
The first change that is made when we copy this data into our warehouse, is
to make the "Date" column more useful. What if a manager wants to, group
sales by month, or by day of the week, or quarter. The only way you could do
would be to start assembling queries that involves clauses like:
GROUP BY DatePart('month', Date)
or
SELECT DatePart('dddd', Date)
or
SELECT DatePart('weekday', Date)
But DatePart("weekday") returns a number. If they want the report to
say"Monday", "Tuesday", ..., "Saturday", "Sunday" they'll have to do some
extra work. Besides, is 2 a Tuesday? Do the numbers start at 1 or zero? Is
Monday the first day of the week, or is it Sunday. Managers aren't going to
know these things. Plus, if you want to have SQL Server doing some grouping
based on parts of dates, the performance drops a lot.
The solution, is to change the date column in your transactions table into:
CREATE TABLE Transactions (
DateID int,
...)
And now you create a Date "dimension"
CREATE TABLE Dates (
DateID int,
Date datetime,
FullDateDescription varchar(200), --i.e. Saturday, February 14th, 1998
DayOfWeek varchar(50), --i.e. "Saturday"
DayNumberInEpoch int,
WeekNumberInEpoch int,
MonthNumberInEpoch int,
DayNumberInCalendarM
onth int, --i.e. 14
DayNumberInCalendarY
ear int, --i.e. 45
DayNumberInFiscalMon
th int,
DayNumberInFiscalYea
r int,
LastDayInWeekIndicat
or varchar(50), --("Last Day In Week", "Not Last
Day In Week")
LastDayInMonthIndica
tor varchar(50), --("Last Day In Month", "Not Last
Day In Month")
CalendarWeekEndingDa
te datetime, i.e. 2-15-1998
CalendarWeekNumberIn
Year int,
CalendarMonthName varchar(50), --i.e. "February"
CalendarYearMonth varchar(100), --i.e. "1997-02"
CalendarQuarter varchar(2), --i.e. "Q1"
CalendarYearQuarter varchar(10), --i.e. "1997-01"
CalendarHalfYear varchar(2), --"1H"
CalendarYear varchar(4), "1998"
FiscalWeek varchar(3), "F06"
FiscalWeekNumberInYe
ar int,
FiscalMonth varchar(20), --i.e. "February"
FiscalMonthNumberInY
ear int,
FiscalYearMonth varchar(50), --i.e. "F1998-06"
FiscalQuarter varchar(4), --i.e. "FQ01"
FiscalYearQuarter varchar(10), --i.e. "F1998-Q01"
FiscalHalfYear varchar(3), --"F1H"
FiscalYear int, --i.e. 1998
SellingSeason varchar(100), --("New Years", "Valentine's Day", "St.
Patrick's Day", "Easter", "Summer", "4th of July", "Memorial Day", "Labor
Day", "Back to School", "Halloween", "Thanksgiving", "President's Day",
"Christmas", "None")
MajorEvent varchar(100), (Similar to above, but also things like "Super
Bowl Sunday" or "Labor Strike")
HolidayIndicator varchar(50), --("Holiday", "Non-holiday")
WeekdayInsicator varchar(50), --("Weekday", "Weekend")
and more!
i'll add a few quotes from the book:
"we can cover the history we have stored, as well as several years in the
future. Even 10 years' worth of days is only about 3,650 rwos, which is a
relativly small dimension table"
"The holiday indicator takes on the values of Holiday or Nonholiday.
Remember that the dimension talb eattributes serve as report labels. Simply
populating the hiliday Insicator with a Y or N woudl be far less useful.
Imagine a report whre we're comparing hiliday sales for a given product
verses non-holiday sales. Obviously, it would be helpful if the columns has
meaningful values such as Holiday/Nonholiday verses a cryptic Y/N. Rather
than decoding crypting flags into understandable labels in a report
application, we prefer that the decode be stored in the database so that a
consitent value is available to all users regardless of their reporting
environment.
"Some designers pause at this point to ask why an explicit date dimension
talbe is needed. They reason that if the date key in the fact talbe is a
date-type field, then any SQL query can directly contrain on the fact table
date key and use natural SQL date semantics to filter on month or year while
avoiding a supposedly expensive join. This reasoning falls apart for several
reasons. First of all, if our relational database can't handle an efficient
join to the date dimension talbe, we're already in deep trouble. Most
database optimizers are quite efficient at resolving dimensional queries; it
is not necessary to avoid joins like the plague. Also, on the performance
fornt, most databases don't index SQL date calculations, so queries
contraining on an SQL-calculated field wouldn't take advantage of an index.
In terms of usability, the typical business users is not versed in SQL date
semantics, so he or she would be unable to directly leverage inherent
capabilities associated with a date data type. SQL date functions do not
support filtering by attributes such as weekdays verses weekends, holidays,
fiscal periods, seasons, or major events. Presuming that the business needs
to slice data by these nonstandard date attributes, then an explicit date
dimension table is essential. At the bottom line, calendar logic belongs in
a dimension table, not in the application code. Finally, we're going to
suggest that the date key is an integer rather than a date data type anyway.
An SQL-based date key typically is 8 bytes, so you're wasting 4 bytes in the
fact table for every date key in every row. "
He talks about how 10 years of dates is only a few thousand rows, so we can
afford the extra database space with all these denormalized fields. On the
other hand the "fact" table (in this example the Transactions table) is
millions or billions of rows. We want that table to be as narrow as
possible. One extra byte per row in a billion row table is an extra gigabyte
of database size.
Another example from the book, that helps to reinforce the point. When
talking about the "Product" dimension table:
Products
Product Key (Primary Key)
Product Description
SKU Number (Natural Key)
Brand Description
Category Description
Department Description (i.e. "Frozen Foods", "Pet Food", "Dairy", "Meat",
etc)
Package Type Description
Package Size
Fat Content
Diet Type
Weight
Weight Units Of Measure
Storage Type
Shelf Life Type
Shelf Width
Shelf Height
Shelf Depth
....and more
"A reasonable product dimension table would have 50 or more descriptive
attributes. Each attribute is a rich source for constraining and
constructing row headers."
In our normal database thinking, would would define a Departments table, and
put a key in the products table to departments. But we don't do that in a
warehoue. Imagine there are 150,000 products and only about 50 departments,
"Thus, on average, there are 3,000 repetitions of each unique value in
department attribute. This is all right! We do not need to separate these
repeated values into a second normalized table to save space. Remember that
dimension table space requirements pale in comparison with fact table space
considerations"
Data for the warehouse is usually extracted nightly from the live
transactional system, transformed into a format suitable for the warehouse,
and then loaded into the data warehouse; ready for use by managers come
morning. This process has a special name, ETL (Extract, Transform, Load).
You're already seen dimension tables; relativly short and very wide (large
number of columns). Dimension tables usually don't measure things that
"happen"; they usually measure things that are. (Customers, Patrons,
Airplanes, Products, Dates, Companies, Diseases, etc).
The central table, in this example the Transactions table, is known as a
"Fact" table. Facts are things that happen, or take place, or get measured
(items on a customer's bill, a line item on a doctor's bill, a boarding pass
to get on an airplace, a daily snapshot of inventory levels, a monthly
snapshot for each bank account).
To sum up:
A data warehouse is a specially designed database system, separate from the
"live" system, that is organized in such as way that makes it easy for
managers to get at the data.
So, have you ever used a data warehouse? It is not magically different from
any other database - only the way it is structured is special.
"Gen" <me@microsoft.com> wrote in message
news:%236Aewx%23sFHA
.3548@TK2MSFTNGP11.phx.gbl...
> I'm thinking I may have worked with a data warehouse before, but how can I
> be sure? I am applying for jobs that ask for an understanding of multiple
> data warehouses but the definition for data warehouse to me is unclear
> since i'm not sure if i've seen one (or worked with one). Would it be the
> same as an adhoc database (many views coming from different databases)?
>
> If you answer, can you give me a basic example?
>
> thanks.
>
| |
| Ian Boyd 2005-09-12, 3:23 am |
| Forgive the typos, i didn't check it before posting it.
| |
| Dave Wickert [MSFT] 2005-09-12, 3:23 am |
| There are actually two major approaches to data warehousing. Ralph Kimball's
dimensional modeling approach is one; the other is enterprise modeling. For
that, look at books by Bill Inmon and/or Claudia Imhoff, such as:
http://search.barnesandnoble.com/bo...isbn=0471399612
or
http://search.barnesandnoble.com/bo...isbn=0471081302
Both approaches are called "data warehouing" although they use *very*
different approaches, which is one of the causes of the misunderstandings
and challenges to understand the technology that you've mentioned below.
Before you fall in love with Kimball's dimensional modeling (which is also
my favorate), you should at least be aware of Inmon's and Imhoff's approach.
--
Dave Wickert [MSFT]
dwickert@online.microsoft.com
Program Manager
BI Systems Team
SQL BI Product Unit (Analysis Services)
--
This posting is provided "AS IS" with no warranties, and confers no rights.
"Ian Boyd" <admin@SWIFTPA.NET> wrote in message
news:eqvJgT0tFHA.904@tk2msftngp13.phx.gbl...
>i asked this same question about a year ago. i kept reading about data
>warehouses, but they would only use made up terms to describe data
>warehouses. i had to way "in" to understand what they were talking about.
>
> Someone then recommended "The Data Warehouse Toolkit" by Ralph Kimball.
> And now i understand; and i recommend the same book.
> Some of the concepts are so foreign and alien to someone who has only ever
> done "normal" databases, that i wouldn't have understood it if i didn't
> see his examples in the book.
>
> The fundamental goal of a data warehouse is to arrange your data in a way
> that normal non-technical end-users (i.e. managers) can peek into the
> database themselves, and assemble whatever queries they want. The
> fundamental goal is to let people "get at the data".
>
> One major design difference between regular databases (your transactional
> system), and a data warehouse database is denormalized "dimension" tables.
> i'll give a limited example, that really hit me over the head with how
> different data warehouse database design is from regular database design.
> This example comes from the book (but the book is much more
> comprehensive).
>
> Take a grocery store's database, that has a Transactions table. This table
> holds all the items on customers receipts. In a regular database, you
> would probably see a design like:
>
> CREATE TABLE Transactions (
> Date datetime,
> ProductID int,
> StoreID int,
> TransactionNumber int,
> UnitCost money
> Quantity float,
> Amount money)
>
> The first change that is made when we copy this data into our warehouse,
> is to make the "Date" column more useful. What if a manager wants to,
> group sales by month, or by day of the week, or quarter. The only way you
> could do would be to start assembling queries that involves clauses like:
>
> GROUP BY DatePart('month', Date)
> or
> SELECT DatePart('dddd', Date)
> or
> SELECT DatePart('weekday', Date)
>
> But DatePart("weekday") returns a number. If they want the report to
> say"Monday", "Tuesday", ..., "Saturday", "Sunday" they'll have to do some
> extra work. Besides, is 2 a Tuesday? Do the numbers start at 1 or zero? Is
> Monday the first day of the week, or is it Sunday. Managers aren't going
> to know these things. Plus, if you want to have SQL Server doing some
> grouping based on parts of dates, the performance drops a lot.
>
> The solution, is to change the date column in your transactions table
> into:
>
> CREATE TABLE Transactions (
> DateID int,
> ...)
>
>
> And now you create a Date "dimension"
>
> CREATE TABLE Dates (
> DateID int,
> Date datetime,
> FullDateDescription varchar(200), --i.e. Saturday, February 14th, 1998
> DayOfWeek varchar(50), --i.e. "Saturday"
> DayNumberInEpoch int,
> WeekNumberInEpoch int,
> MonthNumberInEpoch int,
> DayNumberInCalendarM
onth int, --i.e. 14
> DayNumberInCalendarY
ear int, --i.e. 45
> DayNumberInFiscalMon
th int,
> DayNumberInFiscalYea
r int,
> LastDayInWeekIndicat
or varchar(50), --("Last Day In Week", "Not Last
> Day In Week")
> LastDayInMonthIndica
tor varchar(50), --("Last Day In Month", "Not Last
> Day In Month")
> CalendarWeekEndingDa
te datetime, i.e. 2-15-1998
> CalendarWeekNumberIn
Year int,
> CalendarMonthName varchar(50), --i.e. "February"
> CalendarYearMonth varchar(100), --i.e. "1997-02"
> CalendarQuarter varchar(2), --i.e. "Q1"
> CalendarYearQuarter varchar(10), --i.e. "1997-01"
> CalendarHalfYear varchar(2), --"1H"
> CalendarYear varchar(4), "1998"
> FiscalWeek varchar(3), "F06"
> FiscalWeekNumberInYe
ar int,
> FiscalMonth varchar(20), --i.e. "February"
> FiscalMonthNumberInY
ear int,
> FiscalYearMonth varchar(50), --i.e. "F1998-06"
> FiscalQuarter varchar(4), --i.e. "FQ01"
> FiscalYearQuarter varchar(10), --i.e. "F1998-Q01"
> FiscalHalfYear varchar(3), --"F1H"
> FiscalYear int, --i.e. 1998
> SellingSeason varchar(100), --("New Years", "Valentine's Day", "St.
> Patrick's Day", "Easter", "Summer", "4th of July", "Memorial Day", "Labor
> Day", "Back to School", "Halloween", "Thanksgiving", "President's Day",
> "Christmas", "None")
> MajorEvent varchar(100), (Similar to above, but also things like "Super
> Bowl Sunday" or "Labor Strike")
> HolidayIndicator varchar(50), --("Holiday", "Non-holiday")
> WeekdayInsicator varchar(50), --("Weekday", "Weekend")
> and more!
>
> i'll add a few quotes from the book:
>
> "we can cover the history we have stored, as well as several years in the
> future. Even 10 years' worth of days is only about 3,650 rwos, which is a
> relativly small dimension table"
>
> "The holiday indicator takes on the values of Holiday or Nonholiday.
> Remember that the dimension talb eattributes serve as report labels.
> Simply populating the hiliday Insicator with a Y or N woudl be far less
> useful. Imagine a report whre we're comparing hiliday sales for a given
> product verses non-holiday sales. Obviously, it would be helpful if the
> columns has meaningful values such as Holiday/Nonholiday verses a cryptic
> Y/N. Rather than decoding crypting flags into understandable labels in a
> report application, we prefer that the decode be stored in the database so
> that a consitent value is available to all users regardless of their
> reporting environment.
>
> "Some designers pause at this point to ask why an explicit date dimension
> talbe is needed. They reason that if the date key in the fact talbe is a
> date-type field, then any SQL query can directly contrain on the fact
> table date key and use natural SQL date semantics to filter on month or
> year while avoiding a supposedly expensive join. This reasoning falls
> apart for several reasons. First of all, if our relational database can't
> handle an efficient join to the date dimension talbe, we're already in
> deep trouble. Most database optimizers are quite efficient at resolving
> dimensional queries; it is not necessary to avoid joins like the plague.
> Also, on the performance fornt, most databases don't index SQL date
> calculations, so queries contraining on an SQL-calculated field wouldn't
> take advantage of an index.
>
> In terms of usability, the typical business users is not versed in SQL
> date semantics, so he or she would be unable to directly leverage inherent
> capabilities associated with a date data type. SQL date functions do not
> support filtering by attributes such as weekdays verses weekends,
> holidays, fiscal periods, seasons, or major events. Presuming that the
> business needs to slice data by these nonstandard date attributes, then an
> explicit date dimension table is essential. At the bottom line, calendar
> logic belongs in a dimension table, not in the application code. Finally,
> we're going to suggest that the date key is an integer rather than a date
> data type anyway. An SQL-based date key typically is 8 bytes, so you're
> wasting 4 bytes in the fact table for every date key in every row. "
>
> He talks about how 10 years of dates is only a few thousand rows, so we
> can afford the extra database space with all these denormalized fields. On
> the other hand the "fact" table (in this example the Transactions table)
> is millions or billions of rows. We want that table to be as narrow as
> possible. One extra byte per row in a billion row table is an extra
> gigabyte of database size.
>
>
> Another example from the book, that helps to reinforce the point. When
> talking about the "Product" dimension table:
>
> Products
> Product Key (Primary Key)
> Product Description
> SKU Number (Natural Key)
> Brand Description
> Category Description
> Department Description (i.e. "Frozen Foods", "Pet Food", "Dairy",
> "Meat", etc)
> Package Type Description
> Package Size
> Fat Content
> Diet Type
> Weight
> Weight Units Of Measure
> Storage Type
> Shelf Life Type
> Shelf Width
> Shelf Height
> Shelf Depth
> ...and more
>
>
> "A reasonable product dimension table would have 50 or more descriptive
> attributes. Each attribute is a rich source for constraining and
> constructing row headers."
>
> In our normal database thinking, would would define a Departments table,
> and put a key in the products table to departments. But we don't do that
> in a warehoue. Imagine there are 150,000 products and only about 50
> departments,
>
> "Thus, on average, there are 3,000 repetitions of each unique value in
> department attribute. This is all right! We do not need to separate these
> repeated values into a second normalized table to save space. Remember
> that dimension table space requirements pale in comparison with fact table
> space considerations"
>
>
> Data for the warehouse is usually extracted nightly from the live
> transactional system, transformed into a format suitable for the
> warehouse, and then loaded into the data warehouse; ready for use by
> managers come morning. This process has a special name, ETL (Extract,
> Transform, Load).
>
> You're already seen dimension tables; relativly short and very wide (large
> number of columns). Dimension tables usually don't measure things that
> "happen"; they usually measure things that are. (Customers, Patrons,
> Airplanes, Products, Dates, Companies, Diseases, etc).
>
> The central table, in this example the Transactions table, is known as a
> "Fact" table. Facts are things that happen, or take place, or get measured
> (items on a customer's bill, a line item on a doctor's bill, a boarding
> pass to get on an airplace, a daily snapshot of inventory levels, a
> monthly snapshot for each bank account).
>
>
> To sum up:
> A data warehouse is a specially designed database system, separate from
> the "live" system, that is organized in such as way that makes it easy for
> managers to get at the data.
>
> So, have you ever used a data warehouse? It is not magically different
> from any other database - only the way it is structured is special.
>
>
>
>
> "Gen" <me@microsoft.com> wrote in message
> news:%236Aewx%23sFHA
.3548@TK2MSFTNGP11.phx.gbl...
>
>
| |
| Ian Boyd 2005-09-12, 11:23 am |
|
"Dave Wickert [MSFT]" <dwickert@online.microsoft.com> wrote in message
news:O3jlkP1tFHA.2072@TK2MSFTNGP14.phx.gbl...
> There are actually two major approaches to data warehousing. Ralph
> Kimball's dimensional modeling approach is one; the other is enterprise
> modeling. For that, look at books by Bill Inmon and/or Claudia Imhoff,
> such as:
> http://search.barnesandnoble.com/bo...isbn=0471399612
> or
> http://search.barnesandnoble.com/bo...isbn=0471081302
i'll give the 2nd one a look. But i think i'm already enamored with Kimball.
He is quote a handsome man.
ed2k:// |file|foo|4307725|CB
E8B4BA138D383CFD163A
66C8F68200|/
| |
|
| Is you will be in the Philidelphia, Atlanta, or Houston area within the next
few days ...
http://www.informatica.com/info/kim1
"Ian Boyd" <ian.msnews010@avatopia.com> wrote in message
news:eNNV%23J7tFHA.3424@tk2msftngp13.phx.gbl...
>
> "Dave Wickert [MSFT]" <dwickert@online.microsoft.com> wrote in message
> news:O3jlkP1tFHA.2072@TK2MSFTNGP14.phx.gbl...
>
> i'll give the 2nd one a look. But i think i'm already enamored with
> Kimball. He is quote a handsome man.
>
> ed2k:// |file|foo|4307725|CB
E8B4BA138D383CFD163A
66C8F68200|/
>
| |
| Peter Nolan 2005-09-15, 9:23 am |
| Hi Gen,
I have posted a lot of links and books on my personal site
www.peternolan.com...it is specifically for 'newbies'.
If you don't know that you were working on a DW then possibly you
weren't.. ;-)
There are various definitions of a DW and I prefer Bill Inmons
definition. This has now extended into the Corporate Information
Factory which is an architecture not a 'product' and not a
'solution'.....
Best Regards
Peter Nolan
www.peternolan.com
|
|
|
|
|