Info > Data Mining > 3-2. OLAP > OLAP toolÀÇ Àü¹ÝÀûÀÎ ¸é
¢¹¢º OLAP toolÀÇ Àü¹ÝÀûÀÎ ¸é (Text - Chap 16)
¨ç Introduction
OLAP tools : Role for answer for accessing large DB.
Hype : OLAP is a substitute for DM ¢¡ not true.
OLAP : powerful and fast tools for reporting on data.
DM : find pattern in data. ¢¡ OLAP and DM are complementary both are important parts
of exploiting data.
¨è Background of appearing OLAP : Fig 16.1>
business world - generating reports to meet business needs
The 1-st generation (manual methods)
- Use the mainframe report-generation tools whose output is traditionally
on green bar paper.
¢¡ too late and too inflexible for decision support. (automated paper-based methods)
response time is measured in weeks and months.
The 2-nd generation (off-the-shelf query)
- Can generate queries in SQL and talk to local or remote data sources
by standard protocol such as ODBC(Open DataBase Standard : MS) or DRDA(Distributed
Remote Data Access : IBM)
¢¡ available from many vendors and have flexible GUI (popular for the past decade)
can use about a day training often running as an SQL query on an already overburdened
DB response time is measured in minutes or hours. (much faster than former method
but still make it difficult to exploit the data) Long response time usually comes
from overburdened OS or saturated networks.
¢¡ User may have forgotten the important question or moved on to some other task
while waiting hours for the answer set
¢¡ Interactive response time(3-5 seconds) is required.
The 3-rd generation (OLAP)
- client-server tools that have an advanced GUI talking to a powerful
and efficient representation of the data, use cube which is suited for queries that
allow users to slice-and-dice data in any way they see fit. store either in a relational
DB using star-schema or multi-dimensional DB that optimizes OLAP. provide handy analysis
that is difficult or impossible to express in SQL
¢¡ very fast response time measured in seconds and provides handy analysis functions
that are difficult or impossible to express in SQL.
¨é An OLAP Example
OLAP - provide the slice-and-dice data to find the information
Table 16.1 (Moviegoers DB) : All the information for each person viewing a given movie is in a single row ¢¡ The data is denormalized and anathema when users update to data ¢¢ relational DB is normalized so only one row has to be updated for any such change efficient disk space than multiple copies of the same row.
perspective OLAP - two types of columns in the denormalized data
i) Dimension columns - information used for summarizations have fixed
number of values(location, product codes and dates)
ii) Aggregate columns - calculate amounts, counts, averages and sums have numeric
values, dollar amounts or cashes
¡Ø Star schema (Fig 16.2 & 16.3)
Â÷¿ø ¸ðµ¨ÀÇ ½ÇÁ¦ ±¸Á¶´Â "½ºÅ¸" µµÇØ·Î ³ªÅ¸³½´Ù. ½ºÅ¸ µµÇØ´Â
Â÷¿ø Å×À̺í·Î µÑ·¯ ½×¿© ÀÖ´Â ½ºÅ¸ Á߽ɿ¡ ÀÖ´Â ½ÇÁ¦(fact) Å×À̺í·Î ³ªÅ¸³¯ ¼ö ÀÖ´Ù.
Interesting point of movie goers DB (Fig 16.4)
i) number of times each movie was seen for movies seen more than 5
ii) for what movies is the average age of the viewers over 30?
iii) number of people and their ages by source
iv) number of people from each source by gender
¢¡ good candidates for dimensions :name of the movie, gender of movie
goers, source of information
¢¡ candidates for aggregations :number of times that each movie was seen and average
age of the movie goers
¨ê What's in a cube?
Cube in movie goers example - quite simple for purpose illustration(3
flat dimensions)
In general, 5-20 dimensions - very difficult to draw
MDB(Multidimensional Database) - a way of representing data that comes in tabular
form (data has row and colums) has a set of dimensions, each of which is divided
into discrete values
well formed cube - each record lands in exactly one subcube (The cardinal rule of MDD)
¡Ø Importance of cardinal rule
redundant dimensions (one for date and another for day of the week)
- the same record will land in two or more subcubes ¢¡ not accurate the sum
Things to keep in mind when using cubes
i) Handling continuous values
ii) Hierarchical dimensions
iii) Dimensions that span multiple fields
¡Ü Handling continuous values
usually quite inefficient and less useful for decision support often
appropriate for the aggregate columns instead of the dimension
- cannot be answer a question "What age group sees the most movies?"
- age must be a dimension
¢¡ to discretize the continuous dimension by defining ranges (bin, deciles)
¡Ü Hierarchical dimensions
a single column seems appropriate for multiple dimensions : a special
date - day of the week, month, quarter, calendar year
represent each of these as a different dimension : expensive and violate the cardinal
rule (a lot of redundancy)
¢¡ OLAP can describe a dimension as a hierarchy include a mechanism for describing
dimension schema, description of hierarchies of information and other properties
within a dimension. - a powerful feature that is also frequently applied to geography
product codes and SKU as well
¡Ø Warning!! (Beware of redundant dimensions)
useful to split a large dimension but be sure that obey the cardinal
rule : overall cube remains consistent otherwise dimension overlap occurred.
¢¡ ensure that record counts summed across all the subcubes in the entire cube will
be equal to the number of records inserted into the cube
¡Ü Dimensions that span multiple fields
multiple columns in the input correspond to a single dimension(date ¢¡ month, day and year)
ex) major bank, where DB of accounts has four different codes for account status.
- four dimensions are considered - cumbersome and makes MDD more difficult for end-users ¢¡ need to collapse down into a single dimension which requires a preprocessing step to consolidate the codes.
Info > Data Mining > 3-2. OLAP > OLAP toolÀÇ Àü¹ÝÀûÀÎ ¸é