Info > Data Mining3-2. OLAP > OLAP toolÀÇ Àü¹ÝÀûÀÎ ¸é


¢¹¢º OLAP toolÀÇ Àü¹ÝÀûÀÎ ¸é (Text - Chap 16)

 

¨ç Introduction

OLAP tools : Role for answer for accessing large DB.
Hype : OLAP is a substitute for DM ¢¡ not true.
OLAP : powerful and fast tools for reporting on data.
DM : find pattern in data. ¢¡ OLAP and DM are complementary both are important parts of exploiting data.

 

¨è Background of appearing OLAP : Fig 16.1>

business world - generating reports to meet business needs
The 1-st generation (manual methods)

- Use the mainframe report-generation tools whose output is traditionally on green bar paper.
¢¡ too late and too inflexible for decision support. (automated paper-based methods) response time is measured in weeks and months.

The 2-nd generation (off-the-shelf query)

- Can generate queries in SQL and talk to local or remote data sources by standard protocol such as ODBC(Open DataBase Standard : MS) or DRDA(Distributed Remote Data Access : IBM)
¢¡ available from many vendors and have flexible GUI (popular for the past decade) can use about a day training often running as an SQL query on an already overburdened DB response time is measured in minutes or hours. (much faster than former method but still make it difficult to exploit the data) Long response time usually comes from overburdened OS or saturated networks.
¢¡ User may have forgotten the important question or moved on to some other task while waiting hours for the answer set
¢¡ Interactive response time(3-5 seconds) is required.

The 3-rd generation (OLAP)

- client-server tools that have an advanced GUI talking to a powerful and efficient representation of the data, use cube which is suited for queries that allow users to slice-and-dice data in any way they see fit. store either in a relational DB using star-schema or multi-dimensional DB that optimizes OLAP. provide handy analysis that is difficult or impossible to express in SQL
¢¡ very fast response time measured in seconds and provides handy analysis functions that are difficult or impossible to express in SQL.

 

¨é An OLAP Example

OLAP - provide the slice-and-dice data to find the information

Table 16.1 (Moviegoers DB) : All the information for each person viewing a given movie is in a single row ¢¡ The data is denormalized and anathema when users update to data ¢¢ relational DB is normalized so only one row has to be updated for any such change efficient disk space than multiple copies of the same row.

perspective OLAP - two types of columns in the denormalized data

i) Dimension columns - information used for summarizations have fixed number of values(location, product codes and dates)
ii) Aggregate columns - calculate amounts, counts, averages and sums have numeric values, dollar amounts or cashes

 

¡Ø Star schema (Fig 16.2 & 16.3)

Â÷¿ø ¸ðµ¨ÀÇ ½ÇÁ¦ ±¸Á¶´Â "½ºÅ¸" µµÇØ·Î ³ªÅ¸³½´Ù. ½ºÅ¸ µµÇØ´Â Â÷¿ø Å×À̺í·Î µÑ·¯ ½×¿© ÀÖ´Â ½ºÅ¸ Á߽ɿ¡ ÀÖ´Â ½ÇÁ¦(fact) Å×À̺í·Î ³ªÅ¸³¯ ¼ö ÀÖ´Ù.
Interesting point of movie goers DB (Fig 16.4)
i) number of times each movie was seen for movies seen more than 5
ii) for what movies is the average age of the viewers over 30?
iii) number of people and their ages by source
iv) number of people from each source by gender

¢¡ good candidates for dimensions :name of the movie, gender of movie goers, source of information
¢¡ candidates for aggregations :number of times that each movie was seen and average age of the movie goers

 

¨ê What's in a cube?

Cube in movie goers example - quite simple for purpose illustration(3 flat dimensions)
In general, 5-20 dimensions - very difficult to draw
MDB(Multidimensional Database) - a way of representing data that comes in tabular form (data has row and colums) has a set of dimensions, each of which is divided into discrete values

well formed cube - each record lands in exactly one subcube (The cardinal rule of MDD)


¡Ø Importance of cardinal rule

redundant dimensions (one for date and another for day of the week) - the same record will land in two or more subcubes ¢¡ not accurate the sum
Things to keep in mind when using cubes
i) Handling continuous values
ii) Hierarchical dimensions
iii) Dimensions that span multiple fields

¡Ü Handling continuous values

usually quite inefficient and less useful for decision support often appropriate for the aggregate columns instead of the dimension
- cannot be answer a question "What age group sees the most movies?"
- age must be a dimension
¢¡ to discretize the continuous dimension by defining ranges (bin, deciles)

¡Ü Hierarchical dimensions

a single column seems appropriate for multiple dimensions : a special date - day of the week, month, quarter, calendar year
represent each of these as a different dimension : expensive and violate the cardinal rule (a lot of redundancy)
¢¡ OLAP can describe a dimension as a hierarchy include a mechanism for describing dimension schema, description of hierarchies of information and other properties within a dimension. - a powerful feature that is also frequently applied to geography product codes and SKU as well

¡Ø Warning!! (Beware of redundant dimensions)

useful to split a large dimension but be sure that obey the cardinal rule : overall cube remains consistent otherwise dimension overlap occurred.
¢¡ ensure that record counts summed across all the subcubes in the entire cube will be equal to the number of records inserted into the cube

¡Ü Dimensions that span multiple fields

multiple columns in the input correspond to a single dimension(date ¢¡ month, day and year)

ex) major bank, where DB of accounts has four different codes for account status.

- four dimensions are considered - cumbersome and makes MDD more difficult for end-users ¢¡ need to collapse down into a single dimension which requires a preprocessing step to consolidate the codes.


Info > Data Mining3-2. OLAP > OLAP toolÀÇ Àü¹ÝÀûÀÎ ¸é