Home > CS490D: Introduction to Data Mining Chris Clifton

CS490D: Introduction to Data Mining Chris Clifton


CS590D:  Data Mining 
Chris Clifton 

January 14, 2006

Data Preparation


 

2  

Data Preprocessing 

  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

CS590D

 

3  

Why Data Preprocessing? 

  • Data in the real world is dirty
    • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
      • e.g., occupation=“”
    • noisy: containing errors or outliers
      • e.g., Salary=“-10”
    • inconsistent: containing discrepancies in codes or names
      • e.g., Age=“42” Birthday=“03/07/1997”
      • e.g., Was rating “1,2,3”, now rating “A, B, C”
      • e.g., discrepancy between duplicate records

CS590D

 

4  

Why Is Data Dirty? 

  • Incomplete data comes from
    • n/a data value when collected
    • different consideration between the time when the data was collected and when it is analyzed.
    • human/hardware/software problems
  • Noisy data comes from the process of data
    • collection
    • entry
    • transmission
  • Inconsistent data comes from
    • Different data sources
    • Functional dependency violation

CS590D

 

5  

Why Is Data Preprocessing Important? 

  • No quality data, no quality mining results!
    • Quality decisions must be based on quality data
      • e.g., duplicate or missing data may cause incorrect or even misleading statistics.
    • Data warehouse needs consistent integration of quality data
  • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon

CS590D

 

6  

Multi-Dimensional Measure of Data Quality 

  • A well-accepted multidimensional view:
    • Accuracy
    • Completeness
    • Consistency
    • Timeliness
    • Believability
    • Value added
    • Interpretability
    • Accessibility
  • Broad categories:
    • intrinsic, contextual, representational, and accessibility.

CS590D

 

7  

Major Tasks in Data Preprocessing 

  • Data cleaning
    • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
  • Data integration
    • Integration of multiple databases, data cubes, or files
  • Data transformation
    • Normalization and aggregation
  • Data reduction
    • Obtains reduced representation in volume but produces the same or similar analytical results
  • Data discretization
    • Part of data reduction but with particular importance, especially for numerical data

CS590D

 

8  

Forms of data preprocessing


CS590D

 

9  

Data Preprocessing 

  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

CS590D

 

10  

Data Cleaning 

  • Importance
    • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball
    • “Data cleaning is the number one problem in data warehousing”—DCI survey
  • Data cleaning tasks
    • Fill in missing values
    • Identify outliers and smooth out noisy data
    • Correct inconsistent data
    • Resolve redundancy caused by data integration

CS590D

 

11  

Missing Data 

  • Data is not always available
    • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
  • Missing data may be due to
    • equipment malfunction
    • inconsistent with other recorded data and thus deleted
    • data not entered due to misunderstanding
    • certain data may not be considered important at the time of entry
    • not register history or changes of the data
  • Missing data may need to be inferred.

CS590D

 

12  

How to Handle Missing Data? 

  • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.
  • Fill in the missing value manually: tedious + infeasible?
  • Fill in it automatically with
    • a global constant : e.g., “unknown”, a new class?!
    • the attribute mean
    • the attribute mean for all samples belonging to the same class: smarter
    • the most probable value: inference-based such as Bayesian formula or decision tree

CS590D

CS590D:  Data Mining 
Chris Clifton 

January 17, 2006

Data Preparation


 

14  

What is Data? 

  • Collection of data objects and their attributes
  • An attribute is a property or characteristic of an object
    • Examples: eye color of a person, temperature, etc.
    • Attribute is also known as variable, field, characteristic, or feature
  • A collection of attributes describe an object
    • Object is also known as record, point, case, sample, entity, or instance
 

Attributes 

Objects


CS590D

 

15  

Attribute Values 

  • Attribute values are numbers or symbols assigned to an attribute
  • Distinction between attributes and attribute values
    • Same attribute can be mapped to different attribute values
      • Example: height can be measured in feet or meters
    • Different attributes can be mapped to the same set of values
      • Example: Attribute values for ID and age are integers
      • But properties of attribute values can be different
        • ID has no limit but age has a maximum and minimum value

CS590D

 

16  

Measurement of Length  

  • The way you measure an attribute is somewhat may not match the attributes properties.

CS590D

 

17  

Types of Attributes  

  • There are different types of attributes
    • Nominal
      • Examples: ID numbers, eye color, zip codes
    • Ordinal
      • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
    • Interval
      • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
    • Ratio
      • Examples: temperature in Kelvin, length, time, counts

CS590D

 

18  

Properties of Attribute Values  

  • The type of an attribute depends on which of the following properties it possesses:
    • Distinctness:    =    
    • Order:     <  >    
    • Addition:    +  -   
    • Multiplication:   * / 
    • Nominal attribute: distinctness
    • Ordinal attribute: distinctness & order
    • Interval attribute: distinctness, order & addition
    • Ratio attribute: all 4 properties

CS590D

 

19  

Attribute Type 
 

Description 
 

Examples 
 

Operations 
 

Nominal 
 

The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) 
 

zip codes, employee ID numbers, eye color, sex: {male, female
 

mode, entropy, contingency correlation, 2 test 
 

Ordinal 
 

The values of an ordinal attribute provide enough information to order objects. (<, >) 
 

hardness of minerals, {good, better, best},  
grades, street numbers 
 

median, percentiles, rank correlation, run tests, sign tests 
 

Interval 
 

For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.   
(+, - ) 
 

calendar dates, temperature in Celsius or Fahrenheit 
 

mean, standard deviation, Pearson's correlation, t and F tests 
 

Ratio 
 

For ratio variables, both differences and ratios are meaningful. (*, /) 
 

temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current 
 

geometric mean, harmonic mean, percent variation

 


CS590D

 

20  

Attribute Level 
 

Transformation 
 

Comments 
 

Nominal 
 

Any permutation of values 
 

If all employee ID numbers were reassigned, would it make any difference? 
 

Ordinal 
 

An order preserving change of values, i.e.,  
new_value = f(old_value)  
where f is a monotonic function. 
 

An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. 
 

Interval 
 

new_value =a * old_value + b where a and b are constants 
 

Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). 
 

Ratio 
 

new_value = a * old_value 
 

Length can be measured in meters or feet.

 


CS590D

 

21  

Discrete and Continuous Attributes  

  • Discrete Attribute
    • Has only a finite or countably infinite set of values
    • Examples: zip codes, counts, or the set of words in a collection of documents
    • Often represented as integer variables.  
    • Note: binary attributes are a special case of discrete attributes
  • Continuous Attribute
    • Has real numbers as attribute values
    • Examples: temperature, height, or weight. 
    • Practically, real values can only be measured and represented using a finite number of digits.
    • Continuous attributes are typically represented as floating-point variables.

CS590D

 

22  

Types of data sets  

  • Record
    • Data Matrix
    • Document Data
    • Transaction Data
  • Graph
    • World Wide Web
    • Molecular Structures
  • Ordered
    • Spatial Data
    • Temporal Data
    • Sequential Data
    • Genetic Sequence Data

CS590D

 

23  

Important Characteristics of Structured Data 

    • Dimensionality
      • Curse of Dimensionality
    • Sparsity 
      • Only presence counts
    • Resolution 
      • Patterns depend on the scale

CS590D

 

24  

Record Data  

  • Data that consists of a collection of records, each of which consists of a fixed set of attributes

CS590D

 

25  

Data Matrix  

  • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
  • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute 

CS590D

 

26  

Document Data 

  • Each document becomes a `term' vector,
    • each term is a component (attribute) of the vector,
    • the value of each component is the number of times the corresponding term occurs in the document.

CS590D

 

27  

Transaction Data 

  • A special type of record data, where
    • each record (transaction) involves a set of items. 
    • For example, consider a grocery store.  The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

CS590D

 

28  

Graph Data  

  • Examples: Generic graph and HTML Links

CS590D

 

29  

Chemical Data  

  • Benzene Molecule: C6H6

CS590D

 

30  

Ordered Data  

  • Sequences of transactions
 

An element of the sequence 

Items/Events


CS590D

 

31  

Ordered Data  

  • Genomic sequence data

CS590D

 

32  

Ordered Data 

  • Spatio-Temporal Data
 

Average Monthly Temperature of land and ocean


CS590D

 

33  

Data Quality  

  • What kinds of data quality problems?
  • How can we detect problems with the data?
  • What can we do about these problems?
  • Examples of data quality problems:
    • Noise and outliers
    • missing values
    • duplicate data

CS590D

 

34  

Noise 

  • Noise refers to modification of original values
    • Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
 
 
 

Two Sine Waves 

Two Sine Waves + Noise


CS590D

 

35  

Outliers 

  • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

CS590D

 

36  

Missing Values 

  • Reasons for missing values
    • Information is not collected  
      (e.g., people decline to give their age and weight)
    • Attributes may not be applicable to all cases  
      (e.g., annual income is not applicable to children)
  • Handling missing values
    • Eliminate Data Objects
    • Estimate Missing Values
    • Ignore the Missing Value During Analysis
    • Replace with all possible values (weighted by their probabilities)

CS590D

 

37  

Duplicate Data 

  • Data set may include data objects that are duplicates, or almost duplicates of one another
    • Major issue when merging data from heterogeous sources
  • Examples:
    • Same person with multiple email addresses
  • Data cleaning
    • Process of dealing with duplicate data issues

CS590D

 

38  

Noisy Data 

  • Noise: random error or variance in a measured variable
  • Incorrect attribute values may due to
    • faulty data collection instruments
    • data entry problems
    • data transmission problems
    • technology limitation
    • inconsistency in naming convention
  • Other data problems which requires data cleaning
    • duplicate records
    • incomplete data
    • inconsistent data

CS590D

 

39  

How to Handle Noisy Data? 

  • Binning method:
    • first sort data and partition into (equi-depth) bins
    • then one can smooth by bin means,  smooth by bin median, smooth by bin boundaries, etc.
  • Clustering
    • detect and remove outliers
  • Combined computer and human inspection
    • detect suspicious values and check by human (e.g., deal with possible outliers)
  • Regression
    • smooth by fitting the data into regression functions

CS590D

 

40  

Simple Discretization Methods: Binning 

  • Equal-width (distance) partitioning:
    • Divides the range into N intervals of equal size: uniform grid
    • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
    • The most straightforward, but outliers may dominate presentation
    • Skewed data is not handled well.
  • Equal-depth (frequency) partitioning:
    • Divides the range into N intervals, each containing approximately same number of samples
    • Good data scaling
    • Managing categorical attributes can be tricky.

CS590D

 

41  

Binning Methods for Data Smoothing 

  • Sorted data (e.g., by price)
    • 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
  • Partition into (equi-depth) bins:
    • Bin 1: 4, 8, 9, 15
    • Bin 2: 21, 21, 24, 25
    • Bin 3: 26, 28, 29, 34
  • Smoothing by bin means:
    • Bin 1: 9, 9, 9, 9
    • Bin 2: 23, 23, 23, 23
    • Bin 3: 29, 29, 29, 29
  • Smoothing by bin boundaries:
    • Bin 1: 4, 4, 4, 15
    • Bin 2: 21, 21, 25, 25
    • Bin 3: 26, 26, 26, 34

CS590D

 

42  

Cluster Analysis


CS590D

 

43  

Regression 



y = x + 1 

X1 

Y1 

Y1’


CS590D

 

44  

Data Preprocessing 

  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

CS590D

 

45  

Data Integration 

  • Data integration:
    • combines data from multiple sources into a coherent store
  • Schema integration
    • integrate metadata from different sources
    • Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#
  • Detecting and resolving data value conflicts
    • for the same real world entity, attribute values from different sources are different
    • possible reasons: different representations, different scales, e.g., metric vs. British units

CS590D

 

46  

Handling Redundancy in Data Integration 

  • Redundant data occur often when integration of multiple databases
    • The same attribute may have different names in different databases
    • One attribute may be a “derived” attribute in another table, e.g., annual revenue
  • Redundant data may be able to be detected by correlational analysis
  • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

CS590D

 

47  

Data Transformation 

  • Smoothing: remove noise from data
  • Aggregation: summarization, data cube construction
  • Generalization: concept hierarchy climbing
  • Normalization: scaled to fall within a small, specified range
    • min-max normalization
    • z-score normalization
    • normalization by decimal scaling
  • Attribute/feature construction
    • New attributes constructed from the given ones

CS590D

CS590D:  Data Mining 
Chris Clifton 

January 18, 2005

Data Preparation


 

49  

Data Transformation: Normalization 

  • min-max normalization
  • z-score normalization 
     
  • normalization by decimal scaling 
     
 
 
 

Where j is the smallest integer such that Max(|     |)<1


CS590D

 

50  

Z-Score (Example) 

-.55 

4 

3.87 

3.00 

-.58 

2 

-0.30 

0.50 

-.49 

7 

0.24 

0.82 

-.71 

-5 

-0.97 

0.10 

.47 

60 

0.22 

0.81 

.20 

45 

0.50 

0.98 

-.23 

22 

-0.17 

0.58 

-.87 

-14 

-0.02 

0.67 

-.44 

10 

0.04 

0.70 

-.30 

18 

-0.09 

0.63 

-.05 

32 

-0.80 

0.20 

3.87 

250 

-0.07 

0.64 

-.35 

15 

-0.79 

0.21 

-.53 

5 

0.40 

0.92 

-.48 

8 

-0.22 

0.55 

-.05 

32 

0.20 

0.80 

4 

70 

-0.72 

0.25 

.55 

5 

-0.27 

0.52 

55.9 

sdev 

.11 

40 

0.59 

sdev 

-0.14 

0.60 

34.3 

Avg 

-.26 

20 

0.68 

Avg 

-0.84 

0.18 

v’ 

v 

v’ 

v


CS590D

 

51  

Data Preprocessing 

  • Aggregation
  • Sampling
  • Dimensionality Reduction
  • Feature subset selection
  • Feature creation
  • Discretization and Binarization
  • Attribute Transformation

 


CS590D

 

52  

Aggregation 

  • Combining two or more attributes (or objects) into a single attribute (or object)
  • Purpose
    • Data reduction
      • Reduce the number of attributes or objects
    • Change of scale
      • Cities aggregated into regions, states, countries, etc
    • More “stable” data
      • Aggregated data tends to have less variability

CS590D

 

53  

Aggregation 

Standard Deviation of Average Monthly Precipitation 

Standard Deviation of Average Yearly Precipitation 

Variation of Precipitation in Australia


CS590D

CS490D: 
Introduction to Data Mining 
Chris Clifton 

January 26, 2004

Data Preparation


 

55  

Data Preprocessing 

  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

CS590D

 

56  

Data Reduction Strategies 

  • A data warehouse may store terabytes of data
    • Complex data analysis/mining may take a very long time to run on the complete data set
  • Data reduction
    • Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
  • Data reduction strategies
    • Data cube aggregation
    • Dimensionality reduction — remove unimportant attributes
    • Data Compression
    • Numerosity reduction — fit data into models
    • Discretization and concept hierarchy generation

CS590D

 

57  

Data Cube Aggregation 

  • The lowest level of a data cube
    • the aggregated data for an individual entity of interest
    • e.g., a customer in a phone calling data warehouse.
  • Multiple levels of aggregation in data cubes
    • Further reduce the size of data to deal with
  • Reference appropriate levels
    • Use the smallest representation which is enough to solve the task
  • Queries regarding aggregated information should be answered using data cube, when possible

CS590D

 

58  

Dimensionality Reduction 

  • Feature selection (i.e., attribute subset selection):
    • Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features
    • reduce # of patterns in the patterns, easier to understand
  • Heuristic methods (due to exponential # of choices):
    • step-wise forward selection
    • step-wise backward elimination
    • combining forward selection and backward elimination
    • decision-tree induction

CS590D

 

59  

Initial attribute set:

{A1, A2, A3, A4, A5, A6} 

A4 ? 

A1? 

A6? 

Class 1 

Class 2 

Class 1 

Class 2 


Reduced attribute set:  {A1, A4, A6} 

Example of 
Decision Tree Induction


CS590D

 

60  

Heuristic Feature Selection Methods 

  • There are 2d possible sub-features of d features
  • Several heuristic feature selection methods:
    • Best single features under the feature independence assumption: choose by significance tests.
    • Best step-wise feature selection:
      • The best single-feature is picked first
      • Then next best feature condition to the first, ...
    • Step-wise feature elimination:
      • Repeatedly eliminate the worst feature
    • Best combined feature selection and elimination:
    • Optimal branch and bound:
      • Use feature elimination and backtracking

CS590D

 

61  

Data Compression 

  • String compression
    • There are extensive theories and well-tuned algorithms
    • Typically lossless
    • But only limited manipulation is possible without expansion
  • Audio/video compression
    • Typically lossy compression, with progressive refinement
    • Sometimes small fragments of signal can be reconstructed without reconstructing the whole
  • Time sequence is not audio
    • Typically short and vary slowly with time

CS590D

 

62  

Data Compression 

Original Data 

Compressed

Data 

lossless 

Original Data

Approximated  

lossy


CS590D

 

63  

Wavelet Transformation  

  • Discrete wavelet transform (DWT): linear signal processing, multiresolutional analysis
  • Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
  • Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
  • Method:
    • Length, L, must be an integer power of 2 (padding with 0s, when necessary)
    • Each transform has 2 functions: smoothing, difference
    • Applies to pairs of data, resulting in two set of data of length L/2
    • Applies two functions recursively, until reaches the desired length
 

  
 

Haar2 

Daubechie4


CS590D

 

64  

DWT for Image Compression 

  • Image
 

  Low Pass       High Pass 

     Low Pass       High Pass 

Low Pass    High Pass


CS590D

 

65  

Curse of Dimensionality 

  • When dimensionality increases, data becomes increasingly sparse in the space that it occupies
  • Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful
 
  • Randomly generate 500 points
  • Compute difference between max and min distance between any pair of points

CS590D

 

66  

Dimensionality Reduction 

  • Purpose:
    • Avoid curse of dimensionality
    • Reduce amount of time and memory required by data mining algorithms
    • Allow data to be more easily visualized
    • May help to eliminate irrelevant features or reduce noise
  • Techniques
    • Principle Component Analysis
    • Singular Value Decomposition
    • Others: supervised and non-linear techniques

CS590D

 

67  

X1 

X2 

Y1 

Y2 

Principal Component Analysis


CS590D

 

68  

Dimensionality Reduction: PCA 

  • Goal is to find a projection that captures the largest  amount of variation in data
 
 

x2 

x1 

e


CS590D

 

69  

Dimensionality Reduction: PCA 

  • Find the eigenvectors of the covariance matrix
  • The eigenvectors define the new space
 

x2 

x1 

e


CS590D

 

70  

Principal Component Analysis  

  • Given N data vectors from k-dimensions, find c ≤  k  orthogonal vectors that can be best used to represent data
    • The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)
  • Each data vector is a linear combination of the c principal component vectors
  • Works for numeric data only
  • Used when the number of dimensions is large

CS590D

 

71  

Dimensionality Reduction: PCA


CS590D

 

72  

Dimensionality Reduction: ISOMAP 

  • Construct a neighbourhood graph
  • For each pair of points in the graph, compute the shortest path distances – geodesic distances
 

By: Tenenbaum, de Silva, Langford (2000)


CS590D

 

73  

Feature Subset Selection 

  • Another way to reduce dimensionality of data
  • Redundant features
    • duplicate much or all of the information contained in one or more other attributes
    • Example: purchase price of a product and the amount of sales tax paid
  • Irrelevant features
    • contain no information that is useful for the data mining task at hand
    • Example: students' ID is often irrelevant to the task of predicting students' GPA

CS590D

 

74  

Feature Subset Selection 

  • Techniques:
    • Brute-force approch:
      • Try all possible feature subsets as input to data mining algorithm
    • Embedded approaches:
      • Feature selection occurs naturally as part of the data mining algorithm
    • Filter approaches:
      • Features are selected before data mining algorithm is run
    • Wrapper approaches:
      • Use the data mining algorithm as a black box to find best subset of attributes

CS590D

CS590D: 
Data Mining 
Chris Clifton 

January 24, 2006

Data Preparation


 

76  

Feature Creation 

  • Create new attributes that can capture the important information in a data set much more efficiently than the original attributes
  • Three general methodologies:
    • Feature Extraction
      • domain-specific
    • Mapping Data to New Space
    • Feature Construction
      • combining features

CS590D

 

77  

Mapping Data to a New Space 
 
 
 

Two Sine Waves 

Two Sine Waves + Noise 

Frequency 

Fourier transform

Wavelet transform


CS590D

 

78  

Discretization Using Class Labels 

  • Entropy based approach
 
 

3 categories for both x and y 

5 categories for both x and y


CS590D

 

79  

Discretization Without Using Class Labels  

Data 

Equal interval width 

Equal frequency 

K-means


CS590D

 

80  

Attribute Transformation 

  • A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
    • Simple functions: xk, log(x), ex, |x|
    • Standardization and Normalization

CS590D

 

81  

Numerosity Reduction 

  • Parametric methods
    • Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
    • Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces
  • Non-parametric methods
    • Do not assume models
    • Major families: histograms, clustering, sampling

CS590D

 

82  

Regression and Log-Linear Models 

  • Linear regression: Data are modeled to fit a straight line
    • Often uses the least-square method to fit the line
  • Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector
  • Log-linear model: approximates discrete multidimensional probability distributions

CS590D

 

83  

Regress Analysis and Log-Linear Models 

  • Linear regression: Y = + X
    • Two parameters , and specify the line and are to be estimated by using the data at hand.
    • using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
  • Multiple regression: Y = b0 + b1 X1 + b2 X2.
    • Many nonlinear functions can be transformed into the above.
  • Log-linear models:
    • The multi-way table of joint probabilities is approximated by a product of lower-order tables.
    • Probability:  p(a, b, c, d) = ab acad bcd

CS590D

 

84  

Histograms 

  • A popular data reduction technique
  • Divide data into buckets and store average (sum) for each bucket
  • Can be constructed optimally in one dimension using dynamic programming
  • Related to quantization problems.

CS590D

 

85  

Clustering 

  • Partition data set into clusters, and one can store cluster representation only
  • Can be very effective if data is clustered but not if data is “smeared”
  • Can have hierarchical clustering and be stored in multi-dimensional index tree structures
  • There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8

CS590D

 

86  

Hierarchical Reduction 

  • Use multi-resolution structure with different degrees of reduction
  • Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters”
  • Parametric methods are usually not amenable to hierarchical representation
  • Hierarchical aggregation
    • An index tree hierarchically divides a data set into partitions by value range of some attributes
    • Each partition can be considered as a bucket
    • Thus an index tree with aggregates stored at each node is a hierarchical histogram

CS590D

 

87  

Sampling 

  • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data
  • Choose a representative subset of the data
    • Simple random sampling may have very poor performance in the presence of skew
  • Develop adaptive sampling methods
    • Stratified sampling:
      • Approximate the percentage of each class (or subpopulation of interest) in the overall database
      • Used in conjunction with skewed data
  • Sampling may not reduce database I/Os (page at a time).

CS590D

 

88  

SRSWOR

(simple random

sample without

replacement) 

SRSWR 

Raw Data 

Sampling


CS590D

 

89  

Sampling 

Raw Data  

Cluster/Stratified Sample


CS590D

 

90  

Sampling  

  • Sampling is the main technique employed for data selection.
    • It is often used for both the preliminary investigation of the data and the final data analysis.
  • Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.
  • Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

CS590D

 

91  

Sampling …  

  • The key principle for effective sampling is the following:
    • using a sample will work almost as well as using the entire data sets, if the sample is representative
    • A sample is representative if it has approximately the same property (of interest) as the original set of data 

CS590D

 

92  

Types of Sampling 

  • Simple Random Sampling
    • There is an equal probability of selecting any particular item
  • Sampling without replacement
    • As each item is selected, it is removed from the population
  • Sampling with replacement
    • Objects are not removed from the population as they are selected for the sample.  
      • In sampling with replacement, the same object can be picked up more than once
  • Stratified sampling
    • Split the data into several partitions; then draw random samples from each partition

CS590D

 

93  

Sample Size 
 

  

8000 points           2000 Points   500 Points


CS590D

 

94  

Sample Size 

  • What sample size is necessary to get at least one object from each of 10 groups.

CS590D

 

95  

Similarity and Dissimilarity 

  • Similarity
    • Numerical measure of how alike two data objects are.
    • Is higher when objects are more alike.
    • Often falls in the range [0,1]
  • Dissimilarity
    • Numerical measure of how different are two data objects
    • Lower when objects are more alike
    • Minimum dissimilarity is often 0
    • Upper limit varies
  • Proximity refers to a similarity or dissimilarity

CS590D

 

96  

Similarity/Dissimilarity for Simple Attributes 

p and q are the attribute values for two data objects.


CS590D

 

97  

Euclidean Distance 

  • Euclidean Distance 
     
     
     
      
    • Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
  • Standardization is necessary, if scales differ. 

CS590D

 

98  

Euclidean Distance 

Distance Matrix


CS590D

 

99  

Minkowski Distance 

  • Minkowski Distance is a generalization of Euclidean Distance
 
 
 

      

       Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q

CS590D

 

100  

Minkowski Distance: Examples 

  • r = 1.  City block (Manhattan, taxicab, L1 norm) distance.
    • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
  • r = 2.  Euclidean distance
  • r  .  “supremum” (Lmax norm, L norm) distance.
    • This is the maximum difference between any component of the vectors
  • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

CS590D

 

101  

Minkowski Distance 

Distance Matrix


CS590D

 

102  

Mahalanobis Distance 

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. 

is the covariance matrix of the input data X


CS590D

 

103  

Mahalanobis Distance 

Covariance Matrix: 

B 

A 

C 

A: (0.5, 0.5)

B: (0, 1)

C: (1.5, 1.5) 

Mahal(A,B) = 5

Mahal(A,C) = 4


CS590D

 

104  

Common Properties of a Distance 

  • Distances, such as the Euclidean distance, have some well known properties.
    1. d(p, q) 0   for all p and q and d(p, q) = 0 only if  
      p = q. (Positive definiteness) 
    2. d(p, q) = d(q, p)   for all p and q. (Symmetry)
    3. d(p, r) d(p, q) + d(q, r)   for all points p, q, and r.   
      (Triangle Inequality)

   where d(p, q) is the distance (dissimilarity) between points (data objects), p and q

  • A distance that satisfies these properties is a metric

CS590D

 

105  

Common Properties of a Similarity 

  • Similarities, also have some well known properties.
    1. s(p, q) = 1 (or maximum similarity) only if p = q.  
       
    2. s(p, q) = s(q, p)   for all p and q. (Symmetry) 

   where s(p, q) is the similarity between points (data objects), p and q.

 


CS590D

 

106  

Similarity Between Binary Vectors 

  • Common situation is that objects, p and q, have only binary attributes
  • Compute similarities using the following quantities 

   M01 = the number of attributes where p was 0 and q was 1

   M10 = the number of attributes where p was 1 and q was 0

   M00 = the number of attributes where p was 0 and q was 0

   M11 = the number of attributes where p was 1 and q was 1 

  • Simple Matching and Jaccard Coefficients

   SMC =  number of matches / number of attributes

            =  (M11 + M00) / (M01 + M10 + M11 + M00

   J = number of 11 matches / number of not-both-zero attributes values

       = (M11) / (M01 + M10 + M11)


CS590D

 

107  

SMC versus Jaccard: Example 

p =  1 0 0 0 0 0 0 0 0 0      

q =  0 0 0 0 0 0 1 0 0 1  

M01 = 2   (the number of attributes where p was 0 and q was 1)

M10 = 1   (the number of attributes where p was 1 and q was 0)

M00 = 7   (the number of attributes where p was 0 and q was 0)

M11 = 0   (the number of attributes where p was 1 and q was 1) 

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7  

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

 


CS590D

 

108  

Cosine Similarity 

  • If d1 and d2 are two document vectors, then

             cos( d1, d2 ) =  (d1d2) / ||d1|| ||d2|| ,

   where  indicates vector dot product and || d || is  the   length of vector d.   

  • Example:  

       d1 =  3 2 0 5 0 0 0 2 0 0  

        d2 =  1 0 0 0 0 0 0 1 0 2  

        d1d2=  3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

       ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =  (42) 0.5 = 6.481

        ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 

         cos( d1, d2 ) = .3150

     


    CS590D

     

    109  

    Extended Jaccard Coefficient (Tanimoto) 

    • Variation of Jaccard for continuous or count attributes
      • Reduces to Jaccard for binary attributes

    CS590D

     

    110  

    Correlation 

    • Correlation measures the linear relationship between objects
    • To compute correlation, we standardize data objects, p and q, and then take their dot product

    CS590D

     

    111  

    Visually Evaluating Correlation 

    Scatter plots showing the similarity from –1 to 1.


    CS590D

     

    112  

    General Approach for Combining Similarities 

    • Sometimes attributes are of many different types, but an overall similarity is needed.

    CS590D

     

    113  

    Using Weights to Combine Similarities 

    • May not want to treat all attributes the same.
      • Use weights wk which are between 0 and 1 and sum to 1.

    CS590D

     

    114  

    Density 

    • Density-based clustering require a notion of density
    • Examples: 
      • Euclidean density
        • Euclidean density = number of points per unit volume
      • Probability density  
      • Graph-based density 

    CS590D

     

    115  

    Euclidean Density – Cell-based 

    • Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains

    CS590D

     

    116  

    Euclidean Density – Center-based 

    • Euclidean density is the number of points within a specified radius of the point

    CS590D

     

    117  

    Data Preprocessing 

    • Why preprocess the data?
    • Data cleaning
    • Data integration and transformation
    • Data reduction
    • Discretization and concept hierarchy generation
    • Summary

    CS590D

     

    118  

    Discretization 

    • Three types of attributes:
      • Nominal — values from an unordered set
      • Ordinal — values from an ordered set
      • Continuous — real numbers
    • Discretization:
      • divide the range of a continuous attribute into intervals
      • Some classification algorithms only accept categorical attributes.
      • Reduce data size by discretization
      • Prepare for further analysis

    CS590D

     

    119  

    Discretization and Concept hierachy 

    • Discretization
      • reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values
    • Concept hierarchies
      • reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)

    CS590D

    CS490D: 
    Introduction to Data Mining 
    Chris Clifton 

    January 28, 2004

    Data Preparation


     

    121  

    Discretization and Concept Hierarchy Generation for Numeric Data 

    • Binning (see sections before)
    • Histogram analysis (see sections before)
    • Clustering analysis (see sections before)
    • Entropy-based discretization
    • Segmentation by natural partitioning

    CS590D

     

    122  

    Definition of Entropy 

    • Entropy
    • Example:  Coin Flip 
      • AX = {heads, tails}
      • P(heads) = P(tails) = ½
      • ½ log2(½) = ½ * - 1
      • H(X) = 1
    • What about a two-headed coin?
    • Conditional Entropy:

    CS590D

     

    123  

    Entropy-Based Discretization 

    • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
    • The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. 
    • The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,
    • Experiments show that it may reduce data size and improve classification accuracy 

    CS590D

     

    124  

    Segmentation by Natural Partitioning 

    • A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.
      • If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals
      • If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
      • If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

    CS590D

     

    125  

    Example of 3-4-5 Rule 

    (-$4000 -$5,000) 

    (-$400 - 0) 

    (-$400 -

    -$300) 

    (-$300 -

    -$200) 

    (-$200 -

    -$100) 

    (-$100 -

      0) 

    (0 - $1,000) 

    (0 -

    $200) 

    ($200 -

    $400) 

    ($400 -

    $600) 

    ($600 -

    $800) 

    ($800 -

    $1,000) 

    ($2,000 - $5, 000) 

    ($2,000 -

    $3,000) 

    ($3,000 -

    $4,000) 

    ($4,000 -

    $5,000) 

    ($1,000 - $2, 000) 

    ($1,000 -

    $1,200) 

    ($1,200 -

    $1,400) 

    ($1,400 -

    $1,600) 

    ($1,600 -

    $1,800) 

    ($1,800 -

    $2,000) 

                            msd=1,000 Low=-$1,000 High=$2,000 

    Step 2: 

    Step 4: 

    Step 1: 

             -$351 -$159  profit              $1,838          $4,700  

           Min             Low (i.e, 5%-tile)                             High(i.e, 95%-0 tile)        Max 

    count 

    (-$1,000  - $2,000) 

    (-$1,000 - 0) 

    (0 -$ 1,000) 

    Step 3: 

    ($1,000 - $2,000)


    CS590D

     

    126  

    Concept Hierarchy Generation for Categorical Data 

    • Specification of a partial ordering of attributes explicitly at the schema level by users or experts
      • street<city<state<country
    • Specification of a portion of a hierarchy by explicit data grouping
      • {Urbana, Champaign, Chicago}<Illinois
    • Specification of a set of attributes.
      • System automatically generates partial ordering by analysis of the number of distinct values
      • E.g., street < city <state < country
    • Specification of only a partial set of attributes
      • E.g., only street < city, not others

    CS590D

     

    127  

    Automatic Concept Hierarchy Generation 

    • Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set
      • The attribute with the most distinct values is placed at the lowest level of the hierarchy
      • Note: Exception—weekday, month, quarter, year
     

    country 

    province_or_ state 

    city 

    street 

    15 distinct values 

    65 distinct values 

    3567 distinct values 

    674,339 distinct values


    CS590D

     

    128  

    Data Preprocessing 

    • Why preprocess the data?
    • Data cleaning
    • Data integration and transformation
    • Data reduction
    • Discretization and concept hierarchy generation
    • Summary

    CS590D

     

    129  

    Summary 

    • Data  preparation is a big issue for both warehousing and mining
    • Data preparation includes
      • Data cleaning and data integration
      • Data reduction and feature selection
      • Discretization
    • A lot a methods have been developed but still an active area of research

    CS590D

     

    130  

    References 

    • E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data Engineering. Vol.23, No.4
    • D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, 1999.
    • H.V. Jagadish et al., Special Issue on Data Reduction Techniques.  Bulletin of the Technical Committee on Data Engineering, 20(4), December 1997.
    • A. Maydanchik, Challenges of Efficient Data Cleansing (DM Review - Data Quality resource portal)
    • D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
    • D. Quass. A Framework for research in Data Cleaning. (Draft 1999)
    • V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001.
    • T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.
    • Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996.
    • R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995.
    • http://www.cs.ucla.edu/classes/spring01/cs240b/notes/data-integration1.pdf

    CS590D

    CS590D:  Data Mining 
    Chris Clifton 

    January 20, 2005

    Data Cubes


     

    132  

    A Sample Data Cube 

    Total annual sales

    of  TV in U.S.A. 

    Date 

    Product 

    Country 

    All, All, All

    sum 

    sum 

      

    TV 

    VCR 

    PC 

    1Qtr 

    2Qtr 

    3Qtr 

    4Qtr 

    U.S.A 

    Canada 

    Mexico 

    sum


    CS590D

     

    133  

    Cuboids Corresponding to the Cube 

    all 

    product 

    date 

    country 

    product,date 

    product,country 

    date, country 

    product, date, country 

    0-D(apex) cuboid 

    1-D cuboids 

    2-D cuboids 

    3-D(base) cuboid


    CS590D

     

    134  

    Browsing a Data Cube 

    • Visualization
    • OLAP capabilities
    • Interactive manipulation

    CS590D

     

    135  

    Typical OLAP Operations 

    • Roll up (drill-up): summarize data
      • by climbing up hierarchy or by dimension reduction
    • Drill down (roll down): reverse of roll-up
      • from higher level summary to lower level summary or detailed data, or introducing new dimensions
    • Slice and dice:
      • project and select
    • Pivot (rotate):
      • reorient the cube, visualization, 3D to series of 2D planes.
    • Other operations
      • drill across: involving (across) more than one fact table
      • drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

    CS590D

     

    136  

    A Star-Net Query Model 

      

    Shipping Method 

    AIR-EXPRESS 

    TRUCK 

    ORDER 

    Customer Orders 

    CONTRACTS 

    Customer 

    Product 

    PRODUCT GROUP 

    PRODUCT LINE 

    PRODUCT ITEM 

    SALES PERSON 

    DISTRICT 

    DIVISION 

    Organization 

    Promotion 

    CITY 

    COUNTRY 

    REGION 

    Location 

    DAILY 

    QTRLY 

    ANNUALY 

    Time 

    Each circle is called a footprint


    CS590D

     

    137  

    Efficient Data Cube Computation 

    • Data cube can be viewed as a lattice of cuboids 
      • The bottom-most cuboid is the base cuboid
      • The top-most cuboid (apex) contains only one cell
      • How many cuboids in an n-dimensional cube with L levels?
    • Materialization of data cube
      • Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
      • Selection of which cuboids to materialize
        • Based on size, sharing, access frequency, etc.

    CS590D

     

    138  

    Cube Operation 

    • Cube definition and computation in DMQL

      define cube sales[item, city, year]: sum(sales_in_dollars)

      compute cube sales

    • Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)

      SELECT item, city, year, SUM (amount)

      FROM SALES

      CUBE BY item, city, year

    • Compute the following Group-Bys

      (date, product, customer),

      (date,product),(date, customer), (product, customer),

      (date), (product), (customer)

      ()  

    (item) 

    (city) 

    () 

    (year) 

    (city, item) 

    (city, year) 

    (item, year) 

    (city, item, year)


    CS590D

     

    139  

    Cube Computation: ROLAP-Based Method 

    • Efficient cube computation methods
      • ROLAP-based cubing algorithms (Agarwal et al’96)
      • Array-based cubing algorithm (Zhao et al’97)
      • Bottom-up computation method (Beyer & Ramarkrishnan’99)
      • H-cubing technique (Han, Pei, Dong & Wang:SIGMOD’01)
    • ROLAP-based cubing algorithms
      • Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples
      • Grouping is performed on some sub-aggregates as a “partial grouping step”
      • Aggregates may be computed from previously computed aggregates, rather than from the base fact table

    CS590D

     

    140  

    Cube Computation: ROLAP-Based Method (2) 

    • This is not in the textbook but in a research paper
    • Hash/sort based methods (Agarwal et. al. VLDB’96)
      • Smallest-parent: computing a cuboid  from the smallest, previously computed cuboid
      • Cache-results:  caching results of a cuboid from which other cuboids are computed to reduce disk I/Os
      • Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads
      • Share-sorts:  sharing sorting costs cross multiple cuboids when sort-based method is used
      • Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used

    CS590D

     

    141  

    Multi-way Array Aggregation for Cube Computation 

    • Partition arrays into chunks (a small subcube which fits in memory).
    • Compressed sparse array addressing: (chunk_id, offset)
    • Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.
     

    What is the best traversing order to do multi-way aggregation? 



    29 

    30 

    31 

    32 







    13 

    14 

    15 

    16 

    64 

    63 

    62 

    61 

    48 

    47 

    46 

    45 

    a1 

    a0 

    c3 

    c2 

    c1 

    c 0 

    b3 

    b2 

    b1 

    b0 

    a2 

    a3 



    44 

    28 

    56 

    40 

    24 

    52 

    36 

    20 

    60


    CS590D

     

    142  

    Multi-way Array Aggregation for Cube Computation 

    A 


    29 

    30 

    31 

    32 







    13 

    14 

    15 

    16 

    64 

    63 

    62 

    61 

    48 

    47 

    46 

    45 

    a1 

    a0 

    c3 

    c2 

    c1 

    c 0 

    b3 

    b2 

    b1 

    b0 

    a2 

    a3 

    C 

    44 

    28 

    56 

    40 

    24 

    52 

    36 

    20 

    60 

    B


    CS590D

     

    143  

    Multi-way Array Aggregation for Cube Computation 

    A 


    29 

    30 

    31 

    32 







    13 

    14 

    15 

    16 

    64 

    63 

    62 

    61 

    48 

    47 

    46 

    45 

    a1 

    a0 

    c3 

    c2 

    c1 

    c 0 

    b3 

    b2 

    b1 

    b0 

    a2 

    a3 

    C 

    44 

    28 

    56 

    40 

    24 

    52 

    36 

    20 

    60 

    B


    CS590D

     

    144  

    Multi-Way Array Aggregation for Cube Computation (Cont.) 

    • Method: the planes should be sorted and computed according to their size in ascending order.
      • See the details of Example 2.12 (pp. 75-78)
      • Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane
    • Limitation of the method: computing well only for a small number of dimensions
      • If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored

    CS590D

     

    145  

    Indexing OLAP Data: Bitmap Index 

    • Index on a particular column
    • Each value in the column has a bit vector: bit-op is fast
    • The length of the bit vector: # of records in the base table
    • The i-th bit is set if the i-th row of the base table has the value for the indexed column
    • not suitable for high cardinality domains
     

    Base table 

    Index on Region 

    Index on Type


    CS590D

     

    146  

    Indexing OLAP Data: Join Indices 

    • Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)
    • Traditional indices map the values to a list of record ids
      • It materializes relational join in JI file and speeds up relational join — a rather costly operation
    • In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.
      • E.g. fact table: Sales and two dimensions city and product
        • A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city
      • Join indices can span multiple dimensions

    CS590D

     

    147  

    Efficient Processing OLAP Queries 

    • Determine which operations should be performed on the available cuboids:
      • transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection
    • Determine to which materialized cuboid(s) the relevant operations should be applied.
    • Exploring indexing structures and compressed vs. dense array structures in MOLAP

    CS590D

     

    148  

    Iceberg Cube 

    • Computing only the cuboid cells whose count 
      or other aggregates satisfying the condition:

    HAVING COUNT(*) >= minsup

    • Motivation
      • Only a small portion of cube cells may be “above the water’’ in a sparse cube
      • Only calculate “interesting” data—data above certain threshold
      • Suppose 100 dimensions, only 1 base cell.  How many aggregate (non-base) cells if count >= 1?  What about count >= 2?

    CS590D

     

    149  

    Bottom-Up Computation (BUC) 

    • BUC (Beyer & Ramakrishnan, SIGMOD’99)
    • Bottom-up vs. top-down?—depending on how you view it!
    • Apriori property:
      • Aggregate the data,                                   then move to the next level
      • If minsup is not met, stop!
    • If minsup = 1 Þ compute full CUBE!

    CS590D

     

    150  

    Partitioning     

    • Usually, entire data set can’t fit in 
      main memory
    • Sort distinct values, partition into blocks 
      that fit
    • Continue processing
    • Optimizations
      • Partitioning
        • External Sorting, Hashing, Counting Sort
      • Ordering dimensions to encourage pruning
        • Cardinality, Skew, Correlation
      • Collapsing duplicates
        • Can’t do holistic aggregates anymore!

    CS590D

     

    151  

    Drawbacks of BUC 

    • Requires a significant amount of memory
      • On par with most other CUBE algorithms though
    • Does not obtain good performance with dense CUBEs
    • Overly skewed data or a bad choice of dimension ordering reduces performance
    • Cannot compute iceberg cubes with complex measures

      CREATE CUBE Sales_Iceberg AS

      SELECT month, city, cust_grp,

          AVG(price), COUNT(*)

      FROM Sales_Infor

      CUBEBY month, city, cust_grp

      HAVING AVG(price) >= 800 AND

          COUNT(*) >= 50

    CS590D

     

    152  

    Non-Anti-Monotonic Measures 

    • The cubing query with avg is non-anti-monotonic!
      • (Mar, *, *, 600, 1800) fails the HAVING clause
      • (Mar, *, Bus, 1300, 360) passes the clause
     

    … 

    … 

    … 

    … 

    … 

    … 

    520 

    540 

    HD 

    Edu 

    Van 

    Mar 

    2500 

    1500 

    Laptop 

    Bus 

    Mon 

    Feb 

    1280 

    1160 

    Camera 

    Edu 

    Tor 

    Jan 

    1200 

    800 

    TV 

    Hld 

    Tor 

    Jan 

    485 

    500 

    Printer 

    Edu 

    Tor 

    Jan 

    Price 

    Cost 

    Prod 

    Cust_grp 

    City 

    Month 

    CREATE CUBE Sales_Iceberg AS

    SELECT month, city, cust_grp,

          AVG(price), COUNT(*)

    FROM Sales_Infor

    CUBEBY month, city, cust_grp

    HAVING AVG(price) >= 800 AND

          COUNT(*) >= 50


    CS590D

     

    153  

    Top-k Average 

    • Let (*, Van, *) cover 1,000 records
      • Avg(price) is the average price of those 1000 sales
      • Avg50(price) is the average price of the top-50 sales (top-50 according to the sales price
    • Top-k average is anti-monotonic
      • The top 50 sales in Van. is with avg(price) <= 800  the top 50 deals in Van. during Feb. must be with avg(price) <= 800
     

    … 

    … 

    … 

    … 

    … 

    … 

    Price 

    Cost 

    Prod 

    Cust_grp 

    City 

    Month


    CS590D

     

    154  

    Binning for Top-k Average 

    • Computing top-k avg is costly with large k
    • Binning idea
      • Avg50(c) >= 800
      • Large value collapsing: use a sum and a count to summarize records with measure >= 800
        • If count>=800, no need to check “small” records
      • Small value binning: a group of bins
        • One bin covers a range, e.g., 600~800, 400~600, etc.
        • Register a sum and a count for each bin

    CS590D

     

    155  

    Approximate top-k average 

    … 

    … 

    … 

    30 

    15200 

    400~600 

    15 

    10600 

    600~800 

    20 

    28000 

    Over 800 

    Count 

    Sum 

    Range 

    Top 50 

    Approximate avg50()=

    (28000+10600+600*15)/50=952 

    Suppose for (*, Van, *), we have 

    … 

    … 

    … 

    … 

    … 

    … 

    Price 

    Cost 

    Prod 

    Cust_grp 

    City 

    Month 

    The cell may pass the HAVING clause


    CS590D

     

    156  

    Quant-info for Top-k Average Binning 

    • Accumulate quant-info for cells to compute average iceberg cubes efficiently
      • Three pieces: sum, count, top-k bins
      • Use top-k bins to estimate/prune descendants
      • Use sum and count to consolidate current cell
     

    avg()

    Not anti-monotonic 

    real avg50()

    Anti-monotonic, but computationally costly 

    Approximate avg50()

    Anti-monotonic, can be computed efficiently 

    strongest 

    weakest


    CS590D

     

    157  

    An Efficient Iceberg Cubing Method: Top-k H-Cubing 

    • One can revise Apriori or BUC to compute a top-k avg iceberg cube. This leads to top-k-Apriori and top-k BUC.
    • Can we compute iceberg cube more efficiently?
    • Top-k H-cubing: an efficient method to compute iceberg cubes with average measure
    • H-tree: a hyper-tree structure
    • H-cubing: computing iceberg cubes using H-tree

    CS590D

     

    158  

    H-tree: A Prefix Hyper-tree 

    … 

    … 

    … 

    … 

    … 

    … 

    520 

    540 

    HD 

    Edu 

    Van 

    Mar 

    2500 

    1500 

    Laptop 

    Bus 

    Mon 

    Feb 

    1280 

    1160 

    Camera 

    Edu 

    Tor 

    Jan 

    1200 

    800 

    TV 

    Hhd 

    Tor 

    Jan 

    485 

    500 

    Printer 

    Edu 

    Tor 

    Jan 

    Price 

    Cost 

    Prod 

    Cust_grp 

    City 

    Month 

    root 

    edu 

    hhd 

    bus 

    Jan 

    Mar 

    Jan 

    Feb 

    Tor 

    Van 

    Tor 

    Mon 

    Q.I. 

    Q.I. 

    Q.I. 

    bins 

    Sum: 1765

    Cnt: 2 

    Quant-Info 

    … 

    … 

     

    Mon 

     

    Van 

     

    Tor 

    … 

    … 

    … 

    Feb 

    … 

    Jan 

    … 

    … 

    … 

    Bus 

    … 

    Hhd 

    Sum:2285 … 

    Edu 

    Side-link 

    Quant-Info 

    Attr. Val. 

    Header

    table


    CS590D

     

    159  

    Properties of H-tree 

    • Construction cost: a single database scan
    • Completeness: It contains the complete information needed for computing the iceberg cube
    • Compactness: # of nodes  n*m+1
      • n: # of tuples in the table
      • m: # of attributes

    CS590D

     

    160  

    Computing Cells Involving Dimension City 

    root 

    Edu. 

    Hhd. 

    Bus. 

    Jan. 

    Mar. 

    Jan. 

    Feb. 

    Tor. 

    Van. 

    Tor. 

    Mon. 

    Q.I. 

    Q.I. 

    Q.I. 

    bins 

    Sum: 1765

    Cnt: 2 

    Quant-Info 

    … 

    … 

    … 

    Mon 

    … 

    Van 

     

    Tor 

    … 

    … 

    … 

    Feb 

    … 

    Jan 

    … 

    … 

    … 

    Bus 

    … 

    Hhd 

    Sum:2285 … 

    Edu 

    Side-link 

    Quant-Info 

    Attr. Val. 

    … 

    … 

    … 

    Feb 

     

    Jan 

    … 

    … 

    … 

    Bus 

    … 

    Hhd 

    … 

    Edu 

    Side-link 

    Q.I. 

    Attr. Val. 

    Header

    Table

    HTor 

    From (*, *, Tor) to (*, Jan, Tor)


    CS590D

     

    161  

    Computing Cells Involving Month But No City 

    root 

    Edu. 

    Hhd. 

    Bus. 

    Jan. 

    Mar. 

    Jan. 

    Feb. 

    Tor. 

    Van. 

    Tor. 

    Mont. 

    Q.I. 

    Q.I. 

    Q.I. 

     

    Mar. 

    … 

    … 

    … 

    Mont. 

    … 

    Van. 

    … 

    Tor. 

    … 

    … 

     

    Feb. 

     

    Jan. 

    … 

    … 

    … 

    Bus. 

    … 

    Hhd. 

    Sum:2285 … 

    Edu. 

    Side-link 

    Quant-Info 

    Attr. Val. 

    1. Roll up quant-info
    2. Compute cells involving month but no city
     

    Q.I. 

    Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed!


    CS590D

     

    162  

    Computing Cells Involving Only Cust_grp 

    root 

    edu 

    hhd 

    bus 

    Jan 

    Mar 

    Jan 

    Feb 

    Tor 

    Van 

    Tor 

    Mon 

    Q.I. 

    Q.I. 

    Q.I. 

    … 

    Mar 

    … 

    … 

    … 

    Mon 

    … 

    Van 

    … 

    Tor 

    … 

    … 

    … 

    Feb 

    … 

    Jan 

    … 

    … 

     

    Bus 

     

    Hhd 

    Sum:2285 … 

    Edu 

    Side-link 

    Quant-Info 

    Attr. Val. 

    Check header table directly 

    Q.I.


    CS590D

     

    163  

    Properties of H-Cubing 

    • Space cost
      • an H-tree
      • a stack of up to (m-1) header tables
    • One database scan
    • Main memory-based tree traversal & side-links updates
    • Top-k_OK marking

    CS590D

     

    164  

    Scalability w.r.t. Count Threshold (No min_avg Setting)


    CS590D

     

    165  

    Computing Iceberg Cubes with Other Complex Measures 

    • Computing other complex measures
      • Key point: find a function which is weaker but ensures certain anti-monotonicity
    • Examples
      • Avg()  v:  avgk(c)  v (bottom-k avg)
      • Avg()  v only (no count): max(price)  v
      • Sum(profit) (profit can be negative):
        • p_sum(c)  v if p_count(c)  k; or otherwise, sumk(c)  v
      • Others: conjunctions of multiple conditions

    CS590D

     

    166  

    Discussion: Other Issues 

    • Computing iceberg cubes with more complex measures?
      • No general answer for holistic measures, e.g., median, mode, rank
      • A research theme even for complex algebraic functions, e.g.,  standard_dev, variance
    • Dynamic vs . static computation of iceberg cubes
      • v and k are only available at query time
      • Setting reasonably low parameters for most nontrivial cases
    • Memory-hog? what if the cubing is too big to fit in memory?—projection and then cubing

    CS590D

     

    167  

    Condensed Cube 

    • W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02.
    • Icerberg cube cannot solve all the problems
      • Suppose 100 dimensions, only 1 base cell with count = 10.  How many aggregate (non-base) cells if count >= 10?
    • Condensed cube
      • Only need to store one cell (a1, a2, …, a100, 10), which represents all the corresponding aggregate cells
      • Adv.
        • Fully precomputed cube without compression
      • Efficient computation of the minimal condensed cube

    CS590D

     

    168  

    References (I) 

    • S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi.  On the computation of multidimensional aggregates.  VLDB’96
    • D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97.
    • R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases.  ICDE’97
    • K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99.
    • S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.
    • OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.
    • G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data Cubes. VLDB’2001
    • J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals.  Data Mining and Knowledge Discovery, 1:29-54, 1997.

    CS590D

     

    169  

    References (II) 

    • J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01
    • V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96
    • Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998.
    • K. Ross and D. Srivastava.  Fast computation of sparse datacubes. VLDB’97.
    • K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. EDBT'98.
    • S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98.
    • E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons, 1997.
    • W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02.
    • Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD’97.

    CS590D

Recent Documents:

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com
TOP