CS590D:
Data Mining
Chris Clifton
January 14, 2006
Data Preparation
2
Data Preprocessing
- Why preprocess
the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept
hierarchy generation
- Summary
3
Why Data
Preprocessing?
- Data in the real world is
dirty
- incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
- noisy: containing errors or outliers
- inconsistent: containing discrepancies in codes or names
- e.g., Age=“42” Birthday=“03/07/1997”
- e.g., Was rating “1,2,3”,
now rating “A, B, C”
- e.g., discrepancy between
duplicate records
4
Why Is
Data Dirty?
- Incomplete data comes from
- n/a data value when collected
- different consideration between
the time when the data was collected and when it is analyzed.
- human/hardware/software problems
- Noisy data comes from the
process of data
- collection
- entry
- transmission
- Inconsistent data comes from
- Different data sources
- Functional dependency violation
5
Why Is
Data Preprocessing Important?
- No quality data, no quality
mining results!
- Quality decisions must be
based on quality data
- e.g., duplicate or missing
data may cause incorrect or even misleading statistics.
- Data warehouse needs consistent
integration of quality data
- Data extraction, cleaning,
and transformation comprises the majority of the work of building a
data warehouse. —Bill Inmon
6
Multi-Dimensional
Measure of Data Quality
- A well-accepted multidimensional
view:
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Value added
- Interpretability
- Accessibility
- Broad categories:
- intrinsic, contextual, representational,
and accessibility.
7
Major
Tasks in Data Preprocessing
- Data cleaning
- Fill in missing values, smooth
noisy data, identify or remove outliers, and resolve inconsistencies
- Data integration
- Integration of multiple databases,
data cubes, or files
- Data transformation
- Normalization and aggregation
- Data reduction
- Obtains reduced representation
in volume but produces the same or similar analytical results
- Data discretization
- Part of data reduction but
with particular importance, especially for numerical data
8
Forms
of data preprocessing
9
Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept
hierarchy generation
- Summary
10
Data Cleaning
- Importance
- “Data cleaning is one of
the three biggest problems in data warehousing”—Ralph Kimball
- “Data cleaning is the number
one problem in data warehousing”—DCI survey
- Data cleaning tasks
- Fill in missing values
- Identify outliers and smooth
out noisy data
- Correct inconsistent data
- Resolve redundancy caused
by data integration
11
Missing
Data
- Data is not always available
- E.g., many tuples have no
recorded value for several attributes, such as customer income in sales
data
- Missing data may be due to
- equipment malfunction
- inconsistent with other recorded
data and thus deleted
- data not entered due to misunderstanding
- certain data may not be considered
important at the time of entry
- not register history or changes
of the data
- Missing data may need to
be inferred.
12
How to
Handle Missing Data?
- Ignore the tuple: usually
done when class label is missing (assuming the tasks in classification—not
effective when the percentage of missing values per attribute varies
considerably.
- Fill in the missing value
manually: tedious + infeasible?
- Fill in it automatically
with
- a global constant : e.g.,
“unknown”, a new class?!
- the attribute mean
- the attribute mean for all
samples belonging to the same class: smarter
- the most probable
value: inference-based such as Bayesian formula or decision tree
CS590D:
Data Mining
Chris Clifton
January 17, 2006
Data Preparation
14
What is
Data?
- Collection of data objects
and their attributes
- An attribute is a property
or characteristic of an object
- Examples: eye color of a person,
temperature, etc.
- Attribute is also known as
variable, field, characteristic, or feature
- A collection of attributes
describe an object
- Object is also known as record,
point, case, sample, entity, or instance
Attributes
Objects
15
Attribute
Values
- Attribute values are numbers
or symbols assigned to an attribute
- Distinction between attributes
and attribute values
- Same attribute can be mapped
to different attribute values
- Example: height can be measured
in feet or meters
- Different attributes can be
mapped to the same set of values
- Example: Attribute values
for ID and age are integers
- But properties of attribute
values can be different
- ID has no limit but age has
a maximum and minimum value
16
Measurement
of Length
- The way you measure an
attribute is somewhat may not match the attributes properties.
17
Types
of Attributes
- There are different types
of attributes
- Nominal
- Examples: ID numbers, eye
color, zip codes
- Ordinal
- Examples: rankings (e.g.,
taste of potato chips on a scale from 1-10), grades, height in {tall,
medium, short}
- Interval
- Examples: calendar dates,
temperatures in Celsius or Fahrenheit.
- Ratio
- Examples: temperature in Kelvin,
length, time, counts
18
Properties
of Attribute Values
- The type of an attribute
depends on which of the following properties it possesses:
- Distinctness: =
- Order: < >
- Addition: + -
- Multiplication: * /
- Nominal attribute: distinctness
- Ordinal attribute: distinctness
& order
- Interval attribute: distinctness,
order & addition
- Ratio attribute: all 4 properties
19
Attribute
Type
Description
Examples
Operations
Nominal
The values of a nominal attribute
are just different names, i.e., nominal attributes provide only enough
information to distinguish one object from another. (=, )
zip codes, employee ID numbers,
eye color, sex: {male, female}
mode, entropy, contingency
correlation, 2
test
Ordinal
The values of an ordinal attribute
provide enough information to order objects. (<, >)
hardness of minerals, {good,
better, best},
grades, street numbers
median, percentiles, rank correlation,
run tests, sign tests
Interval
For interval attributes, the
differences between values are meaningful, i.e., a unit of measurement
exists.
(+, - )
calendar dates, temperature
in Celsius or Fahrenheit
mean, standard deviation, Pearson's
correlation, t and F tests
Ratio
For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin, monetary
quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean,
percent variation
20
Attribute
Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers
were reassigned, would it make any difference?
Ordinal
An order preserving change
of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing the
notion of good, better best can be represented equally well by the values
{1, 2, 3} or by { 0.5, 1, 10}.
Interval
new_value =a * old_value
+ b where a and b are constants
Thus, the Fahrenheit and Celsius
temperature scales differ in terms of where their zero value is and
the size of a unit (degree).
Ratio
new_value = a * old_value
Length can be measured in meters
or feet.
21
Discrete
and Continuous Attributes
- Discrete Attribute
- Has only a finite or countably
infinite set of values
- Examples: zip codes, counts,
or the set of words in a collection of documents
- Often represented as integer
variables.
- Note: binary attributes are
a special case of discrete attributes
- Continuous Attribute
- Has real numbers as attribute
values
- Examples: temperature, height,
or weight.
- Practically, real values
can only be measured and represented using a finite number of digits.
- Continuous attributes are
typically represented as floating-point variables.
22
Types
of data sets
- Record
- Data Matrix
- Document Data
- Transaction Data
- Graph
- World Wide Web
- Molecular Structures
- Ordered
- Spatial Data
- Temporal Data
- Sequential Data
- Genetic Sequence Data
23
Important
Characteristics of Structured Data
- Resolution
- Patterns depend on the scale
24
Record
Data
- Data that consists of a collection
of records, each of which consists of a fixed set of attributes
25
Data Matrix
- If data objects have the
same fixed set of numeric attributes, then the data objects can be thought
of as points in a multi-dimensional space, where each dimension represents
a distinct attribute
- Such data set can be represented
by an m by n matrix, where there are m rows, one for each object, and
n columns, one for each attribute
26
Document
Data
- Each document becomes a `term'
vector,
- each term is a component
(attribute) of the vector,
- the value of each component
is the number of times the corresponding term occurs in the document.
27
Transaction
Data
- A special type of record
data, where
- each record (transaction)
involves a set of items.
- For example, consider a grocery
store. The set of products purchased by a customer during one
shopping trip constitute a transaction, while the individual products
that were purchased are the items.
28
Graph
Data
- Examples: Generic graph and
HTML Links
29
Chemical
Data
30
Ordered
Data
- Sequences of transactions
An element of the sequence
Items/Events
31
Ordered
Data
32
Ordered
Data
Average Monthly Temperature of land
and ocean
33
Data Quality
- What kinds of data quality
problems?
- How can we detect problems
with the data?
- What can we do about these
problems?
- Examples of data quality
problems:
- Noise and outliers
- missing values
- duplicate data
34
Noise
- Noise refers to modification
of original values
- Examples: distortion of a
person’s voice when talking on a poor phone and “snow” on television
screen
Two Sine Waves
Two Sine Waves + Noise
35
Outliers
- Outliers are data objects
with characteristics that are considerably different than most of the
other data objects in the data set
36
Missing
Values
- Reasons for missing values
- Information is not collected
(e.g., people decline to give their age and weight)
- Attributes may not be applicable
to all cases
(e.g., annual income is not applicable to children)
- Handling missing values
- Eliminate Data Objects
- Estimate Missing Values
- Ignore the Missing Value During
Analysis
- Replace with all possible
values (weighted by their probabilities)
37
Duplicate
Data
- Data set may include data
objects that are duplicates, or almost duplicates of one another
- Major issue when merging data
from heterogeous sources
- Examples:
- Same person with multiple
email addresses
- Data cleaning
- Process of dealing with duplicate
data issues
38
Noisy
Data
- Noise: random error or variance
in a measured variable
- Incorrect attribute values
may due to
- faulty data collection instruments
- data entry problems
- data transmission problems
- technology limitation
- inconsistency in naming convention
- Other data problems which
requires data cleaning
- duplicate records
- incomplete data
- inconsistent data
39
How to
Handle Noisy Data?
- Binning method:
- first sort data and partition
into (equi-depth) bins
- then one can smooth
by bin means, smooth by bin median, smooth by bin boundaries, etc.
- Clustering
- detect and remove outliers
- Combined computer and human
inspection
- detect suspicious values and
check by human (e.g., deal with possible outliers)
- Regression
- smooth by fitting the data
into regression functions
40
Simple
Discretization Methods: Binning
- Equal-width (distance) partitioning:
- Divides the range into N intervals
of equal size: uniform grid
- if A and B are the lowest
and highest values of the attribute, the width of intervals will be:
W = (B –A)/N.
- The most straightforward,
but outliers may dominate presentation
- Skewed data is not handled
well.
- Equal-depth (frequency) partitioning:
- Divides the range into N
intervals, each containing approximately same number of samples
- Good data scaling
- Managing categorical attributes
can be tricky.
41
Binning
Methods for Data Smoothing
- Sorted data (e.g., by price)
- 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
- Partition into (equi-depth)
bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
- Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
- Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
42
Cluster
Analysis
43
Regression
x
y
y = x + 1
X1
Y1
Y1’
44
Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration
and transformation
- Data reduction
- Discretization and concept
hierarchy generation
- Summary
45
Data Integration
- Data integration:
- combines data from multiple
sources into a coherent store
- Schema integration
- integrate metadata from different
sources
- Entity identification problem:
identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
- Detecting and resolving data
value conflicts
- for the same real world entity,
attribute values from different sources are different
- possible reasons: different
representations, different scales, e.g., metric vs. British units
46
Handling
Redundancy in Data Integration
- Redundant data occur often
when integration of multiple databases
- The same attribute may have
different names in different databases
- One attribute may be a “derived”
attribute in another table, e.g., annual revenue
- Redundant data may be able
to be detected by correlational analysis
- Careful integration of the
data from multiple sources may help reduce/avoid redundancies and inconsistencies
and improve mining speed and quality
47
Data Transformation
- Smoothing: remove noise from
data
- Aggregation: summarization,
data cube construction
- Generalization: concept hierarchy
climbing
- Normalization: scaled to
fall within a small, specified range
- min-max normalization
- z-score normalization
- normalization by decimal scaling
- Attribute/feature construction
- New attributes constructed
from the given ones
CS590D:
Data Mining
Chris Clifton
January 18, 2005
Data Preparation
49
Data Transformation:
Normalization
- normalization by decimal
scaling
Where j is the smallest
integer such that Max(| |)<1
50
Z-Score
(Example)
-.55
4
3.87
3.00
-.58
2
-0.30
0.50
-.49
7
0.24
0.82
-.71
-5
-0.97
0.10
.47
60
0.22
0.81
.20
45
0.50
0.98
-.23
22
-0.17
0.58
-.87
-14
-0.02
0.67
-.44
10
0.04
0.70
-.30
18
-0.09
0.63
-.05
32
-0.80
0.20
3.87
250
-0.07
0.64
-.35
15
-0.79
0.21
-.53
5
0.40
0.92
-.48
8
-0.22
0.55
-.05
32
0.20
0.80
4
70
-0.72
0.25
.55
5
-0.27
0.52
55.9
sdev
.11
40
0.59
sdev
-0.14
0.60
34.3
Avg
-.26
20
0.68
Avg
-0.84
0.18
v’
v
v’
v
51
Data Preprocessing
- Aggregation
- Sampling
- Dimensionality Reduction
- Feature subset selection
- Feature creation
- Discretization and Binarization
- Attribute Transformation
52
Aggregation
- Combining two or more attributes
(or objects) into a single attribute (or object)
- Purpose
- Data reduction
- Reduce the number of attributes
or objects
- Change of scale
- Cities aggregated into regions,
states, countries, etc
- More “stable” data
- Aggregated data tends to
have less variability
53
Aggregation
Standard Deviation of Average Monthly
Precipitation
Standard Deviation of Average Yearly
Precipitation
Variation of Precipitation in Australia
CS490D:
Introduction to Data Mining
Chris Clifton
January 26, 2004
Data Preparation
55
Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept
hierarchy generation
- Summary
56
Data Reduction
Strategies
- A data warehouse may store
terabytes of data
- Complex data analysis/mining
may take a very long time to run on the complete data set
- Data reduction
- Obtain a reduced representation
of the data set that is much smaller in volume but yet produce the same
(or almost the same) analytical results
- Data reduction strategies
- Data cube
aggregation
- Dimensionality
reduction — remove unimportant
attributes
- Data Compression
- Numerosity
reduction — fit data into models
- Discretization and concept hierarchy generation
57
Data Cube
Aggregation
- The lowest level of a data
cube
- the aggregated data for an individual entity of interest
- e.g., a customer in a phone
calling data warehouse.
- Multiple levels of aggregation
in data cubes
- Further reduce the size of
data to deal with
- Reference appropriate levels
- Use the smallest representation
which is enough to solve the task
- Queries regarding aggregated
information should be answered using data cube, when possible
58
Dimensionality
Reduction
- Feature selection (i.e.,
attribute subset selection):
- Select a minimum set of features
such that the probability distribution of different classes given the
values for those features is as close as possible to the original distribution
given the values of all features
- reduce # of patterns in the
patterns, easier to understand
- Heuristic methods (due to
exponential # of choices):
- step-wise forward selection
- step-wise backward elimination
- combining forward selection
and backward elimination
- decision-tree induction
59
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1?
A6?
Class 1
Class 2
Class 1
Class 2
>
Reduced attribute set:
{A1, A4, A6}
Example
of
Decision Tree Induction
60
Heuristic
Feature Selection Methods
- There are 2d possible
sub-features of d features
- Several heuristic feature
selection methods:
- Best single features under
the feature independence assumption: choose by significance tests.
- Best step-wise feature selection:
- The best single-feature is
picked first
- Then next best feature condition
to the first, ...
- Step-wise feature elimination:
- Repeatedly eliminate the worst
feature
- Best combined feature selection
and elimination:
- Optimal branch and bound:
- Use feature elimination and
backtracking
61
Data Compression
- String compression
- There are extensive theories
and well-tuned algorithms
- Typically lossless
- But only limited manipulation
is possible without expansion
- Audio/video compression
- Typically lossy compression,
with progressive refinement
- Sometimes small fragments
of signal can be reconstructed without reconstructing the whole
- Time sequence is not audio
- Typically short and vary slowly
with time
62
Data Compression
Original Data
Compressed
Data
lossless
Original Data
Approximated
lossy
63
Wavelet
Transformation
- Discrete wavelet transform
(DWT): linear signal processing, multiresolutional analysis
- Compressed approximation:
store only a small fraction of the strongest of the wavelet coefficients
- Similar to discrete Fourier
transform (DFT), but better lossy compression, localized in space
- Method:
- Length, L, must be an integer
power of 2 (padding with 0s, when necessary)
- Each transform has 2 functions:
smoothing, difference
- Applies to pairs of data,
resulting in two set of data of length L/2
- Applies two functions recursively,
until reaches the desired length
Haar2
Daubechie4
64
DWT for
Image Compression
Low Pass
High Pass
Low Pass
High Pass
Low Pass High Pass
65
Curse
of Dimensionality
- When dimensionality increases,
data becomes increasingly sparse in the space that it occupies
- Definitions of density and
distance between points, which is critical for clustering and outlier
detection, become less meaningful
- Randomly generate 500
points
- Compute difference between
max and min distance between any pair of points
66
Dimensionality
Reduction
- Purpose:
- Avoid curse of dimensionality
- Reduce amount of time and
memory required by data mining algorithms
- Allow data to be more easily
visualized
- May help to eliminate irrelevant
features or reduce noise
- Techniques
- Principle Component Analysis
- Singular Value Decomposition
- Others: supervised and non-linear
techniques
67
X1
X2
Y1
Y2
Principal
Component Analysis
68
Dimensionality
Reduction: PCA
- Goal is to find a projection
that captures the largest amount of variation in data
x2
x1
e
69
Dimensionality
Reduction: PCA
- Find the eigenvectors of
the covariance matrix
- The eigenvectors define the
new space
x2
x1
e
70
Principal
Component Analysis
- Given N data vectors from
k-dimensions, find c ≤ k orthogonal vectors that can be
best used to represent data
- The original data set is reduced
to one consisting of N data vectors on c principal components (reduced
dimensions)
- Each data vector is a linear
combination of the c principal component vectors
- Works for numeric data only
- Used when the number of dimensions
is large
71
Dimensionality
Reduction: PCA
72
Dimensionality
Reduction: ISOMAP
- Construct a neighbourhood
graph
- For each pair of points in
the graph, compute the shortest path distances – geodesic distances
By: Tenenbaum,
de Silva, Langford (2000)
73
Feature
Subset Selection
- Another way to reduce dimensionality
of data
- Redundant features
- duplicate much or all of the
information contained in one or more other attributes
- Example: purchase price of
a product and the amount of sales tax paid
- Irrelevant features
- contain no information that
is useful for the data mining task at hand
- Example: students' ID is often
irrelevant to the task of predicting students' GPA
74
Feature
Subset Selection
- Techniques:
- Brute-force approch:
- Try all possible feature subsets
as input to data mining algorithm
- Embedded approaches:
- Feature selection occurs
naturally as part of the data mining algorithm
- Filter approaches:
- Features are selected before
data mining algorithm is run
- Wrapper approaches:
- Use the data mining algorithm
as a black box to find best subset of attributes
CS590D:
Data Mining
Chris Clifton
January 24, 2006
Data Preparation
76
Feature
Creation
- Create new attributes that
can capture the important information in a data set much more efficiently
than the original attributes
- Three general methodologies:
- Feature Extraction
- Mapping Data to New Space
- Feature Construction
77
Mapping
Data to a New Space
Two Sine Waves
Two Sine Waves + Noise
Frequency
Fourier transform
Wavelet transform
78
Discretization
Using Class Labels
3 categories for both x and y
5 categories for both x and y
79
Discretization
Without Using Class Labels
Data
Equal interval width
Equal frequency
K-means
80
Attribute
Transformation
- A function that maps the
entire set of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of the new
values
- Simple functions: xk,
log(x), ex, |x|
- Standardization and Normalization
81
Numerosity
Reduction
- Parametric methods
- Assume the data fits some
model, estimate model parameters, store only the parameters, and discard
the data (except possible outliers)
- Log-linear models: obtain
value at a point in m-D space as the product on appropriate marginal
subspaces
- Non-parametric methods
- Do not assume models
- Major families: histograms,
clustering, sampling
82
Regression
and Log-Linear Models
- Linear regression: Data are
modeled to fit a straight line
- Often uses the least-square
method to fit the line
- Multiple regression: allows
a response variable Y to be modeled as a linear function of multidimensional
feature vector
- Log-linear model: approximates
discrete multidimensional probability distributions
83
Regress
Analysis and Log-Linear Models
- Linear regression: Y =
+
X
- Two parameters ,
and
specify the line and are to be estimated by using the data at hand.
- using the least squares criterion
to the known values of Y1, Y2, …, X1, X2,
….
- Multiple regression: Y
= b0 + b1 X1 + b2 X2.
- Many nonlinear functions can
be transformed into the above.
- Log-linear models:
- The multi-way table of joint
probabilities is approximated by a product of lower-order tables.
- Probability: p(a,
b, c, d) = ab acad bcd
84
Histograms
- A popular data reduction
technique
- Divide data into buckets
and store average (sum) for each bucket
- Can be constructed optimally
in one dimension using dynamic programming
- Related to quantization problems.
85
Clustering
- Partition data set into clusters,
and one can store cluster representation only
- Can be very effective if
data is clustered but not if data is “smeared”
- Can have hierarchical clustering
and be stored in multi-dimensional index tree structures
- There are many choices of
clustering definitions and clustering algorithms, further detailed in
Chapter 8
86
Hierarchical
Reduction
- Use multi-resolution structure
with different degrees of reduction
- Hierarchical clustering is
often performed but tends to define partitions of data sets rather than
“clusters”
- Parametric methods are usually
not amenable to hierarchical representation
- Hierarchical aggregation
- An index tree hierarchically
divides a data set into partitions by value range of some attributes
- Each partition can be considered
as a bucket
- Thus an index tree with aggregates
stored at each node is a hierarchical histogram
87
Sampling
- Allow a mining algorithm
to run in complexity that is potentially sub-linear to the size of the
data
- Choose a representative subset of the data
- Simple random sampling may
have very poor performance in the presence of skew
- Develop adaptive sampling
methods
- Stratified sampling:
- Approximate the percentage
of each class (or subpopulation of interest) in the overall database
- Used in conjunction with skewed
data
- Sampling may not reduce database
I/Os (page at a time).
88
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
Sampling
89
Sampling
Raw Data
Cluster/Stratified Sample
90
Sampling
- Sampling is the main technique
employed for data selection.
- It is often used for both
the preliminary investigation of the data and the final data analysis.
- Statisticians sample because obtaining
the entire set of data of interest is too expensive or time consuming.
- Sampling is used in data
mining because processing the entire set of data of interest is too expensive
or time consuming.
91
Sampling
…
- The key principle for effective
sampling is the following:
- using a sample will work almost
as well as using the entire data sets, if the sample is representative
- A sample is representative
if it has approximately the same property (of interest) as the original
set of data
92
Types
of Sampling
- Simple Random Sampling
- There is an equal probability
of selecting any particular item
- Sampling without replacement
- As each item is selected,
it is removed from the population
- Sampling with replacement
- Objects are not removed from
the population as they are selected for the sample.
- In sampling with replacement,
the same object can be picked up more than once
- Stratified sampling
- Split the data into several
partitions; then draw random samples from each partition
93
Sample
Size
8000 points
2000 Points 500 Points
94
Sample
Size
- What sample size is necessary
to get at least one object from each of 10 groups.
95
Similarity
and Dissimilarity
- Similarity
- Numerical measure of how alike
two data objects are.
- Is higher when objects are
more alike.
- Often falls in the range [0,1]
- Dissimilarity
- Numerical measure of how different
are two data objects
- Lower when objects are more
alike
- Minimum dissimilarity is often
0
- Upper limit varies
- Proximity refers to a similarity
or dissimilarity
96
Similarity/Dissimilarity
for Simple Attributes
p and q are the attribute
values for two data objects.
97
Euclidean
Distance
- Euclidean Distance
- Where n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes (components)
or data objects p and q.
- Standardization is necessary,
if scales differ.
98
Euclidean
Distance
Distance Matrix
99
Minkowski
Distance
- Minkowski Distance is a generalization
of Euclidean Distance
Where r is a
parameter, n is the number of dimensions (attributes) and
pk and qk are, respectively,
the kth attributes (components) or data objects p and q.
100
Minkowski
Distance: Examples
- r = 1. City block (Manhattan,
taxicab, L1 norm) distance.
- A common example of this is
the Hamming distance, which is just the number of bits that are different
between two binary vectors
- r = 2. Euclidean distance
- r . “supremum” (Lmax norm, L norm)
distance.
- This is the maximum difference
between any component of the vectors
- Do not confuse r with n,
i.e., all these distances are defined for all numbers of dimensions.
101
Minkowski
Distance
Distance Matrix
102
Mahalanobis
Distance
For red points, the Euclidean distance
is 14.7, Mahalanobis distance is 6.
is the covariance matrix of the input data
X
103
Mahalanobis
Distance
Covariance Matrix:
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
104
Common
Properties of a Distance
- Distances, such as the Euclidean
distance, have some well known properties.
- d(p, q)
0 for all p and q and d(p, q) = 0
only if
p = q. (Positive definiteness)
- d(p, q) = d(q, p)
for all p and q. (Symmetry)
- d(p, r)
d(p, q) + d(q, r) for all points p, q,
and r.
(Triangle Inequality)
where d(p, q)
is the distance (dissimilarity) between points (data objects), p
and q.
- A distance that satisfies
these properties is a metric
105
Common
Properties of a Similarity
- Similarities, also have some
well known properties.
- s(p, q) = 1
(or maximum similarity) only if p = q.
- s(p, q) = s(q, p)
for all p and q. (Symmetry)
where s(p, q)
is the similarity between points (data objects), p and q.
106
Similarity
Between Binary Vectors
- Common situation is that
objects, p and q, have only binary attributes
- Compute similarities using
the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
- Simple Matching and Jaccard
Coefficients
SMC = number
of matches / number of attributes
= (M11 + M00) / (M01 + M10
+ M11 + M00)
J = number of 11 matches
/ number of not-both-zero attributes values
= (M11)
/ (M01 + M10 + M11)
107
SMC versus
Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01
+ M10 + M11 + M00) = (0+7) / (2+1+0+7)
= 0.7
J = (M11) / (M01
+ M10 + M11) = 0 / (2 + 1 + 0) = 0
108
Cosine
Similarity
- If d1
and d2 are two document vectors, then
cos( d1, d2 ) =
(d1 d2) / ||d1||
||d2|| ,
where indicates
vector dot product and || d
|| is the length of vector d.
-
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2=
3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1||
= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =
(42) 0.5 = 6.481
||d2||
= (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
0.5 = (6) 0.5 = 2.245
cos( d1,
d2 ) = .3150
109
Extended
Jaccard Coefficient (Tanimoto)
- Variation of Jaccard for
continuous or count attributes
- Reduces to Jaccard for binary
attributes
110
Correlation
- Correlation measures the
linear relationship between objects
- To compute correlation, we
standardize data objects, p and q, and then take their dot product
111
Visually
Evaluating Correlation
Scatter plots showing the similarity
from –1 to 1.
112
General
Approach for Combining Similarities
- Sometimes attributes are
of many different types, but an overall similarity is needed.
113
Using
Weights to Combine Similarities
- May not want to treat all
attributes the same.
- Use weights wk
which are between 0 and 1 and sum to 1.
114
Density
- Density-based clustering
require a notion of density
- Examples:
- Euclidean density
- Euclidean density = number
of points per unit volume
115
Euclidean
Density – Cell-based
- Simplest approach is to divide
region into a number of rectangular cells of equal volume and define
density as # of points the cell contains
116
Euclidean
Density – Center-based
- Euclidean density is the
number of points within a specified radius of the point
117
Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization
and concept hierarchy generation
- Summary
118
Discretization
- Three types of attributes:
- Nominal — values from an
unordered set
- Ordinal — values from an
ordered set
- Continuous — real numbers
- Discretization:
- divide the range of a continuous
attribute into intervals
- Some classification algorithms
only accept categorical attributes.
- Reduce data size by discretization
- Prepare for further analysis
119
Discretization
and Concept hierachy
- Discretization
- reduce the number of values
for a given continuous attribute by dividing the range of the attribute
into intervals. Interval labels can then be used to replace actual data
values
- Concept hierarchies
- reduce the data by collecting
and replacing low level concepts (such as numeric values for the attribute
age) by higher level concepts (such as young, middle-aged, or senior)
CS490D:
Introduction to Data Mining
Chris Clifton
January 28, 2004
Data Preparation
121
Discretization
and Concept Hierarchy Generation for Numeric Data
- Binning (see sections before)
- Histogram analysis (see sections
before)
- Clustering analysis (see
sections before)
- Entropy-based discretization
- Segmentation by natural partitioning
122
Definition
of Entropy
- Example: Coin Flip
- AX
= {heads, tails}
- P(heads) = P(tails) =
½
- ½
log2(½) = ½ * - 1
- H(X) = 1
- What about a two-headed coin?
- Conditional Entropy:
123
Entropy-Based
Discretization
- Given a set of samples S,
if S is partitioned into two intervals S1 and S2 using boundary T, the
entropy after partitioning is
- The boundary that minimizes
the entropy function over all possible boundaries is selected as a binary
discretization.
- The process is recursively
applied to partitions obtained until some stopping criterion is met,
e.g.,
- Experiments show that it
may reduce data size and improve classification accuracy
124
Segmentation
by Natural Partitioning
- A simply
3-4-5 rule can be used to segment numeric data into relatively uniform,
“natural” intervals.
- If an interval covers 3, 6,
7 or 9 distinct values at the most significant digit, partition the
range into 3 equi-width intervals
- If it covers 2, 4, or 8 distinct
values at the most significant digit, partition the range into 4 intervals
- If it covers 1, 5, or 10 distinct
values at the most significant digit, partition the range into 5 intervals
125
Example
of 3-4-5 Rule
(-$4000 -$5,000)
(-$400 - 0)
(-$400 -
-$300)
(-$300 -
-$200)
(-$200 -
-$100)
(-$100 -
0)
(0 - $1,000)
(0 -
$200)
($200 -
$400)
($400 -
$600)
($600 -
$800)
($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -
$3,000)
($3,000 -
$4,000)
($4,000 -
$5,000)
($1,000 - $2, 000)
($1,000 -
$1,200)
($1,200 -
$1,400)
($1,400 -
$1,600)
($1,600 -
$1,800)
($1,800 -
$2,000)
msd=1,000 Low=-$1,000 High=$2,000
Step 2:
Step 4:
Step 1:
-$351 -$159 profit
$1,838 $4,700
Min
Low (i.e, 5%-tile)
High(i.e, 95%-0 tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0)
(0 -$ 1,000)
Step 3:
($1,000 - $2,000)
126
Concept
Hierarchy Generation for Categorical Data
- Specification of a partial
ordering of attributes explicitly at the schema level by users or experts
- street<city<state<country
- Specification of a portion
of a hierarchy by explicit data grouping
- {Urbana, Champaign, Chicago}<Illinois
- Specification of a set of
attributes.
- System automatically generates
partial ordering by analysis of the number of distinct values
- E.g., street < city <state
< country
- Specification of only a partial
set of attributes
- E.g., only street < city,
not others
127
Automatic
Concept Hierarchy Generation
- Some concept hierarchies
can be automatically generated based on the analysis of the number of
distinct values per attribute in the given data set
- The attribute with the most
distinct values is placed at the lowest level of the hierarchy
- Note: Exception—weekday,
month, quarter, year
country
province_or_
state
city
street
15 distinct
values
65 distinct
values
3567 distinct
values
674,339 distinct
values
128
Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept
hierarchy generation
- Summary
129
Summary
- Data preparation is
a big issue for both warehousing and mining
- Data preparation includes
- Data cleaning and data integration
- Data reduction and feature
selection
- Discretization
- A lot a methods have been
developed but still an active area of research
130
References
- E. Rahm and H. H. Do. Data
Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical
Committee on Data Engineering. Vol.23, No.4
- D. P. Ballou and G. K. Tayi.
Enhancing data quality in data warehouse environments. Communications
of ACM, 42:73-78, 1999.
- H.V. Jagadish et al., Special
Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997.
- A. Maydanchik, Challenges
of Efficient Data Cleansing (DM Review - Data Quality resource portal)
- D. Pyle. Data Preparation
for Data Mining. Morgan Kaufmann, 1999.
- D. Quass. A Framework for
research in Data Cleaning. (Draft 1999)
- V. Raman and J. Hellerstein.
Potters Wheel: An Interactive Framework for Data Cleaning and Transformation,
VLDB’2001.
- T. Redman. Data Quality:
Management and Technology. Bantam Books, New York, 1992.
- Y. Wand and R. Wang. Anchoring
data quality dimensions ontological foundations. Communications of ACM,
39:86-95, 1996.
- R. Wang, V. Storey, and C.
Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995.
- http://www.cs.ucla.edu/classes/spring01/cs240b/notes/data-integration1.pdf
CS590D:
Data Mining
Chris Clifton
January 20, 2005
Data Cubes
132
A Sample
Data Cube
Total annual
sales
of
TV in U.S.A.
Date
Product
Country
All, All, All
sum
sum
TV
VCR
PC
1Qtr
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
sum
133
Cuboids
Corresponding to the Cube
all
product
date
country
product,date
product,country
date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
134
Browsing
a Data Cube
- Visualization
- OLAP capabilities
- Interactive manipulation
135
Typical
OLAP Operations
- Roll up (drill-up): summarize data
- by climbing up hierarchy
or by dimension reduction
- Drill down
(roll down): reverse of roll-up
- from higher level summary
to lower level summary or detailed data, or introducing new dimensions
- Slice and
dice:
- Pivot (rotate):
- reorient the cube, visualization,
3D to series of 2D planes.
- Other operations
- drill across: involving (across) more than one fact table
- drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
136
A Star-Net
Query Model
Shipping Method
AIR-EXPRESS
TRUCK
ORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
Organization
Promotion
CITY
COUNTRY
REGION
Location
DAILY
QTRLY
ANNUALY
Time
Each circle is called a footprint
137
Efficient
Data Cube Computation
- Data cube can be viewed as
a lattice of cuboids
- The bottom-most cuboid is
the base cuboid
- The top-most cuboid (apex)
contains only one cell
- How many cuboids in an n-dimensional
cube with L levels?
- Materialization of data cube
- Materialize every (cuboid)
(full materialization), none
(no materialization), or some
(partial materialization)
- Selection of which cuboids
to materialize
- Based on size, sharing, access
frequency, etc.
138
Cube Operation
- Cube definition and computation
in DMQL
define
cube sales[item, city, year]: sum(sales_in_dollars)
- Transform it into a SQL-like
language (with a new operator cube
by, introduced by Gray et al.’96)
SELECT item, city,
year, SUM (amount)
- Compute the following Group-Bys
(date, product, customer),
(date,product),(date,
customer), (product, customer),
(date),
(product), (customer)
(item)
(city)
()
(year)
(city,
item)
(city,
year)
(item,
year)
(city,
item, year)
139
Cube Computation:
ROLAP-Based Method
- Efficient cube computation
methods
- ROLAP-based cubing algorithms
(Agarwal et al’96)
- Array-based cubing algorithm
(Zhao et al’97)
- Bottom-up computation method
(Beyer & Ramarkrishnan’99)
- H-cubing technique (Han, Pei,
Dong & Wang:SIGMOD’01)
- ROLAP-based cubing algorithms
- Sorting, hashing, and grouping
operations are applied to the dimension attributes in order to reorder
and cluster related tuples
- Grouping is performed on some
sub-aggregates as a “partial grouping step”
- Aggregates may be computed
from previously computed aggregates, rather than from the base fact
table
140
Cube Computation:
ROLAP-Based Method (2)
- This is not
in the textbook but in a research paper
- Hash/sort based methods (Agarwal
et. al. VLDB’96)
- Smallest-parent: computing
a cuboid from the smallest, previously computed cuboid
- Cache-results:
caching results of a cuboid from which other cuboids are computed to
reduce disk I/Os
- Amortize-scans: computing
as many as possible cuboids at the same time to amortize disk reads
- Share-sorts:
sharing sorting costs cross multiple cuboids when sort-based method
is used
- Share-partitions: sharing
the partitioning cost across multiple cuboids when hash-based algorithms
are used
141
Multi-way
Array Aggregation for Cube Computation
- Partition arrays into chunks
(a small subcube which fits in memory).
- Compressed sparse array addressing:
(chunk_id, offset)
- Compute aggregates in “multiway”
by visiting cube cells in the order which minimizes the # of times to
visit each cell, and reduces memory access and storage cost.
What is
the best traversing order to do multi-way aggregation?
A
B
29
30
31
32
1
2
3
4
5
9
13
14
15
16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2
a3
C
B
44
28
56
40
24
52
36
20
60
142
Multi-way
Array Aggregation for Cube Computation
A
B
29
30
31
32
1
2
3
4
5
9
13
14
15
16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2
a3
C
44
28
56
40
24
52
36
20
60
B
143
Multi-way
Array Aggregation for Cube Computation
A
B
29
30
31
32
1
2
3
4
5
9
13
14
15
16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2
a3
C
44
28
56
40
24
52
36
20
60
B
144
Multi-Way
Array Aggregation for Cube Computation (Cont.)
- Method: the planes should
be sorted and computed according to their size in ascending order.
- See the details of Example
2.12 (pp. 75-78)
- Idea: keep the smallest plane
in the main memory, fetch and compute only one chunk at a time for the
largest plane
- Limitation of the method:
computing well only for a small number of dimensions
- If there are a large number
of dimensions, “bottom-up computation” and iceberg cube computation
methods can be explored
145
Indexing
OLAP Data: Bitmap Index
- Index on a particular column
- Each value in the column
has a bit vector: bit-op is fast
- The length of the bit vector:
# of records in the base table
- The i-th bit is
set if the i-th row of the base table has the value for the
indexed column
- not suitable for high cardinality
domains
Base table
Index on
Region
Index on
Type
146
Indexing
OLAP Data: Join Indices
- Join index: JI(R-id, S-id)
where R (R-id, …) S (S-id, …)
- Traditional indices map
the values to a list of record ids
- It materializes relational
join in JI file and speeds up relational join — a rather costly operation
- In data warehouses, join
index relates the values of the dimensions of a start schema to rows in the fact table.
- E.g. fact table: Sales
and two dimensions city and product
- A join index on city
maintains for each distinct city a list of R-IDs of the tuples recording
the Sales in the city
- Join indices can span multiple
dimensions
147
Efficient
Processing OLAP Queries
- Determine which operations
should be performed on the available cuboids:
- transform drill, roll, etc.
into corresponding SQL and/or OLAP operations, e.g, dice = selection
+ projection
- Determine to which materialized
cuboid(s) the relevant operations should be applied.
- Exploring indexing structures
and compressed vs. dense array structures in MOLAP
148
Iceberg
Cube
- Computing only the cuboid
cells whose count
or other aggregates satisfying the condition:
HAVING COUNT(*) >=
minsup
- Motivation
- Only a small portion of cube
cells may be “above the water’’ in a sparse cube
- Only calculate “interesting”
data—data above certain threshold
- Suppose 100 dimensions, only
1 base cell. How many aggregate (non-base) cells if count >=
1? What about count >= 2?
149
Bottom-Up
Computation (BUC)
- BUC (Beyer & Ramakrishnan,
SIGMOD’99)
- Bottom-up vs. top-down?—depending
on how you view it!
- Apriori property:
- Aggregate the data,
then move to the next level
- If minsup is not met,
stop!
- If minsup = 1 Þ compute
full CUBE!
150
Partitioning
- Usually, entire data set
can’t fit in
main memory
- Sort distinct values,
partition into blocks
that fit
- Continue processing
- Optimizations
- Partitioning
- External Sorting, Hashing,
Counting Sort
- Ordering dimensions to encourage
pruning
- Cardinality, Skew, Correlation
- Collapsing duplicates
- Can’t do holistic aggregates
anymore!
151
Drawbacks
of BUC
- Requires a significant amount
of memory
- On par with most other CUBE
algorithms though
- Does not obtain good performance
with dense CUBEs
- Overly skewed data or a
bad choice of dimension ordering reduces performance
- Cannot compute iceberg cubes
with complex measures
SELECT month, city, cust_grp,
CUBEBY month, city, cust_grp
152
Non-Anti-Monotonic
Measures
- The cubing query with avg
is non-anti-monotonic!
- (Mar, *, *, 600, 1800) fails
the HAVING clause
- (Mar, *, Bus, 1300, 360) passes
the clause
…
…
…
…
…
…
520
540
HD
Edu
Van
Mar
2500
1500
Laptop
Bus
Mon
Feb
1280
1160
Camera
Edu
Tor
Jan
1200
800
TV
Hld
Tor
Jan
485
500
Printer
Edu
Tor
Jan
Price
Cost
Prod
Cust_grp
City
Month
CREATE CUBE Sales_Iceberg
AS
SELECT month, city, cust_grp,
AVG(price),
COUNT(*)
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) >= 800 AND
COUNT(*) >= 50
153
Top-k
Average
- Let (*, Van, *) cover 1,000
records
- Avg(price) is the average
price of those 1000 sales
- Avg50(price) is
the average price of the top-50 sales (top-50 according to the sales
price
- Top-k average is anti-monotonic
- The top 50 sales in Van. is
with avg(price) <= 800 the top 50 deals in Van. during Feb. must be
with avg(price) <= 800
…
…
…
…
…
…
Price
Cost
Prod
Cust_grp
City
Month
154
Binning
for Top-k Average
- Computing top-k avg is costly
with large k
- Binning idea
- Avg50(c) >=
800
- Large value collapsing: use
a sum and a count to summarize records with measure >= 800
- If count>=800, no need
to check “small” records
- Small value binning: a group
of bins
- One bin covers a range, e.g.,
600~800, 400~600, etc.
- Register a sum and a count
for each bin
155
Approximate
top-k average
…
…
…
30
15200
400~600
15
10600
600~800
20
28000
Over 800
Count
Sum
Range
Top 50
Approximate avg50()=
(28000+10600+600*15)/50=952
Suppose for (*, Van, *), we
have
…
…
…
…
…
…
Price
Cost
Prod
Cust_grp
City
Month
The cell may
pass the HAVING clause
156
Quant-info
for Top-k Average Binning
- Accumulate quant-info for
cells to compute average iceberg cubes efficiently
- Three pieces: sum, count,
top-k bins
- Use top-k bins to estimate/prune
descendants
- Use sum and count to consolidate
current cell
avg()
Not anti-monotonic
real
avg50()
Anti-monotonic, but
computationally costly
Approximate
avg50()
Anti-monotonic, can
be computed efficiently
strongest
weakest
157
An Efficient
Iceberg Cubing Method: Top-k H-Cubing
- One can revise Apriori or
BUC to compute a top-k avg iceberg cube. This leads to top-k-Apriori
and top-k BUC.
- Can we compute iceberg cube
more efficiently?
- Top-k H-cubing: an efficient
method to compute iceberg cubes with average measure
- H-tree: a hyper-tree structure
- H-cubing: computing iceberg
cubes using H-tree
158
H-tree:
A Prefix Hyper-tree
…
…
…
…
…
…
520
540
HD
Edu
Van
Mar
2500
1500
Laptop
Bus
Mon
Feb
1280
1160
Camera
Edu
Tor
Jan
1200
800
TV
Hhd
Tor
Jan
485
500
Printer
Edu
Tor
Jan
Price
Cost
Prod
Cust_grp
City
Month
root
edu
hhd
bus
Jan
Mar
Jan
Feb
Tor
Van
Tor
Mon
Q.I.
Q.I.
Q.I.
bins
Sum: 1765
Cnt: 2
Quant-Info
…
…
…
Mon
…
Van
…
Tor
…
…
…
Feb
…
Jan
…
…
…
Bus
…
Hhd
Sum:2285 …
Edu
Side-link
Quant-Info
Attr. Val.
Header
table
159
Properties
of H-tree
- Construction cost: a single
database scan
- Completeness: It contains
the complete information needed for computing the iceberg cube
- Compactness: # of nodes n*m+1
- n: # of tuples in the table
- m: # of attributes
160
Computing
Cells Involving Dimension City
root
Edu.
Hhd.
Bus.
Jan.
Mar.
Jan.
Feb.
Tor.
Van.
Tor.
Mon.
Q.I.
Q.I.
Q.I.
bins
Sum: 1765
Cnt: 2
Quant-Info
…
…
…
Mon
…
Van
…
Tor
…
…
…
Feb
…
Jan
…
…
…
Bus
…
Hhd
Sum:2285 …
Edu
Side-link
Quant-Info
Attr. Val.
…
…
…
Feb
…
Jan
…
…
…
Bus
…
Hhd
…
Edu
Side-link
Q.I.
Attr. Val.
Header
Table
HTor
From (*, *, Tor) to (*, Jan, Tor)
161
Computing
Cells Involving Month But No City
root
Edu.
Hhd.
Bus.
Jan.
Mar.
Jan.
Feb.
Tor.
Van.
Tor.
Mont.
Q.I.
Q.I.
Q.I.
…
Mar.
…
…
…
Mont.
…
Van.
…
Tor.
…
…
…
Feb.
…
Jan.
…
…
…
Bus.
…
Hhd.
Sum:2285 …
Edu.
Side-link
Quant-Info
Attr. Val.
- Roll up quant-info
- Compute cells involving month
but no city
Q.I.
Top-k OK mark: if Q.I.
in a child passes top-k avg threshold, so does its parents. No binning
is needed!
162
Computing
Cells Involving Only Cust_grp
root
edu
hhd
bus
Jan
Mar
Jan
Feb
Tor
Van
Tor
Mon
Q.I.
Q.I.
Q.I.
…
Mar
…
…
…
Mon
…
Van
…
Tor
…
…
…
Feb
…
Jan
…
…
…
Bus
…
Hhd
Sum:2285
…
Edu
Side-link
Quant-Info
Attr. Val.
Check header table directly
Q.I.
163
Properties
of H-Cubing
- Space cost
- an H-tree
- a stack of up to (m-1) header
tables
- One database scan
- Main memory-based tree traversal
& side-links updates
- Top-k_OK marking
164
Scalability
w.r.t. Count Threshold (No min_avg Setting)
165
Computing
Iceberg Cubes with Other Complex Measures
- Computing other complex measures
- Key point: find a function
which is weaker but ensures certain anti-monotonicity
- Examples
- Avg() v: avgk(c) v (bottom-k avg)
- Avg() v only (no count): max(price) v
- Sum(profit) (profit can be
negative):
- p_sum(c) v if p_count(c) k; or otherwise, sumk(c) v
- Others: conjunctions of multiple
conditions
166
Discussion:
Other Issues
- Computing iceberg cubes with
more complex measures?
- No general answer for holistic
measures, e.g., median, mode, rank
- A research theme even for
complex algebraic functions, e.g., standard_dev, variance
- Dynamic vs . static computation
of iceberg cubes
- v and k are only available
at query time
- Setting reasonably low parameters
for most nontrivial cases
- Memory-hog? what if the cubing
is too big to fit in memory?—projection and then cubing
167
Condensed
Cube
- W. Wang, H. Lu, J. Feng, J.
X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size.
ICDE’02.
- Icerberg cube cannot solve
all the problems
- Suppose 100 dimensions, only
1 base cell with count = 10. How many aggregate (non-base) cells
if count >= 10?
- Condensed cube
- Only need to store one cell
(a1, a2, …, a100, 10), which represents
all the corresponding aggregate cells
- Adv.
- Fully precomputed cube without
compression
- Efficient computation of the
minimal condensed cube
168
References
(I)
- S. Agarwal,
R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,
and S. Sarawagi. On the computation of multidimensional aggregates.
VLDB’96
- D. Agrawal, A. E. Abbadi,
A. Singh, and T. Yurek. Efficient view maintenance in data warehouses.
SIGMOD’97.
- R. Agrawal, A. Gupta, and
S. Sarawagi. Modeling multidimensional databases. ICDE’97
- K. Beyer and
R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..
SIGMOD’99.
- S. Chaudhuri
and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997.
- OLAP council. MDAPI specification
version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.
- G. Dong,
J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained
Gradients in Data Cubes. VLDB’2001
- J. Gray, S. Chaudhuri, A.
Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh.
Data cube: A relational aggregation operator generalizing group-by,
cross-tab and sub-totals. Data Mining and Knowledge Discovery,
1:29-54, 1997.
169
References
(II)
- J. Han, J.
Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex
Measures. SIGMOD’01
- V. Harinarayan, A. Rajaraman,
and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96
- Microsoft. OLEDB for OLAP
programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap,
1998.
- K. Ross and D. Srivastava.
Fast computation of sparse datacubes. VLDB’97.
- K. A. Ross, D. Srivastava,
and D. Chatziantoniou. Complex aggregation at multiple granularities.
EDBT'98.
- S. Sarawagi,
R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data
cubes. EDBT'98.
- E. Thomsen. OLAP Solutions:
Building Multidimensional Information Systems. John Wiley & Sons,
1997.
- W. Wang,
H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing
Data Cube Size. ICDE’02.
- Y. Zhao,
P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous
multidimensional aggregates. SIGMOD’97.