CS490D: Introduction to Data Mining Chris Clifton_Free Document Search and Download

CS590D: Data Mining
Chris Clifton

January 14, 2006

Data Preparation

Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“”

noisy: containing errors or outliers

e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records

Why Is Data Dirty?

Incomplete data comes from

n/a data value when collected
different consideration between the time when the data was collected and when it is analyzed.
human/hardware/software problems

Noisy data comes from the process of data

collection
entry
transmission

Inconsistent data comes from

Different data sources
Functional dependency violation

Why Is Data Preprocessing Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:

Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility

Broad categories:

intrinsic, contextual, representational, and accessibility.

Major Tasks in Data Preprocessing

Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration

Integration of multiple databases, data cubes, or files

Data transformation

Normalization and aggregation

Data reduction

Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization

Part of data reduction but with particular importance, especially for numerical data

Forms of data preprocessing

Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Data Cleaning

Importance

“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data warehousing”—DCI survey

Data cleaning tasks

Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data

Missing data may need to be inferred.

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or decision tree

CS590D: Data Mining
Chris Clifton

January 17, 2006

Data Preparation

What is Data?

Collection of data objects and their attributes
An attribute is a property or characteristic of an object

Examples: eye color of a person, temperature, etc.
Attribute is also known as variable, field, characteristic, or feature

A collection of attributes describe an object

Object is also known as record, point, case, sample, entity, or instance

Attributes

Objects

Attribute Values

Attribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute values

Same attribute can be mapped to different attribute values

Example: height can be measured in feet or meters

Different attributes can be mapped to the same set of values

Example: Attribute values for ID and age are integers
But properties of attribute values can be different

ID has no limit but age has a maximum and minimum value

Measurement of Length

The way you measure an attribute is somewhat may not match the attributes properties.

Types of Attributes

There are different types of attributes

Nominal

Examples: ID numbers, eye color, zip codes

Ordinal

Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

Interval

Examples: calendar dates, temperatures in Celsius or Fahrenheit.

Ratio

Examples: temperature in Kelvin, length, time, counts

Properties of Attribute Values

The type of an attribute depends on which of the following properties it possesses:

Distinctness: = 
Order: < >
Addition: + -
Multiplication: * /
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties

Attribute Type

Description

Examples

Operations

Nominal

The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )

zip codes, employee ID numbers, eye color, sex: {male, female}

mode, entropy, contingency correlation, ² test

Ordinal

The values of an ordinal attribute provide enough information to order objects. (<, >)

hardness of minerals, {good, better, best},
grades, street numbers

median, percentiles, rank correlation, run tests, sign tests

Interval

For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )

calendar dates, temperature in Celsius or Fahrenheit

mean, standard deviation, Pearson's correlation, t and F tests

Ratio

For ratio variables, both differences and ratios are meaningful. (*, /)

temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current

geometric mean, harmonic mean, percent variation

Attribute Level

Transformation

Comments

Nominal

Any permutation of values

If all employee ID numbers were reassigned, would it make any difference?

Ordinal

An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.

An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}.

Interval

new_value =a * old_value + b where a and b are constants

Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree).

Ratio

new_value = a * old_value

Length can be measured in meters or feet.

Discrete and Continuous Attributes

Discrete Attribute

Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes

Continuous Attribute

Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.

Types of data sets

Record

Data Matrix
Document Data
Transaction Data

Graph

World Wide Web
Molecular Structures

Ordered

Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data

Important Characteristics of Structured Data

Dimensionality

Curse of Dimensionality

Sparsity

Only presence counts

Resolution

Patterns depend on the scale

Record Data

Data that consists of a collection of records, each of which consists of a fixed set of attributes

Data Matrix

If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute

Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

Document Data

Each document becomes a `term' vector,

each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the document.

Transaction Data

A special type of record data, where

each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

Graph Data

Examples: Generic graph and HTML Links

Chemical Data

Benzene Molecule: C₆H₆

Ordered Data

Sequences of transactions

An element of the sequence

Items/Events

Ordered Data

Genomic sequence data

Ordered Data

Spatio-Temporal Data

Average Monthly Temperature of land and ocean

Data Quality

What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:

Noise and outliers
missing values
duplicate data

Noise

Noise refers to modification of original values

Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen

Two Sine Waves

Two Sine Waves + Noise

Outliers

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

Missing Values

Reasons for missing values

Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values

Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)

Duplicate Data

Data set may include data objects that are duplicates, or almost duplicates of one another

Major issue when merging data from heterogeous sources

Examples:

Same person with multiple email addresses

Data cleaning

Process of dealing with duplicate data issues

Noisy Data

Noise: random error or variance in a measured variable
Incorrect attribute values may due to

faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

Other data problems which requires data cleaning

duplicate records
incomplete data
inconsistent data

How to Handle Noisy Data?

Binning method:

first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Clustering

detect and remove outliers

Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with possible outliers)

Regression

smooth by fitting the data into regression functions

Simple Discretization Methods: Binning

Equal-width (distance) partitioning:

Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well.

Equal-depth (frequency) partitioning:

Divides the range into N intervals, each containing approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.

Binning Methods for Data Smoothing

Sorted data (e.g., by price)

4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition into (equi-depth) bins:

Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34

Smoothing by bin means:

Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:

Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34

Cluster Analysis

Regression

y = x + 1

Y1’

Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Data Integration

Data integration:

combines data from multiple sources into a coherent store

Schema integration

integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#

Detecting and resolving data value conflicts

for the same real world entity, attribute values from different sources are different
possible reasons: different representations, different scales, e.g., metric vs. British units

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases

The same attribute may have different names in different databases
One attribute may be a “derived” attribute in another table, e.g., annual revenue

Redundant data may be able to be detected by correlational analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation

Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range

min-max normalization
z-score normalization
normalization by decimal scaling

Attribute/feature construction

New attributes constructed from the given ones

CS590D: Data Mining
Chris Clifton

January 18, 2005

Data Preparation

Data Transformation: Normalization

min-max normalization

z-score normalization

normalization by decimal scaling

Where j is the smallest integer such that Max(| |)<1

Z-Score (Example)

-.55

3.87

3.00

-.58

-0.30

0.50

-.49

0.24

0.82

-.71

-5

-0.97

0.10

.47

0.22

0.81

.20

0.50

0.98

-.23

-0.17

0.58

-.87

-14

-0.02

0.67

-.44

0.04

0.70

-.30

-0.09

0.63

-.05

-0.80

0.20

3.87

250

-0.07

0.64

-.35

-0.79

0.21

-.53

0.40

0.92

-.48

-0.22

0.55

-.05

0.20

0.80

-0.72

0.25

.55

-0.27

0.52

55.9

sdev

.11

0.59

sdev

-0.14

0.60

34.3

Avg

-.26

0.68

Avg

-0.84

0.18

v’

Data Preprocessing

Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation

Aggregation

Combining two or more attributes (or objects) into a single attribute (or object)
Purpose

Data reduction

Reduce the number of attributes or objects

Change of scale

Cities aggregated into regions, states, countries, etc

More “stable” data

Aggregated data tends to have less variability

Aggregation

Standard Deviation of Average Monthly Precipitation

Standard Deviation of Average Yearly Precipitation

Variation of Precipitation in Australia

CS490D:
Introduction to Data Mining
Chris Clifton

January 26, 2004

Data Preparation

Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Data Reduction Strategies

A data warehouse may store terabytes of data

Complex data analysis/mining may take a very long time to run on the complete data set

Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategies

Data cube aggregation
Dimensionality reduction — remove unimportant attributes
Data Compression
Numerosity reduction — fit data into models
Discretization and concept hierarchy generation

Data Cube Aggregation

The lowest level of a data cube

the aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.

Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with

Reference appropriate levels

Use the smallest representation which is enough to solve the task

Queries regarding aggregated information should be answered using data cube, when possible

Dimensionality Reduction

Feature selection (i.e., attribute subset selection):

Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features
reduce # of patterns in the patterns, easier to understand

Heuristic methods (due to exponential # of choices):

step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

A1?

A6?

Class 1

Class 2

Class 1

Class 2

Reduced attribute set: {A1, A4, A6}

Example of
Decision Tree Induction

Heuristic Feature Selection Methods

There are 2^d possible sub-features of d features
Several heuristic feature selection methods:

Best single features under the feature independence assumption: choose by significance tests.
Best step-wise feature selection:

The best single-feature is picked first
Then next best feature condition to the first, ...

Step-wise feature elimination:

Repeatedly eliminate the worst feature

Best combined feature selection and elimination:
Optimal branch and bound:

Use feature elimination and backtracking

Data Compression

String compression

There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion

Audio/video compression

Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without reconstructing the whole

Time sequence is not audio

Typically short and vary slowly with time

Data Compression

Original Data

Compressed

Data

lossless

Original Data

Approximated

lossy

Wavelet Transformation

Discrete wavelet transform (DWT): linear signal processing, multiresolutional analysis
Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
Method:

Length, L, must be an integer power of 2 (padding with 0s, when necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length

Haar2

Daubechie4

DWT for Image Compression

Image

Low Pass High Pass

Curse of Dimensionality

When dimensionality increases, data becomes increasingly sparse in the space that it occupies
Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful

Randomly generate 500 points
Compute difference between max and min distance between any pair of points

Dimensionality Reduction

Purpose:

Avoid curse of dimensionality
Reduce amount of time and memory required by data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise

Techniques

Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques

Principal Component Analysis

Dimensionality Reduction: PCA

Goal is to find a projection that captures the largest amount of variation in data

x₂

x₁

Dimensionality Reduction: PCA

Find the eigenvectors of the covariance matrix
The eigenvectors define the new space

x₂

x₁

Principal Component Analysis

Given N data vectors from k-dimensions, find c ≤ k orthogonal vectors that can be best used to represent data

The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)

Each data vector is a linear combination of the c principal component vectors
Works for numeric data only
Used when the number of dimensions is large

Dimensionality Reduction: PCA

Dimensionality Reduction: ISOMAP

Construct a neighbourhood graph
For each pair of points in the graph, compute the shortest path distances – geodesic distances

By: Tenenbaum, de Silva, Langford (2000)

Feature Subset Selection

Another way to reduce dimensionality of data
Redundant features

duplicate much or all of the information contained in one or more other attributes
Example: purchase price of a product and the amount of sales tax paid

Irrelevant features

contain no information that is useful for the data mining task at hand
Example: students' ID is often irrelevant to the task of predicting students' GPA

Feature Subset Selection

Techniques:

Brute-force approch:

Try all possible feature subsets as input to data mining algorithm

Embedded approaches:

Feature selection occurs naturally as part of the data mining algorithm

Filter approaches:

Features are selected before data mining algorithm is run

Wrapper approaches:

Use the data mining algorithm as a black box to find best subset of attributes

CS590D:
Data Mining
Chris Clifton

January 24, 2006

Data Preparation

Feature Creation

Create new attributes that can capture the important information in a data set much more efficiently than the original attributes
Three general methodologies:

Feature Extraction

domain-specific

Mapping Data to New Space
Feature Construction

combining features

Mapping Data to a New Space

Two Sine Waves

Two Sine Waves + Noise

Frequency

Fourier transform

Wavelet transform

Discretization Using Class Labels

Entropy based approach

3 categories for both x and y

5 categories for both x and y

Discretization Without Using Class Labels

Data

Equal interval width

Equal frequency

K-means

Attribute Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

Simple functions: x^k, log(x), e^x, |x|
Standardization and Normalization

Numerosity Reduction

Parametric methods

Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces

Non-parametric methods

Do not assume models
Major families: histograms, clustering, sampling

Regression and Log-Linear Models

Linear regression: Data are modeled to fit a straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector
Log-linear model: approximates discrete multidimensional probability distributions

Regress Analysis and Log-Linear Models

Linear regression: Y =  +  X

Two parameters ,  and  specify the line and are to be estimated by using the data at hand.
using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….

Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the above.

Log-linear models:

The multi-way table of joint probabilities is approximated by a product of lower-order tables.
Probability: p(a, b, c, d) = ab acad bcd

Histograms

A popular data reduction technique
Divide data into buckets and store average (sum) for each bucket
Can be constructed optimally in one dimension using dynamic programming
Related to quantization problems.

Clustering

Partition data set into clusters, and one can store cluster representation only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional index tree structures
There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8

Hierarchical Reduction

Use multi-resolution structure with different degrees of reduction
Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters”
Parametric methods are usually not amenable to hierarchical representation
Hierarchical aggregation

An index tree hierarchically divides a data set into partitions by value range of some attributes
Each partition can be considered as a bucket
Thus an index tree with aggregates stored at each node is a hierarchical histogram

Sampling

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data
Choose a representative subset of the data

Simple random sampling may have very poor performance in the presence of skew

Develop adaptive sampling methods

Stratified sampling:

Approximate the percentage of each class (or subpopulation of interest) in the overall database
Used in conjunction with skewed data

Sampling may not reduce database I/Os (page at a time).

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Sampling

Raw Data

Cluster/Stratified Sample

Sampling

Sampling is the main technique employed for data selection.

It is often used for both the preliminary investigation of the data and the final data analysis.

Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.
Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

Sampling …

The key principle for effective sampling is the following:

using a sample will work almost as well as using the entire data sets, if the sample is representative
A sample is representative if it has approximately the same property (of interest) as the original set of data

Types of Sampling

Simple Random Sampling

There is an equal probability of selecting any particular item

Sampling without replacement

As each item is selected, it is removed from the population

Sampling with replacement

Objects are not removed from the population as they are selected for the sample.

In sampling with replacement, the same object can be picked up more than once

Stratified sampling

Split the data into several partitions; then draw random samples from each partition

Sample Size

8000 points 2000 Points 500 Points

Sample Size

What sample size is necessary to get at least one object from each of 10 groups.

Similarity and Dissimilarity

Similarity

Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]

Dissimilarity

Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies

Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Euclidean Distance

Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Standardization is necessary, if scales differ.

Euclidean Distance

Distance Matrix

Minkowski Distance

Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and p_k and q_k are, respectively, the kth attributes (components) or data objects p and q.

100

Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.

A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2. Euclidean distance
r  . “supremum” (Lmax norm, L norm) distance.

This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

101

Minkowski Distance

Distance Matrix

102

Mahalanobis Distance

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

 is the covariance matrix of the input data X

103

Mahalanobis Distance

Covariance Matrix:

A: (0.5, 0.5)

B: (0, 1)

C: (1.5, 1.5)

Mahal(A,B) = 5

Mahal(A,C) = 4

104

Common Properties of a Distance

Distances, such as the Euclidean distance, have some well known properties.

d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is a metric

105

Common Properties of a Similarity

Similarities, also have some well known properties.

s(p, q) = 1 (or maximum similarity) only if p = q.
s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

106

Similarity Between Binary Vectors

Common situation is that objects, p and q, have only binary attributes

Compute similarities using the following quantities

M₀₁ = the number of attributes where p was 0 and q was 1

M₁₀= the number of attributes where p was 1 and q was 0

M₀₀ = the number of attributes where p was 0 and q was 0

M₁₁ = the number of attributes where p was 1 and q was 1

Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes

= (M₁₁ + M₀₀) / (M₀₁ + M₁₀ + M₁₁ + M₀₀)

J = number of 11 matches / number of not-both-zero attributes values

= (M₁₁) / (M₀₁ + M₁₀ + M₁₁)

107

SMC versus Jaccard: Example

p = 1 0 0 0 0 0 0 0 0 0

q = 0 0 0 0 0 0 1 0 0 1

M₀₁ = 2 (the number of attributes where p was 0 and q was 1)

M₁₀ = 1 (the number of attributes where p was 1 and q was 0)

M₀₀ = 7 (the number of attributes where p was 0 and q was 0)

M₁₁ = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M₁₁ + M₀₀)/(M₀₁ + M₁₀ + M₁₁ + M₀₀) = (0+7) / (2+1+0+7) = 0.7

J = (M₁₁) / (M₀₁ + M₁₀ + M₁₁) = 0 / (2 + 1 + 0) = 0

108

Cosine Similarity

If d₁ and d₂ are two document vectors, then

cos( d₁, d₂ ) = (d₁  d₂) / ||d₁|| ||d₂|| ,

where  indicates vector dot product and || d || is the length of vector d.

Example:

d₁ = 3 2 0 5 0 0 0 2 0 0

   d₂ = 1 0 0 0 0 0 0 1 0 2

    d₁  d₂= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

   ||d₁|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)^0.5 = (42) ^0.5 = 6.481

    ||d₂|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)^0.5= (6) ^0.5 = 2.245

    cos( d₁, d₂ ) = .3150

CS590D

109

Extended Jaccard Coefficient (Tanimoto)
- Variation of Jaccard for continuous or count attributes
CS590D

110

Correlation
- Correlation measures the linear relationship between objects
- To compute correlation, we standardize data objects, p and q, and then take their dot product
CS590D

111

Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.

CS590D

112

General Approach for Combining Similarities
- Sometimes attributes are of many different types, but an overall similarity is needed.
CS590D

113

Using Weights to Combine Similarities
- May not want to treat all attributes the same.
CS590D

114

Density
- Density-based clustering require a notion of density
- Examples:
CS590D

115

Euclidean Density – Cell-based
- Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains
CS590D

116

Euclidean Density – Center-based
- Euclidean density is the number of points within a specified radius of the point
CS590D

117

Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
CS590D

118

Discretization
- Three types of attributes:
- Discretization:
CS590D

119

Discretization and Concept hierachy
- Discretization
- Concept hierarchies
CS590D

CS490D:
Introduction to Data Mining
Chris Clifton

January 28, 2004

Data Preparation

121

Discretization and Concept Hierarchy Generation for Numeric Data
- Binning (see sections before)
- Histogram analysis (see sections before)
- Clustering analysis (see sections before)
- Entropy-based discretization
- Segmentation by natural partitioning
CS590D

122

Definition of Entropy
- Entropy
- Example: Coin Flip
- What about a two-headed coin?
- Conditional Entropy:
CS590D

123

Entropy-Based Discretization
- Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
- The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
- The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,
- Experiments show that it may reduce data size and improve classification accuracy
CS590D

124

Segmentation by Natural Partitioning
- A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.
CS590D

125

Example of 3-4-5 Rule

(-$4000 -$5,000)

(-$400 - 0)

(-$400 -

-$300)

(-$300 -

-$200)

(-$200 -

-$100)

(-$100 -

0)

(0 - $1,000)

(0 -

$200)

($200 -

$400)

($400 -

$600)

($600 -

$800)

($800 -

$1,000)

($2,000 - $5, 000)

($2,000 -

$3,000)

($3,000 -

$4,000)

($4,000 -

$5,000)

($1,000 - $2, 000)

($1,000 -

$1,200)

($1,200 -

$1,400)

($1,400 -

$1,600)

($1,600 -

$1,800)

($1,800 -

$2,000)

                        msd=1,000 Low=-$1,000 High=$2,000

Step 2:

Step 4:

Step 1:

         -$351 -$159  profit              $1,838          $4,700

       Min             Low (i.e, 5%-tile)                             High(i.e, 95%-0 tile)        Max

count

(-$1,000 - $2,000)

(-$1,000 - 0)

(0 -$ 1,000)

Step 3:

($1,000 - $2,000)

CS590D

126

Concept Hierarchy Generation for Categorical Data
- Specification of a partial ordering of attributes explicitly at the schema level by users or experts
- Specification of a portion of a hierarchy by explicit data grouping
- Specification of a set of attributes.
- Specification of only a partial set of attributes
CS590D

127

Automatic Concept Hierarchy Generation
- Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set
country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values

CS590D

128

Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
CS590D

129

Summary
- Data preparation is a big issue for both warehousing and mining
- Data preparation includes
- A lot a methods have been developed but still an active area of research
CS590D

130

References
- E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data Engineering. Vol.23, No.4
- D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, 1999.
- H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December 1997.
- A. Maydanchik, Challenges of Efficient Data Cleansing (DM Review - Data Quality resource portal)
- D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
- D. Quass. A Framework for research in Data Cleaning. (Draft 1999)
- V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001.
- T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.
- Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996.
- R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995.
- http://www.cs.ucla.edu/classes/spring01/cs240b/notes/data-integration1.pdf
CS590D

CS590D: Data Mining
Chris Clifton

January 20, 2005

Data Cubes

132

A Sample Data Cube

Total annual sales

of TV in U.S.A.

Date

Product

Country

All, All, All

sum

sum

TV

VCR

PC

1Qtr

2Qtr

3Qtr

4Qtr

U.S.A

Canada

Mexico

sum

CS590D

133

Cuboids Corresponding to the Cube

all

product

date

country

product,date

product,country

date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

CS590D

134

Browsing a Data Cube
- Visualization
- OLAP capabilities
- Interactive manipulation
CS590D

135

Typical OLAP Operations
- Roll up (drill-up): summarize data
- Drill down (roll down): reverse of roll-up
- Slice and dice:
- Pivot (rotate):
- Other operations
CS590D

136

A Star-Net Query Model

Shipping Method

AIR-EXPRESS

TRUCK

ORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

Organization

Promotion

CITY

COUNTRY

REGION

Location

DAILY

QTRLY

ANNUALY

Time

Each circle is called a footprint

CS590D

137

Efficient Data Cube Computation
- Data cube can be viewed as a lattice of cuboids
- Materialization of data cube
CS590D

138

Cube Operation
- Cube definition and computation in DMQL
- Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
- Compute the following Group-Bys
(item)

(city)

()

(year)

(city, item)

(city, year)

(item, year)

(city, item, year)

CS590D

139

Cube Computation: ROLAP-Based Method
- Efficient cube computation methods
- ROLAP-based cubing algorithms
CS590D

140

Cube Computation: ROLAP-Based Method (2)
- This is not in the textbook but in a research paper
- Hash/sort based methods (Agarwal et. al. VLDB’96)
CS590D

141

Multi-way Array Aggregation for Cube Computation
- Partition arrays into chunks (a small subcube which fits in memory).
- Compressed sparse array addressing: (chunk_id, offset)
- Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.
What is the best traversing order to do multi-way aggregation?

A

B

29

30

31

32

1

2

3

4

5

9

13

14

15

16

64

63

62

61

48

47

46

45

a1

a0

c3

c2

c1

c 0

b3

b2

b1

b0

a2

a3

C

B

44

28

56

40

24

52

36

20

60

CS590D

142

Multi-way Array Aggregation for Cube Computation

A

B

29

30

31

32

1

2

3

4

5

9

13

14

15

16

64

63

62

61

48

47

46

45

a1

a0

c3

c2

c1

c 0

b3

b2

b1

b0

a2

a3

C

44

28

56

40

24

52

36

20

60

B

CS590D

143

Multi-way Array Aggregation for Cube Computation

A

B

29

30

31

32

1

2

3

4

5

9

13

14

15

16

64

63

62

61

48

47

46

45

a1

a0

c3

c2

c1

c 0

b3

b2

b1

b0

a2

a3

C

44

28

56

40

24

52

36

20

60

B

CS590D

144

Multi-Way Array Aggregation for Cube Computation (Cont.)
- Method: the planes should be sorted and computed according to their size in ascending order.
- Limitation of the method: computing well only for a small number of dimensions
CS590D

145

Indexing OLAP Data: Bitmap Index
- Index on a particular column
- Each value in the column has a bit vector: bit-op is fast
- The length of the bit vector: # of records in the base table
- The i-th bit is set if the i-th row of the base table has the value for the indexed column
- not suitable for high cardinality domains
Base table

Index on Region

Index on Type

CS590D

146

Indexing OLAP Data: Join Indices
- Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)
- Traditional indices map the values to a list of record ids
- In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.
CS590D

147

Efficient Processing OLAP Queries
- Determine which operations should be performed on the available cuboids:
- Determine to which materialized cuboid(s) the relevant operations should be applied.
- Exploring indexing structures and compressed vs. dense array structures in MOLAP
CS590D

148

Iceberg Cube
- Computing only the cuboid cells whose count
  or other aggregates satisfying the condition:
HAVING COUNT(*) >= minsup
- Motivation
CS590D

149

Bottom-Up Computation (BUC)
- BUC (Beyer & Ramakrishnan, SIGMOD’99)
- Bottom-up vs. top-down?—depending on how you view it!
- Apriori property:
- If minsup = 1 Þ compute full CUBE!
CS590D

150

Partitioning
- Usually, entire data set can’t fit in
  main memory
- Sort distinct values, partition into blocks
  that fit
- Continue processing
- Optimizations
CS590D

151

Drawbacks of BUC
- Requires a significant amount of memory
- Does not obtain good performance with dense CUBEs
- Overly skewed data or a bad choice of dimension ordering reduces performance
- Cannot compute iceberg cubes with complex measures
CS590D

152

Non-Anti-Monotonic Measures
- The cubing query with avg is non-anti-monotonic!
…

…

…

…

…

…

520

540

HD

Edu

Van

Mar

2500

1500

Laptop

Bus

Mon

Feb

1280

1160

Camera

Edu

Tor

Jan

1200

800

TV

Hld

Tor

Jan

485

500

Printer

Edu

Tor

Jan

Price

Cost

Prod

Cust_grp

City

Month

CREATE CUBE Sales_Iceberg AS

SELECT month, city, cust_grp,

AVG(price), COUNT(*)

FROM Sales_Infor

CUBEBY month, city, cust_grp

HAVING AVG(price) >= 800 AND

COUNT(*) >= 50

CS590D

153

Top-k Average
- Let (*, Van, *) cover 1,000 records
- Top-k average is anti-monotonic
…

…

…

…

…

…

Price

Cost

Prod

Cust_grp

City

Month

CS590D

154

Binning for Top-k Average
- Computing top-k avg is costly with large k
- Binning idea
CS590D

155

Approximate top-k average

…

…

…

30

15200

400~600

15

10600

600~800

20

28000

Over 800

Count

Sum

Range

Top 50

Approximate avg⁵⁰()=

(28000+10600+600*15)/50=952

Suppose for (*, Van, *), we have

…

…

…

…

…

…

Price

Cost

Prod

Cust_grp

City

Month

The cell may pass the HAVING clause

CS590D

156

Quant-info for Top-k Average Binning
- Accumulate quant-info for cells to compute average iceberg cubes efficiently
avg()

Not anti-monotonic

real avg⁵⁰()

Anti-monotonic, but computationally costly

Approximate avg⁵⁰()

Anti-monotonic, can be computed efficiently

strongest

weakest

CS590D

157

An Efficient Iceberg Cubing Method: Top-k H-Cubing
- One can revise Apriori or BUC to compute a top-k avg iceberg cube. This leads to top-k-Apriori and top-k BUC.
- Can we compute iceberg cube more efficiently?
- Top-k H-cubing: an efficient method to compute iceberg cubes with average measure
- H-tree: a hyper-tree structure
- H-cubing: computing iceberg cubes using H-tree
CS590D

158

H-tree: A Prefix Hyper-tree

…

…

…

…

…

…

520

540

HD

Edu

Van

Mar

2500

1500

Laptop

Bus

Mon

Feb

1280

1160

Camera

Edu

Tor

Jan

1200

800

TV

Hhd

Tor

Jan

485

500

Printer

Edu

Tor

Jan

Price

Cost

Prod

Cust_grp

City

Month

root

edu

hhd

bus

Jan

Mar

Jan

Feb

Tor

Van

Tor

Mon

Q.I.

Q.I.

Q.I.

bins

Sum: 1765

Cnt: 2

Quant-Info

…

…

…

Mon

…

Van

…

Tor

…

…

…

Feb

…

Jan

…

…

…

Bus

…

Hhd

Sum:2285 …

Edu

Side-link

Quant-Info

Attr. Val.

Header

table

CS590D

159

Properties of H-tree
- Construction cost: a single database scan
- Completeness: It contains the complete information needed for computing the iceberg cube
- Compactness: # of nodes  n*m+1
CS590D

160

Computing Cells Involving Dimension City

root

Edu.

Hhd.

Bus.

Jan.

Mar.

Jan.

Feb.

Tor.

Van.

Tor.

Mon.

Q.I.

Q.I.

Q.I.

bins

Sum: 1765

Cnt: 2

Quant-Info

…

…

…

Mon

…

Van

…

Tor

…

…

…

Feb

…

Jan

…

…

…

Bus

…

Hhd

Sum:2285 …

Edu

Side-link

Quant-Info

Attr. Val.

…

…

…

Feb

…

Jan

…

…

…

Bus

…

Hhd

…

Edu

Side-link

Q.I.

Attr. Val.

Header

Table

H_Tor

From (*, *, Tor) to (*, Jan, Tor)

CS590D

161

Computing Cells Involving Month But No City

root

Edu.

Hhd.

Bus.

Jan.

Mar.

Jan.

Feb.

Tor.

Van.

Tor.

Mont.

Q.I.

Q.I.

Q.I.

…

Mar.

…

…

…

Mont.

…

Van.

…

Tor.

…

…

…

Feb.

…

Jan.

…

…

…

Bus.

…

Hhd.

Sum:2285 …

Edu.

Side-link

Quant-Info

Attr. Val.
1. Roll up quant-info
2. Compute cells involving month but no city
Q.I.

Top-k OK mark: if Q.I. in a child passes top-k avg threshold, so does its parents. No binning is needed!

CS590D

162

Computing Cells Involving Only Cust_grp

root

edu

hhd

bus

Jan

Mar

Jan

Feb

Tor

Van

Tor

Mon

Q.I.

Q.I.

Q.I.

…

Mar

…

…

…

Mon

…

Van

…

Tor

…

…

…

Feb

…

Jan

…

…

…

Bus

…

Hhd

Sum:2285 …

Edu

Side-link

Quant-Info

Attr. Val.

Check header table directly

Q.I.

CS590D

163

Properties of H-Cubing
- Space cost
- One database scan
- Main memory-based tree traversal & side-links updates
- Top-k_OK marking
CS590D

164

Scalability w.r.t. Count Threshold (No min_avg Setting)

CS590D

165

Computing Iceberg Cubes with Other Complex Measures
- Computing other complex measures
- Examples
CS590D

166

Discussion: Other Issues
- Computing iceberg cubes with more complex measures?
- Dynamic vs . static computation of iceberg cubes
- Memory-hog? what if the cubing is too big to fit in memory?—projection and then cubing
CS590D

167

Condensed Cube
- W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02.
- Icerberg cube cannot solve all the problems
- Condensed cube
CS590D

168

References (I)
- S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96
- D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97.
- R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
- K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99.
- S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.
- OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.
- G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data Cubes. VLDB’2001
- J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
CS590D

169

References (II)
- J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01
- V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96
- Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998.
- K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97.
- K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. EDBT'98.
- S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98.
- E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons, 1997.
- W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02.
- Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD’97.
CS590D

CS490D: Introduction to Data Mining Chris Clifton

Recent Documents:

Recent Search: