Data Quality
- Beforehand
- Intended queries & mining tasks?
- Types
- Noise
- Problems
- Destroy regularities of data
- Problems
- Outliers: data objects with characteristics considerably different than others
- Problems
- Affect analysis of data e.g. calculating standard deviation
- Problems
- Missing values
- Inconsistent or duplicate data
- Inconsistent data can be treated as missing data
- Noise
Data Preprocessing
- Aggregation: combine >= 2 attributes into 1
- Purpose
- Data reduction
- Change of scale
- Smoother data
- Purpose
- Sampling
- Purpose
- Preliminary investigation
- Managing & processing of entire data set is too expensive
- Types
- Random
- Without replacement
- With replacement
- Place the data back to pool after drawing
- Stratified
- Split data into partitions, draw some from each
- Purpose
- Dimensionality Reduction
- Purpose
- Avoid curse of dimensionality
- Save more time & space
- Better visualization
- Purpose
- Feature subset selection
- Motivation
- Redundant features
- Irrelevant features
- Techiniques
- Brute-force
- Embedded
- During data analysis
- Filter
- Wrapper
- Motivation
- Feature creation
- Mapping data to new space
- Discretization & binarization
- Discretization
- Categorical
- Binarization
- New values retain independence
- Sparser space
- Discretization
- Attribute transformation
Curse of Dimensionality
As dimensionality increases, data become increasingly sparse in space.
Max distance / min distance
becomes smaller.
Clustering & outlier detection becomes less meaningful.
Data Similarity & Dissimilarity
- Euclidean distance
- Minkowski distance
- Generalization of Euclidean distance
- Mahalanobis distance
- Embedding the corelationship
- Simple matching & Jaccard coefficients
- For binary vectors
- Cosine similarity
- Mostly for documents
- Correlation