Data Quality

  • Beforehand
    • Intended queries & mining tasks?
  • Types
    • Noise
      • Problems
        • Destroy regularities of data
    • Outliers: data objects with characteristics considerably different than others
      • Problems
        • Affect analysis of data e.g. calculating standard deviation
    • Missing values
    • Inconsistent or duplicate data
      • Inconsistent data can be treated as missing data

Data Preprocessing

  • Aggregation: combine >= 2 attributes into 1
    • Purpose
      • Data reduction
      • Change of scale
      • Smoother data
  • Sampling
    • Purpose
      • Preliminary investigation
      • Managing & processing of entire data set is too expensive
    • Types
      • Random
      • Without replacement
      • With replacement
        • Place the data back to pool after drawing
      • Stratified
        • Split data into partitions, draw some from each
  • Dimensionality Reduction
    • Purpose
      • Avoid curse of dimensionality
      • Save more time & space
      • Better visualization
  • Feature subset selection
    • Motivation
      • Redundant features
      • Irrelevant features
    • Techiniques
      • Brute-force
      • Embedded
        • During data analysis
      • Filter
      • Wrapper
  • Feature creation
  • Mapping data to new space
  • Discretization & binarization
    • Discretization
      • Categorical
    • Binarization
      • New values retain independence
      • Sparser space
  • Attribute transformation
Curse of Dimensionality

As dimensionality increases, data become increasingly sparse in space.
Max distance / min distance becomes smaller.
Clustering & outlier detection becomes less meaningful.

Data Similarity & Dissimilarity

  • Euclidean distance
  • Minkowski distance
    • Generalization of Euclidean distance
  • Mahalanobis distance
    • Embedding the corelationship
  • Simple matching & Jaccard coefficients
    • For binary vectors
  • Cosine similarity
    • Mostly for documents
  • Correlation

results matching ""

    No results matching ""