Machine Learning

Gives computers the ability to learn without being explicitly programmed.

Machine Learning v.s. Data Mining v.s. Artificial Intelligence

  • Overlap significantly
  • ML: learning properties and adapt to new data
  • DM: discovering unknown properties in data
  • AI: machines performing tasks that are characteristic of human intelligence
    • ML is a way of achieving AI
    • In the 1980s, AI all about expert systems
      • Expert system = knowledge base + inference engine
      • Problems
        • Knowledge created by hand
        • Things in real world not always true/false
      • ML improvement
        • Knowledge learned from data
        • Probability to represent the real world

Types

Human supervision or not
  • Supervised learning
    • k-NN
    • Linear regression
    • Logistic regression
    • SVM
    • Decision trees & random forests
    • Neural networks
  • Unsupervised learning
    • Clustering
      • k-means
      • Hierarchical cluster analysis (HCA)
      • Expectation maximization
    • Visualization & dimensionality reduction
      • Principal component analysis (PCA)
      • Kernel PCA
      • Locally-linear embedding (LLE)
      • t-distributed stochastic neighbor embedding (t-SNE)
    • Anomaly detection
    • Association rule learning
      • Apriori
      • Eclat
  • Semi-supervised learning
  • Reinforcement learning
    • Observe environment, select & perform action, get reward or penalty, update policy
Learning incrementally on the fly or not
  • Online
    • "Data in motion"
    • Model updated as data arrive
    • Advantage
      • Suitable for streaming data
      • Suitable for resource-constrained systems
      • Suitable for out-of-core learning (dataset cannot fit memory)
    • Disadvantage
      • Bad data needs to be monitored
  • Offline (batch)
    • "Data at rest"
    • Model estimated all data at once
    • Advantage
      • Faster convergence
    • Disadvantage
      • Not efficient if new data frequently come in
      • Not applicable to reinforcement learning
Try to find patterns to build a model or not
  • Instance-based
    • Memorize every training instances
    • Generalize to new instances using similarity measure
  • Model-based
    • Build a model of the given training data, and use it to make predictions
    • Less vulnerable to bad data

Data

  • More data
    • Generalize better
    • May have sampling bias
  • Less data
    • Sample noise (non-representative data)
Data v.s. Algorithm

Very different ML algorithms performed almost identically well on a complex problem given enough data.

But small- and medium-sized datasets are still very common.

How to better detect patterns?
  • Data cleaning
    • Remove outliers
    • Ignore/fill in missing values
  • Feature engineering
    • Select the most useful features
    • Combine features
    • Create new features
  • Prevent overfitting (regularization)
    • Simplify model
      • Fewer parameters
      • Reduce features
    • Gather more data
    • Reduce noise in data
    • Add in hyperparameters to tune the pattern
  • Prevent underfitting
    • Build more powerful model
      • More parameters
    • Feed better features

Testing & Validating

To see how well the derived model performs, try it on new data instances.

  • Split dataset
    • Training set
    • Test set
      • Generalization error (out-of-sample error)
No Free Lunch Theorem

No one model works best for every problem. -> It is common in ML to try multiple models.

results matching ""

    No results matching ""