To find out optimal parameters without iterating through gradient decent by calculating out the optima.
Normal Equation
Taking derivatives with respect to the θj and setting them to zero.
$$ \theta = (X^TX)^{-1}X^Ty
$$
No need feature scaling.
Comparison with Gradient Descent
| Gradient Descent | Normal Equation | |
|---|---|---|
| Choosing α | Yes | No |
| Iterate | Yes | No |
| Time | $$O(kn^2)$$ | $$O(n^3)$$, needs to calculate inverse |
| Usage timing | n big |
n small |
Noninvertibility
pinv gives pseudo inversion. Will give a value even if the matrix is not invertible.
- Cause
- Redundant features causing linear dependency
- Remove redundant features
- Too many features e.g. $$m \le n$$
- Delete some features
- Regularization
- Redundant features causing linear dependency