To find out optimal parameters without iterating through gradient decent by calculating out the optima.
Normal Equation
Taking derivatives with respect to the θj
and setting them to zero.
$$ \theta = (X^TX)^{-1}X^Ty
$$
No need feature scaling.
Comparison with Gradient Descent
Gradient Descent | Normal Equation | |
---|---|---|
Choosing α | Yes | No |
Iterate | Yes | No |
Time | $$O(kn^2)$$ | $$O(n^3)$$, needs to calculate inverse |
Usage timing | n big |
n small |
Noninvertibility
pinv
gives pseudo inversion. Will give a value even if the matrix is not invertible.
- Cause
- Redundant features causing linear dependency
- Remove redundant features
- Too many features e.g. $$m \le n$$
- Delete some features
- Regularization
- Redundant features causing linear dependency