Problem Formulation
Content-Based Recommender System
- $$r(i, j)$$: 1 if user $$j$$ has rated movie $$i$$, 0 otherwise$$y^{(i, j)}$$
- $$y^{(i, j)}$$: rating of user $$j$$ on movie $$i$$ (if defined)$$\theta^{(j)}$$
- $$\theta^{(j)}$$: parameter vector for user $$j$$$$x^{(i)}$$
- $$x^{(i)}$$: feature vector for movie $$i$$$$(\theta^{(j)})^Tx^{(i)}$$ (prepend 1 as intercept term)
- $$(\theta^{(j)})^Tx^{(i)}$$: predicted rating of user $$j$$ on movie $$i$$
- $$m^{(j)}$$: # of movies rated by user $$j$$
To learn $$\theta^{(j)}$$:
$$ min{\theta^{(j)}} = \frac{1}{2m^{(j)}} \sum{i:r(i, j) = 1} ((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2m^{(j)}}\sum_{k=1}^n (\theta_k^{(j)})^2
$$
To learn $$\theta^{(1)}, ..., \theta^{(n_u)}$$:
$$ min{\theta^{(1)}, ..., \theta^{(n_u)}} = \frac{1}{2m^{(j)}} \sum{j=1}^{nu} \sum{i:r(i, j) = 1} ((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2m^{(j)}}\sum{j=1}^{n_u}\sum{k=1}^n (\theta_k^{(j)})^2
$$
Gradient update:
$$ \begin{aligned} \theta_0^{(j)} &= \theta_0^{(j)} - \alpha ((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})x_k^{(i)}\ \theta_k^{(j)} &= \theta_k^{(j)} - \alpha (((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})x_k^{(i)} + \lambda \theta_k^{(j)}) \end{aligned}
$$
$$m^{(j)}$$ can actually be eliminated.
Collaborative Filtering
Given $$X$$, estimate $$\Theta$$:
$$ min{\theta^{(1)}, ..., \theta^{(n_u)}} = \frac{1}{2} \sum{j=1}^{nu} \sum{i:r(i, j) = 1} ((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2}\sum{j=1}^{n_u}\sum{k=1}^n (\theta_k^{(j)})^2
$$
Given $$\Theta$$, can estimate $$X$$:
$$ min{x^{(1)}, ..., x^{(n_m)}} = \frac{1}{2} \sum{i=1}^{nm} \sum{j:r(i, j) = 1} ((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2}\sum{i=1}^{n_m}\sum{k=1}^n (x_k^{(i)})^2
$$
Minimizing $$X$$ and $$\Theta$$ simultaneously:
$$ J(\Theta, X) = min{x^{(1)}, ..., x^{(n_m)}, \theta^{(1)}, ..., \theta^{(n_u)}} = \frac{1}{2} \sum{(i, j):r(i, j) = 1} ((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2}\sum{i=1}^{n_m}\sum{k=1}^n (xk^{(i)})^2 + \frac{\lambda}{2}\sum{j=1}^{nu}\sum{k=1}^n (\theta_k^{(j)})^2
$$
- Initialize $$\Theta$$, $$X$$ to small random values (without intercept term)
- Minimize $$J(\Theta, X)$$ using gradient descent or other optimization algorithm
- $$xk^{(i)} = x_k^{(i)} - \alpha(\sum{j:r(i,j)=1}((\theta^{(i)})^Tx^{(i)} - y^{(i, j)})\theta_k^{(j)} + \lambda x_k^{(i)})$$
- $$\thetak^{(j)} = \theta_k^{(j)} - \alpha (\sum{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})x_k^{(i)} + \lambda \theta_k^{(j)})$$
- Predict rating with $$\theta^Tx$$
Low Rank Matrix Factorizatoin
Predicted ratings:
$$ \begin{bmatrix} (\theta^{(1)})^Tx^{(1)} & (\theta^{(1)})^Tx^{(2)} & \dots & (\theta^{(1)})^Tx^{(n_m)}\ (\theta^{(2)})^Tx^{(1)} & (\theta^{(1)})^Tx^{(2)} & \dots & (\theta^{(2)})^Tx^{(n_m)}\ \vdots & \vdots & \ddots & \vdots \ (\theta^{(n_u)})^Tx^{(1)} & (\theta^{(n_u)})^Tx^{(2)} & \dots & (\theta^{(n_u)})^Tx^{(n_m)} \end{bmatrix}
$$
$$ X = \begin{bmatrix} (x^{1})^T\ (x^{2})^T\ \vdots\ (x^{n_m})^T \end{bmatrix}
$$
$$ \Theta = \begin{bmatrix} (\theta^{1})^T\ (\theta^{2})^T\ \vdots\ (\theta^{n_u})^T \end{bmatrix}
$$ Related movies:
$$|x^{(i)} - x^{(j)}|$$ the smaller, movie $$i$$ and $$j$$ the more similar.
Mean Normalization
Prediction: $$(\theta^{(j)})^Tx^{(i)} + \mu_i$$.