%pylab inline
import pandas as pd
%load_ext rpy2.ipython
import os
from ipypublish import nb_setup
Populating the interactive namespace from numpy and matplotlib
A recommendation algorithm tells you what you like or want. It may tell you about many things you like, sorted in order as well. It tries to understand your preferences using recorded data on your likes and dislikes. This machine learning class of models is often known as "collaborative filtering".
Netflix has a recommendation engine for movies. It tries to show you movies that you prefer. If you think about all the $N$ users of a movie service, each one having preferences over $K$ attributes of a movie, then we can represent this matrix as a collection of weights, with each user on the columns, and the attributes on the rows. This would be a matrix $u \in R^{K \times N}$. Each element of the matrix is indexed as $u_{ki}$, where $i=1,2,...,N$, and $k=1,2,...,K$.
Likewise imagine another matrix $m$ of $M$ movies on the columns and the same $K$ attributes on the rows. We get a matrix $m \in R^{K \times M}$. Each element of the matrix is indexed as $m_{kj}$, where $j=1,2,...,M$, and $k=1,2,...,K$.
For any user $i$, we may rank movies based on the predicted score $r_{ij}$ for movie $j$, easily calculated as
$$ r_{ij} = \sum_{k=1}^K u_{ki} m_{kj} = u_i^\top m_j $$where $u_i$ is a column vector of size $K \times 1$, and $m_j$ is a column vector of size $K \times 1$ as well. The elements of $r_{ij}$ form a matrix $r$ of dimension $N \times M$. Some, but not all of these elements are actually observed, because users rate movies. While matrix $r$ may be observable with ratings data, matrices $u$ and $m$ are latent, because the $K$ attributes are unknown. We do know that
$$ r = u^\top m $$Therefore, we would like to factorize matrix $r$ into the two matrices $u,m$.
If the true score for movie $j$, user $i$, is $y_{ij}$, then we want to find $u,m$ that deliver the closest value of $r_{ij}$ to its true value. This is done using a technique known as Alternating Least Squares (ALS).
The best fit recommender system is specified as the solution to the following problem, where we minimize loss function $L$. Since the notation gets hairy here, remember that any variable with two subscripts is scalar, with one subscript is a vector, and with no subscripts is a matrix.
$$ \begin{align} L &= \sum_{i=1}^N \sum_{j=1}^M (y_{ij}-r_{ij})^2 + \lambda_u \sum_{i=1}^N \parallel u_i \parallel^2 + \lambda_m \sum_{j=1}^M \parallel m_j \parallel^2 \\ &= \sum_{i=1}^N \sum_{j=1}^M (y_{ij}-u_i^\top m_j)^2 + \lambda_u \sum_{i=1}^N u_i^\top u_i + \lambda_m \sum_{j=1}^M m_j^\top m_j \end{align} $$We wish to find matrices $u,m$ that solve this minimization problem. We begin with some starting guess for both matrices. Then, we differentiate function $L$ with respect to just one matrix, say $u$, and solve for its optimal values. Next, we take the new values of $u$, and old values of $m$, and differentiate $L$ with respect to $m$ to solve for the new optimal values of $m$, holding $u$ fixed. We then repeat the process for a chosen number of epochs, alternating between the $u$ and $m$ subproblems. This eventually converges to the optimal $u$ and $m$ matrices, completing the factorization that minimizes the loss function.
Differentiate $L$ with respect to each user, obtaining $N$ first-order equations, which may then be set to zero and solved.
$$ \frac{\partial L}{\partial u_i} = \sum_{j=1}^M (y_{ij}-u_i^\top m_j)(-2 m_j) + 2 \lambda_u u_i = 0 \quad \in {\cal R}^{K \times 1} $$which gives the following equation:
$$ \sum_{j=1}^M (y_{ij}-u_i^\top m_j)m_j = \lambda_u u_i $$If you work it out carefully you can write this purely in matrix form as follows:
$$ m \cdot (y_i - m^\top u_i) = \lambda_u u_i \quad \in {\cal R}^{K \times 1} $$(Note that $y_i \in {\cal R}^{M \times 1}$; $m \in {\cal R}^{K \times M}$; $u_i \in {\cal R}^{K \times 1}$. And so the LHS is ${\cal R}^{K \times 1}$, as is the RHS.)
We rewrite this economically as
$$ \begin{align} m \cdot (y_i - m^\top u_i) &= \lambda_u u_i \quad \in {\cal R}^{K \times 1} \\ m \cdot y_i &= (\lambda_u I + m \cdot m^\top) \cdot u_i \\ \mbox{ } \\ u_i &= (\lambda_u I + m \cdot m^\top)^{-1} \cdot m \cdot y_i \quad \in {\cal R}^{K \times 1} \end{align} $$where $I \in {\cal R}^{K \times K}$ is an identity matrix. This gives one column $u_i$ of the $u$ matrix, and we compute this over all $i=1,2,...,N$. Therefore, holding $m$ fixed, we can get all the $u$ matrix values. And then holding $u$ fixed, we get all the $m$ matrix values, as shown next.
This is analogous to the $u$ matrix, and the answer is
$$ m_j = (\lambda_m I + u \cdot u^\top)^{-1} \cdot u \cdot y_j \quad \in {\cal R}^{K \times 1} $$This gives one column $m_j$ of the $m$ matrix, and we compute this over all $j=1,2,...,M$. Note that $I \in {\cal R}^{K \times K}$ is an identity matrix, and $y_j \in {\cal R}^{N \times 1}$.
For those students who are uncomfortable with this sort of matrix algebra, I strongly recommend taking a small system of $N=4$ users and $M=3$ movies, and factorize matrix $r \in {\cal R}^{3 \times 2}$. Maybe set $K=2$. Rework the calculus and algebra above to get comfortable with these mathematical objects.
The alternating least squares algorithm is similar to the Expectations-Maximization (EM) algorithm of @10.2307/2984875.
In R, we have the ALS package to do this matrix factorization.
%%R
library(ALS)
/Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loading required package: nnls warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Loading required package: Iso warnings.warn(x, RRuntimeWarning) /Users/srdas/anaconda3/lib/python3.7/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Iso 0.0-17 warnings.warn(x, RRuntimeWarning)
Suppose we have 50 users who rate 200 movies. This gives us the $y$ matrix of true ratings. We want to factorize this matrix into two latent matrices $u$ and $m$, by choosing $K=2$ latent attributes. The code for this is simple.
%%R
N=50; M=200; K=2
y = matrix(ceiling(runif(N*M)*5),N,M) #Matrix (i,j) to be factorized
u0 = matrix(rnorm(K*N),N,K) #Guess for u matrix
m0 = matrix(rnorm(K*M),M,K) #Guess for m matrix
res = als(CList=list(u0),S=m0,PsiList=list(y))
Initial RSS 129531.2 Iteration (opt. S): 1, RSS: 104657.8, RD: 0.1920259 Iteration (opt. C): 2, RSS: 27240.16, RD: 0.7397217 Iteration (opt. S): 3, RSS: 19093.6, RD: 0.2990642 Iteration (opt. C): 4, RSS: 18923.13, RD: 0.008928302 Iteration (opt. S): 5, RSS: 18852.69, RD: 0.003722389 Iteration (opt. C): 6, RSS: 18800.45, RD: 0.002771116 Iteration (opt. S): 7, RSS: 18763.54, RD: 0.001963151 Iteration (opt. C): 8, RSS: 18734.92, RD: 0.001525233 Iteration (opt. S): 9, RSS: 18713.99, RD: 0.001117186 Iteration (opt. C): 10, RSS: 18697.59, RD: 0.0008763283 Initial RSS / Final RSS = 129531.2 / 18697.59 = 6.927693
We now extract the two latent matrices. The predicted ratings matrix is also generated, i.e., $r$. We compute the RMSE of ratings predictions.
%%R
#Results
u = t(res$CList[[1]]); print(dim(u)) #Put in K x N format
m = t(res$S); print(dim(m)) #In K x M format
r = t(u) %*% m; print(dim(r)) #Should be N x M
e = (r-y)^2; print(mean(e))
#Check
print(mean(res$resid[[1]]^2))
[1] 2 50 [1] 2 200 [1] 50 200 [1] 1.869759 [1] 1.869759
What does the $u$ matrix tell us? It says for each of the 50 users, how they weight the two attributes. For example, the first user has the following weights:
%%R
print(u[,1])
[1] 5.399144 4.247651
You can see that each attribute is given different weights.
Likewise, you can take matrix $m$ and see how much each movie offers on each attribute dimension. For example, the first movie's loadings are
%%R
print(m[,1])
[1] 0.01732872 0.65090215
How do we use this decomposition? We can take a new user and find out which existing user is closest, using cosine similarity on some other characteristics. Then you can use that user's weights in matrix $u$, to get the movie rankings ordering. Suppose the new user's weights just happen to be the mean of all the other users'. Then we have
%%R
#Find new user's weights on attributes
u_new = as.matrix(rowMeans(u))
print(u_new)
#Find predicted ratings for all M movies for the new user
pred_ratings = t(m) %*% u_new
sol = sort(pred_ratings,decreasing=TRUE,index.return=TRUE)
print(head(sol$ix))
[,1] [1,] 6.432936 [2,] 4.302349 [1] 36 157 137 67 142 71
The top 6 movie numbers are listed.
We revisit the same approach using Python. A good reference is here: https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe; https://drive.google.com/file/d/1M5mHqLfiBDJWqefQRT-d-Y3301RYqW-z/view?usp=sharing
See also this nice article on matrix factorization and collaborative filtering in the realm of ALS: https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1; https://drive.google.com/file/d/1LZN0kDoCGvErof9xAoOSLjU6NpG4C4bm/view?usp=sharing
We do a very basic implementation using NMF.
from numpy.random import randint
N=50; M=200; K=2
y = 1 + randint(5,size=(N,M)) #Matrix (i,j) to be factorized
y
array([[4, 5, 4, ..., 2, 4, 1], [2, 1, 4, ..., 2, 4, 1], [4, 1, 5, ..., 2, 1, 2], ..., [1, 5, 4, ..., 4, 4, 4], [1, 3, 5, ..., 4, 4, 4], [5, 3, 1, ..., 4, 2, 3]])
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
u = model.fit_transform(y)
m = model.components_
print(u.shape, m.shape)
#User 1's weights
u[1,:]
array([1.16589846, 1.37151716])
#Recommended movies for User 1
r = m.T.dot(u[1,:])
argsort(r)[:6]
array([ 96, 92, 127, 38, 55, 145])