Chapter 14 The Machine Knows What You Want: Recommender Systems

14.1 Introduction

A recommendation algorithm tells you what you like or want. It may tell you about many things you like, sorted in order as well. It tries to understand your preferences using recorded data on your likes and dislikes.

Netflix has a recommendation engine for movies. It tries to show you movies that you prefer. If you think about all the \(N\) users of a movie service, each one having preferences over \(K\) attributes of a movie, then we can represent this matrix as a collection of weights, with each user on the columns, and the attributes on the rows. This would be a matrix \(u \in R^{K \times N}\). Each element of the matrix is indexed as \(u_{ki}\), where \(i=1,2,...,N\), and \(k=1,2,...,K\).

Likewise imagine another matrix \(m\) of \(M\) movies on the columns and the same \(K\) attributes on the rows. We get a matrix \(m \in R^{K \times M}\). Each element of the matrix is indexed as \(m_{kj}\), where \(j=1,2,...,M\), and \(k=1,2,...,K\).

For any user \(i\), we may rank movies based on the predicted score \(r_{ij}\) for movie \(j\), easily calculated as

\[ r_{ij} = \sum_{k=1}^K u_{ki} m_{kj} = u_i^\top m_j \]

where \(u_i\) is a column vector of size \(K \times 1\), and \(m_j\) is a column vector of size \(K \times 1\) as well. The elements of \(r_{ij}\) form a matrix \(r\) of dimension \(N \times M\). Some, but not all of these elements are actually observed, because users rate movies. While matrix \(r\) may be observable with ratings data, matrices \(u\) and \(m\) are latent, because the \(K\) attributes are unknown. We do know that

\[ r = u^\top m \]

Therefore, we would like to factorize matrix \(r\) into the two matrices \(u,m\).

If the true score for movie \(j\), user \(i\), is \(y_{ij}\), then we want to find \(u,m\) that deliver the closest value of \(r_{ij}\) to its true value. This is done using a technique known as Alternating Least Squares (ALS).

14.2 Alternating Least Squares

The best fit recommender system is specified as the solution to the following problem, where we minimize loss function \(L\). Since the notation gets hairy here, remember that any variable with two subscripts is scalar, with one subscript is a vector, and with no subscripts is a matrix.

\[ \begin{align} L &= \sum_{i=1}^N \sum_{j=1}^M (y_{ij}-r_{ij})^2 + \lambda_u \sum_{i=1}^N \parallel u_i \parallel^2 + \lambda_m \sum_{j=1}^M \parallel m_j \parallel^2 \\ &= \sum_{i=1}^N \sum_{j=1}^M (y_{ij}-u_i^\top m_j)^2 + \lambda_u \sum_{i=1}^N u_i^\top u_i + \lambda_m \sum_{j=1}^M m_j^\top m_j \end{align} \]

We wish to find matrices \(u,m\) that solve this minimization problem. We begin with some starting guess for both matrices. Then, we differentiate function \(L\) with respect to just one matrix, say \(u\), and solve for its optimal values. Next, we take the new values of \(u\), and old values of \(m\), and differentiate \(L\) with respect to \(m\) to solve for the new optimal values of \(m\), holding \(u\) fixed. We then repeat the process for a chosen number of epochs, alternating between the \(u\) and \(m\) subproblems. This eventually converges to the optimal \(u\) and \(m\) matrices, completing the factorization that minimizes the loss function.

14.3 Solve \(u\) matrix

Differentiate \(L\) with respect to each user, obtaining \(N\) first-order equations, which may then be set to zero and solved.

\[ \frac{\partial L}{\partial u_i} = \sum_{i=1}^N \sum_{j=1}^M (y_{ij}-u_i^\top m_j)(-2 m_j) + 2 \lambda_u u_i = 0 \quad \in {\cal R}^{K \times 1} \]

which gives the following equation:

\[ \sum_{j=1}^M (y_{ij}-u_i^\top m_j)m_j = \lambda_u u_i \]

If you work it out carefully you can write this purely in matrix form as follows:

\[ (y_i - m^\top u_i)^\top = \lambda_u u_i^\top \quad \in {\cal R}^{1 \times K} \]

(Note that \(y_i \in {\cal R}^{1 \times M}\); \(m^\top \in {\cal R}^{M \times K}\); \(u_i \in {\cal R}^{K \times 1}\). And so the LHS is \({\cal R}^{1 \times K}\), as is the RHS.)

We rewrite this economically as

\[ \begin{align} m \cdot (y_i^\top - m^\top u_i) &= \lambda_u u_i \quad \in {\cal R}^{K \times 1} \\ m \cdot y_i^\top &= (\lambda_u I + m \cdot m^\top) \cdot u_i \\ \mbox{ } \\ u_i &= (\lambda_u I + m \cdot m^\top)^{-1} \cdot m \cdot y_i^\top \quad \in {\cal R}^{K \times 1} \end{align} \]

where \(I \in {\cal R}^{K \times K}\) is an identity matrix. This gives one column \(u_i\) of the \(u\) matrix, and we compute this over all \(i=1,2,...,N\).

14.4 Solve \(m\) matrix

This is analogous to the \(u\) matrix, and the answer is

\[ m_j = (\lambda_m I + u \cdot u^\top)^{-1} \cdot u \cdot y_j \quad \in {\cal R}^{K \times 1} \]

This gives one column \(m_j\) of the \(m\) matrix, and we compute this over all \(j=1,2,...,M\). Note that \(I \in {\cal R}^{K \times K}\) is an identity matrix, and \(y_j \in {\cal R}^{N \times 1}\).

For those students who are uncomfortable with this sort of matrix algebra, I strongly recommend taking a small system of \(N=4\) users and \(M=3\) movies, and factorize matrix \(r \in {\cal R}^{3 \times 2}\). Maybe set \(K=2\). Rework the calculus and algebra above to get comfortable with these mathematical objects.

The alternating least squares algorithm is similar to the Expectations-Maximization (EM) algorithm of Dempster, Laird, and Rubin (1977).

14.5 ALS package

In R, we have the ALS package to do this matrix factorization.

library(ALS)

## Loading required package: nnls

## Loading required package: Iso

## Iso 0.0-17

Suppose we have 50 users who rate 200 movies. This gives us the \(y\) matrix of true ratings. We want to factorize this matrix into two latent matrices \(u\) and \(m\), by choosing \(K=2\) latent attributes. The code for this is simple.

N=50; M=200; K=2
y = matrix(ceiling(runif(N*M)*5),N,M)    #Matrix (i,j) to be factorized
u0 = matrix(rnorm(K*N),N,K)     #Guess for u matrix
m0 = matrix(rnorm(K*M),M,K)     #Guess for m matrix
res = als(CList=list(u0),S=m0,PsiList=list(y))

## Initial RSS 129355.7 
## Iteration (opt. S): 1, RSS: 109957.2, RD: 0.1499624
## Iteration (opt. C): 2, RSS: 67315.62, RD: 0.3878016
## Iteration (opt. S): 3, RSS: 19471.09, RD: 0.7107493
## Iteration (opt. C): 4, RSS: 19029.84, RD: 0.02266185
## Iteration (opt. S): 5, RSS: 18992.98, RD: 0.001936723
## Iteration (opt. C): 6, RSS: 18966.15, RD: 0.001412627
## Iteration (opt. S): 7, RSS: 18947.37, RD: 0.0009902627
## Initial RSS / Final RSS = 129355.7 / 18947.37 = 6.827104

We now extract the two latent matrices. The predicted ratings matrix is also generated, i.e., \(r\). We compute the RMSE of ratings predictions.

#Results
u = t(res$CList[[1]]); print(dim(u))   #Put in K x N format

## [1]  2 50

m = t(res$S); print(dim(m))    #In K x M format

## [1]   2 200

r = t(u) %*% m; print(dim(r))  #Should be N x M

## [1]  50 200

e = (r-y)^2; print(mean(e))

## [1] 1.894737

#Check
print(mean(res$resid[[1]]^2))

## [1] 1.894737

14.6 Interpretation and Use

What does the \(u\) matrix tell us? It says for each of the 50 users, how they weight the two attributes. For example, the first user has the following weights:

print(u[,1])

## [1] 11.46459  5.84150

You can see that each attribute is given different weights.

Likewise, you can take matrix \(m\) and see how much each movie offers on each attribute dimension. For example, the first movie’s loadings are

print(m[,1])

## [1] 0.2137344 0.0583482

How do we use this decomposition? We can take a new user and find out which existing user is closest, using cosine similarity on some other characteristics. Then you can use that user’s weights in matrix \(u\), to get the movie rankings ordering. Suppose the new user’s weights just happen to be the mean of all the other users’. Then we have

#Find new user's weights on attributes
u_new = as.matrix(rowMeans(u))
print(u_new)

##          [,1]
## [1,] 9.576559
## [2,] 9.489472

#Find predicted ratings for all M movies for the new user
pred_ratings = t(m) %*% u_new
sol = sort(pred_ratings,decreasing=TRUE,index.return=TRUE)
print(head(sol$ix))

## [1]  87 162 158 165 112  97

The top 6 movie numbers are listed.

References

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the Em Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1). [Royal Statistical Society, Wiley]: 1–38. http://www.jstor.org/stable/2984875.