lycpaul@home:~$

Movie Recommendation with Item-based Collaborative Filtering



Background

I have completed the CS598 Practical Statistical Learning course last semester (2024 Fall). This course covers the basic concepts of statistical learning, including linear models, non-linear models, and ensemble methods. The last project of the course is to build a movie recommendation system using item-based collaborative filtering (IBCF). It is quite a fun project to implement from algorithm to deployment on web.

Movie Recommendation with Item-based Collaborative Filtering

Item-based Collaborative Filtering

The project used the MovieLens Dataset with ~1 million anonymous ratings for 3,706 movies, provided by 6,040 MovieLens users who joined the platform in 2000.

Matrix variables used for the project and their dimensions:

  • $\mathbf{R}$: denote the 6040-by-3706 user-movie rating matrix.
  • $\mathbf{S}$: denote the 3706-by-3706 similarity matrix.
  • $\mathbf{I}$: denote the 3706-by-3706 cardinality matrix.

We first compute the (transformed) Cosine similarity among the 3,706 movies. For movies $i$ and $j$, let \(\mathcal{I}_{ij}\) denote the set of users who rated both movies $i$ and $j$. We ignore similarities computed based on less than three user ratings. Thus, define the similarity between movie $i$ and movie $j$ as follows, when the cardinality of \(\mathcal{I}_{ij}\) is bigger than two,

\[S_{ij} = \frac{1}{2} + \frac{1}{2} \frac{\sum_{l \in \mathcal{I}_{ij}} R_{li}{R_{lj}}}{\sqrt{\sum_{l \in \mathcal{I}_{ij}} R_{li}^2} \sqrt{\sum_{l \in \mathcal{I}_{ij}} R_{lj}^2}}\]

Python code to compute the similarity matrix.

def compute_cos_similarity(rating: pd.DataFrame, cardinality: pd.DataFrame):
    # recenter the rating matrix
    R = rating.subtract(rating.mean(axis=1), axis='rows').fillna(0)
    
    # compute the dot product
    dot_product = R.T @ R
    
    # compute the euclidean distance
    euclidean_dist = (R.T ** 2) @ (R != 0)
    euclidean_dist = euclidean_dist.apply(np.vectorize(np.sqrt))
    euclidean_dist = euclidean_dist.T * euclidean_dist
    
    # compute the cosine similarity
    cosine_similarity = dot_product / euclidean_dist
    
    # transformed Cosine similarity (1 + cos)/2 ensures the similarity is between 0 and 1
    S = (1 + cosine_similarity) / 2
    
    # set the diagonal to nan
    np.fill_diagonal(S.values, np.nan)

    # ignore similarities computed based on less than three user ratings
    S[cardinality < 3] = None
    
    return S

Display the similarity matrix for the selected movies.

movies_ids = ["m1", "m10", "m100", "m1510", "m260", "m3212"]
display(similarity.loc[movies_ids, movies_ids])
  m1 m10 m100 m1510 m260 m3212
m1 NaN 0.512106 0.392000 NaN 0.741597 NaN
m10 0.512106 NaN 0.547458 NaN 0.534349 NaN
m100 0.392000 0.547458 NaN NaN 0.329694 NaN
m1510 NaN NaN NaN NaN NaN NaN
m260 0.741597 0.534349 0.329694 NaN NaN NaN
m3212 NaN NaN NaN NaN NaN NaN

Then we need to create a function named myIBCF, which takes a newuser a 3706-by-1 vector and similarity matrix as input, and output the top ten movies to this new use using the following formula as the prediction for movie 𝑖:

\[\frac{1}{\sum_{j \in S(i) \cap \{l: w_l \ne NA\}} S_{ij}} \sum_{j \in S(i) \cap \{l: w_l \ne NA\}} S_{ij} w_{j}\]

The prediction is missing if the set \(S(i) \cap \{l: w_l \ne NA\}\) is empty, meaning if none of the similar items have been rated by the user.

Here’s the Python code to implement the myIBCF function.

def myIBCF(newuser: pd.Series, similarity: pd.DataFrame) -> pd.Series:
    n = 10  # default number of recommendations
    w = newuser
    S = similarity.fillna(0)

    # compute the movie recommendations
    rated_movies = w[w.notna()].index
    rated_matrix = (~w.isna()).astype(int)
    w = w.fillna(0)
    recomendations = S.dot(w) / S.dot(rated_matrix)
    recomendations = recomendations.sort_values(ascending=False)[0:n].dropna()
    return recomendations
tags: Statistical Learning

Back to Blogs | Top | Tags