15 minute read

Photo by Felix Mooneeram


Hybrid recommendation system

Origin of dataset

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

Definition of a recommandation system

Credits: Wikipedia

It is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. They are primarily used in commercial applications.

Recommender systems are utilized in a variety of areas, and are most commonly recognized as playlist generators for video and music services like Netflix, YouTube and Spotify, product recommenders for services such as Amazon, or content recommenders for social media platforms such as Facebook and Twitter. These systems can operate using a single input, like music, or multiple inputs within and across platforms like news, books, and search queries. There are also popular recommender systems for specific topics like restaurants and online dating.

Recommandation engine design

There are different approaches, some of them are:

Collaborative filtering

It is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items. By locating peer users/items with a rating history similar to the current user or item, they generate recommendations using this neighborhood […].

Advantage: it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself […].

Content-based filtering

It is based on a description of the item and a profile of the user’s preferences. This methods is best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on product features.

In this system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. This algorithms try to recommend items that are similar to those that a user liked in the past, or is examining in the present. It does not rely on a user sign-in mechanism to generate this often temporary profile. In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.[…]

Hybrid recommender systems

Most recommender systems now use a hybrid approach, combining collaborative filtering, content-based filtering, and other approaches.


Data exploration & preparation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy import sparse

#import numpy as np
#import scipy.sparse as sparse

Detailed description of used data files

rating_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
df_ratings = pd.read_csv('../input/u.data', sep='\t', names=rating_cols)
df_ratings.shape
(100000, 4)
df_ratings.head()
user_id movie_id rating unix_timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
df_users = pd.read_csv('../input/u.user', sep='|', names=users_cols, parse_dates=True)
df_users.head()
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
items_cols = ['movie_id' , 'movie_title' , 'release_date' , 'video_release_date' , 'IMDb_URL' , 'unknown|' , 'Action|' , 'Adventure|', 'Animation|', "Children's|", 'Comedy|', 'Crime|', 'Documentary|', 'Drama|',\
              'Fantasy|', 'Film-Noir|', 'Horror|', 'Musical|', 'Mystery|', 'Romance|', 'Sci-Fi|', 'Thriller|', \
              'War|', 'Western|']
df_items = pd.read_csv('../input/u.item', sep='|', encoding='latin-1', names=items_cols, parse_dates=True, index_col='movie_id')
df_items.head()
movie_title release_date video_release_date IMDb_URL unknown| Action| Adventure| Animation| Children's| Comedy| ... Fantasy| Film-Noir| Horror| Musical| Mystery| Romance| Sci-Fi| Thriller| War| Western|
movie_id
1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
2 GoldenEye (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?GoldenEye%20(... 0 1 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 Four Rooms (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Four%20Rooms%... 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 Get Shorty (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%... 0 1 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
5 Copycat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995) 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 23 columns

movie_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
df_movies = pd.read_csv('../input/u.item', sep='|', names=movie_cols, usecols=range(5), encoding='latin-1', index_col='movie_id')
df_movies.shape
(1682, 4)
df_movies.head()
title release_date video_release_date imdb_url
movie_id
1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
2 GoldenEye (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?GoldenEye%20(...
3 Four Rooms (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Four%20Rooms%...
4 Get Shorty (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%...
5 Copycat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995)

df_movies provides infos for each movie (ie for each line).

df_ratings provides infos for each rating a user have made.

u.data – The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id item id rating timestamp. The time stamps are unix seconds since 1/1/1970 UTC
u.item – Information about the items (movies); this is a tab separated list of movie id movie title release date video release date IMDb URL unknown Action Adventure Animation Children’s Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.

Data Visualizations

sns.countplot(x='rating', data=df_ratings)
<matplotlib.axes._subplots.AxesSubplot at 0x7f709d726550>

png

plt.figure(figsize=(14, 5))
sns.countplot(x='age', data=df_users)
<matplotlib.axes._subplots.AxesSubplot at 0x7f709d726470>

png

sns.countplot(x='sex', data=df_users)
<matplotlib.axes._subplots.AxesSubplot at 0x7f709d069048>

png

plt.figure(figsize=(5, 8))
sns.countplot(y='occupation', data=df_users)
<matplotlib.axes._subplots.AxesSubplot at 0x7f709d237208>

png

genre_list = ['Action' , 'Adventure' , 'Animation' , "Children's" , 'Comedy' , 'Crime' , 'Documentary' , 'Drama' \
              , 'Fantasy' , 'Film-Noir' , 'Horror' , 'Musical' , 'Mystery' , 'Romance' , 'Sci-Fi' , 'Thriller' , \
              'War' , 'Western']
genre_sum = []
for g in genre_list:
    genre_sum.append(df_items[g].sum())

genre_df = pd.DataFrame({'genres' : genre_list, 'sum' : genre_sum})

plt.figure(figsize=(5, 8))
sns.barplot(y='genres', x='sum', data=genre_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f709d1da898>

png

Data preparation

df_items.columns
Index(['movie_title', 'release_date', 'video_release_date', 'IMDb_URL',
       'unknown|', 'Action|', 'Adventure|', 'Animation|', 'Children's|',
       'Comedy|', 'Crime|', 'Documentary|', 'Drama|', 'Fantasy|', 'Film-Noir|',
       'Horror|', 'Musical|', 'Mystery|', 'Romance|', 'Sci-Fi|', 'Thriller|',
       'War|', 'Western|'],
      dtype='object')
df_items = df_items.drop(columns=['movie_title', 'release_date', 'video_release_date', 'IMDb_URL'])
df_items = df_items.assign(genres=df_items.values.dot(df_items.columns.values))
df_items = df_items.drop(columns=['unknown|', 'Action|', 'Adventure|', 'Animation|', "Children's|",\
       'Comedy|', 'Crime|', 'Documentary|', 'Drama|', 'Fantasy|', 'Film-Noir|',\
       'Horror|', 'Musical|', 'Mystery|', 'Romance|', 'Sci-Fi|', 'Thriller|',\
       'War|', 'Western|'])

df_movies = pd.concat([df_movies, df_items], axis=1)
df_movies = df_movies.drop(columns=['release_date', 'video_release_date', 'imdb_url'])
df_movies.head()
title genres
movie_id
1 Toy Story (1995) Animation|Children's|Comedy|
2 GoldenEye (1995) Action|Adventure|Thriller|
3 Four Rooms (1995) Thriller|
4 Get Shorty (1995) Action|Comedy|Drama|
5 Copycat (1995) Crime|Drama|Thriller|
df_ratings = df_ratings.drop(columns=['unix_timestamp'])
df_ratings.head()
user_id movie_id rating
0 196 242 3
1 186 302 3
2 22 377 1
3 244 51 2
4 166 346 1

In order to fit a LightFM model, the Dataframe should be transformed to a sparse matrix (i.e a big matrice with a lot of zeros or empty values). Pandas’ df & Numpy arrays are not suitable for manipulating this kind of data. We need to use Scipy sparse matrices.

By going so, the information of the ids (userId and movieId) will be lost. Then we will only deal with indices (row number and column number). Therefore, the df_to_matrix function also returns dictionaries mapping indexes to ids (ex: uid_to_idx mapping userId to index of the matrix)

# Reference : https://github.com/EthanRosenthal/rec-a-sketch

def threshold_interactions_df(df, row_name, col_name, row_min, col_min):
    """Limit interactions df to minimum row and column interactions.
    Parameters
    ----------
    df : DataFrame
        DataFrame which contains a single row for each interaction between
        two entities. Typically, the two entities are a user and an item.
    row_name : str
        Name of column in df which corresponds to the eventual row in the
        interactions matrix.
    col_name : str
        Name of column in df which corresponds to the eventual column in the
        interactions matrix.
    row_min : int
        Minimum number of interactions that the row entity has had with
        distinct column entities.
    col_min : int
        Minimum number of interactions that the column entity has had with
        distinct row entities.
    Returns
    -------
    df : DataFrame
        Thresholded version of the input df. Order of rows is not preserved.
    Examples
    --------
    df looks like:
    user_id | item_id
    =================
      1001  |  2002
      1001  |  2004
      1002  |  2002
    thus, row_name = 'user_id', and col_name = 'item_id'
    If we were to set row_min = 2 and col_min = 1, then the returned df would
    look like
    user_id | item_id
    =================
      1001  |  2002
      1001  |  2004
    """

    n_rows = df[row_name].unique().shape[0]
    n_cols = df[col_name].unique().shape[0]
    sparsity = float(df.shape[0]) / float(n_rows*n_cols) * 100
    print('Starting interactions info')
    print('Number of rows: {}'.format(n_rows))
    print('Number of cols: {}'.format(n_cols))
    print('Sparsity: {:4.3f}%'.format(sparsity))

    done = False
    while not done:
        starting_shape = df.shape[0]
        col_counts = df.groupby(row_name)[col_name].count()
        df = df[~df[row_name].isin(col_counts[col_counts < col_min].index.tolist())]
        row_counts = df.groupby(col_name)[row_name].count()
        df = df[~df[col_name].isin(row_counts[row_counts < row_min].index.tolist())]
        ending_shape = df.shape[0]
        if starting_shape == ending_shape:
            done = True

    n_rows = df[row_name].unique().shape[0]
    n_cols = df[col_name].unique().shape[0]
    sparsity = float(df.shape[0]) / float(n_rows*n_cols) * 100
    print('Ending interactions info')
    print('Number of rows: {}'.format(n_rows))
    print('Number of columns: {}'.format(n_cols))
    print('Sparsity: {:4.3f}%'.format(sparsity))
    return df

def get_df_mappings(df, row_name, col_name):
    """Map entities in interactions df to row and column indices
    Parameters
    ----------
    df : DataFrame
        Interactions DataFrame.
    row_name : str
        Name of column in df which contains row entities.
    col_name : str
        Name of column in df which contains column entities.
    Returns
    -------
    rid_to_idx : dict
        Maps row ID's to the row index in the eventual interactions matrix.
    idx_to_rid : dict
        Reverse of rid_to_idx. Maps row index to row ID.
    cid_to_idx : dict
        Same as rid_to_idx but for column ID's
    idx_to_cid : dict
    """


    # Create mappings
    rid_to_idx = {}
    idx_to_rid = {}
    for (idx, rid) in enumerate(df[row_name].unique().tolist()):
        rid_to_idx[rid] = idx
        idx_to_rid[idx] = rid

    cid_to_idx = {}
    idx_to_cid = {}
    for (idx, cid) in enumerate(df[col_name].unique().tolist()):
        cid_to_idx[cid] = idx
        idx_to_cid[idx] = cid

    return rid_to_idx, idx_to_rid, cid_to_idx, idx_to_cid


def df_to_matrix(df, row_name, col_name):
    """Take interactions dataframe and convert to a sparse matrix
    Parameters
    ----------
    df : DataFrame
    row_name : str
    col_name : str
    Returns
    -------
    interactions : sparse csr matrix
    rid_to_idx : dict
    idx_to_rid : dict
    cid_to_idx : dict
    idx_to_cid : dict
    """
    rid_to_idx, idx_to_rid,\
        cid_to_idx, idx_to_cid = get_df_mappings(df, row_name, col_name)

    def map_ids(row, mapper):
        return mapper[row]

    I = df[row_name].apply(map_ids, args=[rid_to_idx]).values
    J = df[col_name].apply(map_ids, args=[cid_to_idx]).values
    V = np.ones(I.shape[0])
    interactions = sparse.coo_matrix((V, (I, J)), dtype=np.float64)
    interactions = interactions.tocsr()
    return interactions, rid_to_idx, idx_to_rid, cid_to_idx, idx_to_cid
ratings_matrix, user_id_to_idx, idx_to_user_id, movie_id_to_idx, idx_to_movie_id = df_to_matrix \
        (df_ratings, row_name='user_id', col_name='movie_id')

ratings_matrix.toarray()
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

This leads to the creation of 5 new variables:

  • a final sparse matrix ratings_matrix (this will be the data used to train the model) and the following utils mappers:
  • uid_to_idx
  • idx_to_uid
  • mid_to_idx
  • idx_to_mid

How to use those mappers ?

# for instance what movies did the userId 4 rate¶
movies_user4 = ratings_matrix.toarray()[user_id_to_idx[4], :]
movies_id_user4 = np.sort(np.vectorize(idx_to_movie_id.get)(np.argwhere(movies_user4>0).flatten()))
df_movies.loc[movies_id_user4, :]['title']
movie_id
11                          Seven (Se7en) (1995)
50                              Star Wars (1977)
210    Indiana Jones and the Last Crusade (1989)
258                               Contact (1997)
260                         Event Horizon (1997)
264                                 Mimic (1997)
271                     Starship Troopers (1997)
288                                Scream (1996)
294                             Liar Liar (1997)
300                         Air Force One (1997)
301                              In & Out (1997)
303                           Ulee's Gold (1997)
324                          Lost Highway (1997)
327                              Cop Land (1997)
328                     Conspiracy Theory (1997)
329                    Desperate Measures (1998)
354                   Wedding Singer, The (1998)
356                           Client, The (1994)
357       One Flew Over the Cuckoo's Nest (1975)
358                                 Spawn (1997)
359                       Assignment, The (1997)
360                            Wonderland (1997)
361                             Incognito (1997)
362                   Blues Brothers 2000 (1998)
Name: title, dtype: object
# On the other side, what is the value of ratings_matrix for: userId = 4
movieId_list = [11, 50, 210, 324, 8, 9, 10]

movieId_idx = [movie_id_to_idx[i] for i in movieId_list]
movieId_idx

ratings_user4 = ratings_matrix.toarray()[user_id_to_idx[4], movieId_idx]
ratings_user4
#the values in ratings_matrix tells if a user a rated a movie but don't explicit the rating value
array([1., 1., 1., 1., 0., 0., 0.])

Recommendation model

from lightfm.cross_validation import random_train_test_split
from lightfm import LightFM
from lightfm.evaluation import precision_at_k

Introduction to lightFM

LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback.

It also makes it possible to incorporate both item and user metadata into the traditional matrix factorization algorithms. It represents each user and item as the sum of the latent representations of their features, thus allowing recommendations to generalise to new items (via item features) and to new users (via user features).

Data expected by the LightFM fit method

From the doc, chapter ‘usage’:

model = LightFM(no_components=30)

Assuming train is a (no_users, no_items) sparse matrix (with 1s denoting positive, and -1s negative interactions),

you can fit a traditional matrix factorization model by calling:

model.fit(train, epochs=20)

This will train a traditional MF model, as no user or item features have been supplied.

Splitting data

The dataset is slightly different from what we have been used to with Scikit-Learn (X as features, y as target).

Lightfm provides a random_train_test_split located into cross_validation dedicated to this usecase.

Let’s split the data randomly into a train matrix and a test matrix with 20% of interactions into the test set.

train, test = random_train_test_split(ratings_matrix, test_percentage=0.2)

Metric and model performance evaluation

  • The optimized metric is the percentage of top k items in the ranking the user has actually interacted with - i.e how good the ranking produced by the model is.
  • We’ll evaluate the recommendation engine with the WARP: the Weighted Approximate-Rank Pairwise loss. It maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision@k) is desired.

Model training and precision

model = LightFM(loss='warp-kos', no_components=40, k=3, learning_rate=0.03)
model.fit(train, epochs=100)
<lightfm.lightfm.LightFM at 0x7f70d44b97b8>
print("Train precision: %.2f" % precision_at_k(model, train, k=5).mean())
print("Test precision: %.2f" % precision_at_k(model, test, train_interactions=train, k=5).mean())
Train precision: 0.91
Test precision: 0.36

What does the attribute item_embeddings of model contains?

ratings_matrix.toarray().shape
(943, 1682)
# equivalent of the Q matrix with no_components = 10, ie the nb of embeddings / features found by the model
model.item_embeddings.shape
(1682, 40)
# equivalent of the P matrix with no_components = 10, ie the nb of embeddings / features found by the model
model.user_embeddings.shape
(943, 40)

Similarity scores between pairs of movies

Previously, we’ve trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : model.user_embeddings ; and V matrix of shape (n_movies, no_components) : model.item_embeddings).

Now we would like to compute similarity between each pair of movies. To calculate a similarity distance, there are 2 choices

  • cosine_similarity function
    • from sklearn.metrics.pairwise import cosine_similarity
    • cosine_similarity(X, Y)
  • or pearson_similarity:
    • import numpy as np
    • np.corrcoef(X, Y)
#Compute the similarity_scores of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(model.item_embeddings, model.item_embeddings)
similarity_scores
array([[ 0.9999999 ,  0.3090872 , -0.3577202 , ...,  0.04141378,
         0.10872173,  0.04886636],
       [ 0.3090872 ,  1.        , -0.53334486, ..., -0.2657429 ,
        -0.18552116, -0.18115026],
       [-0.3577202 , -0.53334486,  1.        , ...,  0.48775673,
         0.33573252,  0.3887553 ],
       ...,
       [ 0.04141378, -0.2657429 ,  0.48775673, ...,  1.0000001 ,
         0.8434785 ,  0.90591186],
       [ 0.10872173, -0.18552116,  0.33573252, ...,  0.8434785 ,
         1.0000001 ,  0.86416286],
       [ 0.04886636, -0.18115026,  0.3887553 , ...,  0.90591186,
         0.86416286,  1.        ]], dtype=float32)
# it's the similarity with the "features" for all movies
cosine_similarity(model.item_embeddings, model.item_embeddings).shape
(1682, 1682)
np.corrcoef(model.item_embeddings)
array([[ 1.        ,  0.30896349, -0.36367854, ...,  0.04217422,
         0.10564704,  0.0499952 ],
       [ 0.30896349,  1.        , -0.53797214, ..., -0.26560836,
        -0.18870322, -0.18094361],
       [-0.36367854, -0.53797214,  1.        , ...,  0.49398712,
         0.32494301,  0.39578214],
       ...,
       [ 0.04217422, -0.26560836,  0.49398712, ...,  1.        ,
         0.85572397,  0.90588932],
       [ 0.10564704, -0.18870322,  0.32494301, ...,  0.85572397,
         1.        ,  0.87858968],
       [ 0.0499952 , -0.18094361,  0.39578214, ...,  0.90588932,
         0.87858968,  1.        ]])
np.corrcoef(model.item_embeddings).shape
(1682, 1682)

Recommandation engine practical use

# For instance what are the 10 most similar movies to movie of idx 21 ?
df_movies.loc[np.vectorize(idx_to_movie_id.get)(similarity_scores[21].argsort()[::-1][1:11])]
title genres
movie_id
1568 Vermont Is For Lovers (1992) Comedy|Romance|
1054 Mr. Wrong (1996) Comedy|
785 Only You (1994) Comedy|Romance|
868 Hearts and Minds (1996) Drama|
852 Bloody Child, The (1996) Drama|Thriller|
1315 Inventing the Abbotts (1997) Drama|Romance|
703 Widows' Peak (1994) Drama|
1452 Lady of Burlesque (1943) Comedy|Mystery|
847 Looking for Richard (1996) Documentary|Drama|
1343 Lotto Land (1995) Drama|

On the assumption that a user likes Scream (movie_id = 288), what would be other recommend movies ? (i.e which movies are the most similar)

Retrieve the top 10 recommendations.

#similarity_scores works with idx and you have the movie_id associated to your movie.
df_movies.loc[np.vectorize(idx_to_movie_id.get)(similarity_scores[movie_id_to_idx[288]].argsort()[::-1][1:11])]
title genres
movie_id
333 Game, The (1997) Mystery|Thriller|
294 Liar Liar (1997) Comedy|
328 Conspiracy Theory (1997) Action|Mystery|Romance|Thriller|
475 Trainspotting (1996) Drama|
307 Devil's Advocate, The (1997) Crime|Horror|Mystery|Thriller|
147 Long Kiss Goodnight, The (1996) Action|Thriller|
245 Devil's Own, The (1997) Action|Drama|Thriller|War|
300 Air Force One (1997) Action|Thriller|
156 Reservoir Dogs (1992) Crime|Thriller|
282 Time to Kill, A (1996) Drama|