Building a movie recommender system

Titan Tutorial #8: Building and deploying a a collaborative-filtering recommender service from scratch

Recommender systems are information filtering systems oriented to customize and personalize the experience of the users using a service.

In order to achieve this, recommender systems make predictions about user preferences based on multiple sources of information (interests, past actions, similar users, context…).

This type of systems are currently pervasive in many tools which we use in a daily basis. Some examples would be:

  • Netflix: Customized recommendations about movies and TV shows
  • Spotify: Automatic Playlist Generator
  • Amazon: Automatic shopping recommendations

Recommender systems can be built in different ways which can be basically classified into 3 different approaches:

  • Collaborative Filtering: Technique based on filtering out the items a user might like based on the ratings of similar users.
  • Content-Based Filtering: Technique based on recommending items based on a comparison between the content of the items and the profile and preferences of a user.
  • Hybrid Recommendation Engines: Mixed approach combining collaborative and content-based filtering.
Collaborative vs. Content-Based Filtering

In this tutorial, we will build a recommender system using a collaborative filtering scheme.

Collaborative filtering models can be built using different approaches such as:

  • Memory based
  • Model based
  • Matrix Factorization
  • Clustering
  • Deep Learning

In this case, we will be using a Matrix Factorization model to make a basic movie recommendation engine.

Before diving into the model implementation, it is convenient to get a subtle understanding of what Matrix Factorization is. Matrix Factorization algorithms for recommendation work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The next figure illustrates this idea at a high level:

Intuition of Matrix Factorization

The mathematical intuition of this technique is to represent both users and items in a lower dimensional space in order to find latent relations and patterns. If you are interested in knowing more about these techniques you can check this series of post about Matrix Factorization for recommendation.

This latent features and relations allows these models to estimate if a user is going to like a movie he has not already seen.

There are also several ways of computing matrix factorizations or decompositions which use depend on the final application of the model. Common factorizations are the following:

  • LU (Lower-Upper) Matrix Decomposition
  • QR Matrix Decomposition
  • Cholesky Decomposition
  • SVD (Singular Value Decomposition)

For the model of this tutorial we will use the SVD included in scipy to build our recommendation system.

Now we are ready to get into the model. As we saw in previous tutorials, it is possible to provision the required hardware and environment details for the deployment. It is as easy as creating a markdown cell in the Notebook with this YAML specification:

```yaml
titan: v1
service:
image: scipy
machine:
cpu: 2
memory: 1024MB
```

For this model we will be using the scipy environment since it includes the required functions to compute the SVD (Singular Value Decomposition).

The required imports for this model will be the following:

import os
import time
import json
import functools
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds

Regarding the dataset, Movilens, one of the most famous movies datasets, will be used to build the system:

df_movies = pd.read_csv("https://raw.githubusercontent.com/jfuentesibanez/datasets/master/movies.csv", usecols=['movieId', 'title', 'genre'], sep=';', dtype={'movieId': 'int32', 'title': 'str', 'genre': 'str'})df_ratings = pd.read_csv("https://raw.githubusercontent.com/jfuentesibanez/datasets/master/ratings.csv", usecols=['userId', 'movieId', 'rating'], sep=';', dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

Once we have the data, and to better understand it, we transform the dataframe to set userId as rows and movieId as columns. In addition, we fill all the null values with 0.0.

df_movie_features = df_ratings.pivot(
index='userId',
columns='movieId',
values='rating'
).fillna(0)

👉 It is important to remark that this will result in a sparse matrix since an average user would only have seen a small amount of movies from all the available!

After this transformation, we will transform the pandas dataframe into a numpy array to compute the SVD and we will calculate the average rating of each user:

R = df_movie_features.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)

At this point, we can make the matrix factorization using the provided function in scipy.

(a) Full reconstruction (b) Reconstruction using just k eigenvalues
U, sigma, Vt = svds(R, k = 50)
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

The code below does the following:

  1. Decomposes A (named R in our code) in three matrices named U, sigma andVtusing the specified k eigenvalues (50 in this case). Please note that, the higher the value of k , the higher the computational cost of the decomposition.
  2. Creates a diagonal array forsigma
  3. Reconstructs Ak with the specified value fork

This process can be seen as dimensionality reduction which allows to find underlying patterns in the data which we will use to obtain the recommendations for each user.

Now we go back to a pandas dataframe for a simpler and better data handling of the results for every user.

preds_df = pd.DataFrame(all_user_predicted_ratings, columns = df_movie_features.columns)

In this matrix, we will have the rating of each user for every movie, independently whether or not it has been seen:

Our results matrix with the estimated ratings

At this point, we can proceed to create the structure to:

a) Make the recommendation prediction

b) Create helper functions

c) Preparing the model endpoints

Let’s see step by step how to prepare the model:

a) Make the recommendation prediction

For the recommendation prediction we will define a function to sort and extract the top rated movies of a user and the top rated recommendations:

def recommend_movies(preds_df, userID, movies_df, original_ratings_df, num_recommendations=5):

# Retrieve and sort user top rated movies and top predictions
user_row_number = userID — 1
sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)
user_data = original_ratings_df[original_ratings_df.userId == (userID)]
user_full = (user_data.merge(movies_df, how = ‘left’, left_on = ‘movieId’, right_on = ‘movieId’).
sort_values([‘rating’], ascending=False)
)
predictions = (movies_df[~movies_df[‘movieId’].isin(user_full[‘movieId’])]).merge( pd.DataFrame(sorted_user_predictions).reset_index(), how = ‘left’, left_on = ‘movieId’,
right_on = ‘movieId’).rename(columns = {user_row_number: ‘Predictions’}).sort_values(‘Predictions’, ascending = False).iloc[:num_recommendations, :-1]
return user_full, predictions

b) Create helper functions

In order to make better and more manageable models, we will be introducing the use of helper functions in this tutorial. For this model we will be using two helper functions using python decorators:

  • measure: To track the execution time of a function
  • endpoint: To facilitate the processing of the request data
# Store total elapsed time, total requests, last 10 processing times in milliseconds
metrics = {'total': 0, 'requests': 0, 'recent': [] }
# Maximum processing time metrics to store from most recent to oldest one
max_recent_items = 20
def store_metrics(start):
elapsed = int((time.time() - start) * 1000)
metrics['requests'] += 1
metrics['total'] += elapsed
metrics['recent'][0:0] = [elapsed]
metrics['recent'] = metrics['recent'][0:max_recent_items] if len(metrics['recent']) >= max_recent_items else metrics['recent']
def measure(fn):
@functools.wraps(fn)
def wrapper(*args, **kwds):
start = time.time()
try:
return fn(*args, **kwds)
finally:
store_metrics(start)
return wrapper
def endpoint(fn):
@functools.wraps(fn)
def wrapper(*args, **kwds):
req = args[0] if len(args) > 0 else '{}'
request = json.loads(req)
args = request.get('args', {})
return fn(args, **kwds)
return wrapper

👉 NOTE: These and many other helper functions will be available in the upcoming release of titanio, the utility package of titan.

c) Preparing the model endpoints

We will create four different endpoints to interact with the model once it has been deployed and transformed into a service:

The endpoints to operate the model

Before instrumentalizing the endpoints that will be processed by titan, we can define the code to be executed for each endpoint.

/recommended

This function will process the arguments of the request (just the userId in this case) and will return the top recommended movies for a user.

👉 Note that we are using the @endpoint decorator to facilitate the processing of the data in the incoming requests.

@endpoint
def recommended(args):
user_id_txt = args.get('param', args.get('001', None))
user_id = int(list(filter(str.isdigit, user_id_txt))[0])
already_rated, predictions = recommend_movies(preds_df, user_id, df_movies, df_ratings, 10)
return predictions.title.to_string(index=False)

/viewed

This endpoint is quite similar to /recommended. The only difference is that it will return the top rated movies of the userId passed as an argument.

/recompute_svd

As we saw before, we are using the SVD to factorize the matrix. We will use this function to create an endpoint which allows us to recompute the factorization using an arbitrary values of kpassed as a parameter. With this function we will be able to tune the performance of the model on-demand.

👉 For this function we will be using both decorators,@endpoint to process the request and @measure to calculate the execution time depending on the value of k.

@measure
@endpoint
def recompute_svd(args):
k_txt = args.get('param1', args.get('50', None))
k = int(list(filter(str.isdigit, k_txt))[0])
U, sigma, Vt = svds(R, k = k)
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

/get_metrics

This final endpoint will return the latest execution times of the functions decorated with @measure (just recompute_svd in our case).

def metrics():
# Time units in milliseconds
total = metrics['total']
requests = metrics['requests']
average_time = int(total / requests) if requests > 0 else 0
data = {
'requests': requests,
'average': average_time,
'total_time': total,
'last_calls': metrics['recent']
}
return json.dumps(data, indent=2)

As we have already seen in previous tutorials, the last step is to instrumentalize and specify which functions shall be exposed by titan that will correspond with the four endpoints we have detailed in the last section.

Note that each endpoint must be placed in a different Notebook cell:

/metrics endpoint

# GET /metrics
print(metrics())

/viewed endpoint

# POST /viewed
print(viewed(REQUEST))

/recommended endpoint

# POST /recommended
print(recommended(REQUEST))

/recompute_svd endpoint

# POST /recompute_svd
print(recompute_svd(REQUEST))

Finally, once the model is ready, we just need to run titan magic command from the CLI:

$ titan deploy

Now that the model has been successfully deployed, we can start using it through the defined endpoints.

Defined endpoints for our service

Here’s an example:

  1. We retrieve the top rated movies from a user (e.g. userId: 500 )using the /viewed endpoint:
Wizard of Oz  The (1939)
Muppet Movie The (1979)
Shawshank Redemption The (1994)
It's a Wonderful Life (1946)
Matrix The (1999)
South Park: Bigger Longer and Uncut (1999)
Erin Brockovich (2000)
Singin' in the Rain (1952)
Creature Comforts (1990)
Christmas Story A (1983)

2. We check the recommended movies for the same user using the /recommended and we get the following:

Toy Story (1995)
Silence of the Lambs The (1991)
Sleeping Beauty (1959)
West Side Story (1961)
Toy Story 2 (1999)
Lady and the Tramp (1955)
Best in Show (2000)
Star Wars: Episode IV - A New Hope (1977)
Bug's Life A (1998)
Saving Private Ryan (1998)

Imagine now that we want to recompute the SVD to check if the model can yield better recommendations using a higher value for k , let’s say k=100 (default value is k=50).

In order to do that, we can use the /recompute_svd passing k=100 as a parameter in our POST call.

After making the call, and since we are using the @measure decorator in this function, we can check the execution time in milliseconds using the /metrics endpoint. As pointed out, the value of k will drastically change the execution time of the matrix factorization.

Making some tests for different values of k provides the following execution times:

+---------+---------------+
| k value | Exec time(ms) |
+---------+---------------+
| 10 | 41 |
| 50 | 379 |
| 100 | 830 |
| 500 | 3836 |
| 1000 | 5084 |
+---------+---------------+

Imagine now that, for any reason, it is needed to make the refactorization faster. Using titanit would just be needed to provision additional hardware resources in the YAML specification as follows:

```yaml
titan: v1
service:
image: scipy
machine:
cpu: 4
memory: 2096MB
```

If we run our tests again using the new hardware, the results are as follows:

+---------+---------------+
| k value | Exec time(ms) |
+---------+---------------+
| 10 | 39 |
| 50 | 322 |
| 100 | 656 |
| 500 | 3320 |
| 1000 | 4367 |
+---------+---------------+

You can check the whole code of the model here or by cloning this GitHub repository:

Wrap-up

In this post we have built a basic recommender system based on matrix factorization from scratch. In addition, we have seen how to use helper functions to improve the quality of our code and to provide interesting capabilities such as offering basic performance metrics through an endpoint.

Finally, we have seen how to improve execution times by provisioning more hardware resources for computationally expensive tasks.

Thanks for reading this far, we really hope you enjoyed the tutorial!

Next Tutorial

In the next tutorial, we see how to make a first approach to MLOps using titan . I’m sure you will find it interesting!

Foreword

Titan can help you to radically reduce and simplify the effort required to put AI/ML models into production, enabling Data Science teams to be agile, more productive and closer to the business impact of their developments.

If you want to know more about how to start using Titan or getting a free demo, please visit our website or drop us a line at info@akoios.com.

If you prefer, you can schedule a meeting with us here.

Akoios: Frictionless solutions for modern data science.

--

--

--

Akoios

Love podcasts or audiobooks? Learn on the go with our new app.

AViD — Dataset for Recognizing Video Actions

Metrics Definition and Analysis; Product Recommendations from Stubhub’s A/B test |Product Case

Product update: easily view any data segmentation!

Analyze large scale data on Azure VM with Python and Jupyter Notebook

The Approaches and Pain Points of Text Analysis in Higher Education

The most common surveys that institutions analyze comments for: student satisfaction, course evaluations, and etc.

Case Study On Big Data.

Thousands of BIMI Records Will Be Released in 2021

Text Clustering-The continuation

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Akoios

Akoios

Akoios

More from Medium

Fine-tuned transformers for multi-lingual job to job seeker matching

ARIMA ML for forecasting cashflow with disparate systems

Machine Learning Operations (MLOps) — Augmenting Machine Learning Activities for better results

Path to production for ML models