Building a movie recommender system

Titan Tutorial #8: Building and deploying a a collaborative-filtering recommender service from scratch

Introduction

Collaborative vs. Content-Based Filtering
Intuition of Matrix Factorization

Building the model

```yaml
titan: v1
service:
image: scipy
machine:
cpu: 2
memory: 1024MB
```
import os
import time
import json
import functools
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds
df_movies = pd.read_csv("https://raw.githubusercontent.com/jfuentesibanez/datasets/master/movies.csv", usecols=['movieId', 'title', 'genre'], sep=';', dtype={'movieId': 'int32', 'title': 'str', 'genre': 'str'})df_ratings = pd.read_csv("https://raw.githubusercontent.com/jfuentesibanez/datasets/master/ratings.csv", usecols=['userId', 'movieId', 'rating'], sep=';', dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})
df_movie_features = df_ratings.pivot(
index='userId',
columns='movieId',
values='rating'
).fillna(0)
R = df_movie_features.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
(a) Full reconstruction (b) Reconstruction using just k eigenvalues
U, sigma, Vt = svds(R, k = 50)
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = df_movie_features.columns)
Our results matrix with the estimated ratings

Preparing the model for its deployment

def recommend_movies(preds_df, userID, movies_df, original_ratings_df, num_recommendations=5):

# Retrieve and sort user top rated movies and top predictions
user_row_number = userID — 1
sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)
user_data = original_ratings_df[original_ratings_df.userId == (userID)]
user_full = (user_data.merge(movies_df, how = ‘left’, left_on = ‘movieId’, right_on = ‘movieId’).
sort_values([‘rating’], ascending=False)
)
predictions = (movies_df[~movies_df[‘movieId’].isin(user_full[‘movieId’])]).merge( pd.DataFrame(sorted_user_predictions).reset_index(), how = ‘left’, left_on = ‘movieId’,
right_on = ‘movieId’).rename(columns = {user_row_number: ‘Predictions’}).sort_values(‘Predictions’, ascending = False).iloc[:num_recommendations, :-1]
return user_full, predictions
# Store total elapsed time, total requests, last 10 processing times in milliseconds
metrics = {'total': 0, 'requests': 0, 'recent': [] }
# Maximum processing time metrics to store from most recent to oldest one
max_recent_items = 20
def store_metrics(start):
elapsed = int((time.time() - start) * 1000)
metrics['requests'] += 1
metrics['total'] += elapsed
metrics['recent'][0:0] = [elapsed]
metrics['recent'] = metrics['recent'][0:max_recent_items] if len(metrics['recent']) >= max_recent_items else metrics['recent']
def measure(fn):
@functools.wraps(fn)
def wrapper(*args, **kwds):
start = time.time()
try:
return fn(*args, **kwds)
finally:
store_metrics(start)
return wrapper
def endpoint(fn):
@functools.wraps(fn)
def wrapper(*args, **kwds):
req = args[0] if len(args) > 0 else '{}'
request = json.loads(req)
args = request.get('args', {})
return fn(args, **kwds)
return wrapper
The endpoints to operate the model
@endpoint
def recommended(args):
user_id_txt = args.get('param', args.get('001', None))
user_id = int(list(filter(str.isdigit, user_id_txt))[0])
already_rated, predictions = recommend_movies(preds_df, user_id, df_movies, df_ratings, 10)
return predictions.title.to_string(index=False)
@measure
@endpoint
def recompute_svd(args):
k_txt = args.get('param1', args.get('50', None))
k = int(list(filter(str.isdigit, k_txt))[0])
U, sigma, Vt = svds(R, k = k)
sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
def metrics():
# Time units in milliseconds
total = metrics['total']
requests = metrics['requests']
average_time = int(total / requests) if requests > 0 else 0
data = {
'requests': requests,
'average': average_time,
'total_time': total,
'last_calls': metrics['recent']
}
return json.dumps(data, indent=2)

Instrumentalizing and deploying the model

# GET /metrics
print(metrics())
# POST /viewed
print(viewed(REQUEST))
# POST /recommended
print(recommended(REQUEST))
# POST /recompute_svd
print(recompute_svd(REQUEST))
$ titan deploy

Using the service and checking its performance

Defined endpoints for our service
Wizard of Oz  The (1939)
Muppet Movie The (1979)
Shawshank Redemption The (1994)
It's a Wonderful Life (1946)
Matrix The (1999)
South Park: Bigger Longer and Uncut (1999)
Erin Brockovich (2000)
Singin' in the Rain (1952)
Creature Comforts (1990)
Christmas Story A (1983)
Toy Story (1995)
Silence of the Lambs The (1991)
Sleeping Beauty (1959)
West Side Story (1961)
Toy Story 2 (1999)
Lady and the Tramp (1955)
Best in Show (2000)
Star Wars: Episode IV - A New Hope (1977)
Bug's Life A (1998)
Saving Private Ryan (1998)
+---------+---------------+
| k value | Exec time(ms) |
+---------+---------------+
| 10 | 41 |
| 50 | 379 |
| 100 | 830 |
| 500 | 3836 |
| 1000 | 5084 |
+---------+---------------+
```yaml
titan: v1
service:
image: scipy
machine:
cpu: 4
memory: 2096MB
```
+---------+---------------+
| k value | Exec time(ms) |
+---------+---------------+
| 10 | 39 |
| 50 | 322 |
| 100 | 656 |
| 500 | 3320 |
| 1000 | 4367 |
+---------+---------------+

Wrap-up

Next Tutorial

Foreword

Akoios: Frictionless solutions for modern data science.

Akoios