Introducing Titan Jobs

6 min readJun 2, 2020

Titan Tutorial #11: Building a “batch-mode” churn prediction model

Introduction

Historically, batch processing has referred to the action of running computational tasks (arbitrary code execution) on demand or scheduled by the user with minimum or no human interaction at all.

In our Data Science world, not all the Machine Learning models are meant to be consumed in real-time through an API interface as we have seen in previous tutorials.There are many use cases where we just need to run our prediction synchronously or asynchronously (e.g. on a monthly basis) to obtain the desired results.

In this tutorial, we will see how our new feature, Titan Jobs, can help Data Scientists in a wide variety of situations, from the consumption of a batch model to the construction of powerful CI/CD pipelines to create full-fledged and robust AM/ML systems in our companies.

As we will also see, the ability to run arbitrary pieces of code, it is not only useful to run these models but also to execute repetitive and/or highly computationally costly tasks during the development phase of a model (e.g. model training or hyperparameter optimization).

Building the model

In this tutorial we will be building a churn prediction model for a Telco using this well-know public dataset. The aim of the model is to predict in advance which of the current customers are more likely to discontinue their subscriptions.

To this end, and to illustrate how Titan Jobs work, we will develop a model in a Decision Tree algorithm based on the approach made in this Kaggle kernel.

The first to note about Titan is that we do not require Jupyter Notebooks to create or run Jobs. Titan Jobs can run any Python script as we will see later in the tutorial.

As usual, we start with our imports for the model:

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

We proceed by getting the dataset we will work with:

df = pd.read_csv("https://storage.googleapis.com/tutorial-datasets/telco.csv")

Once we have the data, it’s time to perform of the data processing and transformation before training.

We start by making some basic actions like transforming to lowercase all the columns and transforming to a binary format some relevant text features.

# (1) Lowercase transformationfor item in df.columns:
  try:
    df[item] = df[item].str.lower()
  except:
    print(item, "Unable to convert")# (2) Binary conversion of relevant features so we can use them for the classificationcolumns_to_convert = ['Partner',
                      'Dependents',
                      'PhoneService',
                      'PaperlessBilling',
                      'Churn']for item in columns_to_convert:
  df[item].replace(to_replace='yes', value=1, inplace=True)
  df[item].replace(to_replace='no',  value=0, inplace=True)

After data, we clean the existing null points and balance the labels so we have the same number of churners and non-churners.

# (4) Check for null data points
df.isnull().sum(axis = 0)
df = df.fillna(value=0)# (5) Balance the labels
churners_number = len(df[df['Churn'] == 1])
print("Number of churners", churners_number)churners = (df[df['Churn'] == 1])non_churners = df[df['Churn'] == 0].sample(n=churners_number)
df2 = churners.append(non_churners)try:
    customer_id = df2['customerID'] # Store this as customer_id variable
    del df2['customerID'] # Not needed
except:
    print("already removed customerID")

Finally, we make the one-hot encoding of categorical variables and remove the useless labels for training purposes.

# (6) One-hot encoding
ml_dummies = pd.get_dummies(df2)
ml_dummies.fillna(value=0, inplace=True)df2.head()# (7) Remove labels
try:
    label = ml_dummies['Churn'] # We remove labels before training
    del ml_dummies['Churn']
except:
    print("label already removed.")

Once the data is prepared, we can make the dataset splitting as usual:

feature_train, feature_test, label_train, label_test = train_test_split(ml_dummies, label, test_size=0.3)

After the splitting, it is time to make the training of the model with a 5-level Decision Tree:

clf = DecisionTreeClassifier(max_depth=5)
clf.fit(feature_train, label_train)
pred = clf.predict(feature_test)
score = clf.score(feature_test, label_test)
print (round(score,3),"\n", "- - - - - ", "\n")

Then, we use a simple function to process the dataset and build the resulting table:

# Preprocessing original dataframe
def preprocess_df(dataframe):
    x = dataframe.copy()
    try:
        customer_id = x['customerID']
        del x['customerID'] # Don't need in ML DF
    except:
        print("customerID already removed")
    ml_dummies = pd.get_dummies(x)
    ml_dummies.fillna(value=0, inplace=True)    try:
        label = ml_dummies['Churn']
        del ml_dummies['Churn']
    except:
        print("label already removed.")
    return ml_dummies, customer_id, labeloriginal_df = preprocess_df(df)

Finally, we build an output dataframe and prepare the results (show only those customers in risk of churning) to finally print them:

# Prepare output
output_df = original_df[0].copy()
output_df['prediction'] = clf.predict_proba(output_df)[:,1]
output_df['churn'] = original_df[2]
output_df['customerID'] = original_df[1]activate = output_df[output_df['churn'] == 0]
output = activate[['customerID','churn','prediction']]
output = output.sort_values(by=['prediction'], ascending=False)# Show output
print(output.to_string())

As usual, you can get the whole code here.

Creating and running the Job

Now that the model is ready, let’s see how we could create our first Titan Job.

Supposing our Python Script is named churn-tutorial.py, creating a Titan Job is as easy as running this command from the CLI:

$ titan jobs run churn-tutorial.py

With that simple command, Titan will automagically create and run the task on the Cloud where Titan is deployed over. As easy as that!

Once it has been deployed, its status can be checked from the dashboard:

Apart from the checking the status, and since we printed the output on stdout, we can also see the results of the script straight from the dashboard:

As expected, the script shows the user IDs and their estimated churn prediction.

Environment variables

It is possible to easily pass environment variables to the Job using the following format:

$ titan jobs run --env FOO=bar churn-tutorial.py

Provisioning hardware

As in the case of Titan Services, it is also possible to define and provision the hardware on which our Job will be executed.

To this end, it is just need to make the specification in a YAML file with the same format as we saw for Titan Services.

When to use Jobs

As already mentioned, since Jobs allow to run arbitrary code on Titan’s infrastructure, it is a powerful tool for the Data Scientists during the development phase of a model.

Using Jobs, it is possible to run computationally expensive workloads (e.g. training) with a very simple command, which allows Data teams to reduce development and delivery time.

Apart from the use in the training phase, Jobs are specially suitable to program the execution of a model in a regular basis (daily, weekly, monthly…). This scheduling can be done using tools like cron as in the following example:

0 0 * * 0 python path/churn-tutorial.py

Likewise, this recurrent execution can be also set up from a CI tool like Gitlab CI. It would be as simple as:

Step 1) Creating a CI Job as we shown in the past tutorial:

sample-job:
  image: python:3.8
  script:
  # Install Titan CLI
  - curl -sf https://install.akoios.com/beta | sh
  # Deploy and Run Job
  - titan jobs run churn-tutorial.py

Step 2) Creating a CI Schedule:

Finally, and taking into account the easy integration of Titan Jobs in CI/CD pipelines, they can be also used to build more complex and powerful CI/CD tasks. Using Jobs, it is possible to automatize certain tasks (data retrieving, re-training…) and include them into any pipeline.

Wrap-up

In this tutorial we introduced Titan Jobs and we have seen how easy is to create and run them.

Titan Jobs allows the scheduled (or not) execution of arbitrary scripts on the Cloud. This feature is useful for several purposes:

Help Data Teams in the development phase of the models by allowing them to run tasks in the Cloud.
Create and schedule recurrent tasks to operate a model in batch-mode.
Enrich CI/CD pipelines by combining Titan Jobs, Titan Services and other tasks and integrations with external tools.

Thanks for reading!

Foreword

Titan can help you to radically reduce and simplify the effort required to put AI/ML models into production, enabling Data Science teams to be agile, more productive and closer to the business impact of their developments.

If you want to know more about how to start using Titan or getting a free demo, please visit our website or drop us a line at info@akoios.com.

If you prefer, you can schedule a meeting with us here.