Combining Jobs, Services and CI

Titan Tutorial #12: Building a production-ready ML pipeline to predict hotel cancellations

As we have seen in previous tutorials, Titan offers building blocks (Services and Jobs) to allow Data Science Teams to build their own pipelines and solutions in a simple yet powerful manner.

In this tutorial, we will see how to build a complete real ML pipeline to predict hotel cancellations based on historical data.

This tutorial will help us illustrate how to combine the different capabilities of Titan in order to deploy and maintain a prediction service.

NOTE: To run this tutorial, a Google Cloud account is needed.

The following figure depicts the structure of the pipeline:

Proposed pipeline
  • Step 1: We will have our data stored as a table in Google BigQuery which will serve as our Data Warehouse in this example.
  • Step 2: We will create a Titan Job to execute an ETL (Extract, Transform and Load) process to regularly prepare the data for its further use for prediction purposes.
  • Step 3: Using another Titan Job, we will train sever prediction models (Logistic Regression, Gradient Boosting and Random Forest) and calculate their main performance metrics.
  • Step 4: Depending on the performance metrics of the previously trained models, they will be (or not) automatically (re)deployed as API services using Titan Services.

Let’s go into detail with each of the steps.

As it was mentioned, we will be using Google BigQuery to store the data. First of all, it is needed to upload the full dataset to BigQuery as it is explained here.

Once uploaded to BigQuery, it is possible to run some SQL queries to confirm that the data has been correctly loaded. For example, this query will return the last 10000 rows.

SELECT 
*
FROM
`datasets.hotel_reservations`
ORDER BY
ReservationStatusDate DESC
LIMIT 10000;

Once the data is uploaded to BigQuery, we can now create a Titan Job to process the data and prepare it to be used by our ML prediction model.

The aim of this Job is:

  1. Access Google BigQuery.
  2. Retrieve the desired information (number of rows and selected columns) and save in CSV format.
  3. Upload the CSV file to Google Storage for its later use. NOTE: You will access to a Google Storage bucket to save the data as shown in the code.

The code of this Titan Job is quite simple and can is shown below:

The query_string in the code shows which features we will be using for our model:

  • Country
  • MarketSegment
  • ArrivalDateMonth
  • DepositType
  • CustomerType
  • LeadTime
  • ArrivalDateYear
  • ArrivalDateWeekNumber
  • ArrivalDateDayOfMonth
  • RequiredCarParkingSpaces
  • IsCanceled

In this step of the pipeline we will training the different prediction models we will later transform into API services. These are the prediction models we are going to use:

  • Logistic Regression
  • Gradient Boosting
  • Random Forest

The current Job will perform the following tasks:

  1. Read the .csv file with the data
  2. Identify and convert the categorical variables
  3. Define the the predicted variable (IsCanceled) and predictors (the rest of variables)
  4. Split the dataset
  5. Train the 3 different models
  6. Calculate the accuracy score for each of the models
  7. Save the trained models for its later use in Google Storage

The code of this Job is shown below:

In this last step, we will use Titan Services to deploy the different prediction models that have been previously trained.

The following Jupyter Notebook does the following:

  1. Load the trained models
  2. Define different endpoints for each prediction (Logistic Regression, Random Forest & Gradient Boosting)

Now that the components of the pipeline are ready, we can put them all together in the CI/CD platform of our choice.

For this example, we will be using GitLab CI for this purpose:

The CI Pipeline

In a regular and scheduled (daily, weekly, monthly) basis, this pipeline will bring together the aforementioned Titan Jobs and Services.

One interesting feature of this pipeline is that, using the Evaluation step, we will just be deploying the models in case their accuracy is above a predetermined threshold. This way, we can avoid the automatic deployment of poorly performing models.

The structure of the pipeline is depicted in its YAML specification:

Note: In order to create and run the pipeline, make sure to enter the required credentials (Google Cloud and Titan) in the CI/CD settings of the project:

Google Cloud and Titan credentials

You can find all the code in this GitHub repository.

Wrap-up

In this tutorial we have mixed together many of the features we saw in previous tutorials in order to build a more complex and production ready ML pipeline.

Combining Titan building blocks (Titan Services and Titan Jobs ) with any sort of data source, makes it really easy to create and maintain robust data-based services for all types of projects.

Thanks for reading this far!

Foreword

Titan can help you to radically reduce and simplify the effort required to put AI/ML models into production, enabling Data Science teams to be agile, more productive and closer to the business impact of their developments.

If you want to know more about how to start using Titan or getting a free demo, please visit our website or drop us a line at info@akoios.com.

If you prefer, you can schedule a meeting with us here.

Akoios: Frictionless solutions for modern data science.

--

--

--

Akoios

Love podcasts or audiobooks? Learn on the go with our new app.

Introduction to Deep Learning with Keras.

Review: RED-Net — Residual Encoder-Decoder Network (Denoising / Super Resolution)

YOLO V5 — Explained and Demystified

Deploy ML model on Google App Engine by triggering Google Cloud Functions

Decoding LDPC Codes with Belief Propagation

Deep learning overview (Part 3)

CowMask —  Data Augmentation for Self-Supervised Models

CowMask — Data Augmentation for Self-Supervised Models

GAN — Role of Individual Units in a Deep Network

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Akoios

Akoios

Akoios

More from Medium

Moving on-prem models into the cloud: How to improve Machine Learning experimentation and…

ML Ops with Azure Machine Learning

Experience report: Data Version Control (DVC) for Machine Learning Projects

https://dvc.org

How to use spark for churn prediction