Titan Tutorial #12: Building a production-ready ML pipeline to predict hotel cancellations
Introduction
As we have seen in previous tutorials, Titan offers building blocks (Services and Jobs) to allow Data Science Teams to build their own pipelines and solutions in a simple yet powerful manner.
In this tutorial, we will see how to build a complete real ML pipeline to predict hotel cancellations based on historical data.
This tutorial will help us illustrate how to combine the different capabilities of Titan in order to deploy and maintain a prediction service.
NOTE: To run this tutorial, a Google Cloud account is needed.
The following figure depicts the structure of the pipeline:
- Step 1: We will have our data stored as a table in Google BigQuery which will serve as our Data Warehouse in this example.
- Step 2: We will create a Titan Job to execute an ETL (Extract, Transform and Load) process to regularly prepare the data for its further use for prediction purposes.
- Step 3: Using another Titan Job, we will train sever prediction models (Logistic Regression, Gradient Boosting and Random Forest) and calculate their main performance metrics.
- Step 4: Depending on the performance metrics of the previously trained models, they will be (or not) automatically (re)deployed as API services using Titan Services.
Let’s go into detail with each of the steps.
Step 1: Setting up our Data WareHouse
As it was mentioned, we will be using Google BigQuery to store the data. First of all, it is needed to upload the full dataset to BigQuery as it is explained here.
Once uploaded to BigQuery, it is possible to run some SQL queries to confirm that the data has been correctly loaded. For example, this query will return the last 10000 rows.
SELECT
*
FROM
`datasets.hotel_reservations`
ORDER BY
ReservationStatusDate DESC
LIMIT 10000;
Step 2: Processing the data
Once the data is uploaded to BigQuery, we can now create a Titan Job to process the data and prepare it to be used by our ML prediction model.
The aim of this Job is:
- Access Google BigQuery.
- Retrieve the desired information (number of rows and selected columns) and save in CSV format.
- Upload the CSV file to Google Storage for its later use. NOTE: You will access to a Google Storage bucket to save the data as shown in the code.
The code of this Titan Job is quite simple and can is shown below:
The query_string in the code shows which features we will be using for our model:
- Country
- MarketSegment
- ArrivalDateMonth
- DepositType
- CustomerType
- LeadTime
- ArrivalDateYear
- ArrivalDateWeekNumber
- ArrivalDateDayOfMonth
- RequiredCarParkingSpaces
- IsCanceled
Step 3: Training the models
In this step of the pipeline we will training the different prediction models we will later transform into API services. These are the prediction models we are going to use:
- Logistic Regression
- Gradient Boosting
- Random Forest
The current Job will perform the following tasks:
- Read the .csv file with the data
- Identify and convert the categorical variables
- Define the the predicted variable (IsCanceled) and predictors (the rest of variables)
- Split the dataset
- Train the 3 different models
- Calculate the accuracy score for each of the models
- Save the trained models for its later use in Google Storage
The code of this Job is shown below:
Step 4: Defining the prediction endpoints
In this last step, we will use Titan Services to deploy the different prediction models that have been previously trained.
The following Jupyter Notebook does the following:
- Load the trained models
- Define different endpoints for each prediction (Logistic Regression, Random Forest & Gradient Boosting)
Putting the pieces together through CI
Now that the components of the pipeline are ready, we can put them all together in the CI/CD platform of our choice.
For this example, we will be using GitLab CI for this purpose:
In a regular and scheduled (daily, weekly, monthly) basis, this pipeline will bring together the aforementioned Titan Jobs and Services.
One interesting feature of this pipeline is that, using the Evaluation step, we will just be deploying the models in case their accuracy is above a predetermined threshold. This way, we can avoid the automatic deployment of poorly performing models.
The structure of the pipeline is depicted in its YAML specification:
Note: In order to create and run the pipeline, make sure to enter the required credentials (Google Cloud and Titan) in the CI/CD settings of the project:
You can find all the code in this GitHub repository.
Wrap-up
In this tutorial we have mixed together many of the features we saw in previous tutorials in order to build a more complex and production ready ML pipeline.
Combining Titan building blocks (Titan Services and Titan Jobs ) with any sort of data source, makes it really easy to create and maintain robust data-based services for all types of projects.
Thanks for reading this far!
Foreword
Titan can help you to radically reduce and simplify the effort required to put AI/ML models into production, enabling Data Science teams to be agile, more productive and closer to the business impact of their developments.
If you want to know more about how to start using Titan or getting a free demo, please visit our website or drop us a line at info@akoios.com.
If you prefer, you can schedule a meeting with us here.