Spark ML to H2O Migration for Machine Learning in iyzico

5 min readNov 1, 2017

In iyzico data science team, we use an ensemble of different machine learning algorithms to detect credit card fraud and to increase conversion rates at the same time. We have too many technical and non-technical requirements for our fraud detection product:

Continuous Delivery: Data scientists develop and test machine learning models that need to be continuously deployed on production in an agile way. Each week we train and deploy ML models on production environment.
Real-Time Fraud Detection: The fraud detection system (ensemble of ML models) should score the risk of a payment transaction in 100 milliseconds (0.1 seconds) at maximum.
High Availability & Scalability: The system should be high available, at 99.99% rate. It should also quickly scale up, because iyzico is a startup growing really fast (the transaction count of iyzico increased 800% last year).
Learning Curve: Our technology stack should have a low learning curve. The stack should also support the communication between a data scientist and a software developer easily. Our engineers always challenge against over-engineering.
Open Source: Because we are a startup and we love open source, the technology stack should be completely open source . We attempt to contribute to open source projects.
Fast: Our company motto is “easy, fast and happy”. We have to be fast while prototyping and deploying MVPs of our products.
On Premise: Due to payment regulations, we can not run software on cloud. Our solution should be deployed on premise.

With the support of engineering and data science teams, we initially started to build ML models with prediction.io framework on top of Spark ML. Why we chose prediction.io and Spark ML? The reasons are:

Our primary criterion was time-to-market and we need to deploy ML models easily and quickly. It is a very simple task to train a model with prediction.io easily via its REST API:

Spark is a scalable data processing engine. It can scale easily while data is growing.

We developed and tested our fraud detection models and moved to production with prediction.io and Spark 2.0. We automated all training, test and deployment processes to enable continuous delivery. Tens of features were developed and many ML models were deployed each week. The data pipeline was working but new problems started to emerge.

The main problems:

It is impossible to deploy an existing model on production environment with prediction.io. You should train again on production because the framework is heavy-weight and can only run the models it trained on the same environment.
Prediction.io and Spark need too much hardware even if you work with small amount of data. For example almost 16 GB RAM is needed to train a model with 1 million transactions with 100 features with a Decision Tree. Public cloud with auto-scale features can not be used due to regulation and we need to dedicate additional servers with large RAM just for training.
When the number of features increase, the prediction time increases exponentially on prediction.io.
Experimentation is difficult with Spark and prediction.io. The data scientists need to build a model with a light-weight framework (Python, R, IPython..etc.) in order for rapid experimentation. Moreover, developers would also like to be able to deploy models they created in a sandbox environment on to the production.
Spark has a good developer community. However prediction.io does not have a stable community and contribution is difficult. It is implemented with Scala and maintaining and scaling Scala code is yet another challenging issue.

After these problems began to become a pain, we started to look for an open source light-weight machine learning framework. It was easy for us to migrate to a different ML platform since we had implemented “Feature Engineering” and “Data Pipeline” with Java 8 and we did not need to change feature engineering side. We would only change model training and deployment infrastructure.

Data science and development team tested Tensor Flow, Spark ML (existing framework) and H2O frameworks and analysed benchmarks. The metrics of benchmarks were:

Simplicity of deploying an existing ML model to production.
Hardware requirements for training a ML model.
The Decision Trees and Bayesian Models were tested.
Python, R and SQL Support.
Simplicity of experimentation on local environment.
The prediction time in milliseconds.

After thorough analysis we decided to migrate to H2O framework. The reasons of this migration are:

ML models can be converted to a simple POJO and it is easy to deploy this model on any Java environment. So that the data scientist can build a model with R and Python and can transport this model to development team as POJO format.
Release management and devops cycle are easy with a POJO ML model.
Hardware needs decreased. For example training with 1 million transactions and 100 features with Random Forest (with 64 Trees) needs almost 16 GB RAM on Spark ML, 10 GB RAM on Tensor Flow and 2 GB RAM on H2O.
Making experiments on local environment is easy with H2O. You can just start the framework with a command “java -jar h2o.jar” and make experiments with Python and R on a browser notebook.

Then we finally migrated from Spark ML to H2O and the results are:

60 GB RAM is saved. Spark ML and prediction.io needed this RAM during model trainings
12 cores saved. Spark ML and prediction.io needed these cores to reduce model training time.
Response time decreased almost 10 times (300 milliseconds to 35 milliseconds):