AutoML using H2o
Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML automates most of the steps in an ML pipeline, with a minimum amount of human effort and without compromising on its performance.
Automatic machine learning broadly includes the following steps:
- Data preparation and Ingestion: The real-world data can be raw data or just in any format. In this step, data needs to be converted into a format that can be processed easily. This also required to decide the data type of different columns in the dataset. We also required a clear knowledge about the task we need to perform on data (e.g classification, regression, etc.)
- Feature Engineering: This includes various steps that are required for cleaning the dataset such as dealing with NULL /missing values, selecting the most important features of the dataset, and removing the low-correlational features, dealing with the skewed dataset.
- Hyperparameter Optimization: To obtain the best results on any model, the AutoML need to carefully tune the hyperparameter values.
- Model Selection: H2O autoML trains with a large number of models in order to produce the best results. H2O AutoML also trains the data of different ensembles to get the best performance out of training data.
H2O AutoML contains the cutting-edge and distributed implementation of many machine learning algorithms. These algorithms are available in Java, Python, Spark, Scala, and R. H2O also provide a web GUI that uses JSON to implement these algorithms. The models trained on H2O AutoML can be easily deployed on the Spark server, AWS, etc.
The main advantage of H2O AutoML is that it automates the steps like basic data processing, model training and tuning, Ensemble and stacking of various models to provide the models with the best performance so that developers can focus on other steps like data collection, feature engineering and deployment of model.
Functionalities of H2O AutoML
- H2O AutoML provides necessary data processing capabilities. These are also included in all of the H2O algorithms.
- Trains a Random grid of algorithms like GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space.
- Individual models are tuned using cross-validation.
- Two Stacked Ensembles are trained. One ensemble contains all the models (optimized for model performance), and the other ensemble provides just the best performing model from each algorithm class/family (optimized for production use).
- Returns a sorted “Leaderboard” of all models.
- All models can be easily exported to production.
Architecture:
H2O AutoML uses H2O architecture. H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.
H2O provides REST API clients for Python, R, Excel, Tableau, and Flow Web UI using socket connections.
The bottom layer contains different components that will run on the H2O JVM process.
An H2O cluster consists of one or more nodes. Each node is a single JVM process. Each JVM process is split into three layers: language, algorithms, and core infrastructure.
- The first layer in the bottom section is the language layer. The language layer consists of an expression evaluation engine for R and the Scala layer.
- The second layer is the algorithm layer. This layer contains an algorithms that are already provided in the H2O such as: XGBoost, GBM, Random Forest, K-Means etc.
- The third layer is the core infrastructure layer that deals with resource management such as Memory and CPU management.
Implementation:
- In this code, we will be using California Housing Dataset which is easily available in colab. First, we need to import the necessary packages.
Code:
python3
# code import pandas as pd import numpy as np import matplotlib.pyplot as plt |
- Now, we load the California Housing Dataset. This is already available in the sample data folder when we load the environment in colab.
Code:
python3
# code df = pd.read_csv( 'sample_data / california_housing_train.csv' ) |
- Let’s look at the dataset, we use the head function to list the first few rows of the dataset.
Code:
python3
# print first 5 rows of dataframe df.head() |
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value 0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0 1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0 2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0 3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0 4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
- Now, let’s check for null values in the dataset. As we can see that there are no null values in the dataset.
Code:
python3
# calculate total null values in every column df.isna(). sum () |
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 0 population 0 households 0 median_income 0 median_house_value 0 dtype: int64
- Now we need to install the h2o, we can install it using pip. Note, if you are using the local environment for H2O, you need to install the Java Development Kit (JDK). After installing JDK and H2O, we will initialize it, if it works fine this will start an H2O instance on the localhost. There are many arguments which we can pass such as:
- nthreads: No of cores H2O server can use, by default it uses all cores of CPU.
- ip: IP address of the server where the H2O server will run. By default, it uses localhost.
- port: port on which the H2O server will run.
- max_mem_size: A character string specifying the maximum size, in bytes, of the memory allocation pool to H2O. This value must be a multiple of 1024 greater than 2MB. Append the letter m or M to indicate megabytes, or g or G to indicate gigabytes. Similarly, there is another parameter min_mem_size. For more details please look at H2O docs
Code:
python3
# install and import H2o ! pip install h2o import h2o # We will be using default parameter Here with H2O init method h2o.init() |
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing) Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpebz1_45i JVM stdout: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.out JVM stderr: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful. H2O_cluster_uptime: 03 secs H2O_cluster_timezone: Etc/UTC H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.30.0.6 H2O_cluster_version_age: 13 days H2O_cluster_name: H2O_from_python_unknownUser_h4lj71 H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 3.180 Gb H2O_cluster_total_cores: 2 H2O_cluster_allowed_cores: 2 H2O_cluster_status: accepting new members, healthy H2O_connection_url: http://127.0.0.1:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 Python_version: 3.6.9 final
- The H2O instance can also be assessed from localhost: 54321, this instance provides a web GUI called FlowGUI. Now, we need to convert the train data frame into the H2O Dataframe.
python3
# convert pandas DataFrame into H2O Frame train_df = h2o.H2OFrame(df) # Describe the train h20Frame train_df.describe() |
Parse progress: |?????????????????????????????????????????????????????????| 100% Rows:17000 Cols:9 longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value type real real int int int int int real int mins -124.35 32.54 1.0 2.0 1.0 3.0 1.0 0.4999 14999.0 mean -119.5621082352941 35.62522470588239 28.589352941176436 2643.6644117647143 539.4108235294095 1429.573941176477 501.2219411764718 3.8835781000000016 207300.9123529415 maxs -114.31 41.95 52.0 37937.0 6445.0 35682.0 6082.0 15.0001 500001.0 sigma 2.0051664084260357 2.137339794657087 12.586936981660406 2179.9470714527765 421.4994515798648 1147.852959159527 384.52084085590155 1.9081565183791034 115983.76438720895 zeros 0 0 0 0 0 0 0 0 0 missing 0 0 0 0 0 0 0 0 0 0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0 1 -114.47 34.4 19.0 7650.0 1901.0 1129.0 463.0 1.82 80100.0 2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0 3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0 4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.925 65500.0 5 -114.58 33.63 29.0 1387.0 236.0 671.0 239.0 3.3438 74000.0 6 -114.58 33.61 25.0 2907.0 680.0 1841.0 633.0 2.6768 82400.0 7 -114.59 34.83 41.0 812.0 168.0 375.0 158.0 1.7083 48500.0 8 -114.59 33.61 34.0 4789.0 1175.0 3134.0 1056.0 2.1782 58400.0 9 -114.6 34.83 46.0 1497.0 309.0 787.0 271.0 2.1908 48100.0
- Now, we load our test dataset into pandas DataFrame and convert it into the H2O Dataframe.
Code:
python3
# code test = pd.read_csv( 'sample_data / california_housing_test.csv' ) test = h2o.H2OFrame(test) # selecting feature and label columns x = test.columns y = 'median_house_value' # remove label classvariable from feature variable x.remove(y) |
Parse progress: |?????????????????????????????????????????????????????????| 100%
- Now, we run AutoML and start training.
Code:
python3
# import autoML from H2O from h2o.automl import H2OAutoML # callh20automl function aml = H2OAutoML(max_runtime_secs = 600 , # exclude_algos =['DeepLearning'], seed = 1 , # stopping_metric ='logloss', # sort_metric ='logloss', balance_classes = False , project_name = 'Project 1' ) # train model and record time % time aml.train(x = x, y = y, training_frame = train_df) |
AutoML progress: |????????????????????????????????????????????????????????| 100% CPU times: user 40 s, sys: 1.24 s, total: 41.2 s Wall time: 9min 39s
- In this step, we will look for the best performing model using the leaderboard and it will most probably be one of the two stacked ensemble models.
python3
# View the H2O aml leaderboard lb = aml.leaderboard # Print all rows instead of 10 rows lb.head(rows = lb.nrows) |
model_id mean_residual_deviance rmse mse mae rmsle StackedEnsemble_AllModels_AutoML_20200714_173719 2.04045e+09 45171.3 2.04045e+09 29642.1 0.221447 StackedEnsemble_BestOfFamily_AutoML_20200714_173719 2.06576e+09 45450.6 2.06576e+09 29949.4 0.223522 GBM_3_AutoML_20200714_173719 2.15623e+09 46435.2 2.15623e+09 30763.8 0.227577 GBM_4_AutoML_20200714_173719 2.15913e+09 46466.4 2.15913e+09 30786.7 0.228627 XGBoost_grid__1_AutoML_20200714_173719_model_5 2.16562e+09 46536.2 2.16562e+09 31075.9 0.233288 GBM_2_AutoML_20200714_173719 2.17639e+09 46651.8 2.17639e+09 31014.8 0.229731 GBM_grid__1_AutoML_20200714_173719_model_2 2.2457e+09 47388.8 2.2457e+09 31717.9 0.236673 GBM_grid__1_AutoML_20200714_173719_model_4 2.24615e+09 47393.6 2.24615e+09 31533.6 0.235206 GBM_grid__1_AutoML_20200714_173719_model_5 2.30368e+09 47996.7 2.30368e+09 31888 0.234582 GBM_grid__1_AutoML_20200714_173719_model_3 2.31412e+09 48105.3 2.31412e+09 32428.7 0.241596 GBM_1_AutoML_20200714_173719 2.38155e+09 48801.2 2.38155e+09 32817.8 0.241261 GBM_5_AutoML_20200714_173719 2.38712e+09 48858.1 2.38712e+09 32730.3 0.238373 XGBoost_grid__1_AutoML_20200714_173719_model_2 2.41444e+09 49137 2.41444e+09 33359.3 nan XGBoost_grid__1_AutoML_20200714_173719_model_1 2.43811e+09 49377.2 2.43811e+09 33392.7 nan XGBoost_grid__1_AutoML_20200714_173719_model_6 2.44549e+09 49451.8 2.44549e+09 33620.7 nan XGBoost_grid__1_AutoML_20200714_173719_model_7 2.46672e+09 49666.1 2.46672e+09 33264.5 nan XGBoost_3_AutoML_20200714_173719 2.47346e+09 49733.9 2.47346e+09 33829 nan XGBoost_grid__1_AutoML_20200714_173719_model_3 2.53867e+09 50385.2 2.53867e+09 33713.1 0.252152 XGBoost_grid__1_AutoML_20200714_173719_model_4 2.61998e+09 51185.8 2.61998e+09 34084.3 nan GBM_grid__1_AutoML_20200714_173719_model_1 2.63332e+09 51315.9 2.63332e+09 35218.1 nan XGBoost_1_AutoML_20200714_173719 2.64565e+09 51435.9 2.64565e+09 34900.5 nan XGBoost_2_AutoML_20200714_173719 2.67031e+09 51675 2.67031e+09 35556.1 nan DRF_1_AutoML_20200714_173719 2.90447e+09 53893.1 2.90447e+09 36925.5 0.263639 XRT_1_AutoML_20200714_173719 2.92071e+09 54043.6 2.92071e+09 37116.6 0.264397 XGBoost_grid__1_AutoML_20200714_173719_model_8 4.32541e+09 65767.9 4.32541e+09 43502.3 0.287448 DeepLearning_1_AutoML_20200714_173719 5.06767e+09 71187.6 5.06767e+09 49467.4 nan DeepLearning_grid__2_AutoML_20200714_173719_model_1 6.01537e+09 77558.8 6.01537e+09 56478.1 0.386805 DeepLearning_grid__3_AutoML_20200714_173719_model_1 7.85515e+09 88629.3 7.85515e+09 64133.5 0.448841 GBM_grid__1_AutoML_20200714_173719_model_6 8.44986e+09 91923.1 8.44986e+09 71726.4 0.483173 DeepLearning_grid__1_AutoML_20200714_173719_model_2 8.72689e+09 93417.8 8.72689e+09 65346.1 nan DeepLearning_grid__1_AutoML_20200714_173719_model_1 8.9643e+09 94680 8.9643e+09 68862.6 nan GLM_1_AutoML_20200714_173719 1.34525e+10 115985 1.34525e+10 91648.3 0.592579
- In this step, we explore the base learners of the stacked ensemble model and select the best performing base learning model.
Code:
python3
# Get the top model of leaderboard se = aml.leader # Get the metalearner model of top model metalearner = h2o.get_model(se.metalearner()[ 'name' ])) # list baselearner models : metalearner.varimp() |
[('XGBoost_grid__1_AutoML_20200714_173719_model_5', 36607.81502851827, 1.0, 0.3400955145231931), ('GBM_4_AutoML_20200714_173719', 33538.168782584005, 0.9161477885652846, 0.311577753531396), ('GBM_3_AutoML_20200714_173719', 27022.573640463357, 0.7381640674105295, 0.25104628830851705), ('XGBoost_grid__1_AutoML_20200714_173719_model_3', 7512.2319349954105, 0.2052084214570911, 0.06979046367994166), ('GBM_2_AutoML_20200714_173719', 1221.399944930078, 0.03336445903637191, 0.011347102862762904), ('XGBoost_grid__1_AutoML_20200714_173719_model_4', 897.9511180098376, 0.024528945999926915, 0.008342184510556763), ('XGBoost_grid__1_AutoML_20200714_173719_model_2', 839.6650323257486, 0.022936769967604773, 0.007800692583632669), ('GBM_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0), ('GBM_grid__1_AutoML_20200714_173719_model_4', 0.0, 0.0, 0.0), ('GBM_grid__1_AutoML_20200714_173719_model_5', 0.0, 0.0, 0.0), ('GBM_grid__1_AutoML_20200714_173719_model_3', 0.0, 0.0, 0.0), ('GBM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('GBM_5_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('XGBoost_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0), ('XGBoost_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0), ('XGBoost_grid__1_AutoML_20200714_173719_model_7', 0.0, 0.0, 0.0), ('XGBoost_3_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('GBM_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0), ('XGBoost_1_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('XGBoost_2_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('DRF_1_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('XRT_1_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('XGBoost_grid__1_AutoML_20200714_173719_model_8', 0.0, 0.0, 0.0), ('DeepLearning_1_AutoML_20200714_173719', 0.0, 0.0, 0.0), ('DeepLearning_grid__2_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0), ('DeepLearning_grid__3_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0), ('GBM_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0), ('DeepLearning_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0), ('DeepLearning_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0), ('GLM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0)]
- Now, we calculate error on this base learning model and plot the feature importance plot using this model.
python3
# model performance on test dataset model = h2o.get_model( 'XGBoost_grid__1_AutoML_20200714_173719_model_5' ) model.model_performance(test) |
ModelMetricsRegression: xgboost ** Reported on test data. ** MSE: 2194912948.887177 RMSE: 46849.89806698812 MAE: 31039.50846508789 RMSLE: 0.24452804591616809 Mean Residual Deviance: 2194912948.887177
Code:
python3
# plot the graph for variable importance model.varimp_plot(num_of_features = 9 ) |
- Now, we can save this model using the model.save method, this model can be deployed on various platforms.
Code:
python3
# sAVE THE BASELEARNER MODEL model_path = h2o.save_model(model = model, path = 'sample_data/' , force = True ) |
References:
- H2O AI architecture doc
- H2O AutoML blog
Contact Us