Bayesian optimization for Hyperparameter Tuning of XGboost classifier

In this approach, we will use a data set for which we have already completed an initial analysis and exploration of a small train_sample set (100K observations) and developed some initial expectations. It is a binary classification problem in which crude web traffic data from an application download portal is provided, along with an outcome, which indicates whether a given web user eventually downloaded a product from the website. For our purpose, the binary outcome variable is encoded as "is_attributed" in the data set. Other variables include the ip address of a given hit, the operating system, device or channels used, apps being examined, the time spent on each click event, and if a download has occured, the time that took place.

First we are going to build a feature engineering pipeline in order to effectively process data sets for predictive modeling.

Feature Engineering Pipeline Design

  1. We will prepare 3 data sets from the large training set. The entire data set contains 200 million samples. In order to illustrate Bayesian Optimization, we will extract 3 data sets each contain ~1 million observations. We will refer these sets as:

    • training_set (1:100,000th rows of the original training set)
    • validation_set1 (100,001:200,000th rows of the original training set)
    • validation_set2 (200,001:300,000th rows of the original training set)
  2. Build the feature extraction and selection pipeline using the training set:

    • Using the insights we obtained from initial data exploration, the following features will be used to create dummy variables: device, app, os and channel. We will engineer these features by converting them to string, tokenization and selecting 300 best features.
    • We will use hashing trick (hashingtokenizer) to improve computational efficiency.
    • We will write custom processing functions to add the log_total_clicks and log_total_click_time features, and remove the unwanted base features
  3. We will prepare the remainder of the pipeline to incorporate interaction terms and perform scaling and standardization.

Partition training and validation sets

In [19]:
import pandas as pd
training_set = pd.read_csv("train.csv",
                           nrows=1000000,
                           dtype = "str")
print("Finished training_set")
validation_set1 = pd.read_csv("train.csv",
                           skiprows = 1000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set1")
validation_set2 = pd.read_csv("train.csv",
                           skiprows = 2000000,names = list(training_set.columns),
                           nrows=1000000,
                           dtype = "str")
print("Finished validation_set2")
Finished training_set
Finished validation_set1
Finished validation_set2
In [20]:
validation_set1.head()
Out[20]:
ip app device os channel click_time attributed_time is_attributed
0 121848 24 1 19 105 2017-11-06 16:21:51 NaN 0
1 2698 25 1 30 259 2017-11-06 16:21:51 NaN 0
2 5729 2 1 19 237 2017-11-06 16:21:51 NaN 0
3 122891 3 1 35 280 2017-11-06 16:21:51 NaN 0
4 105433 15 2 25 245 2017-11-06 16:21:51 NaN 0
In [2]:
training_set.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
ip                 1000000 non-null object
app                1000000 non-null object
device             1000000 non-null object
os                 1000000 non-null object
channel            1000000 non-null object
click_time         1000000 non-null object
attributed_time    1693 non-null object
is_attributed      1000000 non-null object
dtypes: object(8)
memory usage: 61.0+ MB
In [23]:
training_set.head()
Out[23]:
ip app device os channel click_time attributed_time is_attributed
0 83230 3 1 13 379 2017-11-06 14:32:21 NaN 0
1 17357 3 1 19 379 2017-11-06 14:33:34 NaN 0
2 35810 3 1 13 379 2017-11-06 14:34:12 NaN 0
3 45745 14 1 13 478 2017-11-06 14:34:52 NaN 0
4 161007 3 1 13 379 2017-11-06 14:35:08 NaN 0

Seperate target labels from feature matrix

We will seperate target labels from features for each of these data sets and pickle them for future use:

In [25]:
import os
import pandas as pd
import numpy as np
import pickle

X_train = training_set.drop(["is_attributed","attributed_time"], axis = 1)
y_train = pd.to_numeric(training_set.is_attributed) 

X_train.to_pickle("X_train.pkl")
y_train.to_pickle("y_train.pkl")

X_val1 = validation_set1.drop(["is_attributed","attributed_time"], axis = 1)
y_val1 = pd.to_numeric(validation_set1.is_attributed) 

X_val1.to_pickle("X_val1.pkl")
y_val1.to_pickle("y_val1.pkl")

X_val2 = validation_set2.drop(["is_attributed","attributed_time"], axis = 1)
y_val2 = pd.to_numeric(validation_set2.is_attributed) 

X_val2.to_pickle("X_val2.pkl")
y_val2.to_pickle("y_val2.pkl")

Build the feature extraction and selection pipeline using training set

Note that this is a bit verbose. If you are not familiar with writing skelarn pipelines, don't worry. Just try to get the overall idea, which is to make entire processing steps consistent and ensure same features are being extracted based on the onservations in the trainig set, without leakage from the validation sets. I also tried to explain how several custom functions are being integrated in the pipeline, just refer to comments in the code.

In [217]:
import pandas as pd
import pickle

# Read the pickled training set
X_train = pd.read_pickle("X_train.pkl")
y_train = pd.read_pickle("y_train.pkl")

# Label text features
Text_features = ["app","device","os","channel"]

##############################################################
# Define utility function to parse and process text features
##############################################################
# Note we avoid lambda functions since they don't pickle when we want to save the pipeline later   
def column_text_processer_nolambda(df,text_columns = Text_features):
    import pandas as pd
    import numpy as np
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Concatenate feature name to each category encoding for each row
    # E.g: encoding 3 at device column will read as device3 to make each encoding unique for a given feature
    for col_index in list(text_data.columns):
        text_data[col_index] = col_index + text_data[col_index].astype(str)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector)

#######################################################################
# Define custom processing functions to add the log_total_clicks and 
# log_total_click_time features, and remove the unwanted base features
#######################################################################
def column_time_processer(X_train):
    import pandas as pd
    import numpy as np

    # Convert click_time to datetime64 dtype 
    X_train.click_time = pd.to_datetime(X_train.click_time)

    # Calculate the log_total_clicks for each ip and add as a new feature to temp_data
    temp_data = pd.DataFrame(np.log(X_train.groupby(["ip"]).size()),
                                    columns = ["log_total_clicks"]).reset_index()


    # Calculate the log_total_click_time for each ip and add as a new feature to temp_data
    # First define a function to process selected ip group 
    def get_log_total_click_time(group):
        diff = (max(group.click_time) - min(group.click_time)).seconds
        return np.log(diff+1)

    # Then apply this function to each ip group and extract the total click time per ip group
    log_time_frame = pd.DataFrame(X_train.groupby(["ip"]).apply(get_log_total_click_time),
                                  columns=["log_total_click_time"]).reset_index()

    # Then add this new feature to the temp_data
    temp_data = pd.merge(temp_data,log_time_frame, how = "left",on = "ip")

    # Combine temp_data with X_train to maintain X_train key order
    temp_data = pd.merge(X_train,temp_data,how = "left",on = "ip")

    # Drop features that are not needed
    temp_data = temp_data[["log_total_clicks","log_total_click_time"]]

    # Return only the numeric features as a tensor to integrate into the numeric feature branch of the pipeline
    return temp_data


#############################################################################
# We need to wrap these custom utility functions using FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
# FunctionTransformer wrapper of utility functions to parse text and numeric features
# Note how we avoid putting any arguments into column_text_processer or column_time_processer
#############################################################################
get_numeric_data = FunctionTransformer(func = column_time_processer, validate=False) 
get_text_data = FunctionTransformer(func = column_text_processer_nolambda,validate=False) 

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
# #Note this regex will match either a whitespace or a punctuation to tokenize 
# the string vector on these preferences, in our case we only have white spaces in our text  
#############################################################################
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   

###############################################
# Construct our feature extraction pipeline
###############################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer
from sklearn.feature_selection import SelectKBest, chi2 # We will use chi-squared as a scoring function to select features for classification
from sklearn.metrics import auc
from SparseInteractions import * #Load SparseInteractions (from : https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py) as a module since it was saved into working directory as SparseInteractions.py

userclick_pipeline1 = Pipeline([
    
    ("union",FeatureUnion(
        # Note that FeatureUnion() also accepts list of tuples, the first half of each tuple 
        # is the name of the transformer within the FeatureUnion
        
        transformer_list = [
            
            ("numeric_subpipeline",Pipeline([        # Note we have subpipeline branches inside the main pipeline
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()) # Step2: impute any missing data using default (mean), note we don't expect missing values in this case. 
            ])), # End of: numeric_subpipeline
            
            ("text_subpipeline",Pipeline([
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC, # Step2: use HashingVectorizer for automated tokenization and feature extraction
                                             ngram_range = (1,1),
                                             non_negative=True, 
                                             norm=None, binary=True )), # Note here we use binary=True since our hack is to use tokenization to generate dummy variables  
                ('dim_red', SelectKBest(chi2,300)) # Step3: use dimension reduction to select 300 best features using chi2 as scoring function
            ]))
        ]
        
    )),# End of step: union, this is the fusion point to main pipeline, all features are numeric at this stage
    
    # Common steps:
            
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms up to the second degree polynomial
    ("scaler",MaxAbsScaler()) # Scale the features between 0 and 1.       
            
])# End of: userclick_pipeline1

Develop the userclick_pipeline1 by using the training set:

In [ ]:
import datetime
start = datetime.datetime.now()

userclick_pipeline1.fit(X_train,y_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")

Having trained our pipeline using the training set, we will pickle it and store for reuse. This will ensure the consistency every time we want to process a data set, and we will extract the same set of features.

In [240]:
# Pickle and store the userclick_pipeline1
import pickle
with open("userclick_pipeline1.pkl","wb") as f:
    pickle.dump(userclick_pipeline1,f)    
    

Transform the features in the training set using the established pipeline

In [ ]:
# Re-load the userclick_pipeline1 to work with
import pickle
with open("userclick_pipeline1.pkl","rb") as f:
    userclick_pipeline1 = pickle.load(f)

import datetime
start = datetime.datetime.now()

# Transform the training set features
X_train_trans_pl1 = userclick_pipeline1.transform(X_train)

end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")
In [245]:
X_train_trans_pl1.shape
Out[245]:
(1000000, 45753)
In [246]:
type(X_train_trans_pl1)
Out[246]:
scipy.sparse.csc.csc_matrix

We will pickle and save this transformed version of the features from the training set. We spent about 3 minutes by training the pipeline and an additional 3 minutes for transforming the features.

In the future, we will only use this pipeline with .transform method to process any datasets we would like to use in our models.

Training a boosted tree classifier using XGboost

Let's get started by training a XGboost classifier with the training data so obtain a benchmark performance with an untuned model. .

In [1]:
# Read the transformed features and target labels from the training set
import pickle
with open("X_train_trans_pl1.pkl","rb") as f:
    X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
    y_train = pickle.load(f)
    
In [7]:
# Initially trying an untuned classifier to test the performance
import datetime
import warnings
warnings.filterwarnings('ignore')
import xgboost as xgb
start = datetime.datetime.now()

# Instantiate the classifier
xgbc = xgb.XGBClassifier(nthread=3)
xgbc.fit(X_train_trans_pl1,y_train)

end = datetime.datetime.now()
process_time = end - start
print("Training XGB classifier took: " + str(process_time.seconds/60) + " minutes.")
Training XGB classifier took: 3.683333333333333 minutes.
In [16]:
# Looking at the performance of the untuned xgb classifier
from sklearn.metrics import roc_auc_score

probs = xgbc.predict_proba(X_train_trans_pl1)
probs = probs[:,1]

print("Untuned XGB classifier roc score : " + str(roc_auc_score(y_train,probs)))
Untuned XGB classifier roc score : 0.938548478522

We have some idea about the performance of the XGboost in the training set, what we actually want to know is the "out-of-the-box" performance of the classifier in the validation sets model have not seen before:

Transform validation sets using the established pipeline

It is time to prepare the validation sets to monitor out of the box performance of our classifier. Remember that we need to transform validation sets using the feature extraction pipeline we established.

In [31]:
# Re-load the userclick_pipeline1 to work with
import pickle
with open("userclick_pipeline1.pkl","rb") as f:
    userclick_pipeline1 = pickle.load(f)

# Re-load the validation set features to work with
with open("X_val1.pkl","rb") as f:
    X_val1 = pickle.load(f)   

with open("X_val2.pkl","rb") as f:
    X_val2 = pickle.load(f)     
    
import datetime
start = datetime.datetime.now()

# Transform the validation set 1 features
X_val1_trans_pl1 = userclick_pipeline1.transform(X_val1)

# Save 
import pickle
with open("X_val1_trans_pl1.pkl","wb") as f:
    pickle.dump(X_val1_trans_pl1,f)

end = datetime.datetime.now()
process_time = end - start

print("Finished processing val1.")
print("It took: " + str(process_time.seconds/60) + " minutes.")
print("_" * 60)

# Transform the validation set 1 features
X_val2_trans_pl1 = userclick_pipeline1.transform(X_val2)

# Save 
import pickle
with open("X_val2_trans_pl1.pkl","wb") as f:
    pickle.dump(X_val2_trans_pl1,f)

end1 = datetime.datetime.now()
process_time = end1 - end

print("Finished processing val2.")
print("It took: " + str(process_time.seconds/60) + " minutes.")
print("_" * 60)
Finished processing val1.
It took: 3.05 minutes.
____________________________________________________________
Finished processing val2.
It took: 3.066666666666667 minutes.
____________________________________________________________

Out-of-the-box performance of untuned xgboost classifier

In [38]:
# Load transformed validation sets:
import pickle

# Features:
with open("X_val1_trans_pl1.pkl","rb") as f:
    X_val1_trans_pl1 = pickle.load(f)
with open("X_val2_trans_pl1.pkl","rb") as f:
    X_val1_trans_pl2 = pickle.load(f) 
    
# Target labels:    
with open("y_val1.pkl","rb") as f:
    y_val1= pickle.load(f)
with open("y_val2.pkl","rb") as f:
    y_val2= pickle.load(f)
    
In [39]:
# calculate out-of-the-box roc_score using validation set 1
probs = xgbc.predict_proba(X_val1_trans_pl1)
probs = probs[:,1]
val1_roc = roc_auc_score(y_val1,probs)
    
# calculate out-of-the-box roc_score using validation set 2
probs = xgbc.predict_proba(X_val2_trans_pl1)
probs = probs[:,1]
val2_roc = roc_auc_score(y_val2,probs)
In [42]:
print(val1_roc)
print(val2_roc)
print(np.array([val1_roc,val2_roc]).mean())
0.935513062488
0.931801812273
0.93365743738

This is a good start for an untuned model!

Let's now try to perform Bayesian Optimization to tune the hyperparameters of the XGB classifier and try to improve the performance. Instead of using cross-validation, we will use the average of roc scores using predictions from the two validation sets we prepared, since each fit step took about 3.68 mins using 3 of the 4 available CPU cores.

Bayesian optimization to tune hyperparameters of XGBoost classifier

Let's recall the xgboost hyperparameters that are available for tuning:

In [17]:
help(xgb.XGBClassifier)
Help on class XGBClassifier in module xgboost.sklearn:

class XGBClassifier(XGBModel, sklearn.base.ClassifierMixin)
 |  Implementation of the scikit-learn API for XGBoost classification.
 |  
 |      Parameters
 |  ----------
 |  max_depth : int
 |      Maximum tree depth for base learners.
 |  learning_rate : float
 |      Boosting learning rate (xgb's "eta")
 |  n_estimators : int
 |      Number of boosted trees to fit.
 |  silent : boolean
 |      Whether to print messages while running boosting.
 |  objective : string or callable
 |      Specify the learning task and the corresponding learning objective or
 |      a custom objective function to be used (see note below).
 |  nthread : int
 |      Number of parallel threads used to run xgboost.
 |  gamma : float
 |      Minimum loss reduction required to make a further partition on a leaf node of the tree.
 |  min_child_weight : int
 |      Minimum sum of instance weight(hessian) needed in a child.
 |  max_delta_step : int
 |      Maximum delta step we allow each tree's weight estimation to be.
 |  subsample : float
 |      Subsample ratio of the training instance.
 |  colsample_bytree : float
 |      Subsample ratio of columns when constructing each tree.
 |  colsample_bylevel : float
 |      Subsample ratio of columns for each split, in each level.
 |  reg_alpha : float (xgb's alpha)
 |      L1 regularization term on weights
 |  reg_lambda : float (xgb's lambda)
 |      L2 regularization term on weights
 |  scale_pos_weight : float
 |      Balancing of positive and negative weights.
 |  
 |  base_score:
 |      The initial prediction score of all instances, global bias.
 |  seed : int
 |      Random number seed.
 |  missing : float, optional
 |      Value in the data which needs to be present as a missing value. If
 |      None, defaults to np.nan.
 |  
 |  Note
 |  ----
 |  A custom objective function can be provided for the ``objective``
 |  parameter. In this case, it should have the signature
 |  ``objective(y_true, y_pred) -> grad, hess``:
 |  
 |  y_true: array_like of shape [n_samples]
 |      The target values
 |  y_pred: array_like of shape [n_samples]
 |      The predicted values
 |  
 |  grad: array_like of shape [n_samples]
 |      The value of the gradient for each sample point.
 |  hess: array_like of shape [n_samples]
 |      The value of the second derivative for each sample point
 |  
 |  Method resolution order:
 |      XGBClassifier
 |      XGBModel
 |      sklearn.base.BaseEstimator
 |      sklearn.base.ClassifierMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0, missing=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  evals_result(self)
 |      Return the evaluation results.
 |      
 |      If eval_set is passed to the `fit` function, you can call evals_result() to
 |      get evaluation results for all passed eval_sets. When eval_metric is also
 |      passed to the `fit` function, the evals_result will contain the eval_metrics
 |      passed to the `fit` function
 |      
 |      Returns
 |      -------
 |      evals_result : dictionary
 |      
 |      Example
 |      -------
 |      param_dist = {'objective':'binary:logistic', 'n_estimators':2}
 |      
 |      clf = xgb.XGBClassifier(**param_dist)
 |      
 |      clf.fit(X_train, y_train,
 |              eval_set=[(X_train, y_train), (X_test, y_test)],
 |              eval_metric='logloss',
 |              verbose=True)
 |      
 |      evals_result = clf.evals_result()
 |      
 |      The variable evals_result will contain:
 |      {'validation_0': {'logloss': ['0.604835', '0.531479']},
 |       'validation_1': {'logloss': ['0.41965', '0.17686']}}
 |  
 |  fit(self, X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True)
 |      Fit gradient boosting classifier
 |      
 |      Parameters
 |      ----------
 |      X : array_like
 |          Feature matrix
 |      y : array_like
 |          Labels
 |      sample_weight : array_like
 |          Weight for each instance
 |      eval_set : list, optional
 |          A list of (X, y) pairs to use as a validation set for
 |          early-stopping
 |      eval_metric : str, callable, optional
 |          If a str, should be a built-in evaluation metric to use. See
 |          doc/parameter.md. If callable, a custom evaluation metric. The call
 |          signature is func(y_predicted, y_true) where y_true will be a
 |          DMatrix object such that you may need to call the get_label
 |          method. It must return a str, value pair where the str is a name
 |          for the evaluation and value is the value of the evaluation
 |          function. This objective is always minimized.
 |      early_stopping_rounds : int, optional
 |          Activates early stopping. Validation error needs to decrease at
 |          least every <early_stopping_rounds> round(s) to continue training.
 |          Requires at least one item in evals.  If there's more than one,
 |          will use the last. Returns the model from the last iteration
 |          (not the best one). If early stopping occurs, the model will
 |          have three additional fields: bst.best_score, bst.best_iteration
 |          and bst.best_ntree_limit.
 |          (Use bst.best_ntree_limit to get the correct value if num_parallel_tree
 |          and/or num_class appears in the parameters)
 |      verbose : bool
 |          If `verbose` and an evaluation set is used, writes the evaluation
 |          metric measured on the validation set to stderr.
 |  
 |  predict(self, data, output_margin=False, ntree_limit=0)
 |  
 |  predict_proba(self, data, output_margin=False, ntree_limit=0)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  feature_importances_
 |      Returns
 |      -------
 |      feature_importances_ : array of shape = [n_features]
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from XGBModel:
 |  
 |  __setstate__(self, state)
 |  
 |  apply(self, X, ntree_limit=0)
 |      Return the predicted leaf every tree for each sample.
 |      
 |      Parameters
 |      ----------
 |      X : array_like, shape=[n_samples, n_features]
 |          Input features matrix.
 |      
 |      ntree_limit : int
 |          Limit number of trees in the prediction; defaults to 0 (use all trees).
 |      
 |      Returns
 |      -------
 |      X_leaves : array_like, shape=[n_samples, n_trees]
 |          For each datapoint x in X and for each tree, return the index of the
 |          leaf x ends up in. Leaves are numbered within
 |          ``[0; 2**(self.max_depth+1))``, possibly with gaps in the numbering.
 |  
 |  booster(self)
 |      Get the underlying xgboost Booster of this model.
 |      
 |      This will raise an exception when fit was not called
 |      
 |      Returns
 |      -------
 |      booster : a xgboost booster of underlying model
 |  
 |  get_params(self, deep=False)
 |      Get parameter.s
 |  
 |  get_xgb_params(self)
 |      Get xgboost type parameters.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.

The beauty of Bayesian Optimization process is the flexibility of defining the estimator function you wish to optimize. We can literally define any function here. Once we define this function and pass ranges for a defined set of hyperparameters, Bayesian optimization will endeavor to maximize the output of this function.

Again there are critical points to understand during implementation, refer to my comments in the code:

In [77]:
# Load transformed validation sets:
import pickle

# Features:
with open("X_val1_trans_pl1.pkl","rb") as f:
    X_val1_trans_pl1 = pickle.load(f)
with open("X_val2_trans_pl1.pkl","rb") as f:
    X_val1_trans_pl2 = pickle.load(f) 
    
# Target labels:    
with open("y_val1.pkl","rb") as f:
    y_val1= pickle.load(f)
with open("y_val2.pkl","rb") as f:
    y_val2= pickle.load(f)    
    
# We start by defining the score we want to be maximized using Bayesian Optimization
# Return MEAN validated 'roc_auc' score from xgb Classifier
# Note that the 4 parameters we will optimize are called as generic arguments

seed = 112 # Random seed

def xgbc_cv(max_depth,learning_rate,n_estimators,reg_alpha):
    from sklearn.metrics import roc_auc_score
    import numpy as np
    
    estimator_function = xgb.XGBClassifier(max_depth=int(max_depth),
                                           learning_rate= learning_rate,
                                           n_estimators= int(n_estimators),
                                           reg_alpha = reg_alpha,
                                           nthread = -1,
                                           objective='binary:logistic',
                                           seed = seed)
    # Fit the estimator
    estimator_function.fit(X_train_trans_pl1,y_train)
    
    # calculate out-of-the-box roc_score using validation set 1
    probs = estimator_function.predict_proba(X_val1_trans_pl1)
    probs = probs[:,1]
    val1_roc = roc_auc_score(y_val1,probs)
    
    # calculate out-of-the-box roc_score using validation set 2
    probs = estimator_function.predict_proba(X_val2_trans_pl1)
    probs = probs[:,1]
    val2_roc = roc_auc_score(y_val2,probs)
    
    # return the mean validation score to be maximized 
    return np.array([val1_roc,val2_roc]).mean()

Once we defined the estimator function we want to optimize, we can start actual optimization process:

In [78]:
import warnings
warnings.filterwarnings('ignore')

from bayes_opt import BayesianOptimization

# alpha is a parameter for the gaussian process
# Note that this is itself a hyperparemter that can be optimized.
gp_params = {"alpha": 1e-10}

# We create the BayesianOptimization objects using the functions that utilize
# the respective classifiers and return cross-validated scores to be optimized.

seed = 112 # Random seed

# We create the bayes_opt object and pass the function to be maximized
# together with the parameters names and their bounds.
# Note the syntax of bayes_opt package: bounds of hyperparameters are passed as two-tuples

hyperparameter_space = {
    'max_depth': (5, 20),
    'learning_rate': (0, 1),
    'n_estimators' : (10,100),
    'reg_alpha': (0,1)
}

xgbcBO = BayesianOptimization(f = xgbc_cv, 
                             pbounds =  hyperparameter_space,
                             random_state = seed,
                             verbose = 10)

# Finally we call .maximize method of the optimizer with the appropriate arguments
# kappa is a measure of 'aggressiveness' of the bayesian optimization process
# The algorithm will randomly choose 3 points to establish a 'prior', then will perform 
# 10 interations to maximize the value of estimator function
xgbcBO.maximize(init_points=3,n_iter=10,acq='ucb', kappa= 3, **gp_params)
Initialization
--------------------------------------------------------------------------------------------
 Step |   Time |      Value |   learning_rate |   max_depth |   n_estimators |   reg_alpha | 
    1 | 02m13s |    0.84898 |          0.0757 |     10.6259 |        14.9325 |      0.7223 | 
    2 | 13m59s |    0.92926 |          0.7769 |     14.6046 |        83.5910 |      0.0026 | 
    3 | 20m39s |    0.92728 |          0.8327 |     19.2502 |        89.6816 |      0.9812 | 
Bayesian Optimization
--------------------------------------------------------------------------------------------
 Step |   Time |      Value |   learning_rate |   max_depth |   n_estimators |   reg_alpha | 
    4 | 06m28s |    0.93059 |          1.0000 |      5.0000 |       100.0000 |      1.0000 | 
    5 | 05m00s |    0.94576 |          0.2612 |      5.0525 |        75.5942 |      0.9925 | 
    6 | 03m54s |    0.93574 |          0.9271 |      5.0621 |        56.1691 |      0.9564 | 
    7 | 04m59s |    0.93641 |          0.9909 |      5.0240 |        74.6725 |      0.8502 | 
    8 | 12m22s |    0.86083 |          0.0313 |     19.6624 |        54.0783 |      0.9691 | 
    9 | 22m42s |    0.85820 |          0.0129 |     18.4734 |        99.8471 |      0.0674 | 
   10 | 02m33s |    0.50000 |          0.0000 |      5.0000 |        32.1893 |      1.0000 | 
   11 | 03m03s |    0.92671 |          1.0000 |     20.0000 |        10.0000 |      0.0000 | 
   12 | 01m21s |    0.92781 |          1.0000 |      5.0000 |        10.0000 |      0.0000 | 
   13 | 16m29s |    0.50000 |          0.0000 |     20.0000 |        70.3654 |      1.0000 | 

Note that each step requires a bit of patience and waiting, since the data set is fairly large for this CPU power, and complexity of our clasifier is different at each step.

The output demonstrates how optimization was working. Colored steps mark improvement. Value column refers to the output of the estimator function we defined, which is the mean ROC validation score from two independent validation sets.

As you can see, it looks like we already imporved the model by finding optimal values for these 4 hyperparameters. Can we set these parameters and search for additional hyperparameters?

In [79]:
# Load transformed validation sets:
import pickle

# Features:
with open("X_val1_trans_pl1.pkl","rb") as f:
    X_val1_trans_pl1 = pickle.load(f)
with open("X_val2_trans_pl1.pkl","rb") as f:
    X_val1_trans_pl2 = pickle.load(f) 
    
# Target labels:    
with open("y_val1.pkl","rb") as f:
    y_val1= pickle.load(f)
with open("y_val2.pkl","rb") as f:
    y_val2= pickle.load(f)    
    
# We start by defining the score we want to be maximized using Bayesian Optimization
# Return MEAN validated 'roc_auc' score from xgb Classifier
# Note that the next 3 parameters we will optimize are called as generic arguments

seed = 112 # Random seed

def xgbc_cv(min_child_weight,colsample_bytree,gamma):
    from sklearn.metrics import roc_auc_score
    import numpy as np
    
    estimator_function = xgb.XGBClassifier(max_depth=int(5.0525),
                                           colsample_bytree= colsample_bytree,
                                           gamma=gamma,
                                           min_child_weight= int(min_child_weight),
                                           learning_rate= 0.2612,
                                           n_estimators= int(75.5942),
                                           reg_alpha = 0.9925,
                                           nthread = -1,
                                           objective='binary:logistic',
                                           seed = seed)
    # Fit the estimator
    estimator_function.fit(X_train_trans_pl1,y_train)
    
    # calculate out-of-the-box roc_score using validation set 1
    probs = estimator_function.predict_proba(X_val1_trans_pl1)
    probs = probs[:,1]
    val1_roc = roc_auc_score(y_val1,probs)
    
    # calculate out-of-the-box roc_score using validation set 2
    probs = estimator_function.predict_proba(X_val2_trans_pl1)
    probs = probs[:,1]
    val2_roc = roc_auc_score(y_val2,probs)
    
    # return the mean validation score to be maximized 
    return np.array([val1_roc,val2_roc]).mean()


import warnings
warnings.filterwarnings('ignore')

from bayes_opt import BayesianOptimization

# alpha is a parameter for the gaussian process
# Note that this is itself a hyperparemter that can be optimized.
gp_params = {"alpha": 1e-10}

# We create the BayesianOptimization objects using the functions that utilize
# the respective classifiers and return cross-validated scores to be optimized.

seed = 112 # Random seed

# We create the bayes_opt object and pass the function to be maximized
# together with the parameters names and their bounds.

hyperparameter_space = {
    'min_child_weight': (1, 20),
    'colsample_bytree': (0.1, 1),
    'gamma' : (0,10)
}

xgbcBO = BayesianOptimization(f = xgbc_cv, 
                             pbounds =  hyperparameter_space,
                             random_state = seed,
                             verbose = 10)

# Finally we call .maximize method of the optimizer with the appropriate arguments

xgbcBO.maximize(init_points=3,n_iter=10,acq='ucb', kappa= 3, **gp_params)
Initialization
-----------------------------------------------------------------------------------
 Step |   Time |      Value |   colsample_bytree |     gamma |   min_child_weight | 
    1 | 01m36s |    0.94742 |             0.1681 |    0.5481 |             8.1261 | 
    2 | 03m59s |    0.94392 |             0.7992 |    8.1768 |            13.1658 | 
    3 | 04m09s |    0.94233 |             0.8495 |    8.8535 |            19.0503 | 
Bayesian Optimization
-----------------------------------------------------------------------------------
 Step |   Time |      Value |   colsample_bytree |     gamma |   min_child_weight | 
    4 | 11m50s |    0.94626 |             0.9761 |    0.0848 |             1.1526 | 
    5 | 01m25s |    0.94397 |             0.1037 |    9.9315 |             1.3312 | 
    6 | 05m02s |    0.94194 |             0.9722 |    0.0010 |            18.4324 | 
    7 | 01m32s |    0.94784 |             0.1266 |    0.0537 |             1.1680 | 
    8 | 01m42s |    0.94710 |             0.1515 |    0.0153 |             2.7921 | 
    9 | 01m35s |    0.94772 |             0.1360 |    4.2282 |             1.0670 | 
   10 | 01m29s |    0.94253 |             0.1015 |    4.1146 |            16.6411 | 
   11 | 05m09s |    0.94548 |             0.9972 |    1.6464 |            11.7380 | 
   12 | 01m33s |    0.94712 |             0.1027 |    4.9164 |             7.0935 | 
   13 | 01m33s |    0.94886 |             0.1121 |    1.4987 |             1.0007 | 

As you can see, this further helped to improve the model performance. Depending on the computational power you have and time you can sacrifice, you can follow additional steps and combination of hyperparameters and may come up with a different set of parameters that might further improve the performance.

A critical point here to keep in mind; the more you push to optimize the model, the rate of return gradually decreases. In other words, you spent mode time but get back much smaller improvements in the model performance. If you are hoping to have a generalizable model to larger data sets, there is a trade off between over-fitting to validation sets you are using to tune your model and model's out-of-the-box performance.