In this approach, we will use a data set for which we have already completed an initial analysis and exploration of a small train_sample set (100K observations) and developed some initial expectations. It is a binary classification problem in which crude web traffic data from an application download portal is provided, along with an outcome, which indicates whether a given web user eventually downloaded a product from the website. For our purpose, the binary outcome variable is encoded as "is_attributed" in the data set. Other variables include the ip address of a given hit, the operating system, device or channels used, apps being examined, the time spent on each click event, and if a download has occured, the time that took place.
First we are going to build a feature engineering pipeline in order to effectively process data sets for predictive modeling.
We will prepare 3 data sets from the large training set. The entire data set contains 200 million samples. In order to illustrate Bayesian Optimization, we will extract 3 data sets each contain ~1 million observations. We will refer these sets as:
Build the feature extraction and selection pipeline using the training set:
We will prepare the remainder of the pipeline to incorporate interaction terms and perform scaling and standardization.
import pandas as pd
training_set = pd.read_csv("train.csv",
nrows=1000000,
dtype = "str")
print("Finished training_set")
validation_set1 = pd.read_csv("train.csv",
skiprows = 1000000,names = list(training_set.columns),
nrows=1000000,
dtype = "str")
print("Finished validation_set1")
validation_set2 = pd.read_csv("train.csv",
skiprows = 2000000,names = list(training_set.columns),
nrows=1000000,
dtype = "str")
print("Finished validation_set2")
validation_set1.head()
training_set.info()
training_set.head()
We will seperate target labels from features for each of these data sets and pickle them for future use:
import os
import pandas as pd
import numpy as np
import pickle
X_train = training_set.drop(["is_attributed","attributed_time"], axis = 1)
y_train = pd.to_numeric(training_set.is_attributed)
X_train.to_pickle("X_train.pkl")
y_train.to_pickle("y_train.pkl")
X_val1 = validation_set1.drop(["is_attributed","attributed_time"], axis = 1)
y_val1 = pd.to_numeric(validation_set1.is_attributed)
X_val1.to_pickle("X_val1.pkl")
y_val1.to_pickle("y_val1.pkl")
X_val2 = validation_set2.drop(["is_attributed","attributed_time"], axis = 1)
y_val2 = pd.to_numeric(validation_set2.is_attributed)
X_val2.to_pickle("X_val2.pkl")
y_val2.to_pickle("y_val2.pkl")
Note that this is a bit verbose. If you are not familiar with writing skelarn pipelines, don't worry. Just try to get the overall idea, which is to make entire processing steps consistent and ensure same features are being extracted based on the onservations in the trainig set, without leakage from the validation sets. I also tried to explain how several custom functions are being integrated in the pipeline, just refer to comments in the code.
import pandas as pd
import pickle
# Read the pickled training set
X_train = pd.read_pickle("X_train.pkl")
y_train = pd.read_pickle("y_train.pkl")
# Label text features
Text_features = ["app","device","os","channel"]
##############################################################
# Define utility function to parse and process text features
##############################################################
# Note we avoid lambda functions since they don't pickle when we want to save the pipeline later
def column_text_processer_nolambda(df,text_columns = Text_features):
import pandas as pd
import numpy as np
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Concatenate feature name to each category encoding for each row
# E.g: encoding 3 at device column will read as device3 to make each encoding unique for a given feature
for col_index in list(text_data.columns):
text_data[col_index] = col_index + text_data[col_index].astype(str)
# Join all the strings in a given row to make a vector
# text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
text_vector = []
for index,rows in text_data.iterrows():
text_item = " ".join(rows).lower()
text_vector.append(text_item)
# return text_vector as pd.Series object to enter the tokenization pipeline
return pd.Series(text_vector)
#######################################################################
# Define custom processing functions to add the log_total_clicks and
# log_total_click_time features, and remove the unwanted base features
#######################################################################
def column_time_processer(X_train):
import pandas as pd
import numpy as np
# Convert click_time to datetime64 dtype
X_train.click_time = pd.to_datetime(X_train.click_time)
# Calculate the log_total_clicks for each ip and add as a new feature to temp_data
temp_data = pd.DataFrame(np.log(X_train.groupby(["ip"]).size()),
columns = ["log_total_clicks"]).reset_index()
# Calculate the log_total_click_time for each ip and add as a new feature to temp_data
# First define a function to process selected ip group
def get_log_total_click_time(group):
diff = (max(group.click_time) - min(group.click_time)).seconds
return np.log(diff+1)
# Then apply this function to each ip group and extract the total click time per ip group
log_time_frame = pd.DataFrame(X_train.groupby(["ip"]).apply(get_log_total_click_time),
columns=["log_total_click_time"]).reset_index()
# Then add this new feature to the temp_data
temp_data = pd.merge(temp_data,log_time_frame, how = "left",on = "ip")
# Combine temp_data with X_train to maintain X_train key order
temp_data = pd.merge(X_train,temp_data,how = "left",on = "ip")
# Drop features that are not needed
temp_data = temp_data[["log_total_clicks","log_total_click_time"]]
# Return only the numeric features as a tensor to integrate into the numeric feature branch of the pipeline
return temp_data
#############################################################################
# We need to wrap these custom utility functions using FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
# FunctionTransformer wrapper of utility functions to parse text and numeric features
# Note how we avoid putting any arguments into column_text_processer or column_time_processer
#############################################################################
get_numeric_data = FunctionTransformer(func = column_time_processer, validate=False)
get_text_data = FunctionTransformer(func = column_text_processer_nolambda,validate=False)
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
# #Note this regex will match either a whitespace or a punctuation to tokenize
# the string vector on these preferences, in our case we only have white spaces in our text
#############################################################################
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
###############################################
# Construct our feature extraction pipeline
###############################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer
from sklearn.feature_selection import SelectKBest, chi2 # We will use chi-squared as a scoring function to select features for classification
from sklearn.metrics import auc
from SparseInteractions import * #Load SparseInteractions (from : https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py) as a module since it was saved into working directory as SparseInteractions.py
userclick_pipeline1 = Pipeline([
("union",FeatureUnion(
# Note that FeatureUnion() also accepts list of tuples, the first half of each tuple
# is the name of the transformer within the FeatureUnion
transformer_list = [
("numeric_subpipeline",Pipeline([ # Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()) # Step2: impute any missing data using default (mean), note we don't expect missing values in this case.
])), # End of: numeric_subpipeline
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC, # Step2: use HashingVectorizer for automated tokenization and feature extraction
ngram_range = (1,1),
non_negative=True,
norm=None, binary=True )), # Note here we use binary=True since our hack is to use tokenization to generate dummy variables
('dim_red', SelectKBest(chi2,300)) # Step3: use dimension reduction to select 300 best features using chi2 as scoring function
]))
]
)),# End of step: union, this is the fusion point to main pipeline, all features are numeric at this stage
# Common steps:
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms up to the second degree polynomial
("scaler",MaxAbsScaler()) # Scale the features between 0 and 1.
])# End of: userclick_pipeline1
Develop the userclick_pipeline1 by using the training set:
import datetime
start = datetime.datetime.now()
userclick_pipeline1.fit(X_train,y_train)
end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")
Having trained our pipeline using the training set, we will pickle it and store for reuse. This will ensure the consistency every time we want to process a data set, and we will extract the same set of features.
# Pickle and store the userclick_pipeline1
import pickle
with open("userclick_pipeline1.pkl","wb") as f:
pickle.dump(userclick_pipeline1,f)
# Re-load the userclick_pipeline1 to work with
import pickle
with open("userclick_pipeline1.pkl","rb") as f:
userclick_pipeline1 = pickle.load(f)
import datetime
start = datetime.datetime.now()
# Transform the training set features
X_train_trans_pl1 = userclick_pipeline1.transform(X_train)
end = datetime.datetime.now()
process_time = end - start
print("It took: " + str(process_time.seconds/60) + " minutes.")
X_train_trans_pl1.shape
type(X_train_trans_pl1)
We will pickle and save this transformed version of the features from the training set. We spent about 3 minutes by training the pipeline and an additional 3 minutes for transforming the features.
In the future, we will only use this pipeline with .transform method to process any datasets we would like to use in our models.
Let's get started by training a XGboost classifier with the training data so obtain a benchmark performance with an untuned model. .
# Read the transformed features and target labels from the training set
import pickle
with open("X_train_trans_pl1.pkl","rb") as f:
X_train_trans_pl1 = pickle.load(f)
with open("y_train.pkl","rb") as f:
y_train = pickle.load(f)
# Initially trying an untuned classifier to test the performance
import datetime
import warnings
warnings.filterwarnings('ignore')
import xgboost as xgb
start = datetime.datetime.now()
# Instantiate the classifier
xgbc = xgb.XGBClassifier(nthread=3)
xgbc.fit(X_train_trans_pl1,y_train)
end = datetime.datetime.now()
process_time = end - start
print("Training XGB classifier took: " + str(process_time.seconds/60) + " minutes.")
# Looking at the performance of the untuned xgb classifier
from sklearn.metrics import roc_auc_score
probs = xgbc.predict_proba(X_train_trans_pl1)
probs = probs[:,1]
print("Untuned XGB classifier roc score : " + str(roc_auc_score(y_train,probs)))
We have some idea about the performance of the XGboost in the training set, what we actually want to know is the "out-of-the-box" performance of the classifier in the validation sets model have not seen before:
It is time to prepare the validation sets to monitor out of the box performance of our classifier. Remember that we need to transform validation sets using the feature extraction pipeline we established.
# Re-load the userclick_pipeline1 to work with
import pickle
with open("userclick_pipeline1.pkl","rb") as f:
userclick_pipeline1 = pickle.load(f)
# Re-load the validation set features to work with
with open("X_val1.pkl","rb") as f:
X_val1 = pickle.load(f)
with open("X_val2.pkl","rb") as f:
X_val2 = pickle.load(f)
import datetime
start = datetime.datetime.now()
# Transform the validation set 1 features
X_val1_trans_pl1 = userclick_pipeline1.transform(X_val1)
# Save
import pickle
with open("X_val1_trans_pl1.pkl","wb") as f:
pickle.dump(X_val1_trans_pl1,f)
end = datetime.datetime.now()
process_time = end - start
print("Finished processing val1.")
print("It took: " + str(process_time.seconds/60) + " minutes.")
print("_" * 60)
# Transform the validation set 1 features
X_val2_trans_pl1 = userclick_pipeline1.transform(X_val2)
# Save
import pickle
with open("X_val2_trans_pl1.pkl","wb") as f:
pickle.dump(X_val2_trans_pl1,f)
end1 = datetime.datetime.now()
process_time = end1 - end
print("Finished processing val2.")
print("It took: " + str(process_time.seconds/60) + " minutes.")
print("_" * 60)
# Load transformed validation sets:
import pickle
# Features:
with open("X_val1_trans_pl1.pkl","rb") as f:
X_val1_trans_pl1 = pickle.load(f)
with open("X_val2_trans_pl1.pkl","rb") as f:
X_val1_trans_pl2 = pickle.load(f)
# Target labels:
with open("y_val1.pkl","rb") as f:
y_val1= pickle.load(f)
with open("y_val2.pkl","rb") as f:
y_val2= pickle.load(f)
# calculate out-of-the-box roc_score using validation set 1
probs = xgbc.predict_proba(X_val1_trans_pl1)
probs = probs[:,1]
val1_roc = roc_auc_score(y_val1,probs)
# calculate out-of-the-box roc_score using validation set 2
probs = xgbc.predict_proba(X_val2_trans_pl1)
probs = probs[:,1]
val2_roc = roc_auc_score(y_val2,probs)
print(val1_roc)
print(val2_roc)
print(np.array([val1_roc,val2_roc]).mean())
This is a good start for an untuned model!
Let's now try to perform Bayesian Optimization to tune the hyperparameters of the XGB classifier and try to improve the performance. Instead of using cross-validation, we will use the average of roc scores using predictions from the two validation sets we prepared, since each fit step took about 3.68 mins using 3 of the 4 available CPU cores.
Let's recall the xgboost hyperparameters that are available for tuning:
help(xgb.XGBClassifier)
The beauty of Bayesian Optimization process is the flexibility of defining the estimator function you wish to optimize. We can literally define any function here. Once we define this function and pass ranges for a defined set of hyperparameters, Bayesian optimization will endeavor to maximize the output of this function.
Again there are critical points to understand during implementation, refer to my comments in the code:
# Load transformed validation sets:
import pickle
# Features:
with open("X_val1_trans_pl1.pkl","rb") as f:
X_val1_trans_pl1 = pickle.load(f)
with open("X_val2_trans_pl1.pkl","rb") as f:
X_val1_trans_pl2 = pickle.load(f)
# Target labels:
with open("y_val1.pkl","rb") as f:
y_val1= pickle.load(f)
with open("y_val2.pkl","rb") as f:
y_val2= pickle.load(f)
# We start by defining the score we want to be maximized using Bayesian Optimization
# Return MEAN validated 'roc_auc' score from xgb Classifier
# Note that the 4 parameters we will optimize are called as generic arguments
seed = 112 # Random seed
def xgbc_cv(max_depth,learning_rate,n_estimators,reg_alpha):
from sklearn.metrics import roc_auc_score
import numpy as np
estimator_function = xgb.XGBClassifier(max_depth=int(max_depth),
learning_rate= learning_rate,
n_estimators= int(n_estimators),
reg_alpha = reg_alpha,
nthread = -1,
objective='binary:logistic',
seed = seed)
# Fit the estimator
estimator_function.fit(X_train_trans_pl1,y_train)
# calculate out-of-the-box roc_score using validation set 1
probs = estimator_function.predict_proba(X_val1_trans_pl1)
probs = probs[:,1]
val1_roc = roc_auc_score(y_val1,probs)
# calculate out-of-the-box roc_score using validation set 2
probs = estimator_function.predict_proba(X_val2_trans_pl1)
probs = probs[:,1]
val2_roc = roc_auc_score(y_val2,probs)
# return the mean validation score to be maximized
return np.array([val1_roc,val2_roc]).mean()
Once we defined the estimator function we want to optimize, we can start actual optimization process:
import warnings
warnings.filterwarnings('ignore')
from bayes_opt import BayesianOptimization
# alpha is a parameter for the gaussian process
# Note that this is itself a hyperparemter that can be optimized.
gp_params = {"alpha": 1e-10}
# We create the BayesianOptimization objects using the functions that utilize
# the respective classifiers and return cross-validated scores to be optimized.
seed = 112 # Random seed
# We create the bayes_opt object and pass the function to be maximized
# together with the parameters names and their bounds.
# Note the syntax of bayes_opt package: bounds of hyperparameters are passed as two-tuples
hyperparameter_space = {
'max_depth': (5, 20),
'learning_rate': (0, 1),
'n_estimators' : (10,100),
'reg_alpha': (0,1)
}
xgbcBO = BayesianOptimization(f = xgbc_cv,
pbounds = hyperparameter_space,
random_state = seed,
verbose = 10)
# Finally we call .maximize method of the optimizer with the appropriate arguments
# kappa is a measure of 'aggressiveness' of the bayesian optimization process
# The algorithm will randomly choose 3 points to establish a 'prior', then will perform
# 10 interations to maximize the value of estimator function
xgbcBO.maximize(init_points=3,n_iter=10,acq='ucb', kappa= 3, **gp_params)
Note that each step requires a bit of patience and waiting, since the data set is fairly large for this CPU power, and complexity of our clasifier is different at each step.
The output demonstrates how optimization was working. Colored steps mark improvement. Value column refers to the output of the estimator function we defined, which is the mean ROC validation score from two independent validation sets.
As you can see, it looks like we already imporved the model by finding optimal values for these 4 hyperparameters. Can we set these parameters and search for additional hyperparameters?
# Load transformed validation sets:
import pickle
# Features:
with open("X_val1_trans_pl1.pkl","rb") as f:
X_val1_trans_pl1 = pickle.load(f)
with open("X_val2_trans_pl1.pkl","rb") as f:
X_val1_trans_pl2 = pickle.load(f)
# Target labels:
with open("y_val1.pkl","rb") as f:
y_val1= pickle.load(f)
with open("y_val2.pkl","rb") as f:
y_val2= pickle.load(f)
# We start by defining the score we want to be maximized using Bayesian Optimization
# Return MEAN validated 'roc_auc' score from xgb Classifier
# Note that the next 3 parameters we will optimize are called as generic arguments
seed = 112 # Random seed
def xgbc_cv(min_child_weight,colsample_bytree,gamma):
from sklearn.metrics import roc_auc_score
import numpy as np
estimator_function = xgb.XGBClassifier(max_depth=int(5.0525),
colsample_bytree= colsample_bytree,
gamma=gamma,
min_child_weight= int(min_child_weight),
learning_rate= 0.2612,
n_estimators= int(75.5942),
reg_alpha = 0.9925,
nthread = -1,
objective='binary:logistic',
seed = seed)
# Fit the estimator
estimator_function.fit(X_train_trans_pl1,y_train)
# calculate out-of-the-box roc_score using validation set 1
probs = estimator_function.predict_proba(X_val1_trans_pl1)
probs = probs[:,1]
val1_roc = roc_auc_score(y_val1,probs)
# calculate out-of-the-box roc_score using validation set 2
probs = estimator_function.predict_proba(X_val2_trans_pl1)
probs = probs[:,1]
val2_roc = roc_auc_score(y_val2,probs)
# return the mean validation score to be maximized
return np.array([val1_roc,val2_roc]).mean()
import warnings
warnings.filterwarnings('ignore')
from bayes_opt import BayesianOptimization
# alpha is a parameter for the gaussian process
# Note that this is itself a hyperparemter that can be optimized.
gp_params = {"alpha": 1e-10}
# We create the BayesianOptimization objects using the functions that utilize
# the respective classifiers and return cross-validated scores to be optimized.
seed = 112 # Random seed
# We create the bayes_opt object and pass the function to be maximized
# together with the parameters names and their bounds.
hyperparameter_space = {
'min_child_weight': (1, 20),
'colsample_bytree': (0.1, 1),
'gamma' : (0,10)
}
xgbcBO = BayesianOptimization(f = xgbc_cv,
pbounds = hyperparameter_space,
random_state = seed,
verbose = 10)
# Finally we call .maximize method of the optimizer with the appropriate arguments
xgbcBO.maximize(init_points=3,n_iter=10,acq='ucb', kappa= 3, **gp_params)
As you can see, this further helped to improve the model performance. Depending on the computational power you have and time you can sacrifice, you can follow additional steps and combination of hyperparameters and may come up with a different set of parameters that might further improve the performance.
A critical point here to keep in mind; the more you push to optimize the model, the rate of return gradually decreases. In other words, you spent mode time but get back much smaller improvements in the model performance. If you are hoping to have a generalizable model to larger data sets, there is a trade off between over-fitting to validation sets you are using to tune your model and model's out-of-the-box performance.