%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
Let's start with loading the training data:
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
import numpy as np
train = pd.read_csv("train.csv")
train.head()
train.info()
First, let's remove the train_id from the training set, and map the numeric and character variables to being with:
train = train.drop("train_id",axis = 1)
train.head()
train.info()
Numeric_features_select = np.logical_or((pd.Series(train.dtypes)) == "int64",(pd.Series(train.dtypes) == "float64")).tolist()
Numeric_features = pd.Series(train.columns)[Numeric_features_select].tolist()
Numeric_features.remove('price')
Numeric_features
Text_features = pd.Series(train.columns)[np.logical_not(Numeric_features_select)].tolist()
Text_features
Since the model evaluation will be based on the log of the target variable price, we convert it at this stage:
train.price = pd.Series(np.log(train.price + 1))
train.head()
Now it is time to split the training data set into training and holdout sets in order to start building our pipeline:
from sklearn.model_selection import train_test_split
X = train.drop("price", axis = 1)
y = train.price
X_train,X_holdout,y_train,y_holdout = train_test_split(X,y,test_size = 0.4, random_state = 425)
type(X_train)
type(y_holdout)
X_train.shape
X_holdout.shape
y_train.shape
y_holdout.shape
At this stage, let's lockdown and save the datasets on which we will train and validate our model:
X_train.to_csv("X_train.csv",index=False)
X_holdout.to_csv("X_holdout.csv", index = False)
y_train.to_csv("y_train.csv", index= False)
y_holdout.to_csv("y_holdout.csv", index= False)
# Write the pickled feature names
import pickle
with open("Numeric_features.pkl", 'wb') as f:
pickle.dump(Numeric_features,f)
f.close()
with open("Text_features.pkl", 'wb') as f:
pickle.dump(Text_features,f)
f.close()
At this stage we will make high-level design for our data processing and model training pipeline. At minimum, our pipeline will need to include the following steps:
We need to process text and numeric features seperately, then combine them using FeatureUnion()
Text subpipeline should include:
Numeric subpipeline should include:
After merging the numeric and text features we will add the following common steps:
Once our pipeline is ready, our goals will be:
Finally, we will repeat these steps using the same pipeline, but changing model structure to train:
After these steps, we will explore ensembling these models to see if we can get a better model.
At the end of this exercise, we expect to get more competent on:
Finally, we will try to use the same sets to train Deep Neural Networks to see how they compare to the performance of shallow learning approaches.
import os
import pandas as pd
import numpy as np
import pickle
# Re-read the training data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
def column_text_processer(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
# Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
return text_vector.apply(lambda x: x.lower())
We will start by loading the necessary functions from sklearn submodules:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
First we build two utility functions to parse numeric and text data, and wrap them using FunctionTransformer, so that they can be integrated into a sklearn pipeline:
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
We also need to create our regex token pattern to use in CountVectorizer. CountVectorizer will use this regex pattern to create tokens and n-grams we specified. It will automatically convert these into dummy features and stores in the form of a sparsemartix. Note that we will use HashingVectorizer to improve computational efficiency.
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
We also need to redefine the default feature selection function for regression to properly place into our pipeline:
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
We can now start building the actual pipeline:
pl1 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
#("int", SparseInteractions(degree=2)), # Add polynomial interaction terms :POSTPONED
("scaler",MaxAbsScaler()), # Scale the features
])
# We fit_transform X outside of the pipeline to obtain transformed X for hyperparameter search,
# since transformation step takes long time and we want to avoid repeating everytime
X_train_transformed = pl1.fit_transform(X_train,y_train)
# We start with ridge regression
model1 = Ridge(alpha=0.5)
model1.fit(X_train_transformed, y_train)
y_pred1 = model1.predict(X_train_transformed)
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_train,y_pred1))
import matplotlib.pyplot as plt
plt.scatter(y_pred1,y_train, s = 2, c = "r", alpha = 0.4)
plt.show()
This simple pipeline already looks very promising!
First, we compile the data loading, utility functions and pipeline steps for easy starting up later:
import os
import pandas as pd
import numpy as np
import pickle
#############################################################################
# Re-read the training data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
def column_text_processer(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
# Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
return text_vector.apply(lambda x: x.lower())
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection with center = True default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
# Prepare the actual pipeline:
pl1 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
#("int", SparseInteractions(degree=2)), # Add polynomial interaction terms :POSTPONED
("scaler",MaxAbsScaler()), # Scale the features
])
# We fit_transform X outside of the pipeline to obtain transformed X for hyperparameter search,
# since transformation step takes long time and we want to avoid repeating everytime
X_train_transformed = pl1.fit_transform(X_train,y_train)
We can now start hyperparameter optimization. Let's first start by looking into the tunable hyperparameters in ridge model:
# looking into parameters of our first estimator:
model1.get_params()
Let's plan to start by tuning the alpha, which is the most important hyperparameter that defines the strength of regularization. A sensible value is between 1 and 0. We will use Exhaustive search over specified parameter values for an estimator, which is performed by GridSearchCV using 5 fold cross-validation. Since we only perform search in one hyperparameter, this is a one-dimensional grid search:
# We prepare 20 grids of alpha in a linear space
alpha_space = np.linspace(1,0,20)
alpha_space
param_grid = {"alpha":alpha_space}
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
# Instantiate the GridSearchCV object:
ridgemodel = Ridge()
ridge_cv = GridSearchCV(estimator= ridgemodel,
param_grid= param_grid,
scoring='neg_mean_squared_error',
cv = 5,
n_jobs=-1)
# Fit the GridSearchCV object to training data to start parameter search:
ridge_cv.fit(X_train_transformed,y_train)
# Print the tuned parameters and score
print("Tuned Ridge Regression Parameters: {}".format(ridge_cv.best_params_))
print("Best score is {}".format(ridge_cv.best_score_))
Note that the score is negative mean-squared error. We found that the optimal alpha is close to 0.68, which was higher than the alpha we used in the model1. Let's make predictions in test set to compare this new tuned model with the earlier model we used:
y_pred2 = ridge_cv.predict(X_train_transformed)
np.sqrt(mean_squared_error(y_train,y_pred2))
import matplotlib.pyplot as plt
plt.scatter(y_pred2,y_train, s = 0.2, c = "r", alpha = 0.8)
plt.show()
plt.scatter(y_pred2,y_pred1, s = 0.5, c = "b", alpha = 0.8)
plt.show()
Interestingly, we note that even after hyperparameter tuning, the predictive performance of the Ridge remained a the same level as they are making the same predictions.
Next, we try to see if we can include interaction terms into the pipeline to see if we can improve model performance.
This is a custom function that is compatible with SparseMatrices:
https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py
from itertools import combinations
import numpy as np
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin
class SparseInteractions(BaseEstimator, TransformerMixin):
def __init__(self, degree=2, feature_name_separator="_"):
self.degree = degree
self.feature_name_separator = feature_name_separator
def fit(self, X, y=None):
return self
def transform(self, X):
if not sparse.isspmatrix_csc(X):
X = sparse.csc_matrix(X)
if hasattr(X, "columns"):
self.orig_col_names = X.columns
else:
self.orig_col_names = np.array([str(i) for i in range(X.shape[1])])
spi = self._create_sparse_interactions(X)
return spi
def get_feature_names(self):
return self.feature_names
def _create_sparse_interactions(self, X):
out_mat = []
self.feature_names = self.orig_col_names.tolist()
for sub_degree in range(2, self.degree + 1):
for col_ixs in combinations(range(X.shape[1]), sub_degree):
# add name for new column
name = self.feature_name_separator.join(self.orig_col_names[list(col_ixs)])
self.feature_names.append(name)
# get column multiplications value
out = X[:, col_ixs[0]]
for j in col_ixs[1:]:
out = out.multiply(X[:, j])
out_mat.append(out)
return sparse.hstack([X] + out_mat)
We save this function as SparseInteractions.py into our working directory in order to call it later for easy loading.
We will now modify our pipeline that incorporates the step that adds interaction terms. We will also include to load and process the holdout data for final validation.
import os
import pandas as pd
import numpy as np
import pickle
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
def column_text_processer(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
# Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
return text_vector.apply(lambda x: x.lower())
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
# Prepare the modified pipeline (pl2):
pl2 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
('dim_red2', SelectKBest(f_regression, 400)) # Add another dimension reduction step at the end
])
Next, we will use the new pipeline (pl2) to transform X_train and X_holdout:
# We fit_transform X outside of the pipeline to obtain transformed X for hyperparameter search,
# since transformation step takes long time and we want to avoid repeating everytime
X_train_transformed = pl2.fit_transform(X_train,y_train)
X_holdout_transformed = pl2.fit_transform(X_holdout, y_holdout)
print(X_train_transformed.shape)
print(X_holdout_transformed.shape)
Note that as we designed, our pipelines provided data sets with 400 features at the end. Next, we fit the new Ridge model using the training set transformed by the modified pipeline (pl2):
from sklearn.metrics import mean_squared_error
model2 = Ridge(alpha=0.5)
model2.fit(X_train_transformed,y_train)
y_pred2 = model2.predict(X_train_transformed)
np.sqrt(mean_squared_error(y_train,y_pred2))
To calculate the RMSE in the holdout set:
y_pred2 = model2.predict(X_holdout_transformed)
np.sqrt(mean_squared_error(y_holdout,y_pred2))
It looks like addition of interaction terms did not increase model performance in our problem. Let's finally try hyperparameter tuning in this scenario:
from sklearn.model_selection import GridSearchCV
alphas = np.linspace(1,0,10)
param_grid = {"alpha":alphas}
# Instantiate the GridSearchCV object:
ridgemodel = Ridge()
ridge_cv = GridSearchCV(estimator= ridgemodel,
param_grid= param_grid,
scoring='neg_mean_squared_error',
cv = 5,
n_jobs=-1)
# Fit the GridSearchCV object to training data to start parameter search:
ridge_cv.fit(X_train_transformed,y_train)
print(ridge_cv.best_params_)
print(ridge_cv.best_score_)
from sklearn.metrics import mean_squared_error
y_pred2 = ridge_cv.predict(X_train_transformed)
np.sqrt(mean_squared_error(y_train,y_pred2))
y_pred2 = ridge_cv.predict(X_holdout_transformed)
np.sqrt(mean_squared_error(y_holdout,y_pred2))
Hyperparameter tuning has marginal impact on the model performance. As a final step, let's modify the pipeline 2, by removing the final feature selection step and see if that changes model performance:
# Prepare the modified pipeline (pl3):
pl3 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
])
X_train_transformed = pl3.fit_transform(X_train, y_train)
X_holdout_transformed = pl3.fit_transform(X_holdout, y_holdout)
print(X_train_transformed.shape)
print(X_holdout_transformed.shape)
Note that by creating interaction terms, we increased the dimension of data enourmously. We will now try Ridge regularization to see if we can improve our previous model's performance:
from sklearn.metrics import mean_squared_error
model3 = Ridge(alpha = 0.5)
model3.fit(X_train_transformed,y_train)
ypred3 = model3.predict(X_train_transformed)
np.sqrt(mean_squared_error(y_train,ypred3))
import matplotlib.pyplot as plt
plt.scatter(ypred3,y_train, s = 0.3, c = 'r', alpha = 0.7)
plt.show()
# To see the performance in the holdout set
ypred3_holdout = model3.predict(X_holdout_transformed)
np.sqrt(mean_squared_error(y_holdout,ypred3_holdout))
Model performance is remarkably better. Therefore, we learned that once added the interaction terms, we can leave the features as they are, as long as we are tempted to perform regularization afterwards.
However, one point we noticed here is that: if we perform fit_transform of the data set outside of the pipeline, this can potentially bias to X_holdout because new features are extracted. Instead, we should fit(train) the entire pipeline using the X_train and y_train, then only use predict with X_holdout.
Let's incorporate the model step into our best performing pipeline, pl3:
import os
import pandas as pd
import numpy as np
import pickle
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
def column_text_processer(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
# Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
return text_vector.apply(lambda x: x.lower())
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
pl3 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
])
# Train our pipeline using training set
pl3.fit(X_train,y_train)
# Make predictions first using the training set
y_pred3 = pl3.predict(X_train)
np.sqrt(mean_squared_error(y_train,y_pred3))
As we expected, we were able to reproduce the rmse using the pipeline object's predictions.
# Make predictions using the holdout set
y_pred3 = pl3.predict(X_holdout)
np.sqrt(mean_squared_error(y_holdout,y_pred3))
Very nice! Now we learned that it is essential to only include the modeling step inside the pipeline. We notice that the performance of our pipeline in the holdout set is actually quite good, better than any model we used before. We note that this is a data set the pipeline has not seen before.
plt.scatter(y_pred3,y_holdout, s = 0.3, c = "r", alpha = 0.4)
plt.show()
Next, let's try to perform Gridsearch over the pipeline3 we established. This way we will try to learn:
Let's get started by loading our data sets, utility functions and the latest pipeline using which we would like to perform hyperparameter tuning:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
def column_text_processer(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
# Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
return text_vector.apply(lambda x: x.lower())
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
pl3 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
])
Let's look into the potential parameters we can tune in our pipeline, using .get_params() method of the pipeline object:
pl3.get_params()
# We notice that the following parameters would be intuitive to tune through GridSearchCV:
'reg__alpha': 0.5
'union__text_subpipeline__dim_red1__k': 300,
'union__text_subpipeline__tokenizer__ngram_range': (1, 3)
'int__degree': 2
# Let's set up a hyperparameter space to use GridSearchCV:
param_grid = {
'reg__alpha': np.linspace(1,0,10),
'union__text_subpipeline__dim_red1__k': [200,300,400],
'union__text_subpipeline__tokenizer__ngram_range': [(1,3),(1,4)], # We will tokenize for up to 4-grams
'int__degree': [2,3] # We will add up to the third polymonial degree interactions
}
Now we will set up our GridSearchCV estimator using the hyperparameter space we just defined.
As a slight modification, here we will use the memory and cache functions of the sklearn in order to save some time in computations.
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# We first create a temporary folder to store the transformers of our pipeline
temp_folder = mkdtemp()
memory = Memory(cachedir= temp_folder, verbose= 10) # Create our memory
# Next we need to redefine our pipeline3 with a memory argument to pass the memory we created
memorized_pipeline3 = Pipeline(steps=[
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
], memory = memory)
# Now we are using the catched (memorized) pipeline for GridSearchCV,
# using 3-fold cross-validation
# using the parameter grid space we defined above
pl3grid = GridSearchCV(memorized_pipeline3, cv = 3, n_jobs= 1, param_grid= param_grid,scoring='neg_mean_squared_error')
# Finally, we start training the pipeline with GridSearchCV, using the .fit() method and training set:
pl3grid.fit(X_train,y_train)
# We need to delete the temporary folder before we exit the training task to allow the memory back to system
rmtree(temp_folder)
Based on [https://github.com/scikit-learn/scikit-learn/issues/1645] this discussion it appears that the serialization we attempt to make uses pickle, which is not compatible with lambda functions. We notice that the lambda functions appear in our pipeline at the custom utility functions we wrote. Therefore, we need to rewrite these utility functions to make them compatible with the serialization process:
def column_text_processer_nolambda(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
# text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
text_vector = []
for index,rows in text_data.iterrows():
text_item = " ".join(rows).lower()
text_vector.append(text_item)
# return text_vector as pd.Series object to enter the tokenization pipeline
return pd.Series(text_vector)
def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
return df[numeric_columns]
#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False)
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
Now, let's redefine our entire data reading and pre-processing pipeline with these custom functions:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data
def column_text_processer_nolambda(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
# text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
text_vector = []
for index,rows in text_data.iterrows():
text_item = " ".join(rows).lower()
text_vector.append(text_item)
# return text_vector as pd.Series object to enter the tokenization pipeline
return pd.Series(text_vector)
def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
return df[numeric_columns]
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False)
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
Finally, we re-define our memorized pipeline together with hyperparameter search space:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# We first create a temporary folder to store the transformers of our pipeline
temp_folder = mkdtemp()
memory = Memory(cachedir= temp_folder, verbose= 10) # Create our memory
# Next we need to redefine our pipeline3 with a memory argument to pass the memory we created
memorized_pipeline3 = Pipeline(steps=[
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
], memory = memory)
# Our hyperparameter grid
param_grid = {
'reg__alpha': np.linspace(1,0,10),
'union__text_subpipeline__dim_red1__k': [200,300,400],
'union__text_subpipeline__tokenizer__ngram_range': [(1,3),(1,4)], # We will tokenize for up to 4-grams
'int__degree': [2,3] # We will add up to the third polymonial degree interactions
}
# Now we are using the catched (memorized) pipeline for GridSearchCV,
# using 3-fold cross-validation
# using the parameter grid space we defined above
pl3grid = GridSearchCV(memorized_pipeline3,
cv = 3,
n_jobs= 1,
param_grid= param_grid,
scoring='neg_mean_squared_error')
# Finally, we start training the pipeline with GridSearchCV, using the .fit() method and training set:
pl3grid.fit(X_train,y_train)
# We need to delete the temporary folder before we exit the training task to allow the memory back to system
rmtree(temp_folder)
from multiprocessing import cpu_count
cpu_count()
Our take home messages:
Therefore, our next solution could be using Dask package or spark-sklearn for paralelization. They appear to have their own problems though.
Until we have a good understanding the nature of these problems, we should avoid ambitious hyperparameter optimization using GridSearchCV, and tune few parameters at a given time instead.
Yes, we realized it is not feasible to use an exhaustive GridSearchCV for hyperparameter optimization given the size of the training data and our limited computational power. However, we should still give a try to RandomSearchCV, where we can test 10 or 20 iterations to randomly pick combinations of parameters from our hyperparameter bucket and see if we can get a model with better performance. This is in a way a matter of luck and waiting game, but nothing should stop us from trying, give there is a possiblity of finding a slighly better performing model.
First we get our Ridge pipeline and hyperparameter space as usual:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data
def column_text_processer_nolambda(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
# text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
text_vector = []
for index,rows in text_data.iterrows():
text_item = " ".join(rows).lower()
text_vector.append(text_item)
# return text_vector as pd.Series object to enter the tokenization pipeline
return pd.Series(text_vector)
def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
return df[numeric_columns]
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False)
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
# Next we define our pipeline:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
pl3 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
])
# We define our hyperparameter grid from which we will randomly select a combination per RandomizedSeachCV iteration
param_grid = {
'reg__alpha': np.linspace(1,0,10),
'union__text_subpipeline__dim_red1__k': [200,300,400],
'union__text_subpipeline__tokenizer__ngram_range': [(1,3),(1,4)], # We will tokenize for up to 4-grams
}
Finally we perform RandomizedSeachCV from the defined hyperparameter space:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))
# Now we are using the pipeline 3 for RandomizedSearchCV,
# using 2-fold cross-validation at this stage
# using the parameter grid space we defined above
pl3randomized= RandomizedSearchCV(pl3,
cv = 2,
param_distributions= param_grid,
n_jobs= 3,
scoring='neg_mean_squared_error',
verbose = 10)
# Finally, we start training the pipeline with RandomizedSearchCV, using the .fit() method and training set:
pl3randomized.fit(X_train,y_train)
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))
These 20 fits took about 5 hours using 3 cores of 4 available CPU, which is quite reasonable. Let's look at the performance of these attempts:
pl3randomized.best_params_
It turns out that the best parameters we have found are still the ones we used in our original pipeline. This is a good example to demonstrate that a limited hyperparameter search does not necessarily yield a better performing model.
-pl3randomized.best_score_
y_pred_pl3_randomized = pl3randomized.predict(X_train)
np.sqrt(mean_squared_error(y_train,y_pred_pl3_randomized))
import matplotlib.pyplot as plt
plt.scatter(x = y_pred_pl3_randomized, y = y_train, c = "r", s = 0.02,alpha = 0.3)
plt.show()
Let's try to modify our best performing pipeline 3 above by changing the regressor to RandomForest, starting with its default parameters, we will try to see if we can get a better model performance without performing any hyperparameter tuning. Note that in this case we will add another feature selection step in the pipeline before using the RandomForest model since we are not performing regularization in this case:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
def column_text_processer(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
# Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
return text_vector.apply(lambda x: x.lower())
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
pl4 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all fearures are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
('dim_red2', SelectKBest(f_regression, 300)),
("reg",RandomForestRegressor(n_jobs = -1, max_depth= 50)) # We start with these parameters of randomforest
])
Next we train our pipeline using .fit method as before:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))
pl4.fit(X_train,y_train)
end = datetime.datetime.now()
print("train end :"+ str(end))
Using all CPU power we have (n_jobs = -1 argument in RandomForest), it took about 21 hours to train the RandomForest model even without any cross-validation with the ~800K samples and 300 features.
This demonstrates that performing cross-validation or hyperparameter search with this model with this amount of training data is not feasible in our computational capacity.
Let's have a look at our predictions and compare to the Ridge pipeline we trained above:
y_pred4_train = pl4.predict(X_train)
np.sqrt(mean_squared_error(y_train,y_pred4_train))
It looks like RandomForest fits into the training set better than the untuned Ridge. Let's look at the performance in the holdout set:
# Make predictions using the holdout set
y_pred4_holdout = pl4.predict(X_holdout)
np.sqrt(mean_squared_error(y_holdout,y_pred4_holdout))
This is quite interesting. It appears that untuned RandomForest pipeline overfits to the training set compared to untuned Ridge model. This is another nice example that a simple regularized model can perform better than a untuned complex algorithm like RandomForest.
We should remember that if we had the computational power to perform hyperparameter tuning using RandomForest, we would probably find a better performing model. However, in this case computational limitations determine the most feasible model.
In the next experiment, we will try to develop an intuition to use boosted trees for the same problem by employing XGboost package.
XGBoost has a sklearn compatible API, which can integrate into sklearn pipelines. Therefore, we can simply start by using our existing Ridge pipeline, simply replacing the regressor estimator and choosing a regularization paramater.
We will start by loading the data and defining our new pipeline:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data
def column_text_processer_nolambda(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
# text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
text_vector = []
for index,rows in text_data.iterrows():
text_item = " ".join(rows).lower()
text_vector.append(text_item)
# return text_vector as pd.Series object to enter the tokenization pipeline
return pd.Series(text_vector)
def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
return df[numeric_columns]
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from xgboost import XGBRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False)
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
# Next we define our pipeline:
pl5 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all features are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()), # Scale the features
("reg",XGBRegressor(reg_alpha = 0.5,
nthread=3,
learning_rate=0.2,
objective="reg:linear",
n_estimators = 10)) # We start with some sensible parameters and tune later if feasible and necessary
])
Let's start training our new pipeline using the training set:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))
pl5.fit(X_train,y_train)
end = datetime.datetime.now()
print("train end :"+ str(end))
XGBoost regressor took only 7 minutes to train, which was remarkably fast. Let's look at the performance:
pl5testpred = pl5.predict(X_train)
np.sqrt(mean_squared_error(y_true=y_train,y_pred= pl5testpred))
This performance is not as good as the Ridge pipeline 3. Let's also look at the performance using the holdout set:
pl5testholdout = pl5.predict(X_holdout)
np.sqrt(mean_squared_error(y_true=y_holdout,y_pred= pl5testholdout))
This is still higher than the holdout rmse of 0.6 we obtained from the Ridge pipeline. Let's try to tune the model if we can.
XGboost pipeline seems to be training reasonably fast. Let's try to perform a random hyperparameter search to see if we can improve the model performance.
We will first prepare a hyperparameter space for the list of parameters we can tune within the Xgboost regressor, and will perform 2-fold cross-validation to fit as many models as possible. Since a single fit took about 7 minutes, trying up to 30 model fits sounds like a reasonable approach to start with.
# The list of tunable parameter in our XGboost pipeline:
pl5.get_params()
Our initial hyperparameter space:
param_space = {
'reg__learning_rate': np.arange(0.05,1.05,.05),
'reg__n_estimators': [20],
'reg__max_depth': [3,6,9],
}
Our RandomSearchCV estimator becomes:
from sklearn.model_selection import RandomizedSearchCV
pl5RandomSearch = RandomizedSearchCV(estimator=pl5,cv= 2,n_jobs = 3,
param_distributions= param_space,
n_iter = 15,
scoring = "neg_mean_squared_error",verbose = 10)
Training the RandomsearchCV estimtor using the training set:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))
pl5RandomSearch.fit(X_train, y_train)
end = datetime.datetime.now()
print("train end :"+ str(end))
Note that the hyperparameter search took about 6 hours, a little shorther than we expected.
pl5RandomSearch.best_params_
pl5RandomSearch.best_score_
np.sqrt(abs(pl5RandomSearch.best_score_))
pl5RandomSearchholdout = pl5RandomSearch.predict(X_holdout)
np.sqrt(mean_squared_error(y_true=y_holdout,y_pred= pl5RandomSearchholdout))
This looks very interesting. The best model we could find using hyperparameter search slighly overfits to the training set (compared to Ridge pipeline - pl3), but performs almost same as the Ridge pipeline in the holdout set.
Since the xgboost model has improved, we can try another round of random search keeping one of these parameters constant.
param_space2 = {
'reg__learning_rate': [0.75000000000000011],
'reg__n_estimators': [20,50,100],
'reg__max_depth': [9,12,15],
'reg__reg_alpha': np.arange(0.05,1.05,.05)
}
from sklearn.model_selection import RandomizedSearchCV
pl5RandomSearch2 = RandomizedSearchCV(estimator=pl5,cv= 2,n_jobs = 3,
param_distributions= param_space2,
n_iter = 20,
scoring = "neg_mean_squared_error",verbose = 10)
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))
pl5RandomSearch2.fit(X_train, y_train)
end = datetime.datetime.now()
print("train end :"+ str(end))
That was a long search! Let's look at the results:
pl5RandomSearch2.best_params_
pl5RandomSearch2.best_score_
np.sqrt(abs(pl5RandomSearch2.best_score_))
predict_pl5RandomSearch2_holdout = pl5RandomSearch2.predict(X_holdout)
np.sqrt(mean_squared_error(y_holdout,predict_pl5RandomSearch2_holdout))
It was worth waiting for it! It appears that the rmse in the hold out set is 0.595, this is an improvement from the performance of the Rigde pipeline (pl3), which gave a rmse of 0.602.
Next try to form an ensemble model using the predictions of these two pipelines.
Now we have two best performing models (or pipelines), one linear the other one is tree-based learner. We can try to ensemble the predictions of these models to see if we can get a better performance in the holdoutset.
To begin with, we can use the average of two models as the ensemble prediction.
Let's try to average the predictions of two pipelines and come up with ensemble predictions for the holdout set:
# Combine the two predictions in a new dataframe
preds = pd.concat([pd.DataFrame(pl3.predict(X_holdout), columns=['ridge_predictions']),
pd.DataFrame(pl5RandomSearch2.predict(X_holdout), columns= ["xgboost_predictions"])], axis = 1)
print(preds.head())
print(preds.shape)
print(y_holdout.shape)
It looks like the predictions of two pipelines are close, although not the same. Let's take the mean of these predictions and add as an another column.
preds["mean_predictions"] = preds.apply(np.mean,axis=1)
preds.head()
Let's also add the true values into the same data frame:
preds["y_holdout"] = y_holdout
preds.head()
Finally, let's look at the rmse:
rmse = pd.DataFrame(data = {
"ridge_rmse":np.sqrt(mean_squared_error(y_holdout, preds.ridge_predictions)),
"xgboost_rmse":np.sqrt(mean_squared_error(y_holdout, preds.xgboost_predictions)),
"mean_rmse":np.sqrt(mean_squared_error(y_holdout, preds.mean_predictions))
}, columns = ["ridge_rmse","xgboost_rmse","mean_rmse"], index = [0])
rmse.head()
Nice work! Just by using a simple averaging approach of two best pipelines, we were able to reduce the rmse for holdout set predictions to 0.584.
This can illustrate that when we combine predictions from 'orthagonal' models, we may able to obtain a better predictive performance simply by averaging their predictions.
Note that here we assumed equal weights for the predictions of each pipeline since we avaraged them. Another approach would be using different weights for each prediction. This is in a way using the predictions of each pipeline as predictors and fitting a new model using the holdout labels (true values). Since this approach will use the hold out set as a training set, the performance of the resulting ensemble model would ideally need to be validated by another validation set. Since we don't have such a set, we will use cross-validation to estimate "cross-validated" error and can come up with a final ensemble model.
Before performing this ensemble pipeline, let's save our predictions data frame for future use:
import pickle
with open("pipeline_holdout_predictions.pkl","wb") as f:
pickle.dump(preds,f)
Reload the predictions data set:
with open("pipeline_holdout_predictions.pkl","rb") as f:
preds = pickle.load(f)
Let's complete our work by training Neural Networks by making use of our existing NLP and feature extraction pipeline.
Let's get started!
We first get our data loading steps along with the modified pipeline. Note that we only removed the regressor step in the pipeline and kept everything else as the same.
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py
#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values
X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values
#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
Numeric_features = pickle.load(f)
f.close()
with open("Text_features.pkl", 'rb') as f:
Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data
def column_text_processer_nolambda(df,text_columns = Text_features):
""""A function that will merge/join all text in a given row to make it ready for tokenization.
- This function should take care of converting missing values to empty strings.
- It should also convert the text to lowercase.
df= pandas dataframe
text_columns = names of the text features in df
"""
# Select only non-text columns that are in the df
text_data = df[text_columns]
# Fill the missing values in text_data using empty strings
text_data.fillna("",inplace=True)
# Join all the strings in a given row to make a vector
# text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
text_vector = []
for index,rows in text_data.iterrows():
text_item = " ".join(rows).lower()
text_vector.append(text_item)
# return text_vector as pd.Series object to enter the tokenization pipeline
return pd.Series(text_vector)
def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
return df[numeric_columns]
#############################################################################
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error
#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False)
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################
#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences
#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True
#############################################################################
# Next we define our pipeline:
pl6 = Pipeline([
("union",FeatureUnion( #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
transformer_list = [
("numeric_subpipeline", Pipeline([ #Note we have subpipeline branches inside the main pipeline
("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
("imputer",Imputer()), # Step2: impute missing values
])), # Branching point of the FeatureUnion
("text_subpipeline",Pipeline([
("parser",get_text_data), # Step1: parse the text data
("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
stop_words = "english",# We will remove English stop words before tokenization
ngram_range = (1,3),
non_negative=True, norm=None, binary=False
)), # Step2: use CountVectorizer for automated tokenization and feature extraction
('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
]))
]
)),# Branching point to the main pipeline: at this point all features are numeric
("int", SparseInteractions(degree=2)), # Add polynomial interaction terms
("scaler",MaxAbsScaler()) # Scale the features
])
Next we .fit() or train the pipeline 6 with our trainign set. Note that the NLP/tokenization, as well as feature selection will be applied on this data set and we will reuse this pipeline to reproducibly extract the same features from the holdout data set later on. The important thing is to use the .fit method only with the training set to ensure the same features are extracted:
pl6.fit(X_train,y_train)
Once we have obtained our pipeline, we lock it down and save as a pickle object to load later on:
import pickle
with open("pl6.pkl","wb") as f:
pickle.dump(pl6,f)
# Reload the pipeline 6
import pickle
with open("pl6.pkl","rb") as f:
pl6 = pickle.load(f)
We will use this pipeline to transform the Training set to perform initial feature extraction and feature selection:
X_train_pl6_transformed = pl6.transform(X_train)
X_train_pl6_transformed.shape
This is the shape of our transformed training set. Note that we performed tokenization, extracted up to 3-grams, and selected 300 best features. Finally we added interaction terms between these features up to the second poynomial degree. This increased the number of features dramatically.
Recall that we used to perform regularization afterwards since this is a lot of feature space to work with. When using neural networks, we can also add regularization. It might appear that we would not even need to add interaction terms since neural networks supposed to extract useful representations of data and interactions are most likely to be represented built in. Neverthless, we will move forward use these features to built an intuition about the performance of a neural network.
We will divide this set into two sets, and train the network using one half and validate the network with the other half:
from sklearn.model_selection import train_test_split
X_net_train,X_net_validation,y_net_train,y_net_validation = train_test_split(X_train_pl6_transformed,
y_train,
test_size = 0.5,
random_state = 423)
print(X_net_train.shape,X_net_validation.shape,y_net_train.shape, y_net_validation.shape)
Let's use these sets to train our network. We will start with a simple network:
from keras import models, metrics, layers
network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_net_train.shape[1],)))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dense(1))
network1.compile(optimizer= "adam", loss= "mean_squared_error", metrics= [metrics.mse])
history_net1 = network1.fit(X_net_train,y_net_train,
epochs=20,batch_size=200,
validation_data=(X_net_validation,y_net_validation))
Using the .evaluate method with training set will give loss and metric for the final network using the training set:
network1.evaluate(X_net_train,y_net_train)
Similarly, for the validation set:
network1.evaluate(X_net_validation,y_net_validation)
We note that the performance of the final network demonstrates overfitting. Therefore, we need to understand at which epoch the model starts overfitting by extracting the loss and metric (in this case they are equivalent since this is a regression problem) at each epoch:
import matplotlib.pyplot as plt
history = history_net1.history
train_rmse = np.sqrt(history["mean_squared_error"])
val_rmse = np.sqrt(history["val_mean_squared_error"])
epochs = np.arange(1,21,1)
plt.plot(epochs,train_rmse,"rD:",label = "Training rmse") # See help(plt.plot) for many other available plotting options
plt.plot(epochs,val_rmse,"bo-",label = "Validation rmse")
plt.title("Training vs. Validation RMSE")
plt.xlabel("epochs")
plt.ylabel("RMSE")
plt.legend()
plt.show()
val_rmse
We found that the network starts overfitting after the 4th epoch. Therefore, we will re-fit the network only using 4 epochs, before attempting to obtain performance in the holdout set:
from keras import models, metrics, layers
network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_net_train.shape[1],)))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dense(1))
network1.compile(optimizer= "adam", loss= "mean_squared_error", metrics= [metrics.mse])
history_net1 = network1.fit(X_net_train,y_net_train,
epochs=4,batch_size=200,
validation_data=(X_net_validation,y_net_validation))
Let's save this trained network for future use. We can save the network as a HDF5 object using the .save method:
network1.save("network1.h5")
Now, we need to prepare the holdout set to make predictions using our trained network. We need to transform the holdoutset exactly in the same way using the pipeline 6 we used to create the training set to train the model.
In this case we will use the pipeline 6 with .transform method. This will ensure the same features will be extracted from the holdout set and it will become ready to feed into the network we trained:
X_net_holdout = pl6.transform(X_holdout)
X_net_holdout.shape
Note that we only used features in the case of .transform method, and transformed the holdout set into the format ready to feed into the network. We created exactly the same 45753 features from the X_holdout data set.
Let's evaluate the rmse in using the holdout set and the network we trained:
np.sqrt(network1.evaluate(X_net_holdout,y_holdout))
The rmse we obtained from the first network is comparable with the tuned xgboost model. Before, finalizing a network, we can try whether we can improve the performance of the network 1 by increasing the model complexity.
With the hopes of increasing model performance, we simply add another layer to our network with same amount of neurons, and also include an earlystopping monitor to stop training until after the validation score is not improving.
from keras import models, metrics, layers
from keras.callbacks import EarlyStopping
estop_monitor = EarlyStopping(patience= 1)
network2 = models.Sequential()
network2.add(layers.Dense(16,activation="relu", input_shape = (X_net_train.shape[1],)))
network2.add(layers.Dense(16,activation="relu"))
network2.add(layers.Dense(16,activation="relu"))
network2.add(layers.Dense(1))
network2.compile(optimizer= "adam", loss= "mean_squared_error", metrics= [metrics.mse])
history_net2 = network2.fit(X_net_train,y_net_train,
epochs=20,batch_size=200,
callbacks = [estop_monitor],
validation_data=(X_net_validation,y_net_validation))
It looks like we have some improvement of the network based on the validation score. Let's evaluate the new model using the transformed holdout set:
np.sqrt(network2.evaluate(X_net_holdout,y_holdout))
Network2 is a slight improvement over Network1, but it might not be a substantial difference that was worth to wait. Neverthless, we will save and use the Network2 to get the predictions:
network2.save("network2.h5")
network2.summary()
holdout_pred_network2 = network2.predict(X_net_holdout)
Now we have 2 best performing models that we believe they are orthagonal, in other words, they are likely to explain different portions of the variation in the data sets. They are fundamentally different forms of models; linear, tree-based and neural network learners. Therefore, we hope to improve our predictive capacity by ensembling these 3 models.
We will keep it simple and use the average of these 3 models as our ensemble model predictions.
# Load back the predictions from other models
with open("pipeline_holdout_predictions.pkl","rb") as f:
preds = pickle.load(f)
preds.head()
preds3 = preds.iloc[:,0:2]
preds3.head()
preds3["network2_predictions"] = holdout_pred_network2
preds3.head()
preds3["mean_predictions"] = preds3.apply(np.mean,axis = 1)
preds3.head()
np.sqrt(mean_squared_error(y_holdout,preds3.mean_predictions))
rmse = pd.DataFrame(data = {
"ridge_rmse":np.sqrt(mean_squared_error(y_holdout, preds3.ridge_predictions)),
"xgboost_rmse":np.sqrt(mean_squared_error(y_holdout, preds3.xgboost_predictions)),
"network2_rmse":np.sqrt(mean_squared_error(y_holdout,preds3.network2_predictions)),
"mean_rmse":np.sqrt(mean_squared_error(y_holdout, preds3.mean_predictions))
}, columns = ["ridge_rmse","xgboost_rmse","network2_rmse","mean_rmse"], index = [0])
rmse.head()
Very nice! By including the neural network into our ensemble model, we were able to reduce the rmse to 0.579. This demonstrates how we can improve predictive performance by training and optimizing orthagonal models and forming model ensembles for final prediction.
# Save the dataframe containing holdout predictions from 3 models
with open("pipeline_holdout_predictions.pkl","wb") as f:
pickle.dump(preds3,f)