Retail Price Suggestion

Table of Contents

In [7]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

Loading and processing the data sets

Let's start with loading the training data:

In [15]:
import warnings
warnings.filterwarnings('ignore')
In [1]:
import os
import pandas as pd
import numpy as np
In [2]:
train = pd.read_csv("train.csv")
In [4]:
train.head()
Out[4]:
train_id name item_condition_id category_name brand_name price shipping item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
In [5]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
train_id             1482535 non-null int64
name                 1482535 non-null object
item_condition_id    1482535 non-null int64
category_name        1476208 non-null object
brand_name           849853 non-null object
price                1482535 non-null float64
shipping             1482535 non-null int64
item_description     1482531 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB

First, let's remove the train_id from the training set, and map the numeric and character variables to being with:

In [3]:
train = train.drop("train_id",axis = 1)
train.head()
Out[3]:
name item_condition_id category_name brand_name price shipping item_description
0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
In [7]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 7 columns):
name                 1482535 non-null object
item_condition_id    1482535 non-null int64
category_name        1476208 non-null object
brand_name           849853 non-null object
price                1482535 non-null float64
shipping             1482535 non-null int64
item_description     1482531 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 79.2+ MB
In [4]:
Numeric_features_select = np.logical_or((pd.Series(train.dtypes)) == "int64",(pd.Series(train.dtypes) == "float64")).tolist()
In [5]:
Numeric_features = pd.Series(train.columns)[Numeric_features_select].tolist()
In [6]:
Numeric_features.remove('price')
In [7]:
Numeric_features
Out[7]:
['item_condition_id', 'shipping']
In [8]:
Text_features = pd.Series(train.columns)[np.logical_not(Numeric_features_select)].tolist()
In [9]:
Text_features
Out[9]:
['name', 'category_name', 'brand_name', 'item_description']

Since the model evaluation will be based on the log of the target variable price, we convert it at this stage:

In [10]:
train.price = pd.Series(np.log(train.price + 1))
In [11]:
train.head()
Out[11]:
name item_condition_id category_name brand_name price shipping item_description
0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 2.397895 1 No description yet
1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 3.970292 0 This keyboard is in great condition and works ...
2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 2.397895 1 Adorable top with a hint of lace and a key hol...
3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 3.583519 1 New with tags. Leather horses. Retail for [rm]...
4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 3.806662 0 Complete with certificate of authenticity

Now it is time to split the training data set into training and holdout sets in order to start building our pipeline:

In [12]:
from sklearn.model_selection import train_test_split

X = train.drop("price", axis = 1)
y = train.price
In [13]:
X_train,X_holdout,y_train,y_holdout = train_test_split(X,y,test_size = 0.4, random_state = 425)
In [17]:
type(X_train)
Out[17]:
pandas.core.frame.DataFrame
In [18]:
type(y_holdout)
Out[18]:
pandas.core.series.Series
In [19]:
X_train.shape
Out[19]:
(889521, 6)
In [20]:
X_holdout.shape
Out[20]:
(593014, 6)
In [21]:
y_train.shape
Out[21]:
(889521,)
In [22]:
y_holdout.shape
Out[22]:
(593014,)

At this stage, let's lockdown and save the datasets on which we will train and validate our model:

In [14]:
X_train.to_csv("X_train.csv",index=False)
X_holdout.to_csv("X_holdout.csv", index = False)
y_train.to_csv("y_train.csv", index= False)
y_holdout.to_csv("y_holdout.csv", index= False)
In [91]:
# Write the pickled feature names
import pickle
with open("Numeric_features.pkl", 'wb') as f:
    pickle.dump(Numeric_features,f)
f.close()

with open("Text_features.pkl", 'wb') as f:
    pickle.dump(Text_features,f)
f.close()

Data processing and model training pipeline

At this stage we will make high-level design for our data processing and model training pipeline. At minimum, our pipeline will need to include the following steps:

  1. We need to process text and numeric features seperately, then combine them using FeatureUnion()

    1. Text subpipeline should include:

      • A function that will merge all text in a given row to make it ready for tokenization.
        • This function should take care of converting missing values to empty strings.
        • It should also convert the text to lowercase.
      • We should remove common stop words and perform text stemming
      • We should tokenize using alphanumeric characters only (white space + punctuation are used as delimiters)
      • We try to include up to 3-grams
    2. Numeric subpipeline should include:

      • An imputation step to fill any missing values using mean of the column
      • A scaling step that will scale the numeric values between -1 and 1
  2. After merging the numeric and text features we will add the following common steps:

    1. Adding interaction terms
    2. Perform a simple feature selection using F-regression method we learned
    3. We will try to include a hashing step to improve computational efficiency
  3. Finally, we will put a model training step, we will start with a regularized linear regression.

Once our pipeline is ready, our goals will be:

  • to get some idea about the initial performance of the pipeline using the simple model in holdout set, using default hyperparameters
  • then try to perform hyperparameter tuning (such as GridSearchCV or RandomSearchCV) to see if we can come up with a better model.
  • save the model that has the best overall performance ( based on rmse using holdout set)

Finally, we will repeat these steps using the same pipeline, but changing model structure to train:

  • Elastic Net model
  • Support vector machines
  • RandomForest regression
  • Gradient boosting

After these steps, we will explore ensembling these models to see if we can get a better model.

At the end of this exercise, we expect to get more competent on:

  1. experiencing developing sklearn pipelines and incorporating custom functions
  2. basic NLP tasks we can perform with our current knowledge
  3. developing an intuition about the performance of different models,
  4. experiencing the sklearn API for hyperparameter tuning using important algorithms

Finally, we will try to use the same sets to train Deep Neural Networks to see how they compare to the performance of shallow learning approaches.

In [2]:
import os
import pandas as pd
import numpy as np
import pickle

# Re-read the training data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])

y_train = y_train.price.values

# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()

A custom function for text pre-processing

In [3]:
def column_text_processer(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    
    # Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
    return text_vector.apply(lambda x: x.lower())   

Building the model training pipeline

We will start by loading the necessary functions from sklearn submodules:

In [4]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression

First we build two utility functions to parse numeric and text data, and wrap them using FunctionTransformer, so that they can be integrated into a sklearn pipeline:

In [5]:
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer

We also need to create our regex token pattern to use in CountVectorizer. CountVectorizer will use this regex pattern to create tokens and n-grams we specified. It will automatically convert these into dummy features and stores in the form of a sparsemartix. Note that we will use HashingVectorizer to improve computational efficiency.

In [6]:
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

We also need to redefine the default feature selection function for regression to properly place into our pipeline:

In [7]:
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

We can now start building the actual pipeline:

In [8]:
pl1 = Pipeline([
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        transformer_list = [
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    #("int", SparseInteractions(degree=2)), # Add polynomial interaction terms :POSTPONED
    ("scaler",MaxAbsScaler()), # Scale the features
])
In [ ]:
# We fit_transform X outside of the pipeline to obtain transformed X for hyperparameter search, 
# since transformation step takes long time and we want to avoid repeating everytime 
X_train_transformed = pl1.fit_transform(X_train,y_train) 
In [6]:
# We start with ridge regression
model1 = Ridge(alpha=0.5)
model1.fit(X_train_transformed, y_train)
Out[6]:
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
In [7]:
y_pred1 = model1.predict(X_train_transformed)
In [8]:
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_train,y_pred1))
Out[8]:
0.63839928042826788
In [10]:
import matplotlib.pyplot as plt
plt.scatter(y_pred1,y_train, s = 2, c = "r", alpha = 0.4)
plt.show()

This simple pipeline already looks very promising!

Ridge Hyperparameter tuning

First, we compile the data loading, utility functions and pipeline steps for easy starting up later:

In [73]:
import os
import pandas as pd
import numpy as np
import pickle

#############################################################################
# Re-read the training data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])

y_train = y_train.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################

def column_text_processer(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    
    # Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
    return text_vector.apply(lambda x: x.lower())  

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression

#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection with center = True default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################
# Prepare the actual pipeline:

pl1 = Pipeline([
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        transformer_list = [
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    #("int", SparseInteractions(degree=2)), # Add polynomial interaction terms :POSTPONED
    ("scaler",MaxAbsScaler()), # Scale the features
])
In [ ]:
# We fit_transform X outside of the pipeline to obtain transformed X for hyperparameter search, 
# since transformation step takes long time and we want to avoid repeating everytime 
X_train_transformed = pl1.fit_transform(X_train,y_train) 

We can now start hyperparameter optimization. Let's first start by looking into the tunable hyperparameters in ridge model:

In [11]:
# looking into parameters of our first estimator:

model1.get_params()
Out[11]:
{'alpha': 0.5,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': None,
 'normalize': False,
 'random_state': None,
 'solver': 'auto',
 'tol': 0.001}

Let's plan to start by tuning the alpha, which is the most important hyperparameter that defines the strength of regularization. A sensible value is between 1 and 0. We will use Exhaustive search over specified parameter values for an estimator, which is performed by GridSearchCV using 5 fold cross-validation. Since we only perform search in one hyperparameter, this is a one-dimensional grid search:

In [12]:
# We prepare 20 grids of alpha in a linear space
alpha_space = np.linspace(1,0,20)
alpha_space
Out[12]:
array([ 1.        ,  0.94736842,  0.89473684,  0.84210526,  0.78947368,
        0.73684211,  0.68421053,  0.63157895,  0.57894737,  0.52631579,
        0.47368421,  0.42105263,  0.36842105,  0.31578947,  0.26315789,
        0.21052632,  0.15789474,  0.10526316,  0.05263158,  0.        ])
In [22]:
param_grid = {"alpha":alpha_space}
  • To select a scoring parameter in sklearn, see:

http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

  • GridSearchCV and RandomizedSearchCV evaluate each parameter setting independently. Computations can be run in parallel if your OS supports it, by using the keyword n_jobs=-1.
In [29]:
# Instantiate the GridSearchCV object:
ridgemodel = Ridge()
ridge_cv = GridSearchCV(estimator= ridgemodel,
                        param_grid= param_grid,
                        scoring='neg_mean_squared_error',
                        cv = 5, 
                        n_jobs=-1)

# Fit the GridSearchCV object to training data to start parameter search:
ridge_cv.fit(X_train_transformed,y_train)
Out[29]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'alpha': array([ 1.     ,  0.94737,  0.89474,  0.84211,  0.78947,  0.73684,
        0.68421,  0.63158,  0.57895,  0.52632,  0.47368,  0.42105,
        0.36842,  0.31579,  0.26316,  0.21053,  0.15789,  0.10526,
        0.05263,  0.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)
In [30]:
# Print the tuned parameters and score
print("Tuned Ridge Regression Parameters: {}".format(ridge_cv.best_params_)) 
print("Best score is {}".format(ridge_cv.best_score_))
Tuned Ridge Regression Parameters: {'alpha': 0.68421052631578949}
Best score is -0.40786863934749973

Note that the score is negative mean-squared error. We found that the optimal alpha is close to 0.68, which was higher than the alpha we used in the model1. Let's make predictions in test set to compare this new tuned model with the earlier model we used:

In [31]:
y_pred2 = ridge_cv.predict(X_train_transformed)
In [35]:
np.sqrt(mean_squared_error(y_train,y_pred2))
Out[35]:
0.63839912322227732
In [34]:
import matplotlib.pyplot as plt
plt.scatter(y_pred2,y_train, s = 0.2, c = "r", alpha = 0.8)
plt.show()
In [37]:
plt.scatter(y_pred2,y_pred1, s = 0.5, c = "b", alpha = 0.8)
plt.show()

Interestingly, we note that even after hyperparameter tuning, the predictive performance of the Ridge remained a the same level as they are making the same predictions.

Next, we try to see if we can include interaction terms into the pipeline to see if we can improve model performance.

Adding feature interaction terms into pipeline

This is a custom function that is compatible with SparseMatrices:

https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py

In [1]:
from itertools import combinations

import numpy as np
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin


class SparseInteractions(BaseEstimator, TransformerMixin):
    def __init__(self, degree=2, feature_name_separator="_"):
        self.degree = degree
        self.feature_name_separator = feature_name_separator

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not sparse.isspmatrix_csc(X):
            X = sparse.csc_matrix(X)

        if hasattr(X, "columns"):
            self.orig_col_names = X.columns
        else:
            self.orig_col_names = np.array([str(i) for i in range(X.shape[1])])

        spi = self._create_sparse_interactions(X)
        return spi

    def get_feature_names(self):
        return self.feature_names

    def _create_sparse_interactions(self, X):
        out_mat = []
        self.feature_names = self.orig_col_names.tolist()

        for sub_degree in range(2, self.degree + 1):
            for col_ixs in combinations(range(X.shape[1]), sub_degree):
                # add name for new column
                name = self.feature_name_separator.join(self.orig_col_names[list(col_ixs)])
                self.feature_names.append(name)

                # get column multiplications value
                out = X[:, col_ixs[0]]
                for j in col_ixs[1:]:
                    out = out.multiply(X[:, j])

                out_mat.append(out)

        return sparse.hstack([X] + out_mat)

We save this function as SparseInteractions.py into our working directory in order to call it later for easy loading.

We will now modify our pipeline that incorporates the step that adds interaction terms. We will also include to load and process the holdout data for final validation.

In [4]:
import os
import pandas as pd
import numpy as np
import pickle

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################

def column_text_processer(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    
    # Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
    return text_vector.apply(lambda x: x.lower())  

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression

#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################
# Prepare the modified pipeline (pl2):

pl2 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ('dim_red2', SelectKBest(f_regression, 400)) # Add another dimension reduction step at the end
])

Next, we will use the new pipeline (pl2) to transform X_train and X_holdout:

In [ ]:
# We fit_transform X outside of the pipeline to obtain transformed X for hyperparameter search, 
# since transformation step takes long time and we want to avoid repeating everytime 
X_train_transformed = pl2.fit_transform(X_train,y_train)
X_holdout_transformed = pl2.fit_transform(X_holdout, y_holdout)
In [6]:
print(X_train_transformed.shape)
print(X_holdout_transformed.shape)
(889521, 400)
(593014, 400)

Note that as we designed, our pipelines provided data sets with 400 features at the end. Next, we fit the new Ridge model using the training set transformed by the modified pipeline (pl2):

In [9]:
from sklearn.metrics import mean_squared_error
model2 = Ridge(alpha=0.5)
model2.fit(X_train_transformed,y_train)
y_pred2 = model2.predict(X_train_transformed) 
np.sqrt(mean_squared_error(y_train,y_pred2))
Out[9]:
0.64640722157857955

To calculate the RMSE in the holdout set:

In [10]:
y_pred2 = model2.predict(X_holdout_transformed) 
np.sqrt(mean_squared_error(y_holdout,y_pred2))
Out[10]:
1.0222330776410204

It looks like addition of interaction terms did not increase model performance in our problem. Let's finally try hyperparameter tuning in this scenario:

In [13]:
from sklearn.model_selection import GridSearchCV
alphas = np.linspace(1,0,10)
param_grid = {"alpha":alphas}
# Instantiate the GridSearchCV object:
ridgemodel = Ridge()
ridge_cv = GridSearchCV(estimator= ridgemodel,
                        param_grid= param_grid,
                        scoring='neg_mean_squared_error',
                        cv = 5, 
                        n_jobs=-1)

# Fit the GridSearchCV object to training data to start parameter search:
ridge_cv.fit(X_train_transformed,y_train)
Out[13]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'alpha': array([ 1.     ,  0.88889,  0.77778,  0.66667,  0.55556,  0.44444,
        0.33333,  0.22222,  0.11111,  0.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)
In [14]:
print(ridge_cv.best_params_)
print(ridge_cv.best_score_)
{'alpha': 0.11111111111111116}
-0.418277136487
In [15]:
from sklearn.metrics import mean_squared_error
y_pred2 = ridge_cv.predict(X_train_transformed) 
np.sqrt(mean_squared_error(y_train,y_pred2))
Out[15]:
0.64638127028157599
In [16]:
y_pred2 = ridge_cv.predict(X_holdout_transformed) 
np.sqrt(mean_squared_error(y_holdout,y_pred2))
Out[16]:
1.0569031945809519

Hyperparameter tuning has marginal impact on the model performance. As a final step, let's modify the pipeline 2, by removing the final feature selection step and see if that changes model performance:

In [17]:
# Prepare the modified pipeline (pl3):
pl3 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
])
In [ ]:
X_train_transformed = pl3.fit_transform(X_train, y_train)
X_holdout_transformed = pl3.fit_transform(X_holdout, y_holdout)
In [20]:
print(X_train_transformed.shape)
print(X_holdout_transformed.shape)
(889521, 45753)
(593014, 45753)

Note that by creating interaction terms, we increased the dimension of data enourmously. We will now try Ridge regularization to see if we can improve our previous model's performance:

In [ ]:
from sklearn.metrics import mean_squared_error
model3 = Ridge(alpha = 0.5)
model3.fit(X_train_transformed,y_train)
ypred3 = model3.predict(X_train_transformed)
In [22]:
np.sqrt(mean_squared_error(y_train,ypred3))
Out[22]:
0.57418678100605181
In [23]:
import matplotlib.pyplot as plt
plt.scatter(ypred3,y_train, s = 0.3, c = 'r', alpha = 0.7)
plt.show()
In [24]:
# To see the performance in the holdout set
ypred3_holdout = model3.predict(X_holdout_transformed)
np.sqrt(mean_squared_error(y_holdout,ypred3_holdout))
Out[24]:
0.95170781413293182

Model performance is remarkably better. Therefore, we learned that once added the interaction terms, we can leave the features as they are, as long as we are tempted to perform regularization afterwards.

However, one point we noticed here is that: if we perform fit_transform of the data set outside of the pipeline, this can potentially bias to X_holdout because new features are extracted. Instead, we should fit(train) the entire pipeline using the X_train and y_train, then only use predict with X_holdout.

Let's incorporate the model step into our best performing pipeline, pl3:

In [33]:
import os
import pandas as pd
import numpy as np
import pickle

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################

def column_text_processer(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    
    # Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
    return text_vector.apply(lambda x: x.lower())  

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################

pl3 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
])
In [ ]:
# Train our pipeline using training set
pl3.fit(X_train,y_train)
In [ ]:
# Make predictions first using the training set
y_pred3 = pl3.predict(X_train)
np.sqrt(mean_squared_error(y_train,y_pred3))

As we expected, we were able to reproduce the rmse using the pipeline object's predictions.

In [ ]:
# Make predictions using the holdout set
y_pred3 = pl3.predict(X_holdout)
np.sqrt(mean_squared_error(y_holdout,y_pred3))

Very nice! Now we learned that it is essential to only include the modeling step inside the pipeline. We notice that the performance of our pipeline in the holdout set is actually quite good, better than any model we used before. We note that this is a data set the pipeline has not seen before.

In [29]:
plt.scatter(y_pred3,y_holdout, s = 0.3, c = "r", alpha = 0.4)
plt.show()

Next, let's try to perform Gridsearch over the pipeline3 we established. This way we will try to learn:

  1. How to perform Gridsearch using an entire pipeline: tuning multiple hyperparameters
  2. How to improve computational efficiency of this process (read Pipeline documentation, e.g: cache)
  3. To understand if we can further improve the performance of the existing pipeline!

GridSearch using entire Ridge pipeline

Let's get started by loading our data sets, utility functions and the latest pipeline using which we would like to perform hyperparameter tuning:

In [2]:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################

def column_text_processer(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    
    # Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
    return text_vector.apply(lambda x: x.lower())  

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################

pl3 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
])

Let's look into the potential parameters we can tune in our pipeline, using .get_params() method of the pipeline object:

In [3]:
pl3.get_params()
Out[3]:
{'int': SparseInteractions(degree=2, feature_name_separator='_'),
 'int__degree': 2,
 'int__feature_name_separator': '_',
 'memory': None,
 'reg': Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
    normalize=False, random_state=None, solver='auto', tol=0.001),
 'reg__alpha': 0.5,
 'reg__copy_X': True,
 'reg__fit_intercept': True,
 'reg__max_iter': None,
 'reg__normalize': False,
 'reg__random_state': None,
 'reg__solver': 'auto',
 'reg__tol': 0.001,
 'scaler': MaxAbsScaler(copy=True),
 'scaler__copy': True,
 'steps': [('union', FeatureUnion(n_jobs=1,
          transformer_list=[('numeric_subpipeline', Pipeline(memory=None,
        steps=[('parser', FunctionTransformer(accept_sparse=False,
             func=<function <lambda> at 0x111756f28>, inv_kw_args=None,
             inverse_func=None, kw_args=None, pass_y='deprecated',
             validate=False)), ('imputer',...izer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1036cd6a8>))]))],
          transformer_weights=None)),
  ('int', SparseInteractions(degree=2, feature_name_separator='_')),
  ('scaler', MaxAbsScaler(copy=True)),
  ('reg', Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001))],
 'union': FeatureUnion(n_jobs=1,
        transformer_list=[('numeric_subpipeline', Pipeline(memory=None,
      steps=[('parser', FunctionTransformer(accept_sparse=False,
           func=<function <lambda> at 0x111756f28>, inv_kw_args=None,
           inverse_func=None, kw_args=None, pass_y='deprecated',
           validate=False)), ('imputer',...izer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1036cd6a8>))]))],
        transformer_weights=None),
 'union__n_jobs': 1,
 'union__numeric_subpipeline': Pipeline(memory=None,
      steps=[('parser', FunctionTransformer(accept_sparse=False,
           func=<function <lambda> at 0x111756f28>, inv_kw_args=None,
           inverse_func=None, kw_args=None, pass_y='deprecated',
           validate=False)), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0))]),
 'union__numeric_subpipeline__imputer': Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0),
 'union__numeric_subpipeline__imputer__axis': 0,
 'union__numeric_subpipeline__imputer__copy': True,
 'union__numeric_subpipeline__imputer__missing_values': 'NaN',
 'union__numeric_subpipeline__imputer__strategy': 'mean',
 'union__numeric_subpipeline__imputer__verbose': 0,
 'union__numeric_subpipeline__memory': None,
 'union__numeric_subpipeline__parser': FunctionTransformer(accept_sparse=False,
           func=<function <lambda> at 0x111756f28>, inv_kw_args=None,
           inverse_func=None, kw_args=None, pass_y='deprecated',
           validate=False),
 'union__numeric_subpipeline__parser__accept_sparse': False,
 'union__numeric_subpipeline__parser__func': <function __main__.<lambda>>,
 'union__numeric_subpipeline__parser__inv_kw_args': None,
 'union__numeric_subpipeline__parser__inverse_func': None,
 'union__numeric_subpipeline__parser__kw_args': None,
 'union__numeric_subpipeline__parser__pass_y': 'deprecated',
 'union__numeric_subpipeline__parser__validate': False,
 'union__numeric_subpipeline__steps': [('parser',
   FunctionTransformer(accept_sparse=False,
             func=<function <lambda> at 0x111756f28>, inv_kw_args=None,
             inverse_func=None, kw_args=None, pass_y='deprecated',
             validate=False)),
  ('imputer',
   Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0))],
 'union__text_subpipeline': Pipeline(memory=None,
      steps=[('parser', FunctionTransformer(accept_sparse=False,
           func=<function column_text_processer at 0x12b8d3268>,
           inv_kw_args=None, inverse_func=None, kw_args=None,
           pass_y='deprecated', validate=False)), ('tokenizer', HashingVectorizer(alternate_sign=True, analyzer='word...kenizer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1036cd6a8>))]),
 'union__text_subpipeline__dim_red1': SelectKBest(k=300, score_func=<function f_regression at 0x1036cd6a8>),
 'union__text_subpipeline__dim_red1__k': 300,
 'union__text_subpipeline__dim_red1__score_func': <function __main__.f_regression>,
 'union__text_subpipeline__memory': None,
 'union__text_subpipeline__parser': FunctionTransformer(accept_sparse=False,
           func=<function column_text_processer at 0x12b8d3268>,
           inv_kw_args=None, inverse_func=None, kw_args=None,
           pass_y='deprecated', validate=False),
 'union__text_subpipeline__parser__accept_sparse': False,
 'union__text_subpipeline__parser__func': <function __main__.column_text_processer>,
 'union__text_subpipeline__parser__inv_kw_args': None,
 'union__text_subpipeline__parser__inverse_func': None,
 'union__text_subpipeline__parser__kw_args': None,
 'union__text_subpipeline__parser__pass_y': 'deprecated',
 'union__text_subpipeline__parser__validate': False,
 'union__text_subpipeline__steps': [('parser',
   FunctionTransformer(accept_sparse=False,
             func=<function column_text_processer at 0x12b8d3268>,
             inv_kw_args=None, inverse_func=None, kw_args=None,
             pass_y='deprecated', validate=False)),
  ('tokenizer',
   HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
            decode_error='strict', dtype=<class 'numpy.float64'>,
            encoding='utf-8', input='content', lowercase=True,
            n_features=1048576, ngram_range=(1, 3), non_negative=True,
            norm=None, preprocessor=None, stop_words='english',
            strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
            tokenizer=None)),
  ('dim_red1',
   SelectKBest(k=300, score_func=<function f_regression at 0x1036cd6a8>))],
 'union__text_subpipeline__tokenizer': HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
          decode_error='strict', dtype=<class 'numpy.float64'>,
          encoding='utf-8', input='content', lowercase=True,
          n_features=1048576, ngram_range=(1, 3), non_negative=True,
          norm=None, preprocessor=None, stop_words='english',
          strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
          tokenizer=None),
 'union__text_subpipeline__tokenizer__alternate_sign': True,
 'union__text_subpipeline__tokenizer__analyzer': 'word',
 'union__text_subpipeline__tokenizer__binary': False,
 'union__text_subpipeline__tokenizer__decode_error': 'strict',
 'union__text_subpipeline__tokenizer__dtype': numpy.float64,
 'union__text_subpipeline__tokenizer__encoding': 'utf-8',
 'union__text_subpipeline__tokenizer__input': 'content',
 'union__text_subpipeline__tokenizer__lowercase': True,
 'union__text_subpipeline__tokenizer__n_features': 1048576,
 'union__text_subpipeline__tokenizer__ngram_range': (1, 3),
 'union__text_subpipeline__tokenizer__non_negative': True,
 'union__text_subpipeline__tokenizer__norm': None,
 'union__text_subpipeline__tokenizer__preprocessor': None,
 'union__text_subpipeline__tokenizer__stop_words': 'english',
 'union__text_subpipeline__tokenizer__strip_accents': None,
 'union__text_subpipeline__tokenizer__token_pattern': '[A-Za-z0-9]+(?=\\s+)',
 'union__text_subpipeline__tokenizer__tokenizer': None,
 'union__transformer_list': [('numeric_subpipeline', Pipeline(memory=None,
        steps=[('parser', FunctionTransformer(accept_sparse=False,
             func=<function <lambda> at 0x111756f28>, inv_kw_args=None,
             inverse_func=None, kw_args=None, pass_y='deprecated',
             validate=False)), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0))])),
  ('text_subpipeline', Pipeline(memory=None,
        steps=[('parser', FunctionTransformer(accept_sparse=False,
             func=<function column_text_processer at 0x12b8d3268>,
             inv_kw_args=None, inverse_func=None, kw_args=None,
             pass_y='deprecated', validate=False)), ('tokenizer', HashingVectorizer(alternate_sign=True, analyzer='word...kenizer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1036cd6a8>))]))],
 'union__transformer_weights': None}
In [ ]:
# We notice that the following parameters would be intuitive to tune through GridSearchCV:

'reg__alpha': 0.5
'union__text_subpipeline__dim_red1__k': 300,
'union__text_subpipeline__tokenizer__ngram_range': (1, 3)
'int__degree': 2

# Let's set up a hyperparameter space to use GridSearchCV:
In [4]:
param_grid = {
    'reg__alpha': np.linspace(1,0,10),
    'union__text_subpipeline__dim_red1__k': [200,300,400],
    'union__text_subpipeline__tokenizer__ngram_range': [(1,3),(1,4)], # We will tokenize for up to 4-grams 
    'int__degree': [2,3] # We will add up to the third polymonial degree interactions 
}

Now we will set up our GridSearchCV estimator using the hyperparameter space we just defined.

As a slight modification, here we will use the memory and cache functions of the sklearn in order to save some time in computations.

Catching the transformers within a pipeline

In [ ]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline


# We first create a temporary folder to store the transformers of our pipeline
temp_folder = mkdtemp()
memory = Memory(cachedir= temp_folder, verbose= 10) # Create our memory 

# Next we need to redefine our pipeline3 with a memory argument to pass the memory we created
memorized_pipeline3 = Pipeline(steps=[
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
    
], memory = memory) 


# Now we are using the catched (memorized) pipeline for GridSearchCV, 
    # using 3-fold cross-validation
    # using the parameter grid space we defined above
    
pl3grid = GridSearchCV(memorized_pipeline3, cv = 3, n_jobs= 1, param_grid= param_grid,scoring='neg_mean_squared_error')

# Finally, we start training the pipeline with GridSearchCV, using the .fit() method and training set:
pl3grid.fit(X_train,y_train)

# We need to delete the temporary folder before we exit the training task to allow the memory back to system
rmtree(temp_folder)

Based on [https://github.com/scikit-learn/scikit-learn/issues/1645] this discussion it appears that the serialization we attempt to make uses pickle, which is not compatible with lambda functions. We notice that the lambda functions appear in our pipeline at the custom utility functions we wrote. Therefore, we need to rewrite these utility functions to make them compatible with the serialization process:

In [43]:
def column_text_processer_nolambda(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector) 

def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
    return df[numeric_columns]

#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False) 
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################

Now, let's redefine our entire data reading and pre-processing pipeline with these custom functions:

In [83]:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data

def column_text_processer_nolambda(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector) 

def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
    return df[numeric_columns]

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False) 
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer  
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################

Finally, we re-define our memorized pipeline together with hyperparameter search space:

In [ ]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline


# We first create a temporary folder to store the transformers of our pipeline
temp_folder = mkdtemp()
memory = Memory(cachedir= temp_folder, verbose= 10) # Create our memory 


# Next we need to redefine our pipeline3 with a memory argument to pass the memory we created
memorized_pipeline3 = Pipeline(steps=[
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
    
], memory = memory) 

# Our hyperparameter grid
param_grid = {
    'reg__alpha': np.linspace(1,0,10),
    'union__text_subpipeline__dim_red1__k': [200,300,400],
    'union__text_subpipeline__tokenizer__ngram_range': [(1,3),(1,4)], # We will tokenize for up to 4-grams 
    'int__degree': [2,3] # We will add up to the third polymonial degree interactions 
}


# Now we are using the catched (memorized) pipeline for GridSearchCV, 
    # using 3-fold cross-validation
    # using the parameter grid space we defined above

  
pl3grid = GridSearchCV(memorized_pipeline3, 
                       cv = 3, 
                       n_jobs= 1, 
                       param_grid= param_grid,
                       scoring='neg_mean_squared_error')


# Finally, we start training the pipeline with GridSearchCV, using the .fit() method and training set:
pl3grid.fit(X_train,y_train)

# We need to delete the temporary folder before we exit the training task to allow the memory back to system
rmtree(temp_folder)
In [79]:
from multiprocessing import cpu_count
cpu_count()
Out[79]:
4

Our take home messages:

  1. Using cache option to memorize steps with large data sets and a complex pipeline is not feasible. Computer crashes.
  2. The n_jobs = -1 parallelization using joblib interface of sklearn is not functional. The error seems to persist as of March 2018.

Therefore, our next solution could be using Dask package or spark-sklearn for paralelization. They appear to have their own problems though.

Until we have a good understanding the nature of these problems, we should avoid ambitious hyperparameter optimization using GridSearchCV, and tune few parameters at a given time instead.

Searching a better Ridge pipeline using RandomizedSearchCV

Yes, we realized it is not feasible to use an exhaustive GridSearchCV for hyperparameter optimization given the size of the training data and our limited computational power. However, we should still give a try to RandomSearchCV, where we can test 10 or 20 iterations to randomly pick combinations of parameters from our hyperparameter bucket and see if we can get a model with better performance. This is in a way a matter of luck and waiting game, but nothing should stop us from trying, give there is a possiblity of finding a slighly better performing model.

First we get our Ridge pipeline and hyperparameter space as usual:

In [1]:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data

def column_text_processer_nolambda(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector) 

def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
    return df[numeric_columns]

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False) 
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer  
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################
# Next we define our pipeline:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline


pl3 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ("reg",Ridge(alpha = 0.5)) # Add the RidgeRegression step using alpha = 0.5
])

# We define our hyperparameter grid from which we will randomly select a combination per RandomizedSeachCV iteration
param_grid = {
    'reg__alpha': np.linspace(1,0,10),
    'union__text_subpipeline__dim_red1__k': [200,300,400],
    'union__text_subpipeline__tokenizer__ngram_range': [(1,3),(1,4)], # We will tokenize for up to 4-grams 
}

Finally we perform RandomizedSeachCV from the defined hyperparameter space:

In [ ]:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))

# Now we are using the pipeline 3 for RandomizedSearchCV, 
    # using 2-fold cross-validation at this stage
    # using the parameter grid space we defined above

pl3randomized= RandomizedSearchCV(pl3, 
                                  cv = 2, 
                                  param_distributions= param_grid, 
                                  n_jobs= 3,
                                  scoring='neg_mean_squared_error',
                                  verbose = 10)


# Finally, we start training the pipeline with RandomizedSearchCV, using the .fit() method and training set:
pl3randomized.fit(X_train,y_train)


import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))

These 20 fits took about 5 hours using 3 cores of 4 available CPU, which is quite reasonable. Let's look at the performance of these attempts:

In [3]:
pl3randomized.best_params_
Out[3]:
{'reg__alpha': 1.0,
 'union__text_subpipeline__dim_red1__k': 300,
 'union__text_subpipeline__tokenizer__ngram_range': (1, 3)}

It turns out that the best parameters we have found are still the ones we used in our original pipeline. This is a good example to demonstrate that a limited hyperparameter search does not necessarily yield a better performing model.

In [5]:
-pl3randomized.best_score_
Out[5]:
0.3756496823066291
In [ ]:
y_pred_pl3_randomized = pl3randomized.predict(X_train)
np.sqrt(mean_squared_error(y_train,y_pred_pl3_randomized))
In [15]:
import matplotlib.pyplot as plt
plt.scatter(x = y_pred_pl3_randomized, y = y_train, c = "r", s = 0.02,alpha = 0.3)
plt.show()

Training the pipeline using RandomForest regressor

Let's try to modify our best performing pipeline 3 above by changing the regressor to RandomForest, starting with its default parameters, we will try to see if we can get a better model performance without performing any hyperparameter tuning. Note that in this case we will add another feature selection step in the pipeline before using the RandomForest model since we are not performing regularization in this case:

In [5]:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################

def column_text_processer(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    
    # Convert all the text to lowercase and return as pd.Series object to enter the tokenization pipeline
    return text_vector.apply(lambda x: x.lower())  

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

#############################################################################
# Utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = lambda x: x[Numeric_features], validate=False) #Note x is by default the tensor that contains all features
get_text_data = FunctionTransformer(column_text_processer,validate=False) # Note how we avoid putting any arguments into column_text_processer
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################

pl4 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all fearures are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ('dim_red2', SelectKBest(f_regression, 300)),
    ("reg",RandomForestRegressor(n_jobs = -1, max_depth= 50)) # We start with these parameters of randomforest
])

Next we train our pipeline using .fit method as before:

In [ ]:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))

pl4.fit(X_train,y_train)

end = datetime.datetime.now()
print("train end :"+ str(end))

Using all CPU power we have (n_jobs = -1 argument in RandomForest), it took about 21 hours to train the RandomForest model even without any cross-validation with the ~800K samples and 300 features.

This demonstrates that performing cross-validation or hyperparameter search with this model with this amount of training data is not feasible in our computational capacity.

Let's have a look at our predictions and compare to the Ridge pipeline we trained above:

In [ ]:
y_pred4_train = pl4.predict(X_train)
np.sqrt(mean_squared_error(y_train,y_pred4_train))

It looks like RandomForest fits into the training set better than the untuned Ridge. Let's look at the performance in the holdout set:

In [ ]:
# Make predictions using the holdout set
y_pred4_holdout = pl4.predict(X_holdout)
np.sqrt(mean_squared_error(y_holdout,y_pred4_holdout))

This is quite interesting. It appears that untuned RandomForest pipeline overfits to the training set compared to untuned Ridge model. This is another nice example that a simple regularized model can perform better than a untuned complex algorithm like RandomForest.

We should remember that if we had the computational power to perform hyperparameter tuning using RandomForest, we would probably find a better performing model. However, in this case computational limitations determine the most feasible model.

In the next experiment, we will try to develop an intuition to use boosted trees for the same problem by employing XGboost package.

Extreme Gradient Boosting Pipeline

XGBoost has a sklearn compatible API, which can integrate into sklearn pipelines. Therefore, we can simply start by using our existing Ridge pipeline, simply replacing the regressor estimator and choosing a regularization paramater.

We will start by loading the data and defining our new pipeline:

In [5]:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data

def column_text_processer_nolambda(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector) 

def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
    return df[numeric_columns]

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from xgboost import XGBRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False) 
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer  
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################
# Next we define our pipeline:


pl5 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all features are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()), # Scale the features
    ("reg",XGBRegressor(reg_alpha = 0.5,
                        nthread=3,
                        learning_rate=0.2,
                        objective="reg:linear",
                        n_estimators = 10)) # We start with some sensible parameters and tune later if feasible and necessary
])

Let's start training our new pipeline using the training set:

In [ ]:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))

pl5.fit(X_train,y_train)

end = datetime.datetime.now()
print("train end :"+ str(end))

XGBoost regressor took only 7 minutes to train, which was remarkably fast. Let's look at the performance:

In [ ]:
pl5testpred = pl5.predict(X_train)
np.sqrt(mean_squared_error(y_true=y_train,y_pred= pl5testpred))

This performance is not as good as the Ridge pipeline 3. Let's also look at the performance using the holdout set:

In [9]:
pl5testholdout = pl5.predict(X_holdout)
np.sqrt(mean_squared_error(y_true=y_holdout,y_pred= pl5testholdout))
Out[9]:
0.73678521851981316

This is still higher than the holdout rmse of 0.6 we obtained from the Ridge pipeline. Let's try to tune the model if we can.

Hyperparameter tuning using XGBoost pipeline

XGboost pipeline seems to be training reasonably fast. Let's try to perform a random hyperparameter search to see if we can improve the model performance.

We will first prepare a hyperparameter space for the list of parameters we can tune within the Xgboost regressor, and will perform 2-fold cross-validation to fit as many models as possible. Since a single fit took about 7 minutes, trying up to 30 model fits sounds like a reasonable approach to start with.

In [11]:
# The list of tunable parameter in our XGboost pipeline:
pl5.get_params()
Out[11]:
{'int': SparseInteractions(degree=2, feature_name_separator='_'),
 'int__degree': 2,
 'int__feature_name_separator': '_',
 'memory': None,
 'reg': XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
        learning_rate=0.2, max_delta_step=0, max_depth=3,
        min_child_weight=1, missing=None, n_estimators=10, nthread=3,
        objective='reg:linear', reg_alpha=0.5, reg_lambda=1,
        scale_pos_weight=1, seed=0, silent=True, subsample=1),
 'reg__base_score': 0.5,
 'reg__colsample_bylevel': 1,
 'reg__colsample_bytree': 1,
 'reg__gamma': 0,
 'reg__learning_rate': 0.2,
 'reg__max_delta_step': 0,
 'reg__max_depth': 3,
 'reg__min_child_weight': 1,
 'reg__missing': None,
 'reg__n_estimators': 10,
 'reg__nthread': 3,
 'reg__objective': 'reg:linear',
 'reg__reg_alpha': 0.5,
 'reg__reg_lambda': 1,
 'reg__scale_pos_weight': 1,
 'reg__seed': 0,
 'reg__silent': True,
 'reg__subsample': 1,
 'scaler': MaxAbsScaler(copy=True),
 'scaler__copy': True,
 'steps': [('union', FeatureUnion(n_jobs=1,
          transformer_list=[('numeric_subpipeline', Pipeline(memory=None,
        steps=[('parser', FunctionTransformer(accept_sparse=False,
             func=<function column_numeric_processer_nolambda at 0x1a1eca27b8>,
             inv_kw_args=None, inverse_func=None, kw_args=None,
             pass_y='deprecated', val...zer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1a1eca2730>))]))],
          transformer_weights=None)),
  ('int', SparseInteractions(degree=2, feature_name_separator='_')),
  ('scaler', MaxAbsScaler(copy=True)),
  ('reg',
   XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
          learning_rate=0.2, max_delta_step=0, max_depth=3,
          min_child_weight=1, missing=None, n_estimators=10, nthread=3,
          objective='reg:linear', reg_alpha=0.5, reg_lambda=1,
          scale_pos_weight=1, seed=0, silent=True, subsample=1))],
 'union': FeatureUnion(n_jobs=1,
        transformer_list=[('numeric_subpipeline', Pipeline(memory=None,
      steps=[('parser', FunctionTransformer(accept_sparse=False,
           func=<function column_numeric_processer_nolambda at 0x1a1eca27b8>,
           inv_kw_args=None, inverse_func=None, kw_args=None,
           pass_y='deprecated', val...zer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1a1eca2730>))]))],
        transformer_weights=None),
 'union__n_jobs': 1,
 'union__numeric_subpipeline': Pipeline(memory=None,
      steps=[('parser', FunctionTransformer(accept_sparse=False,
           func=<function column_numeric_processer_nolambda at 0x1a1eca27b8>,
           inv_kw_args=None, inverse_func=None, kw_args=None,
           pass_y='deprecated', validate=False)), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0))]),
 'union__numeric_subpipeline__imputer': Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0),
 'union__numeric_subpipeline__imputer__axis': 0,
 'union__numeric_subpipeline__imputer__copy': True,
 'union__numeric_subpipeline__imputer__missing_values': 'NaN',
 'union__numeric_subpipeline__imputer__strategy': 'mean',
 'union__numeric_subpipeline__imputer__verbose': 0,
 'union__numeric_subpipeline__memory': None,
 'union__numeric_subpipeline__parser': FunctionTransformer(accept_sparse=False,
           func=<function column_numeric_processer_nolambda at 0x1a1eca27b8>,
           inv_kw_args=None, inverse_func=None, kw_args=None,
           pass_y='deprecated', validate=False),
 'union__numeric_subpipeline__parser__accept_sparse': False,
 'union__numeric_subpipeline__parser__func': <function __main__.column_numeric_processer_nolambda>,
 'union__numeric_subpipeline__parser__inv_kw_args': None,
 'union__numeric_subpipeline__parser__inverse_func': None,
 'union__numeric_subpipeline__parser__kw_args': None,
 'union__numeric_subpipeline__parser__pass_y': 'deprecated',
 'union__numeric_subpipeline__parser__validate': False,
 'union__numeric_subpipeline__steps': [('parser',
   FunctionTransformer(accept_sparse=False,
             func=<function column_numeric_processer_nolambda at 0x1a1eca27b8>,
             inv_kw_args=None, inverse_func=None, kw_args=None,
             pass_y='deprecated', validate=False)),
  ('imputer',
   Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0))],
 'union__text_subpipeline': Pipeline(memory=None,
      steps=[('parser', FunctionTransformer(accept_sparse=False,
           func=<function column_text_processer_nolambda at 0x1a1eca2950>,
           inv_kw_args=None, inverse_func=None, kw_args=None,
           pass_y='deprecated', validate=False)), ('tokenizer', HashingVectorizer(alternate_sign=True, anal...enizer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1a1eca2730>))]),
 'union__text_subpipeline__dim_red1': SelectKBest(k=300, score_func=<function f_regression at 0x1a1eca2730>),
 'union__text_subpipeline__dim_red1__k': 300,
 'union__text_subpipeline__dim_red1__score_func': <function __main__.f_regression>,
 'union__text_subpipeline__memory': None,
 'union__text_subpipeline__parser': FunctionTransformer(accept_sparse=False,
           func=<function column_text_processer_nolambda at 0x1a1eca2950>,
           inv_kw_args=None, inverse_func=None, kw_args=None,
           pass_y='deprecated', validate=False),
 'union__text_subpipeline__parser__accept_sparse': False,
 'union__text_subpipeline__parser__func': <function __main__.column_text_processer_nolambda>,
 'union__text_subpipeline__parser__inv_kw_args': None,
 'union__text_subpipeline__parser__inverse_func': None,
 'union__text_subpipeline__parser__kw_args': None,
 'union__text_subpipeline__parser__pass_y': 'deprecated',
 'union__text_subpipeline__parser__validate': False,
 'union__text_subpipeline__steps': [('parser',
   FunctionTransformer(accept_sparse=False,
             func=<function column_text_processer_nolambda at 0x1a1eca2950>,
             inv_kw_args=None, inverse_func=None, kw_args=None,
             pass_y='deprecated', validate=False)),
  ('tokenizer',
   HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
            decode_error='strict', dtype=<class 'numpy.float64'>,
            encoding='utf-8', input='content', lowercase=True,
            n_features=1048576, ngram_range=(1, 3), non_negative=True,
            norm=None, preprocessor=None, stop_words='english',
            strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
            tokenizer=None)),
  ('dim_red1',
   SelectKBest(k=300, score_func=<function f_regression at 0x1a1eca2730>))],
 'union__text_subpipeline__tokenizer': HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
          decode_error='strict', dtype=<class 'numpy.float64'>,
          encoding='utf-8', input='content', lowercase=True,
          n_features=1048576, ngram_range=(1, 3), non_negative=True,
          norm=None, preprocessor=None, stop_words='english',
          strip_accents=None, token_pattern='[A-Za-z0-9]+(?=\\s+)',
          tokenizer=None),
 'union__text_subpipeline__tokenizer__alternate_sign': True,
 'union__text_subpipeline__tokenizer__analyzer': 'word',
 'union__text_subpipeline__tokenizer__binary': False,
 'union__text_subpipeline__tokenizer__decode_error': 'strict',
 'union__text_subpipeline__tokenizer__dtype': numpy.float64,
 'union__text_subpipeline__tokenizer__encoding': 'utf-8',
 'union__text_subpipeline__tokenizer__input': 'content',
 'union__text_subpipeline__tokenizer__lowercase': True,
 'union__text_subpipeline__tokenizer__n_features': 1048576,
 'union__text_subpipeline__tokenizer__ngram_range': (1, 3),
 'union__text_subpipeline__tokenizer__non_negative': True,
 'union__text_subpipeline__tokenizer__norm': None,
 'union__text_subpipeline__tokenizer__preprocessor': None,
 'union__text_subpipeline__tokenizer__stop_words': 'english',
 'union__text_subpipeline__tokenizer__strip_accents': None,
 'union__text_subpipeline__tokenizer__token_pattern': '[A-Za-z0-9]+(?=\\s+)',
 'union__text_subpipeline__tokenizer__tokenizer': None,
 'union__transformer_list': [('numeric_subpipeline', Pipeline(memory=None,
        steps=[('parser', FunctionTransformer(accept_sparse=False,
             func=<function column_numeric_processer_nolambda at 0x1a1eca27b8>,
             inv_kw_args=None, inverse_func=None, kw_args=None,
             pass_y='deprecated', validate=False)), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0))])),
  ('text_subpipeline', Pipeline(memory=None,
        steps=[('parser', FunctionTransformer(accept_sparse=False,
             func=<function column_text_processer_nolambda at 0x1a1eca2950>,
             inv_kw_args=None, inverse_func=None, kw_args=None,
             pass_y='deprecated', validate=False)), ('tokenizer', HashingVectorizer(alternate_sign=True, anal...enizer=None)), ('dim_red1', SelectKBest(k=300, score_func=<function f_regression at 0x1a1eca2730>))]))],
 'union__transformer_weights': None}

Our initial hyperparameter space:

In [14]:
param_space = {
    'reg__learning_rate': np.arange(0.05,1.05,.05),
    'reg__n_estimators': [20],
    'reg__max_depth': [3,6,9],
}

Our RandomSearchCV estimator becomes:

In [15]:
from sklearn.model_selection import RandomizedSearchCV
pl5RandomSearch = RandomizedSearchCV(estimator=pl5,cv= 2,n_jobs = 3,
                                     param_distributions= param_space,
                                     n_iter = 15,
                                     scoring = "neg_mean_squared_error",verbose = 10)

Training the RandomsearchCV estimtor using the training set:

In [ ]:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))

pl5RandomSearch.fit(X_train, y_train)

end = datetime.datetime.now()
print("train end :"+ str(end))

Note that the hyperparameter search took about 6 hours, a little shorther than we expected.

In [19]:
pl5RandomSearch.best_params_
Out[19]:
{'reg__learning_rate': 0.75000000000000011,
 'reg__max_depth': 9,
 'reg__n_estimators': 20}
In [ ]:
 
In [20]:
pl5RandomSearch.best_score_
Out[20]:
-0.37109474934025283
In [21]:
np.sqrt(abs(pl5RandomSearch.best_score_))
Out[21]:
0.60917546679118062
In [ ]:
pl5RandomSearchholdout = pl5RandomSearch.predict(X_holdout)
np.sqrt(mean_squared_error(y_true=y_holdout,y_pred= pl5RandomSearchholdout))

This looks very interesting. The best model we could find using hyperparameter search slighly overfits to the training set (compared to Ridge pipeline - pl3), but performs almost same as the Ridge pipeline in the holdout set.

Since the xgboost model has improved, we can try another round of random search keeping one of these parameters constant.

In [26]:
param_space2 = {
    'reg__learning_rate': [0.75000000000000011],
    'reg__n_estimators': [20,50,100],
    'reg__max_depth': [9,12,15],
    'reg__reg_alpha': np.arange(0.05,1.05,.05)
}
In [27]:
from sklearn.model_selection import RandomizedSearchCV
pl5RandomSearch2 = RandomizedSearchCV(estimator=pl5,cv= 2,n_jobs = 3,
                                     param_distributions= param_space2,
                                     n_iter = 20,
                                     scoring = "neg_mean_squared_error",verbose = 10)
In [ ]:
import datetime
start = datetime.datetime.now()
print("train start :"+ str(start))

pl5RandomSearch2.fit(X_train, y_train)

end = datetime.datetime.now()
print("train end :"+ str(end))

That was a long search! Let's look at the results:

In [29]:
pl5RandomSearch2.best_params_
Out[29]:
{'reg__learning_rate': 0.7500000000000001,
 'reg__max_depth': 9,
 'reg__n_estimators': 50,
 'reg__reg_alpha': 0.75000000000000011}
In [30]:
pl5RandomSearch2.best_score_
Out[30]:
-0.36246170126796323
In [31]:
np.sqrt(abs(pl5RandomSearch2.best_score_))
Out[31]:
0.6020479227337
In [ ]:
predict_pl5RandomSearch2_holdout = pl5RandomSearch2.predict(X_holdout)
np.sqrt(mean_squared_error(y_holdout,predict_pl5RandomSearch2_holdout))

It was worth waiting for it! It appears that the rmse in the hold out set is 0.595, this is an improvement from the performance of the Rigde pipeline (pl3), which gave a rmse of 0.602.

Next try to form an ensemble model using the predictions of these two pipelines.

Ridge and xgboost Model Ensemble

Now we have two best performing models (or pipelines), one linear the other one is tree-based learner. We can try to ensemble the predictions of these models to see if we can get a better performance in the holdoutset.

To begin with, we can use the average of two models as the ensemble prediction.

Average model Ensemble

Let's try to average the predictions of two pipelines and come up with ensemble predictions for the holdout set:

In [ ]:
# Combine the two predictions in a new dataframe
preds = pd.concat([pd.DataFrame(pl3.predict(X_holdout), columns=['ridge_predictions']),
                   pd.DataFrame(pl5RandomSearch2.predict(X_holdout), columns= ["xgboost_predictions"])], axis = 1)
print(preds.head())
print(preds.shape)
print(y_holdout.shape)

It looks like the predictions of two pipelines are close, although not the same. Let's take the mean of these predictions and add as an another column.

In [78]:
preds["mean_predictions"] = preds.apply(np.mean,axis=1)
preds.head()
Out[78]:
ridge_predictions xgboost_predictions mean_predictions
0 3.999295 4.336337 4.167816
1 3.030860 2.975331 3.003095
2 2.837267 2.861459 2.849363
3 2.913534 2.901714 2.907624
4 3.040877 3.098519 3.069698

Let's also add the true values into the same data frame:

In [80]:
preds["y_holdout"] = y_holdout
preds.head()
Out[80]:
ridge_predictions xgboost_predictions mean_predictions y_holdout
0 3.999295 4.336337 4.167816 4.521789
1 3.030860 2.975331 3.003095 3.295837
2 2.837267 2.861459 2.849363 2.302585
3 2.913534 2.901714 2.907624 3.044522
4 3.040877 3.098519 3.069698 2.708050

Finally, let's look at the rmse:

In [93]:
rmse = pd.DataFrame(data = {
    "ridge_rmse":np.sqrt(mean_squared_error(y_holdout, preds.ridge_predictions)),
    "xgboost_rmse":np.sqrt(mean_squared_error(y_holdout, preds.xgboost_predictions)),
    "mean_rmse":np.sqrt(mean_squared_error(y_holdout, preds.mean_predictions))
}, columns = ["ridge_rmse","xgboost_rmse","mean_rmse"], index = [0])

rmse.head()
Out[93]:
ridge_rmse xgboost_rmse mean_rmse
0 0.602095 0.59559 0.584237

Nice work! Just by using a simple averaging approach of two best pipelines, we were able to reduce the rmse for holdout set predictions to 0.584.

This can illustrate that when we combine predictions from 'orthagonal' models, we may able to obtain a better predictive performance simply by averaging their predictions.

Note that here we assumed equal weights for the predictions of each pipeline since we avaraged them. Another approach would be using different weights for each prediction. This is in a way using the predictions of each pipeline as predictors and fitting a new model using the holdout labels (true values). Since this approach will use the hold out set as a training set, the performance of the resulting ensemble model would ideally need to be validated by another validation set. Since we don't have such a set, we will use cross-validation to estimate "cross-validated" error and can come up with a final ensemble model.

Before performing this ensemble pipeline, let's save our predictions data frame for future use:

In [94]:
import pickle
with open("pipeline_holdout_predictions.pkl","wb") as f:
    pickle.dump(preds,f)

Model ensemble using pipeline predictions and cross-validation

Reload the predictions data set:

In [95]:
with open("pipeline_holdout_predictions.pkl","rb") as f:
    preds = pickle.load(f)

Training Deep Neural Networks (Deep Learning) for prediction

Let's complete our work by training Neural Networks by making use of our existing NLP and feature extraction pipeline.

  1. We will .fit the training data using just the transformation steps of pipeline 5 above. This will be our transformator object.
  2. We will use this pipeline to transform the Training set to perform initial feature extraction and feature selection.
  3. We will divide this set into two sets, and train the network using one half and validate the network with the other half.
  4. Once we obtain a good-performance network, we will prepare the holdout set using the pipeline (using .transform)
  5. We will use the trained network and feed the transformed holdout set, obtain predictions and understand the performance in the holdout set.

Let's get started!

Modified transformator pipeline for feature extraction and selection

We first get our data loading steps along with the modified pipeline. Note that we only removed the regressor step in the pipeline and kept everything else as the same.

In [2]:
import os
import pandas as pd
import numpy as np
import pickle
from SparseInteractions import * #Load SparseInteractions as a module since it was saved into working directory as SparseInteractions.py

#############################################################################
# Re-read the training and hold out data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv", header=None, names= ["price"])
y_train = y_train.price.values

X_holdout = pd.read_csv("X_holdout.csv")
y_holdout = pd.read_csv("y_holdout.csv", header=None, names= ["price"])
y_holdout = y_holdout.price.values

#############################################################################
# Re-read the pickled feature names
import pickle
with open("Numeric_features.pkl", 'rb') as f:
    Numeric_features = pickle.load(f)
f.close()

with open("Text_features.pkl", 'rb') as f:
    Text_features = pickle.load(f)
f.close()
#############################################################################
# Custom utility functions to parse out numeric and text data

def column_text_processer_nolambda(df,text_columns = Text_features):
    """"A function that will merge/join all text in a given row to make it ready for tokenization. 
    - This function should take care of converting missing values to empty strings. 
    - It should also convert the text to lowercase.
    df= pandas dataframe
    text_columns = names of the text features in df
    """ 
    # Select only non-text columns that are in the df
    text_data = df[text_columns]
    
    # Fill the missing values in text_data using empty strings
    text_data.fillna("",inplace=True)
    
    # Join all the strings in a given row to make a vector
    # text_vector = text_data.apply(lambda x: " ".join(x), axis = 1)
    text_vector = []
    for index,rows in text_data.iterrows():
        text_item = " ".join(rows).lower()
        text_vector.append(text_item)

    # return text_vector as pd.Series object to enter the tokenization pipeline
    return pd.Series(text_vector) 

def column_numeric_processer_nolambda(df,numeric_columns = Numeric_features):
    return df[numeric_columns]

#############################################################################

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.preprocessing import MaxAbsScaler, Imputer, FunctionTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

#############################################################################
# FunctionTransformer wrapper of utility functions to parse text and numeric features
get_numeric_data = FunctionTransformer(func = column_numeric_processer_nolambda, validate=False) 
get_text_data = FunctionTransformer(column_text_processer_nolambda,validate=False) # Note how we avoid putting any arguments into column_text_processer  
#############################################################################

#############################################################################
# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'   #Note this regex will match either a whitespace or a punctuation to tokenize the string vector on these preferences  

#############################################################################
# Define f_regression for feature selection to convert center = False default
def f_regression(X,Y):
    import sklearn
    return sklearn.feature_selection.f_regression(X,Y,center = False) # default is center = True

#############################################################################
# Next we define our pipeline:


pl6 = Pipeline([
    
    ("union",FeatureUnion(        #Note that FeatureUnion() accepts list of tuples, the first half of each tuple is the name of the transformer
        
        transformer_list = [
            
            ("numeric_subpipeline", Pipeline([        #Note we have subpipeline branches inside the main pipeline
                
                ("parser",get_numeric_data), # Step1: parse the numeric data (note how we avoid () when using FunctionTransformer objects)
                ("imputer",Imputer()), # Step2: impute missing values
            
            ])), # Branching point of the FeatureUnion
            
            ("text_subpipeline",Pipeline([
            
                ("parser",get_text_data), # Step1: parse the text data 
                ("tokenizer",HashingVectorizer(token_pattern= TOKENS_ALPHANUMERIC,
                                             stop_words = "english",# We will remove English stop words before tokenization
                                             ngram_range = (1,3),
                                             non_negative=True, norm=None, binary=False  
                                            )), # Step2: use CountVectorizer for automated tokenization and feature extraction
                                            ('dim_red1', SelectKBest(f_regression, 300)) # Step3: use dimension reduction to select 300 best features
                
            ]))
        ]
    
    )),# Branching point to the main pipeline: at this point all features are numeric
    
    ("int", SparseInteractions(degree=2)), # Add polynomial interaction terms 
    ("scaler",MaxAbsScaler()) # Scale the features
])

Next we .fit() or train the pipeline 6 with our trainign set. Note that the NLP/tokenization, as well as feature selection will be applied on this data set and we will reuse this pipeline to reproducibly extract the same features from the holdout data set later on. The important thing is to use the .fit method only with the training set to ensure the same features are extracted:

In [ ]:
pl6.fit(X_train,y_train)

Once we have obtained our pipeline, we lock it down and save as a pickle object to load later on:

In [7]:
import pickle
with open("pl6.pkl","wb") as f:
    pickle.dump(pl6,f)
In [8]:
# Reload the pipeline 6
import pickle
with open("pl6.pkl","rb") as f:
    pl6 = pickle.load(f)

We will use this pipeline to transform the Training set to perform initial feature extraction and feature selection:

In [ ]:
X_train_pl6_transformed = pl6.transform(X_train)
In [10]:
X_train_pl6_transformed.shape
Out[10]:
(889521, 45753)

This is the shape of our transformed training set. Note that we performed tokenization, extracted up to 3-grams, and selected 300 best features. Finally we added interaction terms between these features up to the second poynomial degree. This increased the number of features dramatically.

Recall that we used to perform regularization afterwards since this is a lot of feature space to work with. When using neural networks, we can also add regularization. It might appear that we would not even need to add interaction terms since neural networks supposed to extract useful representations of data and interactions are most likely to be represented built in. Neverthless, we will move forward use these features to built an intuition about the performance of a neural network.

We will divide this set into two sets, and train the network using one half and validate the network with the other half:

In [12]:
from sklearn.model_selection import train_test_split

X_net_train,X_net_validation,y_net_train,y_net_validation = train_test_split(X_train_pl6_transformed,
                                                                            y_train,
                                                                            test_size = 0.5,
                                                                            random_state = 423)
In [19]:
print(X_net_train.shape,X_net_validation.shape,y_net_train.shape, y_net_validation.shape)
(444760, 45753) (444761, 45753) (444760,) (444761,)

Let's use these sets to train our network. We will start with a simple network:

In [29]:
from keras import models, metrics, layers

network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_net_train.shape[1],)))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dense(1))
network1.compile(optimizer= "adam", loss= "mean_squared_error", metrics= [metrics.mse])
history_net1 = network1.fit(X_net_train,y_net_train, 
                             epochs=20,batch_size=200,
                             validation_data=(X_net_validation,y_net_validation))
Train on 444760 samples, validate on 444761 samples
Epoch 1/20
444760/444760 [==============================] - 370s 833us/step - loss: 0.7495 - mean_squared_error: 0.7495 - val_loss: 0.3778 - val_mean_squared_error: 0.3778
Epoch 2/20
444760/444760 [==============================] - 370s 832us/step - loss: 0.3483 - mean_squared_error: 0.3483 - val_loss: 0.3622 - val_mean_squared_error: 0.3622
Epoch 3/20
444760/444760 [==============================] - 374s 841us/step - loss: 0.3284 - mean_squared_error: 0.3284 - val_loss: 0.3591 - val_mean_squared_error: 0.3591
Epoch 4/20
444760/444760 [==============================] - 379s 853us/step - loss: 0.3103 - mean_squared_error: 0.3103 - val_loss: 0.3587 - val_mean_squared_error: 0.3587
Epoch 5/20
444760/444760 [==============================] - 393s 883us/step - loss: 0.2934 - mean_squared_error: 0.2934 - val_loss: 0.3620 - val_mean_squared_error: 0.3620
Epoch 6/20
444760/444760 [==============================] - 369s 830us/step - loss: 0.2793 - mean_squared_error: 0.2793 - val_loss: 0.3685 - val_mean_squared_error: 0.3685
Epoch 7/20
444760/444760 [==============================] - 2011s 5ms/step - loss: 0.2671 - mean_squared_error: 0.2671 - val_loss: 0.3758 - val_mean_squared_error: 0.3758
Epoch 8/20
444760/444760 [==============================] - 374s 842us/step - loss: 0.2566 - mean_squared_error: 0.2566 - val_loss: 0.3800 - val_mean_squared_error: 0.3800
Epoch 9/20
444760/444760 [==============================] - 382s 858us/step - loss: 0.2476 - mean_squared_error: 0.2476 - val_loss: 0.3870 - val_mean_squared_error: 0.3870
Epoch 10/20
444760/444760 [==============================] - 374s 842us/step - loss: 0.2393 - mean_squared_error: 0.2393 - val_loss: 0.3941 - val_mean_squared_error: 0.3941
Epoch 11/20
444760/444760 [==============================] - 370s 831us/step - loss: 0.2323 - mean_squared_error: 0.2323 - val_loss: 0.3967 - val_mean_squared_error: 0.3967
Epoch 12/20
444760/444760 [==============================] - 371s 834us/step - loss: 0.2258 - mean_squared_error: 0.2258 - val_loss: 0.4052 - val_mean_squared_error: 0.4052
Epoch 13/20
444760/444760 [==============================] - 373s 839us/step - loss: 0.2199 - mean_squared_error: 0.2199 - val_loss: 0.4125 - val_mean_squared_error: 0.4125
Epoch 14/20
444760/444760 [==============================] - 29154s 66ms/step - loss: 0.2147 - mean_squared_error: 0.2147 - val_loss: 0.4176 - val_mean_squared_error: 0.4176
Epoch 15/20
444760/444760 [==============================] - 368s 829us/step - loss: 0.2098 - mean_squared_error: 0.2098 - val_loss: 0.4225 - val_mean_squared_error: 0.4225
Epoch 16/20
444760/444760 [==============================] - 8453s 19ms/step - loss: 0.2052 - mean_squared_error: 0.2052 - val_loss: 0.4277 - val_mean_squared_error: 0.4277
Epoch 17/20
444760/444760 [==============================] - 368s 827us/step - loss: 0.2011 - mean_squared_error: 0.2011 - val_loss: 0.4301 - val_mean_squared_error: 0.4301
Epoch 18/20
444760/444760 [==============================] - 3933s 9ms/step - loss: 0.1972 - mean_squared_error: 0.1972 - val_loss: 0.4315 - val_mean_squared_error: 0.4315
Epoch 19/20
444760/444760 [==============================] - 371s 834us/step - loss: 0.1937 - mean_squared_error: 0.1937 - val_loss: 0.4407 - val_mean_squared_error: 0.4407
Epoch 20/20
444760/444760 [==============================] - 2175s 5ms/step - loss: 0.1902 - mean_squared_error: 0.1902 - val_loss: 0.4452 - val_mean_squared_error: 0.4452

Using the .evaluate method with training set will give loss and metric for the final network using the training set:

In [30]:
network1.evaluate(X_net_train,y_net_train)
444760/444760 [==============================] - 214s 482us/step
Out[30]:
[0.17616860371869356, 0.17616860371869356]

Similarly, for the validation set:

In [31]:
network1.evaluate(X_net_validation,y_net_validation)
444761/444761 [==============================] - ETA:  - 219s 492us/step
Out[31]:
[0.44524100941222505, 0.44524100941222505]

Monitoring the performance of network across epochs

We note that the performance of the final network demonstrates overfitting. Therefore, we need to understand at which epoch the model starts overfitting by extracting the loss and metric (in this case they are equivalent since this is a regression problem) at each epoch:

In [49]:
import matplotlib.pyplot as plt

history = history_net1.history
train_rmse = np.sqrt(history["mean_squared_error"])
val_rmse = np.sqrt(history["val_mean_squared_error"])
epochs = np.arange(1,21,1)

plt.plot(epochs,train_rmse,"rD:",label = "Training rmse")  # See help(plt.plot) for many other available plotting options
plt.plot(epochs,val_rmse,"bo-",label = "Validation rmse")
plt.title("Training vs. Validation RMSE")
plt.xlabel("epochs")
plt.ylabel("RMSE")
plt.legend()
plt.show()
In [50]:
val_rmse
Out[50]:
array([ 0.6146642 ,  0.60180312,  0.59925333,  0.59893295,  0.60165737,
        0.60703129,  0.61301899,  0.61645183,  0.6220756 ,  0.62775259,
        0.62986071,  0.63654109,  0.64225062,  0.64623833,  0.65003685,
        0.65398622,  0.65581378,  0.6568958 ,  0.66387125,  0.66726382])

Re-train the network for sufficient epochs to avoid overfitting

We found that the network starts overfitting after the 4th epoch. Therefore, we will re-fit the network only using 4 epochs, before attempting to obtain performance in the holdout set:

In [51]:
from keras import models, metrics, layers

network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_net_train.shape[1],)))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dense(1))
network1.compile(optimizer= "adam", loss= "mean_squared_error", metrics= [metrics.mse])
history_net1 = network1.fit(X_net_train,y_net_train, 
                             epochs=4,batch_size=200,
                             validation_data=(X_net_validation,y_net_validation))
Train on 444760 samples, validate on 444761 samples
Epoch 1/4
444760/444760 [==============================] - 407s 915us/step - loss: 0.6064 - mean_squared_error: 0.6064 - val_loss: 0.3668 - val_mean_squared_error: 0.3668
Epoch 2/4
444760/444760 [==============================] - 389s 875us/step - loss: 0.3438 - mean_squared_error: 0.3438 - val_loss: 0.3610 - val_mean_squared_error: 0.3610
Epoch 3/4
444760/444760 [==============================] - 358s 806us/step - loss: 0.3207 - mean_squared_error: 0.3207 - val_loss: 0.3549 - val_mean_squared_error: 0.3549
Epoch 4/4
444760/444760 [==============================] - 522s 1ms/step - loss: 0.3014 - mean_squared_error: 0.3014 - val_loss: 0.3573 - val_mean_squared_error: 0.3573

Let's save this trained network for future use. We can save the network as a HDF5 object using the .save method:

In [54]:
network1.save("network1.h5")

Prepare the holdout set using the pipeline (using .transform)

Now, we need to prepare the holdout set to make predictions using our trained network. We need to transform the holdoutset exactly in the same way using the pipeline 6 we used to create the training set to train the model.

In this case we will use the pipeline 6 with .transform method. This will ensure the same features will be extracted from the holdout set and it will become ready to feed into the network we trained:

In [ ]:
X_net_holdout = pl6.transform(X_holdout)
In [56]:
X_net_holdout.shape
Out[56]:
(593014, 45753)

Note that we only used features in the case of .transform method, and transformed the holdout set into the format ready to feed into the network. We created exactly the same 45753 features from the X_holdout data set.

Let's evaluate the rmse in using the holdout set and the network we trained:

In [57]:
np.sqrt(network1.evaluate(X_net_holdout,y_holdout))
593014/593014 [==============================] - 9086s 15ms/step
Out[57]:
array([ 0.5977123,  0.5977123])

The rmse we obtained from the first network is comparable with the tuned xgboost model. Before, finalizing a network, we can try whether we can improve the performance of the network 1 by increasing the model complexity.

Increasing network complexity to test improved performance

With the hopes of increasing model performance, we simply add another layer to our network with same amount of neurons, and also include an earlystopping monitor to stop training until after the validation score is not improving.

In [58]:
from keras import models, metrics, layers
from keras.callbacks import EarlyStopping

estop_monitor = EarlyStopping(patience= 1)
network2 = models.Sequential()
network2.add(layers.Dense(16,activation="relu", input_shape = (X_net_train.shape[1],)))
network2.add(layers.Dense(16,activation="relu"))
network2.add(layers.Dense(16,activation="relu"))
network2.add(layers.Dense(1))
network2.compile(optimizer= "adam", loss= "mean_squared_error", metrics= [metrics.mse])
history_net2 = network2.fit(X_net_train,y_net_train, 
                             epochs=20,batch_size=200,
                             callbacks = [estop_monitor],
                             validation_data=(X_net_validation,y_net_validation))
Train on 444760 samples, validate on 444761 samples
Epoch 1/20
444760/444760 [==============================] - 375s 842us/step - loss: 0.5658 - mean_squared_error: 0.5658 - val_loss: 0.3691 - val_mean_squared_error: 0.3691
Epoch 2/20
444760/444760 [==============================] - 3958s 9ms/step - loss: 0.3396 - mean_squared_error: 0.3396 - val_loss: 0.3579 - val_mean_squared_error: 0.3579
Epoch 3/20
444760/444760 [==============================] - 357s 802us/step - loss: 0.3195 - mean_squared_error: 0.3195 - val_loss: 0.3552 - val_mean_squared_error: 0.3552
Epoch 4/20
444760/444760 [==============================] - 367s 826us/step - loss: 0.3019 - mean_squared_error: 0.3019 - val_loss: 0.3567 - val_mean_squared_error: 0.3567

It looks like we have some improvement of the network based on the validation score. Let's evaluate the new model using the transformed holdout set:

In [59]:
np.sqrt(network2.evaluate(X_net_holdout,y_holdout))
593014/593014 [==============================] - 9851s 17ms/step
Out[59]:
array([ 0.59729053,  0.59729053])

Network2 is a slight improvement over Network1, but it might not be a substantial difference that was worth to wait. Neverthless, we will save and use the Network2 to get the predictions:

In [60]:
network2.save("network2.h5")
In [65]:
network2.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_15 (Dense)             (None, 16)                732064    
_________________________________________________________________
dense_16 (Dense)             (None, 16)                272       
_________________________________________________________________
dense_17 (Dense)             (None, 16)                272       
_________________________________________________________________
dense_18 (Dense)             (None, 1)                 17        
=================================================================
Total params: 732,625
Trainable params: 732,625
Non-trainable params: 0
_________________________________________________________________
In [66]:
holdout_pred_network2 = network2.predict(X_net_holdout)

Ensemble model using Neural Network, Xgboost and Ridge

Now we have 2 best performing models that we believe they are orthagonal, in other words, they are likely to explain different portions of the variation in the data sets. They are fundamentally different forms of models; linear, tree-based and neural network learners. Therefore, we hope to improve our predictive capacity by ensembling these 3 models.

We will keep it simple and use the average of these 3 models as our ensemble model predictions.

In [71]:
# Load back the predictions from other models

with open("pipeline_holdout_predictions.pkl","rb") as f:
    preds = pickle.load(f)
In [72]:
preds.head()
Out[72]:
ridge_predictions xgboost_predictions mean_predictions y_holdout
0 3.999295 4.336337 4.167816 4.521789
1 3.030860 2.975331 3.003095 3.295837
2 2.837267 2.861459 2.849363 2.302585
3 2.913534 2.901714 2.907624 3.044522
4 3.040877 3.098519 3.069698 2.708050
In [87]:
preds3 = preds.iloc[:,0:2]
In [88]:
preds3.head()
Out[88]:
ridge_predictions xgboost_predictions
0 3.999295 4.336337
1 3.030860 2.975331
2 2.837267 2.861459
3 2.913534 2.901714
4 3.040877 3.098519
In [89]:
preds3["network2_predictions"] = holdout_pred_network2
In [90]:
preds3.head()
Out[90]:
ridge_predictions xgboost_predictions network2_predictions
0 3.999295 4.336337 3.808022
1 3.030860 2.975331 3.030390
2 2.837267 2.861459 2.871538
3 2.913534 2.901714 3.203488
4 3.040877 3.098519 3.046734
In [93]:
preds3["mean_predictions"] = preds3.apply(np.mean,axis = 1)
In [94]:
preds3.head()
Out[94]:
ridge_predictions xgboost_predictions network2_predictions mean_predictions
0 3.999295 4.336337 3.808022 4.047885
1 3.030860 2.975331 3.030390 3.012194
2 2.837267 2.861459 2.871538 2.856755
3 2.913534 2.901714 3.203488 3.006245
4 3.040877 3.098519 3.046734 3.062043
In [95]:
np.sqrt(mean_squared_error(y_holdout,preds3.mean_predictions))
Out[95]:
0.57945783280159124
In [97]:
rmse = pd.DataFrame(data = {
    "ridge_rmse":np.sqrt(mean_squared_error(y_holdout, preds3.ridge_predictions)),
    "xgboost_rmse":np.sqrt(mean_squared_error(y_holdout, preds3.xgboost_predictions)),
    "network2_rmse":np.sqrt(mean_squared_error(y_holdout,preds3.network2_predictions)),
    "mean_rmse":np.sqrt(mean_squared_error(y_holdout, preds3.mean_predictions))
}, columns = ["ridge_rmse","xgboost_rmse","network2_rmse","mean_rmse"], index = [0])

rmse.head()
Out[97]:
ridge_rmse xgboost_rmse network2_rmse mean_rmse
0 0.602095 0.59559 0.597291 0.579458

Very nice! By including the neural network into our ensemble model, we were able to reduce the rmse to 0.579. This demonstrates how we can improve predictive performance by training and optimizing orthagonal models and forming model ensembles for final prediction.

In [99]:
# Save the dataframe containing holdout predictions from 3 models
with open("pipeline_holdout_predictions.pkl","wb") as f:
    pickle.dump(preds3,f)