1 Introduction and summary

Here I will explore the Titanic data set by using simple, parsimonious linear models, then perform feature engineering, missing value imputation and dimension reduction. Finally, I train classifiers by using random forest, support vector machines, gradient boosted trees algorithms. I will also explore stacking these classifiers and model tuning to improve accuracy of the predictions for survival.

Thanks for reading my Kernel and have a great time here!

2 Loading and partitioning the data

Load the data sets:

training <- read.csv("train.csv", stringsAsFactors = FALSE, na.strings = "")
testing <- read.csv("test.csv", stringsAsFactors = FALSE,na.strings = "")

Random partition of the training set into:

library(caret)
set.seed(1234)
InTrain <- createDataPartition(y=training$Survived,p = 0.7,list = FALSE)
tune.test.set <- training[-InTrain,]
training <- training[InTrain,]

Summarize the training data set:

summary(training)
##   PassengerId       Survived         Pclass          Name          
##  Min.   :  1.0   Min.   :0.000   Min.   :1.000   Length:624        
##  1st Qu.:232.5   1st Qu.:0.000   1st Qu.:2.000   Class :character  
##  Median :450.5   Median :0.000   Median :3.000   Mode  :character  
##  Mean   :448.5   Mean   :0.383   Mean   :2.303                     
##  3rd Qu.:663.2   3rd Qu.:1.000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.000   Max.   :3.000                     
##                                                                    
##      Sex                 Age            SibSp            Parch       
##  Length:624         Min.   : 0.75   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :28.50   Median :0.0000   Median :0.0000  
##                     Mean   :29.92   Mean   :0.5096   Mean   :0.3782  
##                     3rd Qu.:38.50   3rd Qu.:1.0000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :5.0000  
##                     NA's   :129                                      
##     Ticket               Fare            Cabin          
##  Length:624         Min.   :  0.000   Length:624        
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.458   Mode  :character  
##                     Mean   : 31.219                     
##                     3rd Qu.: 31.621                     
##                     Max.   :512.329                     
##                                                         
##    Embarked        
##  Length:624        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
table(training$Sex)
## 
## female   male 
##    216    408
table(training$Embarked)
## 
##   C   Q   S 
## 122  51 450
table(training$Survived)
## 
##   0   1 
## 385 239

The data is imbalanced in terms of gender as well as the outcome class. However, this would not prevent us from developing classifiers. It will probably make it harder though!


3 Developing expectations from the data: basic exploratory data analysis

Let’s start developing expectations from the data. Note that we will only use the training data set for exploration and feature engineering. This is important to avoid over-fitting and decreasing the bias in our final out-of the sample accuracy.


3.1 Relationships between Age and Family size

#Exploration by pairs plot:

pairs(Survived ~ Age+ SibSp+Parch+Fare,pch =19, cex = 0.4,data=training,
      col= ifelse(training$Survived == 1, "navy","red"))

Without concluding too much from this plot, we can immediately start noticing certain relationships.

  • Age and SibSp, and Age and Parch are inversely related. This is intuitive as older individuals are expected to have larger family sizes.
  • There seems to be a relationship between Age, Fare (and perhaps Pclass) and Survival. We will explore this further.

3.2 The survival probability of lower fare class passengers is significantly lower than 1st class passengers

An intuitive possibility is that more passengers who travel in the higher cabin classes might be survived.

table(training$Survived,training$Pclass)
##    
##       1   2   3
##   0  56  69 260
##   1  97  60  82

We see that clearly less people survived from the 2nd and 3rd classes. We can go ahead and test if that difference is statistically significant. Fitting a logistic regression model between these two variables proves that indeed, the probability of survival is significantly low in 2nd class compared to 1st class, and it is also significantly lower in 3rd class, compared to 2nd and 1st class passengers (p < 0.001).

summary(glm(factor(Survived) ~ factor(Pclass), data = training, family = "binomial"))
## 
## Call:
## glm(formula = factor(Survived) ~ factor(Pclass), family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4178  -0.7404  -0.7404   0.9547   1.6900  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.5494     0.1678   3.273  0.00106 ** 
## factor(Pclass)2  -0.6891     0.2436  -2.829  0.00467 ** 
## factor(Pclass)3  -1.7033     0.2103  -8.101 5.44e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 830.57  on 623  degrees of freedom
## Residual deviance: 755.94  on 621  degrees of freedom
## AIC: 761.94
## 
## Number of Fisher Scoring iterations: 4

Therefore, the Pclass feature is an important one to keep in our data set.


3.3 Females have statistically significant higher probability of survival

In the event of crisis, it is expected that women and children will get the priority in the rescue efforts. We indeed notice that more females were survived, despite their small proportion.

table(training$Survived,training$Sex)
##    
##     female male
##   0     52  333
##   1    164   75

We can also investigate if this difference is statistically significant by fitting a logistic regression model. The beauty of logistic regression is that we can tailor our scientific question by including potential confounders into the model. In this case we suspect that Age could be a confounder for the observed relationship between Sex and Survival. We also add the interaction term to see if the effect of Sex is modified by Age as well.

summary(glm(Survived ~  Age*Sex, data = training, family = "binomial"))
## 
## Call:
## glm(formula = Survived ~ Age * Sex, family = "binomial", data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9548  -0.7126  -0.5914   0.7092   2.1966  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.79165    0.40478   1.956  0.05050 . 
## Age          0.01918    0.01403   1.367  0.17164   
## Sexmale     -1.57094    0.51413  -3.056  0.00225 **
## Age:Sexmale -0.03842    0.01718  -2.237  0.02529 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 669.39  on 494  degrees of freedom
## Residual deviance: 500.37  on 491  degrees of freedom
##   (129 observations deleted due to missingness)
## AIC: 508.37
## 
## Number of Fisher Scoring iterations: 4

Fascinating! First, we notice that indeed males have significantly lower probability of survival (p < 0.01) even after adjusting for Age and the interaction between Sex and Age. It is also worth to note that there is statistically significant interaction between Age and Sex predictors (p < 0.05).

Therefore, sex is another feature we will definitely keep to train our classifiers.


3.4 Fare variable: does it explain any variability beyond Pclass?

As we expected from the PClass - Survival relationship, passengers who paid higher are mostly those who survived:

qplot(y = log(Fare), x = factor(Survived), data = training, geom = "boxplot", fill = factor(Survived))+theme_bw()+scale_fill_manual(values = c("red","blue"))

The question is: does Fare feature explain any variability beyond the Pclass? If not, we should drop this feature since we don’t need two collinear features when training our classifiers. This will make our estimates biased.

When we fit Pclass together with Fare:

summary(glm(Survived ~ factor(Pclass) + Fare, data = training, family = "binomial"))
## 
## Call:
## glm(formula = Survived ~ factor(Pclass) + Fare, family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0701  -0.7697  -0.7248   1.0894   1.7390  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.026382   0.285894  -0.092   0.9265    
## factor(Pclass)2 -0.274404   0.293648  -0.934   0.3501    
## factor(Pclass)3 -1.236542   0.279913  -4.418 9.98e-06 ***
## Fare             0.007772   0.003309   2.349   0.0188 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 830.57  on 623  degrees of freedom
## Residual deviance: 749.26  on 620  degrees of freedom
## AIC: 757.26
## 
## Number of Fisher Scoring iterations: 4

Interestingly, we notice that Fare still remains significant (p < 0.05) in the presence of the Pclass feature. This implies that there is additional variability in the Survival outcome that is explained by Fare, perhaps because this is a continuous variable and Pclass is categorical. Therefore, it seems feasible that they explain some non-overlapping variance.

We will also keep Fare feature to train our model.


4 Feature Engineering

At this stage we will continue to explore the training data set, with the difference that we will try to extract/engineer different features that are not immediately available to us. We will perform preliminary analyses to test the potential utility of these features.

As always, we will perform all feature engineering, imputation and dimension reduction in the training data set and will apply into the test data sets exactly in the same way.


4.1 Generate categorical variables

It is more useful to model a few features as categorical variables:

training <- transform(training,Survived = factor(Survived), Pclass = factor(Pclass),
          Sex = factor(Sex),SibSp = factor(SibSp), Parch = factor(Parch), Embarked = factor(Embarked))
tune.test.set <- transform(tune.test.set,Survived = factor(Survived), Pclass = factor(Pclass),Sex = factor(Sex),SibSp = factor(SibSp), Parch = factor(Parch), Embarked = factor(Embarked))

testing <- transform(testing, Pclass = factor(Pclass),
          Sex = factor(Sex),SibSp = factor(SibSp), Parch = factor(Parch), Embarked = factor(Embarked))

4.2 Remove the passenger ID column from the training sets

This feature has no classification power and will generate bias in our classifiers:

library(dplyr)
training <- dplyr::select(training,-PassengerId)
tune.test.set <- dplyr::select(tune.test.set,-PassengerId)

4.3 Investigate the missing values: explore opportunities for imputation

training.na <- as.data.frame(is.na(training));names(training.na) <- names(training)
apply(training.na,2,sum)
## Survived   Pclass     Name      Sex      Age    SibSp    Parch   Ticket 
##        0        0        0        0      129        0        0        0 
##     Fare    Cabin Embarked 
##        0      476        1

We notice that missing values appear in 3 variables; Age, Cabin and Embarked.


4.3.1 Cabin feature: more than 75% missing data

From the available observations, we can infer that the first letter actually presents the cabin section. Let’s extract the first letter from the available ones and test if they have any predictive value:

library(ggplot2); library(dplyr)
Cabin.letter <-substr(training$Cabin[!training.na$Cabin],1,1)
Cabin.survival <- training$Survived[!training.na$Cabin]
Cabin.Pclass <- training$Pclass[!training.na$Cabin]
qplot(x = factor(Cabin.letter), fill = Cabin.survival)+scale_fill_manual(values = c("red","navy"))+theme_bw()

We notice that this feature might have some predictive value. Let’s check if the cabin relates with the Pclass:

qplot(x = factor(Cabin.letter), fill = Cabin.Pclass)+scale_fill_manual(values = c("red","navy","green"))+theme_bw()

We indeed notice that Cabin letters A,B,C are absolutely first class and therefore can be inferred from the Pclass variable. D and E are also more likely to be in 1st class. Gs are all coming from the 3rd Class.

We can also check if fare relates with the cabin letter:

Cabin.Fare <- training$Fare[!training.na$Cabin]
qplot(x = factor(Cabin.letter), y=Cabin.Fare, color = Cabin.survival)+scale_color_manual(values = c("red","navy"))+theme_bw()

The relationship between Fare and Cabin Letter is not so dramatic to allow some imputation.

The problem with imputing Cabin feature from Pclass is that we don’t know whether the missing variables are MCAR (Missing Completely at Random) or whether there is a relationship between the reason they are missing and the outcome (Survival).

Therefore, it would be more sensible to drop this variable and don’t use in building our classifiers.

Remove Cabin feature from all data sets:

library(dplyr)
training <- dplyr::select(training, - Cabin)
tune.test.set <- dplyr::select(tune.test.set, -Cabin)
testing <- dplyr::select(testing, -Cabin)

4.3.2 Age feature: imputing missing data by using a multivariate linear model

We recall the relationship between Age and family size from our earlier pairs plot. Therefore, one intuitive imputation potential would be comparing the Age with SibSp feature:

qplot(x = SibSp, y = Age, color = Survived, data = training)+
       theme_bw()+scale_color_manual(values = c("red","navy"))+theme_bw()

We notice that as the SibSp increases, the age group decreases. Let’s look at the distribution of Age across the SibSp bins.

ggplot(data = training, aes(x = Age, fill= Survived))+
        geom_histogram(bins = 40)+facet_grid(. ~ SibSp)+ scale_fill_manual(values = c("red","navy"))+
        theme_bw()

Aha! Most of the Age distribution is having either no Spouses or siblings or only 1. For the 1 siblings/spouses group the mean age seems to be higher. In both cases Age is approximated by normal distribution. The problem with SibSp is that some of the factor levels are only present in the observations where Age is missing, making this predictor unbalanced across the missing and complete cases.

It would be also interesting to explore gender differences when considering Age:

ggplot(data = training, aes(x = Age, fill= Survived))+
        geom_histogram(bins = 40)+facet_grid(. ~ Sex)+ scale_fill_manual(values = c("red","navy"))+
        theme_bw()

How about the Age distribution in different passenger classes? (Pclass):

ggplot(data = training, aes(x = Age, fill= Survived))+
        geom_histogram(bins = 40)+facet_grid(. ~ Pclass)+ scale_fill_manual(values = c("red","navy"))+
        theme_bw()

as.data.frame(training %>% group_by(Pclass,Survived) %>% summarise(mean.Age = mean(Age,na.rm=T)))

This is quite interesting! In all passenger classes, the mean age of the survived passengers is lower than those passed away. We also notice that the mean age decreases and the Class number increases, i.e: older passengers are in the better classes on average. Therefore, Pclass would also be include in the imputation model.

It would be therefore sensible to impute Age by random gaussian imputation using the mean and standard deviation of individual factor levels of these predictors. We will include Sex in this model to account for gender-specific differences in Age, as well as Pclass to account for the interesting seperation of Age by Passenger Class we noted above.

In order to do this more rigorously, we can first fit a linear model with the existing Age , Sex and Pclass data:

4.3.2.1 Fitting multivariate Age imputation model

The model will become:

Age ~ Sex + Pclass + e (random Gaussian error)

lmAge = lm(Age ~ Sex + Pclass, data = training, na.action = "na.omit")
summary(lmAge)
## 
## Call:
## lm(formula = Age ~ Sex + Pclass, data = training, na.action = "na.omit")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.261  -8.119  -0.634   7.819  47.366 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   35.605      1.339  26.600  < 2e-16 ***
## Sexmale        5.577      1.257   4.436 1.13e-05 ***
## Pclass2       -8.949      1.658  -5.397 1.06e-07 ***
## Pclass3      -14.548      1.439 -10.112  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.14 on 491 degrees of freedom
##   (129 observations deleted due to missingness)
## Multiple R-squared:  0.1838, Adjusted R-squared:  0.1788 
## F-statistic: 36.86 on 3 and 491 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lmAge)[1:4]

## NULL

This model is just OK, but we don’t need the perfect model for this type of imputation. Nice to see that both variance and normality assumptions of the model holds and all levels of the covariates have significant impact on the mean outcome in the presence of each other. This would give us a good estimation for the missing values of Age.

Just to check if our imputation model yields the similar distribution as the original age:

complete.cases.Age <- training$Age[complete.cases(training$Age)]
after.imputation.Age <-c(complete.cases.Age,predict(lmAge, newdata = training[is.na(training$Age),]))
just.imputed.observations.Age <- after.imputation.Age[!complete.cases(training$Age)]

par(mfrow = c(1,3))
hist(complete.cases.Age,breaks = 20,col = "navy");
hist(after.imputation.Age,breaks = 20,col = "lightgreen");
hist(just.imputed.observations.Age,breaks = 20, col = "purple")

Therefore, our imputation model performs a nice job!

Using the model to impute missing values of Age:

training$Age[is.na(training$Age)] = predict(lmAge, newdata = training[is.na(training$Age),])
tune.test.set$Age[is.na(tune.test.set$Age)] = predict(lmAge, newdata = tune.test.set[is.na(tune.test.set$Age),])
testing$Age[is.na(testing$Age)] = predict(lmAge, newdata = testing[is.na(testing$Age),])

Note that we use the same model object we derived from the training data set to impute the missing values of Age for all data sets. This will prevent us from over-fitting to the test data set.

Re-investigate the missing values:

training.na <- as.data.frame(is.na(training));names(training.na) <- names(training)
tune.testing.na <- as.data.frame(is.na(tune.test.set));names(tune.testing.na) <- names(tune.test.set)
testing.na <- as.data.frame(is.na(testing));names(testing.na) <- names(testing)
data.frame(apply(training.na,2,sum),apply(tune.testing.na ,2,sum),apply(testing.na ,2,sum))

It appears that only one case is left missing in each of the data sets. We will consider this as random missingness and remove in each data set:

training <- training[complete.cases(training),]
tune.test.set <- tune.test.set[complete.cases(tune.test.set),]
testing <- testing[complete.cases(testing),]

Now we have completed processing the missing values.

4.4 Generate Dummy Variables with Categorical variables

Before we start building our classifiers, it is desirable to convert the categorical variables to dummy variables (0 or 1). This will preserve the parsimony if later we need to interpret the results of any decision trees. Furthermore, many machine learning packages work well with dummy variables.

library(caret)
which(sapply(training[,-1],is.factor))
##   Pclass      Sex    SibSp    Parch Embarked 
##        1        3        5        6        9
# There are 5 categorical predictors in out data set
factors.training <- which(sapply(training,is.factor))
factors.tune.test.set <- which(sapply(tune.test.set,is.factor))
factors.testing <- which(sapply(testing,is.factor))

dummies.training <- dummyVars(Survived ~ Pclass + Sex + SibSp + Parch + Embarked, data = training)
dummies.tune.test.set <- dummyVars(Survived ~ Pclass + Sex + SibSp + Parch + Embarked, data = tune.test.set)
dummies.testing <- dummyVars(PassengerId ~ Pclass + Sex + SibSp + Parch + Embarked, data = testing)

# Add the dummy variables to both training and test data sets, simultaneously removing the existing factor variables:
training <- cbind(training[,-factors.training[-1]], predict(dummies.training,newdata = training))
tune.test.set <- cbind(tune.test.set[,-factors.tune.test.set[-1]], predict(dummies.tune.test.set,newdata = tune.test.set))
testing <- cbind(testing[,-factors.testing], predict(dummies.testing,newdata = testing))

That was lots of code! Still, using caret package makes the process less cumbersome.

After doing that we notice that there are differences in the factor levels of certain variables are not present in the training set and test sets. These are present in Parch and SibSp features. The lower levels of these variables are conserved in all sets and they represent most of the data. At this stage we will only keep features that are present in all sets:

training <- training[,names(training) %in% names(tune.test.set)]
tune.test.set <- tune.test.set[,names(tune.test.set) %in% names(training)]
identical(names(training),names(tune.test.set))
## [1] TRUE
PassengerId <- testing$PassengerId
testing <- testing[,names(testing) %in% names(training)]
testing$PassengerId <- PassengerId

4.5 Feature engineering with Name and Ticket variables

4.5.1 Ticket feature: sub-clusters with classification value

It would be interesting to just convert the ticket feature to numeric and see if it has any classification value:

Ticket <- toupper(training$Ticket)
# Better to remove anything left of the last white space
w <- grep(" ", Ticket)
last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
        max(y[1])
})

Ticket[w] <- substring(Ticket[w],last.space+1)

Ticket <- gsub(" ","", Ticket)
Ticket <- gsub("[A-Z]","",Ticket)
Ticket <- gsub("\\.","",Ticket)
Ticket <- gsub("\\/","", Ticket)
Ticket <- as.numeric(Ticket)
Ticket[is.na(Ticket)] = 0
Ticket <- as.numeric(Ticket)

qplot(x=Ticket, y = Fare, data = training, color = Survived)+theme_bw()+scale_color_manual(values = c("red","navy"))

We note that at least two groups of ticket numbers are associated with higher fares, and to some extent they seperate the outcome classes. Therefore, we will engineer this numeric feature and add into all sets, we will remove the original variable.

# Feature Engineering function
num.Ticket <- function(x){
Ticket <- toupper(x$Ticket)
# Better to remove anything left of the last white space
w <- grep(" ", Ticket)
last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
        max(y[1])
})
Ticket[w] <- substring(Ticket[w],last.space+1)

Ticket <- gsub(" ","", Ticket)
Ticket <- gsub("[A-Z]","",Ticket)
Ticket <- gsub("\\.","",Ticket)
Ticket <- gsub("\\/","", Ticket)
Ticket <- as.numeric(Ticket)
Ticket <- as.numeric(Ticket)
return(Ticket)
}

training$num.Ticket <- num.Ticket(training)
tune.test.set$num.Ticket <- num.Ticket(tune.test.set)
testing$num.Ticket <- num.Ticket(testing)

sum(is.na(training$num.Ticket ))
## [1] 2
sum(is.na(tune.test.set$num.Ticket ))
## [1] 2
sum(is.na(testing$num.Ticket))
## [1] 0

After this conversion, only 2 missing values were introduced to training and tune.testing data sets.

4.5.2 Name feature: getting ‘titles’ out of it

When looking into the names of the passengers, we notice that at least we can attempt to extract a “title” feature from the name strings and explore its relationship with the outcome:

Pass.Names <- training$Name
first.comma<- sapply(gregexpr(",",Pass.Names), function(y){
        y[1][1]
})
first.dot <- sapply(gregexpr("\\.",Pass.Names), function(y){
        y[1][1]
})
Titles <- substr(Pass.Names,first.comma+2,first.dot-1)
qplot(x = factor(Titles), y = Age ,color = Survived, data = training)

This feature can also be useful, therefore we will add into the data sets:

# Feature engineering function:
Titles <- function(x){
      Pass.Names <- x$Name
first.comma<- sapply(gregexpr(",",Pass.Names), function(y){
        y[1][1]
})
first.dot <- sapply(gregexpr("\\.",Pass.Names), function(y){
        y[1][1]
})
Titles <- substr(Pass.Names,first.comma+2,first.dot-1)  
return(Titles)
}

training$Titles <- factor(Titles(training))
tune.test.set$Titles <- factor(Titles(tune.test.set))
testing$Titles <- factor(Titles(testing))

Next, we need to create dummy variables from each of these titles:

factors.training <- which(sapply(training,is.factor))
factors.tune.test.set <- which(sapply(tune.test.set,is.factor))
factors.testing <- which(sapply(testing,is.factor))

dummies.training <- dummyVars(Survived ~ Titles, data = training)
dummies.tune.test.set <- dummyVars(Survived ~ Titles, data = tune.test.set)
dummies.testing <- dummyVars(PassengerId ~ Titles, data = testing)

We then add the dummy variables to both training and test data sets, simultaneously removing the existing factor variables:

training <- cbind(training[,-factors.training[-1]], predict(dummies.training,newdata = training))
tune.test.set <- cbind(tune.test.set[,-factors.tune.test.set[-1]], predict(dummies.tune.test.set,newdata = tune.test.set))
testing <- cbind(testing[,-factors.testing], predict(dummies.testing,newdata = testing))

Not surprisingly ,after doing that we notice that there are differences in the factor levels of Titles feature. At this stage we will only keep features that are present in all sets:

training <- training[,names(training) %in% names(tune.test.set)]
tune.test.set <- tune.test.set[,names(tune.test.set) %in% names(training)]
identical(names(training),names(tune.test.set))
## [1] TRUE
PassengerId <- testing$PassengerId
testing <- testing[,names(testing) %in% names(training)]
testing$PassengerId <- PassengerId

We should also remove the original Ticket and Name features from all data sets since we already have the engineered versions included:

training <- dplyr::select(training,-Ticket,-Name)
tune.test.set <- dplyr::select(tune.test.set ,-Ticket,-Name)
testing <- dplyr::select(testing,-Ticket,-Name)

Finally, since we had 2 missing values we generated due to the Ticket feature engineering, we need to get the complete cases in each of the data set:

training <- training[complete.cases(training),]
tune.test.set <- tune.test.set[complete.cases(tune.test.set),]
testing <- testing[complete.cases(testing),]
testing <- data.frame(PassengerId = testing$PassengerId, testing[,-27])

This completes the feature engineering for all data sets.


4.6 Collinearity, Near Zero Variance and Dimension Reduction (Principal Components Analysis)

We generated lots of features! While we had reasons for building them, now it is the chance to drop or transform them if we think that they might hurt us down the road.

One reason why we might want to reduce dimension is the collinearity, where few features are perfectly correlated with each other. Another type are the features with near zero variance.

4.6.1 Near zero variance features

Let’s explore if there are near zero variance features in the training data set:

nsv <- nearZeroVar(x = training, saveMetrics = TRUE)
sum(!nsv$nzv)
## [1] 20

Great we checked, indeed 7 features have near zero variance! We will drop them and keep the remaining 20 in our data sets.

training <- training[,!nsv$nzv]
tune.test.set <- tune.test.set[,!nsv$nzv]
testing <- testing[,!nsv$nzv]

4.6.2 Testing for collinearity

Let’s have a look at the collinearity in our training set:

M <- abs(cor(training[,-1])) # M is an absolute value correlation matrix representing the pairwise #correlations between all variables 
diag(M) <- 0 # We replace the diagonal values with zero (just because these are the correations with  #themselves we are not interested in capturing them).
which(M > 0.8, arr.ind = TRUE) # What are the highest correated variables?
##            row col
## Sex.male     7   6
## Titles.Mr   18   6
## Sex.female   6   7
## Titles.Mr   18   7
## SibSp.1      9   8
## SibSp.0      8   9
## Sex.female   6  18
## Sex.male     7  18
unique(row.names(which(M > 0.8, arr.ind = TRUE)))
## [1] "Sex.male"   "Titles.Mr"  "Sex.female" "SibSp.1"    "SibSp.0"
cor.variables <- training[,unique(row.names(which(M > 0.8, arr.ind = TRUE)))]
cor.variables$Survived <- training$Survived

We notice the expected features correlate with each other. These include Male gender and Title Mr.

The next question is: should we replace these features with more statistically orthagonal features by using matrix transformation?

4.6.3 Principal Components Analysis (PCA)

A PCA would be handy to visualize whether indeed it is worth reducing the dimension in the space of highly correlated features in our training data set:

prePCA <- preProcess(cor.variables[,-6],method = "pca")
PCAcor <- predict(prePCA,cor.variables[,-6])
qplot(PCAcor$PC1,PCAcor$PC2, color = Survived, data = cor.variables)+theme_bw()+scale_color_manual(values = c("red","navy"))

qplot(PCAcor$PC3,PCAcor$PC2, color = Survived, data = cor.variables)+theme_bw()+scale_color_manual(values = c("red","navy"))

This was not too helpful, since even the first 3 principal components are unable to explain the variance. At least we have given a try!

We will use the remaining 20 features to train our classifiers.

5 Training Classifiers

Due to the classification nature of the problem, we will perform tree-based prediction algorithms and support vector machines to perform predictions. We will perform 10-fold cross validation in our initial training to prevent over-fitting. We will also set the random number generator seed for reproducibility.

5.1 Simple classification tree with cross validation

set.seed(125745)
RPART <- train(Survived ~ ., data = training,method = "rpart", trControl = trainControl(method = "cv", number = 10))
RPART
## CART 
## 
## 621 samples
##  19 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01898734  0.8117872  0.5982508
##   0.04008439  0.7926628  0.5628837
##   0.45991561  0.6793521  0.2299384
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01898734.

5.2 Random Forest classifier with cross validation

set.seed(125745)
RF <- train(Survived ~ ., data = training,method = "rf", trControl = trainControl(method = "cv", number = 10)) 
RF
## Random Forest 
## 
## 621 samples
##  19 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8164186  0.6018132
##   10    0.8247698  0.6237537
##   19    0.8230801  0.6234377
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 10.

5.3 Boosted tree classifier with cross validation

set.seed(125745)
GBM <- train(Survived ~ ., data = training,method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
GBM
## Stochastic Gradient Boosting 
## 
## 621 samples
##  19 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8005985  0.5758879
##   1                  100      0.7973991  0.5681702
##   1                  150      0.7941213  0.5585009
##   2                   50      0.8037739  0.5781097
##   2                  100      0.8150147  0.6012232
##   2                  150      0.8197519  0.6118256
##   3                   50      0.8071063  0.5876910
##   3                  100      0.8101769  0.5947381
##   3                  150      0.8037500  0.5814624
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

5.4 Support vector machines with a radial kernel

set.seed(125745)
SVM <- train(Survived ~ ., data = training,method = "svmRadial", trControl = trainControl(method = "cv", number = 10))
SVM
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 621 samples
##  19 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.8165227  0.6070723
##   0.50  0.8180836  0.6065842
##   1.00  0.8084309  0.5840500
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0545379
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.0545379 and C = 0.5.

5.5 Summarizing the in-sample accuracy metrics from the individual classifiers

# Using resamples function from the caret package to summarize the data
modelsummary <- resamples(list(RPART=RPART,RF=RF,GBM=GBM,SVM=SVM))
# In-sample accuracy values for each model
summary(modelsummary)$statistics$Accuracy
##         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## RPART 0.7619  0.7849 0.8065 0.8118  0.8327 0.8689    0
## RF    0.7581  0.7890 0.8306 0.8248  0.8566 0.8852    0
## GBM   0.7581  0.7823 0.8161 0.8198  0.8497 0.9016    0
## SVM   0.7903  0.8065 0.8175 0.8181  0.8334 0.8548    0

If we consider the Median accuracy as our benchmark, the Random Forest classifier seems to have the best accuracy amongst the classifiers we trained. Our out of the sample accuracy by using the tune.test set will be much lower though.

5.6 ‘Variable importance’ plots for each classifier

dotPlot(varImp(RPART))

dotPlot(varImp(RF))

dotPlot(varImp(SVM))

dotPlot(varImp(GBM))

It is great to see that some of the features we engineered are amongst the most important features for training the classifiers!

5.7 Tuning and stacking the classifiers for improved accuracy

5.7.1 Random Forest classifier with bootstrap and 632 correction

set.seed(125745)
RF.tune <- train(Survived ~ ., data = training,method = "rf", trControl = trainControl(method = "boot632", number = 10) )
median(RF.tune$results$Accuracy)
## [1] 0.878137

Great! Using this approach increased in sample accuracy to almost 88%!

5.7.2 Stacking classifiers

Let’s try to build a ‘classifier of classifiers’ by using the predictions from the individual predictors. Ideally we should perform this in a seperate test set, however the data set we have is relatively small so we will have to reuse the predictions from the test data set.

in.sample.predictions <- data.frame(RF.tune = predict(RF.tune,training), RF = predict(RF, training), GBM = predict(GBM,training), RPART = predict(RPART, training), SVM = predict(SVM, training))

# Add the outcome
in.sample.predictions$Survived <- training$Survived

We will stack these classifiers and train two more classifiers by using the Random Forest and Gradient boosted tree classifiers

set.seed(125745)
RF.stack <- train(Survived ~ ., data = in.sample.predictions,method = "rf", trControl = trainControl(method = "cv", number = 10)) 
RF.stack
## Random Forest 
## 
## 621 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9967742  0.9931397
##   3     0.9967742  0.9931397
##   5     0.9967742  0.9931397
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
set.seed(125745)
GBM.stack <- train(Survived ~ ., data = in.sample.predictions,method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
GBM.stack
## Stochastic Gradient Boosting 
## 
## 621 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9967742  0.9931397
##   1                  100      0.9967742  0.9931397
##   1                  150      0.9967742  0.9931397
##   2                   50      0.9967742  0.9931397
##   2                  100      0.9967742  0.9931397
##   2                  150      0.9967742  0.9931397
##   3                   50      0.9967742  0.9931397
##   3                  100      0.9967742  0.9931397
##   3                  150      0.9967742  0.9931397
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 1, shrinkage = 0.1 and n.minobsinnode = 10.

That is great! Stacking the classifiers really improved the accuracy and we have now over 99% in-sample accuracy. There might be a little bit over-fitting since we reused the training set multiple times, but we will asses our out-of the sample accuracy by using the tune.test.set.

6 Unbiased evaluation of the out-of-the box accuracy using the tune test set

Before making our prediction with the actual test data set, it is a good idea to get an estimate of our out-of the box accuracy by using the tune.test.set we randomly partitioned initially. Since we haven’t touched this data set for training our classifiers, this would be a good first hand estimate of the out-of-the box accuracy.

Since we will use the stacked classifiers we constructed, we will have to follow the same procedure in the tune.test data set to perform predictions using our trained classifiers

pred.tune.test.RF <- predict(RF, tune.test.set)
pred.tune.test.RF.tune <- predict(RF.tune, tune.test.set)
pred.tune.test.GBM <- predict(GBM, tune.test.set)
pred.tune.test.RPART <- predict(RPART, tune.test.set)
pred.tune.test.SVM <- predict(SVM, tune.test.set)

tune.test.predictions <- data.frame(RF.tune = pred.tune.test.RF.tune, RF = pred.tune.test.RF, GBM = pred.tune.test.GBM, RPART = pred.tune.test.RPART, SVM = pred.tune.test.SVM)

# Add the outcome from tune.test set
tune.test.predictions$Survived <-tune.test.set$Survived

Next, make the predictions:

pred.tune.test.RF.stack <- predict(RF.stack,tune.test.predictions)
pred.tune.test.GBM.stack <- predict(GBM.stack,tune.test.predictions)

confusionMatrix(pred.tune.test.RF.stack, tune.test.set$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 149  32
##          1  13  70
##                                           
##                Accuracy : 0.8295          
##                  95% CI : (0.7786, 0.8729)
##     No Information Rate : 0.6136          
##     P-Value [Acc > NIR] : 2e-14           
##                                           
##                   Kappa : 0.6277          
##  Mcnemar's Test P-Value : 0.00729         
##                                           
##             Sensitivity : 0.9198          
##             Specificity : 0.6863          
##          Pos Pred Value : 0.8232          
##          Neg Pred Value : 0.8434          
##              Prevalence : 0.6136          
##          Detection Rate : 0.5644          
##    Detection Prevalence : 0.6856          
##       Balanced Accuracy : 0.8030          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(pred.tune.test.GBM.stack, tune.test.set$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 148  33
##          1  14  69
##                                           
##                Accuracy : 0.822           
##                  95% CI : (0.7704, 0.8662)
##     No Information Rate : 0.6136          
##     P-Value [Acc > NIR] : 1.796e-13       
##                                           
##                   Kappa : 0.6111          
##  Mcnemar's Test P-Value : 0.00865         
##                                           
##             Sensitivity : 0.9136          
##             Specificity : 0.6765          
##          Pos Pred Value : 0.8177          
##          Neg Pred Value : 0.8313          
##              Prevalence : 0.6136          
##          Detection Rate : 0.5606          
##    Detection Prevalence : 0.6856          
##       Balanced Accuracy : 0.7950          
##                                           
##        'Positive' Class : 0               
## 

7 Conclusions

The out-of-the sample accuracy of the stacked classifiers is ~ 83%. This is probably because we did not use a third data set in between when stacking our classifiers. This likely caused over-fitting of the stacked classifiers and overestimation of the accuracy.

Nevertheless, it was worth trying to build the stacked classifiers, as they still have improved accuracy relative to stand alone classifiers.

8 Final prediction and preparation of the submission

Our final step is to make predictions using the unlabeled testing data set and prepare a submission file that matches our prediction with the Passenger IDs. We will use the stacked Random Forest classifier as it gave us the highest out-of-the sample accuracy:

pred.test.RF <- predict(RF, testing[,-1])
pred.test.RF.tune <- predict(RF.tune, testing[,-1])
pred.test.GBM <- predict(GBM, testing[,-1])
pred.test.RPART <- predict(RPART, testing[,-1])
pred.test.SVM <- predict(SVM, testing[,-1])

test.predictions <- data.frame(RF.tune = pred.test.RF.tune, RF = pred.test.RF , GBM = pred.test.GBM, RPART = pred.test.RPART, SVM = pred.test.SVM)

pred.test.RF.stack <- as.numeric(as.character(predict(RF.stack,test.predictions)))

9 Revisiting the classification problem

In this new section, I will revisit the data with a fresh look and see if I can try alternative approaches to improve the classification accuracy.

9.1 Using Principal Components to train classifiers: beating the curse of dimensionality

Perhaps it was a little wasteful to seperate a validation set from the small training set. Since we already perform cross-validation by training our model, in the rest of my attempts I decided to use the entire training set to train the classifiers.

9.1.1 Feature selection and engineering

library(dplyr);library(ggplot2); library(caret)

training <- read.csv("train.csv")
final_testing <- read.csv("test.csv")


training <- dplyr::select(training, -PassengerId)
training$Survived <- factor(training$Survived)

# Fare is important to keep, Age perhaps not
qplot(x = Age, y = Fare, data = training, color = Survived)

# Also keep Pclass (good at classifiying victims)
qplot(x = Pclass, y = Fare, data = training, color = Survived, alpha = I(0.2))

# Leave Pclass and Parch
F.size = training$Parch + training$SibSp
qplot(x = F.size , y = Fare, data = training, color = Survived, alpha = I(0.4))

# Definitely keep Gender
qplot(x = Sex , y = Fare, data = training, color = Survived, alpha = I(0.4))

# remove other features:

training <- dplyr::select(training, -Name, -Age, -Ticket,-Cabin, - Embarked)
final_testing <- dplyr::select(final_testing, -Name, -Age, -Ticket,-Cabin, - Embarked)

summary(training)
##  Survived     Pclass          Sex          SibSp           Parch       
##  0:549    Min.   :1.000   female:314   Min.   :0.000   Min.   :0.0000  
##  1:342    1st Qu.:2.000   male  :577   1st Qu.:0.000   1st Qu.:0.0000  
##           Median :3.000                Median :0.000   Median :0.0000  
##           Mean   :2.309                Mean   :0.523   Mean   :0.3816  
##           3rd Qu.:3.000                3rd Qu.:1.000   3rd Qu.:0.0000  
##           Max.   :3.000                Max.   :8.000   Max.   :6.0000  
##       Fare       
##  Min.   :  0.00  
##  1st Qu.:  7.91  
##  Median : 14.45  
##  Mean   : 32.20  
##  3rd Qu.: 31.00  
##  Max.   :512.33
summary(final_testing)
##   PassengerId         Pclass          Sex          SibSp       
##  Min.   : 892.0   Min.   :1.000   female:152   Min.   :0.0000  
##  1st Qu.: 996.2   1st Qu.:1.000   male  :266   1st Qu.:0.0000  
##  Median :1100.5   Median :3.000                Median :0.0000  
##  Mean   :1100.5   Mean   :2.266                Mean   :0.4474  
##  3rd Qu.:1204.8   3rd Qu.:3.000                3rd Qu.:1.0000  
##  Max.   :1309.0   Max.   :3.000                Max.   :8.0000  
##                                                                
##      Parch             Fare        
##  Min.   :0.0000   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:  7.896  
##  Median :0.0000   Median : 14.454  
##  Mean   :0.3923   Mean   : 35.627  
##  3rd Qu.:0.0000   3rd Qu.: 31.500  
##  Max.   :9.0000   Max.   :512.329  
##                   NA's   :1
# There is one missing value in the Fare feature in the testing set, thus we need to find a way to impute by using the 
# training set

lmFit1 <- lm(Fare ~ factor(Pclass)*factor(Sex), data = training)

summary(lmFit1)
## 
## Call:
## lm(formula = Fare ~ factor(Pclass) * factor(Sex), data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80.20  -8.37  -4.77   4.03 445.10 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      106.126      4.013  26.443  < 2e-16 ***
## factor(Pclass)2                  -84.156      6.003 -14.020  < 2e-16 ***
## factor(Pclass)3                  -90.007      5.160 -17.444  < 2e-16 ***
## factor(Sex)male                  -38.900      5.340  -7.284 7.16e-13 ***
## factor(Pclass)2:factor(Sex)male   36.671      7.903   4.640 4.01e-06 ***
## factor(Pclass)3:factor(Sex)male   35.442      6.588   5.380 9.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.91 on 885 degrees of freedom
## Multiple R-squared:  0.3903, Adjusted R-squared:  0.3869 
## F-statistic: 113.3 on 5 and 885 DF,  p-value: < 2.2e-16
par(mfrow= c(2,2))
plot(lmFit1)[1:4]

## NULL
# lmFit1 Fair model to impute Fare in the testing set

sum(is.na(training$Fare)) # No missing values in the training set
## [1] 0
sum(is.na(final_testing$Fare)) # One missing value in the test set
## [1] 1
final_testing$Fare[is.na(final_testing$Fare)] <- predict(lmFit1, newdata = final_testing[is.na(final_testing$Fare),])

sum(is.na(final_testing$Fare)) # Imputed
## [1] 0
# Final check confirms that there is no remaining missing data in either of the data sets
apply(is.na(training),2,sum);apply(is.na(final_testing),2,sum)
## Survived   Pclass      Sex    SibSp    Parch     Fare 
##        0        0        0        0        0        0
## PassengerId      Pclass         Sex       SibSp       Parch        Fare 
##           0           0           0           0           0           0
# Convert Sex to dummy variable

training$Female <- ifelse(training$Sex == "female", 1,0)
final_testing$Female <- ifelse(final_testing$Sex == "female", 1,0)

training <- dplyr::select(training, - Sex)
final_testing <- dplyr::select(final_testing, -Sex)

This provided us a fairly small set of features that are likely to be important for classification.

9.1.2 Dimension reduction with the training set

Let’s now see whether Principal Components can aid classification:

prePCA <- preProcess(training[,-1],method = "pca")
PCAtraining <- predict(prePCA,newdata = training)




qplot(x = PC1, y = PC2, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
        scale_color_manual(values = c("red","navy"))

qplot(x = PC2, y = PC3, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
        scale_color_manual(values = c("red","navy"))

qplot(x = PC1, y = PC3, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
        scale_color_manual(values = c("red","navy"))

PCA predictors seperate classes pretty well!

9.1.3 Training classifiers using Principal Components as predictors

After explaining most of the variability in the existing predictors, what happens if we train our classifiers with these principal components ?

set.seed(1234)
PCAknn <- train(Survived ~ ., data = PCAtraining,method = "knn",
                trControl= trainControl(method = "boot632"))



# So far gives the best accuracy: (bench mark: 0.83)##############
set.seed(1234)
PCA.rf <- train(Survived ~ ., data = PCAtraining,method = "rf",
                   trControl= trainControl(method = "boot632"))
##################################################################
# Not as good as the PCA.rf
set.seed(1234)
PCA.gbm <- train(Survived ~ ., data = PCAtraining,method = "gbm",
                trControl= trainControl(method = "boot632", number = 200),verbose =F)

# Not as good as the PCA.rf
set.seed(1234)
PCA.lda <- train(Survived ~ ., data = PCAtraining,method = "lda",
                 trControl= trainControl(method = "boot632"), verbose =F)

# Not as good as the PCA.rf
set.seed(1234)
PCA.svm <- train(Survived ~ ., data = PCAtraining,method = "svmRadial",
                 trControl= trainControl(method = "boot632"), verbose =F)

# Let's make a prediction by using final testing set:

PCAtesting <- predict(prePCA,newdata = final_testing)

predictions <- NULL

models <- list(PCA.lda,
               PCA.gbm,
               PCA.rf,
               PCAknn,
               PCA.svm)

predictions <- sapply(models, function(x){
        temp <- as.numeric(as.character(predict(x,newdata = PCAtesting[,-1])))
})
prediction.table <- NULL
for(i in seq_along(predictions)){
        prediction.table <- data.frame(PassengerId = final_testing$PassengerId, 
                                          Survived = predictions[,i])   
        write.csv(prediction.table,paste0("PCA_predictions",i,".csv"), row.names = F)
}
  • PCA.svm performed better than any other classifier I trained so far! The test accuracy was : 0.77990

  • Surprisingly, PCA.lda was the second best one!

**Take home message:** reduce the dimension as much as possible, try to explain more variability in the data with a smaller set of features, and keep trying different models despite their training accuracy, they might perform better in the test set.

This brings us a slight rise in the leader board, but certainly we have more way to go.

9.2 Using Principal Components to train classifiers: what if we added those few more features while training the classifiers?

Let’s go back to perform:

  • a few more feature engineering steps to add some features I was able to extract previously
  • repeat PCA
  • fit PCA.svm again
  • Test whether adding a few more features before the PCA can improve the accuracy (hoping that we can capture more variance in the direction of the response, in the form of principal components)

9.2.1 Feature selection and engineering

library(dplyr);library(ggplot2); library(caret)

training <- read.csv("train.csv")
final_testing <- read.csv("test.csv")


training <- dplyr::select(training, -PassengerId)
training$Survived <- factor(training$Survived)

# remove some features:

training <- dplyr::select(training,-Cabin, - Embarked, -Age)
final_testing <- dplyr::select(final_testing,-Cabin, - Embarked, -Age)

summary(training)
##  Survived     Pclass                                         Name    
##  0:549    Min.   :1.000   Abbing, Mr. Anthony                  :  1  
##  1:342    1st Qu.:2.000   Abbott, Mr. Rossmore Edward          :  1  
##           Median :3.000   Abbott, Mrs. Stanton (Rosa Hunt)     :  1  
##           Mean   :2.309   Abelson, Mr. Samuel                  :  1  
##           3rd Qu.:3.000   Abelson, Mrs. Samuel (Hannah Wizosky):  1  
##           Max.   :3.000   Adahl, Mr. Mauritz Nils Martin       :  1  
##                           (Other)                              :885  
##      Sex          SibSp           Parch             Ticket   
##  female:314   Min.   :0.000   Min.   :0.0000   1601    :  7  
##  male  :577   1st Qu.:0.000   1st Qu.:0.0000   347082  :  7  
##               Median :0.000   Median :0.0000   CA. 2343:  7  
##               Mean   :0.523   Mean   :0.3816   3101295 :  6  
##               3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6  
##               Max.   :8.000   Max.   :6.0000   CA 2144 :  6  
##                                                (Other) :852  
##       Fare       
##  Min.   :  0.00  
##  1st Qu.:  7.91  
##  Median : 14.45  
##  Mean   : 32.20  
##  3rd Qu.: 31.00  
##  Max.   :512.33  
## 
summary(final_testing)
##   PassengerId         Pclass     
##  Min.   : 892.0   Min.   :1.000  
##  1st Qu.: 996.2   1st Qu.:1.000  
##  Median :1100.5   Median :3.000  
##  Mean   :1100.5   Mean   :2.266  
##  3rd Qu.:1204.8   3rd Qu.:3.000  
##  Max.   :1309.0   Max.   :3.000  
##                                  
##                                         Name         Sex     
##  Abbott, Master. Eugene Joseph            :  1   female:152  
##  Abelseth, Miss. Karen Marie              :  1   male  :266  
##  Abelseth, Mr. Olaus Jorgensen            :  1               
##  Abrahamsson, Mr. Abraham August Johannes :  1               
##  Abrahim, Mrs. Joseph (Sophie Halaut Easu):  1               
##  Aks, Master. Philip Frank                :  1               
##  (Other)                                  :412               
##      SibSp            Parch             Ticket         Fare        
##  Min.   :0.0000   Min.   :0.0000   PC 17608:  5   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   113503  :  4   1st Qu.:  7.896  
##  Median :0.0000   Median :0.0000   CA. 2343:  4   Median : 14.454  
##  Mean   :0.4474   Mean   :0.3923   16966   :  3   Mean   : 35.627  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   220845  :  3   3rd Qu.: 31.500  
##  Max.   :8.0000   Max.   :9.0000   347077  :  3   Max.   :512.329  
##                                    (Other) :396   NA's   :1
# There is one missing value in the Fare feature in the testing set, thus we need to find a way to impute by using the 
# training set

lmFit1 <- lm(Fare ~ factor(Pclass)*factor(Sex), data = training)

summary(lmFit1)
## 
## Call:
## lm(formula = Fare ~ factor(Pclass) * factor(Sex), data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80.20  -8.37  -4.77   4.03 445.10 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      106.126      4.013  26.443  < 2e-16 ***
## factor(Pclass)2                  -84.156      6.003 -14.020  < 2e-16 ***
## factor(Pclass)3                  -90.007      5.160 -17.444  < 2e-16 ***
## factor(Sex)male                  -38.900      5.340  -7.284 7.16e-13 ***
## factor(Pclass)2:factor(Sex)male   36.671      7.903   4.640 4.01e-06 ***
## factor(Pclass)3:factor(Sex)male   35.442      6.588   5.380 9.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.91 on 885 degrees of freedom
## Multiple R-squared:  0.3903, Adjusted R-squared:  0.3869 
## F-statistic: 113.3 on 5 and 885 DF,  p-value: < 2.2e-16
par(mfrow= c(2,2))
plot(lmFit1)[1:4]

## NULL
# lmFit1 Fair model to impute Fare in the testing set

sum(is.na(training$Fare)) # No missing values in the training set
## [1] 0
sum(is.na(final_testing$Fare)) # One missing value in the test set
## [1] 1
final_testing$Fare[is.na(final_testing$Fare)] <- predict(lmFit1, newdata = final_testing[is.na(final_testing$Fare),])

sum(is.na(final_testing$Fare)) # Imputed
## [1] 0
# Final check confirms that there is no remaining missing data in either of the data sets
apply(is.na(training),2,sum);apply(is.na(final_testing),2,sum)
## Survived   Pclass     Name      Sex    SibSp    Parch   Ticket     Fare 
##        0        0        0        0        0        0        0        0
## PassengerId      Pclass        Name         Sex       SibSp       Parch 
##           0           0           0           0           0           0 
##      Ticket        Fare 
##           0           0
# Convert Sex to dummy variable
training$Female <- ifelse(training$Sex == "female", 1,0)
final_testing$Female <- ifelse(final_testing$Sex == "female", 1,0)

training <- dplyr::select(training, - Sex)
final_testing <- dplyr::select(final_testing, -Sex)


# Ticket
Ticket <- toupper(training$Ticket)
# Better to remove anything left of the last white space
w <- grep(" ", Ticket)
last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
        max(y[1])
})

Ticket[w] <- substring(Ticket[w],last.space+1)

Ticket <- gsub(" ","", Ticket)
Ticket <- gsub("[A-Z]","",Ticket)
Ticket <- gsub("\\.","",Ticket)
Ticket <- gsub("\\/","", Ticket)
Ticket <- as.numeric(Ticket)
Ticket[is.na(Ticket)] = 0
Ticket <- as.numeric(Ticket)


plot(x=log10(Ticket), y = training$Fare, col = ifelse(training$Survived == 0, "red","navy"))
plot(x=log10(Ticket), y = training$SibSp, col = ifelse(training$Survived == 0, "red","navy"))
plot(x=log10(Ticket), y = training$Parch, col = ifelse(training$Survived == 0, "red","navy"))

# Remove original Ticket and add new Ticket feature

# Feature Engineering function
num.Ticket <- function(x){
        Ticket <- toupper(x$Ticket)
        # Better to remove anything left of the last white space
        w <- grep(" ", Ticket)
        last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
                max(y[1])
        })
        Ticket[w] <- substring(Ticket[w],last.space+1)
        
        Ticket <- gsub(" ","", Ticket)
        Ticket <- gsub("[A-Z]","",Ticket)
        Ticket <- gsub("\\.","",Ticket)
        Ticket <- gsub("\\/","", Ticket)
        Ticket <- as.numeric(Ticket)
        Ticket <- as.numeric(Ticket)
        return(log10(Ticket))
}

training$num.Ticket <- num.Ticket(training)
final_testing$num.Ticket <- num.Ticket(final_testing)


sum(is.na(training$num.Ticket )) # 4 Missing values introduced
## [1] 4
sum(is.na(final_testing$num.Ticket)) # No missing values
## [1] 0
# remove the Ticket feature

training <- dplyr::select(training,-Ticket)
final_testing <- dplyr::select(final_testing,-Ticket)

# Title feature from Name:

# Feature engineering function:
Titles.func <- function(x){
        Pass.Names <- x$Name
        
        first.comma<- sapply(gregexpr(",",Pass.Names), function(y){
                y[1][1]
        })
        first.dot <- sapply(gregexpr("\\.",Pass.Names), function(y){
                y[1][1]
        })
        
        Titles <- substr(Pass.Names,first.comma+2,first.dot-1)  
        return(Titles)
        }

Titles <- Titles.func(training)

qplot(x = training$Fare[training$Female == 0], y = factor(Titles[training$Female == 0]), data = training[training$Female == 0,], color = Survived, alpha = I(0.2))+
        theme_bw()

# We notice that "Master" title explains quite a few of the survived males. This would be a useful feature to add into the data sets

training$Titles.Master <- ifelse(Titles == "Master",1,0)
final_testing$Titles.Master <- ifelse(Titles.func(final_testing) == "Master",1,0)

# Remove the original Name feature:

training <- dplyr::select(training,-Name)
final_testing <- dplyr::select(final_testing,-Name)

apply(is.na(training),2,sum)
##      Survived        Pclass         SibSp         Parch          Fare 
##             0             0             0             0             0 
##        Female    num.Ticket Titles.Master 
##             0             4             0
apply(is.na(final_testing),2,sum)
##   PassengerId        Pclass         SibSp         Parch          Fare 
##             0             0             0             0             0 
##        Female    num.Ticket Titles.Master 
##             0             0             0
training <- training[complete.cases(training),]

Now we added 2 more features compared to our previous attempt. Let’s build our classifiers to see if that made any difference.

9.2.2 Dimension reduction with the training set

prePCA <- preProcess(training[,-1],method = "pca")
PCAtraining <- predict(prePCA,newdata = training)

qplot(x = PC1, y = PC2, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
        scale_color_manual(values = c("red","navy"))

The decision boundary is still not an easy one for any classifier. But we can’t tell if we have improved the accuracy until we make some predictions!

set.seed(1234)
PCA.svm <- train(Survived ~ ., data = PCAtraining,method = "svmRadial",
                 trControl= trainControl(method = "boot632"), verbose =F)

# Let's perform a prediction:

PCAtesting <- predict(prePCA,newdata = final_testing)
prediction.table <- data.frame(PassengerId = final_testing$PassengerId, 
                               Survived = predict(PCA.svm,PCAtesting))   
write.csv(prediction.table,paste0("PCA_predictions_3",".csv"), row.names = F)

Great! These two new features alone have increased the testing accuracy from 0.77 to 0.79426!

Take home message: every small twist makes a huge difference!

10 Using Shrinkage methods (Regularization)

It would be also interesting to perform a linear model selection and see how well it might perform with the new training set we obtained.

10.1 Fitting a lasso logistic regression model to data

library(glmnet)

# We convert Pclass, SibSp and Parch into factor variables

training$Pclass <- factor(training$Pclass)
training$SibSp <- factor(training$SibSp)
training$Parch <- factor(training$Parch)

final_testing$Pclass <- factor(final_testing$Pclass)
final_testing$SibSp <- factor(final_testing$SibSp)
final_testing$Parch <- factor(final_testing$Parch)

# We create a matrix of predictors
x.training = model.matrix(Survived ~ . -1, data = training)
x.testing = model.matrix( ~ . -1, data = final_testing)


y = training$Survived

# Fit lasso regression to training data
fit.lasso <- glmnet(x.training,y, family = "binomial")
plot(fit.lasso, xvar = "lambda", label = TRUE)

plot(fit.lasso, xvar = "dev", label = TRUE)

Lasso is a nice way of reducing the features while still trying to explain the variance and therefore trying to win the bias-variance trade-off. In this case we notice that only 4 features are able to explain 30% of the deviance, while adding all features is only able to explain 40% of it. It is more feasible to use this restricted model with the few features and their shrunken coefficients.

10.2 Choosing the optimal lambda by using cross-validation

It is best to choose the optimal lambda using cross-validation, with the aim of minimizing the miss-classification error:

cv.lasso <- cv.glmnet(x.training,y, family = "binomial")
plot(cv.lasso)

Notice that what we are doing here is actually computationally intense.

  1. We get very fine grids of lambda values.
  2. Using each lambda value, we perform 10-fold cross validation:
  • We fit a model using the Sum of Squares convex optimization and obtain cross-validated error (average of 10 model fits)

3.We continue steps 1 and 2 for all fine grids of lambda values in the range.

The two vertical lines are produced in the plot, the left one marks the model with min error and the middle one shows a little more restricted model 1 standard error away from the minimum error.

When choosing the optimal model, we might prefer to get the more restricted model that is indicated by the second vertical line, which is default in glmnet package, due to the 1 standard error conversion.

The final model coefficients we obtained by cross-validation and the non-zero features remaining:

coef(cv.lasso)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)   -1.283806e+00
## Pclass1        7.169612e-01
## Pclass2        .           
## Pclass3       -7.886747e-01
## SibSp1         .           
## SibSp2         .           
## SibSp3        -7.647953e-01
## SibSp4        -6.354454e-01
## SibSp5        -5.233464e-01
## SibSp8        -6.220902e-01
## Parch1         .           
## Parch2         .           
## Parch3         .           
## Parch4        -3.456830e-01
## Parch5         .           
## Parch6         .           
## Fare           4.924973e-05
## Female         2.475991e+00
## num.Ticket     .           
## Titles.Master  1.938509e+00

10.3 Making predictions using the best lasso model fit

#Training accuracy
pred <- predict(cv.lasso, type = "response", newx = x.training)
# Turning probabilities into predictions
pred.training <- ifelse(pred > 0.5,1,0)

# Mean training accuracy
mean(pred.training == training$Survived)
## [1] 0.8139797
###### Testing predictions for submission

# These steps are needed otherwise predict() will complain!
x.testing <- data.frame(x.testing)  # Convert test matrix to data frame
x.testing$Survived <- NULL  # Set the response to NULL (otherwise model object will keep looking at it and can't find one from the test set, throws error)

# Finally retain only the features that exists in the training data matrix
x.testing <- x.testing[,which(colnames(x.testing) %in% colnames(x.training))]

# Calculate the probabilities:
pred.t <- predict(cv.lasso, type = "response", newx = data.matrix(x.testing))
# Turn them into predictions
pred.testing <- ifelse(pred.t > 0.5,1,0)
# Print to submit into Kaggle to obtain mean test accuracy:
prediction.table <- data.frame(PassengerId = final_testing$PassengerId, 
                               Survived = data.frame(Survived = pred.testing))   
write.csv(prediction.table,paste0("Lasso_predictions_1",".csv"), row.names = F)

This submission scored 0.77990. Not an improvement of our benchmark PCA.svm classifiers. It was certainly worth to try regularization and see how far it could bring us. In fact, it is doing fairly well.