Here I will explore the Titanic data set by using simple, parsimonious linear models, then perform feature engineering, missing value imputation and dimension reduction. Finally, I train classifiers by using random forest, support vector machines, gradient boosted trees algorithms. I will also explore stacking these classifiers and model tuning to improve accuracy of the predictions for survival.
Thanks for reading my Kernel and have a great time here!
Load the data sets:
training <- read.csv("train.csv", stringsAsFactors = FALSE, na.strings = "")
testing <- read.csv("test.csv", stringsAsFactors = FALSE,na.strings = "")
Random partition of the training set into:
library(caret)
set.seed(1234)
InTrain <- createDataPartition(y=training$Survived,p = 0.7,list = FALSE)
tune.test.set <- training[-InTrain,]
training <- training[InTrain,]
Summarize the training data set:
summary(training)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.000 Min. :1.000 Length:624
## 1st Qu.:232.5 1st Qu.:0.000 1st Qu.:2.000 Class :character
## Median :450.5 Median :0.000 Median :3.000 Mode :character
## Mean :448.5 Mean :0.383 Mean :2.303
## 3rd Qu.:663.2 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :891.0 Max. :1.000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:624 Min. : 0.75 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :28.50 Median :0.0000 Median :0.0000
## Mean :29.92 Mean :0.5096 Mean :0.3782
## 3rd Qu.:38.50 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.0000 Max. :5.0000
## NA's :129
## Ticket Fare Cabin
## Length:624 Min. : 0.000 Length:624
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.458 Mode :character
## Mean : 31.219
## 3rd Qu.: 31.621
## Max. :512.329
##
## Embarked
## Length:624
## Class :character
## Mode :character
##
##
##
##
table(training$Sex)
##
## female male
## 216 408
table(training$Embarked)
##
## C Q S
## 122 51 450
table(training$Survived)
##
## 0 1
## 385 239
The data is imbalanced in terms of gender as well as the outcome class. However, this would not prevent us from developing classifiers. It will probably make it harder though!
Let’s start developing expectations from the data. Note that we will only use the training data set for exploration and feature engineering. This is important to avoid over-fitting and decreasing the bias in our final out-of the sample accuracy.
#Exploration by pairs plot:
pairs(Survived ~ Age+ SibSp+Parch+Fare,pch =19, cex = 0.4,data=training,
col= ifelse(training$Survived == 1, "navy","red"))
Without concluding too much from this plot, we can immediately start noticing certain relationships.
An intuitive possibility is that more passengers who travel in the higher cabin classes might be survived.
table(training$Survived,training$Pclass)
##
## 1 2 3
## 0 56 69 260
## 1 97 60 82
We see that clearly less people survived from the 2nd and 3rd classes. We can go ahead and test if that difference is statistically significant. Fitting a logistic regression model between these two variables proves that indeed, the probability of survival is significantly low in 2nd class compared to 1st class, and it is also significantly lower in 3rd class, compared to 2nd and 1st class passengers (p < 0.001).
summary(glm(factor(Survived) ~ factor(Pclass), data = training, family = "binomial"))
##
## Call:
## glm(formula = factor(Survived) ~ factor(Pclass), family = "binomial",
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4178 -0.7404 -0.7404 0.9547 1.6900
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5494 0.1678 3.273 0.00106 **
## factor(Pclass)2 -0.6891 0.2436 -2.829 0.00467 **
## factor(Pclass)3 -1.7033 0.2103 -8.101 5.44e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 830.57 on 623 degrees of freedom
## Residual deviance: 755.94 on 621 degrees of freedom
## AIC: 761.94
##
## Number of Fisher Scoring iterations: 4
Therefore, the Pclass feature is an important one to keep in our data set.
In the event of crisis, it is expected that women and children will get the priority in the rescue efforts. We indeed notice that more females were survived, despite their small proportion.
table(training$Survived,training$Sex)
##
## female male
## 0 52 333
## 1 164 75
We can also investigate if this difference is statistically significant by fitting a logistic regression model. The beauty of logistic regression is that we can tailor our scientific question by including potential confounders into the model. In this case we suspect that Age could be a confounder for the observed relationship between Sex and Survival. We also add the interaction term to see if the effect of Sex is modified by Age as well.
summary(glm(Survived ~ Age*Sex, data = training, family = "binomial"))
##
## Call:
## glm(formula = Survived ~ Age * Sex, family = "binomial", data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9548 -0.7126 -0.5914 0.7092 2.1966
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.79165 0.40478 1.956 0.05050 .
## Age 0.01918 0.01403 1.367 0.17164
## Sexmale -1.57094 0.51413 -3.056 0.00225 **
## Age:Sexmale -0.03842 0.01718 -2.237 0.02529 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 669.39 on 494 degrees of freedom
## Residual deviance: 500.37 on 491 degrees of freedom
## (129 observations deleted due to missingness)
## AIC: 508.37
##
## Number of Fisher Scoring iterations: 4
Fascinating! First, we notice that indeed males have significantly lower probability of survival (p < 0.01) even after adjusting for Age and the interaction between Sex and Age. It is also worth to note that there is statistically significant interaction between Age and Sex predictors (p < 0.05).
Therefore, sex is another feature we will definitely keep to train our classifiers.
As we expected from the PClass - Survival relationship, passengers who paid higher are mostly those who survived:
qplot(y = log(Fare), x = factor(Survived), data = training, geom = "boxplot", fill = factor(Survived))+theme_bw()+scale_fill_manual(values = c("red","blue"))
The question is: does Fare feature explain any variability beyond the Pclass? If not, we should drop this feature since we don’t need two collinear features when training our classifiers. This will make our estimates biased.
When we fit Pclass together with Fare:
summary(glm(Survived ~ factor(Pclass) + Fare, data = training, family = "binomial"))
##
## Call:
## glm(formula = Survived ~ factor(Pclass) + Fare, family = "binomial",
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0701 -0.7697 -0.7248 1.0894 1.7390
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.026382 0.285894 -0.092 0.9265
## factor(Pclass)2 -0.274404 0.293648 -0.934 0.3501
## factor(Pclass)3 -1.236542 0.279913 -4.418 9.98e-06 ***
## Fare 0.007772 0.003309 2.349 0.0188 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 830.57 on 623 degrees of freedom
## Residual deviance: 749.26 on 620 degrees of freedom
## AIC: 757.26
##
## Number of Fisher Scoring iterations: 4
Interestingly, we notice that Fare still remains significant (p < 0.05) in the presence of the Pclass feature. This implies that there is additional variability in the Survival outcome that is explained by Fare, perhaps because this is a continuous variable and Pclass is categorical. Therefore, it seems feasible that they explain some non-overlapping variance.
We will also keep Fare feature to train our model.
At this stage we will continue to explore the training data set, with the difference that we will try to extract/engineer different features that are not immediately available to us. We will perform preliminary analyses to test the potential utility of these features.
As always, we will perform all feature engineering, imputation and dimension reduction in the training data set and will apply into the test data sets exactly in the same way.
It is more useful to model a few features as categorical variables:
training <- transform(training,Survived = factor(Survived), Pclass = factor(Pclass),
Sex = factor(Sex),SibSp = factor(SibSp), Parch = factor(Parch), Embarked = factor(Embarked))
tune.test.set <- transform(tune.test.set,Survived = factor(Survived), Pclass = factor(Pclass),Sex = factor(Sex),SibSp = factor(SibSp), Parch = factor(Parch), Embarked = factor(Embarked))
testing <- transform(testing, Pclass = factor(Pclass),
Sex = factor(Sex),SibSp = factor(SibSp), Parch = factor(Parch), Embarked = factor(Embarked))
This feature has no classification power and will generate bias in our classifiers:
library(dplyr)
training <- dplyr::select(training,-PassengerId)
tune.test.set <- dplyr::select(tune.test.set,-PassengerId)
training.na <- as.data.frame(is.na(training));names(training.na) <- names(training)
apply(training.na,2,sum)
## Survived Pclass Name Sex Age SibSp Parch Ticket
## 0 0 0 0 129 0 0 0
## Fare Cabin Embarked
## 0 476 1
We notice that missing values appear in 3 variables; Age, Cabin and Embarked.
From the available observations, we can infer that the first letter actually presents the cabin section. Let’s extract the first letter from the available ones and test if they have any predictive value:
library(ggplot2); library(dplyr)
Cabin.letter <-substr(training$Cabin[!training.na$Cabin],1,1)
Cabin.survival <- training$Survived[!training.na$Cabin]
Cabin.Pclass <- training$Pclass[!training.na$Cabin]
qplot(x = factor(Cabin.letter), fill = Cabin.survival)+scale_fill_manual(values = c("red","navy"))+theme_bw()
We notice that this feature might have some predictive value. Let’s check if the cabin relates with the Pclass:
qplot(x = factor(Cabin.letter), fill = Cabin.Pclass)+scale_fill_manual(values = c("red","navy","green"))+theme_bw()
We indeed notice that Cabin letters A,B,C are absolutely first class and therefore can be inferred from the Pclass variable. D and E are also more likely to be in 1st class. Gs are all coming from the 3rd Class.
We can also check if fare relates with the cabin letter:
Cabin.Fare <- training$Fare[!training.na$Cabin]
qplot(x = factor(Cabin.letter), y=Cabin.Fare, color = Cabin.survival)+scale_color_manual(values = c("red","navy"))+theme_bw()
The relationship between Fare and Cabin Letter is not so dramatic to allow some imputation.
The problem with imputing Cabin feature from Pclass is that we don’t know whether the missing variables are MCAR (Missing Completely at Random) or whether there is a relationship between the reason they are missing and the outcome (Survival).
Therefore, it would be more sensible to drop this variable and don’t use in building our classifiers.
Remove Cabin feature from all data sets:
library(dplyr)
training <- dplyr::select(training, - Cabin)
tune.test.set <- dplyr::select(tune.test.set, -Cabin)
testing <- dplyr::select(testing, -Cabin)
We recall the relationship between Age and family size from our earlier pairs plot. Therefore, one intuitive imputation potential would be comparing the Age with SibSp feature:
qplot(x = SibSp, y = Age, color = Survived, data = training)+
theme_bw()+scale_color_manual(values = c("red","navy"))+theme_bw()
We notice that as the SibSp increases, the age group decreases. Let’s look at the distribution of Age across the SibSp bins.
ggplot(data = training, aes(x = Age, fill= Survived))+
geom_histogram(bins = 40)+facet_grid(. ~ SibSp)+ scale_fill_manual(values = c("red","navy"))+
theme_bw()
Aha! Most of the Age distribution is having either no Spouses or siblings or only 1. For the 1 siblings/spouses group the mean age seems to be higher. In both cases Age is approximated by normal distribution. The problem with SibSp is that some of the factor levels are only present in the observations where Age is missing, making this predictor unbalanced across the missing and complete cases.
It would be also interesting to explore gender differences when considering Age:
ggplot(data = training, aes(x = Age, fill= Survived))+
geom_histogram(bins = 40)+facet_grid(. ~ Sex)+ scale_fill_manual(values = c("red","navy"))+
theme_bw()
How about the Age distribution in different passenger classes? (Pclass):
ggplot(data = training, aes(x = Age, fill= Survived))+
geom_histogram(bins = 40)+facet_grid(. ~ Pclass)+ scale_fill_manual(values = c("red","navy"))+
theme_bw()
as.data.frame(training %>% group_by(Pclass,Survived) %>% summarise(mean.Age = mean(Age,na.rm=T)))
This is quite interesting! In all passenger classes, the mean age of the survived passengers is lower than those passed away. We also notice that the mean age decreases and the Class number increases, i.e: older passengers are in the better classes on average. Therefore, Pclass would also be include in the imputation model.
It would be therefore sensible to impute Age by random gaussian imputation using the mean and standard deviation of individual factor levels of these predictors. We will include Sex in this model to account for gender-specific differences in Age, as well as Pclass to account for the interesting seperation of Age by Passenger Class we noted above.
In order to do this more rigorously, we can first fit a linear model with the existing Age , Sex and Pclass data:
The model will become:
Age ~ Sex + Pclass + e (random Gaussian error)
lmAge = lm(Age ~ Sex + Pclass, data = training, na.action = "na.omit")
summary(lmAge)
##
## Call:
## lm(formula = Age ~ Sex + Pclass, data = training, na.action = "na.omit")
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.261 -8.119 -0.634 7.819 47.366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.605 1.339 26.600 < 2e-16 ***
## Sexmale 5.577 1.257 4.436 1.13e-05 ***
## Pclass2 -8.949 1.658 -5.397 1.06e-07 ***
## Pclass3 -14.548 1.439 -10.112 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.14 on 491 degrees of freedom
## (129 observations deleted due to missingness)
## Multiple R-squared: 0.1838, Adjusted R-squared: 0.1788
## F-statistic: 36.86 on 3 and 491 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lmAge)[1:4]
## NULL
This model is just OK, but we don’t need the perfect model for this type of imputation. Nice to see that both variance and normality assumptions of the model holds and all levels of the covariates have significant impact on the mean outcome in the presence of each other. This would give us a good estimation for the missing values of Age.
Just to check if our imputation model yields the similar distribution as the original age:
complete.cases.Age <- training$Age[complete.cases(training$Age)]
after.imputation.Age <-c(complete.cases.Age,predict(lmAge, newdata = training[is.na(training$Age),]))
just.imputed.observations.Age <- after.imputation.Age[!complete.cases(training$Age)]
par(mfrow = c(1,3))
hist(complete.cases.Age,breaks = 20,col = "navy");
hist(after.imputation.Age,breaks = 20,col = "lightgreen");
hist(just.imputed.observations.Age,breaks = 20, col = "purple")
Therefore, our imputation model performs a nice job!
Using the model to impute missing values of Age:
training$Age[is.na(training$Age)] = predict(lmAge, newdata = training[is.na(training$Age),])
tune.test.set$Age[is.na(tune.test.set$Age)] = predict(lmAge, newdata = tune.test.set[is.na(tune.test.set$Age),])
testing$Age[is.na(testing$Age)] = predict(lmAge, newdata = testing[is.na(testing$Age),])
Note that we use the same model object we derived from the training data set to impute the missing values of Age for all data sets. This will prevent us from over-fitting to the test data set.
Re-investigate the missing values:
training.na <- as.data.frame(is.na(training));names(training.na) <- names(training)
tune.testing.na <- as.data.frame(is.na(tune.test.set));names(tune.testing.na) <- names(tune.test.set)
testing.na <- as.data.frame(is.na(testing));names(testing.na) <- names(testing)
data.frame(apply(training.na,2,sum),apply(tune.testing.na ,2,sum),apply(testing.na ,2,sum))
It appears that only one case is left missing in each of the data sets. We will consider this as random missingness and remove in each data set:
training <- training[complete.cases(training),]
tune.test.set <- tune.test.set[complete.cases(tune.test.set),]
testing <- testing[complete.cases(testing),]
Now we have completed processing the missing values.
Before we start building our classifiers, it is desirable to convert the categorical variables to dummy variables (0 or 1). This will preserve the parsimony if later we need to interpret the results of any decision trees. Furthermore, many machine learning packages work well with dummy variables.
library(caret)
which(sapply(training[,-1],is.factor))
## Pclass Sex SibSp Parch Embarked
## 1 3 5 6 9
# There are 5 categorical predictors in out data set
factors.training <- which(sapply(training,is.factor))
factors.tune.test.set <- which(sapply(tune.test.set,is.factor))
factors.testing <- which(sapply(testing,is.factor))
dummies.training <- dummyVars(Survived ~ Pclass + Sex + SibSp + Parch + Embarked, data = training)
dummies.tune.test.set <- dummyVars(Survived ~ Pclass + Sex + SibSp + Parch + Embarked, data = tune.test.set)
dummies.testing <- dummyVars(PassengerId ~ Pclass + Sex + SibSp + Parch + Embarked, data = testing)
# Add the dummy variables to both training and test data sets, simultaneously removing the existing factor variables:
training <- cbind(training[,-factors.training[-1]], predict(dummies.training,newdata = training))
tune.test.set <- cbind(tune.test.set[,-factors.tune.test.set[-1]], predict(dummies.tune.test.set,newdata = tune.test.set))
testing <- cbind(testing[,-factors.testing], predict(dummies.testing,newdata = testing))
That was lots of code! Still, using caret package makes the process less cumbersome.
After doing that we notice that there are differences in the factor levels of certain variables are not present in the training set and test sets. These are present in Parch and SibSp features. The lower levels of these variables are conserved in all sets and they represent most of the data. At this stage we will only keep features that are present in all sets:
training <- training[,names(training) %in% names(tune.test.set)]
tune.test.set <- tune.test.set[,names(tune.test.set) %in% names(training)]
identical(names(training),names(tune.test.set))
## [1] TRUE
PassengerId <- testing$PassengerId
testing <- testing[,names(testing) %in% names(training)]
testing$PassengerId <- PassengerId
It would be interesting to just convert the ticket feature to numeric and see if it has any classification value:
Ticket <- toupper(training$Ticket)
# Better to remove anything left of the last white space
w <- grep(" ", Ticket)
last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
max(y[1])
})
Ticket[w] <- substring(Ticket[w],last.space+1)
Ticket <- gsub(" ","", Ticket)
Ticket <- gsub("[A-Z]","",Ticket)
Ticket <- gsub("\\.","",Ticket)
Ticket <- gsub("\\/","", Ticket)
Ticket <- as.numeric(Ticket)
Ticket[is.na(Ticket)] = 0
Ticket <- as.numeric(Ticket)
qplot(x=Ticket, y = Fare, data = training, color = Survived)+theme_bw()+scale_color_manual(values = c("red","navy"))
We note that at least two groups of ticket numbers are associated with higher fares, and to some extent they seperate the outcome classes. Therefore, we will engineer this numeric feature and add into all sets, we will remove the original variable.
# Feature Engineering function
num.Ticket <- function(x){
Ticket <- toupper(x$Ticket)
# Better to remove anything left of the last white space
w <- grep(" ", Ticket)
last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
max(y[1])
})
Ticket[w] <- substring(Ticket[w],last.space+1)
Ticket <- gsub(" ","", Ticket)
Ticket <- gsub("[A-Z]","",Ticket)
Ticket <- gsub("\\.","",Ticket)
Ticket <- gsub("\\/","", Ticket)
Ticket <- as.numeric(Ticket)
Ticket <- as.numeric(Ticket)
return(Ticket)
}
training$num.Ticket <- num.Ticket(training)
tune.test.set$num.Ticket <- num.Ticket(tune.test.set)
testing$num.Ticket <- num.Ticket(testing)
sum(is.na(training$num.Ticket ))
## [1] 2
sum(is.na(tune.test.set$num.Ticket ))
## [1] 2
sum(is.na(testing$num.Ticket))
## [1] 0
After this conversion, only 2 missing values were introduced to training and tune.testing data sets.
When looking into the names of the passengers, we notice that at least we can attempt to extract a “title” feature from the name strings and explore its relationship with the outcome:
Pass.Names <- training$Name
first.comma<- sapply(gregexpr(",",Pass.Names), function(y){
y[1][1]
})
first.dot <- sapply(gregexpr("\\.",Pass.Names), function(y){
y[1][1]
})
Titles <- substr(Pass.Names,first.comma+2,first.dot-1)
qplot(x = factor(Titles), y = Age ,color = Survived, data = training)
This feature can also be useful, therefore we will add into the data sets:
# Feature engineering function:
Titles <- function(x){
Pass.Names <- x$Name
first.comma<- sapply(gregexpr(",",Pass.Names), function(y){
y[1][1]
})
first.dot <- sapply(gregexpr("\\.",Pass.Names), function(y){
y[1][1]
})
Titles <- substr(Pass.Names,first.comma+2,first.dot-1)
return(Titles)
}
training$Titles <- factor(Titles(training))
tune.test.set$Titles <- factor(Titles(tune.test.set))
testing$Titles <- factor(Titles(testing))
Next, we need to create dummy variables from each of these titles:
factors.training <- which(sapply(training,is.factor))
factors.tune.test.set <- which(sapply(tune.test.set,is.factor))
factors.testing <- which(sapply(testing,is.factor))
dummies.training <- dummyVars(Survived ~ Titles, data = training)
dummies.tune.test.set <- dummyVars(Survived ~ Titles, data = tune.test.set)
dummies.testing <- dummyVars(PassengerId ~ Titles, data = testing)
We then add the dummy variables to both training and test data sets, simultaneously removing the existing factor variables:
training <- cbind(training[,-factors.training[-1]], predict(dummies.training,newdata = training))
tune.test.set <- cbind(tune.test.set[,-factors.tune.test.set[-1]], predict(dummies.tune.test.set,newdata = tune.test.set))
testing <- cbind(testing[,-factors.testing], predict(dummies.testing,newdata = testing))
Not surprisingly ,after doing that we notice that there are differences in the factor levels of Titles feature. At this stage we will only keep features that are present in all sets:
training <- training[,names(training) %in% names(tune.test.set)]
tune.test.set <- tune.test.set[,names(tune.test.set) %in% names(training)]
identical(names(training),names(tune.test.set))
## [1] TRUE
PassengerId <- testing$PassengerId
testing <- testing[,names(testing) %in% names(training)]
testing$PassengerId <- PassengerId
We should also remove the original Ticket and Name features from all data sets since we already have the engineered versions included:
training <- dplyr::select(training,-Ticket,-Name)
tune.test.set <- dplyr::select(tune.test.set ,-Ticket,-Name)
testing <- dplyr::select(testing,-Ticket,-Name)
Finally, since we had 2 missing values we generated due to the Ticket feature engineering, we need to get the complete cases in each of the data set:
training <- training[complete.cases(training),]
tune.test.set <- tune.test.set[complete.cases(tune.test.set),]
testing <- testing[complete.cases(testing),]
testing <- data.frame(PassengerId = testing$PassengerId, testing[,-27])
This completes the feature engineering for all data sets.
We generated lots of features! While we had reasons for building them, now it is the chance to drop or transform them if we think that they might hurt us down the road.
One reason why we might want to reduce dimension is the collinearity, where few features are perfectly correlated with each other. Another type are the features with near zero variance.
Let’s explore if there are near zero variance features in the training data set:
nsv <- nearZeroVar(x = training, saveMetrics = TRUE)
sum(!nsv$nzv)
## [1] 20
Great we checked, indeed 7 features have near zero variance! We will drop them and keep the remaining 20 in our data sets.
training <- training[,!nsv$nzv]
tune.test.set <- tune.test.set[,!nsv$nzv]
testing <- testing[,!nsv$nzv]
Let’s have a look at the collinearity in our training set:
M <- abs(cor(training[,-1])) # M is an absolute value correlation matrix representing the pairwise #correlations between all variables
diag(M) <- 0 # We replace the diagonal values with zero (just because these are the correations with #themselves we are not interested in capturing them).
which(M > 0.8, arr.ind = TRUE) # What are the highest correated variables?
## row col
## Sex.male 7 6
## Titles.Mr 18 6
## Sex.female 6 7
## Titles.Mr 18 7
## SibSp.1 9 8
## SibSp.0 8 9
## Sex.female 6 18
## Sex.male 7 18
unique(row.names(which(M > 0.8, arr.ind = TRUE)))
## [1] "Sex.male" "Titles.Mr" "Sex.female" "SibSp.1" "SibSp.0"
cor.variables <- training[,unique(row.names(which(M > 0.8, arr.ind = TRUE)))]
cor.variables$Survived <- training$Survived
We notice the expected features correlate with each other. These include Male gender and Title Mr.
The next question is: should we replace these features with more statistically orthagonal features by using matrix transformation?
A PCA would be handy to visualize whether indeed it is worth reducing the dimension in the space of highly correlated features in our training data set:
prePCA <- preProcess(cor.variables[,-6],method = "pca")
PCAcor <- predict(prePCA,cor.variables[,-6])
qplot(PCAcor$PC1,PCAcor$PC2, color = Survived, data = cor.variables)+theme_bw()+scale_color_manual(values = c("red","navy"))
qplot(PCAcor$PC3,PCAcor$PC2, color = Survived, data = cor.variables)+theme_bw()+scale_color_manual(values = c("red","navy"))
This was not too helpful, since even the first 3 principal components are unable to explain the variance. At least we have given a try!
We will use the remaining 20 features to train our classifiers.
Due to the classification nature of the problem, we will perform tree-based prediction algorithms and support vector machines to perform predictions. We will perform 10-fold cross validation in our initial training to prevent over-fitting. We will also set the random number generator seed for reproducibility.
set.seed(125745)
RPART <- train(Survived ~ ., data = training,method = "rpart", trControl = trainControl(method = "cv", number = 10))
RPART
## CART
##
## 621 samples
## 19 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01898734 0.8117872 0.5982508
## 0.04008439 0.7926628 0.5628837
## 0.45991561 0.6793521 0.2299384
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01898734.
set.seed(125745)
RF <- train(Survived ~ ., data = training,method = "rf", trControl = trainControl(method = "cv", number = 10))
RF
## Random Forest
##
## 621 samples
## 19 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8164186 0.6018132
## 10 0.8247698 0.6237537
## 19 0.8230801 0.6234377
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
set.seed(125745)
GBM <- train(Survived ~ ., data = training,method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE)
GBM
## Stochastic Gradient Boosting
##
## 621 samples
## 19 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8005985 0.5758879
## 1 100 0.7973991 0.5681702
## 1 150 0.7941213 0.5585009
## 2 50 0.8037739 0.5781097
## 2 100 0.8150147 0.6012232
## 2 150 0.8197519 0.6118256
## 3 50 0.8071063 0.5876910
## 3 100 0.8101769 0.5947381
## 3 150 0.8037500 0.5814624
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
set.seed(125745)
SVM <- train(Survived ~ ., data = training,method = "svmRadial", trControl = trainControl(method = "cv", number = 10))
SVM
## Support Vector Machines with Radial Basis Function Kernel
##
## 621 samples
## 19 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8165227 0.6070723
## 0.50 0.8180836 0.6065842
## 1.00 0.8084309 0.5840500
##
## Tuning parameter 'sigma' was held constant at a value of 0.0545379
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.0545379 and C = 0.5.
# Using resamples function from the caret package to summarize the data
modelsummary <- resamples(list(RPART=RPART,RF=RF,GBM=GBM,SVM=SVM))
# In-sample accuracy values for each model
summary(modelsummary)$statistics$Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RPART 0.7619 0.7849 0.8065 0.8118 0.8327 0.8689 0
## RF 0.7581 0.7890 0.8306 0.8248 0.8566 0.8852 0
## GBM 0.7581 0.7823 0.8161 0.8198 0.8497 0.9016 0
## SVM 0.7903 0.8065 0.8175 0.8181 0.8334 0.8548 0
If we consider the Median accuracy as our benchmark, the Random Forest classifier seems to have the best accuracy amongst the classifiers we trained. Our out of the sample accuracy by using the tune.test set will be much lower though.
dotPlot(varImp(RPART))
dotPlot(varImp(RF))
dotPlot(varImp(SVM))
dotPlot(varImp(GBM))
It is great to see that some of the features we engineered are amongst the most important features for training the classifiers!
set.seed(125745)
RF.tune <- train(Survived ~ ., data = training,method = "rf", trControl = trainControl(method = "boot632", number = 10) )
median(RF.tune$results$Accuracy)
## [1] 0.878137
Great! Using this approach increased in sample accuracy to almost 88%!
Let’s try to build a ‘classifier of classifiers’ by using the predictions from the individual predictors. Ideally we should perform this in a seperate test set, however the data set we have is relatively small so we will have to reuse the predictions from the test data set.
in.sample.predictions <- data.frame(RF.tune = predict(RF.tune,training), RF = predict(RF, training), GBM = predict(GBM,training), RPART = predict(RPART, training), SVM = predict(SVM, training))
# Add the outcome
in.sample.predictions$Survived <- training$Survived
We will stack these classifiers and train two more classifiers by using the Random Forest and Gradient boosted tree classifiers
set.seed(125745)
RF.stack <- train(Survived ~ ., data = in.sample.predictions,method = "rf", trControl = trainControl(method = "cv", number = 10))
RF.stack
## Random Forest
##
## 621 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9967742 0.9931397
## 3 0.9967742 0.9931397
## 5 0.9967742 0.9931397
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
set.seed(125745)
GBM.stack <- train(Survived ~ ., data = in.sample.predictions,method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE)
GBM.stack
## Stochastic Gradient Boosting
##
## 621 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 558, 560, 559, 559, 559, 559, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9967742 0.9931397
## 1 100 0.9967742 0.9931397
## 1 150 0.9967742 0.9931397
## 2 50 0.9967742 0.9931397
## 2 100 0.9967742 0.9931397
## 2 150 0.9967742 0.9931397
## 3 50 0.9967742 0.9931397
## 3 100 0.9967742 0.9931397
## 3 150 0.9967742 0.9931397
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
## = 1, shrinkage = 0.1 and n.minobsinnode = 10.
That is great! Stacking the classifiers really improved the accuracy and we have now over 99% in-sample accuracy. There might be a little bit over-fitting since we reused the training set multiple times, but we will asses our out-of the sample accuracy by using the tune.test.set.
Before making our prediction with the actual test data set, it is a good idea to get an estimate of our out-of the box accuracy by using the tune.test.set we randomly partitioned initially. Since we haven’t touched this data set for training our classifiers, this would be a good first hand estimate of the out-of-the box accuracy.
Since we will use the stacked classifiers we constructed, we will have to follow the same procedure in the tune.test data set to perform predictions using our trained classifiers
pred.tune.test.RF <- predict(RF, tune.test.set)
pred.tune.test.RF.tune <- predict(RF.tune, tune.test.set)
pred.tune.test.GBM <- predict(GBM, tune.test.set)
pred.tune.test.RPART <- predict(RPART, tune.test.set)
pred.tune.test.SVM <- predict(SVM, tune.test.set)
tune.test.predictions <- data.frame(RF.tune = pred.tune.test.RF.tune, RF = pred.tune.test.RF, GBM = pred.tune.test.GBM, RPART = pred.tune.test.RPART, SVM = pred.tune.test.SVM)
# Add the outcome from tune.test set
tune.test.predictions$Survived <-tune.test.set$Survived
Next, make the predictions:
pred.tune.test.RF.stack <- predict(RF.stack,tune.test.predictions)
pred.tune.test.GBM.stack <- predict(GBM.stack,tune.test.predictions)
confusionMatrix(pred.tune.test.RF.stack, tune.test.set$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 149 32
## 1 13 70
##
## Accuracy : 0.8295
## 95% CI : (0.7786, 0.8729)
## No Information Rate : 0.6136
## P-Value [Acc > NIR] : 2e-14
##
## Kappa : 0.6277
## Mcnemar's Test P-Value : 0.00729
##
## Sensitivity : 0.9198
## Specificity : 0.6863
## Pos Pred Value : 0.8232
## Neg Pred Value : 0.8434
## Prevalence : 0.6136
## Detection Rate : 0.5644
## Detection Prevalence : 0.6856
## Balanced Accuracy : 0.8030
##
## 'Positive' Class : 0
##
confusionMatrix(pred.tune.test.GBM.stack, tune.test.set$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 148 33
## 1 14 69
##
## Accuracy : 0.822
## 95% CI : (0.7704, 0.8662)
## No Information Rate : 0.6136
## P-Value [Acc > NIR] : 1.796e-13
##
## Kappa : 0.6111
## Mcnemar's Test P-Value : 0.00865
##
## Sensitivity : 0.9136
## Specificity : 0.6765
## Pos Pred Value : 0.8177
## Neg Pred Value : 0.8313
## Prevalence : 0.6136
## Detection Rate : 0.5606
## Detection Prevalence : 0.6856
## Balanced Accuracy : 0.7950
##
## 'Positive' Class : 0
##
The out-of-the sample accuracy of the stacked classifiers is ~ 83%. This is probably because we did not use a third data set in between when stacking our classifiers. This likely caused over-fitting of the stacked classifiers and overestimation of the accuracy.
Nevertheless, it was worth trying to build the stacked classifiers, as they still have improved accuracy relative to stand alone classifiers.
Our final step is to make predictions using the unlabeled testing data set and prepare a submission file that matches our prediction with the Passenger IDs. We will use the stacked Random Forest classifier as it gave us the highest out-of-the sample accuracy:
pred.test.RF <- predict(RF, testing[,-1])
pred.test.RF.tune <- predict(RF.tune, testing[,-1])
pred.test.GBM <- predict(GBM, testing[,-1])
pred.test.RPART <- predict(RPART, testing[,-1])
pred.test.SVM <- predict(SVM, testing[,-1])
test.predictions <- data.frame(RF.tune = pred.test.RF.tune, RF = pred.test.RF , GBM = pred.test.GBM, RPART = pred.test.RPART, SVM = pred.test.SVM)
pred.test.RF.stack <- as.numeric(as.character(predict(RF.stack,test.predictions)))
In this new section, I will revisit the data with a fresh look and see if I can try alternative approaches to improve the classification accuracy.
Perhaps it was a little wasteful to seperate a validation set from the small training set. Since we already perform cross-validation by training our model, in the rest of my attempts I decided to use the entire training set to train the classifiers.
library(dplyr);library(ggplot2); library(caret)
training <- read.csv("train.csv")
final_testing <- read.csv("test.csv")
training <- dplyr::select(training, -PassengerId)
training$Survived <- factor(training$Survived)
# Fare is important to keep, Age perhaps not
qplot(x = Age, y = Fare, data = training, color = Survived)
# Also keep Pclass (good at classifiying victims)
qplot(x = Pclass, y = Fare, data = training, color = Survived, alpha = I(0.2))
# Leave Pclass and Parch
F.size = training$Parch + training$SibSp
qplot(x = F.size , y = Fare, data = training, color = Survived, alpha = I(0.4))
# Definitely keep Gender
qplot(x = Sex , y = Fare, data = training, color = Survived, alpha = I(0.4))
# remove other features:
training <- dplyr::select(training, -Name, -Age, -Ticket,-Cabin, - Embarked)
final_testing <- dplyr::select(final_testing, -Name, -Age, -Ticket,-Cabin, - Embarked)
summary(training)
## Survived Pclass Sex SibSp Parch
## 0:549 Min. :1.000 female:314 Min. :0.000 Min. :0.0000
## 1:342 1st Qu.:2.000 male :577 1st Qu.:0.000 1st Qu.:0.0000
## Median :3.000 Median :0.000 Median :0.0000
## Mean :2.309 Mean :0.523 Mean :0.3816
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :3.000 Max. :8.000 Max. :6.0000
## Fare
## Min. : 0.00
## 1st Qu.: 7.91
## Median : 14.45
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
summary(final_testing)
## PassengerId Pclass Sex SibSp
## Min. : 892.0 Min. :1.000 female:152 Min. :0.0000
## 1st Qu.: 996.2 1st Qu.:1.000 male :266 1st Qu.:0.0000
## Median :1100.5 Median :3.000 Median :0.0000
## Mean :1100.5 Mean :2.266 Mean :0.4474
## 3rd Qu.:1204.8 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1309.0 Max. :3.000 Max. :8.0000
##
## Parch Fare
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 7.896
## Median :0.0000 Median : 14.454
## Mean :0.3923 Mean : 35.627
## 3rd Qu.:0.0000 3rd Qu.: 31.500
## Max. :9.0000 Max. :512.329
## NA's :1
# There is one missing value in the Fare feature in the testing set, thus we need to find a way to impute by using the
# training set
lmFit1 <- lm(Fare ~ factor(Pclass)*factor(Sex), data = training)
summary(lmFit1)
##
## Call:
## lm(formula = Fare ~ factor(Pclass) * factor(Sex), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.20 -8.37 -4.77 4.03 445.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.126 4.013 26.443 < 2e-16 ***
## factor(Pclass)2 -84.156 6.003 -14.020 < 2e-16 ***
## factor(Pclass)3 -90.007 5.160 -17.444 < 2e-16 ***
## factor(Sex)male -38.900 5.340 -7.284 7.16e-13 ***
## factor(Pclass)2:factor(Sex)male 36.671 7.903 4.640 4.01e-06 ***
## factor(Pclass)3:factor(Sex)male 35.442 6.588 5.380 9.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.91 on 885 degrees of freedom
## Multiple R-squared: 0.3903, Adjusted R-squared: 0.3869
## F-statistic: 113.3 on 5 and 885 DF, p-value: < 2.2e-16
par(mfrow= c(2,2))
plot(lmFit1)[1:4]
## NULL
# lmFit1 Fair model to impute Fare in the testing set
sum(is.na(training$Fare)) # No missing values in the training set
## [1] 0
sum(is.na(final_testing$Fare)) # One missing value in the test set
## [1] 1
final_testing$Fare[is.na(final_testing$Fare)] <- predict(lmFit1, newdata = final_testing[is.na(final_testing$Fare),])
sum(is.na(final_testing$Fare)) # Imputed
## [1] 0
# Final check confirms that there is no remaining missing data in either of the data sets
apply(is.na(training),2,sum);apply(is.na(final_testing),2,sum)
## Survived Pclass Sex SibSp Parch Fare
## 0 0 0 0 0 0
## PassengerId Pclass Sex SibSp Parch Fare
## 0 0 0 0 0 0
# Convert Sex to dummy variable
training$Female <- ifelse(training$Sex == "female", 1,0)
final_testing$Female <- ifelse(final_testing$Sex == "female", 1,0)
training <- dplyr::select(training, - Sex)
final_testing <- dplyr::select(final_testing, -Sex)
This provided us a fairly small set of features that are likely to be important for classification.
Let’s now see whether Principal Components can aid classification:
prePCA <- preProcess(training[,-1],method = "pca")
PCAtraining <- predict(prePCA,newdata = training)
qplot(x = PC1, y = PC2, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
scale_color_manual(values = c("red","navy"))
qplot(x = PC2, y = PC3, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
scale_color_manual(values = c("red","navy"))
qplot(x = PC1, y = PC3, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
scale_color_manual(values = c("red","navy"))
PCA predictors seperate classes pretty well!
After explaining most of the variability in the existing predictors, what happens if we train our classifiers with these principal components ?
set.seed(1234)
PCAknn <- train(Survived ~ ., data = PCAtraining,method = "knn",
trControl= trainControl(method = "boot632"))
# So far gives the best accuracy: (bench mark: 0.83)##############
set.seed(1234)
PCA.rf <- train(Survived ~ ., data = PCAtraining,method = "rf",
trControl= trainControl(method = "boot632"))
##################################################################
# Not as good as the PCA.rf
set.seed(1234)
PCA.gbm <- train(Survived ~ ., data = PCAtraining,method = "gbm",
trControl= trainControl(method = "boot632", number = 200),verbose =F)
# Not as good as the PCA.rf
set.seed(1234)
PCA.lda <- train(Survived ~ ., data = PCAtraining,method = "lda",
trControl= trainControl(method = "boot632"), verbose =F)
# Not as good as the PCA.rf
set.seed(1234)
PCA.svm <- train(Survived ~ ., data = PCAtraining,method = "svmRadial",
trControl= trainControl(method = "boot632"), verbose =F)
# Let's make a prediction by using final testing set:
PCAtesting <- predict(prePCA,newdata = final_testing)
predictions <- NULL
models <- list(PCA.lda,
PCA.gbm,
PCA.rf,
PCAknn,
PCA.svm)
predictions <- sapply(models, function(x){
temp <- as.numeric(as.character(predict(x,newdata = PCAtesting[,-1])))
})
prediction.table <- NULL
for(i in seq_along(predictions)){
prediction.table <- data.frame(PassengerId = final_testing$PassengerId,
Survived = predictions[,i])
write.csv(prediction.table,paste0("PCA_predictions",i,".csv"), row.names = F)
}
PCA.svm performed better than any other classifier I trained so far! The test accuracy was : 0.77990
Surprisingly, PCA.lda was the second best one!
**Take home message:** reduce the dimension as much as possible, try to explain more variability in the data with a smaller set of features, and keep trying different models despite their training accuracy, they might perform better in the test set.
This brings us a slight rise in the leader board, but certainly we have more way to go.
Let’s go back to perform:
library(dplyr);library(ggplot2); library(caret)
training <- read.csv("train.csv")
final_testing <- read.csv("test.csv")
training <- dplyr::select(training, -PassengerId)
training$Survived <- factor(training$Survived)
# remove some features:
training <- dplyr::select(training,-Cabin, - Embarked, -Age)
final_testing <- dplyr::select(final_testing,-Cabin, - Embarked, -Age)
summary(training)
## Survived Pclass Name
## 0:549 Min. :1.000 Abbing, Mr. Anthony : 1
## 1:342 1st Qu.:2.000 Abbott, Mr. Rossmore Edward : 1
## Median :3.000 Abbott, Mrs. Stanton (Rosa Hunt) : 1
## Mean :2.309 Abelson, Mr. Samuel : 1
## 3rd Qu.:3.000 Abelson, Mrs. Samuel (Hannah Wizosky): 1
## Max. :3.000 Adahl, Mr. Mauritz Nils Martin : 1
## (Other) :885
## Sex SibSp Parch Ticket
## female:314 Min. :0.000 Min. :0.0000 1601 : 7
## male :577 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7
## Median :0.000 Median :0.0000 CA. 2343: 7
## Mean :0.523 Mean :0.3816 3101295 : 6
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6
## Max. :8.000 Max. :6.0000 CA 2144 : 6
## (Other) :852
## Fare
## Min. : 0.00
## 1st Qu.: 7.91
## Median : 14.45
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
summary(final_testing)
## PassengerId Pclass
## Min. : 892.0 Min. :1.000
## 1st Qu.: 996.2 1st Qu.:1.000
## Median :1100.5 Median :3.000
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Name Sex
## Abbott, Master. Eugene Joseph : 1 female:152
## Abelseth, Miss. Karen Marie : 1 male :266
## Abelseth, Mr. Olaus Jorgensen : 1
## Abrahamsson, Mr. Abraham August Johannes : 1
## Abrahim, Mrs. Joseph (Sophie Halaut Easu): 1
## Aks, Master. Philip Frank : 1
## (Other) :412
## SibSp Parch Ticket Fare
## Min. :0.0000 Min. :0.0000 PC 17608: 5 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 113503 : 4 1st Qu.: 7.896
## Median :0.0000 Median :0.0000 CA. 2343: 4 Median : 14.454
## Mean :0.4474 Mean :0.3923 16966 : 3 Mean : 35.627
## 3rd Qu.:1.0000 3rd Qu.:0.0000 220845 : 3 3rd Qu.: 31.500
## Max. :8.0000 Max. :9.0000 347077 : 3 Max. :512.329
## (Other) :396 NA's :1
# There is one missing value in the Fare feature in the testing set, thus we need to find a way to impute by using the
# training set
lmFit1 <- lm(Fare ~ factor(Pclass)*factor(Sex), data = training)
summary(lmFit1)
##
## Call:
## lm(formula = Fare ~ factor(Pclass) * factor(Sex), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.20 -8.37 -4.77 4.03 445.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.126 4.013 26.443 < 2e-16 ***
## factor(Pclass)2 -84.156 6.003 -14.020 < 2e-16 ***
## factor(Pclass)3 -90.007 5.160 -17.444 < 2e-16 ***
## factor(Sex)male -38.900 5.340 -7.284 7.16e-13 ***
## factor(Pclass)2:factor(Sex)male 36.671 7.903 4.640 4.01e-06 ***
## factor(Pclass)3:factor(Sex)male 35.442 6.588 5.380 9.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.91 on 885 degrees of freedom
## Multiple R-squared: 0.3903, Adjusted R-squared: 0.3869
## F-statistic: 113.3 on 5 and 885 DF, p-value: < 2.2e-16
par(mfrow= c(2,2))
plot(lmFit1)[1:4]
## NULL
# lmFit1 Fair model to impute Fare in the testing set
sum(is.na(training$Fare)) # No missing values in the training set
## [1] 0
sum(is.na(final_testing$Fare)) # One missing value in the test set
## [1] 1
final_testing$Fare[is.na(final_testing$Fare)] <- predict(lmFit1, newdata = final_testing[is.na(final_testing$Fare),])
sum(is.na(final_testing$Fare)) # Imputed
## [1] 0
# Final check confirms that there is no remaining missing data in either of the data sets
apply(is.na(training),2,sum);apply(is.na(final_testing),2,sum)
## Survived Pclass Name Sex SibSp Parch Ticket Fare
## 0 0 0 0 0 0 0 0
## PassengerId Pclass Name Sex SibSp Parch
## 0 0 0 0 0 0
## Ticket Fare
## 0 0
# Convert Sex to dummy variable
training$Female <- ifelse(training$Sex == "female", 1,0)
final_testing$Female <- ifelse(final_testing$Sex == "female", 1,0)
training <- dplyr::select(training, - Sex)
final_testing <- dplyr::select(final_testing, -Sex)
# Ticket
Ticket <- toupper(training$Ticket)
# Better to remove anything left of the last white space
w <- grep(" ", Ticket)
last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
max(y[1])
})
Ticket[w] <- substring(Ticket[w],last.space+1)
Ticket <- gsub(" ","", Ticket)
Ticket <- gsub("[A-Z]","",Ticket)
Ticket <- gsub("\\.","",Ticket)
Ticket <- gsub("\\/","", Ticket)
Ticket <- as.numeric(Ticket)
Ticket[is.na(Ticket)] = 0
Ticket <- as.numeric(Ticket)
plot(x=log10(Ticket), y = training$Fare, col = ifelse(training$Survived == 0, "red","navy"))
plot(x=log10(Ticket), y = training$SibSp, col = ifelse(training$Survived == 0, "red","navy"))
plot(x=log10(Ticket), y = training$Parch, col = ifelse(training$Survived == 0, "red","navy"))
# Remove original Ticket and add new Ticket feature
# Feature Engineering function
num.Ticket <- function(x){
Ticket <- toupper(x$Ticket)
# Better to remove anything left of the last white space
w <- grep(" ", Ticket)
last.space<- sapply(gregexpr(" ",Ticket[w]), function(y){
max(y[1])
})
Ticket[w] <- substring(Ticket[w],last.space+1)
Ticket <- gsub(" ","", Ticket)
Ticket <- gsub("[A-Z]","",Ticket)
Ticket <- gsub("\\.","",Ticket)
Ticket <- gsub("\\/","", Ticket)
Ticket <- as.numeric(Ticket)
Ticket <- as.numeric(Ticket)
return(log10(Ticket))
}
training$num.Ticket <- num.Ticket(training)
final_testing$num.Ticket <- num.Ticket(final_testing)
sum(is.na(training$num.Ticket )) # 4 Missing values introduced
## [1] 4
sum(is.na(final_testing$num.Ticket)) # No missing values
## [1] 0
# remove the Ticket feature
training <- dplyr::select(training,-Ticket)
final_testing <- dplyr::select(final_testing,-Ticket)
# Title feature from Name:
# Feature engineering function:
Titles.func <- function(x){
Pass.Names <- x$Name
first.comma<- sapply(gregexpr(",",Pass.Names), function(y){
y[1][1]
})
first.dot <- sapply(gregexpr("\\.",Pass.Names), function(y){
y[1][1]
})
Titles <- substr(Pass.Names,first.comma+2,first.dot-1)
return(Titles)
}
Titles <- Titles.func(training)
qplot(x = training$Fare[training$Female == 0], y = factor(Titles[training$Female == 0]), data = training[training$Female == 0,], color = Survived, alpha = I(0.2))+
theme_bw()
# We notice that "Master" title explains quite a few of the survived males. This would be a useful feature to add into the data sets
training$Titles.Master <- ifelse(Titles == "Master",1,0)
final_testing$Titles.Master <- ifelse(Titles.func(final_testing) == "Master",1,0)
# Remove the original Name feature:
training <- dplyr::select(training,-Name)
final_testing <- dplyr::select(final_testing,-Name)
apply(is.na(training),2,sum)
## Survived Pclass SibSp Parch Fare
## 0 0 0 0 0
## Female num.Ticket Titles.Master
## 0 4 0
apply(is.na(final_testing),2,sum)
## PassengerId Pclass SibSp Parch Fare
## 0 0 0 0 0
## Female num.Ticket Titles.Master
## 0 0 0
training <- training[complete.cases(training),]
Now we added 2 more features compared to our previous attempt. Let’s build our classifiers to see if that made any difference.
prePCA <- preProcess(training[,-1],method = "pca")
PCAtraining <- predict(prePCA,newdata = training)
qplot(x = PC1, y = PC2, data = PCAtraining, color = Survived, alpha = I(0.3))+theme_bw()+
scale_color_manual(values = c("red","navy"))
The decision boundary is still not an easy one for any classifier. But we can’t tell if we have improved the accuracy until we make some predictions!
set.seed(1234)
PCA.svm <- train(Survived ~ ., data = PCAtraining,method = "svmRadial",
trControl= trainControl(method = "boot632"), verbose =F)
# Let's perform a prediction:
PCAtesting <- predict(prePCA,newdata = final_testing)
prediction.table <- data.frame(PassengerId = final_testing$PassengerId,
Survived = predict(PCA.svm,PCAtesting))
write.csv(prediction.table,paste0("PCA_predictions_3",".csv"), row.names = F)
Great! These two new features alone have increased the testing accuracy from 0.77 to 0.79426!
Take home message: every small twist makes a huge difference!
It would be also interesting to perform a linear model selection and see how well it might perform with the new training set we obtained.
library(glmnet)
# We convert Pclass, SibSp and Parch into factor variables
training$Pclass <- factor(training$Pclass)
training$SibSp <- factor(training$SibSp)
training$Parch <- factor(training$Parch)
final_testing$Pclass <- factor(final_testing$Pclass)
final_testing$SibSp <- factor(final_testing$SibSp)
final_testing$Parch <- factor(final_testing$Parch)
# We create a matrix of predictors
x.training = model.matrix(Survived ~ . -1, data = training)
x.testing = model.matrix( ~ . -1, data = final_testing)
y = training$Survived
# Fit lasso regression to training data
fit.lasso <- glmnet(x.training,y, family = "binomial")
plot(fit.lasso, xvar = "lambda", label = TRUE)
plot(fit.lasso, xvar = "dev", label = TRUE)
Lasso is a nice way of reducing the features while still trying to explain the variance and therefore trying to win the bias-variance trade-off. In this case we notice that only 4 features are able to explain 30% of the deviance, while adding all features is only able to explain 40% of it. It is more feasible to use this restricted model with the few features and their shrunken coefficients.
It is best to choose the optimal lambda using cross-validation, with the aim of minimizing the miss-classification error:
cv.lasso <- cv.glmnet(x.training,y, family = "binomial")
plot(cv.lasso)
Notice that what we are doing here is actually computationally intense.
3.We continue steps 1 and 2 for all fine grids of lambda values in the range.
The two vertical lines are produced in the plot, the left one marks the model with min error and the middle one shows a little more restricted model 1 standard error away from the minimum error.
When choosing the optimal model, we might prefer to get the more restricted model that is indicated by the second vertical line, which is default in glmnet package, due to the 1 standard error conversion.
The final model coefficients we obtained by cross-validation and the non-zero features remaining:
coef(cv.lasso)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.283806e+00
## Pclass1 7.169612e-01
## Pclass2 .
## Pclass3 -7.886747e-01
## SibSp1 .
## SibSp2 .
## SibSp3 -7.647953e-01
## SibSp4 -6.354454e-01
## SibSp5 -5.233464e-01
## SibSp8 -6.220902e-01
## Parch1 .
## Parch2 .
## Parch3 .
## Parch4 -3.456830e-01
## Parch5 .
## Parch6 .
## Fare 4.924973e-05
## Female 2.475991e+00
## num.Ticket .
## Titles.Master 1.938509e+00
#Training accuracy
pred <- predict(cv.lasso, type = "response", newx = x.training)
# Turning probabilities into predictions
pred.training <- ifelse(pred > 0.5,1,0)
# Mean training accuracy
mean(pred.training == training$Survived)
## [1] 0.8139797
###### Testing predictions for submission
# These steps are needed otherwise predict() will complain!
x.testing <- data.frame(x.testing) # Convert test matrix to data frame
x.testing$Survived <- NULL # Set the response to NULL (otherwise model object will keep looking at it and can't find one from the test set, throws error)
# Finally retain only the features that exists in the training data matrix
x.testing <- x.testing[,which(colnames(x.testing) %in% colnames(x.training))]
# Calculate the probabilities:
pred.t <- predict(cv.lasso, type = "response", newx = data.matrix(x.testing))
# Turn them into predictions
pred.testing <- ifelse(pred.t > 0.5,1,0)
# Print to submit into Kaggle to obtain mean test accuracy:
prediction.table <- data.frame(PassengerId = final_testing$PassengerId,
Survived = data.frame(Survived = pred.testing))
write.csv(prediction.table,paste0("Lasso_predictions_1",".csv"), row.names = F)
This submission scored 0.77990. Not an improvement of our benchmark PCA.svm classifiers. It was certainly worth to try regularization and see how far it could bring us. In fact, it is doing fairly well.