1 Summary
2 Data loading and partitioning
3 Data cleaning, dimension reduction and exploratory analysis
4 Model training, testing and stacking
5 Final model evaluation: prediction of 20 test cases

1 Summary

Here I present the detailed analysis and predictive modeling of the human activity recognition data set generated by Velloso et al (http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises). By using the training data set, I built several non-linear models by using decision tree, random forest, boosted tree as well as support vector machines to classify 5 types of human activities by using the features collected by human activity recognition devices. I have also demonstrated stacking and blending of the combinations of these models that are trained using the predictions of the stand alone classifiers. Finally all individual and stacked classifiers were evaluated using an independent validation set. Random forest model gave the highest accuracy (99.9%) when tested on the validation set. Finally, the random forest model was used to predict 20 independent test cases, resulting in 100% accuracy (verified/data not shown).

2 Data loading and partitioning

Download the data sets:

fileURLTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileURLTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileURLTrain,"train.csv")
download.file(fileURLTest,"test.csv")

Load the data sets:

training <- read.csv("train.csv")
final_testing <- read.csv("test.csv")

Understanding the classes:

Based on the original data description, we notice that the classe variable is the outcome variable representing: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

We will use the training set to build a predictive model in order to classify the cases in the final_testing set.

We therefore partition the training set into:

building: actual model building set
tune.testing: testing set for tuning the build models
validation: for one-time evaluation of the model performance

library(caret);library(ggplot2); set.seed(23445)
INbuilding <- createDataPartition(y = training$classe,p=0.6,list = FALSE)
building <- training[INbuilding,]

rest <- training[-INbuilding,]

INtune <- createDataPartition(y = rest$classe,p=0.5,list = FALSE)
tune.testing <- rest[INtune,]
validation <- rest[-INtune,]

3 Data cleaning, dimension reduction and exploratory analysis

All data processing and exploration initially performed in the building set, then exactly applied to tune.testing,validation, and final_testing sets with the same parameters

3.1 STEP1: Handling missing values

Note that classes are slightly unbalanced, we have more of class A.

table(building$classe)

## 
##    A    B    C    D    E 
## 3348 2279 2054 1930 2165

length(which(apply(is.na(building),2,sum)>0))

## [1] 67

Note that 11548 data points (98% of them) are consistently missing in 67 variables.

table(building$classe[is.na(building$amplitude_pitch_forearm)])/table(building$classe)

## 
##         A         B         C         D         E 
## 0.9829749 0.9815709 0.9771178 0.9823834 0.9778291

The missing values seem to be balanced between the classes, similar fraction of classes are contained in missing values.

Therefore, drop the missing value containing columns as they unlikely to contribute our predictive power.

tune.testing <- tune.testing[,-which(apply(is.na(building),2,sum)>0)]
validation <- validation[,-which(apply(is.na(building),2,sum)>0)]
final_testing <- final_testing[,-which(apply(is.na(building),2,sum)>0)]
building <- building[,-which(apply(is.na(building),2,sum)>0)]

3.2 STEP 2: Remove near-zero variance features

Therefore, drop the missing value containing columns as they unlikely to contribute our predictive power.

nsv <- nearZeroVar(x = building, saveMetrics = TRUE)
sum(!nsv$nzv)

## [1] 59

59 of the 93 remaining features have non-zero variance and will be kept the data sets.

tune.testing <- tune.testing[,!nsv$nzv]
validation <- validation[,!nsv$nzv]
final_testing <- final_testing[,!nsv$nzv]
building <- building[,!nsv$nzv]

3.3 Exploratory data analysis

3.3.0.1 Investigating collinear features

cont <- !sapply(building,is.factor) # Continuous variables in the building set
M <- abs(cor(building[,cont])) # M is an absolute value correlation matrix representing the pairwise #correlations between all continuous variables 
diag(M) <- 0 # We replace the diagonal values with zero (just because these are the correations with  #themselves we are not interested in capturing them).
head(which(M > 0.8, arr.ind = TRUE)) # What are the highest correated variables?

##                  row col
## yaw_belt           7   5
## total_accel_belt   8   5
## accel_belt_y      13   5
## accel_belt_z      14   5
## accel_belt_x      12   6
## magnet_belt_x     15   6

head(unique(row.names(which(M > 0.8, arr.ind = TRUE))))

## [1] "yaw_belt"         "total_accel_belt" "accel_belt_y"    
## [4] "accel_belt_z"     "accel_belt_x"     "magnet_belt_x"

length(unique(row.names(which(M > 0.8, arr.ind = TRUE))))

## [1] 18

We find that there are 18 highly correlated predictors in the data set.

3.3.0.2 Principal components analysis (PCA)

We perform PCA to see if dimension reduction might help to resolve classes in the case of collinear features:

library(caret)
cor.variables <- building[,unique(row.names(which(M > 0.8, arr.ind = TRUE)))]
cor.variables$classe <- building$classe

prePCA <- preProcess(cor.variables[,-19],method = "pca")
PCAcor <- predict(prePCA,cor.variables[,-19])
qplot(PCAcor$PC1,PCAcor$PC2, color = classe, data = cor.variables) +theme_bw()

We concluded that there is no obvious advantage gained by calculating principal components for the collinear features. Importantly, we notice that the correlated predictors have already clusters within them. In this case, we will not attempt further dimension reduction, because the data looks like suitable for classification algorithms (rather than linear models).

3.4 STEP3: Feature engineering

Converting factor variables into dummy variables.

which(sapply(building[,-59],is.factor)) # Only two factor variables remained in the datasets

##      user_name cvtd_timestamp 
##              2              5

# Note that one of them is a time stamp composed of 20 unique values. At this point we will model them as #categorical variables
factors <- which(sapply(building[,-59],is.factor))
# Convert these variables into dummy variables:
dummies <- dummyVars(classe ~ user_name + cvtd_timestamp, data = building)

# Add them into building and all test and validation sets and drop the original factor variables
building <- cbind(building[,-factors],predict(dummies,building))
tune.testing <- cbind(tune.testing[,-factors],predict(dummies,tune.testing))
validation <- cbind(validation[,-factors],predict(dummies,validation))
names(final_testing)[59] <- "classe" # Names in newdata should match the object to use predict( ) function
final_testing <- cbind(final_testing[,-factors],predict(dummies,final_testing))

3.5 STEP4: Eliminating the bias from the indexing variable X

We notice that the variable X is just an indexing variable that can perfectly separate the classes:

qplot(X,classe,data = building,color = classe)+theme_bw()

Such a variable has no real predictive power because the real samples will not be ordered based on such an index. Therefore, we remove the variable X from all sets before building the models:

building <- subset(building,select = -c(X))
tune.testing <- subset(tune.testing,select = -c(X))
validation <- subset(validation,select = -c(X))
final_testing <- subset(final_testing,select = -c(X))

3.6 STEP5 : legitimize the feature names

When performing feature creation, particularly for the creation of the dummy variables, we generated complex feature names. The classification algorithms in R can not handle such Complex feature names that contains special characters such as “/”. Therefore, we legitimize all feature names before model training:

names(building) <- make.names(names(building))
names(tune.testing) <- make.names(names(tune.testing))
names(validation) <- make.names(names(validation))
names(final_testing) <- make.names(names(final_testing))

4 Model training, testing and stacking

Since this is a very intuitive classification problem and there are clear non-linear trends/structures/clusters in the data, we will build tree-based classifiers and support vector machines to perform predictions.

These classifiers will be:

Classification decision tree
Random Forest
Gradient Boosted tree
Support vector machines with a radial kernel

We will first build stand-alone classifiers, then stack them by using the respective algorithms. We will evaluate the predictive performance of the individual as well as the stacked classifiers by using the validation data set.

4.0.1 Cross validation statement

In order to avoid over fitting of the classifiers, we will perform 10-fold cross-validation while building each individual classifier, as well as when stacking them together. Given the high number of dimensions - high number of features in the data set, this incorporates a substantial computational expense to build our classifiers. However, this was necessary to obtain the high accuracy and successful classifiers as demonstrated below.

4.1 Building Individual classifiers:

Simple classification tree (method = “rpart”) with cross validation:

set.seed(125745)
RPART <- train(classe ~ ., data = building,method = "rpart", trControl = trainControl(method = "cv", number = 10))
saveRDS(RPART,"RPART.rds") #Save model object for future loading if necessary

Random Forest (method = “cv”) with cross validation

set.seed(125745)
RF <- train(classe ~ ., data = building,method = "rf", trControl = trainControl(method = "cv", number = 10))  #Takes very long time to run!!
saveRDS(RF,"RF.rds") #Save model object for future loading if necessary

Gradient boosted tree (method = “gbm”) with cross validation

set.seed(125745)
GBM <- train(classe ~ ., data = building,method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) # Takes very long time to run!!
saveRDS(GBM,"GBM.rds") #Save model object for future loading if necessary

support vector machines with a radial kernel

set.seed(125745)
SVM <- train(classe ~ ., data = building,method = "svmRadial", trControl = trainControl(method = "cv", number = 10))
# Takes very long time to run!!
saveRDS(SVM,"SVM.rds")#Save model object for future loading if necessary

Loading back the earlier classifiers:

 RPART = readRDS("RPART.rds")
 RF = readRDS("RF.rds")
 GBM = readRDS("GBM.rds")
 SVM = readRDS("SVM.rds")

Summarizing the in-sample accuracies from the individual classifiers:

# Using resamples function from the caret package to summarize the data
modelsummary <- resamples(list(RPART=RPART,RF=RF,GBM=GBM,SVM=SVM))
# In-sample accuracy values for each model
summary(modelsummary)$statistics$Accuracy

##         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## RPART 0.4911  0.5522 0.5924 0.5842  0.6088 0.6644    0
## RF    0.9975  0.9983 0.9987 0.9987  0.9992 1.0000    0
## GBM   0.9932  0.9958 0.9958 0.9961  0.9970 0.9992    0
## SVM   0.9210  0.9354 0.9376 0.9382  0.9437 0.9499    0

We notice that 3 out of the 4 classifiers we build perform quite well in the building set.

4.2 Making independent predictions by using the individual classifiers and tune.testing set

# Make independent predictions
pred.tune.RPART <- predict(RPART,newdata = tune.testing)
pred.tune.RF <- predict(RF,newdata = tune.testing)
pred.tune.SVM <- predict(SVM,newdata = tune.testing)
pred.tune.GBM <- predict(GBM,newdata = tune.testing)

# Collect accuracy values in a data.frame:
tune.testing.predictions <-data.frame(pred.tune.RPART,pred.tune.RF,pred.tune.SVM,pred.tune.GBM)

tune.testing.accuracy <- apply(tune.testing.predictions, 2, function(x){
        temp=confusionMatrix(x,tune.testing$classe)$overall[1]
        names(temp) <- names(x)
        return(temp)
})
print(tune.testing.accuracy )

## pred.tune.RPART    pred.tune.RF   pred.tune.SVM   pred.tune.GBM 
##       0.4863625       0.9989804       0.9370380       0.9959215

Based on the independent prediction in the tune.testing set, Random Forest classification tree gave the highest accuracy. The accuracy is remarkable (0.9989). Boosted tree also gave a comparable accuracy.

4.3 Stacking and blending the classifiers

At this stage we will carry over the predictions of the individual models above and see if we can stack combinations of these models to further improve the accuracy.

To this end, we will combine each classifier we built above, by using the 4 classification algorithms we used.

# Add the outcome variable to the tune.testing predictions of the individual classifiers above
set.seed(125745)
tune.testing.predictions$classe <- tune.testing$classe
# Train stacked classifiers by using 4 algorithms using the tune.testing predictions data set
stacked.model.rpart <- train(classe ~ ., method = "rpart", data = tune.testing.predictions,trControl = trainControl(method = "cv", number = 10) )
stacked.model.rf <- train(classe ~ ., method = "rf", data = tune.testing.predictions,trControl = trainControl(method = "cv", number = 10) )
stacked.model.svm <- train(classe ~ ., method = "svmRadial", data = tune.testing.predictions,trControl = trainControl(method = "cv", number = 10) )
stacked.model.gbm <- train(classe ~ ., method = "gbm", data = tune.testing.predictions,trControl = trainControl(method = "cv", number = 10), verbose = FALSE )
# Using resamples function from the caret package to summarize the data
stacked.modelsummary <- resamples(list(RPART=stacked.model.rpart,RF=stacked.model.rf,GBM=stacked.model.gbm ,SVM=stacked.model.svm))
# In-sample accuracy values for each model
summary(stacked.modelsummary)$statistics$Accuracy

##         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## RPART 0.6599  0.6611 0.6629 0.7304  0.8334 0.8346    0
## RF    0.9974  0.9975 1.0000 0.9990  1.0000 1.0000    0
## GBM   0.9974  0.9975 1.0000 0.9990  1.0000 1.0000    0
## SVM   0.9974  0.9975 1.0000 0.9990  1.0000 1.0000    0

4.4 Making independent predictions by using all individual and stacked classifiers and the validation set

# Individual model predictions by using the validation data set
pred.validation.RPART <- predict(RPART,newdata = validation)
pred.validation.RF <- predict(RF,newdata = validation)
pred.validation.SVM <- predict(SVM,newdata = validation)
pred.validation.GBM <- predict(GBM,newdata = validation)
# Stacked model predictions by using the validation data set:
# First, combine the predictions of the individual classifiers from the validation set:
validation.predictions <-data.frame(pred.tune.RPART = pred.validation.RPART, 
                                      pred.tune.RF = pred.validation.RF,
                                      pred.tune.SVM=pred.validation.SVM,
                                      pred.tune.GBM = pred.validation.GBM)
# Then use the stacked classifiers to perform independent predictions in this new data set:
pred.validation.Stacked.RPART <- predict(stacked.model.rpart,newdata = validation.predictions)
pred.validation.Stacked.RF <- predict(stacked.model.rf,newdata = validation.predictions)
pred.validation.Stacked.SVM <- predict(stacked.model.svm,newdata = validation.predictions)
pred.validation.Stacked.GBM <- predict(stacked.model.gbm,newdata = validation.predictions)
# Collect accuracy values in a data.frame:
collected.validation.predictions<-data.frame(pred.validation.RPART,pred.validation.RF,pred.validation.SVM,pred.validation.GBM,pred.validation.Stacked.RPART,pred.validation.Stacked.RF,pred.validation.Stacked.SVM,pred.validation.Stacked.GBM)
validation.accuracy <- apply(collected.validation.predictions, 2, function(x){
        temp=confusionMatrix(x,validation$classe)$overall[1]
        names(temp) <- names(x)
        return(temp)
})
print(validation.accuracy)

##         pred.validation.RPART            pred.validation.RF 
##                     0.4934999                     0.9982157 
##           pred.validation.SVM           pred.validation.GBM 
##                     0.9393321                     0.9964313 
## pred.validation.Stacked.RPART    pred.validation.Stacked.RF 
##                     0.6609737                     0.9982157 
##   pred.validation.Stacked.SVM   pred.validation.Stacked.GBM 
##                     0.9982157                     0.9982157

We found that the combinations of the individual models stacked by SVM and GBM further increased the overall accuracy of the predictions. However, they did not go beyond the accuracy level of the stand alone Random Forest classifier.

4.5 Variable importances for the selected Random Forest classifier

We further explored the relative importance of the available features for building the random forest classifier.

dotPlot(varImp(RF), main = "Random forest classifier feature importance")

We noticed that the raw_timestamp_part1 feature appears to be the most important for the performance of the random forest classifier.

Therefore, we concluded that the stand alone Random Forest classifier we build provides the best predictive performance. We will perform the predictions on the final_testing data set by using this classifier.

5 Final model evaluation: prediction of 20 test cases

Finally, we tested the prediction accuracy of our best classifier by performing a one-time test using the final_testing set:

final.prediction.RF <- predict(RF,newdata = final_testing)

By comparing the 20 predicted classes to the actual class labels (data not shown), we found that the random forest classifier was able to correctly classify all 20 cases (100% accuracy). The code provided in this analysis report is sufficient to reproduce and confirm these findings.

A random forest classifier for highly accurate prediction of human activity

Ozan Aygun

4/3/2017