1 Introduction and summary
2 Loading the data
3 Summarizing the data
- 3.1 Understanding the impact of missing data
4 Developing expectations from data
5 Training predictive algorithms for estimating house prices
6 Refine the Gradient boosting model
7 Conclusions

1 Introduction and summary

Here I used the Ames housing data set to perform a fairly comprehensive analysis and predictive modeling to estimate house prices. I have extensively explored the training set, then performed feature engineering, missing value imputation and feature selection using the training set. Finally, I trained both linear models such as lasso regularization, PCA regression, as well as more complex algorithms including gradient boosting, extreme gradient boosting, random forest, and support vector machines. I have obtained a model that predicts house sale prices fairly well, with a RMSE of 0.12717 obtained from the test data set.

2 Loading the data

train <- read.csv("train.csv"); test <- read.csv("test.csv")

3 Summarizing the data

Here, I will briefly look at the data just to get familiar with the features, missing values and anything strange in general.

The outcome variable SalePrice is a complete variable, but there are a number of predictors with many missing values. It looks like we will need to deal with quite a few missing values. Median house price is $163,000.

summary(train)

##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
##  Mean   : 10517                           Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                
##  Max.   :215245                                                
##                                                                
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual    
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000  
##  (Other):   2                 (Other): 19                   
##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle   
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141  
##  Median :5.000   Median :1973   Median :1994   Gambrel:  11  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2  
##                                                              
##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea    
##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0  
##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0  
##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0  
##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7  
##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0  
##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0  
##  (Other):   2   (Other):128   (Other):136                 NA's   :8       
##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221    
##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134    
##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114    
##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953    
##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38    
##                      Wood  :  3                                      
##                                                                      
##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2     
##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00  
##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00  
##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00  
##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55  
##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00  
##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00  
##  NA's: 37                      NA's:  38                     
##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95    
##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365    
##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241              
##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1              
##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428              
##  Max.   :2336.0   Max.   :6110.0   Wall :   4                       
##                                                                     
##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                              
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100     
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39     
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586     
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735     
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000              
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000              
##                                                              
##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType 
##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6  
##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870  
##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19  
##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88  
##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9  
##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387  
##                   Typ :1360                               NA's   : 81  
##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual 
##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3  
##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48  
##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14  
##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3  
##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311  
##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81  
##  NA's   :81                                                              
##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00  
##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                           
##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2  
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000              
##  Max.   :508.00   Max.   :480.00   Max.   :738.000              
##                                                                 
##    Fence      MiscFeature    MiscVal             MoSold      
##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000  
##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000  
##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000  
##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322  
##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000  
##                           Max.   :15500.00   Max.   :12.000  
##                                                              
##      YrSold        SaleType    SaleCondition    SalePrice     
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900  
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975  
##  Median :2008   COD    :  43   Alloca :  12   Median :163000  
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921  
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000  
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000  
##                 (Other):   9

par(mfrow = c(1,2))
boxplot(train$SalePrice,main = "SalePrice")
boxplot(log10(train$SalePrice), main = "Log10(SalePrice)")

library(pls)

3.1 Understanding the impact of missing data

Which features have more than 30% missing data?

na.status <- is.na(train)
na.sum <- apply(na.status,2,sum)
names(na.sum) <- colnames(train)
mostly_missing <- which(na.sum > (0.3 * nrow(train)))
na.sum[mostly_missing]

##       Alley FireplaceQu      PoolQC       Fence MiscFeature 
##        1369         690        1453        1179        1406

5 variables have substantial amount of missing data. Do they have impact on the house prices?

3.1.1 Alley

Based on the data description: Type of alley access to property

   Grvl     Gravel
   Pave     Paved
   NA       No alley access

library(ggplot2)

ggplot(data = train, aes(x = Alley, y = log(SalePrice), fill= Alley))+
        geom_boxplot()

Alley might be important. Gravel homes have significantly lower price. Convert NAs to “NoAlley” to make more sense.

train$Alley <- as.character(train$Alley)
train$Alley[which(is.na(train$Alley))] <- "NoAlley"
train$Alley <- factor(train$Alley)

#Also transform the test set
test$Alley <- as.character(test$Alley)
test$Alley[which(is.na(test$Alley))] <- "NoAlley"
test$Alley <- factor(test$Alley)

3.1.2 FireplaceQu

Based on the data description:

FireplaceQu: Fireplace quality

   Ex       Excellent - Exceptional Masonry Fireplace
   Gd       Good - Masonry Fireplace in main level
   TA       Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   Fa       Fair - Prefabricated Fireplace in basement
   Po       Poor - Ben Franklin Stove
   NA       No Fireplace

ggplot(data = train, aes(x = FireplaceQu, y = log10(SalePrice), fill= FireplaceQu))+
        geom_boxplot()

Let’s look at through a simple linear model:

par(mfrow=c(1,2))
hist(train$SalePrice)
hist(log10(train$SalePrice)) # Looks more gaussian

FirePlaceFit <- lm(log10(train$SalePrice) ~ FireplaceQu, data = train)
summary(FirePlaceFit)

## 
## Call:
## lm(formula = log10(train$SalePrice) ~ FireplaceQu, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38691 -0.09351 -0.01374  0.09604  0.57966 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.50252    0.03013 182.644  < 2e-16 ***
## FireplaceQuFa -0.28729    0.03959  -7.256 9.82e-13 ***
## FireplaceQuGd -0.17957    0.03106  -5.781 1.08e-08 ***
## FireplaceQuPo -0.40442    0.04469  -9.050  < 2e-16 ***
## FireplaceQuTA -0.21003    0.03126  -6.719 3.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1476 on 765 degrees of freedom
##   (690 observations deleted due to missingness)
## Multiple R-squared:  0.118,  Adjusted R-squared:  0.1134 
## F-statistic: 25.59 on 4 and 765 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(FirePlaceFit)

It appears that all levels of this feature also have significantly different mean Sales prices. Again, converting NAs to NoFireplace to be more informative.

train$FireplaceQu <- as.character(train$FireplaceQu)
train$FireplaceQu[which(is.na(train$FireplaceQu))] <- "NoFireplace"
train$FireplaceQu <- factor(train$FireplaceQu)

#Also transform the test set
test$FireplaceQu <- as.character(test$FireplaceQu)
test$FireplaceQu[which(is.na(test$FireplaceQu))] <- "NoFireplace"
test$FireplaceQu <- factor(test$FireplaceQu)

3.1.3 PoolQC

Based on the data description:

    PoolQC: Pool quality
           Ex       Excellent
           Gd       Good
           TA       Average/Typical
           Fa       Fair
           NA       No Pool

library(ggplot2)
ggplot(data = train, aes(x = PoolQC, y = log10(SalePrice), fill = PoolQC))+
        geom_boxplot()

Note that actually very few houses have pools. Houses with excellent pool condition have significantly higher sales prices. We will also keep this feature, converting NAs to “NoPool”:

train$PoolQC <- as.character(train$PoolQC)
train$PoolQC[which(is.na(train$PoolQC))] <- "NoPool"
train$PoolQC <- factor(train$PoolQC)

#Also transform the test set
test$PoolQC <- as.character(test$PoolQC)
test$PoolQC[which(is.na(test$PoolQC))] <- "NoPool"
test$PoolQC <- factor(test$PoolQC)

3.1.4 Fence

Based on the data description:

    Fence: Fence quality
           GdPrv    Good Privacy
           MnPrv    Minimum Privacy
           GdWo     Good Wood
           MnWw     Minimum Wood/Wire
           NA       No Fence

library(ggplot2)
ggplot(data = train, aes(x = Fence, y = log10(SalePrice), fill = Fence))+
        geom_boxplot()

FenceFit <- lm(log10(train$SalePrice) ~ Fence, data = train)
summary(FenceFit)

## 
## Call:
## lm(formula = log10(train$SalePrice) ~ Fence, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57500 -0.06747 -0.00520  0.05611  0.72551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.23634    0.01816 288.272  < 2e-16 ***
## FenceGdWo   -0.11851    0.02628  -4.510 9.58e-06 ***
## FenceMnPrv  -0.08969    0.02131  -4.210 3.46e-05 ***
## FenceMnWw   -0.11305    0.04582  -2.467   0.0142 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1395 on 277 degrees of freedom
##   (1179 observations deleted due to missingness)
## Multiple R-squared:  0.08211,    Adjusted R-squared:  0.07217 
## F-statistic:  8.26 on 3 and 277 DF,  p-value: 2.781e-05

Interestingly, this feature might have a complex relationship, if any at all, with the response variable. The effects of each of the level deemed as significant, so we will keep this feature at this point as well.

train$Fence <- as.character(train$Fence)
train$Fence[which(is.na(train$Fence))] <- "NoFence"
train$Fence <- factor(train$Fence)

#Also transform the test set
test$Fence <- as.character(test$Fence)
test$Fence[which(is.na(test$Fence))] <- "NoFence"
test$Fence <- factor(test$Fence)

3.1.5 MiscFeature

Based on the data description:

    MiscFeature: Miscellaneous feature not covered in other categories
           Elev     Elevator
           Gar2     2nd Garage (if not described in garage section)
           Othr     Other
           Shed     Shed (over 100 SF)
           TenC     Tennis Court
           NA       None

We will deal with the remaining missing value containing features in the next stages of our analysis.

table(train$MiscFeature)

## 
## Gar2 Othr Shed TenC 
##    2    2   49    1

ggplot(data = train, aes(x = MiscFeature, y = log10(SalePrice), fill = MiscFeature))+
        geom_boxplot()

Again, very few houses have this features, and not sure if these qualities have a real impact on the house value amongst other measured features. Nevertheless, I will maintain this feature as well. Just converting NAs to “NoMiscFeature” to be more clear:

train$MiscFeature <- as.character(train$MiscFeature)
train$MiscFeature[which(is.na(train$MiscFeature))] <- "NoMiscFeature"
train$MiscFeature <- factor(train$MiscFeature)

#Also transform the test set
test$MiscFeature <- as.character(test$MiscFeature)
test$MiscFeature[which(is.na(test$MiscFeature))] <- "NoMiscFeature"
test$MiscFeature <- factor(test$MiscFeature)

4 Developing expectations from data

Here I will explore the data further, in order to develop a better understanding about the relationships between features and the home prices.

4.1 Continuous features: watching for strong predictors and potential collinearities

Let’s look at the continuous features in the data set:

cont.var <-NULL
for(i in 1:ncol(train)){
        if(class(train[,i]) == "integer" | class(train[,i]) == "numeric"){
                cont.var <- c(cont.var,i)
        }
}

4.1.1 Part1: transforming LotArea to make it more gaussian

pairs(log10(train$SalePrice) ~ ., data = train[,cont.var[1:10]], cex = 0.05, pch = 19, col = "navy")

We need to remove ID, since this is just an identifier for the houses, has no predictive value as expected.
OverallQual and OverallCond look strong predictors, but might have some collinearity.
Yearbuilt and YearRemodAdd also look good predictors but most likely contain redundant information.
For the LOT and SF features, it could be better to look at after log transformation, since the data looks skewed.

par(mfrow=c(3,2))
hist(train$LotArea, breaks = 50, col = "navy");hist(log10(train$LotArea+1),breaks=50,col = "navy")
hist(train$MasVnrArea, breaks = 50,col = "navy");hist(log10(train$MasVnrArea+1),breaks=50,col = "navy")
hist(train$BsmtFinSF1, breaks = 50,col = "navy");hist(log10(train$BsmtFinSF1+1),breaks=50,col = "navy")

Indeed one of these features (LotArea) look more Gaussian when it is log10 transformed.

pairs(log10(train$SalePrice) ~ log10(train$LotArea+1) + log10(train$MasVnrArea+1) + log10(train$BsmtFinSF1+1),cex = 0.1, pch = 19, col = "navy")

When transformed, lotArea seems to do a better job in explaining the variability in the SalePrice, other 2 variables however, have substantial 0 values and it is questionable how much this transformation might help.

Therefore, I will only transform the LotArea at this stage and leave the other features as they are, for feature selection steps.

4.1.3 Part3: Size of the garage and number of rooms are good predictors

pairs(log10(train$SalePrice) ~ ., data = train[,cont.var[22:31]], cex = 0.05, pch = 19, col = "navy")

Arguably, these features are all related to the overall area of the house.

4.1.4 Part4: fairly weak predictors

pairs(log10(train$SalePrice) ~ ., data = train[,cont.var[32:38]], cex = 0.05, pch = 19, col = "navy")

Processing the identified features:

# Remove ID variable

train.ID <- train$Id #Hold it in a new variable
train <- train[, -1]

library(dplyr)

train <- dplyr::mutate(train, log10.LotArea = log10(LotArea))
train <- dplyr::select(train, -LotArea)

# Also transform in the test set
test <- dplyr::mutate(test, log10.LotArea = log10(LotArea))
test <- dplyr::select(test, -LotArea)

4.2 Categorical predictors: feature engineering to create dummy variables

Now let’s look at the categorical variables to see if certain ones are more powerful than others:

cat.var <-NULL
for(i in 1:ncol(train)){
        if(class(train[,i]) == "factor"){
                cat.var <- c(cat.var,i)
        }
}
length(cat.var)

## [1] 43

We have 43 categorical variables, some of them have only few levels. This is a challenge in many ways. It is not feasible to look each of them individually, therefore, we will first convert them to binary variables and try to use regularization or decision tree-based approaches to select and use most useful features.

# Write a function that gets a data set and converts all factor variables to dummy variables
convert.to.dummy <- function(data.set){
        cat.var <-NULL
        temp.data <- data.frame(1:nrow(data.set))
           for(i in 1:ncol(data.set)){
                if(class(data.set[,i]) == "factor"){
                        cat.var <- c(cat.var,i)
                        factor.levels <- levels(data.set[,i]) # Try to find a way to classify NA's as "NO" otherwise they generate problem downstream
                                # First check if there is any 'NA-level'
                                if(any(is.na(data.set[,i]))){
                                        dummy.vector = ifelse(is.na(data.set[,i]),1,0)
                                        dummy.vector <- data.frame(dummy.vector)
                                        colnames(dummy.vector)[1] = paste("NO",names((data.set)[i]),sep = ".")
                                        temp.data <- cbind(temp.data,dummy.vector)
                                }
                        
                                for(j in seq_along(factor.levels)){ # Then deal with normal factor levels
                                dummy.vector = ifelse(data.set[,i] == factor.levels[j],1,0)
                                
                                #Since we already dealt with NAs above
                                if(any(is.na(dummy.vector))){dummy.vector[is.na(dummy.vector)] <- 0} 
                                
                                dummy.vector <- data.frame(dummy.vector)
                                colnames(dummy.vector)[1] = paste(names((data.set)[i]),
                                                                  factor.levels[j],sep = ".")
                                temp.data <- cbind(temp.data,dummy.vector)
                                }
                }
           }
           #Remove the original categorical variables from data.set
           data.set <- data.set[,-cat.var]     
           #Add the dummy.variable set
           temp.data <- temp.data[,-1] # remove the unnecessary column
           data.set <- cbind(data.set,temp.data)
           
           return(data.set)     
}

# Keep test.IDs aside
test.ID <- test$Id

# Process the training set
training.processed <- convert.to.dummy(train)

# Process the test set
test.processed <- convert.to.dummy(test)

# Note that not all the levels are present in the test set, therefore we need to make the features same.
training.processed.SalePrice <- training.processed$SalePrice #Keep the original outcome variable 

consensus.features1 <- which(colnames(training.processed) %in% colnames(test.processed))
training.processed <- training.processed[,consensus.features1] #Keep the same features as in the test set

consensus.features2 <- which(colnames(test.processed) %in% colnames(training.processed))
test.processed <- test.processed[,consensus.features2] #Keep the same features as in the training set

identical(colnames(test.processed), colnames(training.processed))

## [1] TRUE

# Same features in both sets!

training.processed <- data.frame(Log.SalePrice = log(training.processed.SalePrice),training.processed) # Add the Log of the response variable
test.processed <- data.frame(ID = test.ID,test.processed) # Add the ID variable

4.3 Missing value imputation

We still have missing values in some of our features, which will be a problem down the road.

apply(is.na(training.processed), 2, sum)[apply(is.na(training.processed), 2, sum) != 0]

## LotFrontage  MasVnrArea GarageYrBlt 
##         259           8          81

apply(is.na(test.processed), 2, sum)[apply(is.na(test.processed), 2, sum) != 0]

##  LotFrontage   MasVnrArea   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF 
##          227           15            1            1            1 
##  TotalBsmtSF BsmtFullBath BsmtHalfBath  GarageYrBlt   GarageCars 
##            1            2            2           78            1 
##   GarageArea 
##            1

4.4 GarageYrBlt

We have 81 missing cases in our training set, these are the houses actually they don’t have a garage.

plot(y =training.processed$Log.SalePrice, x =training.processed$GarageYrBlt, pch = 19, col = "navy", cex = 0.5)

It looks like this feature has some relationship with the outcome, as also we can generally expect that newer houses are going to be more expensive. Therefore, we should make an attempt to impute the values missing from this feature.

plot(training.processed$YearBuilt, training.processed$GarageYrBlt, pch = 19, col = "navy", cex = 0.5)

As we expected, for most of the houses the year that garage was built is the same as the year the house was built. Therefore, for imputation purpose it is sensible to use the YearBuilt feature to fill these missing values in the GarageYrBuilt feature. While this is not consistent with the fact that these houses do not contain any garage, since we captured that in the categorical features we engineered, at this point imputing the years seems to be more appropriate than removing these data points altogether:

#Impute in the training set
training.processed$GarageYrBlt[is.na(training.processed$GarageYrBlt)] <- training.processed$YearBuilt[is.na(training.processed$GarageYrBlt)]


#Impute in the test set
test.processed$GarageYrBlt[is.na(test.processed$GarageYrBlt)] <- test.processed$YearBuilt[is.na(test.processed$GarageYrBlt)]

4.5 LotFrontage

Let’s look at this feature more closely:

plot(x = training.processed$LotFrontage, y = training.processed$Log.SalePrice, cex = 0.5, col = "navy" , pch = 19)
abline(lm(training.processed$Log.SalePrice ~ training.processed$LotFrontage))

Again this feature seems to have some degree of relationship with the response. We suspect that this feature has relationship with other features that relate to the Lot or the general area of the house:

cor(training.processed$LotFrontage,training.processed$log10.LotArea, use = "complete.obs")

## [1] 0.6540587

cor(training.processed$LotFrontage,training.processed$GarageArea, use = "complete.obs")

## [1] 0.3449967

cor(training.processed$LotFrontage,training.processed$MasVnrArea, use = "complete.obs")

## [1] 0.1934581

# Note that correlation gets better when log10 transformed:
cor(log10(training.processed$LotFrontage), training.processed$log10.LotArea, use = "complete.obs")

## [1] 0.7455501

summary(lm(training.processed$Log.SalePrice ~ log10(training.processed$LotFrontage) + training.processed$log10.LotArea ))

## 
## Call:
## lm(formula = training.processed$Log.SalePrice ~ log10(training.processed$LotFrontage) + 
##     training.processed$log10.LotArea)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51402 -0.23961 -0.04072  0.25528  1.22485 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            8.88631    0.20139  44.125  < 2e-16
## log10(training.processed$LotFrontage)  0.30471    0.10638   2.864  0.00425
## training.processed$log10.LotArea       0.65296    0.07572   8.623  < 2e-16
##                                          
## (Intercept)                           ***
## log10(training.processed$LotFrontage) ** 
## training.processed$log10.LotArea      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3763 on 1198 degrees of freedom
##   (259 observations deleted due to missingness)
## Multiple R-squared:  0.1833, Adjusted R-squared:  0.1819 
## F-statistic: 134.4 on 2 and 1198 DF,  p-value: < 2.2e-16

plot(y = log10(training.processed$LotFrontage), x = training.processed$log10.LotArea, cex = 0.5, col = "navy" , pch = 19)
abline(lm(log10(training.processed$LotFrontage) ~ training.processed$log10.LotArea))

It looks like this feature pretty much collinear with the LotArea feature, especially when log10 transformed. Therefore, we will remove this feature from both training and test data sets.

training.processed <- dplyr::select(training.processed, -LotFrontage)
test.processed <- dplyr::select(test.processed, -LotFrontage)

4.6 MasVnrArea: Masonry veneer area in square feet

Looking at more closely to this feature:

plot(x = training.processed$MasVnrArea, y = training.processed$Log.SalePrice,cex = 0.5, col = "navy" , pch = 19)

Note that a large number of these values are zero, which probably reflects the houses with no masonry veener. What happens if we put these aside and look at the rest of the values:

plot(x = log10(training.processed$MasVnrArea[training.processed$MasVnrArea != 0]), y = training.processed$Log.SalePrice[training.processed$MasVnrArea != 0],cex = 0.5, col = "navy" , pch = 19)

There might be ‘some’ relationship with the response, if not too strong. It is also a little confusing since the variable MasVnrType only has these 8 values as “none”, but linked the large amount of MasVnrAre == 0 data to certain MasVnrTypes:

qplot(data = training.processed,x = log10(training.processed$MasVnrArea), y = training.processed$Log.SalePrice, col = train$MasVnrType)

It is more clear now. The “none” group indeed represent the MasVnrArea = 0, NAs however appear as missing values in both cases.

ggplot(data = training.processed,aes(x = train$MasVnrType, y = training.processed$Log.SalePrice, fill = train$MasVnrType))+geom_boxplot()

We note that the SalePrice of the missing cases in the training set is closest to the SalePrice of the houses with Stone Masonry Veener type. It is likely that the area of the veener in some of these houses were not measured, perhaps it is more difficult to do so. This is our assumption.

Based on this assumption, we will merge the NA cases with the Stone group and remove the MasVnrArea feature in both training and test data sets:

training.processed$MasVnrType.None[training.processed$NO.MasVnrType == 1] <- 1
training.processed <- dplyr::select(training.processed, -NO.MasVnrType, -MasVnrArea)

test.processed$MasVnrType.None[test.processed$NO.MasVnrType == 1] <- 1
test.processed <- dplyr::select(test.processed, -NO.MasVnrType, -MasVnrArea)

4.7 Missing data remaining in the test set

We still need to deal with few missing data points we have in the test set. Since we need to make SalePrice predictions for each of these cases, we need to assess how we can impute these values at this point:

apply(is.na(test.processed), 2, sum)[apply(is.na(test.processed), 2, sum) != 0]

##   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF BsmtFullBath 
##            1            1            1            1            2 
## BsmtHalfBath   GarageCars   GarageArea 
##            2            1            1

Some features regarding Basement and Garage have missing values:

test.processed[which(apply(is.na(test.processed), 1, sum) != 0),which(apply(is.na(test.processed), 2, sum) != 0 )]

Fortunately, the missing values are only restricted to 3 house records. It is likely these unusual houses are either lacking basement or garage. Since all these features are numeric, we will simply place zero instead of NA’s for these values. This is of course based on our assumption.

for (i in 1:ncol(test.processed)) {
        if(any(is.na(test.processed[,i]))){
                test.processed[is.na(test.processed[,i]),i] <- 0
        }
}

This completes missing value imputation as no missing values left in either training and test data sets.

5 Training predictive algorithms for estimating house prices

5.1 Using shrinkage methods (regularization)

It would be interesting to perform a linear model selection and see how well it might perform with the new training set we obtained.

5.1.1 Fitting a lasso regression model to data

library(glmnet)

# We create a matrix of predictors
x.training = model.matrix(Log.SalePrice ~ . -1, data = training.processed)
x.testing = model.matrix( ~ . -1, data = test.processed)

y = training.processed$Log.SalePrice

# Fit lasso regression to training data
fit.lasso <- glmnet(x.training,y, family = "gaussian")
plot(fit.lasso, xvar = "lambda", label = TRUE)

plot(fit.lasso, xvar = "dev", label = TRUE)

Lasso is a nice way of reducing the features while still trying to explain the variance and therefore trying to win the bias-variance trade-off. In this case we notice that only 3 features are able to explain 60% of the deviance, while adding 13 features is able to explain 80% of it. It is more feasible to use the restricted model with the few features and their shrunken coefficients.

5.1.2 Choosing the optimal lambda by using cross-validation

It is best to choose the optimal lambda using cross-validation, with the aim of minimizing mean squared error::

#cv.lasso will be the list containing the coefficients and information of our optimal model fit using cross-validation (10 fold by default)
set.seed(123)
cv.lasso <- cv.glmnet(x.training,y, family = "gaussian")
plot(cv.lasso)

Notice that what we are doing here is actually computationally intense.

We get very fine grids of lambda values.
Using each lambda value, we perform 10-fold cross validation:

We fit a model using the Sum of Squares convex optimization and obtain cross-validated error (average of 10 model fits)

We continue steps 1 and 2 for all fine grids of lambda values in the range.

The two vertical lines are produced in the plot, the left one marks the model with min error and the middle one shows a little more restricted model 1 standard error away from the minimum error.

When choosing the optimal model, we might prefer to get the more restricted model that is indicated by the second vertical line, which is default in glmnet package, due to the 1 standard error conversion.

The final (optimal) model coefficients we obtained by cross-validation and the non-zero features remaining:

head(coef(cv.lasso))

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  5.436609175
## MSSubClass   .          
## OverallQual  0.096744689
## OverallCond  0.004338550
## YearBuilt    0.001045407
## YearRemodAdd 0.001458188

Note that our final model includes ~30 features with non-zero coefficients. We therefore significantly reduced the size of the data set by selecting features through lasso regression. The resulting model explains more than 80% of the deviance in the outcome, which is considerable good for the training set.

We also note that some of the dummy variables remained in the model, while majority of them removed. It seems to be good idea to keep all levels as separate dummy variables until the regularization, since we don’t know which ones may be important in the final model context.

5.1.3 Making predictions using the best lasso model fit

Let’s first start with the training set:

pred <- predict(cv.lasso, newx = x.training)

#Let's write a function to streamline our analyses of the various models we will train:

log.rmse.training <- function(pred,observed,method){

#pred: vector of predicted values (in log scale)
#observed: vector of observed values (in log scale)
#method: the method name used to built the predictive model        
        
#Calculating the root mean squared error:
rmse.training = sqrt(mean((pred - observed)^2))

#Calculating the Pearson correlation:
cor.training = cor(training.processed$Log.SalePrice,pred)

plot(x = pred, y = observed, cex = 0.5, col = "navy", pch = 19, 
     main = method,xlab = "Log(Predicted)", ylab = "Log(Observed)")
text(x = max(pred), y = min(observed)+0.4,
     labels = paste0("RMSE: ",rmse.training))
text(x = max(pred), y = min(observed)+0.2,
     labels = paste0("Pr.cor: ",cor.training))
}

log.rmse.training(pred = pred, observed = training.processed$Log.SalePrice, method = "Lasso")

Therefore, the RMSE we obtained from the training set is ~15%.

Next we will try to use test set to estimate the house prices using our model.

5.1.4 Predictions using the test set

Now we will prepare our predictions for submission:

# These steps are needed otherwise predict() will complain!
x.testing <- data.frame(x.testing)  # Convert test matrix to data frame
x.testing$Log.SalePrice <- NULL  # Set the response to NULL (otherwise model object will keep looking at it and can't find one from the test set, throws error)

# Finally retain only the features that exists in the training data matrix
x.testing <- x.testing[,which(colnames(x.testing) %in% colnames(x.training))]

pred.lasso <- predict(cv.lasso, newx = data.matrix(x.testing))

# Note that these are predictions in log scale, we need to exponentiate them to get actual SalePrice predictions in the right scale;

pred.lasso <- exp(pred.lasso)

We can write a function to streamline our submissions:

prepare.submission <- function(id,pred,method){
#pred: vector of predicted values from the test set (in linear scale)
#id: vector of indices in the test set
#method: the method name used to built the predictive model  
prediction.table <- data.frame(Id = id,SalePrice =  pred)
colnames(prediction.table)[2] <- "SalePrice"
write.csv(prediction.table,paste0(method,"_predictions",".csv"), row.names = F)
}

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.lasso, method = "Lasso")

The lasso model gave us a log RMSE of 0.15597, which is very close to the error we observed in the training set.

5.2 Using Principal Components for regression: beating the curse of dimensionality

5.2.1 Dimension reduction with the training set

library(caret)
#First get a PCA processing object
prePCA <- preProcess(training.processed[,-1],method = "pca")
#Generate the principal components from the training set 
PCAtraining <- predict(prePCA,newdata = training.processed)
dim(PCAtraining)

## [1] 1460  171

# Note that we reduced the data set into 171 features(principal components)


qplot(x = PC1, y = Log.SalePrice, data = PCAtraining,  alpha = I(0.3))+theme_bw()

qplot(x = PC2, y = Log.SalePrice, data = PCAtraining,  alpha = I(0.3))+theme_bw()

qplot(x = PC3, y = Log.SalePrice, data = PCAtraining,  alpha = I(0.3))+theme_bw()

qplot(x = PC4, y= Log.SalePrice, data = PCAtraining,  alpha = I(0.3))+theme_bw()

Now let’s wee if we can further reduce the dimensions. Looking into how much variation we can cumulatively explain by principal components:

var.vector <- apply(PCAtraining[-1], 2, var)
var.cumulative.percent <- NULL
for(i in 1:length(var.vector)){
        var.cumulative.percent[i] <- sum(var.vector[1:i])/sum(var.vector[1:length(var.vector)])*100
}
plot(y = var.cumulative.percent, x = 1:length(var.cumulative.percent), pch= 19, cex = 0.5, 
     col = "navy", ylab = "% variance explained", xlab = "# of PCs")

It looks like about the first 100% PCs able to explain the 80% variance in the entire data set. This is still quite large number of dimensions.

Let’s follow the RMSE estimated by cross-validation in the training set if we happen to train regression models using these features:

We will use the pcr function in the pls library to perform Principal Components Regression:

library(pls)
set.seed(2)
pcr.fit <- pcr(Log.SalePrice ~ . , data = training.processed, scale = FALSE, validation = "CV", ncomp = 170)

validation = “CV” : we ask the pcr function to compute a 10-fold cross-validation error for each possible value of the principal components (M) that can be derived from the data matrix.

summary function is used to visualize the resulting model.fit object:

summary(pcr.fit)

## Data:    X dimension: 1460 282 
##  Y dimension: 1460 1
## Fit method: svdpc
## Number of components considered: 170
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.3996   0.2523   0.2536   0.2538   0.2552   0.2553   0.2460
## adjCV       0.3996   0.2517   0.2531   0.2529   0.2546   0.2546   0.2464
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.2346   0.2328   0.2290    0.2287    0.2286    0.2286    0.2284
## adjCV   0.2340   0.2322   0.2284    0.2281    0.2279    0.2280    0.2281
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV       0.2319    0.2219    0.2008    0.1982    0.1958    0.1958
## adjCV    0.2310    0.2219    0.1999    0.1973    0.1949    0.1950
##        20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## CV       0.1955    0.1877    0.1768    0.1628    0.1624    0.1617
## adjCV    0.1946    0.1878    0.1759    0.1619    0.1615    0.1608
##        26 comps  27 comps  28 comps  29 comps  30 comps  31 comps
## CV       0.1629    0.1631    0.1633    0.1636    0.1633    0.1633
## adjCV    0.1620    0.1622    0.1623    0.1626    0.1623    0.1623
##        32 comps  33 comps  34 comps  35 comps  36 comps  37 comps
## CV       0.1632    0.1632    0.1633    0.1622    0.1621    0.1619
## adjCV    0.1622    0.1622    0.1625    0.1612    0.1611    0.1609
##        38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## CV       0.1617    0.1616    0.1617    0.1615    0.1619    0.1617
## adjCV    0.1609    0.1605    0.1606    0.1605    0.1610    0.1607
##        44 comps  45 comps  46 comps  47 comps  48 comps  49 comps
## CV       0.1618    0.1613    0.1615    0.1618    0.1619     0.162
## adjCV    0.1608    0.1602    0.1604    0.1607    0.1609     0.161
##        50 comps  51 comps  52 comps  53 comps  54 comps  55 comps
## CV       0.1617    0.1607    0.1604    0.1604    0.1603    0.1602
## adjCV    0.1607    0.1595    0.1592    0.1592    0.1590    0.1590
##        56 comps  57 comps  58 comps  59 comps  60 comps  61 comps
## CV       0.1602    0.1603    0.1608    0.1606    0.1606    0.1604
## adjCV    0.1590    0.1591    0.1597    0.1594    0.1594    0.1593
##        62 comps  63 comps  64 comps  65 comps  66 comps  67 comps
## CV       0.1598    0.1589    0.1591    0.1594    0.1583    0.1583
## adjCV    0.1586    0.1576    0.1582    0.1580    0.1575    0.1568
##        68 comps  69 comps  70 comps  71 comps  72 comps  73 comps
## CV       0.1582    0.1582    0.1569    0.1560    0.1550    0.1544
## adjCV    0.1575    0.1560    0.1552    0.1544    0.1536    0.1531
##        74 comps  75 comps  76 comps  77 comps  78 comps  79 comps
## CV       0.1545    0.1542    0.1542    0.1543    0.1542    0.1544
## adjCV    0.1530    0.1529    0.1528    0.1529    0.1529    0.1533
##        80 comps  81 comps  82 comps  83 comps  84 comps  85 comps
## CV       0.1545    0.1540    0.1541    0.1540    0.1540    0.1540
## adjCV    0.1528    0.1526    0.1528    0.1527    0.1525    0.1526
##        86 comps  87 comps  88 comps  89 comps  90 comps  91 comps
## CV       0.1542    0.1542    0.1542    0.1541    0.1543    0.1537
## adjCV    0.1530    0.1526    0.1527    0.1527    0.1529    0.1519
##        92 comps  93 comps  94 comps  95 comps  96 comps  97 comps
## CV       0.1539    0.1534    0.1537    0.1541    0.1541    0.1541
## adjCV    0.1522    0.1519    0.1521    0.1525    0.1526    0.1525
##        98 comps  99 comps  100 comps  101 comps  102 comps  103 comps
## CV       0.1542    0.1546     0.1549     0.1550     0.1551     0.1552
## adjCV    0.1526    0.1530     0.1533     0.1534     0.1536     0.1537
##        104 comps  105 comps  106 comps  107 comps  108 comps  109 comps
## CV        0.1551     0.1554     0.1553     0.1554     0.1554     0.1552
## adjCV     0.1535     0.1537     0.1536     0.1536     0.1537     0.1535
##        110 comps  111 comps  112 comps  113 comps  114 comps  115 comps
## CV        0.1556     0.1554     0.1556     0.1559     0.1560     0.1561
## adjCV     0.1540     0.1537     0.1539     0.1541     0.1544     0.1544
##        116 comps  117 comps  118 comps  119 comps  120 comps  121 comps
## CV        0.1561     0.1560     0.1557     0.1557     0.1559     0.1553
## adjCV     0.1546     0.1543     0.1538     0.1539     0.1545     0.1536
##        122 comps  123 comps  124 comps  125 comps  126 comps  127 comps
## CV        0.1548     0.1548     0.1550     0.1548     0.1545     0.1542
## adjCV     0.1533     0.1529     0.1531     0.1528     0.1526     0.1524
##        128 comps  129 comps  130 comps  131 comps  132 comps  133 comps
## CV        0.1542     0.1541     0.1543     0.1543     0.1546     0.1544
## adjCV     0.1524     0.1522     0.1524     0.1523     0.1526     0.1525
##        134 comps  135 comps  136 comps  137 comps  138 comps  139 comps
## CV        0.1546     0.1548     0.1548     0.1544     0.1546     0.1541
## adjCV     0.1527     0.1529     0.1529     0.1527     0.1528     0.1524
##        140 comps  141 comps  142 comps  143 comps  144 comps  145 comps
## CV        0.1540     0.1540     0.1538     0.1529     0.1528     0.1524
## adjCV     0.1519     0.1521     0.1520     0.1509     0.1508     0.1502
##        146 comps  147 comps  148 comps  149 comps  150 comps  151 comps
## CV        0.1524     0.1524     0.1525     0.1525     0.1523     0.1525
## adjCV     0.1504     0.1503     0.1503     0.1503     0.1502     0.1505
##        152 comps  153 comps  154 comps  155 comps  156 comps  157 comps
## CV        0.1524     0.1529     0.1531     0.1534     0.1536     0.1536
## adjCV     0.1504     0.1509     0.1510     0.1513     0.1515     0.1515
##        158 comps  159 comps  160 comps  161 comps  162 comps  163 comps
## CV        0.1536     0.1536     0.1537     0.1539     0.1543     0.1546
## adjCV     0.1515     0.1515     0.1516     0.1518     0.1522     0.1525
##        164 comps  165 comps  166 comps  167 comps  168 comps  169 comps
## CV        0.1547     0.1548     0.1551     0.1549     0.1548     0.1548
## adjCV     0.1525     0.1527     0.1529     0.1528     0.1527     0.1527
##        170 comps
## CV        0.1546
## adjCV     0.1525
## 
## TRAINING: % variance explained
##                1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X                34.94    57.01    75.11    90.74    94.00    95.98
## Log.SalePrice    61.45    61.79    62.47    62.50    62.81    64.35
##                7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X                 97.9    98.81    99.07     99.30     99.49     99.68
## Log.SalePrice     68.6    69.13    70.18     70.44     70.53     70.57
##                13 comps  14 comps  15 comps  16 comps  17 comps  18 comps
## X                 99.77     99.87     99.93     99.98     99.99    100.00
## Log.SalePrice     70.60     70.89     73.14     78.38     78.97     79.52
##                19 comps  20 comps  21 comps  22 comps  23 comps  24 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     79.56     79.64     81.66     83.81     86.42     86.51
##                25 comps  26 comps  27 comps  28 comps  29 comps  30 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     86.58     86.61     86.61     86.63     86.63     86.67
##                31 comps  32 comps  33 comps  34 comps  35 comps  36 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     86.67     86.69     86.74     86.74     86.96     87.04
##                37 comps  38 comps  39 comps  40 comps  41 comps  42 comps
## X                100.00    100.00    100.00    100.00    100.00     100.0
## Log.SalePrice     87.07     87.07     87.15     87.18     87.19      87.2
##                43 comps  44 comps  45 comps  46 comps  47 comps  48 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     87.29     87.29     87.41     87.41     87.42     87.42
##                49 comps  50 comps  51 comps  52 comps  53 comps  54 comps
## X                100.00    100.00     100.0    100.00    100.00    100.00
## Log.SalePrice     87.44     87.55      87.8     87.82     87.84     87.92
##                55 comps  56 comps  57 comps  58 comps  59 comps  60 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     87.92     87.92     87.93     87.93     87.96     87.97
##                61 comps  62 comps  63 comps  64 comps  65 comps  66 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     88.01     88.12     88.23     88.23     88.47     88.48
##                67 comps  68 comps  69 comps  70 comps  71 comps  72 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     88.68     88.68     89.13     89.13     89.19     89.19
##                73 comps  74 comps  75 comps  76 comps  77 comps  78 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     89.21     89.26     89.27     89.31     89.33     89.38
##                79 comps  80 comps  81 comps  82 comps  83 comps  84 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     89.39     89.63     89.64     89.64     89.64     89.69
##                85 comps  86 comps  87 comps  88 comps  89 comps  90 comps
## X                 100.0     100.0     100.0     100.0    100.00    100.00
## Log.SalePrice      89.7      89.7      89.8      89.8     89.82     89.84
##                91 comps  92 comps  93 comps  94 comps  95 comps  96 comps
## X                100.00    100.00    100.00    100.00    100.00    100.00
## Log.SalePrice     89.96     89.97     89.98     90.03     90.03     90.03
##                97 comps  98 comps  99 comps  100 comps  101 comps
## X                100.00    100.00    100.00     100.00     100.00
## Log.SalePrice     90.04     90.05     90.05      90.05      90.05
##                102 comps  103 comps  104 comps  105 comps  106 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.06      90.06      90.07      90.11      90.14
##                107 comps  108 comps  109 comps  110 comps  111 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.18      90.18      90.18      90.18      90.22
##                112 comps  113 comps  114 comps  115 comps  116 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.23      90.26      90.27      90.28      90.29
##                117 comps  118 comps  119 comps  120 comps  121 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.37      90.41      90.41      90.42      90.56
##                122 comps  123 comps  124 comps  125 comps  126 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.59      90.75      90.76      90.82      90.83
##                127 comps  128 comps  129 comps  130 comps  131 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.86      90.88      90.91      90.92      90.96
##                132 comps  133 comps  134 comps  135 comps  136 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.97      90.97      90.97      90.98      90.98
##                137 comps  138 comps  139 comps  140 comps  141 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      90.99      91.07      91.12      91.21      91.22
##                142 comps  143 comps  144 comps  145 comps  146 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      91.22      91.34      91.35      91.43      91.43
##                147 comps  148 comps  149 comps  150 comps  151 comps
## X                 100.00      100.0      100.0      100.0      100.0
## Log.SalePrice      91.45       91.5       91.5       91.5       91.5
##                152 comps  153 comps  154 comps  155 comps  156 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      91.53      91.53      91.53      91.54      91.55
##                157 comps  158 comps  159 comps  160 comps  161 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      91.55      91.56      91.56      91.57      91.57
##                162 comps  163 comps  164 comps  165 comps  166 comps
## X                 100.00     100.00     100.00     100.00     100.00
## Log.SalePrice      91.57      91.57      91.59      91.59      91.61
##                167 comps  168 comps  169 comps  170 comps
## X                 100.00     100.00     100.00     100.00
## Log.SalePrice      91.61      91.63      91.65      91.66

This output pretty much explains what is going on when we perform the PCR:

In the first section of the summary table we get the CV-estimated error:

RMSEP: Note that pcr returns the root mean squared error (to get MSE, just square the RMSE values). For each number of components used in the model, the predicted RMSE from 10-fold CV is provided. 0-component model corresponds to the NULL model which just contains an intercept.

We can use the validationplot() function to plot the cross-validation estimated RMSE or MSE scores:

validationplot(pcr.fit,val.type = "RMSEP")

We note that after ~25 PCs, the RMSE is stabilized around 0.16 and adding more components beyond has minimal impact. This training RMSE is not better than what we obtained from Lasso regression.

5.3 Training Random Forest algorithm with cross validation

Let’s train RF algorithm to see if we can get a better RMSE:

set.seed(123)
RF <- train(Log.SalePrice ~ ., data = training.processed,method = "rf", trControl = trainControl(method = "cv", number = 10)) 
RF

## Random Forest 
## 
## 1460 samples
##  282 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##     2   0.2176368  0.8123522
##   142   0.1383240  0.8833084
##   282   0.1431040  0.8733710
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 142.

It looks like we had some major improvement in training RMSE using the RF algorithm:

log.rmse.training(pred = predict(RF,training.processed),observed = training.processed$Log.SalePrice,method = " Random Forest")

5.3.1 Variable importance in Random Forest

imp <- data.frame(RF$finalModel$importance)
imp[2] <- row.names(imp)
imp <- dplyr::arrange(imp,IncNodePurity)
head(imp,10)

5.3.2 Evaluate the Random Forest RMSE in test data set

pred.rf <- exp(predict(RF, newdata = test.processed)) #Note we exponentiate the predicted values to get back into linear scale

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.rf, method = "RandomForest")

RF predictions gave us a RMSE of 0.14117 from the test data set, which is better than lasso, but same as SVM prediction.

5.4 Gradient boosting with cross validation

set.seed(123)
GBM <- train(Log.SalePrice ~ ., data = training.processed,method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
GBM

## Stochastic Gradient Boosting 
## 
## 1460 samples
##  282 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared 
##   1                   50      0.1747361  0.8338095
##   1                  100      0.1474842  0.8670602
##   1                  150      0.1392099  0.8786576
##   2                   50      0.1519649  0.8622421
##   2                  100      0.1373730  0.8813529
##   2                  150      0.1342455  0.8865987
##   3                   50      0.1422702  0.8754971
##   3                  100      0.1336038  0.8876556
##   3                  150      0.1326268  0.8893477
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Gradient boosting provided a model with a further improved RMSE beyond lasso.

log.rmse.training(pred = predict(GBM,training.processed),observed = training.processed$Log.SalePrice,method = " Gradient Boosting")

5.4.1 Variable importance in Gradient Boosting:

dotPlot(varImp(GBM))

It is interesting to note that gradient boosting found OverallQual feature as the most important, as we previously noticed. Also important to note that continuous features appear as more important predictors in regression.

5.4.2 Evaluate the Gradient Boosting RMSE in test data set

pred.gbm <- exp(predict(GBM, newdata = test.processed)) #Note we exponentiate the predicted values to get back into linear scale

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.gbm, method = "GBM")

GBM prediction gave us a RMSE of 0.12764 in the test data set, which is higher than lasso, RF or SVM.

5.5 Extreme gradient boosting with cross validation

set.seed(123)
xGB <- train(Log.SalePrice ~ ., data = training.processed,method = "xgbLinear", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
xGB

## eXtreme Gradient Boosting 
## 
## 1460 samples
##  282 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   lambda  alpha  nrounds  RMSE       Rsquared 
##   0e+00   0e+00   50      0.1412797  0.8746213
##   0e+00   0e+00  100      0.1413552  0.8746935
##   0e+00   0e+00  150      0.1413554  0.8747918
##   0e+00   1e-04   50      0.1417705  0.8742478
##   0e+00   1e-04  100      0.1414839  0.8748611
##   0e+00   1e-04  150      0.1415823  0.8747320
##   0e+00   1e-01   50      0.1395686  0.8770151
##   0e+00   1e-01  100      0.1397304  0.8769822
##   0e+00   1e-01  150      0.1397366  0.8769945
##   1e-04   0e+00   50      0.1414864  0.8742877
##   1e-04   0e+00  100      0.1411982  0.8748763
##   1e-04   0e+00  150      0.1413166  0.8746859
##   1e-04   1e-04   50      0.1434487  0.8708219
##   1e-04   1e-04  100      0.1433909  0.8709634
##   1e-04   1e-04  150      0.1433326  0.8710759
##   1e-04   1e-01   50      0.1411844  0.8749075
##   1e-04   1e-01  100      0.1413519  0.8748503
##   1e-04   1e-01  150      0.1413538  0.8748580
##   1e-01   0e+00   50      0.1430947  0.8711727
##   1e-01   0e+00  100      0.1429144  0.8717897
##   1e-01   0e+00  150      0.1429291  0.8718746
##   1e-01   1e-04   50      0.1426570  0.8726593
##   1e-01   1e-04  100      0.1426381  0.8729438
##   1e-01   1e-04  150      0.1425038  0.8732002
##   1e-01   1e-01   50      0.1399740  0.8766468
##   1e-01   1e-01  100      0.1399758  0.8765913
##   1e-01   1e-01  150      0.1399521  0.8766161
## 
## Tuning parameter 'eta' was held constant at a value of 0.3
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were nrounds = 50, lambda = 0, alpha
##  = 0.1 and eta = 0.3.

Extreme Gradient boosting provided a model with a training RMSE of 0.029, which is quite good!

log.rmse.training(pred = predict(xGB,training.processed),observed = training.processed$Log.SalePrice,method = "Extreme Gradient Boosting")

### Evaluate the Gradient Boosting RMSE in test data set

pred.xgb <- exp(predict(xGB, newdata = test.processed)) #Note we exponentiate the predicted values to get back into linear scale

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.xgb, method = "xGB")

Despite the remarkable performance in the training set, the xGB model provided us a RMSE of 0.14634 when using the test set. It sounds like xGB was over-fitting to training set.

5.6 Support vector machines with a radial kernel with cross validation

set.seed(123)
SVM <- train(Log.SalePrice ~ ., data = training.processed,method = "svmRadial", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
SVM

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1460 samples
##  282 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared 
##   0.25  0.3120216  0.3282845
##   0.50  0.3074168  0.3364527
##   1.00  0.3044189  0.3418351
## 
## Tuning parameter 'sigma' was held constant at a value of 0.002617709
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.002617709 and C = 1.

log.rmse.training(pred = predict(SVM,training.processed),observed = training.processed$Log.SalePrice,method = " Support Vector Machines")

Support vector machines also provided some improvement beyond lasso.

5.6.1 Evaluate the Support Vector Machine RMSE in test data set

pred.svm <- exp(predict(RF, newdata = test.processed)) #Note we exponentiate the predicted values to get back into linear scale

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.svm, method = "SVM")

SVM prediction gave us a RMSE of 0.14117 in the test data set, which is better than lasso.

5.7 Using Principal Components along with Mechine Learning Algorithms

Next, let’s give a try what happens when we train machine learning algorithms using the principal components of the training set:

5.7.1 Gradient boosting with cross validation using Principal Components

set.seed(123)
GBM.PCA <- train(Log.SalePrice ~ ., data = PCAtraining,method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
GBM.PCA

## Stochastic Gradient Boosting 
## 
## 1460 samples
##  170 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared 
##   1                   50      0.1854606  0.8079470
##   1                  100      0.1647943  0.8370229
##   1                  150      0.1566702  0.8495593
##   2                   50      0.1659383  0.8359099
##   2                  100      0.1541557  0.8536853
##   2                  150      0.1496547  0.8609910
##   3                   50      0.1596203  0.8463063
##   3                  100      0.1482250  0.8648430
##   3                  150      0.1465758  0.8668248
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Using principal components instead of the actual data slightly improved the RMSE in gradient boosting.

log.rmse.training(pred = predict(GBM.PCA,PCAtraining),observed = training.processed$Log.SalePrice,method = " Gradient Boosting using PCs")

5.7.1.1 Evaluate the Gradient Boosting using PCs RMSE in test data set

#Prepare principal components of the test set
library(caret)
#First get a PCA processing object
prePCA.test <- preProcess(test.processed[,-1],method = "pca")
#Generate the principal components from the training set 
PCAtest <- predict(prePCA.test,newdata = test.processed)

pred.GBM.pca <- exp(predict(GBM.PCA, newdata = PCAtest)) #Note we exponentiate the predicted values to get back into linear scale

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.GBM.pca, method = "GBM_pca")

This approach yields a RMSE of 0.2 in test data set, which is not as powerful as the GBM in the original data set.

5.7.2 Support vector machines with a radial kernel with cross validation using principal components

set.seed(123)
SVM.PCA <- train(Log.SalePrice ~ ., data = PCAtraining,method = "svmRadial", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
SVM.PCA

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1460 samples
##  170 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared 
##   0.25  0.2394946  0.7080861
##   0.50  0.2104113  0.7539796
##   1.00  0.1951820  0.7771754
## 
## Tuning parameter 'sigma' was held constant at a value of 0.004405027
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.004405027 and C = 1.

In the case of SMV, using Principal components seems not to improve the RMSE:

log.rmse.training(pred = predict(SVM.PCA,PCAtraining),observed = training.processed$Log.SalePrice,method = " Support Vector Machines using PCs")

5.7.3 Training Random Forest algorithm with cross validation using PCs

Let’s train RF algorithm to see if we can get a better RMSE:

set.seed(123)
RF <- train(Log.SalePrice ~ ., data = PCAtraining,method = "rf", trControl = trainControl(method = "cv", number = 10)) 
RF

## Random Forest 
## 
## 1460 samples
##  170 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared 
##     2   0.3071923  0.6300562
##    86   0.1576456  0.8490237
##   170   0.1574013  0.8451992
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 170.

In the case of RF, using Principal components seems not to improve the RMSE:

log.rmse.training(pred = predict(RF,PCAtraining),observed = training.processed$Log.SalePrice,method = "Random Forest using PCs")

6 Refine the Gradient boosting model

So far the best predictions we were able to make was using the GBM model. Next, we are going to ask whether we can make this model more flexible to reduce the RMSE? Perhaps we had way more features than we needed and we were over-fitting to the data.

Let’s inspect the variable importance for this model once again:

dotPlot(varImp(GBM))

It looks like variable importance has already dropped significantly by the time we had 20 features in the model. Now the question is: what if we just have used the top few features to train a more flexible model:

#Choose top10 features to train the model again
library(dplyr)

#It is hard to access the varImp elements
# This trick helps: 
col_index <- varImp(GBM)$importance %>% 
  mutate(names=row.names(.)) %>%
  arrange(-Overall)

# The feature name vector including the response variable to fetch these features from the data set:
top10.features = c("Log.SalePrice",col_index$names[1:10])

#Train the GBM model using only these features
set.seed(123)
GBM.top10 <- train(Log.SalePrice ~ ., data = training.processed[,top10.features],method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE) 
GBM.top10

## Stochastic Gradient Boosting 
## 
## 1460 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1313, 1314, 1314, 1314, 1315, 1314, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared 
##   1                   50      0.1763772  0.8307680
##   1                  100      0.1490263  0.8639535
##   1                  150      0.1432947  0.8715325
##   2                   50      0.1537599  0.8585428
##   2                  100      0.1408027  0.8749777
##   2                  150      0.1396576  0.8767283
##   3                   50      0.1454400  0.8698230
##   3                  100      0.1404652  0.8762992
##   3                  150      0.1394657  0.8776850
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

This gave us a model with a slightly higher RMSE in the training set:

log.rmse.training(pred = predict(GBM.top10,training.processed[,top10.features]),observed = training.processed$Log.SalePrice,method = "GBM.top10")

6.1 Estimate the RMSE in the test set using the GBM model restricted to top10 most important features:

pred.GBM.top10 <- exp(predict(GBM.top10, newdata = test.processed[,top10.features[-1]])) #Note we exponentiate the predicted values to get back into linear scale

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.GBM.top10 , method = "GBM_top10")

This gave us a RMSE of 0.14176 which is not as good as the original model.

6.2 Scan different GBM models for reduced training RMSE

#Only perform once
rmse.table <- data.frame(Number_of_features = 1:282, rmse = 1:282)
for(i in 1:285){
        set.seed(123)
        #Add a less important feature, one at a time, and fit new models
        features <- c("Log.SalePrice",col_index$names[1:i]) 
        GBM.temp <- train(Log.SalePrice ~ ., data = training.processed[,features],method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE)
        pred.temp <- predict(GBM.temp,training.processed[,features])
        rmse.temp <-sqrt(mean((pred.temp - training.processed$Log.SalePrice)^2))
        rmse.table$Number_of_features[i] = i
        rmse.table$rmse[i] = rmse.temp
}
saveRDS(rmse.table, file = "rmse.table.rds")

#Find the model with minimum RMSE
rmse.table <- readRDS("rmse.table.rds")
rmse.table[which(rmse.table$rmse == min(rmse.table$rmse)),]

plot(y = rmse.table$rmse, x = rmse.table$Number_of_features, pch = 19, cex = 0.6, col = "navy",
     xlab = "RMSE", ylab = "Number of features in the GBM", main = "Scanning restricted GBM models for reduced RMSE ")

Note that the minimum RMSE is obtained using the model with 79 most important features. Including other features beyond this model does not further reduce the RMSE. Let’s try to train and use the model formed by top79 most important features:

top79.features = c("Log.SalePrice",col_index$names[1:79]) 
#Train the GBM model using only these features
set.seed(123)
GBM.top79 <- train(Log.SalePrice ~ ., data = training.processed[,top79.features],method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE)

6.3 Estimate the RMSE in the test set using the GBM model restricted to top50 most important features:

pred.GBM.top79 <- exp(predict(GBM.top79, newdata = test.processed[,top79.features[-1]])) #Note we exponentiate the predicted values to get back into linear scale

Finally, prepare our submission:

prepare.submission(id = test.ID, pred = pred.GBM.top79 , method = "GBM_top79")

This approach using the top79 most important predictors reduced the test RMSE to 0.12717. Therefore, indeed the model might become more flexible by including less features.

7 Conclusions

We have obtained a fairly good model to predict the Sale Prices in the Ames housing data set. A RMSE of 0.12717 is not trivial, but perhaps can be improved by using model ensemble methods as well as further feature engineering. I plan to revisit this problem in near future to apply and test more sophisticated approaches in order to reduce the RMSE in the test data set.

Predicting the housing market in Iowa: scanning extreme gradient boosting for superior prediction

Ozan Aygun

12/10/2017

1 Introduction and summary

2 Loading the data

3 Summarizing the data

3.1 Understanding the impact of missing data

3.1.1 Alley

3.1.2 FireplaceQu

3.1.3 PoolQC

3.1.4 Fence

3.1.5 MiscFeature

4 Developing expectations from data

4.1 Continuous features: watching for strong predictors and potential collinearities

4.1.1 Part1: transforming LotArea to make it more gaussian

4.1.2 Part2: Area related features generally explain variation in the SalePrice

4.1.3 Part3: Size of the garage and number of rooms are good predictors

4.1.4 Part4: fairly weak predictors

4.2 Categorical predictors: feature engineering to create dummy variables

4.3 Missing value imputation

4.4 GarageYrBlt

4.5 LotFrontage

4.6 MasVnrArea: Masonry veneer area in square feet

4.7 Missing data remaining in the test set

5 Training predictive algorithms for estimating house prices

5.1 Using shrinkage methods (regularization)

5.1.1 Fitting a lasso regression model to data

5.1.2 Choosing the optimal lambda by using cross-validation

5.1.3 Making predictions using the best lasso model fit

5.1.4 Predictions using the test set

5.2 Using Principal Components for regression: beating the curse of dimensionality

5.2.1 Dimension reduction with the training set

5.3 Training Random Forest algorithm with cross validation

5.3.1 Variable importance in Random Forest

5.3.2 Evaluate the Random Forest RMSE in test data set

5.4 Gradient boosting with cross validation

5.4.1 Variable importance in Gradient Boosting:

5.4.2 Evaluate the Gradient Boosting RMSE in test data set

5.5 Extreme gradient boosting with cross validation

5.6 Support vector machines with a radial kernel with cross validation

5.6.1 Evaluate the Support Vector Machine RMSE in test data set

5.7 Using Principal Components along with Mechine Learning Algorithms

5.7.1 Gradient boosting with cross validation using Principal Components

5.7.1.1 Evaluate the Gradient Boosting using PCs RMSE in test data set

5.7.2 Support vector machines with a radial kernel with cross validation using principal components

5.7.3 Training Random Forest algorithm with cross validation using PCs

6 Refine the Gradient boosting model

6.1 Estimate the RMSE in the test set using the GBM model restricted to top10 most important features:

6.2 Scan different GBM models for reduced training RMSE

6.3 Estimate the RMSE in the test set using the GBM model restricted to top50 most important features:

7 Conclusions