Neural Network Optimization Using Dropout Layers

Let's try to develop an intuition on how we can tune a given neural network using dropout layers. We choose a binary classification problem to study the optimization approach. The process is largely emprical, however we need to have an understanding of overfitting and regularization in order to fully embrace the concept.

In [118]:
# Loading training set, contains 100 engineered features and about 800,000 training samples, saved as a sparse matrix
import pickle
with open("X_train_balanced_trans_100.pkl","rb") as f:
    X_train_balanced_trans_100 = pickle.load(f) 

with open("y_train_balanced.pkl", "rb") as f:
    y_train_balanced = pickle.load(f) 
    

Bechmark Neural Networks

Start with a 'medium-complexity' model to develop expectations:

In [120]:
from keras import models, metrics, layers
# Note that csr type of sparse matrix runs significantly faster in keras neural network implementation

network1 = models.Sequential()
network1.add(layers.Dense(32,activation="relu", input_shape = (X_train_balanced_trans_100.shape[1],)))
network1.add(layers.Dense(32,activation="relu"))
network1.add(layers.Dense(32,activation="relu"))
network1.add(layers.Dense(1,activation= "sigmoid"))
network1.compile(optimizer= "adam", loss= "binary_crossentropy", metrics= ["acc"])
history_net1 = network1.fit(X_train_balanced_trans_100.tocsr(),y_train_balanced,  
                             epochs=20,batch_size=500,validation_split= 0.5)
Train on 417611 samples, validate on 417611 samples
Epoch 1/20
417611/417611 [==============================] - 6s 14us/step - loss: 0.2536 - acc: 0.9017 - val_loss: 0.2184 - val_acc: 0.9166
Epoch 2/20
417611/417611 [==============================] - 5s 12us/step - loss: 0.2214 - acc: 0.9134 - val_loss: 0.2149 - val_acc: 0.9184
Epoch 3/20
417611/417611 [==============================] - 6s 13us/step - loss: 0.2194 - acc: 0.9138 - val_loss: 0.2136 - val_acc: 0.9189
Epoch 4/20
417611/417611 [==============================] - 6s 14us/step - loss: 0.2181 - acc: 0.9144 - val_loss: 0.2134 - val_acc: 0.9192
Epoch 5/20
417611/417611 [==============================] - 5s 12us/step - loss: 0.2173 - acc: 0.9146 - val_loss: 0.2128 - val_acc: 0.9190
...

Now reduce the complexity to monitor performance:

In [142]:
network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_train_balanced_trans_100.shape[1],)))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dense(1,activation= "sigmoid"))
network1.compile(optimizer= "adam", loss= "binary_crossentropy", metrics= ["acc"])
history_net1 = network1.fit(X_train_balanced_trans_100.tocsr(),y_train_balanced,  
                             epochs=40,batch_size=200,validation_split= 0.5)
Train on 417611 samples, validate on 417611 samples
Epoch 1/40
417611/417611 [==============================] - 12s 28us/step - loss: 0.2455 - acc: 0.9005 - val_loss: 0.2160 - val_acc: 0.9177
Epoch 2/40
417611/417611 [==============================] - 10s 25us/step - loss: 0.2204 - acc: 0.9133 - val_loss: 0.2152 - val_acc: 0.9183
Epoch 3/40
417611/417611 [==============================] - 10s 25us/step - loss: 0.2187 - acc: 0.9140 - val_loss: 0.2150 - val_acc: 0.9185
Epoch 4/40
417611/417611 [==============================] - 11s 26us/step - loss: 0.2175 - acc: 0.9147 - val_loss: 0.2153 - val_acc: 0.9184
Epoch 5/40
417611/417611 [==============================] - 11s 25us/step - loss: 0.2170 - acc: 0.9148 - val_loss: 0.2135 - val_acc: 0.9188
...

In [173]:
import matplotlib.pyplot as plt
epochs = list(range(1,len(history_net1.history["loss"]) +1))
print("Min loss: " + str(min(history_net1.history["loss"])) + " at epoch " + str(history_net1.history["loss"].index(min(history_net1.history["loss"])) + 1))
print("Min val loss: " + str(min(history_net1.history["val_loss"]))+ " at epoch " + str(history_net1.history["val_loss"].index(min(history_net1.history["val_loss"])) + 1))
plt.plot(epochs,history_net1.history["loss"],marker = "o", color = "b", label = "loss")
plt.plot(epochs,history_net1.history["val_loss"],marker = "o", color = "r", label = "val_loss")
plt.legend(fontsize = 20)
plt.xlabel("Number of epochs",fontsize=20)
plt.ylabel("Loss",fontsize=20)
plt.title("Monitoring neural network performance",fontsize=25)
plt.show()
Min loss: 0.211369023644 at epoch 40
Min val loss: 0.21146884427 at epoch 35

Note how the training loss keep reducing but validation loss (i.e.: out-of-the-box performance of the model) does not behave in the same way. After the epoch 35, the network starts overfitting to training set.

Dropout regularization

We are going to apply dropout regularization approach between hidden layers of the network. This approach randomly drops neurons (at a specified dropout rate) for a given layer of the network at each learning cycle. This results in removal of these neurons, masking contribution of their weights to the final prediction. Nearby neurons are expected to compansate for the impact of dropped weights to achieve the same loss in the cost function. This process, when used properly, leads to a better generalization of the model and help reducing the impact of overfitting.

In [181]:
# Add dropout between hidden layers
from keras import models, metrics, layers
import matplotlib.pyplot as plt

network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_train_balanced_trans_100.shape[1],)))
network1.add(layers.Dropout(0.2))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dropout(0.2))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dropout(0.2))
network1.add(layers.Dense(8,activation="relu"))
network1.add(layers.Dropout(0.2))
network1.add(layers.Dense(1,activation= "sigmoid"))
network1.compile(optimizer= "adam", loss= "binary_crossentropy", metrics= ["acc"])
history_net1 = network1.fit(X_train_balanced_trans_100.tocsr(),y_train_balanced,  
                             epochs=100,batch_size=200,validation_split= 0.5)


epochs = list(range(1,len(history_net1.history["loss"]) +1))
print("Min loss: " + str(min(history_net1.history["loss"])) + " at epoch " + str(history_net1.history["loss"].index(min(history_net1.history["loss"])) + 1))
print("Min val loss: " + str(min(history_net1.history["val_loss"]))+ " at epoch " + str(history_net1.history["val_loss"].index(min(history_net1.history["val_loss"])) + 1))
plt.plot(epochs,history_net1.history["loss"],marker = "o", color = "b", label = "loss")
plt.plot(epochs,history_net1.history["val_loss"],marker = "o", color = "r", label = "val_loss")
plt.legend(fontsize = 20)
plt.xlabel("Number of epochs",fontsize=20)
plt.ylabel("Loss",fontsize=20)
plt.title("Monitoring neural network performance",fontsize=25)
plt.show()
Train on 417611 samples, validate on 417611 samples
Epoch 1/100
417611/417611 [==============================] - 17s 41us/step - loss: 0.2806 - acc: 0.8939 - val_loss: 0.2187 - val_acc: 0.9186
Epoch 2/100
417611/417611 [==============================] - 14s 34us/step - loss: 0.2427 - acc: 0.9104 - val_loss: 0.2191 - val_acc: 0.9184
Epoch 3/100
417611/417611 [==============================] - 14s 33us/step - loss: 0.2389 - acc: 0.9111 - val_loss: 0.2157 - val_acc: 0.9197
Epoch 4/100
417611/417611 [==============================] - 14s 33us/step - loss: 0.2377 - acc: 0.9109 - val_loss: 0.2169 - val_acc: 0.9179
Epoch 5/100
417611/417611 [==============================] - 14s 33us/step - loss: 0.2365 - acc: 0.9116 - val_loss: 0.2155 - val_acc: 0.9192
...
Min loss: 0.228760845657 at epoch 98
Min val loss: 0.211608010965 at epoch 64
In [ ]:
# decrease dropout rate (model will fit stronger to training set again)
from keras import models, metrics, layers
import matplotlib.pyplot as plt
# Note that csr type of sparse matrix runs significantly faster in keras neural network implementation

network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_train_balanced_trans_100.shape[1],)))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(8,activation="relu"))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(1,activation= "sigmoid"))
network1.compile(optimizer= "adam", loss= "binary_crossentropy", metrics= ["acc"])
history_net1 = network1.fit(X_train_balanced_trans_100.tocsr(),y_train_balanced,  
                             epochs=100,batch_size=200,validation_split= 0.5)


epochs = list(range(1,len(history_net1.history["loss"]) +1))
print("Min loss: " + str(min(history_net1.history["loss"])) + " at epoch " + str(history_net1.history["loss"].index(min(history_net1.history["loss"])) + 1))
print("Min val loss: " + str(min(history_net1.history["val_loss"]))+ " at epoch " + str(history_net1.history["val_loss"].index(min(history_net1.history["val_loss"])) + 1))
plt.plot(epochs,history_net1.history["loss"],marker = "o", color = "b", label = "loss")
plt.plot(epochs,history_net1.history["val_loss"],marker = "o", color = "r", label = "val_loss")
plt.legend(fontsize = 20)
plt.xlabel("Number of epochs",fontsize=20)
plt.ylabel("Loss",fontsize=20)
plt.title("Monitoring neural network performance",fontsize=25)
plt.show()
In [185]:
epochs = list(range(1,len(history_net1.history["loss"]) +1))
print("Min loss: " + str(min(history_net1.history["loss"])) + " at epoch " + str(history_net1.history["loss"].index(min(history_net1.history["loss"])) + 1))
print("Min val loss: " + str(min(history_net1.history["val_loss"]))+ " at epoch " + str(history_net1.history["val_loss"].index(min(history_net1.history["val_loss"])) + 1))
plt.plot(epochs,history_net1.history["loss"],marker = "o", color = "b", label = "loss")
plt.plot(epochs,history_net1.history["val_loss"],marker = "o", color = "r", label = "val_loss")
plt.legend(fontsize = 20)
plt.xlabel("Number of epochs",fontsize=20)
plt.ylabel("Loss",fontsize=20)
plt.title("Monitoring neural network performance",fontsize=25)
plt.show()
Min loss: 0.210405489929 at epoch 95
Min val loss: 0.211566713362 at epoch 25

Now we have an intuition about the impact of dropout regularization on the existing network. We are familiar with the performance of the model, with and without regularization, so we can try increasing complexity to overfit once again.

In [186]:
# Increase model complexity to overfit
from keras import models, metrics, layers
import matplotlib.pyplot as plt
# Note that csr type of sparse matrix runs significantly faster in keras neural network implementation

network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_train_balanced_trans_100.shape[1],)))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(32,activation="relu"))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(8,activation="relu"))
network1.add(layers.Dropout(0.001))
network1.add(layers.Dense(1,activation= "sigmoid"))
network1.compile(optimizer= "adam", loss= "binary_crossentropy", metrics= ["acc"])
history_net1 = network1.fit(X_train_balanced_trans_100.tocsr(),y_train_balanced,  
                             epochs=100,batch_size=200,validation_split= 0.5)


epochs = list(range(1,len(history_net1.history["loss"]) +1))
print("Min loss: " + str(min(history_net1.history["loss"])) + " at epoch " + str(history_net1.history["loss"].index(min(history_net1.history["loss"])) + 1))
print("Min val loss: " + str(min(history_net1.history["val_loss"]))+ " at epoch " + str(history_net1.history["val_loss"].index(min(history_net1.history["val_loss"])) + 1))
plt.plot(epochs,history_net1.history["loss"],marker = "o", color = "b", label = "loss")
plt.plot(epochs,history_net1.history["val_loss"],marker = "o", color = "r", label = "val_loss")
plt.legend(fontsize = 20)
plt.xlabel("Number of epochs",fontsize=20)
plt.ylabel("Loss",fontsize=20)
plt.title("Monitoring neural network performance",fontsize=25)
plt.show()
Train on 417611 samples, validate on 417611 samples
Epoch 1/100
417611/417611 [==============================] - 16s 38us/step - loss: 0.2470 - acc: 0.9032 - val_loss: 0.2200 - val_acc: 0.9179
Epoch 2/100
417611/417611 [==============================] - 14s 34us/step - loss: 0.2219 - acc: 0.9128 - val_loss: 0.2158 - val_acc: 0.9185
Epoch 3/100
417611/417611 [==============================] - 15s 36us/step - loss: 0.2197 - acc: 0.9137 - val_loss: 0.2144 - val_acc: 0.9182
Epoch 4/100
417611/417611 [==============================] - 15s 35us/step - loss: 0.2184 - acc: 0.9143 - val_loss: 0.2140 - val_acc: 0.9178
Epoch 5/100
417611/417611 [==============================] - 15s 35us/step - loss: 0.2176 - acc: 0.9144 - val_loss: 0.2198 - val_acc: 0.9147
...
Min loss: 0.209121280931 at epoch 99
Min val loss: 0.213720297473 at epoch 6

Overfitting after epoch 6 is obvious. Now, let's apply our dropout regularization trick once again:

In [187]:
# Increase Dropout rate
from keras import models, metrics, layers
import matplotlib.pyplot as plt
# Note that csr type of sparse matrix runs significantly faster in keras neural network implementation

network1 = models.Sequential()
network1.add(layers.Dense(16,activation="relu", input_shape = (X_train_balanced_trans_100.shape[1],)))
network1.add(layers.Dropout(0.1))
network1.add(layers.Dense(32,activation="relu"))
network1.add(layers.Dropout(0.1))
network1.add(layers.Dense(16,activation="relu"))
network1.add(layers.Dropout(0.1))
network1.add(layers.Dense(8,activation="relu"))
network1.add(layers.Dropout(0.1))
network1.add(layers.Dense(1,activation= "sigmoid"))
network1.compile(optimizer= "adam", loss= "binary_crossentropy", metrics= ["acc"])
history_net1 = network1.fit(X_train_balanced_trans_100.tocsr(),y_train_balanced,  
                             epochs=100,batch_size=200,validation_split= 0.5)


epochs = list(range(1,len(history_net1.history["loss"]) +1))
print("Min loss: " + str(min(history_net1.history["loss"])) + " at epoch " + str(history_net1.history["loss"].index(min(history_net1.history["loss"])) + 1))
print("Min val loss: " + str(min(history_net1.history["val_loss"]))+ " at epoch " + str(history_net1.history["val_loss"].index(min(history_net1.history["val_loss"])) + 1))
plt.plot(epochs,history_net1.history["loss"],marker = "o", color = "b", label = "loss")
plt.plot(epochs,history_net1.history["val_loss"],marker = "o", color = "r", label = "val_loss")
plt.legend(fontsize = 20)
plt.xlabel("Number of epochs",fontsize=20)
plt.ylabel("Loss",fontsize=20)
plt.title("Monitoring neural network performance",fontsize=25)
plt.show()
Train on 417611 samples, validate on 417611 samples
Epoch 1/100
417611/417611 [==============================] - 18s 43us/step - loss: 0.2596 - acc: 0.9001 - val_loss: 0.2164 - val_acc: 0.9187
Epoch 2/100
417611/417611 [==============================] - 15s 36us/step - loss: 0.2309 - acc: 0.9119 - val_loss: 0.2174 - val_acc: 0.9182
Epoch 3/100
417611/417611 [==============================] - 25s 59us/step - loss: 0.2278 - acc: 0.9128 - val_loss: 0.2133 - val_acc: 0.9198
Epoch 4/100
417611/417611 [==============================] - 18s 42us/step - loss: 0.2263 - acc: 0.9129 - val_loss: 0.2129 - val_acc: 0.9200
Epoch 5/100
417611/417611 [==============================] - 16s 38us/step - loss: 0.2250 - acc: 0.9135 - val_loss: 0.2126 - val_acc: 0.9195
...
Min loss: 0.218788871383 at epoch 98
Min val loss: 0.210070606734 at epoch 51

Notice the boost in network performance? By increasing regularization following an overfitted model provides us a model that is able to predict better than the benchmark model, yet it also helped us to avoid overfitting we observed.

Therefore, the take home message from this exercise is:

  1. Start with benchmark models: low, medium and high complexity. Develop expectations about your model and overfitting.
  2. Perform regularization on your "medium complexity" network to monitor performance.
  3. Tune-down regularization, slighly increase network complexity, by adding neurons and/or layers, observe overfitting.
  4. Turn-on regularization once again, monitor any noticable boost in out-of-the-box network performance.

By performing a few emprical cycles involving these steps, you might be able to approach a better-tuned neural network and advance your Deep Learning journey!