Studying Principal Component Analysis at coffee Break, Cambridge, MA. Photo: Ozan Aygun

Using streaming data and ensemble classifiers for incremental learning

Imagine you have a predictive analytics problem and data is accumulating real-time. You want to train a model for decision making, but not sure if you are capturing changing trends in the data. This is a concept sometimes referred to as trend adaptation or concept drift. Alternatively, you may have a fairly big data set but do not have computational power/memory to fit all the data into your model and you believe a representative sampling approach might be biased.

How are you going to use streaming data to continuously train a model to reduce bias? Some models (such as decision trees) would need to see entire data set in once, so they don't provide a productive solution. Fortunately, more flexible and adapdative machine learning algorithms exist, including Naive Bayes and Gradient Descent. These fairly simple algorithms can be extremely useful when dealing with streaming data, especially when the data can be reduced into few dimensions. They generalize well.

In this example, I would like to demonstrate this approach using a binary classification problem, where the entire data set contains 200 million training samples. You can find more information about this data set and feature engineering pipeline here.

You will learn the following approach:

1. Train first tier classifiers using a small (~1 million examples) training set.

2. Establish the out-of the-box-performance of these classifiers.

3. Pick a few best performing 1st tier classifiers, get their predictions. This will reduce the dimension of the data significantly.

4. Ensemble a second tier classifier using the predictions of first tier classifiers.

5. Stream entire training set in small batches through steps 4 and 5, in order to perform incremental (online) learning, and monitor performance of two distinct ensemble classifiers.