Spring at Cadillac Mountain, Acadia National Park, Maine. Photo: Ozan Aygun

Bayesian optimization for Hyperparameter Tuning of XGboost classifier

Once you have a first signs of a sub-optimally performing predictive model, the natural next question you ask: "How can I make this model to work even better?" A refined form of this question: " Is it the best performance I can get from this model?"

This question leads to efforts we collectively refer to "optimization". More specifically, we want to tune the "hyperparameters" of the machine learning algorithms so that the provide the optimal performance for a given problem. These hyperparameters are different from other model parameters that are 'learned' from the training data. Hyperparameters should be pre-defined and they have a key impact on adjusting model complexity. Simple examples of hyperparameters are regularization parameters such as alpha and lambda, used for L1 or L2 regularization. Other examples include number of trees in a decision tree, learning rate or number of neurons in a Deep Neural Network. All of these hyperparameters should be fine-tuned in order to optimize performance of the model.

Amongst the hyperparameter tuning approaches, two algorithms are the most common: Grid Search and Randomized Search. Both of these approaches are coupled with cross-validation to obtain reliable estimates of the cost function.

Grid Search is an exhaustive search of hyperparameters in a predefined space. This algorithm ensures that all possible combinations of various hyperparameters will be tested and the set of hyperparameters that yield the best performing model will be selected. This may sound very robust, however in large data sets performing Grid Search may not be practical (can you guess why? Hint: think about the added computational cost of cross-validation).

Randomized Search is a viable alternative for Grid Search. As the name implies, for each iteration a set of hyperparameters will be randomly chosen from the predefined hyperparameter distributions. The maximum number of iterations can be set, which can limit the computational cost to whatever that is feasible based on the resources. The main disadvantage of Randomized Search is that it is not exhaustive, and we can not guarantee the entire hyperparameter space will be tested. Therefore, potentially better performing set of hyperparameters may be missed due to the iteration constraints predefined.

When we pause a second and think, we realize that both of these approaches have their own weaknesses. For a given problem, using Grid Search may or may not be feasible. Fortunately, there are other optimization alternatives that can balance between the need of a exhaustive search and computational constraints.

Bayesian Optimization is such an approach. The exact theory behind Bayesian Optimization is too complex to explain here. However, the basic idea involves generating a robust 'prior' for the cost value as a function of various hyperparameters in the defined space. Once the 'prior' is set, Bayesian Optimization process will actively work to minimize different 'regions' of the cost by adjusting strategic hyperparameters. Since the number of iterations can be set, this optimization approach enables balancing exploration and exploitation. Here I wrote up a basic example of Bayesian Optimization to optimize Hyperparameters of a XGboost classifier. Once you study this example try to understand the flexibility of this approach and how you can use it in other classification or regression problems. Keep in mind that -ironically- Bayesian Optimization has hyperparameters itself, therefore, just like any other approach in machine learning, it does not guarantee the optimal solution for any optimization problem.