Spring at Cadillac Mountain, Acadia National Park, Maine. Photo: Ozan Aygun
                    
					
						
							
								 Bayesian optimization for Hyperparameter Tuning of XGboost classifier 
							
							Once you have a first signs of a sub-optimally performing predictive model, the natural next question
							you ask: "How can I make this model to work even better?" A refined form of this question: " Is it
							the best performance I can get from this model?"
							
							This question leads to efforts we collectively refer to "optimization". More specifically, we want to
							tune the "hyperparameters" of the machine learning algorithms so that the provide the optimal performance
							for a given problem. These hyperparameters are different from other model parameters that are 'learned'
							from the training data. Hyperparameters should be pre-defined and they have a key impact on adjusting model
							complexity. Simple examples of hyperparameters are regularization parameters such as alpha and lambda, used for L1
							or L2 regularization. Other examples include number of trees in a decision tree, learning rate or number
							of neurons in a Deep Neural Network. All of these hyperparameters should be fine-tuned in order to optimize
							performance of the model.
							
							Amongst the hyperparameter tuning approaches, two algorithms are the most common: Grid Search and Randomized Search.
							Both of these approaches are coupled with cross-validation to obtain reliable estimates of the cost function.
							
							Grid Search is an exhaustive search of hyperparameters in a predefined space. This algorithm ensures that all
							possible combinations of various hyperparameters will be tested and the set of hyperparameters that yield the 
							best performing model will be selected. This may sound very robust, however in large data sets performing
							Grid Search may not be practical (can you guess why? Hint: think about the added computational cost of cross-validation).
							
							Randomized Search is a viable alternative for Grid Search. As the name implies, for each iteration a set of
							hyperparameters will be randomly chosen from the predefined hyperparameter distributions. The maximum number of iterations 
							can be set, which can limit the computational cost to whatever that is feasible based on the resources. The main disadvantage of
							Randomized Search is that it is not exhaustive, and we can not guarantee the entire hyperparameter space will be tested. Therefore,
							potentially better performing set of hyperparameters may be missed due to the iteration constraints predefined.
							
							When we pause a second and think, we realize that both of these approaches have their own weaknesses. For a given problem,
							using Grid Search may or may not be feasible. Fortunately, there are other optimization alternatives that can balance between the
							need of a exhaustive search and computational constraints.  
							
							Bayesian Optimization is such an approach. The exact theory behind Bayesian Optimization is too complex to explain here. However, the
							basic idea involves generating a robust 'prior' for the cost value as a function of various hyperparameters in the defined space. 
							Once the 'prior' is set, Bayesian Optimization process will actively work to minimize different 'regions' of the cost by adjusting
							strategic hyperparameters. Since the number of iterations can be set, this optimization approach enables balancing exploration and exploitation.
							Here I wrote up a basic example of Bayesian Optimization to optimize Hyperparameters of a XGboost classifier. Once you study
							this example try to understand the flexibility of this approach and how you can use it in other classification or regression problems.
							Keep in mind that -ironically- Bayesian Optimization has hyperparameters itself, therefore, just like any other approach in machine learning,
							it does not guarantee the optimal solution for any optimization problem.