Hyperparameter Tuning in XGBoost Using Genetic Algorithm
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
algorithm
Mohit Jain
Sep 17, 2018 · 9 min read
. . .
Introduction
Genetic algorithm, as defined in wikipedia, takes its inspiration from the process of
natural selection, proposed by Charles Darwin. In a more general term, we can
understand the natural process and how it is related it to the genetic algorithm, using
the following description:
We start with initial population, which will have certain traits, as shown in Figure 1.
This initial population will be tested in a certain environment to observe how well the
individuals (parents) in this population perform, based on a predefined fitness criteria.
The fitness in case of machine learning can be any performance metric- accuracy,
precision, recall, F1-score, auc among others. Based on the fitness value we select top
performing parents (“survival of the fittest”), as the survived population (Figure 2).
Now the parents in the survived population will mate to produce offspring, using a
combination of two steps: crossover/recombination and mutation. In the case of
crossover, the genes (parameters) from the mating parents will be recombined, to
produce offspring, with each children inheriting some genes (parameters) from each
parent (Figure 3).
Finally, in the case of mutation, some of the values of the genes (parameters) will be
altered to maintain a genetic diversity (Figure 4). This allows the nature/genetic
algorithm to typically arrive at a better solution.
Figure 5 shows the second generation of population, which will contain both survived
parents as well as children. We keep the survived parents so as to retain best fitness
parameters in case the offspring’s fitness value turns out to be worse than the parents.
The module will have functions to follow the four steps: (i) initialization, (ii) selection,
(iii) crossover, and (iv) mutation, similar to the what is discussed above (A small
section of this code is inspired from the post here).
Initialization:
The first step is initialization, where the parameters are randomly initialized to create
the population. It is similar to the first generation of the population shown in Figure 1.
The code below shows the initialization process, where we generate a vector containing
the parameters. In the case of XGBoost we have selected 7 parameters to optimize:
learning_rate, n_estimators, max_depth, min_child_weight, subsample,
colsample_bytree, and gamma. A detailed description of these parameters can be
found here.
def initilialize_poplulation(numberOfParents):
learningRate = np.empty([numberOfParents, 1])
nEstimators = np.empty([numberOfParents, 1], dtype = np.uint8)
maxDepth = np.empty([numberOfParents, 1], dtype = np.uint8)
minChildWeight = np.empty([numberOfParents, 1])
gammaValue = np.empty([numberOfParents, 1])
subSample = np.empty([numberOfParents, 1])
colSampleByTree = np.empty([numberOfParents, 1])
for i in range(numberOfParents):
print(i)
learningRate[i] = round(random.uniform(0.01, 1), 2)
nEstimators[i] = random.randrange(10, 1500, step = 25)
maxDepth[i] = int(random.randrange(1, 10, step= 1))
minChildWeight[i] = round(random.uniform(0.01, 10.0), 2)
gammaValue[i] = round(random.uniform(0.01, 10.0), 2)
subSample[i] = round(random.uniform(0.01, 1.0), 2)
colSampleByTree[i] = round(random.uniform(0.01, 1.0), 2)
The limits on the parameters are either based on the limits described in the XGBoost
documentation or based on an reasonable guess, if the upper limit is set to infinity. We
start with creating an empty array for each parameter, followed by populating it with
random values.
Parent selection (survival of the fittest)
In the second step we train our model using the initial population and calculate the
fitness value. In this case we will calculate the F1-score.
We will define how many parents we would like to select and create an array with the
selected parents, based on their fitness value.
There are various methods to define crossover in the case of genetic algorithms, such as
single-point, two-point and k-point crossover, uniform crossover and crossover for
ordered lists. We are going to use uniform crossover, where each parameter for the
child will be independently selected from the parents, based on a certain distribution.
In our case we will use “discrete uniform” distribution from numpy random function .
'''
Mate these parents to create children having parameters from these
parents (we are using uniform crossover method)
'''
def crossover_uniform(parents, childrenSize):
children = np.empty(childrenSize)
'''
Create child by choosing parameters from two parents selected
using new_parent_selection function. The parameter values
will be picked from the indexes, which were randomly selected
above.
'''
for i in range(childrenSize[0]):
Mutation
The final step will be to introduce diversity in the children by randomly selecting one of
the parameter and altering it’s value by a random amount. We will introduce some
limits also, in order to constrain the altered values within a certain range. Skipping
these constrains might lead to error.
np.random.seed(723)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
= 0.20, random_state = 97)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#XGboost Classifier
#model xgboost
#use xgboost API now
xgDMatrix = xgb.DMatrix(X_train, y_train) #create Dmatrix
xgbDMatrixTest = xgb.DMatrix(X_test, y_test)
We have 8 parents to start with and we select 4 fittest parents for mating. We will
create 4 generations and monitor the fitness (F1-score). Half of the parents in the
subsequent generation will be the fittest parents selected from the previous generation.
This will allow us to keep the best fitness score at least same as the previous generation,
in case the children fitness score is worse.
'''
We will create new population, which will contain parents that
where selected previously based on the
fitness score and rest of them will be children
'''
population[0:parents.shape[0], :] = parents #fittest parents
population[parents.shape[0]:, :] = children_mutated #children
populationHistory[(generation+1)*numberOfParents :
(generation+1)*numberOfParents+ numberOfParents , :] = population
#srore parent information
fitness = geneticXGboost.train_population(population=population,
dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test)
fitnessHistory[generation+1, :] = fitness
#Best fitness
print("Best fitness is =", fitness[bestFitnessIndex])
#Best parameters
print("Best parameters are:")
print('learning_rate', population[bestFitnessIndex][0])
print('n_estimators', population[bestFitnessIndex][1])
print('max_depth', int(population[bestFitnessIndex][2]))
print('min_child_weight', population[bestFitnessIndex][3])
print('gamma', population[bestFitnessIndex][4])
print('subsample', population[bestFitnessIndex][5])
print('colsample_bytree', population[bestFitnessIndex][6])
Now let us visualize the change in fitness in the population with each generation
(figure below). While we already started with high F1-score (~0.98), in two of the
parents, in the randomly generated initial population, we were able to improve it
further in the final generation. The lowest F1-score was 0.9143 for one parent in the
initial population and the best score was 0.9947 for one of the parents in the final
generation. This demonstrate that we can improve the performance metric in XGBoost,
by simple implementation of genetic algorithm. The final code can be found at my
github account. It also contains code that will allow you to observe the change in
various parameters, with each generation.