Xgboost: Release 1.0.0-SNAPSHOT
Xgboost: Release 1.0.0-SNAPSHOT
Xgboost: Release 1.0.0-SNAPSHOT
Release 1.0.0-SNAPSHOT
xgboost developers
1 Contents 3
1.1 Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Get Started with XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 XGBoost Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.5 XGBoost GPU Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 XGBoost Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.7 XGBoost Python Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.8 XGBoost R Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.9 XGBoost JVM Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
1.10 XGBoost.jl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
1.11 XGBoost Command Line version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
1.12 Contribute to XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Index 141
i
ii
xgboost, Release 1.0.0-SNAPSHOT
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree
boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same
code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
CONTENTS 1
xgboost, Release 1.0.0-SNAPSHOT
2 CONTENTS
CHAPTER
ONE
CONTENTS
• The binary wheel will support GPU algorithms (gpu_exact, gpu_hist) on machines with NVIDIA GPUs. Please
note that training with multiple GPUs is only supported for Linux platform. See XGBoost GPU Support.
• Currently, we provide binary wheels for 64-bit Linux and Windows.
This page gives instructions on how to build and install XGBoost from scratch on various systems. It consists of two
steps:
1. First build the shared library from the C++ codes (libxgboost.so for Linux/OSX and xgboost.dll for
Windows). (For R-package installation, please directly refer to R Package Installation.)
2. Then install the language packages (e.g. Python Package).
For windows users who use github tools, you can open the git shell and type the following command:
3
xgboost, Release 1.0.0-SNAPSHOT
Please refer to Trouble Shooting section first if you have any problem during installation. If the instructions do not
work for you, please feel free to ask questions at the user forum.
Contents
• Building the Shared Library
– Building on Ubuntu/Debian
– Building on OSX
– Building on Windows
– Building with GPU support
– Customized Building
• Python Package Installation
• R Package Installation
• Trouble Shooting
• Building the documentation
Building on Ubuntu/Debian
Building on OSX
First, obtain gcc-8 with Homebrew (https://brew.sh/) to enable multi-threading (i.e. using multiple CPU threads for
training). The default Apple Clang compiler does not support OpenMP, so using the default compiler would have
disabled multi-threading.
4 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
You might need to run the command with --user flag if you run into permission errors.
Create the build/ directory and invoke CMake. Make sure to add CC=gcc-8 CXX=g++-8 so that Homebrew
GCC is selected. After invoking CMake, you can build XGBoost with make:
mkdir build
cd build
CC=gcc-8 CXX=g++-8 cmake ..
make -j4
Building on Windows
You need to first clone the XGBoost repo with --recursive option, to clone the submodules. We recommend you
use Git for Windows, as it comes with a standard Bash shell. This will highly ease the installation process.
To build with Visual Studio, we will need CMake. Make sure to install a recent version of CMake. Then run the
following from the root of the XGBoost directory:
mkdir build
cd build
cmake .. -G"Visual Studio 14 2015 Win64"
This specifies an out of source build using the Visual Studio 64 bit generator. (Change the -G option appropriately
if you have a different version of Visual Studio installed.) Open the .sln file in the build directory and build with
Visual Studio.
After the build process successfully ends, you will find a xgboost.dll library file inside ./lib/ folder.
After installing Git for Windows, you should have a shortcut named Git Bash. You should run all subsequent steps
in Git Bash.
In MinGW, make command comes with the name mingw32-make. You can add the following line into the .
bashrc file:
alias make='mingw32-make'
(On 64-bit Windows, you should get MinGW64 instead.) Make sure that the path to MinGW is in the system PATH.
To build with MinGW, type:
cp make/mingw64.mk config.mk; make -j4
See Building XGBoost library for Python for Windows with MinGW-w64 (Advanced) for buildilng XGBoost for
Python.
XGBoost can be built with GPU support for both Linux and Windows using CMake. GPU support works with the
Python package as well as the CLI version. See Installing R package with GPU support for special instructions for R.
An up-to-date version of the CUDA toolkit is required.
From the command line on Linux starting from the XGBoost directory:
mkdir build
cd build
cmake .. -DUSE_CUDA=ON
make -j4
(Change the -G option appropriately if you have a different version of Visual Studio installed.)
6 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
To speed up compilation, the compute version specific to your GPU could be passed to cmake as, e.g.,
-DGPU_COMPUTE_VER=50. The above cmake configuration run will create an xgboost.sln solution file in
the build directory. Build this solution in release mode as a x64 build, either from Visual studio or from command
line:
Customized Building
We recommend the use of CMake for most use cases. See the full range of building options in CMakeLists.txt.
Alternatively, you may use Makefile. The Makefile uses a configuration file config.mk, which lets you modify
several compilation flags: - Whether to enable support for various distributed filesystems such as HDFS and Amazon
S3 - Which compiler to use - And some more
To customize, first copy make/config.mk to the project root and then modify the copy.
The Python package is located at python-package/. There are several ways to install the package:
1. Install system-wide, which requires root permission:
You will however need Python distutils module for this to work. It is often part of the core Python package or it
can be installed using your package manager, e.g. in Debian use
2. Only set the environment variable PYTHONPATH to tell Python where to find the library. For example, assume
we cloned xgboost on the home directory ~. then we can added the following line in ~/.bashrc. This
option is recommended for developers who change the code frequently. The changes will be immediately
reflected once you pulled the code and rebuild the project (no need to call setup again).
export PYTHONPATH=~/xgboost/python-package
Building XGBoost library for Python for Windows with MinGW-w64 (Advanced)
Windows versions of Python are built with Microsoft Visual Studio. Usually Python binary modules are built with the
same compiler the interpreter is built with. However, you may not be able to use Visual Studio, for following reasons:
1. VS is proprietary and commercial software. Microsoft provides a freeware “Community” edition, but its licens-
ing terms impose restrictions as to where and how it can be used.
2. Visual Studio contains telemetry, as documented in Microsoft Visual Studio Licensing Terms. Running software
with telemetry may be against the policy of your organization.
So you may want to build XGBoost with GCC own your own risk. This presents some difficulties because MSVC
uses Microsoft runtime and MinGW-w64 uses own runtime, and the runtimes have different incompatible memory
allocators. But in fact this setup is usable if you know how to deal with it. Here is some experience.
1. The Python interpreter will crash on exit if XGBoost was used. This is usually not a big issue.
2. -O3 is OK.
3. -mtune=native is also OK.
4. Don’t use -march=native gcc flag. Using it causes the Python interpreter to crash if the DLL was actually
used.
5. You may need to provide the lib with the runtime libs. If mingw32/bin is not in PATH, build a wheel
(python setup.py bdist_wheel), open it with an archiver and put the needed dlls to the directory
where xgboost.dll is situated. Then you can install the wheel with pip.
R Package Installation
You can install xgboost from CRAN just like any other R package:
install.packages("xgboost")
For OSX users, single-threaded version will be installed. So only one thread will be used for training. To enable use
of multiple threads (and utilize capacity of multi-core CPUs), see the section Installing R package on Mac OSX with
multi-threading to install XGBoost from source.
Make sure you have installed git and a recent C++ compiler supporting C++11 (e.g., g++-4.8 or higher). On Windows,
Rtools must be installed, and its bin directory has to be added to PATH during the installation.
Due to the use of git-submodules, devtools::install_github can no longer be used to install the latest version
of R package. Thus, one has to run git to check out the code first:
If the last line fails because of the error R: command not found, it means that R was not set up to run from
command line. In this case, just start R as you would normally do and run the following:
8 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
setwd('wherever/you/cloned/it/xgboost/R-package/')
install.packages('.', repos = NULL, type="source")
The package could also be built and installed with CMake (and Visual C++ 2015 on Windows) using instructions from
Installing R package with GPU support, but without GPU support (omit the -DUSE_CUDA=ON cmake parameter).
If all fails, try Building the shared library to see whether a problem is specific to R package or not.
First, obtain gcc-8 with Homebrew (https://brew.sh/) to enable multi-threading (i.e. using multiple CPU threads for
training). The default Apple Clang compiler does not support OpenMP, so using the default compiler would have
disabled multi-threading.
Create the build/ directory and invoke CMake with option R_LIB=ON. Make sure to add CC=gcc-8
CXX=g++-8 so that Homebrew GCC is selected. After invoking CMake, you can install the R package by running
make and make install:
mkdir build
cd build
CC=gcc-7 CXX=g++-7 cmake .. -DR_LIB=ON
make -j4
make install
The procedure and requirements are similar as in Building with GPU support, so make sure to read it first.
On Linux, starting from the XGBoost directory type:
mkdir build
cd build
cmake .. -DUSE_CUDA=ON -DR_LIB=ON
make install -j
When default target is used, an R package shared library would be built in the build area. The install tar-
get, in addition, assembles the package files with this shared library under build/R-package and runs R CMD
INSTALL.
On Windows, CMake with Visual C++ Build Tools (or Visual Studio) has to be used to build an R package with GPU
support. Rtools must also be installed (perhaps, some other MinGW distributions with gendef.exe and dlltool.
exe would work, but that was not tested).
mkdir build
cd build
cmake .. -G"Visual Studio 14 2015 Win64" -DUSE_CUDA=ON -DR_LIB=ON
cmake --build . --target install --config Release
When --target xgboost is used, an R package DLL would be built under build/Release. The --target
install, in addition, assembles the package files with this dll under build/R-package and runs R CMD
INSTALL.
If cmake can’t find your R during the configuration step, you might provide the location of its executable to cmake
like this: -DLIBR_EXECUTABLE="C:/Program Files/R/R-3.4.1/bin/x64/R.exe".
If on Windows you get a “permission denied” error when trying to write to . . . Program Files/R/. . . during the package
installation, create a .Rprofile file in your personal home directory (if you don’t already have one in there), and
add a line to it which specifies the location of your R packages user library, like the following:
You might find the exact location by running .libPaths() in R GUI or RStudio.
Trouble Shooting
XGBoost uses Sphinx for documentation. To build it locally, you need a installed XGBoost with all its dependencies
along with:
• System dependencies
– git
– graphviz
• Python dependencies
– sphinx
– breathe
– guzzle_sphinx_theme
– recommonmark
10 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
– mock
Under xgboost/doc directory, run make <format> with <format> replaced by the format you want. For a
list of supported formats, run make help under the same directory.
This is a quick start tutorial showing snippets for you to quickly try out XGBoost on the demo dataset on a binary
classification task.
1.2.2 Python
1.2.3 R
# load data
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
# fit model
bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1,
˓→nrounds = 2,
1.2.4 Julia
using XGBoost
# read data
train_X, train_Y = readlibsvm("demo/data/agaricus.txt.train", (6513, 126))
test_X, test_Y = readlibsvm("demo/data/agaricus.txt.test", (1611, 126))
# fit model
num_round = 2
bst = xgboost(train_X, num_round, label=train_Y, eta=1, max_depth=2)
# predict
pred = predict(bst, test_X)
1.2.5 Scala
import ml.dmlc.xgboost4j.scala.DMatrix
import ml.dmlc.xgboost4j.scala.XGBoost
object XGBoostScalaExample {
def main(args: Array[String]) {
// read trainining data, available at xgboost/demo/data
val trainData =
new DMatrix("/path/to/agaricus.txt.train")
// define parameters
val paramMap = List(
"eta" -> 0.1,
"max_depth" -> 2,
"objective" -> "binary:logistic").toMap
// number of iterations
val round = 2
// train the model
val model = XGBoost.train(trainData, paramMap, round)
// run prediction
val predTrain = model.predict(trainData)
// save model to the file.
model.saveModel("/local/path/to/model")
}
}
This section contains official tutorials inside XGBoost package. See Awesome XGBoost for more resources.
XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper
Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. This is a tutorial on gradient boosted
trees, and most of the content is based on these slides by Tianqi Chen, the original author of XGBoost.
The gradient boosted trees has been around for a while, and there are a lot of materials on the topic. This tutorial
will explain boosted trees in a self-contained and principled way using the elements of supervised learning. We think
this explanation is cleaner, more formal, and motivates the model formulation used in XGBoost.
12 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
XGBoost is used for supervised learning problems, where we use the training data (with multiple features) 𝑥𝑖 to predict
a target variable 𝑦𝑖 . Before we learn about trees specifically, let us start by reviewing the basic elements in supervised
learning.
The model in supervised learning usually refers to the mathematical structure of by which the prediction
∑︀ 𝑦𝑖 is made
from the input 𝑥𝑖 . A common example is a linear model, where the prediction is given as 𝑦ˆ𝑖 = 𝑗 𝜃𝑗 𝑥𝑖𝑗 , a linear
combination of weighted input features. The prediction value can have different interpretations, depending on the task,
i.e., regression or classification. For example, it can be logistic transformed to get the probability of positive class in
logistic regression, and it can also be used as a ranking score when we want to rank the outputs.
The parameters are the undetermined part that we need to learn from data. In linear regression problems, the param-
eters are the coefficients 𝜃. Usually we will use 𝜃 to denote the parameters (there are many parameters in a model, our
definition here is sloppy).
With judicious choices for 𝑦𝑖 , we may express a variety of tasks, such as regression, classification, and ranking. The
task of training the model amounts to finding the best parameters 𝜃 that best fit the training data 𝑥𝑖 and labels 𝑦𝑖 . In
order to train the model, we need to define the objective function to measure how well the model fit the training data.
A salient characteristic of objective functions is that they consist two parts: training loss and regularization term:
where 𝐿 is the training loss function, and Ω is the regularization term. The training loss measures how predictive our
model is with respect to the training data. A common choice of 𝐿 is the mean squared error, which is given by
∑︁
𝐿(𝜃) = (𝑦𝑖 − 𝑦ˆ𝑖 )2
𝑖
Another commonly used loss function is logistic loss, to be used for logistic regression:
∑︁
𝐿(𝜃) = [𝑦𝑖 ln(1 + 𝑒−^𝑦𝑖 ) + (1 − 𝑦𝑖 ) ln(1 + 𝑒𝑦^𝑖 )]
𝑖
The regularization term is what people usually forget to add. The regularization term controls the complexity of the
model, which helps us to avoid overfitting. This sounds a bit abstract, so let us consider the following problem in the
following picture. You are asked to fit visually a step function given the input data points on the upper left corner of
the image. Which solution among the three do you think is the best fit?
The correct answer is marked in red. Please consider if this visually seems a reasonable fit to you. The general principle
is we want both a simple and predictive model. The tradeoff between the two is also referred as bias-variance tradeoff
in machine learning.
The elements introduced above form the basic elements of supervised learning, and they are natural building blocks
of machine learning toolkits. For example, you should be able to describe the differences and commonalities between
gradient boosted trees and random forests. Understanding the process in a formalized way also helps us to understand
the objective that we are learning and the reason behind the heuristics such as pruning and smoothing.
Now that we have introduced the elements of supervised learning, let us get started with real trees. To begin with, let
us first learn about the model choice of XGBoost: decision tree ensembles. The tree ensemble model consists of a
set of classification and regression trees (CART). Here’s a simple example of a CART that classifies whether someone
will like a hypothetical computer game X.
14 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
We classify the members of a family into different leaves, and assign them the score on the corresponding leaf. A
CART is a bit different from decision trees, in which the leaf only contains decision values. In CART, a real score is
associated with each of the leaves, which gives us richer interpretations that go beyond classification. This also allows
for a principled, unified approach to optimization, as we will see in a later part of this tutorial.
Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, which
sums the prediction of multiple trees together.
Here is an example of a tree ensemble of two trees. The prediction scores of each individual tree are summed up to
get the final score. If you look at the example, an important fact is that the two trees try to complement each other.
Mathematically, we can write our model in the form
𝐾
∑︁
𝑦ˆ𝑖 = 𝑓𝑘 (𝑥𝑖 ), 𝑓𝑘 ∈ ℱ
𝑘=1
where 𝐾 is the number of trees, 𝑓 is a function in the functional space ℱ, and ℱ is the set of all possible CARTs. The
Now here comes a trick question: what is the model used in random forests? Tree ensembles! So random forests and
boosted trees are really the same models; the difference arises from how we train them. This means that, if you write
a predictive service for tree ensembles, you only need to write one and it should work for both random forests and
gradient boosted trees. (See Treelite for an actual example.) One example of why elements of supervised learning
rock.
Tree Boosting
Now that we introduced the model, let us turn to training: How should we learn the trees? The answer is, as is always
for all supervised learning models: define an objective function and optimize it!
Let the following be the objective function (remember it always needs to contain training loss and regularization):
𝑛 𝑡
(𝑡)
∑︁ ∑︁
obj = 𝑙(𝑦𝑖 , 𝑦ˆ𝑖 ) + Ω(𝑓𝑖 )
𝑖=1 𝑖=1
Additive Training
The first question we want to ask: what are the parameters of trees? You can find that what we need to learn are those
functions 𝑓𝑖 , each containing the structure of the tree and the leaf scores. Learning tree structure is much harder than
traditional optimization problem where you can simply take the gradient. It is intractable to learn all the trees at once.
Instead, we use an additive strategy: fix what we have learned, and add one new tree at a time. We write the prediction
(𝑡)
value at step 𝑡 as 𝑦ˆ𝑖 . Then we have
(0)
𝑦ˆ𝑖 =0
(1) (0)
𝑦ˆ𝑖 = 𝑓1 (𝑥𝑖 ) = 𝑦ˆ𝑖 + 𝑓1 (𝑥𝑖 )
(2) (1)
𝑦ˆ𝑖 = 𝑓1 (𝑥𝑖 ) + 𝑓2 (𝑥𝑖 ) = 𝑦ˆ𝑖 + 𝑓2 (𝑥𝑖 )
...
𝑡
(𝑡) (𝑡−1)
∑︁
𝑦ˆ𝑖 = 𝑓𝑘 (𝑥𝑖 ) = 𝑦ˆ𝑖 + 𝑓𝑡 (𝑥𝑖 )
𝑘=1
It remains to ask: which tree do we want at each step? A natural thing is to add the one that optimizes our objective.
𝑛 𝑡
(𝑡)
∑︁ ∑︁
obj(𝑡) = 𝑙(𝑦𝑖 , 𝑦ˆ𝑖 ) + Ω(𝑓𝑖 )
𝑖=1 𝑖=1
𝑛
(𝑡−1)
∑︁
= 𝑙(𝑦𝑖 , 𝑦ˆ𝑖 + 𝑓𝑡 (𝑥𝑖 )) + Ω(𝑓𝑡 ) + constant
𝑖=1
If we consider using mean squared error (MSE) as our loss function, the objective becomes
𝑛 𝑡
(𝑡−1)
∑︁ ∑︁
obj(𝑡) = (𝑦𝑖 − (ˆ
𝑦𝑖 + 𝑓𝑡 (𝑥𝑖 )))2 + Ω(𝑓𝑖 )
𝑖=1 𝑖=1
𝑛
(𝑡−1)
∑︁
= [2(ˆ
𝑦𝑖 − 𝑦𝑖 )𝑓𝑡 (𝑥𝑖 ) + 𝑓𝑡 (𝑥𝑖 )2 ] + Ω(𝑓𝑡 ) + constant
𝑖=1
16 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
The form of MSE is friendly, with a first order term (usually called the residual) and a quadratic term. For other losses
of interest (for example, logistic loss), it is not so easy to get such a nice form. So in the general case, we take the
Taylor expansion of the loss function up to the second order:
𝑛
∑︁ (𝑡−1) 1
obj(𝑡) = [𝑙(𝑦𝑖 , 𝑦ˆ𝑖 ) + 𝑔𝑖 𝑓𝑡 (𝑥𝑖 ) + ℎ𝑖 𝑓𝑡2 (𝑥𝑖 )] + Ω(𝑓𝑡 ) + constant
𝑖=1
2
After we remove all the constants, the specific objective at step 𝑡 becomes
𝑛
∑︁ 1
[𝑔𝑖 𝑓𝑡 (𝑥𝑖 ) + ℎ𝑖 𝑓𝑡2 (𝑥𝑖 )] + Ω(𝑓𝑡 )
𝑖=1
2
This becomes our optimization goal for the new tree. One important advantage of this definition is that the value of the
objective function only depends on 𝑔𝑖 and ℎ𝑖 . This is how XGBoost supports custom loss functions. We can optimize
every loss function, including logistic regression and pairwise ranking, using exactly the same solver that takes 𝑔𝑖 and
ℎ𝑖 as input!
Model Complexity
We have introduced the training step, but wait, there is one important thing, the regularization term! We need to
define the complexity of the tree Ω(𝑓 ). In order to do so, let us first refine the definition of the tree 𝑓 (𝑥) as
Here 𝑤 is the vector of scores on leaves, 𝑞 is a function assigning each data point to the corresponding leaf, and 𝑇 is
the number of leaves. In XGBoost, we define the complexity as
𝑇
1 ∑︁ 2
Ω(𝑓 ) = 𝛾𝑇 + 𝜆 𝑤
2 𝑗=1 𝑗
Of course, there is more than one way to define the complexity, but this one works well in practice. The regularization
is one part most tree packages treat less carefully, or simply ignore. This was because the traditional treatment of
tree learning only emphasized improving impurity, while the complexity control was left to heuristics. By defining it
formally, we can get a better idea of what we are learning and obtain models that perform well in the wild.
Here is the magical part of the derivation. After re-formulating the tree model, we can write the objective value with
the 𝑡-th tree as:
𝑛 𝑇
∑︁ 1 1 ∑︁ 2
obj(𝑡) ≈ 2
[𝑔𝑖 𝑤𝑞(𝑥𝑖 ) + ℎ𝑖 𝑤𝑞(𝑥 ] + 𝛾𝑇 + 𝜆 𝑤
𝑖=1
2 𝑖 )
2 𝑗=1 𝑗
𝑇
∑︁ ∑︁ 1 ∑︁
= [( 𝑔𝑖 )𝑤𝑗 + ( ℎ𝑖 + 𝜆)𝑤𝑗2 ] + 𝛾𝑇
𝑗=1
2
𝑖∈𝐼𝑗 𝑖∈𝐼𝑗
where 𝐼𝑗 = {𝑖|𝑞(𝑥𝑖 ) = 𝑗} is the set of indices of data points assigned to the 𝑗-th leaf. Notice that in the second line
we have changed the index of the summation because ∑︀ all the data points∑︀
on the same leaf get the same score. We could
further compress the expression by defining 𝐺𝑗 = 𝑖∈𝐼𝑗 𝑔𝑖 and 𝐻𝑗 = 𝑖∈𝐼𝑗 ℎ𝑖 :
𝑇
∑︁ 1
obj(𝑡) = [𝐺𝑗 𝑤𝑗 + (𝐻𝑗 + 𝜆)𝑤𝑗2 ] + 𝛾𝑇
𝑗=1
2
In this equation, 𝑤𝑗 are independent with respect to each other, the form 𝐺𝑗 𝑤𝑗 + 21 (𝐻𝑗 + 𝜆)𝑤𝑗2 is quadratic and the
best 𝑤𝑗 for a given structure 𝑞(𝑥) and the best objective reduction we can get is:
𝐺𝑗
𝑤𝑗* = −
𝐻𝑗 + 𝜆
𝑇
* 1 ∑︁ 𝐺2𝑗
obj = − + 𝛾𝑇
2 𝑗=1 𝐻𝑗 + 𝜆
The last equation measures how good a tree structure 𝑞(𝑥) is.
If all this sounds a bit complicated, let’s take a look at the picture, and see how the scores can be calculated. Basically,
for a given tree structure, we push the statistics 𝑔𝑖 and ℎ𝑖 to the leaves they belong to, sum the statistics together, and
use the formula to calculate how good the tree is. This score is like the impurity measure in a decision tree, except that
it also takes the model complexity into account.
Now that we have a way to measure how good a tree is, ideally we would enumerate all possible trees and pick the
best one. In practice this is intractable, so we will try to optimize one level of the tree at a time. Specifically we try to
split a leaf into two leaves, and the score it gains is
𝐺2𝐿 𝐺2𝑅 (𝐺𝐿 + 𝐺𝑅 )2
[︂ ]︂
1
𝐺𝑎𝑖𝑛 = + − −𝛾
2 𝐻𝐿 + 𝜆 𝐻𝑅 + 𝜆 𝐻𝐿 + 𝐻𝑅 + 𝜆
This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on
the original leaf 4) regularization on the additional leaf. We can see an important fact here: if the gain is smaller than
𝛾, we would do better not to add that branch. This is exactly the pruning techniques in tree based models! By using
the principles of supervised learning, we can naturally come up with the reason these techniques work :)
18 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in
sorted order, like the following picture.
A left to right scan is sufficient to calculate the structure score of all possible split solutions, and we can find the best
split efficiently.
Now that you understand what boosted trees are, you may ask, where is the introduction for XGBoost? XGBoost is
exactly a tool motivated by the formal principle introduced in this tutorial! More importantly, it is developed with both
deep consideration in terms of systems optimization and principles in machine learning. The goal of this library
is to push the extreme of the computation limits of machines to provide a scalable, portable and accurate library.
Make sure you try it out, and most importantly, contribute your piece of wisdom (code, examples, tutorials) to the
community!
Kubeflow community provides XGBoost Operator to support distributed XGBoost training and batch prediction in
a Kubernetes cluster. It provides an easy and efficient XGBoost model training and batch prediction in distributed
fashion.
How to use
In order to run a XGBoost job in a Kubernetes cluster, carry out the following steps:
1. Install XGBoost Operator in Kubernetes.
a. XGBoost Operator is designed to manage XGBoost jobs, including job scheduling, monitoring, pods and
services recovery etc. Follow the installation guide to install XGBoost Operator.
2. Write application code to interface with the XGBoost operator.
a. You’ll need to furnish a few scripts to inteface with the XGBoost operator. Refer to the Iris classification
example.
b. Data reader/writer: you need to have your data source reader and writer based on the requirement. For
example, if your data is stored in a Hive Table, you have to write your own code to read/write Hive table
based on the ID of worker.
c. Model persistence: in this example, model is stored in the OSS storage. If you want to store your model
into Amazon S3, Google NFS or other storage, you’ll need to specify the model reader and writer based
on the requirement of storage system.
3. Configure the XGBoost job using a YAML file.
a. YAML file is used to configure the computation resource and environment for your XGBoost job to run,
e.g. the number of workers and masters. The template YAML template is provided for reference.
4. Submit XGBoost job to Kubernetes cluster.
a. Kubectl command is used to submit a XGBoost job, and then you can monitor the job status.
Work in progress
XGBoost mostly combines a huge number of regression trees with a small learning rate. In this situation, trees added
early are significant and trees added late are unimportant.
Vinayak and Gilad-Bachrach proposed a new method to add dropout techniques from the deep neural net community
to boosted trees, and reported better results in some situations.
This is a instruction of new tree booster dart.
Original paper
Rashmi Korlakai Vinayak, Ran Gilad-Bachrach. “DART: Dropouts meet Multiple Additive Regression Trees.” JMLR.
Features
20 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Because of the randomness introduced in the training, expect the following few differences:
• Training can be slower than gbtree because the random dropout prevents usage of the prediction buffer.
• The early stop might not be stable, due to the randomness.
How it works
Parameters
The booster dart inherits gbtree booster, so it supports all parameters that gbtree does, such as eta, gamma,
max_depth etc.
Additional parameters are noted below:
• sample_type: type of sampling algorithm.
– uniform: (default) dropped trees are selected uniformly.
– weighted: dropped trees are selected in proportion to weight.
• normalize_type: type of normalization algorithm.
– tree: (default) New trees have the same weight of each of dropped trees.
(︃ )︃ (︃ )︃
∑︁ 1 ∑︁ 𝜂˜
𝑎 𝐹𝑖 + 𝐹𝑚 = 𝑎 𝐹𝑖 + 𝐹𝑚
𝑘 𝑘
𝑖∈K 𝑖∈K
(︁ 𝜂 )︁
∼𝑎 1+ 𝐷
𝑘
𝑘+𝜂
=𝑎 𝐷 = 𝐷,
𝑘
𝑘
𝑎=
𝑘+𝜂
– forest: New trees have the same weight of sum of dropped trees (forest).
(︃ )︃ (︃ )︃
∑︁ ∑︁
𝑎 𝐹𝑖 + 𝐹𝑚 = 𝑎 ˜
𝐹𝑖 + 𝜂 𝐹𝑚
𝑖∈K 𝑖∈K
∼ 𝑎 (1 + 𝜂) 𝐷
= 𝑎(1 + 𝜂)𝐷 = 𝐷,
1
𝑎= .
1+𝜂
Sample Script
It is often the case in a modeling problem or project that the functional form of an acceptable model is constrained
in some way. This may happen due to business considerations, or because of the type of scientific question being
investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality,
constraints can be used to improve the predictive performance of the model.
A common type of constraint in this situation is that certain features bear a monotonic relationship to the predicted
response:
22 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
A Simple Example
To illustrate, let’s create some simulated data with two features and a response according to the following scheme
The response generally increases with respect to the 𝑥1 feature, but a sinusoidal variation has been superimposed,
resulting in the true effect being non-monotonic. For the 𝑥2 feature the variation is decreasing with a sinusoidal
variation.
Let’s fit a boosted tree model to this data without imposing any monotonic constraints:
The black curve shows the trend inferred from the model for each feature. To make these plots the distinguished
feature 𝑥𝑖 is fed to the model over a one-dimensional grid of values, while all the other features (in this case only one
other feature) are set to their average values. We see that the model does a good job of capturing the general trend with
the oscillatory wave superimposed.
Here is the same model, but fit with monotonicity constraints:
We see the effect of the constraint. For each variable the general direction of the trend is still evident, but the oscillatory
behaviour no longer remains as it would violate our imposed constraints.
It is very simple to enforce monotonicity constraints in XGBoost. Here we will give an example using Python, but the
same general idea generalizes to other platforms.
Suppose the following code fits your model without monotonicity constraints
Then fitting with monotonicity constraints only requires adding a single parameter
params_constrained = params.copy()
params_constrained['monotone_constraints'] = "(1,-1)"
In this example the training data X has two columns, and by using the parameter values (1,-1) we are telling
XGBoost to impose an increasing constraint on the first predictor and a decreasing constraint on the second.
Some other examples:
• (1,0): An increasing constraint on the first predictor and no constraint on the second.
• (0,-1): No constraint on the first predictor and a decreasing constraint on the second.
Choice of tree construction algorithm. To use monotonic constraints, be sure to set the tree_method parameter
to one of exact, hist, and gpu_hist.
Note for the ‘hist’ tree construction algorithm. If tree_method is set to either hist or gpu_hist, enabling
monotonic constraints may produce unnecessarily shallow trees. This is because the hist method reduces the number
of candidate splits to be considered at each split. Monotonic constraints may wipe out all available split candidates, in
which case no split is made. To reduce the effect, you may want to increase the max_bin parameter to consider more
split candidates.
24 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
XGBoost is normally used to train gradient-boosted decision trees and other gradient boosted models. Random forests
use the same model representation and inference, as gradient-boosted decision trees, but a different training algorithm.
One can use XGBoost to train a standalone random forest or use random forest as a base model for gradient boosting.
Here we focus on training standalone random forest.
We have native APIs for training random forests since the early days, and a new Scikit-Learn wrapper after 0.82 (not
included in 0.82). Please note that the new Scikit-Learn wrapper is still experimental, which means we might change
the interface whenever needed.
params = {
'colsample_bynode': 0.8,
'learning_rate': 1,
'max_depth': 5,
'num_parallel_tree': 100,
'objective': 'binary:logistic',
'subsample': 0.8,
'tree_method': 'gpu_hist'
}
XGBRFClassifier and XGBRFRegressor are SKL-like classes that provide random forest functionality. They
are basically versions of XGBClassifier and XGBRegressor that train random forest instead of gradient boost-
ing, and have default values and meaning of some of the parameters adjusted accordingly. In particular:
• n_estimators specifies the size of the forest to be trained; it is converted to num_parallel_tree,
instead of the number of boosting rounds
• learning_rate is set to 1 by default
• colsample_bynode and subsample are set to 0.8 by default
• booster is always gbtree
For a simple example, you can train a random forest regressor with:
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X, y):
xgb_model = xgb.XGBRFRegressor(random_state=42).fit(
X[train_index], y[train_index])
Note that these classes have a smaller selection of parameters compared to using train(). In particular, it is impos-
sible to combine random forests with gradient boosting using this API.
Caveats
• XGBoost uses 2nd order approximation to the objective function. This can lead to results that differ from a
random forest implementation that uses the exact value of the objective function.
• XGBoost does not perform replacement when subsampling training cases. Each training case can occur in a
subsampled set either 0 or 1 time.
The decision tree is a powerful tool to discover interaction among independent variables (features). Variables that
appear together in a traversal path are interacting with one another, since the condition of a child node is predicated on
the condition of the parent node. For example, the highlighted red path in the diagram below contains three variables:
𝑥1 , 𝑥7 , and 𝑥10 , so the highlighted prediction (at the highlighted leaf node) is the product of interaction between 𝑥1 ,
𝑥7 , and 𝑥10 .
When the tree depth is larger than one, many variables interact on the sole basis of minimizing training loss, and the
resulting decision tree may capture a spurious relationship (noise) rather than a legitimate relationship that generalizes
across different datasets. Feature interaction constraints allow users to decide which variables are allowed to interact
and which are not.
Potential benefits include:
• Better predictive performance from focusing on interactions that work – whether through domain specific knowl-
edge or algorithms that rank interactions
• Less noise in predictions; better generalization
• More control to the user on what the model can fit. For example, the user may want to exclude some interactions
even if they perform well due to regulatory constraints
26 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
A Simple Example
Feature interaction constraints are expressed in terms of groups of variables that are allowed to interact. For example,
the constraint [0, 1] indicates that variables 𝑥0 and 𝑥1 are allowed to interact with each other but with no other
variable. Similarly, [2, 3, 4] indicates that 𝑥2 , 𝑥3 , and 𝑥4 are allowed to interact with one another but with no
other variable. A set of feature interaction constraints is expressed as a nested list, e.g. [[0, 1], [2, 3, 4]],
where each inner list is a group of indices of features that are allowed to interact with each other.
In the following diagram, the left decision tree is in violation of the first constraint ([0, 1]), whereas the right
decision tree complies with both the first and second constraints ([0, 1], [2, 3, 4]).
It is very simple to enforce feature interaction constraints in XGBoost. Here we will give an example using Python,
but the same general idea generalizes to other platforms.
Suppose the following code fits your model without feature interaction constraints:
Then fitting with feature interaction constraints only requires adding a single parameter:
params_constrained = params.copy()
# Use nested list to define feature interaction constraints
params_constrained['interaction_constraints'] = '[[0, 2], [1, 3, 4], [5, 6]]'
# Features 0 and 2 are allowed to interact with each other but with no other feature
# Features 1, 3, 4 are allowed to interact with one another but with no other feature
# Features 5 and 6 are allowed to interact with each other but with no other feature
Choice of tree construction algorithm. To use feature interaction constraints, be sure to set the tree_method
parameter to one of the following: exact, hist or gpu_hist. Support for gpu_hist is added after (excluding)
version 0.90.
Advanced topic
The intuition behind interaction constraint is simple. User have prior knowledge about relations between different
features, and encode it as constraints during model construction. But there are also some subtleties around specifying
constraints. Take constraint [[1, 2], [2, 3, 4]] as an example, the second feature appears in two different
interaction sets [1, 2] and [2, 3, 4], so the union set of features allowed to interact with 2 is {1, 3, 4}.
In following diagram, root splits at feature 2. because all its descendants should be able to interact with it, so at the
second layer all 4 features are legitimate split candidates for further splitting, disregarding specified constraint sets.
This has lead to some interesting implications of feature interaction constraints. Take [[0, 1], [0, 1, 2],
[1, 2]] as another example. Assuming we have only 3 available features in our training datasets for presentation
purpose, careful readers might have found out that the above constraint is same with [0, 1, 2]. Since no matter
which feature is chosen for split in root node, all its descendants have to include every feature as legitimate split
candidates to avoid violating interaction constraints.
For one last example, we use [[0, 1], [1, 3, 4]] and choose feature 0 as split for root node. At the second
layer of built tree, 1 is the only legitimate split candidate except for 0 itself, since they belong to the same constraint
set. Following the grow path of our example tree below, the node at second layer splits at feature 1. But due to the fact
that 1 also belongs to second constraint set [1, 3, 4], at third layer, we need to include all features as candidates
to comply with its ascendants.
XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will
describe the LibSVM format. (See this Wikipedia article for a description of the CSV format.)
For training or predicting, XGBoost takes an instance file with the format as below:
Listing 1: train.txt
1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400
0 0:1.3 1:0.3
1 0:0.01 1:0.3
0 0:0.2 1:0.3
Each line represent a single instance, and in the first line ‘1’ is the instance label, ‘101’ and ‘102’ are feature indices,
‘1.2’ and ‘0.03’ are feature values. In the binary classification case, ‘1’ is used to indicate positive samples, and ‘0’ is
used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of
the instance being positive.
28 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Note: all information below is applicable only to single-node version of the package. If you’d like to perform
distributed training with multiple nodes, skip to the section Embedding additional information inside LibSVM file.
For ranking task, XGBoost supports the group input format. In ranking task, instances are categorized into query
groups in real world scenarios. For example, in the learning to rank web pages scenario, the web page instances are
grouped by their queries. XGBoost requires an file that indicates the group information. For example, if the instance
file is the train.txt shown above, the group file should be named train.txt.group and be of the following
format:
Listing 2: train.txt.group
2
3
This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are
in another group. The numbers in the group file are actually indicating the number of instances in each group in the
instance file in order. At the time of configuration, you do not have to indicate the path of the group file. If the instance
file name is xxx, XGBoost will check whether there is a file named xxx.group in the same directory.
Instances in the training data may be assigned weights to differentiate relative importance among them. For example,
if we provide an instance weight file for the train.txt file in the example as below:
Listing 3: train.txt.weight
1
0.5
0.5
1
0.5
It means that XGBoost will emphasize more on the first and fourth instance (i.e. the positive instances) while training.
The configuration is similar to configuring the group information. If the instance file name is xxx, XGBoost will look
for a file named xxx.weight in the same directory. If the file exists, the instance weights will be extracted and used
at the time of training.
XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction
using logistic regression for train.txt file, we can create the following file:
30 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Listing 4: train.txt.base_margin
-0.4
1.0
3.4
XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin
is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in
value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin
values.
Query ID Columns
This is most useful for ranking task, where the instances are grouped into query groups. You may embed query group
ID for each instance in the LibSVM file by adding a token of form qid:xx in each row:
Listing 5: train.txt
1 qid:1 101:1.2 102:0.03
0 qid:1 1:2.1 10001:300 10002:400
0 qid:2 0:1.3 1:0.3
1 qid:2 0:0.01 1:0.3
0 qid:3 0:0.2 1:0.3
1 qid:3 3:-0.1 10:-0.3
0 qid:3 6:0.2 10:0.15
Instance weights
You may specify instance weights in the LibSVM file by appending each instance label with the corresponding weight
in the form of [label]:[weight], as shown by the following example:
Listing 6: train.txt
1:1.0 101:1.2 102:0.03
0:0.5 1:2.1 10001:300 10002:400
0:0.5 0:1.3 1:0.3
1:1.0 0:0.01 1:0.3
0:0.5 0:0.2 1:0.3
where the negative instances are assigned half weights compared to the positive instances.
Parameter tuning is a dark art in machine learning, the optimal parameters of a model can depend on many scenarios.
So it is impossible to create a comprehensive guide for doing so.
This document tries to provide some guideline for parameters in XGBoost.
If you take a machine learning or statistics course, this is likely to be one of the most important concepts. When
we allow the model to get more complicated (e.g. more depth), the model has better ability to fit the training data,
resulting in a less biased model. However, such complicated model requires more data to fit.
Most of parameters in XGBoost are about bias variance tradeoff. The best model should trade the model complexity
with its predictive power carefully. Parameters Documentation will tell you whether each parameter will make the
model more conservative or not. This can be used to help you turn the knob between complicated model and simple
model.
Control Overfitting
When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.
There are in general two ways that you can control overfitting in XGBoost:
• The first way is to directly control model complexity.
– This includes max_depth, min_child_weight and gamma.
• The second way is to add randomness to make training robust to noise.
– This includes subsample and colsample_bytree.
– You can also reduce stepsize eta. Remember to increase num_round when you do so.
For common cases such as ads clickthrough log, the dataset is extremely imbalanced. This can affect the training of
XGBoost model, and there are two ways to improve it.
• If you care only about the overall performance metric (AUC) of your prediction
– Balance the positive and negative weights via scale_pos_weight
– Use AUC for evaluation
• If you care about predicting the right probability
– In such a case, you cannot re-balance the dataset
– Set parameter max_delta_step to a finite number (say 1) to help convergence
There is no big difference between using external memory version and in-memory version. The only difference is the
filename format.
The external memory version takes in the following filename format:
32 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
filename#cacheprefix
The filename is the normal path to libsvm file you want to load in, and cacheprefix is a path to a cache file that
XGBoost will use for external memory cache.
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
You can find that there is additional #dtrain.cache following the libsvm file, this is the name of cache file. For
CLI version, simply add the cache suffix, e.g. "../data/agaricus.txt.train#dtrain.cache".
Performance Note
Distributed Version
The external memory mode naturally works on distributed version, you can simply set path like
data = "hdfs://path-to-data/#dtrain.cache"
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal so that you
can directly use dtrain.cache to cache to current folder.
Usage Note
XGBoost is designed to be an extensible library. One way to extend it is by providing our own objective function for
training and corresponding metric for performance monitoring. This document introduces implementing a customized
elementwise evaluation metric and objective for XGBoost. Although the introduction uses Python for demonstration,
the concepts should be readily applicable to other language bindings.
Note:
• The ranking task does not support customized functions.
• The customized functions defined here are only applicable to single node training. Distributed environment
requires syncing with xgboost.rabit, the interface is subject to change hence beyond the scope of this
tutorial.
• We also plan to re-design the interface for multi-classes objective in the future.
In the following sections, we will provide a step by step walk through of implementing Squared Log
Error(SLE) objective function:
1
[𝑙𝑜𝑔(𝑝𝑟𝑒𝑑 + 1) − 𝑙𝑜𝑔(𝑙𝑎𝑏𝑒𝑙 + 1)]2
2
and its default metric Root Mean Squared Log Error(RMSLE):
√︂
1
[𝑙𝑜𝑔(𝑝𝑟𝑒𝑑 + 1) − 𝑙𝑜𝑔(𝑙𝑎𝑏𝑒𝑙 + 1)]2
𝑁
Although XGBoost has native support for said functions, using it for demonstration provides us the opportunity of
comparing the result from our own implementation and the one from XGBoost internal for learning purposes. After
finishing this tutorial, we should be able to provide our own functions for rapid experiments.
During model training, the objective function plays an important role: provide gradient information, both first and
second order gradient, based on model predictions and observed data labels (or targets). Therefore, a valid objective
function should accept two inputs, namely prediction and labels. For implementing SLE, we define:
import numpy as np
import xgboost as xgb
In the above code snippet, squared_log is the objective function we want. It accepts a numpy array predt as
model prediction, and the training DMatrix for obtaining required information, including labels and weights (not used
here). This objective is then used as a callback function for XGBoost during training by passing it as an argument to
xgb.train:
34 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Notice that in our definition of the objective, whether we subtract the labels from the prediction or the other way
around is important. If you find the training error goes up instead of down, this might be the reason.
So after having a customized objective, we might also need a corresponding metric to monitor our model’s perfor-
mance. As mentioned above, the default metric for SLE is RMSLE. Similarly we define another callback like function
as the new metric:
Since we are demonstrating in Python, the metric or objective needs not be a function, any callable object should
suffice. Similarly to the objective function, our metric also accepts predt and dtrain as inputs, but returns the
name of metric itself and a floating point value as result. After passing it into XGBoost as argument of feval
parameter:
Notice that the parameter disable_default_eval_metric is used to suppress the default metric in XGBoost.
For fully reproducible source code and comparison plots, see custom_rmsle.py.
XGBoost is designed to be memory efficient. Usually it can handle problems as long as the data fit into your memory.
(This usually means millions of instances) If you are running out of memory, checkout external memory version or
distributed version of XGBoost.
The distributed version of XGBoost is designed to be portable to various environment. Distributed XGBoost can be
ported to any platform that supports rabit. You can directly run XGBoost on Yarn. In theory Mesos and other resource
allocation engines can be easily supported as well.
The first fact we need to know is going distributed does not necessarily solve all the problems. Instead, it creates more
problems such as more communication overhead and fault tolerance. The ultimate question will still come back to how
to push the limit of each computation node and use less resources to complete the task (thus with less communication
and chance of failure).
To achieve these, we decide to reuse the optimizations in the single node XGBoost and build distributed version on
top of it. The demand of communication in machine learning is rather simple, in the sense that we can depend on a
limited set of API (in our case rabit). Such design allows us to reuse most of the code, while being portable to major
platforms such as Hadoop/Yarn, MPI, SGE. Most importantly, it pushes the limit of the computation resources we can
use.
The model and data format of XGBoost is exchangeable, which means the model trained by one language can be
loaded in another. This means you can train the model using R, while running prediction using Java or C++, which are
more common in production systems. You can also train the model using distributed versions, and load them in from
Python to do some interactive analysis.
XGBoost supports missing value by default. In tree algorithms, branch directions for missing values are learned during
training. Note that the gblinear booster treats missing values as zeros.
36 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
This could happen, due to non-determinism in floating point summation order and multi-threading. Though the general
accuracy will usually remain the same.
1.4.10 Why do I see different results with sparse and dense data?
“Sparse” elements are treated as if they were “missing” by the tree booster, and as zeros by the linear booster. For tree
models, it is important to use consistent data formats during training and scoring.
This page contains information about GPU algorithms supported in XGBoost. To install GPU support, checkout the
Installation Guide.
Tree construction (training) and prediction can be accelerated with CUDA-capable GPUs.
Usage
Algorithms
tree_method Description
gpu_exact The standard XGBoost tree construction algorithm. Performs exact search for splits. Slower and
(deprecated) uses considerably more memory than gpu_hist.
gpu_hist Equivalent to the XGBoost fast histogram algorithm. Much faster and uses considerably less
memory. NOTE: Will run very slowly on GPUs older than Pascal architecture.
Supported parameters
GPU accelerated prediction is enabled by default for the above mentioned tree_method parameters but can be
switched to CPU prediction by setting predictor to cpu_predictor. This could be useful if you want to
conserve GPU memory. Likewise when using CPU algorithms, GPU accelerated prediction can be enabled by setting
predictor to gpu_predictor.
The experimental parameter single_precision_histogram can be set to True to enable building histograms
using single precision. This may improve speed, in particular on older architectures.
The device ordinal (which GPU to use if you have many of them) can be selected using the gpu_id parameter, which
defaults to 0 (the first device reported by CUDA runtime).
The GPU algorithms currently work with CLI, Python and R packages. See Installation Guide for details.
Note: Single node multi-GPU training with n_gpus parameter is deprecated after 0.90. Please use distributed GPU
training with one process per GPU.
XGBoost supports fully distributed GPU training using Dask. See Python documentation Dask API and worked
examples here.
Objective functions
Most of the objective functions implemented in XGBoost can be run on GPU. Following table shows current support
status.
38 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Objective will run on GPU if GPU updater (gpu_hist), otherwise they will run on CPU by default. For unsupported
objectives XGBoost will fall back to using CPU implementation by default.
Metric functions
Following table shows current support status for evaluation metrics on the GPU.
Similar to objective functions, default device for metrics is selected based on tree updater and predictor (which is
selected based on tree updater).
Benchmarks
python tests/benchmark/benchmark.py
Training time time on 1,000,000 rows x 50 columns with 500 boosting iterations and 0.25/0.75 test/train split on
i7-6700K CPU @ 4.00GHz and Pascal Titan X yields the following results:
See GPU Accelerated XGBoost and Updates to the XGBoost GPU algorithms for additional performance benchmarks
of the gpu_exact and gpu_hist tree methods.
Developer notes
The application may be profiled with annotations by specifying USE_NTVX to cmake and providing the path to
the stand-alone nvtx header via NVTX_HEADER_DIR. Regions covered by the ‘Monitor’ class in cuda code will
automatically appear in the nsight profiler.
1.5.2 References
Mitchell R, Frank E. (2017) Accelerating the XGBoost algorithm using GPU computing. PeerJ Computer Science
3:e127 https://doi.org/10.7717/peerj-cs.127
Nvidia Parallel Forall: Gradient Boosting, Decision Trees and XGBoost with CUDA
Contributors
Before running XGBoost, we must set three types of parameters: general parameters, booster parameters and task
parameters.
40 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
• General parameters relate to which booster we are using to do boosting, commonly tree or linear model
• Booster parameters depend on which booster you have chosen
• Learning task parameters decide on the learning scenario. For example, regression tasks may use different
parameters with ranking tasks.
• Command line parameters relate to behavior of CLI version of XGBoost.
• General Parameters
– Parameters for Tree Booster
– Additional parameters for Dart Booster (booster=dart)
– Parameters for Linear Booster (booster=gblinear)
– Parameters for Tweedie Regression (objective=reg:tweedie)
• Learning Task Parameters
• Command Line Parameters
42 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
– colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for
every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the
current tree.
– colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once
every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current
level.
– colsample_by* parameters work cumulatively. For instance, the com-
bination {'colsample_bytree':0.5, 'colsample_bylevel':0.5,
'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each
split.
• lambda [default=1, alias: reg_lambda]
– L2 regularization term on weights. Increasing this value will make model more conservative.
• alpha [default=0, alias: reg_alpha]
– L1 regularization term on weights. Increasing this value will make model more conservative.
• tree_method string [default= auto]
– The tree construction algorithm used in XGBoost. See description in the reference paper.
– XGBoost supports hist and approx for distributed training and only support approx for external
memory version.
– Choices: auto, exact, approx, hist, gpu_hist
– A comma separated string defining the sequence of tree updaters to run, providing a modular way to con-
struct and to modify the trees. This is an advanced parameter that is usually set automatically, depending
on some other parameters. However, it could be also set explicitly by a user. The following updater plugins
exist:
* prune: prunes the splits where loss < min_split_loss (or gamma).
– In a distributed setting, the implicit updater sequence value would be adjusted to grow_histmaker,
prune by default, and you can set tree_method as hist to use grow_histmaker.
• refresh_leaf [default=1]
– This is a parameter of the refresh updater plugin. When this flag is 1, tree leafs as well as tree nodes’
stats are updated. When it is 0, only node stats are updated.
• process_type [default= default]
– A type of boosting process to run.
– Choices: default, update
44 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
– Increasing this number improves the optimality of splits at the cost of higher computation time.
• predictor, [default=‘‘cpu_predictor‘‘]
– The type of predictor algorithm to use. Provides the same results but allows the use of GPU or CPU.
* tree: new trees have the same weight of each of dropped trees.
· Weight of new trees are 1 / (k + learning_rate).
· Dropped trees are scaled by a factor of k / (k + learning_rate).
* forest: new trees have the same weight of sum of dropped trees (forest).
· Weight of new trees are 1 / (1 + learning_rate).
· Dropped trees are scaled by a factor of 1 / (1 + learning_rate).
• rate_drop [default=0.0]
– Dropout rate (a fraction of previous trees to drop during the dropout).
– range: [0.0, 1.0]
• one_drop [default=0]
– When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one
or epsilon-dropout from the original DART paper).
• skip_drop [default=0.0]
– Probability of skipping the dropout procedure during a boosting iteration.
* If a dropout is skipped, new trees are added in the same manner as gbtree.
* Note that non-zero skip_drop has higher priority than rate_drop or one_drop.
– range: [0.0, 1.0]
* shotgun: Parallel coordinate descent algorithm based on shotgun algorithm. Uses ‘hogwild’ paral-
lelism and therefore produces a nondeterministic solution on each run.
* coord_descent: Ordinary coordinate descent algorithm. Also multithreaded but still produces a
deterministic solution.
• feature_selector [default= cyclic]
– Feature selection and ordering method
* thrifty: Thrifty, approximately-greedy feature selector. Prior to cyclic updates, reorders features
in descending magnitude of their univariate weight changes. This operation is multithreaded and is a
linear complexity approximation of the quadratic greedy selection. It allows restricting the selection
to top_k features per group with the largest magnitude of univariate weight change, by setting the
top_k parameter.
• top_k [default=0]
– The number of top features to select in greedy and thrifty feature selector. The value of 0 means
using all the features.
• tweedie_variance_power [default=1.5]
– Parameter that controls the variance of the Tweedie distribution var(y) ~
E(y)^tweedie_variance_power
– range: (1,2)
– Set closer to 2 to shift towards a gamma distribution
46 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Specify the learning task and the corresponding learning objective. The objective options are below:
• objective [default=reg:squarederror]
– reg:squarederror: regression with squared loss.
– reg:squaredlogerror: regression with squared log loss 21 [𝑙𝑜𝑔(𝑝𝑟𝑒𝑑+1)−𝑙𝑜𝑔(𝑙𝑎𝑏𝑒𝑙+1)]2 . All input
labels are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.
– reg:logistic: logistic regression
– binary:logistic: logistic regression for binary classification, output probability
– binary:logitraw: logistic regression for binary classification, output score before logistic transfor-
mation
– binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than
producing probabilities.
– count:poisson –poisson regression for count data, output mean of poisson distribution
* error@t: a different than 0.5 binary classification threshold value could be specified by providing a
numerical value through ‘t’.
* ndcg-, map-, ndcg@n-, map@n-: In XGBoost, NDCG and MAP will evaluate the score of a list
without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these
score as 0 to be consistent under some conditions.
The following parameters are only used in the console version of XGBoost
• num_round
– The number of rounds for boosting
• data
48 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
This page contains links to all the python related documents on python package. To install the package package,
checkout Installation Guide.
1.7.1 Contents
Install XGBoost
Data Interface
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
# label_column specifies the index of the column containing the true label
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
50 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Currently, the DMLC data parser cannot parse CSV files with headers. Use Pandas (see below) to read CSV
files with headers.
• Saving DMatrix into a XGBoost binary file will make loading faster:
dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary('train.buffer')
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)
When performing ranking tasks, the number of weights should be equal to number of groups.
Setting Parameters
XGBoost can use either a list of pairs or a dictionary to set parameters. For instance:
• Booster parameters
# alternatively:
# plst = param.items()
# plst += [('eval_metric', 'ams@0')]
Training
num_round = 10
bst = xgb.train(param, dtrain, num_round, evallist)
bst.save_model('0001.model')
The model and its feature map can also be dumped to a text file.
# dump model
bst.dump_model('dump.raw.txt')
# dump model with feature map
bst.dump_model('dump.raw.txt', 'featmap.txt')
Methods including update and boost from xgboost.Booster are designed for internal usage only. The wrapper function
xgboost.train does some pre-configuration including setting up caches and some other parameters.
Early Stopping
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping
requires at least one set in evals. If there’s more than one, it will use the last.
The model will train until the validation score stops improving. Validation error needs to decrease at least every
early_stopping_rounds to continue training.
If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration
and bst.best_ntree_limit. Note that xgboost.train() will return a model from the last iteration, not the
best one.
This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if
you specify more than one evaluation metric the last one in param['eval_metric'] is used for early stopping.
Prediction
A model that has been trained or loaded can perform predictions on data sets.
52 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
If early stopping is enabled during training, you can get predictions from the best iteration with bst.
best_ntree_limit:
Plotting
You can use plotting module to plot importance and output tree.
To plot importance, use xgboost.plot_importance(). This function requires matplotlib to be installed.
xgb.plot_importance(bst)
To plot the output tree via matplotlib, use xgboost.plot_tree(), specifying the ordinal number of the target
tree. This function requires graphviz and matplotlib.
xgb.plot_tree(bst, num_trees=2)
When you use IPython, you can use the xgboost.to_graphviz() function, which converts the target tree to
a graphviz instance. The graphviz instance is automatically rendered in IPython.
xgb.to_graphviz(bst, num_trees=2)
This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more
information about python package.
• data (string/numpy.array/scipy.sparse/pd.DataFrame/dt.Frame) –
Data source of DMatrix. When data is string type, it represents the path libsvm format
txt file, or binary file that xgboost can read from.
• label (list or numpy 1-D array, optional) – Label of the training data.
• missing (float, optional) – Value in the data which needs to be present as a miss-
ing value. If None, defaults to np.nan.
• weight (list or numpy 1-D array , optional) – Weight for each instance.
54 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Parameters
• field (str) – The field name of the information
• data (numpy array) – The array of data to be set
set_group(group)
Set group size of DMatrix (used for ranking).
Parameters group (array like) – Group size of each group
set_label(label)
Set label of dmatrix
Parameters label (array like) – The label information to be set into DMatrix
set_label_npy2d(label)
Set label of dmatrix
Parameters label (array like) – The label information to be set into DMatrix from
numpy 2D array
set_uint_info(field, data)
Set uint type property into the DMatrix.
Parameters
• field (str) – The field name of the information
• data (numpy array) – The array of data to be set
set_weight(weight)
Set weight of each instance.
Parameters weight (array like) – Weight for each data point
set_weight_npy2d(weight)
Set weight of each instance for numpy 2D array
Parameters weight (array like) – Weight for each data point in numpy 2D array
slice(rindex, allow_groups=False)
Slice the DMatrix and return a new DMatrix that only contains rindex.
Parameters
• rindex (list) – List of indices to be selected.
• allow_groups (boolean) – Allow slicing of a matrix with a groups attribute
Returns res – A new DMatrix containing only selected indices.
Return type DMatrix
56 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
get_score(fmap=”, importance_type=’weight’)
Get feature importance of each feature. Importance type can be defined as:
• ‘weight’: the number of times a feature is used to split the data across all trees.
• ‘gain’: the average gain across all splits the feature is used in.
• ‘cover’: the average coverage across all splits the feature is used in.
• ‘total_gain’: the total gain across all splits the feature is used in.
58 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
• ‘total_cover’: the total coverage across all splits the feature is used in.
Parameters
• fmap (str (optional)) – The name of feature map file.
• importance_type (str, default 'weight') – One of the importance types
defined above.
Parameters
• data (DMatrix) – The dmatrix storing the input.
• output_margin (bool) – Whether to output the raw untransformed margin value.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all
trees).
• pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample,
ntrees) with each record indicating the predicted leaf index of each sample in each tree.
Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1
and tree 0.
• pred_contribs (bool) – When this is True the output will be a matrix of size (nsam-
ple, nfeats + 1) with each record indicating the feature contributions (SHAP values) for
that prediction. The sum of all feature contributions is equal to the raw untransformed
margin value of the prediction. Note the final column is the bias term.
• approx_contribs (bool) – Approximate the contributions of each feature
• pred_interactions (bool) – When this is True the output will be a matrix of size
(nsample, nfeats + 1, nfeats + 1) indicating the SHAP interaction values for each pair of
features. The sum of each row (or column) of the interaction values equals the corre-
sponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the
raw untransformed margin value of the prediction. Note the last row and column corre-
spond to the bias term.
• validate_features (bool) – When this is True, validate that the Booster’s and
data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the
same.
Returns prediction
Return type numpy array
save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal binary format which is universal among the various XGBoost
interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved. To
preserve all attributes, pickle the Booster object.
Parameters fname (string) – Output file name
save_rabit_checkpoint()
Save the current booster to rabit checkpoint.
save_raw()
Save the model to a in memory buffer representation
Returns
60 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Learning API
[xgb.callback.reset_learning_rate(custom_rates)]
Returns Booster
Return type a trained booster model
xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(),
obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None,
as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuf-
fle=True)
Cross-validation with given parameters.
Parameters
• params (dict) – Booster params.
• dtrain (DMatrix) – Data to be trained.
• num_boost_round (int) – Number of boosting iterations.
• nfold (int) – Number of folds in CV.
62 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
[xgb.callback.reset_learning_rate(custom_rates)]
Scikit-Learn API
64 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Note: A custom objective function can be provided for the objective parameter. In this case, it should have
the signature objective(y_true, y_pred) -> grad, hess:
y_true: array_like of shape [n_samples] The target values
y_pred: array_like of shape [n_samples] The predicted values
grad: array_like of shape [n_samples] The value of the gradient for each sample point.
hess: array_like of shape [n_samples] The value of the second derivative for each sample point
apply(X, ntree_limit=0)
Return the predicted leaf every tree for each sample.
Parameters
• X (array_like, shape=[n_samples, n_features]) – Input features matrix.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all
trees).
Returns X_leaves – For each datapoint x in X and for each tree, return the index of the leaf
x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly
with gaps in the numbering.
Return type array_like, shape=[n_samples, n_trees]
property coef_
Coefficients property
Returns coef_
Return type array of shape [n_features] or [n_classes, n_features]
evals_result()
Return the evaluation results.
If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all
passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the
eval_metrics passed to the fit function.
Returns evals_result
Return type dictionary
Example
clf = xgb.XGBModel(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
evals_result = clf.evals_result()
property feature_importances_
Feature importances property
Returns feature_importances_
Return type array of shape [n_features]
66 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
[xgb.callback.reset_learning_rate(custom_rates)]
get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
Returns booster
Return type a xgboost booster of underlying model
get_num_boosting_rounds()
Gets the number of xgboost boosting rounds.
get_params(deep=False)
Get parameters.
get_xgb_params()
Get xgboost type parameters.
property intercept_
Intercept (bias) property
Returns intercept_
Return type array of shape (1,) or [n_classes]
load_model(fname)
Load the model from a file.
The model is loaded from an XGBoost internal binary format which is universal among the various XG-
Boost interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be
loaded. Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string or a memory buffer) – Input file name or memory
buffer(see also save_raw)
predict(data, output_margin=False, ntree_limit=None, validate_features=True)
Predict with data.
Parameters
• data (numpy.array/scipy.sparse) – Data to predict with
• output_margin (bool) – Whether to output the raw untransformed margin value.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to
best_ntree_limit if defined (i.e. it has been trained with early stopping), otherwise 0 (use
all trees).
• validate_features (bool) – When this is True, validate that the Booster’s and
data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the
same.
Returns prediction
Return type numpy array
save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal binary format which is universal among the various XGBoost
interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be loaded.
Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string) – Output file name
set_params(**params)
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This
68 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid
search. :returns: :rtype: self
class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1,
silent=None, objective=’binary:logistic’, booster=’gbtree’,
n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
max_delta_step=0, subsample=1, colsample_bytree=1, colsam-
ple_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, base_score=0.5, random_state=0, seed=None,
missing=None, **kwargs)
Bases: xgboost.sklearn.XGBModel, object
Implementation of the scikit-learn API for XGBoost classification.
Parameters
• max_depth (int) – Maximum tree depth for base learners.
• learning_rate (float) – Boosting learning rate (xgb’s “eta”)
• n_estimators (int) – Number of trees to fit.
• verbosity (int) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
• silent (boolean) – Whether to print messages while running boosting. Deprecated.
Use verbosity instead.
• objective (string or callable) – Specify the learning task and the correspond-
ing learning objective or a custom objective function to be used (see note below).
• booster (string) – Specify which booster to use: gbtree, gblinear or dart.
• nthread (int) – Number of parallel threads used to run xgboost. (Deprecated, please use
n_jobs)
• n_jobs (int) – Number of parallel threads used to run xgboost. (replaces nthread)
• gamma (float) – Minimum loss reduction required to make a further partition on a leaf
node of the tree.
• min_child_weight (int) – Minimum sum of instance weight(hessian) needed in a
child.
• max_delta_step (int) – Maximum delta step we allow each tree’s weight estimation
to be.
• subsample (float) – Subsample ratio of the training instance.
• colsample_bytree (float) – Subsample ratio of columns when constructing each
tree.
• colsample_bylevel (float) – Subsample ratio of columns for each level.
• colsample_bynode (float) – Subsample ratio of columns for each split.
• reg_alpha (float (xgb's alpha)) – L1 regularization term on weights
• reg_lambda (float (xgb's lambda)) – L2 regularization term on weights
• scale_pos_weight (float) – Balancing of positive and negative weights.
• base_score – The initial prediction score of all instances, global bias.
• seed (int) – Random number seed. (Deprecated, please use random_state)
• random_state (int) – Random number seed. (replaces seed)
• missing (float, optional) – Value in the data which needs to be present as a miss-
ing value. If None, defaults to np.nan.
• importance_type (string, default "gain") – The feature importance type for
the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “to-
tal_cover”.
• **kwargs (dict, optional) – Keyword arguments for XGBoost Booster object.
Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/
blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and
**kwargs dict simultaneously will result in a TypeError.
Note: A custom objective function can be provided for the objective parameter. In this case, it should have
the signature objective(y_true, y_pred) -> grad, hess:
y_true: array_like of shape [n_samples] The target values
y_pred: array_like of shape [n_samples] The predicted values
grad: array_like of shape [n_samples] The value of the gradient for each sample point.
hess: array_like of shape [n_samples] The value of the second derivative for each sample point
apply(X, ntree_limit=0)
Return the predicted leaf every tree for each sample.
Parameters
• X (array_like, shape=[n_samples, n_features]) – Input features matrix.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all
trees).
Returns X_leaves – For each datapoint x in X and for each tree, return the index of the leaf
x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly
with gaps in the numbering.
Return type array_like, shape=[n_samples, n_trees]
property coef_
Coefficients property
Returns coef_
Return type array of shape [n_features] or [n_classes, n_features]
70 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
evals_result()
Return the evaluation results.
If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all
passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the
eval_metrics passed to the fit function.
Returns evals_result
Return type dictionary
Example
clf = xgb.XGBClassifier(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
evals_result = clf.evals_result()
property feature_importances_
Feature importances property
Returns feature_importances_
Return type array of shape [n_features]
[xgb.callback.reset_learning_rate(custom_rates)]
get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
Returns booster
Return type a xgboost booster of underlying model
get_num_boosting_rounds()
Gets the number of xgboost boosting rounds.
get_params(deep=False)
Get parameters.
get_xgb_params()
Get xgboost type parameters.
property intercept_
Intercept (bias) property
Returns intercept_
Return type array of shape (1,) or [n_classes]
72 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
load_model(fname)
Load the model from a file.
The model is loaded from an XGBoost internal binary format which is universal among the various XG-
Boost interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be
loaded. Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string or a memory buffer) – Input file name or memory
buffer(see also save_raw)
predict(data, output_margin=False, ntree_limit=None, validate_features=True)
Predict with data.
Parameters
• data (DMatrix) – The dmatrix storing the input.
• output_margin (bool) – Whether to output the raw untransformed margin value.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to
best_ntree_limit if defined (i.e. it has been trained with early stopping), otherwise 0 (use
all trees).
• validate_features (bool) – When this is True, validate that the Booster’s and
data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the
same.
Returns prediction
Return type numpy array
Parameters
• data (DMatrix) – The dmatrix storing the input.
save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal binary format which is universal among the various XGBoost
interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be loaded.
Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string) – Output file name
set_params(**params)
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This
allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid
search. :returns: :rtype: self
class xgboost.XGBRanker(max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1,
silent=None, objective=’rank:pairwise’, booster=’gbtree’, n_jobs=-1,
nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
subsample=1, colsample_bytree=1, colsample_bylevel=1, colsam-
ple_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
Bases: xgboost.sklearn.XGBModel
Implementation of the Scikit-Learn API for XGBoost Ranking.
Parameters
• max_depth (int) – Maximum tree depth for base learners.
• learning_rate (float) – Boosting learning rate (xgb’s “eta”)
• n_estimators (int) – Number of boosted trees to fit.
• verbosity (int) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
• silent (boolean) – Whether to print messages while running boosting. Deprecated.
Use verbosity instead.
• objective (string) – Specify the learning task and the corresponding learning objec-
tive. The objective name must start with “rank:”.
• booster (string) – Specify which booster to use: gbtree, gblinear or dart.
• nthread (int) – Number of parallel threads used to run xgboost. (Deprecated, please use
n_jobs)
• n_jobs (int) – Number of parallel threads used to run xgboost. (replaces nthread)
• gamma (float) – Minimum loss reduction required to make a further partition on a leaf
node of the tree.
74 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Note: A custom objective function is currently not supported by XGBRanker. Likewise, a custom metric
function is not supported either.
apply(X, ntree_limit=0)
Return the predicted leaf every tree for each sample.
Parameters
• X (array_like, shape=[n_samples, n_features]) – Input features matrix.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all
trees).
Returns X_leaves – For each datapoint x in X and for each tree, return the index of the leaf
x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly
with gaps in the numbering.
Return type array_like, shape=[n_samples, n_trees]
property coef_
Coefficients property
Returns coef_
Return type array of shape [n_features] or [n_classes, n_features]
evals_result()
Return the evaluation results.
If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all
passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the
eval_metrics passed to the fit function.
Returns evals_result
Return type dictionary
Example
clf = xgb.XGBModel(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
evals_result = clf.evals_result()
76 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
property feature_importances_
Feature importances property
Returns feature_importances_
Return type array of shape [n_features]
• eval_set (list, optional) – A list of (X, y) tuple pairs to use as validation sets,
for which metrics will be computed. Validation metrics will help us track the performance
of the model.
• sample_weight_eval_set (list, optional) – A list of the form [L_1, L_2,
. . . , L_n], where each L_i is a list of group weights on the i-th validation set.
[xgb.callback.reset_learning_rate(custom_rates)]
get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
Returns booster
Return type a xgboost booster of underlying model
get_num_boosting_rounds()
Gets the number of xgboost boosting rounds.
get_params(deep=False)
Get parameters.
get_xgb_params()
Get xgboost type parameters.
property intercept_
Intercept (bias) property
Returns intercept_
Return type array of shape (1,) or [n_classes]
load_model(fname)
Load the model from a file.
The model is loaded from an XGBoost internal binary format which is universal among the various XG-
Boost interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be
78 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
loaded. Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string or a memory buffer) – Input file name or memory
buffer(see also save_raw)
predict(data, output_margin=False, ntree_limit=0, validate_features=True)
Predict with data.
Parameters
• data (numpy.array/scipy.sparse) – Data to predict with
• output_margin (bool) – Whether to output the raw untransformed margin value.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to
best_ntree_limit if defined (i.e. it has been trained with early stopping), otherwise 0 (use
all trees).
• validate_features (bool) – When this is True, validate that the Booster’s and
data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the
same.
Returns prediction
Return type numpy array
save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal binary format which is universal among the various XGBoost
interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be loaded.
Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string) – Output file name
set_params(**params)
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This
allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid
search. :returns: :rtype: self
80 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Note: A custom objective function can be provided for the objective parameter. In this case, it should have
the signature objective(y_true, y_pred) -> grad, hess:
y_true: array_like of shape [n_samples] The target values
y_pred: array_like of shape [n_samples] The predicted values
grad: array_like of shape [n_samples] The value of the gradient for each sample point.
hess: array_like of shape [n_samples] The value of the second derivative for each sample point
apply(X, ntree_limit=0)
Return the predicted leaf every tree for each sample.
Parameters
• X (array_like, shape=[n_samples, n_features]) – Input features matrix.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all
trees).
Returns X_leaves – For each datapoint x in X and for each tree, return the index of the leaf
x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly
with gaps in the numbering.
Return type array_like, shape=[n_samples, n_trees]
property coef_
Coefficients property
Returns coef_
Return type array of shape [n_features] or [n_classes, n_features]
evals_result()
Return the evaluation results.
If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all
passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the
eval_metrics passed to the fit function.
Returns evals_result
Return type dictionary
Example
clf = xgb.XGBModel(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
evals_result = clf.evals_result()
property feature_importances_
Feature importances property
Returns feature_importances_
Return type array of shape [n_features]
82 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
[xgb.callback.reset_learning_rate(custom_rates)]
get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
Returns booster
Return type a xgboost booster of underlying model
get_num_boosting_rounds()
Gets the number of xgboost boosting rounds.
get_params(deep=False)
Get parameters.
get_xgb_params()
Get xgboost type parameters.
property intercept_
Intercept (bias) property
Returns intercept_
Return type array of shape (1,) or [n_classes]
load_model(fname)
Load the model from a file.
The model is loaded from an XGBoost internal binary format which is universal among the various XG-
Boost interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be
loaded. Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string or a memory buffer) – Input file name or memory
buffer(see also save_raw)
predict(data, output_margin=False, ntree_limit=None, validate_features=True)
Predict with data.
Parameters
• data (numpy.array/scipy.sparse) – Data to predict with
• output_margin (bool) – Whether to output the raw untransformed margin value.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to
best_ntree_limit if defined (i.e. it has been trained with early stopping), otherwise 0 (use
all trees).
• validate_features (bool) – When this is True, validate that the Booster’s and
data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the
same.
Returns prediction
Return type numpy array
save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal binary format which is universal among the various XGBoost
interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be loaded.
Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string) – Output file name
set_params(**params)
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This
84 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid
search. :returns: :rtype: self
class xgboost.XGBRFClassifier(max_depth=3, learning_rate=1, n_estimators=100,
verbosity=1, silent=None, objective=’binary:logistic’,
n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
max_delta_step=0, subsample=0.8, colsample_bytree=1, col-
sample_bylevel=1, colsample_bynode=0.8, reg_alpha=0,
reg_lambda=1e-05, scale_pos_weight=1, base_score=0.5,
random_state=0, seed=None, missing=None, **kwargs)
Bases: xgboost.sklearn.XGBClassifier
Experimental implementation of the scikit-learn API for XGBoost random forest classification.
Parameters
• max_depth (int) – Maximum tree depth for base learners.
• learning_rate (float) – Boosting learning rate (xgb’s “eta”)
• n_estimators (int) – Number of trees to fit.
• verbosity (int) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
• silent (boolean) – Whether to print messages while running boosting. Deprecated.
Use verbosity instead.
• objective (string or callable) – Specify the learning task and the correspond-
ing learning objective or a custom objective function to be used (see note below).
• booster (string) – Specify which booster to use: gbtree, gblinear or dart.
• nthread (int) – Number of parallel threads used to run xgboost. (Deprecated, please use
n_jobs)
• n_jobs (int) – Number of parallel threads used to run xgboost. (replaces nthread)
• gamma (float) – Minimum loss reduction required to make a further partition on a leaf
node of the tree.
• min_child_weight (int) – Minimum sum of instance weight(hessian) needed in a
child.
• max_delta_step (int) – Maximum delta step we allow each tree’s weight estimation
to be.
• subsample (float) – Subsample ratio of the training instance.
• colsample_bytree (float) – Subsample ratio of columns when constructing each
tree.
• colsample_bylevel (float) – Subsample ratio of columns for each level.
• colsample_bynode (float) – Subsample ratio of columns for each split.
• reg_alpha (float (xgb's alpha)) – L1 regularization term on weights
• reg_lambda (float (xgb's lambda)) – L2 regularization term on weights
• scale_pos_weight (float) – Balancing of positive and negative weights.
• base_score – The initial prediction score of all instances, global bias.
• seed (int) – Random number seed. (Deprecated, please use random_state)
• random_state (int) – Random number seed. (replaces seed)
• missing (float, optional) – Value in the data which needs to be present as a miss-
ing value. If None, defaults to np.nan.
• importance_type (string, default "gain") – The feature importance type for
the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “to-
tal_cover”.
• **kwargs (dict, optional) – Keyword arguments for XGBoost Booster object.
Full documentation of parameters can be found here: https://github.com/dmlc/xgboost/
blob/master/doc/parameter.rst. Attempting to set a parameter via the constructor args and
**kwargs dict simultaneously will result in a TypeError.
Note: A custom objective function can be provided for the objective parameter. In this case, it should have
the signature objective(y_true, y_pred) -> grad, hess:
y_true: array_like of shape [n_samples] The target values
y_pred: array_like of shape [n_samples] The predicted values
grad: array_like of shape [n_samples] The value of the gradient for each sample point.
hess: array_like of shape [n_samples] The value of the second derivative for each sample point
apply(X, ntree_limit=0)
Return the predicted leaf every tree for each sample.
Parameters
• X (array_like, shape=[n_samples, n_features]) – Input features matrix.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all
trees).
Returns X_leaves – For each datapoint x in X and for each tree, return the index of the leaf
x ends up in. Leaves are numbered within [0; 2**(self.max_depth+1)), possibly
with gaps in the numbering.
Return type array_like, shape=[n_samples, n_trees]
property coef_
Coefficients property
Returns coef_
Return type array of shape [n_features] or [n_classes, n_features]
86 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
evals_result()
Return the evaluation results.
If eval_set is passed to the fit function, you can call evals_result() to get evaluation results for all
passed eval_sets. When eval_metric is also passed to the fit function, the evals_result will contain the
eval_metrics passed to the fit function.
Returns evals_result
Return type dictionary
Example
clf = xgb.XGBClassifier(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
evals_result = clf.evals_result()
property feature_importances_
Feature importances property
Returns feature_importances_
Return type array of shape [n_features]
[xgb.callback.reset_learning_rate(custom_rates)]
get_booster()
Get the underlying xgboost Booster of this model.
This will raise an exception when fit was not called
Returns booster
Return type a xgboost booster of underlying model
get_num_boosting_rounds()
Gets the number of xgboost boosting rounds.
get_params(deep=False)
Get parameters.
get_xgb_params()
Get xgboost type parameters.
property intercept_
Intercept (bias) property
Returns intercept_
Return type array of shape (1,) or [n_classes]
88 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
load_model(fname)
Load the model from a file.
The model is loaded from an XGBoost internal binary format which is universal among the various XG-
Boost interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be
loaded. Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string or a memory buffer) – Input file name or memory
buffer(see also save_raw)
predict(data, output_margin=False, ntree_limit=None, validate_features=True)
Predict with data.
Parameters
• data (DMatrix) – The dmatrix storing the input.
• output_margin (bool) – Whether to output the raw untransformed margin value.
• ntree_limit (int) – Limit number of trees in the prediction; defaults to
best_ntree_limit if defined (i.e. it has been trained with early stopping), otherwise 0 (use
all trees).
• validate_features (bool) – When this is True, validate that the Booster’s and
data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the
same.
Returns prediction
Return type numpy array
Parameters
• data (DMatrix) – The dmatrix storing the input.
save_model(fname)
Save the model to a file.
The model is saved in an XGBoost internal binary format which is universal among the various XGBoost
interfaces. Auxiliary attributes of the Python Booster object (such as feature names) will not be loaded.
Label encodings (text labels to numeric labels) will be also lost. If you are using only the Python
interface, we recommend pickling the model object for best results.
Parameters fname (string) – Output file name
set_params(**params)
Set the parameters of this estimator. Modification of the sklearn method to allow unknown kwargs. This
allows using the full range of xgboost parameters that are not defined as member variables in sklearn grid
search. :returns: :rtype: self
Plotting API
Plotting Library.
xgboost.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, ti-
tle=’Feature importance’, xlabel=’F score’, ylabel=’Features’,
importance_type=’weight’, max_num_features=None, grid=True,
show_values=True, **kwargs)
Plot importance based on fitted trees.
Parameters
• booster (Booster, XGBModel or dict) – Booster or XGBModel instance, or dict
taken by Booster.get_fscore()
• ax (matplotlib Axes, default None) – Target axes instance. If None, new figure
and axes will be created.
• grid (bool, Turn the axes grids on or off. Default is True
(On)) –
• importance_type (str, default "weight") – How the importance is calcu-
lated: either “weight”, “gain”, or “cover”
– ”weight” is the number of times a feature appears in a tree
– ”gain” is the average gain of splits which use the feature
– ”cover” is the average coverage of splits which use the feature where coverage is defined
as the number of samples affected by the split
• max_num_features (int, default None) – Maximum number of top features dis-
played on plot. If None, all features will be displayed.
90 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Callback API
xgboost.callback.print_evaluation(period=1, show_stdv=True)
Create a callback that print evaluation result.
We print the evaluation results every period iterations and on the first and the last iterations.
Parameters
• period (int) – The period to log the evaluation results
• show_stdv (bool, optional) – Whether show stdv if provided
Returns callback – A callback that print evaluation every period iterations.
Return type function
xgboost.callback.record_evaluation(eval_result)
Create a call back that records the evaluation history into eval_result.
Parameters eval_result (dict) – A dictionary to store the evaluation results.
Returns callback – The requested callback function.
Return type function
xgboost.callback.reset_learning_rate(learning_rates)
Reset learning rate after iteration 1
NOTE: the initial learning rate will still take in-effect on first iteration.
Parameters learning_rates (list or function) – List of learning rate for each boosting
round or a customized function that calculates eta in terms of current number of round and the
total number of boosting round (e.g. yields learning rate decay)
• list l: eta = l[boosting_round]
• function f: eta = f(boosting_round, num_boost_round)
Returns callback – The requested callback function.
Return type function
92 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Dask API
Parameters
• client – Dask client representing the cluster
• func – Python function to be executed by each worker. Typically contains xgboost training
code.
• args – Arguments to be forwarded to func
Returns Dict containing the function return value for each worker
xgboost.dask.create_worker_dmatrix(*args, **kwargs)
Creates a DMatrix object local to a given worker. Simply forwards arguments onto the standard DMatrix con-
structor, if one of the arguments is a dask dataframe, unpack the data frame to get the local components.
All dask dataframe arguments must use the same partitioning.
Parameters args – DMatrix constructor args.
Returns DMatrix object containing data local to current dask worker
xgboost.dask.get_local_data(data)
Unpacks a distributed data object to get the rows local to this worker
Parameters data – A distributed dask data object
Returns Local data partition e.g. numpy or pandas
• Checkout the Installation Guide contains instructions to install xgboost, and Tutorials for examples on how to
use XGBoost for various tasks.
• Read the API documentation.
• Please visit Walk-through Examples.
1.8.2 Tutorials
XGBoost R Tutorial
Introduction
94 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
Installation
Github version
install.packages("drat", repos="https://cran.rstudio.com")
drat:::addRepo("dmlc")
install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source")
CRAN version
install.packages("xgboost")
Learning
require(xgboost)
Dataset presentation
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example
data are the same as you will use on in your every day life :-).
Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.
Dataset loading
We will load the agaricus datasets embedded with the package and will link them to variables.
The datasets are already split in:
• train: will be used to build the model ;
• test: will be used to assess the quality of our model.
Why split the dataset in two parts?
In the first part we will build our model. In the second part we will want to test it and assess its quality. Without
dividing the dataset we would test the model on the data which the algorithm have already seen.
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
In the real world, it would be up to you to make this division between train and test data. The way
to do it is out of scope for this article, however caret package may help.
Each variable is a list containing two things, label and data:
str(train)
## List of 2
## $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
## .. ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
## .. ..@ Dim : int [1:2] 6513 126
## .. ..@ Dimnames:List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex"
˓→"cap-shape=flat" ...
label is the outcome of our dataset meaning it is the binary classification we will try to predict.
Let’s discover the dimensionality of our datasets.
dim(train$data)
dim(test$data)
This dataset is very small to not make the R package too heavy, however XGBoost is built to manage huge datasets
very efficiently.
As seen below, the data are stored in a dgCMatrix which is a sparse matrix and label vector is a numeric
vector ({0,1}):
class(train$data)[1]
## [1] "dgCMatrix"
class(train$label)
## [1] "numeric"
This step is the most critical part of the process for the quality of our model.
Basic training
We are using the train data. As explained above, both data and label are stored in a list.
96 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
In a sparse matrix, cells containing 0 are not stored in memory. Therefore, in a dataset mainly made of 0, memory
size is reduced. It is very common to have such a dataset.
We will train decision tree model using the following parameters:
• objective = "binary:logistic": we will train a binary classification model ;
• max.depth = 2: the trees won’t be deep, because our case is very simple ;
• nthread = 2: the number of cpu threads we are going to use;
• nrounds = 2: there will be two passes on the data, the second one will enhance the model by further reducing
the difference between ground truth and prediction.
## [0] train-error:0.046522
## [1] train-error:0.022263
The more complex the relationship between your features and your label is, the more passes you need.
Parameter variations
Dense matrix
Alternatively, you can put your dataset in a dense matrix, i.e. a basic R matrix.
## [0] train-error:0.046522
## [1] train-error:0.022263
xgb.DMatrix
XGBoost offers a way to group them in a xgb.DMatrix. You can even add other meta data in it. This will be useful
for the most advanced features we will discover later.
## [0] train-error:0.046522
## [1] train-error:0.022263
Verbose option
XGBoost has several features to help you view the learning progress internally. The purpose is to help you to set the
best parameters, which is the key of your model quality.
One of the simplest way to see the training progress is to set the verbose option (see below for more advanced
techniques).
# verbose = 0, no message
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nrounds = 2,
˓→objective = "binary:logistic", verbose = 0)
## [0] train-error:0.046522
## [1] train-error:0.022263
## [0] train-error:0.046522
## [11:41:01] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots,
˓→ 4 extra nodes, 0 pruned nodes, max_depth=2
## [1] train-error:0.022263
The purpose of the model we have built is to classify new data. As explained before, we will use the test dataset for
this step.
## [1] 1611
These numbers doesn’t look like binary classification {0,1}. We need to perform a simple transformation before
being able to use these results.
The only thing that XGBoost does is a regression. XGBoost is using label vector to build its regression model.
How can we use a regression model to perform a binary classification?
98 Chapter 1. Contents
xgboost, Release 1.0.0-SNAPSHOT
If we think about the meaning of a regression applied to our data, the numbers we get are probabilities that a datum
will be classified as 1. Therefore, we will set the rule that if this probability for a specific datum is > 0.5 then the
observation is classified as 1 (or 0 otherwise).
## [1] 0 1 0 0 0 1
To measure the model performance, we will compute a simple metric, the average error.
Note that the algorithm has not seen the test data during the model construction.
Steps explanation:
1. as.numeric(pred > 0.5) applies our rule that when the probability (<=> regression <=> prediction) is
> 0.5 the observation is classified as 1 and 0 otherwise ;
2. probabilityVectorPreviouslyComputed != test$label computes the vector of error between
true data and computed probabilities ;
3. mean(vectorOfErrors) computes the average error itself.
The most important thing to remember is that to do a classification, you just do a regression to the label and then
apply a threshold.
Multiclass classification works in a similar way.
This metric is 0.02 and is pretty low: our yummly mushroom model works well!
Advanced features
Most of the features below have been implemented to help you to improve your model by offering a better understand-
ing of its content.
Dataset preparation
For the following advanced features, we need to put data in xgb.DMatrix as explained above.
One of the special features of xgb.train is the capacity to follow the progress of the learning after each round.
Because of the way boosting works, there is a time when having too many rounds lead to overfitting. You can see
this feature as a cousin of a cross-validation method. The following techniques will help you to avoid overfitting or
optimizing the learning time in stopping it as soon as possible.
One way to measure progress in the learning of a model is to provide to XGBoost a second dataset already classified.
Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each
round during the learning.
in some way it is similar to what we have done above with the average error. The main difference is that
above it was after building the model, and now it is during the construction that we measure errors.
For the purpose of this example, we use watchlist parameter. It is a list of xgb.DMatrix, each of them tagged
with a name.
XGBoost has computed at each round the same average error metric seen above (we set nrounds to 2, that is why
we have two lines). Obviously, the train-error number is related to the training dataset (the one the algorithm
learns from) and the test-error number to the test dataset.
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned
from the training dataset matches the observations from the test dataset.
If with your own dataset you do not have such results, you should think about how you divided your dataset in training
and test. May be there is something to fix. Again, caret package may help.
For a better understanding of the learning progression, you may want to have some specific metric or even use multiple
evaluation metrics.
˓→"binary:logistic")
eval.metric allows us to monitor two new metrics for each round, logloss and error.
Linear boosting
Until now, all the learnings we have performed were based on boosting trees. XGBoost implements a second al-
gorithm, based on linear boosting. The only difference with the previous command is booster = "gblinear"
parameter (and removing eta parameter).
˓→objective = "binary:logistic")
In this specific case, linear boosting gets slightly better performance metrics than a decision tree based algorithm.
In simple cases, this will happen because there is nothing better than a linear algorithm to catch a linear link. However,
decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver
bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
Manipulating xgb.DMatrix
Save / Load
Like saving models, xgb.DMatrix object (which groups both dataset and outcome) can also be saved using xgb.
DMatrix.save function.
xgb.DMatrix.save(dtrain, "dtrain.buffer")
## [1] TRUE
Information extraction
Information can be extracted from an xgb.DMatrix using getinfo function. Hereafter we will extract label
data.
You can dump the tree you learned using xgb.dump into a text file.
xgb.dump(bst, with_stats = T)
## [1] "booster[0]"
## [2] "0:[f28<-1.00136e-05] yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
## [3] "1:[f55<-1.00136e-05] yes=3,no=4,missing=3,gain=1158.21,cover=924.5"
## [4] "3:leaf=1.71218,cover=812"
## [5] "4:leaf=-1.70044,cover=112.5"
## [6] "2:[f108<-1.00136e-05] yes=5,no=6,missing=5,gain=198.174,cover=703.75"
## [7] "5:leaf=-1.94071,cover=690.5"
## [8] "6:leaf=1.85965,cover=13.25"
## [9] "booster[1]"
## [10] "0:[f59<-1.00136e-05] yes=1,no=2,missing=1,gain=832.545,cover=788.852"
## [11] "1:[f28<-1.00136e-05] yes=3,no=4,missing=3,gain=569.725,cover=768.39"
## [12] "3:leaf=0.784718,cover=458.937"
## [13] "4:leaf=-0.96853,cover=309.453"
## [14] "2:leaf=-6.23624,cover=20.4624"
You can plot the trees from your model using ‘‘‘xgb.plot.tree‘‘
xgb.plot.tree(model = bst)
if you provide a path to fname parameter you can save the trees to your hard drive.
Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in
redoing the same task again and again? In these very rare cases, you will want to save your model and load it when
required.
Helpfully for you, XGBoost implements such functions.
## [1] TRUE
xgb.save function should return TRUE if everything goes well and crashes otherwise.
An interesting test to see how identical our saved model is to the original one would be to compare the two predictions.
# print class
print(class(rawVec))
## [1] "raw"
References
Introduction
The purpose of this Vignette is to show you how to use Xgboost to discover and understand your own dataset better.
This Vignette is not about predicting anything (see Xgboost presentation). We will explain how to use Xgboost to
highlight the link between the features of your data and the outcome.
Package loading:
require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd')
A categorical variable has a fixed number of different values. For instance, if a variable called Colour can have only
one of these three values, red, blue or green, then Colour is a categorical variable.
In R, a categorical variable is called factor.
Type ?factor in the console for more information.
To answer the question above we will convert categorical variables to numeric one.
In this Vignette we will see how to transform a dense data.frame (dense = few zeroes in the matrix) with categor-
ical variables to a very sparse matrix (sparse = lots of zero in the matrix) of numeric features.
The method we are going to see is usually called one-hot encoding.
The first step is to load Arthritis dataset in memory and wrap it with data.table package.
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F)
data.table is 100% compliant with R data.frame but its syntax is more consistent and its perfor-
mance for large dataset is best in class (dplyr from R and Pandas from Python included). Some parts
of Xgboost R package use data.table.
The first thing we want to do is to have a look to the first lines of the data.table:
head(df)
For the first feature we create groups of age by rounding the real age.
Note that we transform it to factor so the algorithm treat these age groups as independent values.
Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.
head(df[,AgeDiscret := as.factor(round(Age/10,0))])
Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value
based on nothing. We will see later if simplifying the information based on arbitrary values is a good strategy (you
may already have an idea of how well it will work. . . ).
These new features are highly correlated to the Age feature because they are simple transformations of this feature.
For many machine learning algorithms, using correlated features is not a good idea. It may sometimes make prediction
less accurate, and most of the time make interpretation of the model almost impossible. GLM, for instance, assumes
that the features are uncorrelated.
Fortunately, decision tree algorithms (including boosted trees) are very robust to these features. Therefore we have
nothing to do to manage this situation.
Cleaning data
We remove ID as there is nothing to learn from this feature (it would just add some noise).
df[,ID:=NULL]
levels(df[,Treatment])
One-hot encoding
Next step, we will transform the categorical data to dummy variables. This is the one-hot encoding step.
The purpose is to transform each value of each categorical feature in a binary feature {0, 1}.
For example, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will
be binary. Therefore, an observation which has the value Placebo in column Treatment before the transformation
will have after the transformation the value 1 in the new column Placebo and the value 0 in the new column
Treated. The column Treatment will disappear during the one-hot encoding.
Column Improved is excluded because it will be our label column, the one we want to predict.
Formulae Improved~.-1 used above means transform all categorical features but column Improved
to binary values. The -1 is here to remove the first column which is full of 1 (this column is generated
by the conversion). For more information, you can type ?sparse.model.matrix in the console.
Create the output numeric vector (not as a sparse Matrix):
1. set Y vector to 0;
2. set Y to 1 for rows where Improved == Marked is TRUE ;
3. return Y vector.
The code below is very usual. For more information, you can look at the documentation of xgboost function (or at
the vignette Xgboost presentation).
## [0] train-error:0.202381
## [1] train-error:0.166667
## [2] train-error:0.166667
## [3] train-error:0.166667
## [4] train-error:0.154762
## [5] train-error:0.154762
## [6] train-error:0.154762
## [7] train-error:0.166667
## [8] train-error:0.166667
## [9] train-error:0.166667
You can see some train-error: 0.XXXXX lines followed by a number. It decreases. Each line shows how well
the model explains your data. Lower is better.
A model which fits too well may overfit (meaning it copy/paste too much the past, and won’t be that good to predict
the future).
Here you can see the numbers decrease until line 7 and then increase.
It probably means we are overfitting. To fix that I should reduce the number of rounds to nrounds =
4. I will let things like that because I don’t really care for the purpose of this example :-)
Feature importance
In the code below, sparse_matrix@Dimnames[[2]] represents the column names of the sparse matrix. These
names are the original values of the features (remember, each binary column == one value of one categorical feature).
We can go deeper in the analysis of the model. In the data.table above, we have discovered which features counts
to predict if the illness will go or not. But we don’t yet know the role of these features. For instance, one of the
question we may want to answer would be: does receiving a placebo treatment helps to recover from the illness?
One simple solution is to count the co-occurrences of a feature and a class of the classification.
For that purpose we will execute the same function as above but using two more parameters, data and label.
head(importanceClean)
In the table above we have removed two not needed columns and select only the first lines.
First thing you notice is the new column Split. It is the split applied to the feature on a branch of one of the tree.
Each split is present, therefore a feature can appear several times in this table. Here we can see the feature Age is used
several times with different splits.
How the split is applied to count the co-occurrences? It is always <. For instance, in the second line, we measure the
number of persons under 61.5 years with the illness gone after the treatment.
The two other new columns are RealCover and RealCover %. In the first column it measures the number of
observations in the dataset where the split is respected and the label marked as 1. The second column is the percentage
of the whole population that RealCover represents.
Therefore, according to our findings, getting a placebo doesn’t seem to help but being younger than 61 years may help
(seems logic).
You may wonder how to interpret the < 1.00001 on the first line. Basically, in a sparse Matrix, there
is no 0, therefore, looking for one hot-encoded categorical observations validating the rule < 1.00001
is like just looking for 1 for this feature.
All these things are nice, but it would be even better to plot the results.
xgb.plot.importance(importance_matrix = importanceRaw)
Feature have automatically been divided in 2 clusters: the interesting features. . . and the others.
Depending of the dataset and the learning parameters you may have more than two clusters. Default value
is to limit them to 10, but you can increase this limit. Look at the function documentation for more
information.
According to the plot above, the most important features in this dataset to predict if the treatment will work are :
• the Age ;
• having received a placebo or not ;
• the sex is third but already included in the not interesting features group ;
• then we see our generated features (AgeDiscret). We can see that their contribution is very low.
Let’s check some Chi2 between each of these features and the label.
Higher Chi2 means better correlation.
##
## Pearson's Chi-squared test
##
## data: df$Age and output_vector
## X-squared = 35.475, df = 35, p-value = 0.4458
##
## Pearson's Chi-squared test
##
## data: df$AgeDiscret and output_vector
## X-squared = 8.2554, df = 5, p-value = 0.1427
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$AgeCat and output_vector
## X-squared = 2.3571, df = 1, p-value = 0.1247
The perfectly random split I did between young and old at 30 years old have a low correlation of 2.36. It’s a result we
may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but
for the illness we are studying, the age to be vulnerable is not the same.
Morality: don’t let your gut lower the quality of your model.
In data science expression, there is the word science :-)
Conclusion
As you can see, in general destroying information by simplifying it won’t improve your model. Chi2 just demonstrates
that.
But in more complex cases, creating a new feature based on existing one which makes link with the outcome more
obvious may help the algorithm and improve the model.
The case studied here is not enough complex to show that. Check Kaggle website for some challenging datasets.
However it’s almost always worse when you add some arbitrary rules.
Moreover, you can notice that even if we have added some not useful new features highly correlated with other features,
the boosting tree algorithm have been able to choose the best one, which in this case is the Age.
Linear models may not be that smart in this scenario.
As you may know, Random Forests™ algorithm is cousin with boosting and both are part of the ensemble learning
family.
Both train several decision trees for one dataset. The main difference is that in Random Forests™, trees are independent
and in boosting, the tree N+1 focus its learning on the loss (<=> what has not been well modeled by the tree N).
This difference have an impact on a corner case in feature importance analysis: the correlated features.
Imagine two features perfectly correlated, feature A and feature B. For one specific tree, if the algorithm needs one of
them, it will choose randomly (true in both boosting and Random Forests™).
However, in Random Forests™ this random choice will be done for each tree, because each tree is independent from
the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature A and
the other 50% will choose feature B. So the importance of the information contained in A and B (which is the same,
because they are perfectly correlated) is diluted in A and B. So you won’t easily know this information is important to
predict what you want to predict! It is even worse when you have 10 correlated features. . .
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not
refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on
feature A or on feature B (but not both). You will know that one feature have an important role in the link between the
observations and the label. It is still up to you to search for the correlated features to the one detected as important if
you need to know all of them.
If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters!
Warning: this is still an experimental parameter.
For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns:
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
˓→"binary:logistic")
## [0] train-error:0.002150
#Boosting - 3 rounds
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nrounds = 3,
˓→objective = "binary:logistic")
## [0] train-error:0.006142
## [1] train-error:0.006756
## [2] train-error:0.001228
1.9.1 Installation
Building XGBoost4J using Maven requires Maven 3 or newer, Java 7+ and CMake 3.2+ for compiling the JNI bind-
ings.
Before you install XGBoost4J, you need to define environment variable JAVA_HOME as your JDK directory to ensure
that your compiler can find jni.h correctly, since XGBoost4J relies on JNI to implement the interaction between the
JVM and native libraries.
After your JAVA_HOME is defined correctly, it is as simple as run mvn package under jvm-packages directory to
install XGBoost4J. You can also skip the tests by running mvn -DskipTests=true package, if you are sure
about the correctness of your local setup.
To publish the artifacts to your local maven repository, run
mvn install
This command will publish the xgboost binaries, the compiled java classes as well as the java sources to your local
repository. Then you can use XGBoost4J in your Java projects by including the following dependency in pom.xml:
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_source_version_num</version>
</dependency>
For sbt, please add the repository and dependency in build.sbt as following:
Listing 8: maven
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_version_num</version>
</dependency>
Listing 9: sbt
"ml.dmlc" % "xgboost4j" % "latest_version_num"
This will checkout the latest stable version from the Maven Central.
For the latest release version number, please check here.
if you want to use XGBoost4J-Spark, replace xgboost4j with xgboost4j-spark.
If you are on Mac OS and using a compiler that supports OpenMP, you need to go to the file xgboost/
jvm-packages/create_jni.py and comment out the line
CONFIG["USE_OPENMP"] = "OFF"
1.9.2 Contents
Data Interface
Like the XGBoost python module, XGBoost4J uses DMatrix to handle data. LIBSVM txt format file, sparse matrix
in CSR/CSC format, and dense matrix are supported.
• The first step is to import DMatrix:
import ml.dmlc.xgboost4j.java.DMatrix;
• Use DMatrix constructor to load data from a libsvm text format file:
1 0 2 0
4 0 0 3
3 1 2 0
We can express the sparse matrix in Compressed Sparse Row (CSR) format:
• You may also load your data from a dense matrix. Let’s assume we have a matrix of form
1 2
3 4
5 6
• To set weight:
Setting Parameters
Training Model
With parameters and data, you are able to train a booster model.
• Import Booster and XGBoost:
import ml.dmlc.xgboost4j.java.Booster;
import ml.dmlc.xgboost4j.java.XGBoost;
• Training
• Saving model
After training, you can save model and dump it out.
booster.saveModel("model.bin");
• Load a model
Prediction
After training and loading a model, you can use it to make prediction for other data. The result will be a two-
dimension float array (nsample, nclass); for predictLeaf(), the result would be of shape (nsample,
nclass*ntrees).
XGBoost4J-Spark is a project aiming to seamlessly integrate XGBoost and Apache Spark by fitting XGBoost to
Apache Spark’s MLLIB framework. With the integration, user can not only uses the high-performant algorithm
implementation of XGBoost, but also leverages the powerful data processing engine of Spark for:
• Feature Engineering: feature extraction, transformation, dimensionality reduction, and selection, etc.
• Pipelines: constructing, evaluating, and tuning ML Pipelines
• Persistence: persist and load machine learning models and even whole Pipelines
This tutorial is to cover the end-to-end process to build a machine learning pipeline with XGBoost4J-Spark. We will
discuss
• Using Spark to preprocess data to fit to XGBoost/XGBoost4J-Spark’s data interface
• Training a XGBoost model with XGBoost4J-Spark
• Serving XGBoost model (prediction) with Spark
• Building a Machine Learning Pipeline with XGBoost4J-Spark
• Running XGBoost4J-Spark in Production
* Early Stopping
* Training with Evaluation Sets
– Prediction
* Batch Prediction
* Single instance prediction
– Model Persistence
– Basic ML Pipeline
– Pipeline with Hyper-parameter Tunning
• Run XGBoost4J-Spark in Production
– Parallel/Distributed Training
– Gang Scheduling
– Checkpoint During Training
Before we go into the tour of how to use XGBoost4J-Spark, we would bring a brief introduction about how to build a
machine learning application with XGBoost4J-Spark. The first thing you need to do is to refer to the dependency in
Maven Central.
You can add the following dependency in your pom.xml.
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark</artifactId>
<version>latest_version_num</version>
</dependency>
<repository>
<id>XGBoost4J-Spark Snapshot Repo</id>
<name>XGBoost4J-Spark Snapshot Repo</name>
<url>https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/</url>
</repository>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark</artifactId>
<version>next_version_num-SNAPSHOT</version>
</dependency>
distribution of XGBoost.
Data Preparation
As aforementioned, XGBoost4J-Spark seamlessly integrates Spark and XGBoost. The integration enables users to
apply various types of transformation over the training/test datasets with the convenient and powerful data processing
framework, Spark.
In this section, we use Iris dataset as an example to showcase how we use Spark to transform raw dataset and make it
fit to the data interface of XGBoost.
Iris dataset is shipped in CSV format. Each instance contains 4 features, “sepal length”, “sepal width”, “petal length”
and “petal width”. In addition, it contains the “class” columnm, which is essentially the label with three possible
values: “Iris Setosa”, “Iris Versicolour” and “Iris Virginica”.
The first thing in data transformation is to load the dataset as Spark’s structured data abstraction, DataFrame.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
At the first line, we create a instance of SparkSession which is the entry of any Spark program working with
DataFrame. The schema variable defines the schema of DataFrame wrapping Iris data. With this explicitly set
schema, we can define the columns’ name as well as their types; otherwise the column name would be the default
ones derived by Spark, such as _col0, etc. Finally, we can use Spark’s built-in csv reader to load Iris csv file as a
DataFrame named rawInput.
Spark also contains many built-in readers for other format. The latest version of Spark supports CSV, JSON, Parquet,
and LIBSVM.
2. Assemble the feature columns as a vector to fit to the data interface of Spark ML framework.
To convert String-typed label to Double, we can use Spark’s built-in feature transformer StringIndexer.
import org.apache.spark.ml.feature.StringIndexer
val stringIndexer = new StringIndexer().
setInputCol("class").
setOutputCol("classIndex").
fit(rawInput)
val labelTransformed = stringIndexer.transform(rawInput).drop("class")
import org.apache.spark.ml.feature.VectorAssembler
val vectorAssembler = new VectorAssembler().
setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
setOutputCol("features")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features",
˓→"classIndex")
Now, we have a DataFrame containing only two columns, “features” which contains vector-represented “sepal length”,
“sepal width”, “petal length” and “petal width” and “classIndex” which has Double-typed labels. A DataFrame like
this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark’s training engine
directly.
We introduce the following approaches dealing with missing value and their fitting scenarios:
1. Skip VectorAssembler (using setHandleInvalid = “skip”) directly. Used in (2), (3).
2. Keep it (using setHandleInvalid = “keep”), and set the “missing” parameter in XGBClassifier/XGBRegressor as
the value representing missing. Used in (2) and (4).
3. Keep it (using setHandleInvalid = “keep”) and transform to other irregular values. Used in (3).
4. Nothing to be done, used in (1).
Then, XGBoost will automatically learn what’s the ideal direction to go when a value is missing, based on that value
and strategy.
Example of setting a missing value (e.g. -999) to the “missing” parameter in XGBoostClassifier:
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map("eta" -> 0.1f,
"missing" -> -999,
"objective" -> "multi:softprob",
"num_class" -> 3,
"num_round" -> 100,
"num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("classIndex")
Training
XGBoost supports both regression and classification. While we use Iris dataset in this tutorial to show how we use
XGBoost/XGBoost4J-Spark to resolve a multi-classes classification problem, the usage in Regression is very similar
to classification.
To train a XGBoost model for classification, we need to claim a XGBoostClassifier first:
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map("eta" -> 0.1f,
"max_depth" -> 2,
"objective" -> "multi:softprob",
"num_class" -> 3,
"num_round" -> 100,
"num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("classIndex")
The available parameters for training a XGBoost model can be found in here. In XGBoost4J-Spark, we support not
only the default set of parameters but also the camel-case variant of these parameters to keep consistent with Spark’s
MLLIB parameters.
Specifically, each parameter in this page has its equivalent form in XGBoost4J-Spark with camel case. For example, to
set max_depth for each tree, you can pass parameter just like what we did in the above code snippet (as max_depth
wrapped in a Map), or you can do it through setters in XGBoostClassifer:
After we set XGBoostClassifier parameters and feature/label column, we can build a transformer, XGBoostClassifi-
cationModel by fitting XGBoostClassifier with the input DataFrame. This fit operation is essentially the training
process and the generated model can then be used in prediction.
Early Stopping
You can also monitor the performance of the model during training with multiple evaluation datasets. By specify-
ing eval_sets or call setEvalSets over a XGBoostClassifier or XGBoostRegressor, you can pass in multiple
evaluation datasets typed as a Map from String to DataFrame.
Prediction
XGBoost4j-Spark supports two ways for model serving: batch prediction and single instance prediction.
Batch Prediction
Batch prediction expects the user to pass the testset in the form of a DataFrame. XGBoost4J-Spark starts a XGBoost
worker for each partition of DataFrame for parallel prediction and generates prediction results for the whole DataFrame
in a batch.
With the above code snippet, we get a result DataFrame, result containing margin, probability for each class and the
prediction for each instance
+-----------------+----------+--------------------+--------------------+----------+
| features|classIndex| rawPrediction| probability|prediction|
+-----------------+----------+--------------------+--------------------+----------+
|[5.1,3.5,1.4,0.2]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0|
|[4.9,3.0,1.4,0.2]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0|
|[4.7,3.2,1.3,0.2]| 0.0|[3.45569849014282...|[0.99643349647521...| 0.0|
|[4.6,3.1,1.5,0.2]| 0.0|[3.45569849014282...|[0.99636095762252...| 0.0|
|[5.0,3.6,1.4,0.2]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0|
|[5.4,3.9,1.7,0.4]| 0.0|[3.45569849014282...|[0.99428516626358...| 0.0|
|[4.6,3.4,1.4,0.3]| 0.0|[3.45569849014282...|[0.99643349647521...| 0.0|
|[5.0,3.4,1.5,0.2]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0|
|[4.4,2.9,1.4,0.2]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0|
|[4.9,3.1,1.5,0.1]| 0.0|[3.45569849014282...|[0.99636095762252...| 0.0|
|[5.4,3.7,1.5,0.2]| 0.0|[3.45569849014282...|[0.99428516626358...| 0.0|
|[4.8,3.4,1.6,0.2]| 0.0|[3.45569849014282...|[0.99643349647521...| 0.0|
|[4.8,3.0,1.4,0.1]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0|
|[4.3,3.0,1.1,0.1]| 0.0|[3.45569849014282...|[0.99618089199066...| 0.0|
|[5.8,4.0,1.2,0.2]| 0.0|[3.45569849014282...|[0.97809928655624...| 0.0|
|[5.7,4.4,1.5,0.4]| 0.0|[3.45569849014282...|[0.97809928655624...| 0.0|
|[5.4,3.9,1.3,0.4]| 0.0|[3.45569849014282...|[0.99428516626358...| 0.0|
|[5.1,3.5,1.4,0.3]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0|
|[5.7,3.8,1.7,0.3]| 0.0|[3.45569849014282...|[0.97809928655624...| 0.0|
|[5.1,3.8,1.5,0.3]| 0.0|[3.45569849014282...|[0.99579632282257...| 0.0|
+-----------------+----------+--------------------+--------------------+----------+
Model Persistence
A data scientist produces an ML model and hands it over to an engineering team for deployment in a production
environment. Reversely, a trained model may be used by data scientists, for example as a baseline, across the process
of data exploration. So it’s important to support model persistence to make the models available across usage scenarios
and programming languages.
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
xgbClassificationModel2.transform(xgbInput)
With regards to ML pipeline save and load, please refer the next section.
After we train a model with XGBoost4j-Spark on massive dataset, sometimes we want to do model serving in single
machine or integrate it with other single node libraries for further processing. XGBoost4j-Spark supports export model
to local by:
Then we can load this model with single node Python XGBoost:
Note: Using HDFS and S3 for exporting the models with nativeBooster.saveModel()
When interacting with other language bindings, XGBoost also supports saving-models-to and loading-models-from
file systems other than the local one. You can use HDFS and S3 by prefixing the path with hdfs:// and s3://
respectively. However, for this capability, you must do one of the following:
1. Build XGBoost4J-Spark with the steps described in here, but turning USE_HDFS (or USE_S3, etc. in the same
place) switch on. With this approach, you can reuse the above code example by replacing “nativeModelPath”
with a HDFS path.
• However, if you build with USE_HDFS, etc. you have to ensure that the involved shared object file, e.g.
libhdfs.so, is put in the LIBRARY_PATH of your cluster. To avoid the complicated cluster environment
configuration, choose the other option.
2. Use bindings of HDFS, S3, etc. to pass model files around. Here are the steps (taking HDFS as an example):
• Create a new file with
xgbClassificationModel.nativeBooster.saveModel(outputStream)
• Download file in other languages from HDFS and load with the pre-built (without the requirement of lib-
hdfs.so) version of XGBoost. (The function “download_from_hdfs” is a helper function to be implemented
by the user)
spark.read.format("libsvm").load("trainingset_libsvm")
Spark assumes that the dataset is using 1-based indexing (feature indices staring with 1). However, when you do
prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is using
0-based indexing (feature indices starting with 0) by default. It creates a pitfall for the users who train model with
Spark but predict with the dataset in the same format in other bindings of XGBoost. The solution is to transform the
dataset to 0-based indexing before you predict with, for example, Python API, or you append ?indexing_mode=1
to your file path when loading with DMatirx. For example in Python:
xgb.DMatrix('test.libsvm?indexing_mode=1')
Basic ML Pipeline
Spark ML pipeline can combine multiple algorithms or functions into a single pipeline. It covers from feature ex-
traction, transformation, selection to model training and prediction. XGBoost4j-Spark makes it feasible to embed
XGBoost into such a pipeline seamlessly. The following example shows how to build such a pipeline consisting of
Spark MLlib feature transformer and XGBoostClassifier estimator.
We still use Iris dataset and the rawInput DataFrame. First we need to split the dataset into training and test dataset.
We need to organize these steps as a Pipeline in Spark ML framework and evaluate the whole pipeline to get a
PipelineModel:
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
After we get the PipelineModel, we can make prediction on the test dataset and evaluate the model accuracy.
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
The most critical operation to maximize the power of XGBoost is to select the optimal parameters for the model.
Tuning parameters manually is a tedious and labor-consuming process. With the latest version of XGBoost4J-Spark,
we can utilize the Spark model selecting tool to automate this process.
The following example shows the code snippet utilizing CrossValidation and MulticlassClassificationEvaluator to
search the optimal combination of two XGBoost parameters, max_depth and eta. (See XGBoost Parameters.) The
model producing the maximum accuracy defined by MulticlassClassificationEvaluator is selected and used to generate
the prediction for the test set.
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.PipelineModel
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
XGBoost4J-Spark is one of the most important steps to bring XGBoost to production environment easier. In this
section, we introduce three key features to run XGBoost4J-Spark in production.
Parallel/Distributed Training
The massive size of training dataset is one of the most significant characteristics in production environment. To
ensure that training in XGBoost scales with the data size, XGBoost4J-Spark bridges the distributed/parallel processing
framework of Spark and the parallel/distributed training mechanism of XGBoost.
In XGBoost4J-Spark, each XGBoost worker is wrapped by a Spark task and the training dataset in Spark’s memory
space is fed to XGBoost workers in a transparent approach to the user.
In the code snippet where we build XGBoostClassifier, we set parameter num_workers (or numWorkers). This
parameter controls how many parallel workers we want to have when training a XGBoostClassificationModel.
Gang Scheduling
XGBoost uses AllReduce. algorithm to synchronize the stats, e.g. histogram values, of each worker during training.
Therefore XGBoost4J-Spark requires that all of nthread * numWorkers cores should be available before the
training runs.
In the production environment where many users share the same cluster, it’s hard to guarantee that your XGBoost4J-
Spark application can get all requested resources for every run. By default, the communication layer in XGBoost will
block the whole application when it requires more resources to be available. This process usually brings unnecessary
resource waste as it keeps the ready resources and try to claim more. Additionally, this usually happens silently and
does not bring the attention of users.
XGBoost4J-Spark allows the user to setup a timeout threshold for claiming resources from the cluster. If the applica-
tion cannot get enough resources within this time period, the application would fail instead of wasting resources for
hanging long. To enable this feature, you can set with XGBoostClassifier/XGBoostRegressor:
xgbClassifier.setTimeoutRequestWorkers(60000L)
If XGBoost4J-Spark cannot get enough resources for running two XGBoost workers, the application would fail. Users
can have external mechanism to monitor the status of application and get notified for such case.
Transient failures are also commonly seen in production environment. To simplify the design of XGBoost, we stop
training if any of the distributed workers fail. However, if the training fails after having been through a long time, it
would be a great waste of resources.
We support creating checkpoint during training to facilitate more efficient recovery from failture. To enable this fea-
ture, you can set how many iterations we build each checkpoint with setCheckpointInterval and the location
of checkpoints with setCheckpointPath:
xgbClassifier.setCheckpointInterval(2)
xgbClassifier.setCheckpointPath("/checkpoint_path")
If the training failed during these 100 rounds, the next run of training would start by reading the latest checkpoint
file in /checkpoints_path and start from the iteration when the checkpoint was built until to next failure or the
specified 100 rounds.
1.10 XGBoost.jl
XGBoost has been developed by community members. Everyone is welcome to contribute. We value all forms of
contributions, including, but not limited to:
• Code reviews for pull requests
• Documentation and usage examples
• Community participation in forums and issues
• Code readability and developer guide
– We welcome contributions that add code comments to improve readability.
– We also welcome contributions to docs to explain the design choices of the XGBoost internals.
• Test cases to make the codebase more robust.
• Tutorials, blog posts, talks that promote the project.
Here are guidelines for contributing to various aspect of the XGBoost project:
XGBoost adopts the Apache style model and governs by merit. We believe that it is important to create an inclu-
sive community where everyone can use, contribute to, and influence the direction of the project. See CONTRIBU-
TORS.md for the current list of contributors.
Everyone in the community is welcomed to send patches, documents, and propose new directions to the project.
The key guideline here is to enable everyone in the community to get involved and participate the decision and de-
velopment. When major changes are proposed, an RFC should be sent to allow discussion by the community. We
encourage public discussion, archivable channels such as issues and discuss forum, so that everyone in the community
can participate and review the process later.
Code reviews are one of the key ways to ensure the quality of the code. High-quality code reviews prevent technical
debt for long-term and are crucial to the success of the project. A pull request needs to be reviewed before it gets
merged. A committer who has the expertise of the corresponding area would moderate the pull request and the merge
the code when it is ready. The corresponding committer could request multiple reviewers who are familiar with the
area of the code. We encourage contributors to request code reviews themselves and help review each other’s code –
remember everyone is volunteering their time to the community, high-quality code review itself costs as much as the
actual code contribution, you could get your code quickly reviewed if you do others the same favor.
The community should strive to reach a consensus on technical decisions through discussion. We expect commit-
ters and PMCs to moderate technical discussions in a diplomatic way, and provide suggestions with clear technical
reasoning when necessary.
Committers
Committers are individuals who are granted the write access to the project. A committer is usually responsible for
a certain area or several areas of the code where they oversee the code review process. The area of contribution can
take all forms, including code contributions and code reviews, documents, education, and outreach. Committers are
essential for a high quality and healthy project. The community actively look for new committers from contributors.
Here is a list of useful traits that help the community to recognize potential committers:
• Sustained contribution to the project, demonstrated by discussion over RFCs, code reviews and proposals of
new features, and other development activities. Being familiar with, and being able to take ownership on one or
several areas of the project.
• Quality of contributions: High-quality, readable code contributions indicated by pull requests that can be merged
without a substantial code review. History of creating clean, maintainable code and including good test cases.
Informative code reviews to help other contributors that adhere to a good standard.
• Community involvement: active participation in the discussion forum, promote the projects via tutorials, talks
and outreach. We encourage committers to collaborate broadly, e.g. do code reviews and discuss designs with
community members that they do not interact physically.
The Project Management Committee(PMC) consists group of active committers that moderate the discussion, manage
the project release, and proposes new committer/PMC members. Potential candidates are usually proposed via an
internal discussion among PMCs, followed by a consensus approval, i.e. least 3 +1 votes, and no vetoes. Any veto
must be accompanied by reasoning. PMCs should serve the community by upholding the community practices and
guidelines XGBoost a better community for everyone. PMCs should strive to only nominate new candidates outside
of their own organization.
Reviewers
Reviewers are individuals who actively contributed to the project and are willing to participate in the code review of
new contributions. We identify reviewers from active contributors. The committers should explicitly solicit reviews
from reviewers. High-quality code reviews prevent technical debt for long-term and are crucial to the success of the
project. A pull request to the project has to be reviewed by at least one reviewer in order to be merged.
Contents
• Follow PEP 8: Style Guide for Python Code. We use PyLint to automatically enforce PEP 8 style across our
Python codebase. Before submitting your pull request, you are encouraged to run PyLint on your machine. See
R Coding Guideline.
• Docstrings should be in NumPy docstring format.
R Coding Guideline
Code Style
make rcpplint
• When needed, you can disable the linter warning of certain line with // NOLINT(*) comments.
• We use roxygen for documenting the R package.
Rmarkdown Vignettes
Rmarkdown vignettes are placed in R-package/vignettes. These Rmarkdown files are not compiled. We host the
compiled version on doc/R-package.
The following steps are followed to add a new Rmarkdown vignettes:
• Add the original rmarkdown to R-package/vignettes.
• Modify doc/R-package/Makefile to add the markdown files to be build.
• Clone the dmlc/web-data repo to folder doc.
• Now type the following command on doc/R-package:
make the-markdown-to-make.md
make html
The reason we do this is to avoid exploded repo size due to generated images.
R package versioning
According to R extension manual, it is good practice to register native routines and to disable symbol search. When
any changes or additions are made to the C++ interface of the R package, please make corresponding changes in
src/init.c as well.
Once you submit a pull request to dmlc/xgboost, we perform two automatic checks to enforce coding style conventions.
To expedite the code review process, you are encouraged to run the checks locally on your machine prior to submitting
your pull request.
Linter
We use pylint and cpplint to enforce style convention and find potential errors. Linting is especially useful for Python,
as we can catch many errors that would have otherwise occured at run-time.
To run this check locally, run the following command from the top level source tree:
cd /path/to/xgboost/
make lint
Clang-tidy
Clang-tidy is an advance linter for C++ code, made by the LLVM team. We use it to conform our C++ codebase to
modern C++ practices and conventions.
To run this check locally, run the following command from the top level source tree:
cd /path/to/xgboost/
python3 tests/ci_build/tidy.py --gtest-path=/path/to/google-test
where --gtest-path option specifies the full path of Google Test library.
Also, the script accepts two optional integer arguments, namely --cpp and --cuda. By default they are both set to
1, meaning that both C++ and CUDA code will be checked. If the CUDA toolkit is not installed on your machine,
you’ll encounter an error. To exclude CUDA source from linting, use:
cd /path/to/xgboost/
python3 tests/ci_build/tidy.py --cuda=0 --gtest-path=/path/to/google-test
cd /path/to/xgboost/
python3 tests/ci_build/tidy.py --cpp=0 --gtest-path=/path/to/google-test
A high-quality suite of tests is crucial in ensuring correctness and robustness of the codebase. Here, we provide
instructions how to run unit tests, and also how to add a new one.
Contents
Add your test under the directory tests/python/ or tests/python-gpu/ (if you are testing GPU code). Refer to the PyTest
tutorial to learn how to write tests for Python code.
You may try running your test by following instructions in this section.
Add your test under the directory tests/cpp/. Refer to this excellent tutorial on using Google Test.
You may try running your test by following instructions in this section.
The JVM packages for XGBoost (XGBoost4J / XGBoost4J-Spark) use the Maven Standard Directory Layout. Specif-
ically, the tests for the JVM packages are located in the following locations:
• jvm-packages/xgboost4j/src/test/
• jvm-packages/xgboost4j-spark/src/test/
To write a test for Java code, see JUnit 5 tutorial. To write a test for Scala, see Scalatest tutorial.
You may try running your test by following instructions in this section.
R package: testthat
Add your test under the directory R-package/tests/testthat. Refer to this excellent tutorial on testthat.
You may try running your test by following instructions in this section.
R package
Run
make Rcheck
JVM packages
mvn package
Then compile XGBoost according to instructions in Building the Shared Library. Finally, invoke pytest at the project
root directory:
(For this step, you should have compiled XGBoost with CUDA enabled.)
To build and run C++ unit tests, install Google Test library with headers and then enable tests while running CMake:
mkdir build
cd build
cmake -DGOOGLE_TEST=ON -DGTEST_ROOT=/path/to/google-test ..
make
make test
To enable tests for CUDA code, add -DUSE_CUDA=ON and -DUSE_NCCL=ON (CUDA toolkit required):
mkdir build
cd build
cmake -DGOOGLE_TEST=ON -DGTEST_ROOT=/path/to/google-test -DUSE_CUDA=ON -DUSE_NCCL=ON .
˓→.
make
make test
One can also run all unit test using ctest tool which provides higher flexibility. For example:
ctest --verbose
By default, sanitizers are bundled in GCC and Clang/LLVM. One can enable sanitizers with GCC >= 4.8 or LLVM >=
3.1, But some distributions might package sanitizers separately. Here is a list of supported sanitizers with correspond-
ing library names:
• Address sanitizer: libasan
• Leak sanitizer: liblsan
• Thread sanitizer: libtsan
Memory sanitizer is exclusive to LLVM, hence not supported in XGBoost.
One can build XGBoost with sanitizer support by specifying -DUSE_SANITIZER=ON. By default, address sanitizer
and leak sanitizer are used when you turn the USE_SANITIZER flag on. You can always change the default by provid-
ing a semicolon separated list of sanitizers to ENABLED_SANITIZERS. Note that thread sanitizer is not compatible
with the other two sanitizers.
By default, CMake will search regular system paths for sanitizers, you can also supply a specified SANITIZER_PATH.
Runing XGBoost on CUDA with address sanitizer (asan) will raise memory error. To use asan with CUDA correctly,
you need to configure asan via ASAN_OPTIONS environment variable:
ASAN_OPTIONS=protect_shadow_gap=0 ${BUILD_DIR}/testxgboost
Contents
• Documents
• Examples
Documents
make html
Examples
Contents
• The git may show some conflicts it cannot merge, say conflicted.py.
– Manually modify the file to resolve the conflict.
– After you resolved the conflict, mark it as resolved by
• Finally push to your fork, you may need to force push here.
Sometimes we want to combine multiple commits, especially when later commits are only fixes to previous ones, to
create a PR with set of meaningful commits. You can do it by following steps.
• Before doing so, configure the default editor of git if you haven’t done so before.
• It will pop up an text editor. Set the first commit as pick, and change later ones to squash.
• After you saved the file, it will pop up another text editor to ask you modify the combined commit message.
• Push the changes to your fork, you need to force push.
The previous two tips requires force push, this is because we altered the path of the commits. It is fine to force push to
your own fork, as long as the commits changed are only yours.
Versioning Policy
Starting from XGBoost 1.0.0, each XGBoost release will be versioned as [MAJOR].[FEATURE].[MAINTENANCE]
• MAJOR: We gurantee the API compatibility across releases with the same major version number. We expect to
have a 1+ years development period for a new MAJOR release version.
• FEATURE: We ship new features, improvements and bug fixes through feature releases. The cycle length of a
feature is decided by the size of feature roadmap. The roadmap is decided right after the previous release.
• MAINTENANCE: Maintenance version only contains bug fixes. This type of release only occurs when we
found significant correctness and/or performance bugs and barrier for users to upgrade to a new version of
XGBoost smoothly.
x
xgboost.core, 53
xgboost.dask, 93
xgboost.plotting, 90
xgboost.sklearn, 63
xgboost.training, 61
139
xgboost, Release 1.0.0-SNAPSHOT
A F
apply() (xgboost.XGBClassifier method), 70 feature_importances_() (xgboost.XGBClassifier
apply() (xgboost.XGBRanker method), 76 property), 71
apply() (xgboost.XGBRegressor method), 65 feature_importances_() (xgboost.XGBRanker
apply() (xgboost.XGBRFClassifier method), 86 property), 77
apply() (xgboost.XGBRFRegressor method), 81 feature_importances_() (xg-
attr() (xgboost.Booster method), 57 boost.XGBRegressor property), 66
attributes() (xgboost.Booster method), 57 feature_importances_() (xg-
boost.XGBRFClassifier property), 87
B feature_importances_() (xg-
boost() (xgboost.Booster method), 57 boost.XGBRFRegressor property), 82
Booster (class in xgboost), 56 feature_names() (xgboost.DMatrix property), 54
feature_types() (xgboost.DMatrix property), 54
C fit() (xgboost.XGBClassifier method), 71
coef_() (xgboost.XGBClassifier property), 70 fit() (xgboost.XGBRanker method), 77
coef_() (xgboost.XGBRanker property), 76 fit() (xgboost.XGBRegressor method), 66
coef_() (xgboost.XGBRegressor property), 65 fit() (xgboost.XGBRFClassifier method), 87
coef_() (xgboost.XGBRFClassifier property), 86 fit() (xgboost.XGBRFRegressor method), 82
coef_() (xgboost.XGBRFRegressor property), 81
copy() (xgboost.Booster method), 57 G
create_worker_dmatrix() (in module xg- get_base_margin() (xgboost.DMatrix method), 54
boost.dask), 93 get_booster() (xgboost.XGBClassifier method), 72
cv() (in module xgboost), 62 get_booster() (xgboost.XGBRanker method), 78
get_booster() (xgboost.XGBRegressor method), 67
D get_booster() (xgboost.XGBRFClassifier method),
DMatrix (class in xgboost), 53 88
dump_model() (xgboost.Booster method), 57 get_booster() (xgboost.XGBRFRegressor method),
83
E get_dump() (xgboost.Booster method), 58
get_float_info() (xgboost.DMatrix method), 54
early_stop() (in module xgboost.callback), 92
get_fscore() (xgboost.Booster method), 58
eval() (xgboost.Booster method), 57
get_label() (xgboost.DMatrix method), 54
eval_set() (xgboost.Booster method), 58
get_local_data() (in module xgboost.dask), 93
evals_result() (xgboost.XGBClassifier method),
get_num_boosting_rounds() (xg-
70
boost.XGBClassifier method), 72
evals_result() (xgboost.XGBRanker method), 76
get_num_boosting_rounds() (xg-
evals_result() (xgboost.XGBRegressor method),
boost.XGBRanker method), 78
65
get_num_boosting_rounds() (xg-
evals_result() (xgboost.XGBRFClassifier
boost.XGBRegressor method), 67
method), 86
get_num_boosting_rounds() (xg-
evals_result() (xgboost.XGBRFRegressor
boost.XGBRFClassifier method), 88
method), 81
141
xgboost, Release 1.0.0-SNAPSHOT
142 Index
xgboost, Release 1.0.0-SNAPSHOT
T
to_graphviz() (in module xgboost), 91
train() (in module xgboost), 61
trees_to_dataframe() (xgboost.Booster method),
61
U
update() (xgboost.Booster method), 61
X
XGBClassifier (class in xgboost), 69
xgboost.core (module), 53
xgboost.dask (module), 93
xgboost.plotting (module), 90
xgboost.sklearn (module), 63
xgboost.training (module), 61
XGBRanker (class in xgboost), 74
XGBRegressor (class in xgboost), 63
XGBRFClassifier (class in xgboost), 85
XGBRFRegressor (class in xgboost), 79
Index 143