XGBoost Hyperparameters — Explained

Image Credits: https://www.chaussurevip.fr/

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost — Tianqi Chen.

XGBoost or eXtreme Gradient Boosting is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.
XGBoost algorithm provides large range of hyperparameters. We should know how to tune these hyperparameters to improve and take full advantage of the XGBoost model.
Let us straightaway get to the business!

Hyperparameters

Hyperparameters are certain values or weights that determine the learning process of an algorithm. The XGBoost parameters can be classified into four distinct categories:

  1. General Parameters — Guide the overall functioning of the model and relate to the booster being used.
  2. Booster Parameters — Depends on the choice of the booster.
  3. Learning Task Parameters — are used to define the optimization objective, i.e. the metric to be calculated at each step. They are also used to specify the learning task and the corresponding learning objective.
  4. Command Line Parameters — Needed for the command line version of XGBoost.

The Command line parameters are only used in the console version of XGBoost, so we will limit this article to the first three categories.

We will now deep dive into each of the parameter categories and go through the available hyperparameters.

General Hyperparameters

  1. booster
    ◙ Default = gbtree
    ◙ Helps to select the type of model to run at each iteration.
    ◙ It has three options — gbtree, gblinear or dart, gbtree and dart.
  2. verbosity
    ◙ Default = 1
    ◙ Verbosity of printing messages.
    ◙ Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug).
  3. validate_parameters
    ◙ Default = False
    ◙ Performs validation of input parameters to check whether a parameter is used or not.
    ◙ Still under experimentation.
  4. nthread
    ◙ Default = maximum number of threads available if not set
    ◙ This is used for parallel processing and number of cores in the system should be entered.
    ◙ If we wish to run on all cores, value should not be entered and algorithm will detect automatically.
  5. disable_default_eval_metric
    ◙ Default = False
    ◙ If set to True, it disables the default metric.
  6. ◙ It is set automatically by the algorithm.
    ◙ Size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
  7. num_feature
    ◙ It is set automatically by the algorithm.
    ◙ It allows us to feed in the number of features to be used in boosting.
    ◙ It is automatically set to the maximum number of features present.

Booster Parameters

There are three sub-categories here — parameters for Tree Booster, Linear Booster and Tweedie Regression. We will cover the Tree Booster parameters as they are used widely.

  1. eta
    ◙ Default = 0.3
    ◙ This is the learning rate of the algorithm.
    ◙ It is the step size shrinkage used in update to prevent overfitting.
    ◙ It makes the model more conservative by shrinking the weights on each step.
    ◙ Range of eta is [0,1].
  2. gamma
    ◙ Default = 0
    ◙ A node is split only when the resulting split gives a positive reduction in the loss function.
    ◙ Gamma specifies the minimum loss reduction required to make a split.
    ◙ The larger the gamma value, the more conservative is the algorithm.
  3. max_depth
    ◙ Default = 6
    ◙ It is the maximum depth of a tree, same as GBM.
    ◙ It is used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
    ◙ It should be tuned using cross validation.
    ◙ We should be careful when setting large value of max_depth because XGBoost aggressively consumes memory when training a deep tree.
    ◙ Typical values range from 3 to 10.
  4. min_child_weight
    ◙ Default = 0
    ◙ It defines the minimum sum of weights of all observations required in a child.
    ◙ It is used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
    ◙ Too high values can lead to under-fitting. So, it should be tuned using cross validation.
    ◙ The larger min_child_weight is, the more conservative the algorithm will be.
    ◙ It ranges from 0 to infinity.
  5. max_delta_step
    ◙ Default = 0
    ◙ This determines each tree’s weight estimation. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
    ◙ Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
    ◙ Set it to value of 1–10 might help control the update.
    ◙ It’s value ranges from 0 to infinity.
    ◙ This is not generally used.
  6. subsample
    ◙ Default = 1
    ◙ It denotes the fraction of observations to be randomly sampled for each tree.
    ◙ Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
    ◙ Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. — This will prevent overfitting.
    ◙ Subsampling occurs once in every boosting iteration.
    ◙ Too small values might lead to under-fitting.
    ◙ Typical values range between 0.5 to 1.
    ◙ It ranges from 0 to 1.
  7. colsample_bytree, colsample_bylevel, colsample_bynode
    ◙ Default = 1
    ◙ This is a family of parameters for subsampling of columns.
    ◙ All colsample_by parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.
    colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
    colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
    colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.
    ◙ colsample_by* parameters work cumulatively. For instance, the combination {‘colsample_bytree’:0.5, ‘colsample_bylevel’:0.5, ‘colsample_bynode’:0.5} with 64 features will leave 8 features to choose from at each split.
  8. lambda
    ◙ Default = 1
    ◙ This is used to handle the regularization part of XGBoost.
    ◙ L2 regularization term on weights (analogous to Ridge regression).
    ◙ Increasing this value will make model more conservative.
    ◙ Though many data scientists don’t use it often, it should be explored to reduce overfitting.
  9. alpha
    ◙ Default = 0
    ◙ L1 regularization term on weight (analogous to Lasso regression).
    ◙ It can be used in case of very high dimensionality so that the algorithm runs faster when implemented.
    ◙ Increasing this value will make model more conservative.
  10. scale_pos_weight
    ◙ Default = 0
    ◙ It controls the balance of positive and negative weights.
    ◙ It is useful for imbalanced classes.
    ◙ A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
    ◙ A typical value to consider: sum(negative instances) / sum(positive instances).
  11. max_leaves
    ◙ Default = 0
    ◙ It controls the maximum number of nodes to be added.
    ◙ Only relevant when grow_policy=lossguide is set.
  12. tree_method string
    ◙ Default = auto
    ◙ The tree construction algorithm used in XGBoost.
    ◙ XGBoost supports approx, hist and gpu_hist for distributed training. Experimental support for external memory is available for approx and gpu_hist.

There are other hyperparameters like sketch_eps,updater, refresh_leaf, process_type, grow_policy, max_bin, predictor and num_parallel_tree.

Learning Task Parameters

  1. objective
    ◙ Default = reg:linear
    ◙ It defines the loss function to be minimized. Most commonly used values are given below -
    reg:squarederror: regression with squared loss.
    reg:squaredlogerror: regression with squared log loss 1/2[log(pred+1)−log(label+1)]2. — All input labels are required to be greater than -1.
    reg:logistic : logistic regression
    binary:logistic: logistic regression for binary classification, output probability.
    binary:logitraw: logistic regression for binary classification, output score before logistic transformation.
    binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
    multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).
    multi:softprob: same as softmax, but output a vector of ndata nclass, which can be further reshaped to ndata nclass matrix. The result contains predicted probability of each data point belonging to each class.
  2. eval_metric
    ◙ Default = according to objective
    ◙ The metric to be used for validation data.
    ◙ The default values are rmse for regression, error for classification and mean average precision for ranking.
    ◙ We can add multiple evaluation metrics.
    Python users must pass the metrices as list of parameters pairs instead of map.
    ◙ The most common values are given below -
    rmse: root mean square error
    mae: mean absolute error
    logloss: negative log-likelihood
    error: Binary classification error rate (0.5 threshold). It is calculated as [#(wrong cases)/#(all cases)]. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
    merror: Multiclass classification error rate. It is calculated as [#(wrong cases)/#(all cases)].
    mlogloss: Multiclass logloss
    auc: Area under the curve
    aucpr: Area under the PR curve
  3. seed
    ◙ Default = 0
    ◙ The random number seed.
    ◙ It can be used for generating reproducible results and also for parameter tuning.

If you’ve been using Scikit-Learn till now, these parameter names might not look familiar. A good news is that xgboost module in python has an sklearn wrapper called XGBClassifier. It uses sklearn style naming convention. The parameters names which will change are:

  1. eta –> learning_rate
  2. lambda –> reg_lambda
  3. alpha –> reg_alpha

You must be wondering that we have defined everything except something similar to the “n_estimators” parameter in GBM. This exists as a parameter in XGBClassifier, and it has to be passed as “num_boosting_rounds” while calling the fit function in the standard xgboost implementation.

If you wish to explore more on this algorithm please refer to the official document.

I hope you find this useful.
Thanks for reading! Stay Safe!

Is a pantomath and a former entrepreneur. Currently, he is in a harmonious and a symbiotic relationship with Data.