How XGBoost Enforces Global Monotonicity for Features
XGBoost can train monotonicity for user-specified features, in the Gradient Boosted Decision Trees model. I have used linear programming to verify that the trained monotonicity is sound. This is how they enforce it.
Preliminary: To learn a node in a decision tree of the ensemble, the training procedure enumerates all possible splits for all features. Each split is in the form: feature j< threshold, and the training enumerates all j and all threshold values. Then, it picks the split with the best score, which is the gain metric.
WLOG, let’s assume that we specify the first feature to be monotonically increasing. The goal is to achieve:
∀ x₁≤x₁’, f(x₁, x₂, …, xₙ) ≤ f(x₁’, x₂, …, xₙ)
XGBoost does the following.
- Local monotonic split: Whenever the training considers a node: x₁ < some threshold, it computes the optimal left leaf and right leaf values, as if that is the final split. The split is only taken if the left leaf ≤ right leaf, otherwise it is discarded and the algorithm considers other splits.
- Local monotonic tree: If there are further children splits, any leaves of the left (right) descendants will be smaller (greater) than the current optimal (left leaf + right leaf)/2, so the local monotonic split is preserved down the road. This value constraint is enforced throughout the trees if the monotone constraint is specified. So within a decision tree, left descendant leaves of x₁ < right descendant leaves of x₁.
- Global monotonic sum-ensemble: XGBoost uses sum of all leaves as the final prediction of the ensemble. This preserves monotonicity for the ensemble.
- If we use fᵢ to represent tree i: ∀ x₁<x₁’, fᵢ(x₁, x₂, …, xₙ) ≤ fᵢ(x₁’, x₂, …, xₙ), where fᵢ maps the input to a concrete value.
- Therefore, we have Σᵢ fᵢ(x₁, x₂, …, xₙ) ≤ Σᵢ fᵢ(x₁’, x₂, …, xₙ).
For code, see src/tree/split_evaluator.h.