Why is XGBoost Predicting unseen values?

gbm
xgboost
gradient_boosting

#1

I have an XGBoost model. It runs well enough with decent cv scores. It is a dataset of around 500,000 with 30 features of categorical, continuous and binary. My test set is 250,000. There are absolutely no negative values in the target variable, however, on prediction with the test set I get around 4,000 instances where a negative value is predicted. There are also 7,000 or so predicted on the training set. Why is this?

Should this happen?

Are there parameters that can be tuned to prevent this?


#2

@c3josh, I have observed this phenomena when using XGBoost for regression as well. I don’t know how it happens, but this is common behavior for other regression techniques, like linear regression. It can extrapolate the observed values. Therefore, depending on the input values, it can predict outcomes that are out of the training range.

One easy fix that you can apply is to use a threshold. If a value is beyond that threshold, it’s assigned the min/max acceptable value.


#3

Hi @c3josh

Reading your questions raises several questions in my mind but there isn’t enough information.
Such questions are best answered when supported with reproducible example. It would be good
if you share your code and a reproducible example.

Regards
Manish


#4

I am glad it is not just myself that has seen this. My understand of Trees, GBM and specifically XGBoost does not conform with extrapolation of unseen values. I would love to find out why or how this can happen!

With the threshold, that is exactly what I am doing. I am replacing all negatives with zeros for now and will then replace with min or average of bottom n values. The issue is that there is no way of doing xgb.cv with the min values in there. It would be great if there was a postprocessing option, like how there is the preproc attribute within XGBoost. Is there any way of doing this?


#5

Hi Manish, unfortunately the dataset that I am using is probably too big to give you an example. I might try reproduce this with a subset. Is there any post analytics that I can do to give you some insite? Or some thing that I can look for within the XGBoost model?


#6

@c3josh, I am not aware of any built-in post processing for XGB. To fix the values within the CV, the easiest solution is to implement your own CV loop and call xgb.train instead of xgb.cv, but It’s unlikely to impact much, as the number of negative outcomes is rather small.


#7

Yes. The impact probably would be. I hadn’t thought of that. Although in a competition this could be an important gain!


#8

I really doubt it would make much difference, even for competitions. It’s affecting about 1.6% of the test data only. Assuming that XGB is doing a reasonably decent job, even the negative values should be close to 0, which will translate to low contribution to the RMSE or other regression metrics if the true value is near 0.

If you want to work around it, you can add a flag that marks if the output is negative or not as a feature to a second XGB model. This new feature will have a positive contribution and hopefully it will fix the negative outputs.


#9

Here is an issue I raised on XGBoost’s Github repo. The answers from Laurae2 were very helpful. It’s probably worth a read for anyone interested in XGBRegression, and also probably XGBClassification.


#10

@c3josh, Excellent! Thank you for posting this :thumbsup:


#11