What is the maximum dataset size for XGBoost?

xgboost

#1

So the question is two fold. Firstly, what is a reasonable upper bound for the number of training samples for XGBoost?

Secondly, what is a reasonable upper bound for the number of features for XGBoost?


#2

XGBoost has no limit to the number of training samples. As long as it fits your RAM, you are good to go. And even if it doesn’t, if you are willing to add a layer of complexity, you can use XGBoost’s interfaces to Spark or Flink to distribute the learning task to multiple machines.

There is no limit to the number of features as well, since tree algorithms do feature selection on their own. Although it might be troublesome to work in very high dimensions, as the training time will increase and it might require finer tuning of regularization hyperparameters.


#3

Great, thank you Caiotaniguchi.

Other than the time taken to train the model (and any parameter tuning), are there any negative effects on the models? Is there any way in which accuracy could decrease?


#4

@c3josh, accuracy can decrease in the test set because of overfitting, but if you tune it properly, it should work just the same.

If you want to know if dimensionality reduction helps, it usually doesn’t. The only case where it could help is when the features are very noisy. Throwing away the garbage helps XGBoost to not pick up weird patterns that do not generalize.


#5

Brilliant. Thank you.

I did read this, or hear this about the dimensionality reduction somewhere. I think it was Owen Zhang that said he has tried it before and it have never improved a model for him… Though this is probably because he has squeezed ever ounce of information from the data through feature engineering prior to this!