Overfitting with R xgboost

r
xgboost

#1

Below are the 3 R instructions I use with XGBoost:

params <- list(booster = "gbtree", objective = "binary:logistic",eta=0.3, gamma=5, max_depth = 3, min_child_weight=1, subsample=1, colsample_bytree=1)

xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T, stratified = T, print_every_n = 10, early_stopping_rounds = 20, maximize = F)

xgb1 <- xgb.train (params = params, data = dtrain, nrounds = xgbcv$best_iteration, watchlist = list(val=dtest,train=dtrain), print_every_n = 10, early_stopping_rounds = 10, maximize = F , eval_metric = "error", eval_metric="logloss")

Which parameters should be tuned to avoid overfitting, that is,grossly, too much difference between train and test sets?
with the mlogit logistic regression, I can achieve 25% success in the training set and 23% in the test set, which is a very low degradation,
but with my xgboost code, if I can easily reach more than 30% in the trainig set, it i always becomes at least twice less in the test set.


#2

@sandoz it’s difficult to suggest anything since you have not specified the problem you are working on, and even the size and structure of the data is unknown. Anyway you can try some other simpler algorithms to check whether overfitting is still happening.


#3

some other simpler algorithmS? please read what I said about mlogit.
Xgboost is told to give better results than logistic regression. Is it true?


#4

It’s difficult to suggest anything since you have not specified the problem you are working on, and even the size and structure of the data is unknown.

@sandoz The reason I said this because you haven’t specified the dataset or the problem statement, it is difficult to suggest a better alternative to solve the problem, or to find the issue which you might be facing.

If you could share them, it would help me (or the community) to help you solve the problem.

This is not always the case; let me explain you why -

Suppose for a regression problem, you have a training data with a linear trend as below

In this case, a linear model (for eg. linear regression) would possibly perform better than a non-linear model (decision tree, XGBoost). Here you can see that the choice of the algorithm you take to solve a problem largely depends on the data you are trying to model your algorithms on. So it is not the case that XGB would always be better than Logistic regression.

My suggestion would be to explore the data, and find the trends/insights from the data. This would help you get a better perspective on the approach or the algorithm you can take to solve the problem at hand.