Below are the 3 R instructions I use with XGBoost:
params <- list(booster = "gbtree", objective = "binary:logistic",eta=0.3, gamma=5, max_depth = 3, min_child_weight=1, subsample=1, colsample_bytree=1) xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T, stratified = T, print_every_n = 10, early_stopping_rounds = 20, maximize = F) xgb1 <- xgb.train (params = params, data = dtrain, nrounds = xgbcv$best_iteration, watchlist = list(val=dtest,train=dtrain), print_every_n = 10, early_stopping_rounds = 10, maximize = F , eval_metric = "error", eval_metric="logloss")
Which parameters should be tuned to avoid overfitting, that is,grossly, too much difference between train and test sets?
with the mlogit logistic regression, I can achieve 25% success in the training set and 23% in the test set, which is a very low degradation,
but with my xgboost code, if I can easily reach more than 30% in the trainig set, it i always becomes at least twice less in the test set.