harry
September 29, 2015, 4:12pm
1
I am currently doing a classification problem using xgboost algorithm .There are four necessary attributes for model specification
data -Input data
label - target variable
nround -the number of trees to the model.
objective -for regression use ‘reg:linear’ and for binary classification use ‘binary:logistic’.
I want to know how to decided the value of nround so that our model does not over-fits
Hi @harry
Try making a cv(4-fold,7-fold) and evaluate the error matrix accordingly.Example code is given below -
params = {}
params["objective"] = "binary:logistic"
params["eta"] = 0.01
params["min_child_weight"] = 7
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 0
params["max_depth"] = 4
params["seed"] = 0
params["eval_metric"] = "auc"
plst = list(params.items())
xgtrain = xgb.DMatrix(x_train,label=y_train,missing=-999)
xgtest = xgb.DMatrix(x_test,missing=-999)
num_rounds = 3000
model = xgb.cv(params, xgtrain, num_rounds,nfold=4,metrics={'auc'}, seed = 0)
So it will give you your error value for each round number, you can decide by this your optimal number of rounds, where your test cv score is maximum.
Hope this helps.
Regards,
Aayush