Error in XGBoost Cross and Validation Prediction Output in R

r
xgboost
crossvalidation

#1

Hi

I am working on a data set in R. It required predicting a categorical variable. The output variable has two categories 1 and 0. In XGboost, I’ve taken num_class parameter as 2.

There are 600 rows in Training Set and 350 rows in test set.

** I am facing multiple issues.**

First Problem
After I run the Xgboost model with cross validation:

xg_model <- xgb.cv(data=data.matrix(dum_train[,-1]), label=x, objective="multi:softprob", nfold = 10, num_class=2, nrounds=200, eta=0.1, subsample=0.5, colsample_bytree=0.5,max_depth=6,min_child_weight=1,eval_metric="merror", prediction=T)

The result shows up like this:
[179] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[180] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[181] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[182] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[183] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[184] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[185] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[186] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[187] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[188] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[189] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[190] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[191] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[192] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[193] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[194] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[195] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[196] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[197] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[198] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000
[199] train-merror:0.000000+0.000000 test-merror:0.000000+0.000000

Question 1: Does this validation result suggest I am over-fitting too much ? If yes, what can I do to avoid over-fitting ?

Second Problem

After running this model, I predicted values on my test set. As mentioned above, my test set has 350 rows, I expect the predicted values from model to be 350. But, the predicted values I get is 700. Double the number of values in test set.

*Question 2: Why is this happening ? What am I doing wrong here ?


#2

Is there a specific reason to use eval_metric = “merror”, why dont you use “error” or “mlogloss” since you only have 2 categories in your target variable.


#3

Hi @Akash_Haldankar

Thank for pointing that out. I just checked for binary classification ‘error’ is a better choice.
I’ll appreciate if you can answer by both questions. Why am I getting double the predicted values that rows in test set ?

Regards
Supra


#4

@supra_minion, As you have used mult:softprob, you are getting predicted probability of each class. The target has two classes 0 and 1, data has 350 rows, so the result returns 700 data points(2* 350 ). If you want to model to return values instead of probabilities, you can use multi:softmax. However, as your problem is a binary classification, not a multi classification problem, you can use binary:logistic as a parameter for the model. Please go through the link, which will be helpful to decide on the parameters and which error metric to use when for xgboost model: https://xgboost.readthedocs.org/en/latest/parameter.html

Hope this helps.


#5

hello @supra_minion,

Please try and use the code below (modify it accordingly):

dtrain <- xgb.DMatrix(data.matrix(df_train), label=x)
dtest <- xgb.DMatrix(data.matrix(df_test))
# xgboost parameters
param <- list("objective" = "binary:logistic",    # binary classification 
              "eval_metric" = "auc",    # evaluation metric 
              "nthread" = 6,   # number of threads to be used 
              "max_depth" = 15,    # maximum depth of tree 
              "eta" = 0.08,    # step size shrinkage 
              "subsample" = 0.9,    # part of data instances to grow tree 
              "colsample_bytree" = 0.9)  # subsample ratio of columns when constructing each tree 
              
# set random seed, for reproducibility 
set.seed(1234)
# k-fold cross validation, with timing
nround.cv = 1
bst.cv <- xgb.cv(param=param, data=dtrain,nfold=10, nrounds=nround.cv, prediction=TRUE, verbose=FALSE)

# index of maximum auc:
max.auc.idx = which.max(bst.cv$dt[, test.auc.mean]) 
max.auc.idx 
## [1] 7
# max auc:
bst.cv$dt[max.auc.idx,]

# real model fit training, with full data
xgb.bst <- xgboost(param=param, data=dtrain,nrounds=max.auc.idx, verbose=1)
pred <- predict(xgb.bst,dtest)
prediction <- as.factor(as.numeric(pred > 0.5))
prediction <- recode(prediction,"0 = 'N';1 = 'Y'")
#Create submission file:
submission <- data.frame(Loan_ID = loan_id$Loan_ID,Loan_Status = prediction)
write.csv(submission,'xgb_pred.csv',row.names = F)

Hope this helps!


#6

Hi @shuvayan

Thanks for helping me with code.

Predictions have improved now. Earlier I was getting 700 predicted values, now I am getting 350 values which is absolutely fine, since I have 350 observations in the test set. But the problem is, all predicted observations are > 0.5 i.e. “Y”.

I couldn’t figure out the error. Need help.


#7

I think there is overfittng, learn about techniques to avoid them say feature selection/extraction etc. Also try various combination of missing value treatment like mean/mode/ replacing with -1 / using VIM package to impute etc. Hope this will help


#8

bst.cv$dt is returning null. I have tried to run the Practical Machine Learning Project with XGBoost
by soesilo wijono code from which your code looks similar.
Please help.