Claim Rejection Classification Problem


I am trying to solve a classification problem. Dependent variable is Claim Rejected(Yes/No)
There are various independent variables like age on claim,claim amount,education level,no of days policy commenced,premium, sum assured etc…I tried various algorithms but decision tree gave me best result.

In the data set total records are 72000 (Yes-5000,No-67000). Decision tree overall accuracy is 95%. However FP (False +ve) is too high which means most of the Yes are not predicted as Yes.
My accuracy alone for Yes is 43% (67% Yes are predicted as No) which I think is not good. Other algorithm I tried like logistic regression gave lower accuracy than decision tree.

Can you guide what I should do further. Or this is an indication that I must be missing some independent variable without which accuracy cannot be improved (for Yes).




Hi there,
This is a problem of unbalanced dataset since you have around 93% as No and aroung 7% as Yes. That is why your model is heavily biased towards the NO class. According to me you can take two approaches:

1- Go for undersampling . i.e reduce the datapoints with NO to balance the number of YES and No. But this is generally not preferred because you loose lot of information via this process.
2- Oversampling i.e creating synthetis samples of the YES class which would match their distribution.

I would recommend you to look into oversamplingtechnique.

Hope this helps.



Hi Neeraj,
Thanks for your time and input. Since the data is almost real and I know that actually no of Yes will be always very low compare to No. Would oversampling have any negative impact. Could this cause bad accuracy when real data comes. Secondly can you please let me know briefly how oversampling is done. Can I do this in R.




For yes you have to do oversampling as @NSS told i.e you have to create some dummy observation for “yes” and I think you are using confusion matrix for accuracy. It’s better to use AUC in this cases.For algo, you should try out any boosting method like ADA or XGB .They are usually better.
One another thing it’s lil’bit tricky to choose the threshold for “yes” or “no” in this type of cases.Model usually tells if it’s greater than .5 then it’s 1(“yes”) else 0(“no”).But sometime if you put the threshold .45 or .4 instead of .5 then it may work better.Check with AUC curve which should be the optimal value of it.
check this link for imbalance classification,



Thanks Tapojyoti. Just wanted to know that XGBOOST when tried on original data (training) gives 100% accuracy on testing data which I think is over-fitting. Any idea why so. Also I haven’t installed CARET however XGBOOST package is installed. Also when I tried oversampled data with DT TP rate is improved but at the cost of Negative. So is it a optimization problem now to set a trade off or really the overall accuracy can be improved keeping TP rate high.



Can you share the code? In Xgboost at first we need to separate the target variable and then we need to eliminate the target variable from the training set.


I am using below code.

Start xgbm

bst <- xgboost(data = data.matrix(training_data[,])
,label = data.matrix(training_data$ClaimStatus)
, max.depth = 2, eta = 0.5
,nround = 2, objective = “binary:logistic”)

End xgbm

pred <- predict(bst, data.matrix(testing_data[,]))



You have to eliminate ClaimStatus from data.matrix(training_data[,]) i.e suppose the index of this column is “a”, then you have to write the code like this way,
pred <- predict(bst, data.matrix(testing_data[,-a]))

But one suggestion,
try with different values of eta or nround. lower value of eta(learning rate) is preferable usually like .1 or .05.
For nround decide the value of it but cross validation (by the function


Thanks Tapo it was really helpful…