Kaggle Titanic: Logistic Regression - higher cross validation scores results in lower accuracy on submission

crossvalidation
logistic_regression
python
scikit-learn

#1

Hi,

I’m working on the Titanic problem at Kaggle. I’m focusing on getting a reasonably good solution using Logistic Regression. I’ve created some features and the training and test set I’m using are:
test_modified.csv (50.4 KB)
train_modified.csv (109.0 KB)

I’m facing a peculiar issue. When I train the model using only “Sex” as the variable I get accuracy 78.67% and 10-fold cross validation mean scores as 78.67% with 0.04 standard deviation. This gives 0.76555 on submission.

Now I’ve run RFECV (recursive feature elimination with cross validation) on my dataset and I get the following graph:

Here, the top 10 variables and their coefficients are:

These give accuracy of 82.72%. The 10-fold cross validation mean accuracy is 81.59% with standard deviation of 0.0255. When I submit this solution on Kaggle, I get a score of 0.75598 which is 0.01 lower than the model with only Sex as predictor.

This is really strange. If the model if overfitting, why am I getting high cross-validation score? Is there something I’m missing here? What are other possible metrics which I can use to diagnose the model?

Please help.

Thanks,
Aarshay


#2

Hi @Aarshay,

Kaggle public LB for this problem is only 50% of the data, so it might be possible that your submission is actually a high accuracy submission still it will score less over Public LB. In this case you should mostly trust your CV.

To avoid overfitting you should simply not rely over one model. Try ensemble of 3-4 models to get a relatively stable model.

Hope this helps.

Regards,
Aayush


#3

Thanks @aayushmnit,

I’ll trust my CV then. Actually score was dropping significantly (~0.2). Also, I’m trying on Titanic and they will show the private LB probably end of this year.

I’ll try on other competitions.

Cheers,
Aarshay