I’m working on the Titanic problem at Kaggle. I’m focusing on getting a reasonably good solution using Logistic Regression. I’ve created some features and the training and test set I’m using are:
test_modified.csv (50.4 KB)
train_modified.csv (109.0 KB)
I’m facing a peculiar issue. When I train the model using only “Sex” as the variable I get accuracy 78.67% and 10-fold cross validation mean scores as 78.67% with 0.04 standard deviation. This gives 0.76555 on submission.
Now I’ve run RFECV (recursive feature elimination with cross validation) on my dataset and I get the following graph:
Here, the top 10 variables and their coefficients are:
These give accuracy of 82.72%. The 10-fold cross validation mean accuracy is 81.59% with standard deviation of 0.0255. When I submit this solution on Kaggle, I get a score of 0.75598 which is 0.01 lower than the model with only Sex as predictor.
This is really strange. If the model if overfitting, why am I getting high cross-validation score? Is there something I’m missing here? What are other possible metrics which I can use to diagnose the model?