Need Help Improving Accuracy of Model



This is what I followed:

  1. load dataset and separate features and target variables
  2. Separate numeric, categorical and ordinal variables
  3. Imputed numeric variables with median (using groupby) and categorical/ordinal with most frequent values.
  4. Encoded categorical values using one hot encoder(pd.get_dummies) and ordinal with label
  5. Used GridSearchCV to tune hyper parameters of LogisticRegression, RandomForest, SVM, KNN, XGBoost. Highest accuracy was 0.784 with XGBoost. LogisticRegression, SVM and RandomForest gave 0.77.
    I’m new to data science and have completed datacamp courses and read analytics vidya blogposts. I spent considerable amount of time on this problem but the accuracy is not increasing. I tried standardizing features and scaling for appropriate algorithms and using subset of features(most significant of them). Any help will be appreciated.


HI @kakashi

There are various things which you can try:

Feature engineering


Also, you can do some visualization before in order to get some insights, which can be further useful in creating new features.

Further, you can read some winner’s approach from here to get more ideas,

Hope this helps.


thanks shubham. I tried ensembling but the accuracy is same as logistic regression. I ensembled 6 models. I’m going to just move on to next problem and check out winners code once competetion is over.


hello please I am new in data science and I have problem loading the dataset, please can somebody put me through?


Hi @dayomitchell

You can refer this article.



blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px #715FFA solid !important; padding-left:1ex !important; background-color:white !important; } Hi Shubham, Thanks a lot well appreciated

September 1 |

Hi @dayomitchell

You can refer this article.


Visit Topic or reply to this email to respond.

In Reply To

September 1 |

hello please I am new in data science and I have problem loading the dataset, please can somebody put me through?
Visit Topic or reply to this email to respond.

To unsubscribe from these emails, click here.



Is accuracy of 83.22% good for this problem? I tried feature selection and ensembling as suggested by Shubham here but can’t improve the accuracy beyond 83.22%. I am thinking to submit the code now and move on to another problem.



I’ve tried the 3 techniques shown in the introductory learning: linear regression, tree decision & random forest. By filling in the missing values with mean. Each of them gave 77~78% accuracy.

What is your strategy to achieve >80% accuracy?

  1. Change method to fill in missing values?
  2. Try different columns (predictor) in prediction?
  3. Explore other techniques for prediction? (appreciate if you would share what technique you use)
  4. Explore parameters in linear regression/ tree decision/ random forest?

I had been trying #1 & #2, but there I am still stuck at <78%.



Hi - I am stuck at 78.4% as well. Even with the ensemble methods or non-linear classifiers. I get a sense that is because of the significant missing value imputations that I did. Although I used most of the best practices like median imputations, etc… my algorithms are suffering from GIGO

need to check the logic used by Data Scientists with > 90% accuracy for missing data management. Any tips? Esp. Credit History?


I am struggling with the same issue. Not able to raise the score above 0.79
Any peers … please help.