This is what I followed:
- load dataset and separate features and target variables
- Separate numeric, categorical and ordinal variables
- Imputed numeric variables with median (using groupby) and categorical/ordinal with most frequent values.
- Encoded categorical values using one hot encoder(pd.get_dummies) and ordinal with label
- Used GridSearchCV to tune hyper parameters of LogisticRegression, RandomForest, SVM, KNN, XGBoost. Highest accuracy was 0.784 with XGBoost. LogisticRegression, SVM and RandomForest gave 0.77.
I’m new to data science and have completed datacamp courses and read analytics vidya blogposts. I spent considerable amount of time on this problem but the accuracy is not increasing. I tried standardizing features and scaling for appropriate algorithms and using subset of features(most significant of them). Any help will be appreciated.