Discussions - The Data Identity Hackathon [Student DataFest 2018]


Use this category for the discussions related to contest: The Data Identity: Student DataFest 2018 Hackathon which will be starting from 15th May 2018. Feel free to share your approach & ask your questions here.

For more information, visit:


Why is my submission not being shown in the leaderboard? I got a score of 6.33


Is there any limit for number of submissions?


I think your account is not verified for the data fest 2018 because earlier I also got the same problem but after verification of my account now it’s ok.


No limit you can do any no. of submission.


Thank you @sngupta


What are the 5 categories in education?


Hello. I can’t join to the slack chat, it shows me this error: “already_in_team”. What can I do?


There’s a channel for student-datafest, join that if you haven’t…(name of the channel is studentdatafest2018)
Otherwise the error is prominent enough in the sense that you are already a member of that channel


You can get the yourself via these small code line,

 import numpy as np


'Bachelors', 'High School Diploma', 'Masters', 'Matriculation', 'No Qualification'


Baseline 0.758 for beginner ! Let’t make it interesting. The-Data-Identity baseline


Would you share any ideas how to enhance the accuracy on the model?
already used scaling, and parameter tuning


Its not so much about feature selection than feature engineering. Adding new features is often the best way to increase both diversity and quality of models.

Simply trying to time the parameters won’t take you to #1 so easily as that can be done by everyone.

Don't bother predicting if you can't validate that your model is learning

So, focus of Feature Engineering as this is what ML…


Hi all,

Here is the benchmark solution (Python) for The Data Identity hackathon to get you all started with the problem:

Happy learning!!


How to treat the NA values in the train and test data sets?


Hi @deepam,

You can use the mean, median or mode to impute the missing values in the dataset. For categorical variables you can use mode and for numerical variables you can use mean or median.

To learn advanced methods to treat missing values in a dataset, you can refer the below mentioned article:


Okay! Thanks for the help.


Which model should I choose to predict the given problem?
Please suggest.


I am using Decision Tree algorithm. I am getting decimal values as classes after prediction. Initially there are two classes ‘0’ and ‘1’ but after prediction I got result in four classes in decimal values as ‘0.56’, ‘0.58’, ‘0.84’ , ‘0.79’.
How it can be solved?


There are leaks in this dataset. If present, leaks are generally exploited in most of the data competitions. Can leaks be used here as well ?