A look into the Hackathon



Hi all,

I was looking out the problem statement of the hackathon that is been conducted from feb 26-feb 28th. I find that the training dataset contains more entries than the testing dataset. This is the first time i am joining hackathon and the problem statement is not detail enough to understand why there are other data files given and i am totally blank on how to approach this problem. could anyone please help me out on clarifying the problem statement precisely??


Yes, the training dataset has more entries than testing dataset so that you have enough data to built a model.
This is a Binary classification problem and you need to predict the outcome for Is_Shortlisted (Note: This currently is not present in TEST dataset)

This dataset has lots of text which you will need to clean up. I believe the existing features are good enough to built a first-cut model and then you can proceed with other feature engineering and model selection approaches.

Hope this helps!


While submitting my solution with 272792 entries, it says that 1***** extra data are there. What is the meaning of this? Please help


The TEST data contains 107428 rows to predict. You seem to be uploading additional entries.
Please check your submission file.



I went through the problem statement, I understand that you have to predict is_shortlisted, a binary classification problem,. I am used to the case where a single training file and single test file are given, but here we have student.csv, internship.csv and train.csv 3 files. I am totally stuck on how to proceed. I have read Machine learning and Data Mining. I have theoretical knowledge but totally stuck on how to apply it, for eg this problem . How to clean the data? What are the features? What feature engineering should I do? This is the first hackathon I am participating. Before this, I have tried the Kaggle Titanic Tutorial in Python. So I thought I could solve this problem. but I feel completely stuck How to Proceed?


Go slow at first. Just try a basic benchmark submission.

To do this, just take the train.CSV file, train a simple decision tree algorithm (watch out for categorical columns. If you dont know what to do With them just drop those columns), and test your algorithm on test.CSV and try submitting your solution. If all goes well you are doing great!

Everybody goes through this phase. Just push through it. You are eventually going to do good :slightly_smiling:

Also, try studying the benchmark codes of people (refer tagged notes of #date-your-data channel on slack). Good luck :+1:


Hi jaiFaizy,

I thought logistic regression would work here, coz this is a binary classification problem. But does decision tree classifier work better ? And what are the features that you would use ? All features except “Minimum_Duration” and “Is_Part_Time” are categorical. Also the “Preferred_location” attribute is missing in many places. So do we need to impute it, or leave it as missing, and take missing as a category ?


Hi @ajayram198,
As far as I know decision tree algorithm can be used for binary classification problem, but there is no reason to suggest that it is better. To test this hypothesis, implement both logistic regression and decision tree and check the accuracy. Check out this discussion

Also I have not actively taken part in the competition, so I cant comment on which features I would use. But what I would do is first try to give all the features to the algorithm (categorical and numerical of all three files viz train.CSV, student.CSV and internship.csv) and make a benchmark. And regarding missing values, I would just impute them with mean. (For categorical missing values, imputing missing values is easy. Consider missing value as a categorical value and impute the value) This is just for benchmarking, the rest will follow.