Megastar prediction: Share your maximum accuracy & the method adopted


Hello everyone, i hope everyone had fun like me on this little but good contest.
please share your highest accuracy & the approach adopted.
It will help rest of the contestants to think & work on their models before getting final solutions from Kunal
I am sure you will agree with me that we will learn more by doing ourselves rather getting simply cooked solution. Please share your approach & maximum accuracy

Used bag of words on skills in addition to vintage conversion to number of months
Algorithm: Random Forest
I have also tried GBM & the accuracy was again around 47%


Hi all,

I have not submitted the solution as I could not complete the task on time, inspite of spending a solid 6 hours. Here is my approach and the lessons learned. I kindly requested the participants / moderators to advice me wherever my approach was wrong.

My approach:

Data transformation:

  1. For work experience - I converted into buckets as 0-2 years, 2-5 years, 5 - 10 years and 15 20 years and 20+.

  2. For UG and PG education, I converted into following buckets.
    a. all B.A and commerce --> Arts
    b. all B.E/ --> Engineering
    c. B.Sc, Comp.Sci/ IT —> Computer Science
    d. BBA/MBA etc —> Management
    e. Math / Stat / Analytics —> Analytics
    f. All others —> OThers

  3. Institution
    1. All Tier 1 —>Tier 1
    2. All others —> others
    3. No PG – Not a PG

Note - I referred to the list containing the tier 1 colleges published in AV.

  1. Skill set
    This was very challenging one for me. As I had never did text mining or bag of words this was quite challenging. I had to manually convert all the skills set to Analytics / BI/ Others according to the key words, which took a lot of time for me.

  2. I broken the “train” dataset into 70% training and 30% validation data.

  3. Applied C5.0 algorithm to get around 46% of accuracy. I would have got a better accuracy if had done the pre-processing steps more judiciously or following the approach that was published in hackethon guide.

Lessons learned:

  1. I used initally SPSS modeler and as it was not working later for some technical reasons, i had to go to R. So I learned that I should not juggle around the tools.

  2. Not knowing about bag of words and text mining. Probably this should be my biggest take away by participating in this hackethon. I never know that such a method can be applied for data transformation / cleaning.

  3. I should never take competitions lightly. As soon as I saw the dataset, i took it for granted as the dataset found to be simple. But i underestimated the power of this “so called simple dataset”. No datasets are simple, unless I bring out the real insights from it.

  4. Feature engineering. This is again something i dont know much. Inspite of this being published as the top thing to do in the hackethon guide which was published on saturday, i took this lightly.

  5. Exploring the data using before bucketing the categories. Though I know this i did not followed this in this hackethon. Probably i should be doing this henceforth, as this matters most for getting good distribution among the buckets.

End notes:

On the whole, I was very happy participating in this competition. Though i was not even qualified to send the final sheet to the organizers, I felt happy at the end realizing that I have to travel many more miles atleast to make the first submission. I got to be humble. Thanks organizers for bringing in such a wonderful event.

Approaches for 'Do you know who's a Megastar'?'

I adopted multinom from nnet package in R.
Received 46.06% accuracy. Indeed a greater learning experience.
Still working to improve it, will share method then. :blush:


@karthe1 thanks for sharing your approach! The road taken is far more interesting than the final destination!
I have just started taking part in data hackathons, so I have a steep learning curve to climb.

My approach (that worked & I submitted with a 45% on the 30% sample):

  1. Work exp :(Vintage) was converted into a numeric variable 8 year(s) 6 month(s) to 8.5 etc
  2. for the UG_education & PG_education, I copied the data into a separate sheet & use the remove duplicate option in Excel to get the unique values. This was then reduced B.A & BA are the same. Similarly B.E , B.Tech/B.Engg are the same, similarly MBA/ PGDM was brought into the same bucket.
  3. Domains: is a factor variable and I kept it as such.
  4. Var2, Var3 & Var4 are categorical variables & I didn’t touch them

After these transformation I ran a random forest over the following variables:
=> Var2+Var3+Var4+UG_New+PG_New+Domain+Skills+ VintageNumeric

Things that didn’t work:

  1. I used the tm package on the skills column, but the output was very sparse & couldn’t get any output from it since, the algorithm kept running on my machine for the entire day!
  2. CART models gave me around 36% accuracy. While a tuned SVM model gave me around 37%. This was after running a train() function on it for half a day!
  3. Ensembles were not much helpful, as they failed the validation test. Was getting a >70% accuracy in the in-sample test, but in the validation this was dropping to 34~35%. Basically overfitting!

This is just my second hackathon, and it was a good experience! Big thumbs up to @kunal & team for putting this up! Eagerly waiting for your next hackathon in Mumbai (in July!)


Solution of Data Hackathon Online Held on 7th June 2015

Congratulations @Nalin for winning this contest. Nalin has won Amazon Voucher worth Rs.5000
Nalin is from Mumbai, India. He happens to be a Chartered Accountant, Investment Banker and Equities Trader.

Below is the code, he used to achieve an accuracy of 49%. He used R to solve this problem. He mainly used the text mining technique called ‘bag of words’ for this problem






combined$Vintage<-substr(combined$Vintage, 1, 2)



CorpusSkills = Corpus(VectorSource(combined$Skills))
CorpusSkills = tm_map(CorpusSkills, tolower)
CorpusSkills = tm_map(CorpusSkills, PlainTextDocument)
CorpusSkills = tm_map(CorpusSkills, removePunctuation)
CorpusSkills = tm_map(CorpusSkills, removeWords, stopwords("english"))
CorpusSkills = tm_map(CorpusSkills, stemDocument)

dtm = DocumentTermMatrix(CorpusSkills)
sparse = removeSparseTerms(dtm, 0.99)
SkillsWords =

colnames(SkillsWords) = paste("Skills", colnames(SkillsWords))
colnames(SkillsWords) = make.names(colnames(SkillsWords))



CorpusUGE = Corpus(VectorSource(combined$UG_Education))
CorpusUGE = tm_map(CorpusUGE, tolower)
CorpusUGE = tm_map(CorpusUGE, PlainTextDocument)
CorpusUGE = tm_map(CorpusUGE, removePunctuation)
CorpusUGE = tm_map(CorpusUGE, removeWords, stopwords("english"))
CorpusUGE = tm_map(CorpusUGE, stemDocument)

dtm = DocumentTermMatrix(CorpusUGE)
sparse = removeSparseTerms(dtm, 0.99)
UGEWords =

colnames(UGEWords) = paste("UGE", colnames(UGEWords))
colnames(UGEWords) = make.names(colnames(UGEWords))



CorpusUGC = Corpus(VectorSource(combined$UG_College))
CorpusUGC = tm_map(CorpusUGC, tolower)
CorpusUGC = tm_map(CorpusUGC, PlainTextDocument)
CorpusUGC = tm_map(CorpusUGC, removePunctuation)
CorpusUGC = tm_map(CorpusUGC, removeWords, stopwords("english"))
CorpusUGC = tm_map(CorpusUGC, stemDocument)

dtm = DocumentTermMatrix(CorpusUGC)
sparse = removeSparseTerms(dtm, 0.99)
UGCWords =

colnames(UGCWords) = paste("UGC", colnames(UGCWords))
colnames(UGCWords) = make.names(colnames(UGCWords))



CorpusPGE = Corpus(VectorSource(combined$PG_Education))
CorpusPGE = tm_map(CorpusPGE, tolower)
CorpusPGE = tm_map(CorpusPGE, PlainTextDocument)
CorpusPGE = tm_map(CorpusPGE, removePunctuation)
CorpusPGE = tm_map(CorpusPGE, removeWords, stopwords("english"))
CorpusPGE = tm_map(CorpusPGE, stemDocument)

dtm = DocumentTermMatrix(CorpusPGE)
sparse = removeSparseTerms(dtm, 0.99)
PGEWords =

colnames(PGEWords) = paste("PGE", colnames(PGEWords))
colnames(PGEWords) = make.names(colnames(PGEWords))



CorpusPGC = Corpus(VectorSource(combined$PG_College))
CorpusPGC = tm_map(CorpusPGC, tolower)
CorpusPGC = tm_map(CorpusPGC, PlainTextDocument)
CorpusPGC = tm_map(CorpusPGC, removePunctuation)
CorpusPGC = tm_map(CorpusPGC, removeWords, stopwords("english"))
CorpusPGC = tm_map(CorpusPGC, stemDocument)

dtm = DocumentTermMatrix(CorpusPGC)
sparse = removeSparseTerms(dtm, 0.99)
PGCWords =

colnames(PGCWords) = paste("PGC", colnames(PGCWords))
colnames(PGCWords) = make.names(colnames(PGCWords))






rf<-randomForest(Category~.,data=trainP,na.action = na.roughfix)




write.csv(sub, "sub.csv", row.names=FALSE)


Congrats @Nalin, Thanks AV for sharing the R code as well.


Congrats @Nalin! I get a 48.06% when I submit the solution on the solution checker. Must be because its checking against only a sample of the test set.
Loved the idea of throwing the UG & PG variables to form a bag of words, that was a neat trick!


Hi Akash,

Do you have the datasets for this Hackathon. If yes, could you please share it with me at "".

That would be very helpful.