Revealing your approach: Predict the gem of Auxesia for Magazino!


#1

This is a thread to share, discuss and brainstorm the approaches you tried in last 2 days.

The more we share, the more we will learn! Over to you guys!


Feedback - Online Hackathon - Predict the gem of Auxesia for Magazino!
Solution or Path to build best model
#2

Nice initiative…

I used R with caret (for preprocessing), flexclust (for clustering) and randomForest (for modeling) packages.

Using k-means, I created a few clusters in training set (and based on that in test set) and then used random forest to fit models for each of the clusters.

Since the shares variable was skewed, I used log transformation before fitting the model.

At least for 40% test set, it gives the best rmse so far (6689.25906869). Hoping that it remains best for 100% test set :wink:

Fingers crossed :smile:

Update: Managed to repeat the performance for 100% of data set!


#3

I started simple with the complete data set, Convert the two categorical columns in pandas. Built a benchmark script with both RF and XGBoost.

After that the following Steps:

  1. Divided the Train data set into 2 parts. train and test. test was some 50 % of the data. (As there was no leader board in the beginning so i was scared of over fitting the data.)

  2. Tried to run Both RF and Xgboost on it.

  3. Did some parameter optimization for both of them. Got some decent scores.

  4. Removed outliers from training set some top 15-20 rows with largest “shares” value. This improved the score greatly.

  5. After this is did some feature selection. Removed features to see if RMSE improved. removed some 5-6 features based on that.

  6. After doing all this i tried the XGboost score and the RF score and their average. all 3 were giving similar score on PLB (public leader board).

  7. So i though of trying something random. I tried KNN, Lasoo no improvement in score and -ve impact on ensemble. so i ditched that.

Finally went with the RF and XGBoost ensemble. Happy with the leanings. Sad that i couldn’t build two distinct model that give similar performance and whose ensemble would give better results…

PS: would love to know if someone was able to build two+ distinct models which were equally good and whose ensemble did better.


#4

Thanks AnalyticsVidhya team for conducting this Hackathon!!
It was a very interesting scenario & wonderful learning experience.
am a new bee and was stuck with linear regression for more than a day.
Then learnt about PCA and reduced some of the features using PCA and discarded few with least correlation with outcome
Tried ridge & lasso regression. At the end heard about ensemble & tried using RF and SVM. For me best outcome was with Lasso.
It was fun learning!!

Used glmnet, leaps, e1071 R packages during the competition.

Thanks!
Chidam


#5

Hi All,

Following is my approach to solve the problem -

  • I started with a simple benchmark file using linear regression and then compared result using random forest, realized that its likely to over-fit the data set and linear regression was giving good results so went ahead and used it throughout(Though i tried SVM, gamboost, glmboost but no improvements in results due to over-fitting even after tuning the parameters)

  • I looked at the distribution of Share values as it was heavily skewed I removed every row with share value >90000(experimentally optimized ~ 80 rows) to get the maximum boost in linear regression as it is prone to outliers and give bad fit

  • Then I started with Feature engineering for most of the variables which were not coming as significantly as they should by the business logic, this gave additional boost in the scores (LDA, number of videos, global subjectivity, positive and negative connotations bucketing worked well for me)

  • I made individual feature of both the categorical variables like is_monday 1/0, didn’t gave me any boost but I like doing it as it gave me more business understanding on which categories are significant on number of shares

  • Last i tried penalizing the linear regression using Lasso Regression which gave me little boost on performance, so went ahead and with Lasso in the end

Regards,
Aayush Agrawal


#6
  • Created dummy variables from Day of publishing & category article
  • Outlier value changed to maximum.
  • Used bucketing on few variables using variable quantiles .
  • Used box cox transformation to find out lambda as the distribution is highly skewed.
  • Used step regression to find out the significant variables
  • Applied various models such as linear regression, random forest, svm, gbm & later created ensemble out of all tried models
    The whole exercise did not given me very fruitful result, but i have learned far more than reading & watching online content on data analytics

#7

After performing the preprocessing steps, I was initally using Random Forest. But then i realized that it is a count data and hence used Poisson regression. Results were not satisfactory. Lately realized that the problem has zero truncation and hence followed zero inflated poisson regression.

I was facing this error when applying zero inflated poisson regression.

mod1 <- vglm(shares ~., family = pospoisson(), data = dataset)

Error in if ((temp <- sum(wz[, 1:M, drop = FALSE] < wzepsilon))) warning(paste(temp, :
argument is not interpretable as logical

I could not find solution for this online. Did anyone tried this please?

Forums were suggesting to use tryCatch to find the error. But i dont know how to use this command effectively.

Can anyone suggest how tryCatch can be used in R?


#8

#9

Hello Friends

I achieved ~ 5559 on 20% LB, can’t recall the score on 40%. Here is the broad approach I adopted

  1. Feature Engineering: I observed that some variables like keyword and polarity had min, max and average. I created a new feature as abs((max-min)/2-average) hoping to capture more information about the variation. This did prove helpful. Next I dropped variables with high multi collinearity and very poor/close to zero correlation with shares. I was left with about 24 predictors.
  2. Data Transformation: This one I still look for answers. When I took the log on shares + log on predictors, the model seemed to be showing great fit, great cor relations. But the rmse worsened to 5680 (on 20% LB). So I left the shares as is and just transformed the predictors. Maybe that itself is not the best thing to do and selective log transformation was required!
  3. Modelling: I am still learning the ropes with R, so had to use SPSS Modeler. I used an ensemble of bagged CHAID decision trees. (Depth of tree=6, stopping rule parent = 100, child = 50, no. of bagged samples = 30). NN or SVM were performing poorly due to extreme values/outliers. For the same reason, boosting was not helping either. With the bagged ensemble I was earlier using the mean to predict final value but when I changed that to median, the result improved. I did not investigate this further though. If I were proficient enough with R I would have tried out a random forest built with conditional inference trees (party package)

Enjoyed the contest and the learning. Look forward to the winning approach. Thanks Analytics Vidhya :slight_smile:


#10

Please help me to understand how you applied feature engineering?


#11

Dear Kunal,

I could not check the accuracy of my work as the URL to upload the .csv was not working.

Now can you please tell me whether we will be having a hackathon on the coming Sunday on 19-th July here in Kolkata ?

If that is the case can you please share the details with me.

Thanks and regards with best wishes
Debanjan


#12

@aayushmnit
Interesting results with Lasso regression. It’s on my to-learn list.


#13

did you do a log transform only on the number of shares or also on some input variables ?


#14

Its the time to felicitate the New Champion of Data Hackathon Online.
Heartiest Congratulations to Mr. Aatish Kumar @aatishk. He has won Amazon voucher worth Rs. 10,000. Good Job Aatish. Well done!

The final rankings can be seen here --> http://goo.gl/SlR09o


#15

The results are out, as announced by Manish.

For another day, you can access the solution upload with 100% data and you can see the final standings on the leaderboard:

http://www.datahack.club:8000/leaderboard

Thanks everyone for all the fun! Congrats @aatishk

@kunal, @BALAJI_SR, @Steve, @gauravkantgoel, @anon, @karthiv, @Ishan, @devanshrising, @poornaramakrishnan, @karthe1, @praveen766, @kesavkulkarni, @Rishabh0709, @santu_rcc014, @rahul, @shuvayan, @VIKAS12APRIL, @dkanand86, @rahul29, @Mohammed, @dharmesh_manglani029, @sekhar_chandra630, @Venkata, @amrrs, @mail1, @nayan1247, @ankitbhargava06, @dada_kishore, @adityashrm21, @Jegan_Venkatasamy, @Himanshu_Khattar, @GaurKaps, @sumit, @Atul_Sharma, @Tushar_Kakkar, @Basu, @rithwik, @Aksgupta123_1, @saimadhup, @Vivek_Agarwal, @Sajan_Kedia, @vikash, @Ram_Marthi, @Uday_Bhan_Singh, @explorer, @ParindDhillon, @gauravkumar37, @Ramesh_Ramachandran, @Swapnil_Sharma, @sarthak93, @Malavika, @aatishk, @Abhishek_Nagarjuna, @abhijit7000, @litankumar, @Nalin, @pooja10, @umeshnrao, @sahil030288, @Kaushik_Roy_Chowdhur, @agentJay, @shaz1985, @Avioboy, @Vignesh, @Debanjan_Banerjee, @raghava_r4u, @Sarangam, @nknithinabc, @Sagar_Pardeshi, @rajendra_belwal, @ParulChoudhary, @punardeep, @abhishek_krishna, @tvmanikandan, @Sumangla, @Jitesh_Khandelwal, @vajravi, @manojprabhakar9090, @ramanindya55, @Subarna, @sreeramindia, @Amaresh_Murthiraju, @Shyam_Naren, @Neel_Biswas, @Srikanth_Adepu, @krishna_cse37, @Mounika, @santoshgsk, @metalosaur, @Veldandi_Karthik_Kum, @Sidhraj, @Deepak_Singhal, @Hankarthu, @manizoya_1, @numb3r303, @saiki0684, @kanigalupula, @satya23k, @ashu, @singhrobin1302, @san_sri12, @sreak1089, @dhuwalia_sumit, @Just_Rahul, @Orca, @Rupert, @Mohit_Bansal, @jitin_kapila, @gaurav_ism, @Malav_Shah, @sandeepgupta2, @yadhu621, @pavitrakumar, @Neeraj_Kumar, @mrsan22, @sorabhkalra, @abimannan, @Arunkumar_t, @SRK, @pradeep_1209, @yh18190, @debarati_dutta8, @architv07, @anshulkgupta93_1, @prateek_gs, @bvdrao, @karthikdoesit, @debapriya28, @zs_master, @nathkundan, @0_rishabh, @atul0017, @promila, @vikashsinghy2k, @Amber_Saxena, @ayesha20, @softwarmechanic, @aliasZero, @Akshay_Tiwari, @vijayavel, @sumitgupta07, @Anilkumar_Panda, @anuragbaid, @Debanjan_Chaudhuri, @dalvi2u, @Sai_Krishna_Bala, @dipanjandeb79, @Dev_Anand, @vishnuvardhanreddy40, @mat04460, @savioseb_1, @Bhargavi_Singaraiah, @Abinaya, @sidhusmart, @Prabakaran, @Kushagra_Sharma, @seaem, @Priyabrata_Dash, @Puneet_Rathore, @Anurag, @yash05, @uditsaini88, @Sitaram, @amarjeetpatra, @katakwar_shashank, @G9hardik, @AbhishekSingh, @akbism, @Amol_Pol, @bhavyaghai, @animeesh2007asansol, @kparitoshprasad, @JohnFelix, @abhi, @Mvijay, @Imsy, @Karthigeyan_Thiyagar, @badrinath_mn, @Santhosh_Kumar, @chidambara_natarajan, @varun_khandelwal08, @Shiva_Akella, @kirubakumaresh, @Mohd_Junaid, @ABHISHEK_SINGLA, @Apoorv_Anand, @Sarah_Masud, @Prerit_Khandelwal, @raghavyadav990, @Rajiv_Iyer, @mahanteshimath, @Rashmi1311, @preethyvarma, @Risoni, @kripton99, @santhudr, @Dhruv_Nigam, @Shubham_Jain, @amogh_kaliwal, @Anurag_GV, @Deepesh87, @Ankit_Samantara, @MJInen, @Prarthana_Bhat, @crptick, @Shobhit_Mittal_de_Ro, @Naseer, @Ved_Klawat, @barnava7, @shrawankumarhari, @KP1, @anujp, @NM8185, @Number_Cruncher, @Ajay_Kumawat, @bala_io, @raj_d, @himansu979, @ankurv857, @Abhijay_Ghildyal, @Rajib_Deb, @Akshay_Kher, @Akhilesh_Arora, @mahadevan_gopal1, @akbuddana, @Sachin_Sharma, @ktsreddy, @Raman_Sharma, @aaathi_bala, @preeti_123, @srinivas_analystsas, @smbilal, @rajendraprasadchepur


#16

Congratulations @aatishk


#17

Congrats @aatishk really u are great to win


#18

Congrats @aatishk
Great Win :slight_smile:


#19

I used an ensemble of lasso, linear regression and boosted linear regression (using mboost package) in R.

I tried log transformation of the shares variable - but that didn’t help results.

SVM and RF forests were also not giving me good results.

After reading this thread I think I should have tried treating outliers and also a ‘cluster and then predict’ approach.


#20

@aatisk congrats ! Looking forward to learning from your code.