This is a thread to share, discuss and brainstorm the approaches you tried in last 2 days.
The more we share, the more we will learn! Over to you guys!
This is a thread to share, discuss and brainstorm the approaches you tried in last 2 days.
The more we share, the more we will learn! Over to you guys!
Nice initiativeâŚ
I used R with caret (for preprocessing), flexclust (for clustering) and randomForest (for modeling) packages.
Using k-means, I created a few clusters in training set (and based on that in test set) and then used random forest to fit models for each of the clusters.
Since the shares variable was skewed, I used log transformation before fitting the model.
At least for 40% test set, it gives the best rmse so far (6689.25906869). Hoping that it remains best for 100% test set
Fingers crossed
Update: Managed to repeat the performance for 100% of data set!
I started simple with the complete data set, Convert the two categorical columns in pandas. Built a benchmark script with both RF and XGBoost.
After that the following Steps:
Divided the Train data set into 2 parts. train and test. test was some 50 % of the data. (As there was no leader board in the beginning so i was scared of over fitting the data.)
Tried to run Both RF and Xgboost on it.
Did some parameter optimization for both of them. Got some decent scores.
Removed outliers from training set some top 15-20 rows with largest âsharesâ value. This improved the score greatly.
After this is did some feature selection. Removed features to see if RMSE improved. removed some 5-6 features based on that.
After doing all this i tried the XGboost score and the RF score and their average. all 3 were giving similar score on PLB (public leader board).
So i though of trying something random. I tried KNN, Lasoo no improvement in score and -ve impact on ensemble. so i ditched that.
Finally went with the RF and XGBoost ensemble. Happy with the leanings. Sad that i couldnât build two distinct model that give similar performance and whose ensemble would give better resultsâŚ
PS: would love to know if someone was able to build two+ distinct models which were equally good and whose ensemble did better.
Thanks AnalyticsVidhya team for conducting this Hackathon!!
It was a very interesting scenario & wonderful learning experience.
am a new bee and was stuck with linear regression for more than a day.
Then learnt about PCA and reduced some of the features using PCA and discarded few with least correlation with outcome
Tried ridge & lasso regression. At the end heard about ensemble & tried using RF and SVM. For me best outcome was with Lasso.
It was fun learning!!
Used glmnet, leaps, e1071 R packages during the competition.
Thanks!
Chidam
Hi All,
Following is my approach to solve the problem -
I started with a simple benchmark file using linear regression and then compared result using random forest, realized that its likely to over-fit the data set and linear regression was giving good results so went ahead and used it throughout(Though i tried SVM, gamboost, glmboost but no improvements in results due to over-fitting even after tuning the parameters)
I looked at the distribution of Share values as it was heavily skewed I removed every row with share value >90000(experimentally optimized ~ 80 rows) to get the maximum boost in linear regression as it is prone to outliers and give bad fit
Then I started with Feature engineering for most of the variables which were not coming as significantly as they should by the business logic, this gave additional boost in the scores (LDA, number of videos, global subjectivity, positive and negative connotations bucketing worked well for me)
I made individual feature of both the categorical variables like is_monday 1/0, didnât gave me any boost but I like doing it as it gave me more business understanding on which categories are significant on number of shares
Last i tried penalizing the linear regression using Lasso Regression which gave me little boost on performance, so went ahead and with Lasso in the end
Regards,
Aayush Agrawal
After performing the preprocessing steps, I was initally using Random Forest. But then i realized that it is a count data and hence used Poisson regression. Results were not satisfactory. Lately realized that the problem has zero truncation and hence followed zero inflated poisson regression.
I was facing this error when applying zero inflated poisson regression.
mod1 <- vglm(shares ~., family = pospoisson(), data = dataset)
Error in if ((temp <- sum(wz[, 1:M, drop = FALSE] < wzepsilon))) warning(paste(temp, :
argument is not interpretable as logical
I could not find solution for this online. Did anyone tried this please?
Forums were suggesting to use tryCatch to find the error. But i dont know how to use this command effectively.
Can anyone suggest how tryCatch can be used in R?
Hello Friends
I achieved ~ 5559 on 20% LB, canât recall the score on 40%. Here is the broad approach I adopted
Enjoyed the contest and the learning. Look forward to the winning approach. Thanks Analytics Vidhya
Dear Kunal,
I could not check the accuracy of my work as the URL to upload the .csv was not working.
Now can you please tell me whether we will be having a hackathon on the coming Sunday on 19-th July here in Kolkata ?
If that is the case can you please share the details with me.
Thanks and regards with best wishes
Debanjan
Its the time to felicitate the New Champion of Data Hackathon Online.
Heartiest Congratulations to Mr. Aatish Kumar @aatishk. He has won Amazon voucher worth Rs. 10,000. Good Job Aatish. Well done!
The final rankings can be seen here --> http://goo.gl/SlR09o
The results are out, as announced by Manish.
For another day, you can access the solution upload with 100% data and you can see the final standings on the leaderboard:
http://www.datahack.club:8000/leaderboard
Thanks everyone for all the fun! Congrats @aatishk
@kunal, @BALAJI_SR, @Steve, @gauravkantgoel, @anon, @karthiv, @Ishan, @devanshrising, @poornaramakrishnan, @karthe1, @praveen766, @kesavkulkarni, @Rishabh0709, @santu_rcc014, @rahul, @shuvayan, @VIKAS12APRIL, @dkanand86, @rahul29, @Mohammed, @dharmesh_manglani029, @sekhar_chandra630, @Venkata, @amrrs, @mail1, @nayan1247, @ankitbhargava06, @dada_kishore, @adityashrm21, @Jegan_Venkatasamy, @Himanshu_Khattar, @GaurKaps, @sumit, @Atul_Sharma, @Tushar_Kakkar, @Basu, @rithwik, @Aksgupta123_1, @saimadhup, @Vivek_Agarwal, @Sajan_Kedia, @vikash, @Ram_Marthi, @Uday_Bhan_Singh, @explorer, @ParindDhillon, @gauravkumar37, @Ramesh_Ramachandran, @Swapnil_Sharma, @sarthak93, @Malavika, @aatishk, @Abhishek_Nagarjuna, @abhijit7000, @litankumar, @Nalin, @pooja10, @umeshnrao, @sahil030288, @Kaushik_Roy_Chowdhur, @agentJay, @shaz1985, @Avioboy, @Vignesh, @Debanjan_Banerjee, @raghava_r4u, @Sarangam, @nknithinabc, @Sagar_Pardeshi, @rajendra_belwal, @ParulChoudhary, @punardeep, @abhishek_krishna, @tvmanikandan, @Sumangla, @Jitesh_Khandelwal, @vajravi, @manojprabhakar9090, @ramanindya55, @Subarna, @sreeramindia, @Amaresh_Murthiraju, @Shyam_Naren, @Neel_Biswas, @Srikanth_Adepu, @krishna_cse37, @Mounika, @santoshgsk, @metalosaur, @Veldandi_Karthik_Kum, @Sidhraj, @Deepak_Singhal, @Hankarthu, @manizoya_1, @numb3r303, @saiki0684, @kanigalupula, @satya23k, @ashu, @singhrobin1302, @san_sri12, @sreak1089, @dhuwalia_sumit, @Just_Rahul, @Orca, @Rupert, @Mohit_Bansal, @jitin_kapila, @gaurav_ism, @Malav_Shah, @sandeepgupta2, @yadhu621, @pavitrakumar, @Neeraj_Kumar, @mrsan22, @sorabhkalra, @abimannan, @Arunkumar_t, @SRK, @pradeep_1209, @yh18190, @debarati_dutta8, @architv07, @anshulkgupta93_1, @prateek_gs, @bvdrao, @karthikdoesit, @debapriya28, @zs_master, @nathkundan, @0_rishabh, @atul0017, @promila, @vikashsinghy2k, @Amber_Saxena, @ayesha20, @softwarmechanic, @aliasZero, @Akshay_Tiwari, @vijayavel, @sumitgupta07, @Anilkumar_Panda, @anuragbaid, @Debanjan_Chaudhuri, @dalvi2u, @Sai_Krishna_Bala, @dipanjandeb79, @Dev_Anand, @vishnuvardhanreddy40, @mat04460, @savioseb_1, @Bhargavi_Singaraiah, @Abinaya, @sidhusmart, @Prabakaran, @Kushagra_Sharma, @seaem, @Priyabrata_Dash, @Puneet_Rathore, @Anurag, @yash05, @uditsaini88, @Sitaram, @amarjeetpatra, @katakwar_shashank, @G9hardik, @AbhishekSingh, @akbism, @Amol_Pol, @bhavyaghai, @animeesh2007asansol, @kparitoshprasad, @JohnFelix, @abhi, @Mvijay, @Imsy, @Karthigeyan_Thiyagar, @badrinath_mn, @Santhosh_Kumar, @chidambara_natarajan, @varun_khandelwal08, @Shiva_Akella, @kirubakumaresh, @Mohd_Junaid, @ABHISHEK_SINGLA, @Apoorv_Anand, @Sarah_Masud, @Prerit_Khandelwal, @raghavyadav990, @Rajiv_Iyer, @mahanteshimath, @Rashmi1311, @preethyvarma, @Risoni, @kripton99, @santhudr, @Dhruv_Nigam, @Shubham_Jain, @amogh_kaliwal, @Anurag_GV, @Deepesh87, @Ankit_Samantara, @MJInen, @Prarthana_Bhat, @crptick, @Shobhit_Mittal_de_Ro, @Naseer, @Ved_Klawat, @barnava7, @shrawankumarhari, @KP1, @anujp, @NM8185, @Number_Cruncher, @Ajay_Kumawat, @bala_io, @raj_d, @himansu979, @ankurv857, @Abhijay_Ghildyal, @Rajib_Deb, @Akshay_Kher, @Akhilesh_Arora, @mahadevan_gopal1, @akbuddana, @Sachin_Sharma, @ktsreddy, @Raman_Sharma, @aaathi_bala, @preeti_123, @srinivas_analystsas, @smbilal, @rajendraprasadchepur
I used an ensemble of lasso, linear regression and boosted linear regression (using mboost package) in R.
I tried log transformation of the shares variable - but that didnât help results.
SVM and RF forests were also not giving me good results.
After reading this thread I think I should have tried treating outliers and also a âcluster and then predictâ approach.