I am trying to normalize data in the entire dataframe. I tried some packages like vegan & caret but I end up having some errors. Caret will work only if I set all the integer columns to numeric. If someone is aware of a simple way of normalizing data in train, please let me know.
Should we enter all 46 numerical variable in the model, or is it advisable to figure out less number of important variables first (around 10) , Also how to do it any idea?
@abhijit7000 Yes you should surely look at reducing the number of variables by using any of the dimensionality reduction techniques like feature selection, Principal Component Analysis, and also by some analytical thinking whether the variable makes sense to the target you are trying to predict i.e. number of shares. Furthermore, you should also consider engineering new features!
May be people find this useful… http://www.slideshare.net/DataRobot/final-10-r-xc-36610234
Nice overview of possible approaches. Best of luck !
Ensemble Learning is also useful. Combine results from multiple approaches… https://en.wikipedia.org/wiki/Ensemble_learning
@kunal Hi, Is it possible to display the logged in usernames on the leaderboard ? Same participant might enter multiple ids here and they do. Its very confusing. Forum members please share your thoughts.
@kunal Atleast make it mandatory that the email id must exist for qualifying to submit a solution.
Why does it not reflect the RMSE score on the leaderboard? Does it reflect after a lag?
The RMSE score only gets updated if the new one is lower. The update should be instantaneous.
Also, in the description
n_unique_tokens: Rate of unique words in the content
What does “Rate” specify here?
The update would happen, if the score is lower and will be instantaneous
We will implement it in the next version of the platform. For now, it is best to keep pseudo ids as there might be email snoopers fishing around
Is rate = percentage?
. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
What does non stop word mean?
Should the Number of shares rounded off to nearest integer, or decimal values as it is are fine?
@abhijit7000 Look at the train file and you should be able to figure it out!
It wouldn’t matter much either ways.
Hii analyticsvidhya team,
http://www.datahack.club:8000/ is not working can you check it once.
ok it’s working now.
yes, i’m also facing the same issue.
thanks Kunal for the prompt response. Now i can access Leaderboard1
I have created a function to compare LDA’s
URL : https://github.com/akshaykher/Data-Hackathon---Predict-the-gem-of-Auxesia-for-Magzino/blob/master/LDA_comparison.R
From this we can easily see which LDA bolsters sharing
After going through the data set I found some variables highly skewed.
Quick understanding of skewness : http://www.mu-sigma.com/analytics/thought_leadership/cafe-cerebral-basic-statistics.html
Removing skewness helps improving the overall model (hence reduce RMSE)