Black Friday Data Hack - Reveal your approach




I completely agree with you and understand what you have trying to do is but I was talking talking about the test data, as we know that purchase variable was not present in the test data but in your code I have seen that ‘Product_Mean’ which you have calculated in the train data, you have used the same values of it in the test data as well…! Add to the point Product Mean which was nothing but the average of the purchase…Now for the train data this new feature is perfect but for the test data where we don’t have purchase given to us(to be predicted) but you are using the same value of product mean calculated from the train data and putting the same observations on test data as well.
I we were given the purchase for the test data as well then on the same fashion we would have calculated product mean for the test data as well…!


If we were given the purchase amount for the test data, you could submit that as a perfect prediction and have a very relaxed weekend :smile:

I’m not sure if you understood the challenge, it is to predict the purchase for the test data. That is not given.


In my first comment only I have mentioned about the problem statement i.e. to predict the purchase in the test data…anyway my point was in the test data you have added product mean column which was calculated wrt the purchase of the user and corresponding product id of the train data…!


Ok, in simple words, I’ve replaced Product_ID column by Product_Mean, because the IDs are meaningless. Hence, the same values has to be used in train as well as test since they are the same product.

This was the most important variable in my model, and gave the maximum boost in accuracy.


Hi Rohan, Good job. Here you want to fixed the Purchase amount of Product_ID in train and test right. make sense.


How did you impute those missing values?


How did you impute missing values?
@tukai007- why did you directly replace the values by -999?


@ved I also used the same approach as tukai007.

Decision tree based models provide us extra ease with respect to the Missing Values. Instead of imputing mean/median/mode, what we do is Denote the missing value with a label, say -999. Now Random Forests, treat it as a different variable group, and figure it out whether this Missing group of data has useful information hidden in it or not by using measures like Entropy reduction/ Gini Coefficient. I suggest you to read up on these parameters and how they are used to build a simple Decision Tree in order to completely understand what is happening under the hood.


Hi guys,

i am working on random forest model.i have many categorical variables as independent variables.should i create dummies or change them to number.for example in city variable i have values as - mumbai,kolkatta,chennai,bangaluru


‘replaced missing values with 9999’

What is the logic behind replace missing values with 9999?


@aayushmnit Hi,Can you please elaborate on how you got the feature importance but running a random forest ? Can’t really get anything about this from your code you linked.Thanks


@BeautifulCodes : Please go through this link.


Thank you


Hey Kunal,

Do we have any connection between Product_ID and Product_Categories. Does that Product_ID correspond to any of the varaibles like Product_Category_1 or 2 or 3?
Kindly help me.



@ank_dsm: I am a learner here. Can you share your codes as well (I wanted to see how can we run DT)



Hi, I´m a student of Data Science for my final proyect I´m working on the prediction and sentiment análisis of the black friday dataset that use for the competition. But I would like to know what the variables mean like in the ocupation variable the numbers what those it mean? It will be really helpfull.
Thank you


hi @dalia_lucia,

They are simply the masked variables. Each number represents a different occupation. For example 10 might mean teacher, 16 could be consultant, 7 could be doctor and so on. Treat these numbers a categories.


I know that this is not quite the objective of the exercise, we are to consider the dataset at face value, but I am trying to make sense of it. The table contains various entries of User_ID and Product_ID, which I take that latter means products the user bought. Then it also has some demographic information on the user, and some categorical information on the product (supposedly) bought. But if the table contains the Product_ID, I don’t really get why the other information is necessary, since we are supposed to know the price of each product and the quantity purchased (if not on the table, elsewhere). What is the rationale behind this prediction model? Any ideas?