Need help in understanding data- Item MRP vs Item Outlet Sales scatterplot is interesting!

r

#1

Hello Analysts out there!

I created some interesting scatterplots - might be helpful to discuss with you guys.

Scatter of Item MRP vs Item Outlet Sales with Outlet Type coloring turned out to be like plot1.

I then used a log transformation on Item Outlet Sales and it is even more interesting, plot 2.

(Both are done in ggplot2 with R.)

Questions:
Why are there gaps in the data at around 65, 130 and 205 prices? (Item MRP)
Looks like price vs sales are of a linear function with some randomization?

Please - lets discuss!!!

Plot 2:


#2

Plot 1:


#3

Thinking a little more about it, it is quite straightforward:

Sales = MRP * volume

So the volume is the linear function there. Grocery stores sell smaller volumes than SM1… SM3 stores.

So I will go and add an ‘average volume’ engineered feature to my model…


#4

I guess you figured it out.

Also, the vertical gaps can be because of product categories. You can try plotting different product types with different color and see the trend.

Regarding the extra feature, I generally prefer not to use the outcome variable in any form for making a new feature. This way you’re keeping a function of the output as input and the model will end up putting too much emphasis on that.

But you should try once and see what you get. Let me know how if goes out.


#5

Ideally we shouldn’t extract any features from response variable (Sales) as we don’t this feature in the test set. It only works in training set


#6

Hi @vamsi_d,

I agree but partially… It depends on what variable you are creating. For example, in this case you can create a variable of average sales of a product or outlet. Since all products exist in train and test sets, you can make this variable in train and apply the same in test. But this is not a good thing to do according to me.


#7

Hello Aarshay,

Thank you for your answer - I tried adding average sales volumes per Item Type but it did not help much.

It made the results worse and in variable importance this new variable skyrocketed - i.e. the model did put too much emphasis on it.

Thanks for your reply by the way :slight_smile:

M


#8

@aktakukac,

That’s exactly what I was warning you against. To be honest, I learned it the hard way by trying it on my own :grin:

The idea is simple. If you try to predict a variable using some function of its own, the model will have a tendency to put too much emphasis on that. Its a very good learning for life :slightly_smiling:

Cheers,
Aarshay


#9

Hi, How did you check variable importance?


#11

@Amit_Sood You can use Boruta Package in R for variable importance.