How to deal with more than 1000 levels in a factor variable in R?



Hi everyone,

I am working on a dataset to predict the sales for the next 6 weeks of a retail store. You can get the datasets from here .

There are two datasets ‘train’ and ‘store’. The dataset ‘store’ has information about different stores owned by the company. There are 1115 stores in total. And each store is significant here with sales.

The dataset train has information about the sales records, number of customers, date, whether it was open on that day, whether it had launched any promotional schemes that day, etc.

Here, the variable ‘store’ has 1115 levels. While applying algorithms like linear regression, it is advisable to convert the variables into one hot encoding, so that they can be easily interpreted by the code. But here, since there are a lot of levels and every one is significant, I am in a confusion. How should I deal with this variable? Should I convert it into one-hot encoding or leave it as it is in the model?

Help would be really appreciated.


Hi @shashwat.2014:

I believe this problem statement is of Rossman Kaggle challenge.

In this scenario of any data columns having more categories. There are 2 options I believe we can go for,

  1. Have business information and accordingly create a hierarchy of it, example. I have 2000 ccd’s in Bangalore, but I can classify them area wise, or say an upper hierarchy or is it a lounge or some category of outlet and any other classification which you can come up with.

  2. Convert all those in a numerical order, I know it does not sound good with statistics, but I can ensure you this will help you a lot, just think a unit change in store will get you increase or decrease in sales. This sounds interesting when you data is able to capture the variation of the sales.


Hi @Swapnil_Sharma

Yes the problem is the Rossman Kaggle Challenge.

While considering option 1, I plan on clustering the stores on the basis of StoreType and Assortment type, since the type of store would give a similar impact because of the range of products and the kind of facilities they have there.

Now I have the following question,
Should I make different models for each of the clusters I have got?
Or should I just create another variable showing the category of store? Here category means 'a unique combination of store type and assortment. eg Category 1 could be All Stores with StoreType=‘a’ and Assortment=‘b’

Thanks in advance!


Hi @shashwat.2014:
I also had the same thought while attempting the problem, I thought of considering it as a variable and creating a single model but I dont know the performance as I didn’t solved it.
In business logic it does not make any sense to cretae 1500 model for all stores, better treat them as a variable
Try it out; I can tell you only this much for this problem


Thanks a lot for your inputs.
I am having 9 categories for now(using the above criterion). So I am planning on making 9 separate models and then combining the results. Will try by making it a variable as well, probably doing one hot encoding for that category.



Hi @shashwat.2014:

Just a suggestion if you are working on python, I have a suggestion:

Use get_dummies in pandas to create a dummy variables and just remove any one of them which you feel as base variable for dummy variables.



Hi @Swapnil_Sharma,

I am working on R currently, but I am sure I can implement this technique there as well.