Categorial Data Feature Engineering



I have a data set for House price prediction.I am unable to encode some categorical feature like availability,society etc .How to find correlation and find the importance of the mentioned features with the price output.I am attaching the data set.
Please help me.Bengaluru_House_Data.csv (916.0 KB)


Hi @amitabha_joy

For the ‘availability’ feature, most values are in form of months and year. You can combine them year-wise, and then encode them. For the ‘Society’ feature can you explain more clearly what is the problem you are facing?


First of all Thanks @AishwaryaSingh

I am new to this data science world.
Could u please give one example for year-wise encode because all r in the current year.18th April means year 2018,17th Jan denotes year 2018 as per the data owner.

In case of society i have run the following code:


It gives lots of different value. My question is how should i encode such high levels values to categorical variable?


Hi @amitabha_joy

Under ‘availability’ feature, since it is not possible to group year-wise , make it quarter-wise. Then you will have 6 levels.

Regarding the levels in ‘society’, a lot of them like MavanK, Wharl P, etc are present only in one row. You can combine all such society names which are present in less that 10 or 20 rows and make ‘others’.