Regression with categorical variables working



Say I have a feature city having values {Delhi, Mumbai, Kolkata etc} and a feature population having numerical data. If I want to predict a third feature (say polluiton) using the above city and population by applying multiple regression. Now I can code it such that each city is represented as a no Delhi–>0, Mumbai–>1, Kolkata–>2 and so on. But now if I apply regression, won’t it be treated as any numerical value and non categorical.
It does not seem correct as if Kolkata is coded as 2, Mumbai as 1 and Delhi as 0, regression will always assume that the order of impact on answer is Kolkata > Mumbai > Delhi or Kolkata < Mumbai < Delhi

What is the mathematics behind regression with categorical variables? Do we need to create new features like is_Delhi, is_Mumbai and is_Kolkata with a 0 or 1 value for each training set?



This encoding technique is useful when a categorical feature has some orders in it’s categories. Since in city categories don’t have any order, so you need to apply some other encoding techniques like OHE.

Ankit Gupta