Say I have a feature city having values {Delhi, Mumbai, Kolkata etc} and a feature population having numerical data. If I want to predict a third feature (say polluiton) using the above city and population by applying multiple regression. Now I can code it such that each city is represented as a no Delhi–>0, Mumbai–>1, Kolkata–>2 and so on. But now if I apply regression, won’t it be treated as any numerical value and non categorical.

It does not seem correct as if Kolkata is coded as 2, Mumbai as 1 and Delhi as 0, regression will always assume that the order of impact on answer is Kolkata > Mumbai > Delhi or Kolkata < Mumbai < Delhi

What is the mathematics behind regression with categorical variables? Do we need to create new features like is_Delhi, is_Mumbai and is_Kolkata with a 0 or 1 value for each training set?