Simpson's Paradox in Regression - Solution?


#1

Hi Everyone,

I am working on a Linear Regression model where my output variable is ‘Salary’ of individuals based on 2 input variables, 1). Department 2). Job_Level.

When I fit a Simple Linear Regression model to predict the “Salary” variable using “Department” it gives the coefficients which make sense with the data, however when I add “Level” also to the model, it produces incorrect coefficients, since I can not share the data set here I have used the Big Mart data (Train data after removing all rows with missing values) for simulation purpose.

Below is the R code of the model which I built:

#First model Simple Linear:

model1 <- lm(Item_Outlet_Sales ~ Outlet_Size, data = Big_mart) ## Predicting the Sales based on Outlet size only

coef(model1)

(Intercept)
2298.99526

Outlet_SizeMedium
-126.87866

Outlet_SizeSmall
59.34781

So I would interpret the coefficients, if the Outlet_Size is Medium then the average sales figures will be -126 comparing to the reference category Outlet_Size_High, similarly if the Outlet_Size is Small then the average sales figures will be positive 59 comparing to Outlet_Size High, this makes sense as the Mean figures of Sales by Outlet_Size matches with the coefficients logic(i.e. for small they are highest and for medium they are lowest)

Now I added one more variable, which is Outlet_Location_Type and re-built the regression equation:

model2 <- lm(Item_Outlet_Sales ~ Outlet_Size + Outlet_Location_Type, data = Big_mart)

coef(model2)

(Intercept)
2651.8512

Outlet_SizeMedium
-303.4965

Outlet_SizeSmall
-374.0069

Outlet_Location_TypeTier 2
160.9976

Outlet_Location_TypeTier 3
-352.8559

Now the problem is, in the simple model where Outlet_Size was used to predict the Sales, coefficients for Outlet_SizeSmall were positive however after adding Outlet_Location_Type they have flipped the signs from Positive to Negative, which doesn’t make sense if we manually compare the coefficients with raw data.

Same is happening when I include the Level variable along with Department variable to predict Salary.

After doing some research on google, I came to know that this phenomena is known as Simpson’s Paradox. Now I know the cause of this problem but my question is how can I resolve this problem to fit a regression model which gives me coefficients which have signs(+ or -) which match with the data used to train the model. I am also interested to share the results with the business owners so I would need to report the coefficients to them.

If you have any solution to this, please share your valuable inputs?

Thanks.


#2

Hi Guys - @PulkitS, @AishwaryaSingh, @NSS,

If you have any idea, please share.

Thanks,
Manoj


#3

Hi @manoj09990,

Linear regression model is affected by the scale of the variables. So, try to bring down the scale of your predictor variables to the same level. You can normalize the variables to reduce the scale.


#4

Thank you @PulkitS, I will try this.