I am working on a Linear Regression model where my output variable is ‘Salary’ of individuals based on 2 input variables, 1). Department 2). Job_Level.
When I fit a Simple Linear Regression model to predict the “Salary” variable using “Department” it gives the coefficients which make sense with the data, however when I add “Level” also to the model, it produces incorrect coefficients, since I can not share the data set here I have used the Big Mart data (Train data after removing all rows with missing values) for simulation purpose.
Below is the R code of the model which I built:
#First model Simple Linear:
model1 <- lm(Item_Outlet_Sales ~ Outlet_Size, data = Big_mart) ## Predicting the Sales based on Outlet size only
So I would interpret the coefficients, if the Outlet_Size is Medium then the average sales figures will be -126 comparing to the reference category Outlet_Size_High, similarly if the Outlet_Size is Small then the average sales figures will be positive 59 comparing to Outlet_Size High, this makes sense as the Mean figures of Sales by Outlet_Size matches with the coefficients logic(i.e. for small they are highest and for medium they are lowest)
Now I added one more variable, which is Outlet_Location_Type and re-built the regression equation:
model2 <- lm(Item_Outlet_Sales ~ Outlet_Size + Outlet_Location_Type, data = Big_mart)
Now the problem is, in the simple model where Outlet_Size was used to predict the Sales, coefficients for Outlet_SizeSmall were positive however after adding Outlet_Location_Type they have flipped the signs from Positive to Negative, which doesn’t make sense if we manually compare the coefficients with raw data.
Same is happening when I include the Level variable along with Department variable to predict Salary.
After doing some research on google, I came to know that this phenomena is known as Simpson’s Paradox. Now I know the cause of this problem but my question is how can I resolve this problem to fit a regression model which gives me coefficients which have signs(+ or -) which match with the data used to train the model. I am also interested to share the results with the business owners so I would need to report the coefficients to them.
If you have any solution to this, please share your valuable inputs?