Hi Everyone,

I am working on a Linear Regression model where my output variable is ‘Salary’ of individuals based on 2 input variables, 1). Department 2). Job_Level.

When I fit a Simple Linear Regression model to predict the “Salary” variable using “Department” it gives the coefficients which make sense with the data, however when I add “Level” also to the model, it produces incorrect coefficients, since I can not share the data set here I have used the Big Mart data (*Train data after removing all rows with missing values*) for simulation purpose.

Below is the R code of the model which I built:

#First model Simple Linear:

model1 <- lm(Item_Outlet_Sales ~ Outlet_Size, data = Big_mart) *## Predicting the Sales based on Outlet size only*

coef(model1)

(Intercept)

2298.99526

Outlet_SizeMedium

-126.87866

Outlet_SizeSmall

59.34781

So I would interpret the coefficients, if the Outlet_Size is Medium then the average sales figures will be -126 comparing to the reference category Outlet_Size_High, similarly if the Outlet_Size is Small then the average sales figures will be positive 59 comparing to Outlet_Size High, this makes sense as the Mean figures of Sales by Outlet_Size matches with the coefficients logic(*i.e. for small they are highest and for medium they are lowest)*

Now I added one more variable, which is Outlet_Location_Type and re-built the regression equation:

model2 <- lm(Item_Outlet_Sales ~ Outlet_Size + Outlet_Location_Type, data = Big_mart)

coef(model2)

(Intercept)

2651.8512

Outlet_SizeMedium

-303.4965

Outlet_SizeSmall

-374.0069

Outlet_Location_TypeTier 2

160.9976

Outlet_Location_TypeTier 3

-352.8559

Now the problem is, in the simple model where Outlet_Size was used to predict the Sales, coefficients for Outlet_SizeSmall were positive however after adding Outlet_Location_Type they have flipped the signs from Positive to Negative, which doesn’t make sense if we manually compare the coefficients with raw data.

Same is happening when I include the Level variable along with Department variable to predict Salary.

After doing some research on google, I came to know that this phenomena is known as **Simpson’s Paradox**. Now I know the cause of this problem but my question is how can I resolve this problem to fit a regression model which gives me coefficients which have signs(+ or -) which match with the data used to train the model. I am also interested to share the results with the business owners so I would need to report the coefficients to them.

If you have any solution to this, please share your valuable inputs?

Thanks.