Why backward selection can not be used when n<p?



I am currently studying about forward stepwise selection and backward stepwise selection method for selection of the features in the model.

forward stepwise selection - It is the method which begins with a model containing no predictors and then adds predictors to the model,one-at-a-time until all of the predictors are in the model.

backward stepwise selection - it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor,one-at-a-time.

I want to know the reason why we can not use backward stepwise selection method when a number of the data point is less than the number of features in a data point.


Here is what I think the reason is
Say X1, X2… Xn are the features in the problem and S1, S2,…Sm are the data points available where n > m.

To solve linear equations, we will be needing at least ‘n’ equations when the number of unknown variables in the equation is ‘n’. When it is less than ‘n’, we will not be able to solve it.

Similar case applies here as well and as a result only when we have n = m or n < m, we will be able to build a linear model.

Note : This does not applies to gradient descent methods.


@SRK, Nice explanation/reasoning

This reasoning is not limited to backward selection, but applies to most regular models where the no. of data points is less than the number of features, perhaps with a few exceptions as SRK pointed out.

Here is a small code segment (in R) to see the results.

library(MASS)    # include the library to access the Boston housing data
str(Boston)    # Check the number of rows and columns 

#Create a linear model using only 10 rows from the data and study the summary.
summary(lm1 <- lm(medv~.,data=Boston[1:10,]))

#Create a linear model using only 13 rows from the data and study the summary. 
summary(lm1 <- lm(medv~.,data=Boston[1:13,]))

##You can repeat this with different number of rows and see the coefficients.

#Create a linear model using 150 rows from the data and study the summary. 
summary(lm1 <- lm(medv~.,data=Boston[1:150,]))

Does it provide any insight?


Hello @harry

Let’s understand what will happen if we build a regression model with n (no. of observations) <= p (predictors).

This happens:

Using linear regression, I get SSE (error) = 0. Zero SSE suggests that predicted values turned out to be same as observed values. So, what went wrong ?

When the number of observations are less than or equal to predictors, the magnitude of reducible and irreducible error becomes Zero. This is practically not possible to have perfect fit to the data. Because, when we make a prediction there remains an eminent error which reminds us that predicted values can never be exactly same as observed values.

Perfect fit of data results in overfitting of the data. This data will perform extremely poorly on an independent test data set.


hello @harry,

Specifically for multiple linear regression if you see the formula for Adjusted R^2 =

If you have n <= k the equation becomes undefined.This is also one of the reasons why n must not be <= number of predictors.

Below is a piece of code,that has 4 different sets of n,p:

fitmodel <- function(n, k) {
  # n: sample size
  # k: number of predictors
  # return linear model fit for given sample size and k predictors
  x <- data.frame(matrix( rnorm(n*k), nrow=n))
  names(x) <- paste("x", seq(k), sep="")
  x$y <- rnorm(n)  
  lm(y~., data=x)

summary(fitmodel(n=9, k=10))
summary(fitmodel(n=10, k=10))
summary(fitmodel(n=11, k=10))
#Only this will give a valid model.
summary(fitmodel(n=12, k=10))

If you run this,you will see that only when n = 12 and k = 10,a valid model is generated whereas for the other three sets the o/p:

So as you can see you cannot do multiple linear regression(backward/forward selection or not) when n <= p.
Hope this helps!!