What are the techniques to improve the performance of a linear/logistic model?


Let’s say I create a basic regression model and I get a R-square of 60%. What are the steps I could try to increase the performance and boost the R-square to say, maybe 80% ?

Similarly for logistic regression.


It is very important to understand what the output of the linear regression means in order to choose the variables that should be included in regression. The methods used to determine this are the forward selection, backward elimination and stepwise regression as explained in:
this document - click here

Alternatively, you may want to jump right to the implementation and understand the techniques via implementation, you can see"
this example - click here



The first thing to understand is that R-square is a relative metric - while R-square of 70% is better than R-square of 60%, there is no benchmark for a good or bad R-square.

Now coming to making improvements, the best advice is to develop deep understanding about the data and the business problem at hand. Specifically:

  1. Look for variables with missing values, is there any way you can impute these values? Using average / segment wise average can be a good strategy.

  2. Remove outliers. Regression is known to be affected by presence of outliers. Removing outliers from the dataset can improve R-square.

  3. Look for new features / derived variables. For example, if you think taking ratios can help, take them. If you can build more hypothesis by talking to business, do that

  4. Transforming variables to make them linear in nature can also help. In particular log transforms are quite popular.

Hope these help.



Thanks @kunal and @Harshita_Dudhe for your inputs