Regression analysis

regression

#1

Hi everyone!

I have a dataset of 10k rows and 350 columns. 97% of all independent variables are standard normally distributed (they all have mean 0 and unity standard deviation). The rest 3% are uniformly distributed unordered categorical variable. The dependent variable is a symetric Gaussian distribution with mean 1 and standard deviation 5.

The goal is to build a predictive model

My approach is to use regression instead of classification.

My question: Would anybody tip me on how to proceed? I don’t want to rush into gradient boosting algorithms before proper analysis. I am looking for a decent statistical approach (appropriate statistic tests)

  • I have done the data exploration and computed correlations. The most strongly correlated variables have a correlation of .5 hence it is ambiguous about which variables to drop
  • off all variables, I found that three of them have a standard deviation of 0.7 instead of being 1 like the rest. And interesting, these variables are in the top 5 strongly correlated to the target variable.

#2

Hello @kthouz,

I am no expert in the field of data science, but according to me the ideal statistical approach for regression would be to first of all determine if the distribution is somewhat linear or not. This will help you to distinguish between whether to use a parametric algorithm like Linear Regression or an unparametric regression algorithm like the KNN Regressor. This can be done by fitting a linear model and analyzing the different graphs to see whether or not the assumptions of linear regression are being followed.
You can seek help in this regard from the following link : https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/

In case the model does not seem linear,you can try variable transformation and check again. If even after a lot of effort, you do not get hints of linearity from the data, it means that the data is too complex to be estimated by a linear curve and requires a more complicated analysis. You can use the KNN Regressor. While using the KNN Regressor, ensure that the value of K is justified in accordance with the bias-variance trade-off for the dataset. This can be optimised using cross validation.

In case you are satisfied with the linear model, the ideal method would be to follow the forward selection algorithm for variable selection. Since the number of variables is quite large, I would have used Boruta package to determine the most important variables and applied linear regression on those variables.

You can then further continue with bagging and boosting algorithms.

Regards,
Shashwat


#3

Thanks a lot for your input Shashwat. I have checked and there is no linearity between dependent and independent variables. I would definitely proceed with the KNN Regressor as you are suggesting


#4

Just a reminder, KNN is a bad idea for this dataset due to the curse of dimensionality. To apply it with any degree of success some sort of dimensionality reduction is required. PCA and t-SNE are my weapons of choice.


#5

I found a handbook that contains statistics tricks and techniques that can be useful http://www.biostathandbook.com/outline.html