I have a dataset of 10k rows and 350 columns. 97% of all independent variables are standard normally distributed (they all have mean 0 and unity standard deviation). The rest 3% are uniformly distributed unordered categorical variable. The dependent variable is a symetric Gaussian distribution with mean 1 and standard deviation 5.
The goal is to build a predictive model
My approach is to use regression instead of classification.
My question: Would anybody tip me on how to proceed? I don’t want to rush into gradient boosting algorithms before proper analysis. I am looking for a decent statistical approach (appropriate statistic tests)
- I have done the data exploration and computed correlations. The most strongly correlated variables have a correlation of .5 hence it is ambiguous about which variables to drop
- off all variables, I found that three of them have a standard deviation of 0.7 instead of being 1 like the rest. And interesting, these variables are in the top 5 strongly correlated to the target variable.