Machine Learning Algorithm Selection

Dear All,

I have a ML problem with 130 features and 3000 records. The number of data seems to be less with more features. What would be the right ML algorithm that will be work for this and Why?

This is a regression problem.

Shankar R

1 Like

Apply PCA then give a try to Random Forest. @shankarthebest

Apply Support Vector Regression which is capable of handling more number of features.

First check the variables are independent of each other. (the basic assumption for regression problem).
Take your numerical variables and check the correlated features and remove those that are highly correlated. There are many statistical techniques too for picking the choosing the right variables.
For categorical variables: check the distribution of each category on that variable as a whole and on the response variable. If any category looks highly skewed remove those.


Best method for dataset with more features and less data is Support vector.

However first check for co-relations between features and remove those that are highly correlated. Post that apply PCA.

I believe that best approach would be here to first drop the non significant variables from the dataset

you can try use
correlation check
different feature selection techniques ( p value basis, feature importance etc)
also out of 130 variable how many are categorical and how many continuous

if you are not able to drop significant number variable it depends what is the main purpose of your model
if you need more interpretation then it would be complex after using the dimension reduction techniques(PCA etc)

I have encountered such scenario while working on clinical data where interpretation was important hence tried regularization with Linear model

  1. Check correlation between numeric variables.
  2. In case of high correlation, check if you can use some feature engineering: like sum two or more variables and create a new variable.
  3. Use dummy variable coding for categorical variables.
  4. See if you can bring interaction variable from converted categorical and numeric variables.
  5. Do feature importance ranking.
  6. I will again check for any other feature engineering which can help me to understand the pattern and explain the insight in better way.
© Copyright 2013-2019 Analytics Vidhya