Boruta Feature Selection Package - Implementation

virtual_machine
python

#1

Hi,
I am trying to implement Boruta feature selection technique in Python to select only the important features.

Need your help in understanding the below points.

  1. Whether Boruta takes care of the NA values or do we need to replace or remove NA values before feeding into Boruta ?
  2. Do we need to convert the categorical variables into numeric before feeding the data into Boruta ?
  3. How to implement Boruta in case of regression problem ? Does the steps remain the same ?

Thank you.


#2

Hi @kks2105,

Boruta is an all relevant feature selection method. This means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some classifier has a minimal error.

  1. Whether Boruta takes care of the NA values or do we need to replace or remove NA values before feeding into Boruta ?

Boruta does not take care of the missing values, so we have to impute/remove the missing values before implementing Boruta package.

  1. Do we need to convert the categorical variables into numeric before feeding the data into Boruta ?

In R, Boruta works with factor variables(categorical variables), so you can try the same in python as well. If this does not work, please let me know.

  1. How to implement Boruta in case of regression problem ? Does the steps remain the same ?

The implementation of Boruta is similar for regression problems as well. This algorithm can be used on any classification / regression problem in hand to come up with a subset of meaningful features.