How to make sure my data is ready before choosing any algorithm for modelling and analysis?

r
machine_learning
data_wrangling
data_science

#1

Looking for an intuitive answer rather than an answer mentioning technical activity(doing some outlier treatment, missing value treatment,…) to be done. Also, at the outset if I know a certain algorithm can be tried, how to make sure the data is ready only to apply that particular algorithm? Then make some transformations and change the algorithm.

Is it too subjective?


#2

Hi @akshay.kotha,

It depends on the type of algorithm you want to apply and type of problem you are solving. Some algorithms only take numerical inputs. So, in such case you have to convert any string value to numerical value (you can use dummies, one hot encoding or label encoding to do so).

Pre processing of the data will vary from algorithm to algorithm.


#3

Hi @akshay.kotha

Yes it is subjective, and would depend on the algorithm. For instance, if you wish to apply logistic regression on you model, you should try scaling or normalizing the variables and remove features that are highly correlated. While using a tree based model, you can choose not to scale the features.


#4

Thanks @PulkitS and @AishwaryaSingh :slight_smile:. Is there any checklist to verify pre-processing activity of the data for most popular algorithms? Or Is this also subjective?

Akshay


#5

I would suggest you read about the algorithms in detail and check which technique works well for respective algorithm. This would help you learn more.


#6

Always use boosting and ensemble methods which will give the best solutions.