How to decide which algorithm to use for a given dataset?



Suppose I have a classification problem. How do I know when to use which algorithm to use like whether to use cat boost, lightgbm, xgboost, random forest, SVM etc. If my dataset has 50% categorical data and 50% continuous data. I am not considering the training speed. Is there any thumb rule to follow?


Hi @Shrikantai,

It completely depends on your dataset as to which algorithm would work best. For instance, if it has a huge number of categorical variables, you might want to go for CatBoost or when the dataset is too large, LightGBM is expected to show a good performance. You can choose XGBoost for imbalanced dataset.

You can read about the algorithms and find out which algorithm fits well with the distribution of your data.


Thank very much @AishwaryaSingh