On what basis any algorithm is selected?



Hello all,

I just have a basic doubt , like what algorithm should we select for any problem?

Suppose i m going to work on any Kaggle problem. So on what basis should i select any algorithm ?

I participated in 2 kaggle problem. Bike share and Rossman.

In Rossman mostly people used XGboost and RandomForest.

So my doubt is on what basis are they selecting those algorthims ?

@Lesaffrea @kunal @shuvayan @Aarshay Please help me with this doubt :slightly_smiling:



This is a good place to start



Kunal has shared a very good infographic. I would just like to add my thoughts.

I generally choose the predictive model depending on complexity of problem, #data points, #features available, type of problem, etc. Kaggle problems are generally heavy on data and require modeling non-intuitive relations. So people generally go for RF or Xgboost.

But I believe that one should also start ground up. I will first implement a baseline model then go for regression, decision tree, SVM and then a RF and then an Xgboost. You should always try to extract a good output from a given model before going further. This helps us in testing whether the better model is actually better results or not. Trust me I have seen situations where random forest is a very slight improvement over a logistic regression of a decision tree. Try the “Loan Prediction” problem at AV Datahacks. Its a good example of this.

Hope this makes sense. I’ll be happy to discuss further if needed. :slightly_smiling:



Hello @Rohit_Nair,

Just to add to what is already mentioned:
The requirement is also a very important part of selecting an algorithm.For example if you need to draw inferences from the data,say you want to find out how each of the predictors(TV Advertising,Radio advertising etc) is affecting your response variable (say sales) it will be more useful to use a parametric technique like linear regression which is easy to interpret.

On the other hand if accuracy is all that is needed and the prediction is the most important thing,you need to predict sales as accurately as possible without caring about what affects it by how much,probably a more complex method like splines or SVM is preferrable.

This is from the book ISLR by Trevor Hastie and Robert Tibshirani .

Hope this helps!!


@Aarshay - Thanks for the explanation. Could you also please brief about how do you classify a dataset to be a complex based on the data points, features etc? Is there any standard available to quantify that the problem is a simple problem or complex problem?


Hi @karthe1,

Very valid point I must say. But honestly, things are a bit fuzzy in this case. There is no real definition or boundary defining a ‘complex’ model. Model selection depends on many more factors other than #data points or #features like what’s the problem at hand, what’s the relation between variables, the type of variables (categorical/continuous).

I think @shuvayan has a very valid point. The problem at hand play a big role. For instance, if you practice on Kaggle datasets, you mostly go for higher accuracy and use as good a model as you can.

But that’s not always the case. I used to work in pharmaceutical analytics some time back and there we had an interesting problem. We had to make a predictive model which could be coded into an MS Excel application so that the medical representatives can use it to make predictions on the fly. In this case, we were stuck with logistic regression and decision tree because models above these are not very intuitive to code from scratch. Also, interpretability was a major concern. The model should make practical sense in such applications.

To summarise, there is no definite answer here. You should also check out ensemble techniques:

These involve making a variety of models and then combining there results to get a better prediction.

Hope this helps!



Hi All !

My 2 cents.

Ooops, actually 1 Image, not 2 cents. :smiley:

This is a very large Image. One should be patient to be able to grind it.


can you mail me(rahul91.aggarwal@gmail.com) the flow chart. This is awesome but difficult to traverse coz of its size.

Thanking for putting it across the portal.


I hope this will help you