Convert continous variables into categoriccal using Decision Trees

decision_trees
spark

#1

As we all know that decision trees works well with continuous variables too. It uses reduction of variation as its split criteria. I just want to know how can I convert my continuous variables into categorical without building or training a dataset. In other terms, how can I bin my continuous column into categorical using decision trees. It comes handy in some databases, but not getting any good idea about it. Also I am trying to do it in Spark, so please give me an idea


#2

I don’t know about Spark but in R, you don’t really have to convert your continuous variable to a categorical (not necessarily). Rather, let the algorithm decide what is the best splitting point(s).
If you wish to come up with a splitting point manually you may see more about GINI Index, Information gain, and other node splitting criteria for continuous variables
However, generally it is left for the algorithm to decide for itself. Hope this helps (y)


#3

So if I give my continuous variable into algorithm, it will give categorical variables based on the splitting criteria we decide. But then for that we have to train our dataset i guess.


#4

You seem slightly on the right track. However, i should point out that the algorithm converts the continuous var into categorical var by itself internally when you give continuous var in dataset to train, i.e. NO variable will be explicitly created to view.
Having said that what you wish to obtain can be found out by printing the final rules or the final tree structure obtained after training and then you could see what splitting point(s) has been chosen by the model.


#5

Hi,

You can do several things:

  • Use package “binst”, very recent one that perform this kind of transformation from a continuous variable to a categorical one by using different kind of splitting criteria: entropy, kmeans, etc.

  • Some algorithms (tree based models) use this kind of separation internally to separate each variable homogenously, you can control in these models what should be the kind of measurement criteria to split (in particular in “partykit” package).

  • And, it is true that in Spark you can do this too, well even from R with package “sparklyr” you can access to Spark’s function “ft_bucketizer” which performs exactly what you need.

Regards,
Carlos Ortega.


#6

You can try the R package ‘smbinning’. It does exactly what you are looking for.
More details on www.scoringmodeling.com
Greetings.