Techniques to collapse the categorical variables in R



Hi all,

I have close to 200 categories in a variable.
What are all the techniques available in R or in statistics in general to collapse the values?
I generally use collapsing by the relative weight percentage. What is the best method to do this.

Thank you.


Collapsing categorical variables are very much dependent on the data itself. If you want to do it technically without knowing the data, you can make use of the frequency distribution and then create buckets

I have taken a sample code where we have 100 rows in a data frame. COL1 has categories a to e. COL2 has numerical values
data <- data.frame(col1=as.character(rep(‘c’,100)),col2=seq(1,100))
data <- data.frame(lapply(data, as.character), stringsAsFactors=FALSE)
data$col1[1:10] <- 'a’
data$col1[10:20] <- 'b’
data$col1[20:60] <- 'b’
data$col1[60:80] <- 'd’
data$col1[80:100] <- ‘e’

_# We will now see the frequency and then combine_
_freq <- table(data$col1)_
_freq <- sort(freq)_