How to do one hot encoding in R

r
dummy_variable

#1

hello,

In python we can do one hot encoding by:

#One-hot-encoding features
ohe_feats = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']
for f in ohe_feats:
    df_all_dummy = pd.get_dummies(df_all[f], prefix=f)
    df_all = df_all.drop([f], axis=1)
    df_all = pd.concat((df_all, df_all_dummy), axis=1)

However I could not find any packages in R to do the same simply.
Can someone help me with the apt library in R to achieve this


#2

hello @pagal_guy,

Something like below code should do it:

#One-hot-encoding features:
library(ade4)
library(data.table)
ohe_feats = c('gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 
             'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser')
for (f in ohe_feats){
  df_all_dummy = acm.disjonctif(df_all[f])
  df_all[f] = NULL
  df_all = cbind(df_all, df_all_dummy)
}

Hope this helps!!


#3

@shuvayan Can you be more clear on this code. I didnt get this cod what you wrote. Can you help me with a code in R for this

Thanks,
Rohit


#4

hello @Rohit_Nair,

for each categorical variable which is in the list ohe_feats the acm.disjonctif will create dummies.In the next line those categorical variables are dropped from the original data and in the next line all the dummy variables are added to the original data.
Hope this helps!!


#5

ok thanks @ shuvayan :slightly_smiling:


#6

Hi @Rohit_Nair,

from one of the code shared by @Rohan_Rao I learnt the following way of one hot encoding

Using dummies library:
df <- dummy.data.frame(df, names=c(“MyField1”), sep="_")

Note: This splits the original field into number of unique values. The original field is no longer available in data frame.

Example:

Data:

after
df <- dummy.data.frame(df, names=c(“MyField1”), sep="_")

In method shown by @shuvayan, the original field is still available for you . Hope this helps.


#7

@sadashivb Can you help me with this error? I googled it but couldn’t found out any solution to this.

library(dummies)
df <- dummy.data.frame(Clean_data, names=c(“Gender”), sep="_")
Error in sort.list(y) : ‘x’ must be atomic for 'sort.list’
Have you called ‘sort’ on a list?

$ Gender : chr “Male” “Male” “Male” “Male” …


#8

One hot Encoding can be done in R using model.matrix is simple and easy.
Here is an Example:

FactoredVariable = factor(df$Any) 
dumm = as.data.frame(model.matrix(~FactoredVariable)[,-1])
dfWithDummies = cbind(df, dumm) 
str(dfWithDummies)

you can also try looking in to Caret Package it offers various data preprocessing and modeling tools to make our life easy.
Thanks, Hope it helps !


#9

@pagal_guy
I Think This should work for u

df1 <- within (df,newcolumnname <- match(df$columnname,unique(df$columnname)))


#10

I am using one hot encoding in python but final result dimension is not same to original data. In this no of Columns has been increased. Pls tell me how to get original dimension ???


#11

Hi @naveed56,

If you use one hot encoding, the dimensions will obviously change. As explained by @sadashivb in the thread above, this is how one hot encoding works -

For a given data with two columns, MyField1 and MyFiled2, the first variable is categorical

image

and on applying one hot encoding, it will look like this (image below). So the dimension will increase.

image

The problem I assume you might be facing is because the dimension in the training dataset is different from the test dataset. You must combine both before applying one hot encoding. So you won’t have a problem. Also, in case you have an order in the categories, you can go for label encoding, which does not affect the dimension.


#12

Here is an excellent article to help you understand one hot encoding, label encoding and the difference between the two.


#13

Thank you @AishwaryaSingh i got your point
Yo are absolutely right
Is there any way to make the exactly same dimension or invert that??
I want same dimension in both cases training and testing
Because Random Forest assign importance values separately to MyField1_A , B and C not a single importance value to MyField1
If not possible then telll me how to combine the importance values of catagorical feature e.g. MyField1??


#14

I just want to implement one hot encoding


#15

If you don’t want to change the dimension, you can go for label encoding (although that would be wrong since your variables do not have an order between them.

The best way would be to combine the train and test set, and then apply one hot encoding. Also, go through the article mentioned above, that would help.


#16

Thank u so much@AishwaryaSingh
I have read that article and my problem has been resolved now i am facing another issue of dummy variable importance values in random forest

For example, Field is a catagorical feature and Field_A, Field_B and Field_C are dummy variables with importance values 0.03, 0.2 and 0.1 respectively. Then how to add these dummy variable importance values of Field_A, Field_B and Field_C. As random forest giving the variable importance value to dummy variables separately not to catagorical feature.

Brief explanation of adding those values bcs i have read an article given below but don’t able to understand


#17

A random forest model will treat each variable separately. Now this could be because that particular category (field_A) might be more important. You will have to make inferences from this information.

I would suggest you to go through Jeremy Howard’s machine learning course (MOOC). He has explained this concept in detail. I have covered the same in the form of an article. Here is the link. The link to his videos is within the article.


#18

Thank you @AishwaryaSingh i have read that article throughly but dont understanc how to add the dummy variables ( Field_A, Field_B and Field_C) importance values to get the importance value of catagorical (Field) feature simply.
If Field is a catagorical feature and Field_A, Field_B and Field_C are dummy variables with importance values 0.03, 0.2 and 0.1 respectively.
Pls tell me how to calculate??? Thank u


#19

@naveed56, adding the values for these three columns to get the value of field would be incorrect. The model is now treating these as three separate features and adding the values is not right.

If you have a size value or something that says field_a>field_b>field_c, then label encode them so that you have only one column.


#20

Thank you. You have already helped me a lot.
I really understand your point.
I can’t use the label encoding

Now, I am using one-hot encoding

can you tell me how to combine the dummy variable importance values as in most of the research papers, authors give a single importance value of a categorical variable.

please visit this page: https://stats.stackexchange.com/questions/314567/feature-importance-with-dummy-variables - A formula is given for combining the importance values of dummy variables. I don’t understand that formula. Can you explain that?

Thank you again