How to create Target Encoding or Mean Encoding in R?

r
target_encoding

#1

In recent Analytics Vidhya Hackathon, I have come to know this Concept Called Target Encoding for handling categorical variables having too many levels (Ex ZIP code, Phone Number). I want to implement this in R but I couldn’t get any leads in Google. Can anybody help me building this in R with CV.?


#2

Hi @Satish_Chilloji, Target Encoding is used when we have to encode a categorical variable which suffers from high cardinality, i.e., too many levels.

Suppose you have a ZIP code feature with 100 levels in your data, and the target variable is continuous. You can create a lookup table by grouping the ZIP code variable and take the group-wise mean of the target variable. Now you can use this lookup table to replace the levels with the calculated mean of target variable, in the ZIP code feature in the original dataset.

You can refer the following sample code to understand the concept.

library(dplyr)

# creating a dummy dataset
data = data.frame(ZIP_CODE = c("110001", "110003", "110021", "110003", "110001", "110021"),
                  Target = c(23,34,56,78,33,65))

data # print data

ZIP_CODE Target
1 110001 23
2 110003 34
3 110021 56
4 110003 78
5 110001 33
6 110021 65

# creating a lookup table
lookup = data %>%
  group_by(ZIP_CODE) %>%
  summarise(mean_target = mean(Target))

lookup # print lookup data

# A tibble: 3 x 2
ZIP_CODE mean_target
<fct> <dbl>
110001 28.0
110003 56.0
110021 60.5

# adding mean target values from the lookup table to the original data
data = left_join(data, lookup)
data # print data with encoded ZIP_CODE
ZIP_CODE Target mean_target
110001 23 28
110003 34 56
110021 56 60.5
110003 78 56
110001 33 28
110021 65 60.5

mean_target is the encoded form of the ZIP_CODE.


#3

That’s very clean example to understand, Thank you Joshi.
I have following few questions before building this.

  1. Let’s say I have built the target coding on the training data set for the ZIP code and we will map this values on the test data set ZIP code as we don’t have the Target for Test, let me know I am correct? If so how to handle unseen levels (example new ZIP code 110022) in Test data that we haven’t come across training data set?

2)If we simply replace ZIP code with ZIP code target encoding value will it cause overfitting ?.
I have gone through below article from kaggle the code is written in python(hard to understand for me), They have done cross-validation to overcome overfitting.
https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features/notebook

Let me know your thoughts.