In recent Analytics Vidhya Hackathon, I have come to know this Concept Called Target Encoding for handling categorical variables having too many levels (Ex ZIP code, Phone Number). I want to implement this in R but I couldn’t get any leads in Google. Can anybody help me building this in R with CV.?
Hi @Satish_Chilloji, Target Encoding is used when we have to encode a categorical variable which suffers from high cardinality, i.e., too many levels.
Suppose you have a ZIP code feature with 100 levels in your data, and the target variable is continuous. You can create a lookup table by grouping the ZIP code variable and take the group-wise mean of the target variable. Now you can use this lookup table to replace the levels with the calculated mean of target variable, in the ZIP code feature in the original dataset.
You can refer the following sample code to understand the concept.
library(dplyr) # creating a dummy dataset data = data.frame(ZIP_CODE = c("110001", "110003", "110021", "110003", "110001", "110021"), Target = c(23,34,56,78,33,65)) data # print data
1 110001 23
2 110003 34
3 110021 56
4 110003 78
5 110001 33
6 110021 65
# creating a lookup table lookup = data %>% group_by(ZIP_CODE) %>% summarise(mean_target = mean(Target)) lookup # print lookup data
# A tibble: 3 x 2
# adding mean target values from the lookup table to the original data data = left_join(data, lookup) data # print data with encoded ZIP_CODE
mean_target is the encoded form of the ZIP_CODE.
That’s very clean example to understand, Thank you Joshi.
I have following few questions before building this.
- Let’s say I have built the target coding on the training data set for the ZIP code and we will map this values on the test data set ZIP code as we don’t have the Target for Test, let me know I am correct? If so how to handle unseen levels (example new ZIP code 110022) in Test data that we haven’t come across training data set?
2)If we simply replace ZIP code with ZIP code target encoding value will it cause overfitting ?.
I have gone through below article from kaggle the code is written in python(hard to understand for me), They have done cross-validation to overcome overfitting.
Let me know your thoughts.