How many columns should be used in dummy coding & one hot encoding in Python

one-hot-encoding
dummy_variable
python

#1

Hello,

While reading today’s article on dealing with categorical variables,I read about the dummy coding example given:

I think that the sex_female and sex_male columns are both saying the same thing so it wouldn’t make sense to use both in the model.Please correct me if I am wrong??
I took a look at the one-hot encoding using python,but I am not being able to understand the code:

Here the levels are 0,1,2,3 and the new levels are 2,3,4,right??
Which levels have been combined??
What does the enc.feature_indices_ capture,why are there 4 indices??
What is the enc.transform doing??
I am sorry if these are very basic questions,but can someone kindly guide me on this??


#2

@pagal_guy,

I have taken example of “Male” and “Female” to make it simple to understand.

Above n_values_ represents number of unique values per feature (look at below snapshot).

enc.feature_indices_ represents cumulative number of dummy variables created starting with zero. It starts with 0 then two unique values in first features so next value is 0+2 =2 again three unique values in second feature so 0+2+3=5 and finally, 9.

Transformation methods

Now, Look at the below transformation, create a variable for each unique value of variable and presence of value represented with 1 and absence with 0.

Look at the below imgae for transformation of [[0,1,1,]].

Hope this helps!

Regards,
Sunil