OneHotEncoding in Python

ipython

#1

Hi…

I am new to Python. Is OneHotEncoding similar to creation of dummy variables ?
Tried the following codes in python OneHotEncoding:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

enc.transform([[0, 1, 1]])
#####<1x9 sparse matrix of type ‘<type ‘numpy.float64’>’ with 3 stored elements in Compressed Sparse Row format>

enc.transform transform [[0, 1, 1]] using one-hot encoding.
How the following code result in output of the array ?

In [1] : enc.transform([[0, 1, 1]]).toarray()
Out[1] : array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])

An explanation with steps will be helpful.

Thanks


#2

Hi @shan4224,

Yes one-hot-coding is similar to the creation of dummy variables.

But this is returning a sparse matrix. Let me explain. You input is a matrix like this:
0 0 3
1 1 0
0 2 1
1 0 2

This is 3 columns/features and 4 rows. Each column has different number of unique entities. If you run:
enc.n_values_
If gives: array([2, 3, 4])

So categories for each feature are:

  1. feature 1: 0 1
  2. feature 2: 0 1 2
  3. feature 3: 0 1 2 3

When you pass [0 1 1] to transform, it’ll return a sparse matrix corresponding to which category the element lies in. There are total 9 categories:

0 1 0 1 2 0 1 2 3 (actual categories)
1 0 0 1 0 0 1 0 0 (output for [0 1 1])
Its 1 for 0 in first 2, 1 in next 3 and 1 in next 4. Hope this makes sense.

If you’re using Pandas, you can simply perform this by ‘get_dummies’ function in Pandas. Its much easier and works in a jiffy!

Hope this helps.

Cheers!