Dummy variables, is necessary to standardize them?

machine_learning
data_science
python

#1

I have the following dataset represented like numpy array

direccion_viento_pos

    Out[32]:

    array([['S'],
           ['S'],
           ['S'],
           ...,
           ['SO'],
           ['NO'],
           ['SO']], dtype=object)

The dimension of this array is:

direccion_viento_pos.shape
(17249, 8)

I am using python and scikit learn to encode these categorical variables in this way:

from __future__ import unicode_literals
import pandas as pd
import numpy as np
# from sklearn import preprocessing
# from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Then I create a label encoder object:

labelencoder_direccion_viento_pos = LabelEncoder()

I take the column position 0 (the unique column) of the direccion_viento_pos and apply the fit_transform() method addressing all their rows

direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])

My direccion_viento_pos is of this way:

direccion_viento_pos[:, 0]
array([5, 5, 5, ..., 7, 3, 7], dtype=object)

Until this moment, each row/observation of direccion_viento_pos have a numeric value, but I want solve the inconvenient of weight in the sense that there are rows with a value more higher than others.

Due to this, I create the dummy variables, which according to this reference are:

A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels

Then, in my direccion_viento_pos context, I have 8 values

  • SO - Sur oeste
  • SE - Sur este
  • S - Sur
  • N - Norte
  • NO - Nor oeste
  • NE - Nor este
  • O - Oeste
  • E - Este

This mean, 8 categories.
Next, I create a OneHotEncoder object with the categorical_features attribute which specifies what features will be treated like categorical variables.

onehotencoder = OneHotEncoder(categorical_features = [0])

And apply this onehotencoder to our direccion_viento_pos matrix.

direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()

My direccion_viento_pos with their categorized variables has stayed so:

direccion_viento_pos

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Then, until here, I’ve created dummy variables to each category.

I wanted to narrate this process, to arrive at my question.

If these dummy encoder variables already in a 0-1 range, is necessary apply the MinMaxScaler feature scaling?

Some say that it is not necessary to scale these fictitious variables. Others say that if necessary because we want accuracy in predictions

I ask this question due to when I apply the MinMaxScaler with the feature_range=(0, 1)
my values have been changed in some positions … despite to still keep this scale.

What is the best option which can I have to choose with respect to my dataset direccion_viento_pos


#2

Hi @bgarcial,

Label encoding is simply converting each value in a column to a number. For example, direccion_viento_pos has 8 categories, so it will give 8 separate number to each category. One disadvantage of this approach is that the numeric values can be misinterpreted by the algorithms. For example, the value of 0 is obviously less than the value of 7 but that does not really correspond to the data set in real life. So using label encoding is beneficial when we have ordinal variables.

The basic strategy in One Hot Encoding is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.

If you use One Hot Encoding it always returns either a value of 0 or 1. So, yes the range is 0-1.

Is there any order in the direccion_viento_pos variable? If yes, go for label encoding and otherwise use One Hot Encoding. And if you use One Hot Encoding, there is no need to standardize them.


#3

Let me summarize this in a different perspective in Question Anwer format.

  1. When to use Label Encoder and One hot encoder?
  2. Is scaling beneficial in either of the cases

First, both methods are proven to improve the accuracy of the model. But, often one is preferred over the other. Let’s assume that we are having a data set with 1000 unique values in a column. It definitely makes sense to use label encoding rather than One hot encoding as our dataframe becomes more sparse (Greater number of zeros in a matrix). Again, this can be used as a feature on PCA decomposition but often fails when implemented in raw.

The other prospective depends on the algorithm you choose. If its a boosting tree algorithm it doesn’t really make much difference. As these algorithms work using ensembling method with conditions during each split. At the same time if you are using a Linear Algorithm then One hot encoding will perform better compared to label encoding as the weights are shifted towards higher numbered label (0 vs 7 as @PulkitS explains )

For the second question, this really depends on the dataset and other features. If the 80% of the features are bounded (0-1) MinMax scaling should definitely work. And, if the features are not bounded so then scaling won’t have a bigger impact. In my implementation, scaling worked when I transformed my target variable with log1p conversion.

Hope this helps you @bgarcial!

Feel free to post any follow up questions.


#4

@PulkitS and @Shaz13 In the first instance, thanks a lot for you for the orientations.
According to what you tell me, is necessary to analyze when use LabelEncoder or OneHotEncoder, but not use together?

LabelEncoder simply encode the values into number according to how many categories I have. But as this solution brings the numerical values weight problem (in my case 0 to 7) and my direction of the wind values does not have any order.
So it is for this reason that after using LabelEncoder I proceed to use OneHotEncoder and then, is when I use the basic strategy of convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.

I explain in a better way what I’m doing:

# I read my dataset.
direccion_viento = pd.read_csv('Direccion del viento.csv', )
print(direccion_viento)
Out[2]: 
                Fecha                         Direccion del viento (Pos)
0      2017-04-01 00:24:17                          S
1      2017-04-01 00:54:16                          S
2      2017-04-01 01:24:17                          S
3      2017-04-01 01:54:17                          S
4      2017-04-01 02:24:16                          S
5      2017-04-01 02:54:15                          S
6      2017-04-01 03:24:16                          S
7      2017-04-01 03:54:14                         SO
8      2017-04-01 04:24:17                          S
...
17248

I remove the Fecha column and reshape my numpy array

direccion_viento_pos = direccion_viento.iloc[:, 1].values
direccion_viento_pos = direccion_viento_pos.reshape(-1, 1)
direccion_viento_pos
Out[2]: 
array([['S'],
       ['S'],
       ['S'],
       ...,
       ['SO'],
       ['NO'],
       ['SO']], dtype=object)

I encode these categorical variables of direccion_viento_pos using LabelEncoder
Is here, when my different (SO, SE, S, N, NO, NE ,O, E) values are converted in values of 0 to 7

# Create the LabelEncoder object
labelencoder_direccion_viento_pos = LabelEncoder()

# Apply the transformation
direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])

# My new direccion_viento_pos categorized values
direccion_viento_pos
Out[2]: 
array([[5],
       [5],
       [5],
       ...,
       [7],
       [3],
       [7]], dtype=object

Until here, I have the categorized values in 0-7 range …

Then I proceed to use OneHotEncoder and then, is when I use the basic strategy of convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.

onehotencoder = OneHotEncoder(categorical_features = [0]) 
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()

Then, my new values of direccion_viento_pos are

direccion_viento_pos
Out[2]: 
array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

I’ve typed all this process to you because If I don’t apply LabelEncoder and if I apply OneHotEncoding only, my direccion_viento_pos values are not categorized, and the OneHotEncoding process will not convert in 1 or 0 (True/False) values to the column such as follow:

from __future__ import unicode_literals
import pandas as pd
import numpy as np
from sklearn import preprocessing
from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# I read my dataset.
direccion_viento = pd.read_csv('Direccion del viento.csv', )
# Select column of interest and reshape my array
direccion_viento_pos = direccion_viento.iloc[:, 1].values
direccion_viento_pos = direccion_viento_pos.reshape(-1, 1)

# Apply OneHotEncoder only without apply LabelEncoding previously
onehotencoder = OneHotEncoder(categorical_features = [0])
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()
direccion_viento_pos
Out[2]: 
 array([['S'],
        ['S'],
        ['S'],
        ...,
        ['SO'],
        ['NO'],
        ['SO']], dtype=object)

My direccion_viento_pos values (SO, SE, S, N, NO, NE ,O, E) these values do not follow any order, …
Then it is possible that I am not understanding good in the sense of that In the process, LabelEncoder and OneHotEncoder should be used together really?

direccion_viento_pos have many values but these are not uniques. and I apply OneHotEncoder to avoid
the weights are shifted towards higher numbered label (0 vs 7 )

According to this process, I am using together … but I’ve interpreted that @PulkitS and @Shaz13 you tell me that often one is preferred over the other. (I don’t know if I am misinterpreted your answers).

At the end of the day, the goal is to get these direccion_viento_pos values in a 1 -0 range/scale to apply clustering K-Means (in which is important the distances to best performance to convergence …)

In this sense, I don’t need to normalize/standardize my data really?
If I decided normalize/standardize my data, this change in any way? Or does not matter?

I bring my apologies to you @PulkitS and @Shaz13 if my questions and considerations are newbies, but I am a bit confused about here.

Best regards and many thanks for your time and effort :slight_smile:


#5

Now I see your concern. Appreciate the code and long reply. I see that you are getting the coding part is not efficient here. Although, the end result is expected and correct transformation. To get to one hot encoding you are first label encoding and then followed by the fitting of the one hot encoder. This is not a practiced approach. Instead, try this handy code.

direccion_viento_pos = direccion_viento.iloc[:, 1].values
direccion_viento_pos = pd.get_dummies(direccion_viento_pos)

Complete documentation on pd.get_dummies is available here

Feel free to revert back if you have other queries.


#6

Hi @bgarcial,

Yes there is no point in using both LabelEncoder and OneHotEncoder together. You should use LabelEncoder when want to change the values into number according to how many categories you have and OneHotEncoder when you want to convert each category into a new column.

If you have applied OneHotEncoding, there is no need to normalize or standardize the data.
There is one more simpler way to get the OneHotEncoded variables. You can use the get_dummies() command.

After these steps you can add the following code to get the OneHotEncoded results:

direccion_viento_pos=pd.get_dummies(direccion_viento_pos)

This will help you to achieve your goal to get these direccion_viento_pos values in a 1 -0 range/scale to apply clustering K-Means.

You don’t have to normalize/standardize your data once you have the dummy variable. Even if you standardize the data, it will not affect the results.


#7

Hi @PulkitS and @Shaz13 thanks so much for the orientation and explanations. I see that pd.get_dummies function perform all the work that I’ve made, behind the scenes.

Best Regards.


#8