Dealing with categorical variables - Looking for recommendations

categorical

#1

I have the following dataset, in which the Direccion del viento (Pos) column have categorical values

In total Direccion del viento (Pos) it has 8 categories:

  • SO - Sur oeste
  • SE - Sur este
  • S - Sur
  • N - Norte
  • NO - Nor oeste
  • NE - Nor este
  • O - Oeste
  • E - Este

Then I convert this dataframe to numpy array and I get:

direccion_viento_pos
dtype: bool
[['S']
 ['S']
 ['S']
 ...
 ['SO']
 ['NO']
 ['SO']]

Since I have character string values, I want these to be numeric values, so I need to code the categorical variables. That is, coding the text we have as numerical values

Then I perform two activities:

  1. I use LabelEncoder() to simply encode the values into number according to how many categories I have.

Label encoding is simply converting each value in a column to a number

labelencoder_direccion_viento_pos = LabelEncoder()
direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])
  1. I use One Hot Encoding to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.

    onehotencoder = OneHotEncoder(categorical_features = [0])
    direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()

Is of this way, since I get these new values:

direccion_viento_pos
array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Then I convert this direccion_viento_pos array to dataframe to visualize of a best way:

# Turn array to dataframe with columns indexes
cols = ['E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO']
df_direccion_viento = pd.DataFrame(direccion_viento_pos, columns=cols)

Then, I can get by each category value a new column and assigns a 1 or 0 (True/False) value to the column.

If I use pandas.get_dummies() function I get the same result.

My question is:
Is this the best way of deal with these categorical variables?
Having a column for each category and having values of zeros in several of them does not help to have a bias or noise in the data for when automatic learning algorithms are applied?

I’ve recently started reading about it in this article, but any guidance on this I appreciate


#2

Hi @bgarcial

pandas.get_dummies() is used quite often. However, if the categorical variable is ordinal in nature, you can manually encode it into a numerical variable. For example, [“small”, “large”, “very large”] can be encoded as [0, 1, 2].

There is one more encoding method called target-based encoding. It is used when the no. of categories is quite high in a categorical variable. Just explore this one.


#3

Hi @pjoshi15
I have been reading about new ways of manage these categorical variables mentioned above, and I am found with the following:

In this link of Jupyter notebooks exercises (on cell number 59) belong to Hands-on Machine Learning with Scikit-Learn and TensorFlow book, the author speaks about of LabelEncoder the following:

Warning: earlier versions of the book used the LabelEncoder class or Pandas’ Series.factorize() method to encode string categorical attributes as integers. However, the OrdinalEncoder class that is planned to be introduced in Scikit-Learn 0.20 (see PR #10521) is preferable since it is designed for input features (X instead of labels y)

This means that LabelEncoder is used for encoding the dependent variable, instead of the input features. My direccion_viento categorical variables dataset are input features.

Initially, on the scikit-learn dev version 0.20 it existed CategoricalEncoder.
I copy this class into a categorical_encoder.py file and apply it:

from __future__ import unicode_literals
    import pandas as pd
    
    # I import the Categorical Encoder locally from my project environment
    from notebooks.DireccionDelViento.sklearn.preprocessing.categorical_encoder import CategoricalEncoder
    
    # Read the dataset
    direccion_viento = pd.read_csv('Direccion del viento.csv', )
    
    # No null values
    print(direccion_viento.isnull().any())
    direccion_viento.isnull().values.any()
    
    # We select only the first  Direccion Viento (pos) column
    direccion_viento = direccion_viento[['Direccion del viento (Pos)']]
    
    encoder = CategoricalEncoder(encoding='onehot-dense', handle_unknown='ignore')
    dir_viento_encoder = encoder.fit_transform(direccion_viento[['Direccion del viento (Pos)']])
    print(" These are the categories", encoder.categories_)
    
    cols = ['E', 'N', 'NE', 'NO', 'O', 'S','SE','SO']
    df_direccion_viento = pd.DataFrame(dir_viento_encoder, columns=cols)

And the resulting dataset is similar to use LabelEncoding and OneHotEncoding

image

The difference between use OneHotEncoder() and use CategoricalEncoder() is that when I use
CategoricalEncoder() is not necessary apply LabelEncoder() really?

This means, that CategoricalEncode if it’s the same as OneHotEncoder or the result of applying them, is the same really?

After, reading and searching about, with respect to CategoricalEncoder() class, Aurélien Géron tell us in their book that CategoricalEncoder will be deprecated in the scikit-learn-0.20 stable version.

In fact, scikit-learn team in their current master branch denote that CategoricalEncoder()

CategoricalEncoder briefly existed in 0.20dev. Its functionality has been rolled into the OneHotEncoder and OrdinalEncoder.

This pull request named Rethinking the CategoricalEncoder API ?, too denote the workflow process to deprecate CategoricalEncoder()

Then according to above, I’ve applied OrdinalEncoder, and the result that I get is the same to when I apply LabelEncoder only

from __future__ import unicode_literals
    # from .future_encoders import OrdinalEncoder
    from sklearn.preprocessing import OrdinalEncoder
    import pandas as pd
    
    # Read the dataset
    direccion_viento = pd.read_csv('Direccion del viento.csv', )
    
    # No null values
    print(direccion_viento.isnull().any())
    direccion_viento.isnull().values.any()
    
    # We select only the first column Direccion Viento (pos)
    direccion_viento = direccion_viento[['Direccion del viento (Pos)']]
    print(direccion_viento.head(10))
    
    ordinal_encoder = OrdinalEncoder()
    direccion_viento_cat_encoded = ordinal_encoder.fit_transform(direccion_viento)

And I get this array, which is a similar result to when I used LabelEncoder():

image

What is the difference between OrdinalEncoder and LabelEncoder taking as a reference your concepts:

LabelEncoder() to simply encode the values into number according to how many categories I have. Label encoding is simply converting each value in a column to a number

and

OrdinalEncoder: Encode categorical features as an integer array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features.
The features are converted to ordinal integers.
This results in a single column of integers (0 to n_categories - 1) per feature

Can I choose the resulting dataset which is created from apply OneHotEncoding technique or select the dataset which is created from apply OrdinalEncoder technique? What is the most appropriate?

Somebody tell me before the following:

Depends on what you plan to do with the data. There are various ways
to work with categorical variable. You need to pick the more appropriate > for the model/situation you are working on by investigating if the
approach you are taking is right for the model you are using.

I will work with models like clustering, linear regression, and neural networks.

How to can I know if OrdinalEncoder or OneHotEncoder is the most appropriate?


#4

Ordinal features can be understood as categorical values that can be sorted or ordered, but my direccion_viento values ('E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO') does not have any order or any value is greater than or less than other.
Would not it make sense to consider them as ordinal in nature? really?


#5

This is the same named mean encoding here ?

Here in this link, talks about of " target mean " concept


#6

It is not ordinal. It should be treated as a simple categorical variable.


#7

Using LabelEncoder and OneHotEncoder like my sample mentioned above in my original question?


#8

You can use OneHotEncoder in this case.