Encoding Categorical attributes

machine_learning
data_science

#1

What if there are already many no of columns and also the categorical columns have categories around 30 or so.Is it a good idea to use one hot encoding ,because that will lead to too many no of columns.

What is the best way to encode in such cases?


#2

Hi,

There is a question you need to ask. Whether the categorical variable has an inherent order within it or not. If there is an order, you are better off label encoding it as this would preserve the order. However, if there is no order then you should go with one hot encoding. If dataset is too large for memory requirement to become a concern you could store it as a sparse matrix.

Regards
Ankit


#3

Hi there!

First of all you need an automated way of spotting the categorical features instead of tracking them yourself. One way to do this is with the following code:

cat_feat = []
for f in df.columns:
    if df[f].dtypes == 'object':
    cat_feat.append(f)

That will provide you with a list of all categorical (string) variables. Then you must decide what kind of encoding you need. For ONE-HOT I would suggest:

df = pd.get_dummies(df, columns = cat_feat)

If you think this will turn your initial dataframe in a very large dataframe and you will run out of memory, you can also label encode your features (not as good a solution as ONE-HOT though):

for col in cat_feat:
    df[col] = df[col].cat.codes

Please forgive the absence of indentation, I don’t know how to do that in this forum. Make sure you properly indent the code above. :slight_smile:


#4

The categorical attribuutes are nominal. So Label encoding is not just the solution.