Python variable data type understanding

I have an object type column named gender containing 3 distinct values i.e male, female and other.
when I’m feeding this column to my linear regression its throwing an error ‘’ ValueError: could not convert string to float: ‘other’’

I’m new to python so i just want to ask that, does python needs all the column to be numeric(float64) format? Does it cant handle object or category type variables ? if not then why these data types exits, earlier i was working on R and i used to feed multiple type of variables including numeric, categorical and it used tp treat them accordingly without an error so i’m just confused whether i shall look into my code or this error is caused by variable type only.

Thanks in Advance

hi @hemantsain55 , yes you need to convert the data , you can try the same by changing datatype to category and using catgeory.code as a feature or you can try pd.get_dummies for converting categorical data to numeric .

Below is the link which might help a little more :

Thanks a lot for your input @palbha but the data I’m working on has 4 variables and more than 200 levels in each variable if i perform one hot encoding than the data file will be too huge to handle by the system,
I’m thinking to perform label encoding but not sure if i apply label encoding, will it assign the same number to the same category in training and testing both the dataset or not???

Hi @hemantsain55 , can we group those 200 levels into some subgroups or so ??
also is that we have huge data for each and every level .
If we perform label encoding on any data and use the same encoder it will work fine for test data as well …

If you have categorical data, you can create dummy variables with 0/1 values for each possible value.
E. g.

idx color
0   blue
1   green
2   green
3   red

This can easily be done with pandas:

import pandas as pd

data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))

will result in:

   color_blue  color_green  color_red
0           1            0          0
1           0            1          0
2           0            1          0
3           0            0          1

First you can convert categorical to numeric by LabelEncoder()
from sklearn.preporcessing import LabelEncoder
le = LabelEncoder()

conver the column to numeric
data[“gender”] = le.fit_transform(data[“gender”], inplace=true)

and then can use Liner Regression over it.

© Copyright 2013-2019 Analytics Vidhya