Changing categorial into numerical variables


#1

I am trying to change categorial into numerical variable but getting this error -

FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

Code is-

train[‘Married’][train[‘Married’]==“No”]=0
train[‘Married’][train[‘Married’]==“Yes”]=1
train[‘Gender’][train[‘Gender’]==“Male”]=1
train[‘Gender’][train[‘Gender’]==“Female”]=0
train[‘Dependents’][train[‘Dependents’]==0]=0
train[‘Dependents’][train[‘Dependents’]==1]=1
train[‘Dependents’][train[‘Dependents’]==2]=2
train.Dependents[train.Dependents>=3]=3
train[‘Education’][train[‘Education’]==“Graduate”]=1
train[‘Education’][train[‘Education’]==“Not Graduate”]=0
train[‘Self_Employed’][train[‘Self_Employed’]==“Yes”]=1
train[‘Self_Employed’][train[‘Self_Employed’]==“No”]=0
train[‘Property_Area’][train[‘Property_Area’]==“Rural”]=1
train[‘Property_Area’][train[‘Property_Area’]==“Semiurban”]=2
train[‘Property_Area’][train[‘Property_Area’]==“Urban”]=3

How to correct it ?


#2

Hey @ASHISH_17,
Rather than using such an ad hoc approach, consider some other preferred ways :

For example, encoding Married column :

from sklearn.preprocessing import LabelEncoder
from pandas import Series
l=LabelEncoder() 
l.fit(train.Married) 
train.Married=Series(l.transform(train.Married))  

You may also refer to this AV Article.

Hope that this helped,
Pavleen


#3

@ASHISH_17,
If you want to do the ad hoc approach in pandas, you will have to improvise your code a little. For example:
I take that you want to denote “Yes” as 1 and “No” as 0 in the Married column. You can do it by:

train["Married"] = train["Married"].apply(lambda x: 1 if x=="Yes" else 0)

You are basically looping through all values of the “Married” column using .apply(…) function of pandas. Then you check if x is “Yes” if true you set the value to 1 else you set it to 0 .

Sometimes this approach is a great time saver though in this case it is better to try one of the ways @pavscorp1911 mentioned.

Thanks.
Sanad


#4

Thanks again it worked .

But I am facing with a new kind or problem .
I am constantly getting this error despite I have kept only Credit History column which contains binary values.
Just to make it simpler and how regression works.

Error -
ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).


#5

Thanks I tried using this method as well.


#6

Well, the answer to this would lie in the code you are using to fill in the missing values. Can you share that here please?


#7

Sure

Below is the code -
#Cleaning the train dataset
train[‘Gender’]=train[‘Gender’].fillna(‘Male’)
train[‘Gender’].notnull().value_counts()
train[‘Married’]=train[‘Married’].fillna(‘Yes’)
train[‘Married’].notnull().value_counts()
train[‘Self_Employed’]=train[‘Self_Employed’].fillna(‘No’)
train[‘Self_Employed’].notnull().value_counts()
train[‘Credit_History’]=train[‘Credit_History’].fillna(1)
train[‘Credit_History’].notnull().value_counts()
train[‘Dependents’]=train[‘Dependents’].fillna(0)
train[‘Dependents’].notnull().value_counts()
train[‘LoanAmount’]=train[‘LoanAmount’].fillna(np.mean(train.LoanAmount))
train[‘LoanAmount_Log’]=np.log(train[‘LoanAmount’])
train[‘Loan_Amount_Term’]=train[‘Loan_Amount_Term’].fillna(360)
train[‘Loan_Amount_Term_Log’]=np.log(train[‘Loan_Amount_Term’])
train[‘TotalIncome’]=train[‘LoanAmount_Log’]+train[‘Loan_Amount_Term_Log’]

#Cleaning the test dataset
train[‘Gender’]=train[‘Gender’].fillna(‘Male’)
train[‘Gender’].notnull().value_counts()
train[‘Married’]=train[‘Married’].fillna(‘Yes’)
train[‘Married’].notnull().value_counts()
train[‘Self_Employed’]=train[‘Self_Employed’].fillna(‘No’)
train[‘Self_Employed’].notnull().value_counts()
train[‘Credit_History’]=train[‘Credit_History’].fillna(1)
train[‘Credit_History’].value_counts()
train[‘Dependents’]=train[‘Dependents’].fillna(0)
train[‘Dependents’].notnull().value_counts()
train[‘LoanAmount’]=train[‘LoanAmount’].fillna(np.mean(train.LoanAmount))
train[‘LoanAmount_Log’]=np.log(train[‘LoanAmount’])
train[‘Loan_Amount_Term’]=train[‘Loan_Amount_Term’].fillna(360)
train[‘Loan_Amount_Term_Log’]=np.log(train[‘Loan_Amount_Term’])
train[‘TotalIncome’]=train[‘LoanAmount_Log’]+train[‘Loan_Amount_Term_Log’]

#Converting categorical to numerical variables of train data
var_mod = [‘Gender’,‘Married’,‘Education’,‘Self_Employed’,‘Loan_Status’]
le = LabelEncoder()
for i in var_mod:
train[i] = le.fit_transform(train[i])
print train.dtypes

#Converting categorical to numerical variables of train data
var_mod = [‘Gender’,‘Married’,‘Education’,‘Self_Employed’]
le = LabelEncoder()
for i in var_mod:
test[i] = le.fit_transform(test[i])
test.dtypes

#Splitting the data into features(X) and targets(y)
X_train=train[[‘Gender’,‘Married’,‘Education’,‘Self_Employed’,‘Credit_History’,‘TotalIncome’]].values
y_train=train[‘Loan_Status’].values

X_test=test[[‘Gender’,‘Married’,‘Education’,‘Self_Employed’,‘Credit_History’,‘TotalIncome’]].values

#logistic regression model
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
accuracy=logreg.predict(X_test)
print accuracy

solution=pd.DataFrame({‘Loan_Status’:np.array(accuracy)},index=test.Loan_ID)
print solution


#8

Hey @ASHISH_17,

By having a look at the code, it seams that you committed a typo and haven’t really cleaned the Test Data. You are making the same computations on the Training Set. So do take care of that.

Hope it works,
Pavleen


#9

I made a typing mistake.
Thanks for correcting it.

I have successfully progressed till the modelling part using Label Encoder and filling missing values as well but still getting LoanAmount column as an object dtype instead it should have been float.
Can you have a look at the code below(Avoid indentation) -

table=train.pivot_table(index=[‘Gender’,‘Education’],values=‘LoanAmount’,aggfunc=np.median)
def fill(x):
if pd.isnull(x[‘LoanAmount’]):
return table.loc[x[‘Gender’],x[‘Education’]]
else:
return x[‘LoanAmount’]
train[‘LoanAmount’]=train.apply(lambda x:format(fill(x)),axis=1)