Showing error in categorical missing value treatment

missing_values
scipy

#1

can anyone help me with this code, I am trying to fill the categorical missing value using this code and it is showing error

my code:
from scipy.stats import mode
mode(train[‘Workclass’]).mode[0]

This is my error:
C:\Users\SKHK634\Anaconda3\lib\site-packages\scipy\stats\stats.py:257: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
“values. nan values will be ignored.”, RuntimeWarning)

can anyone help me with this.


#2

Hi @shankarj67,

Thats a warning, not a error. Your code would probably run fine inspite of it.

Also, you would be better off using pandas’s inbuilt function “fillna” for imputing missing values.

The code would look like

train.loc[ : , 'Workclass'].fillna(train.Workclass.mode[0], inplace=True)

#3

Thanks for your answer but do you have any idea why it is showing :

TypeError: ‘method’ object is not subscriptable


#4

Is this for the previously asked question or is it a new question. Because if they are not related, you should probably post it as a new thread.

Answering you question,

In python, subsciptable objects are those objects which have __getitem__() method implemented. This property of subscriptable objects allow them to store other objects inside them. Examples of subscriptable objects are lists, sets, etc.

So your problem might be that (whatever xyz) object you are trying to pass in (whatever xyz) function is probably not meant to be passed in that function.


#5

Hi @jalFaizy,

I am experiencing @shankarj67 's problem in imputing missing files .

As you correctly wrote, that is a warning, not an error. However, I get an error followed by that warning every time I impute a column with some NaN values.

Now, the problem I face is the following:

  • I have tried to replace the NaN values with ’ '. This is the piece of code:

data = pd.read_csv("/path/train.csv", na_filter=False,index_col="Id")

  • When I check the number of empty values per column, I get 0. This is the code:

      #Create a new function:
      def num_missing(x):
        return sum(x.isnull())
    
      #Applying per column:
      print("Missing values per column:")
      print(data.apply(num_missing, axis=0))
    

This is the output:

Missing values per column:
Age               0
Workclass         0
Education         0
Marital.Status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Hours.Per.Week    0
Native.Country    0
Income.Group      0
dtype: int64

As you can see, they are all zero now. And they weren’t before using na_filter of course.

  • NaN has been replaced BUT it seems that there are no more “empty cells”. Unfortunately, if I check the csv file, I still have missing values where once there were NaN. Indeed, if I write

data['Workclass'].fillna(mode(data['Workclass']).mode[0], inplace=True)

This won’t change the csv file because the modified version apparently hasn’t got any missing values!

I can’t understand how to clean the csv getting rid of the NaN and leaving blank space so that this line of code

data.apply(num_missing, axis=0)

can return the real amount of missing value for each column.

Many Thanks for your help.


#6

@GdC,

  1. Regarding missing values per column, you don’t need to write a separate function, you can simply do this -

    df.isnull().sum()
    

  1. After you successfully make the changes in your data frame(like imputing missing values) you need to write them back to the csv in order for the csv file to be updated. For example, your data frame name is “df”, you would do something like this:

    df.to_csv("/path/train.csv")
    

This will over write the previous csv with the updated one.

Hope this helps,
Sanad.


#7

@mohdsanadzakirizvi

Thank you so much for your reply. It helps a lot!

However, the issue I had was mainly about the changes in the data frame.

I have managed to fix the problem in this way:

  • Removing the filter: data = pd.read_csv("path/train.csv",index_col="Id")

  • Imputing with dropna(), for example: data['Workclass'].fillna(mode(data['Workclass'].dropna()).mode[0], inplace=True)

Thank you for your reply!

Giovanni


#8

train[‘Workclass’].fillna(train[‘Workclass’].mode()[0],inplace=True)

Try above code it will work


#9

i am using this code to find out mode of values.
This code is running properly for every attribute except with missing values

    from scipy.stats import mode
   mode(train['Workclass']).mode[0]

but instead it is showing this error


#10

can anyone help me out please


#11

The Null values are being represented by numpy.nan. The numpy.nan is of float type.
While performing sorting (inside stats.mode() call), the comparator is unable to compare ‘float’ with ‘str’.

Try this:
mode(train['Workclass'].astype('str')).mode[0]

It converts the ‘Workclass’ data type from object to string before calculating the mode.