Custom For loop very slow in DataFrame

python
loop

#1

Hello,
I have a dataframe of 3 Lakh rows and multiple columns.
Suppose my data is :

ID        PAN            DOB
1           x              a
2           y              b
3           z              c
2           x              d

I want to perform an operation that in whichever rows, ID is same, but corresponding PAN or DOB is not same, then store that ID and PAN/DOB in separate dataframe.

I coded the following in python:

clt = pd.read_csv("Sample client1.csv")

Id = clt.loc[:,"ID"]
dob= clt.loc[:,"DOB"]
pan=clt.loc[:,"PAN"]

dataerror=pd.DataFrame(columns= ["ID","PAN","DOB"])   #Empty dataframe to store erroneous data

def IDmatch(Id,dob,pan,i,j):
            if id[i]==id[j]: 
                if dob[i]!=dob[j] or pan[i]!=pan[j]:
                        global dataerror
                        dataerror=dataerror.append({"Error ID":ID[i],"Error type":"Same ID, different PAN or 
                                                                      DOB","Error value":clt.PAN[i]},ignore_index = True)


 for i in range(0,300000):
    for j in range(i+1,300000):
          IDmatch(id,dob,pan,i,j)

The above code is running fine on a sample data of 10 rows but it is just not getting executed on 3 lakh rows. Please suggest what can I do for this.
It would be of great help.


#2

Hi @itijain

Can you tell me what’s the error it shows ?


#3

It is not showing any error. I think code is fine but the execution time is too long. I waited for 2 hours but it didn’t give any output.
Is there a way I can refine my code so that it gets executed faster.


#4

I am not a python expert, but When you have huge data set, general guideline is reduce string comparisons and do some numerical comparison. You can try using Label Encoder if it works and then do match conditions
ps; this is a thought but not tested.


#5

Hi @itijain

You can try to find out the number of IDs which do not repeat, and remove those rows. This should reduce the size of the dataset. (If almost all IDs are repeated, this certainly will fail )

Another possible solution:
For each ID, locate the rows where it is present and check PAN and DOB for only those rows. So for example, you find out that ID=2 is in row 2, 4 then you can check PAN and DOB values for these rows.

So right now your code compares every single ID to 3 lakh IDs. But if you know the row numbers where the ID is repeating, you’re not comparing the IDs anymore, just the PAN and DOB, for limited rows.


#6

hi,

You can use the below approach.

1. remove duplicate rows (where all column values are repeating)

data = data.drop_duplicates(keep=‘first’)

2. identify and copy rows where ID columns are same but PAN / DOB are different

dataerror = data.loc[data.ID.duplicated(keep=‘first’)]