How to compare rows of data

python

#1

For the Black Friday case, I am assuming that the gender, age, occupation,…, and marital status is the same for each User_ID.

How can I go about verifying this hypothesis?

Here’s what I mean:
How can I verify that the Gender, Age, Occupation, City_Category, Stay_In_Current_City_Years, and Marital_Status are F, 0-17, 10, A, 2 and 0 respectively in every row where the User_ID is 1000001?

Thank you.

Capture


#2

Hi @fehsuccess,

You will have to create a for loop that compares the columns for every set of User_ID. Here is a basic approach.

  1. Take two variables i and j. Suppose i has the User_ID at index 0 and j has User_ID at index 1.
  2. Compare i and j.
  3. When i and j are equal, compare the 4 columns. If same, move to the next index; if not same, print the index value.
  4. When i and j are not equal, move to the next index.

PS: Black Friday dataset has a large number of rows and columns so this iteration will take a lot of time. (unless you have good computational power). If you can optimize the loop, do share your approach.


#3

Thanks for your response, AishwaryaSingh

Right! I’m definitely not going to do it for all the IDs – that’ll take too much time.


#4

I finally figured it out.

Basically, I used the .nunique function to list the number of unique Gender, Age, Occupation, etc. each User_ID possessed, converted the result to a list and used that for my comparisons. See the code below:

#sum up list elements
def sum_list(listname):
    sum_of_element = 0
    for element in listname:
        sum_of_element += element
    return sum_of_element


#remove duplicates
def Remove(duplicate): 
    final_list = [] 
    for num in duplicate: 
        if num not in final_list: 
            final_list.append(num) 
    return final_list

list_of_IDs = Remove(train.loc[:,'User_ID'].values.tolist())
needed_columns = train.loc[:, 'User_ID':'Marital_Status']


likely_erratic = []

for ID in list_of_IDs:
    a = needed_columns.loc[needed_columns.User_ID == ID, :].nunique().values.tolist()
    if sum_list(a) != 7:
        likely_erratic.append(ID)

print(likely_erratic)

#5

Great approach! :+1: