For the Black Friday case, I am assuming that the gender, age, occupation,…, and marital status is the same for each User_ID.
How can I go about verifying this hypothesis?
Here’s what I mean:
How can I verify that the Gender, Age, Occupation, City_Category, Stay_In_Current_City_Years, and Marital_Status are F, 0-17, 10, A, 2 and 0 respectively in every row where the User_ID is 1000001?
Thank you.

Hi @fehsuccess,
You will have to create a for loop that compares the columns for every set of User_ID. Here is a basic approach.
- Take two variables
i
and j
. Suppose i
has the User_ID
at index 0 and j
has User_ID
at index 1.
- Compare
i
and j
.
- When
i
and j
are equal, compare the 4 columns. If same, move to the next index; if not same, print the index value.
- When
i
and j
are not equal, move to the next index.
PS: Black Friday dataset has a large number of rows and columns so this iteration will take a lot of time. (unless you have good computational power). If you can optimize the loop, do share your approach.
1 Like
Thanks for your response, AishwaryaSingh
Right! I’m definitely not going to do it for all the IDs – that’ll take too much time.
I finally figured it out.
Basically, I used the .nunique
function to list the number of unique Gender, Age, Occupation, etc. each User_ID possessed, converted the result to a list and used that for my comparisons. See the code below:
#sum up list elements
def sum_list(listname):
sum_of_element = 0
for element in listname:
sum_of_element += element
return sum_of_element
#remove duplicates
def Remove(duplicate):
final_list = []
for num in duplicate:
if num not in final_list:
final_list.append(num)
return final_list
list_of_IDs = Remove(train.loc[:,'User_ID'].values.tolist())
needed_columns = train.loc[:, 'User_ID':'Marital_Status']
likely_erratic = []
for ID in list_of_IDs:
a = needed_columns.loc[needed_columns.User_ID == ID, :].nunique().values.tolist()
if sum_list(a) != 7:
likely_erratic.append(ID)
print(likely_erratic)
1 Like