R/Python Script needed to assign Cluster ID to each PO records based on multiple conditions across columns

r
machine_learning
python

#1

Dear All,

I am working on a test scenario/procedure for split purchase orders such that I would like to create a flag for cluster of POs in the data tables based on multiple conditions as follows:

  1. For each PO records, to check the records issued by the same Vendor ID within the 14 days window period

  2. Total sum PO value within the 14 days window period by the same Vendor ID should be greater than 15,000 (each PO should be less than 15,000 in this exercise)

  3. Assign a Cluster ID to the series of POs in order for the user to be able to deep dive into the details of these POs to determine whether it is a genuine case of split purchase (non-compliance risk of procurement)

Note that column C-G are the raw data available where I would like to create column A & B based on the above conditions

Raw data attached for reference.Assigning a flag based on conditions.csv (18.3 KB)

Many thanks in advance!

Best,
Eng Xiong


#2

Hi, could you explain more about this 14 day window period? How did you arrive at it?


#3

Hi, thanks for clarifying. For each PO record, i would check the PO creation date with a list of PO records for the next 14 days by the same vendor and sum them cumulatively. Once it hits the 15K threshold, I would assign a cluster ID to the set. I would do the same for each PO records. While I did it manually for a small sample data set, I would like to explore how to do so using a R/Python script.

Hope it clarifies. Thanks!

Best,
Eng Xiong
New to programming


#4

Hi @engxiongster,

Can you please clarify whether you are looking for a hard-coded solution or a machine learning approach?
For example, if window amount of 14 days then… x cluster otherwise y cluster. Or you are looking for a solution through machine learning. In the second case, this would be an unsupervised problem whereas in first its a simple if else condition script. Please let us know your requirement.


#5

Hi Shaz13,

I would prefer to start with a hard-coded solution whereby i am able to assign a cluster ID for the set of data with the rolling 14 days criteria (i.e. 1 PO number can exist in more than 1 cluster ID)

Thanks!

Best,
Eng Xiong


#6

Hello @engxiongster,

I am afraid that we cannot help you with complete scripting as that is out of scope for this discussion forum. Please let us know what you have tried already and share that code perhaps so that we can give suggestions and inferences.


#7

Dear Shaz13,

As i am very new to programming, it took me a while to have an attempt on coding a solution for the problem.

Following is my thought process to the problem (I have managed to break down the problem to focus on batches of 1 DeptVendor and need help to apply across all the DeptVendor):

  1. I check the min and max PO date to determine the # of 14 days batches i will need to create
    2.I will then assign a batch number in the while loop procedure i have created below
    3.I then summarize the total batch amount and count using groupby function
  2. Finally, i filter the result based on the split purchase test conditions such as the batch amt must be greater than 15K and (batch amt > batch count - 2)*15K

Please advise me if there is a more efficient way to create the solution as well as on how to apply it across the different DeptVendor. - attached excel batchtest for reference batchtest.zip (69.3 KB)

Following is the full set of code I have done so far.

import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

df = pd.read_excel("batchtest.xlsx","Sheet1")

# converting PO No to string type from integer
df['PO_No'] = df['PO_No'].astype(str)

# using a subset of entire data to run the script - need help to apply the for each DeptVendor
df_1grp = df[df['DeptVendor'] == "computer1234"]

# start of procedure to create batches of datasets to run split purchase test next by filtering with conditions
from datetime import timedelta
# for each DeptVendor, to find the # of moving 14 days window batches based on min and max dates
minDate = df_1grp['PO_Date'].min()
maxDate = df_1grp['PO_Date'].max()
# setting a date counter for below script to loop
endDate = minDate + timedelta(days=14)
# setting a batch counter to conduct split purchase test later
n=1
# setting temp dataframe for appending data in the while loop
df_final_temp = df_1grp[(df_1grp.PO_Date > maxDate)]
df_final_temp['Batch'] = 0
df_final = df_final_temp

# begin of while loop script to create batches based on moving 14 days window
while minDate < maxDate:
    endDate = minDate + timedelta(days=14)
    df_2grp = df_1grp[(df_1grp.PO_Date >= minDate) & (df_1grp.PO_Date <= endDate)]
    df_2grp['Batch'] = n
    df_final = df_2grp.append(df_final,sort=False)
    if endDate == maxDate:
        break
    minDate = minDate + timedelta(days=1)
    n=n+1

# create a dataframe from the groupby object (batch) from df_final
df_final2 = pd.DataFrame({'BatchAmt' :
df_final.groupby(['Batch'])['Sum PO Amt (SGD)'].sum(),'BatchCount' :
df_final.groupby(['Batch'])['Distinct Count PO'].sum()}).reset_index()

# merge the 2 dataframes - something ilke vlookup on excel
result =df_final.merge(df_final2,on='Batch')

# filter batches which fulfill the split purchase test conditions

filtered_result = result[(result['BatchAmt']>(result['BatchCount']-2)*15000) & (result['BatchAmt']>15000)]

result.to_excel('batch2.xlsx')
filtered_result.to_excel('batch2_filtered.xlsx')

Any help is greatly appreciated. Thanks!

Best,
Eng Xiong