R/Python Script needed to assign Cluster ID to each PO records based on multiple conditions across columns

r
machine_learning
python

#1

Dear All,

I am working on a test scenario/procedure for split purchase orders such that I would like to create a flag for cluster of POs in the data tables based on multiple conditions as follows:

  1. For each PO records, to check the records issued by the same Vendor ID within the 14 days window period

  2. Total sum PO value within the 14 days window period by the same Vendor ID should be greater than 15,000 (each PO should be less than 15,000 in this exercise)

  3. Assign a Cluster ID to the series of POs in order for the user to be able to deep dive into the details of these POs to determine whether it is a genuine case of split purchase (non-compliance risk of procurement)

Note that column C-G are the raw data available where I would like to create column A & B based on the above conditions

Raw data attached for reference.Assigning a flag based on conditions.csv (18.3 KB)

Many thanks in advance!

Best,
Eng Xiong


#2

Hi, could you explain more about this 14 day window period? How did you arrive at it?


#3

Hi, thanks for clarifying. For each PO record, i would check the PO creation date with a list of PO records for the next 14 days by the same vendor and sum them cumulatively. Once it hits the 15K threshold, I would assign a cluster ID to the set. I would do the same for each PO records. While I did it manually for a small sample data set, I would like to explore how to do so using a R/Python script.

Hope it clarifies. Thanks!

Best,
Eng Xiong
New to programming


#4

Hi @engxiongster,

Can you please clarify whether you are looking for a hard-coded solution or a machine learning approach?
For example, if window amount of 14 days then… x cluster otherwise y cluster. Or you are looking for a solution through machine learning. In the second case, this would be an unsupervised problem whereas in first its a simple if else condition script. Please let us know your requirement.


#5

Hi Shaz13,

I would prefer to start with a hard-coded solution whereby i am able to assign a cluster ID for the set of data with the rolling 14 days criteria (i.e. 1 PO number can exist in more than 1 cluster ID)

Thanks!

Best,
Eng Xiong


#6

Hello @engxiongster,

I am afraid that we cannot help you with complete scripting as that is out of scope for this discussion forum. Please let us know what you have tried already and share that code perhaps so that we can give suggestions and inferences.