Clustering and R and time component

r
clustering

#1

Looking for some advice here:

I have product, booking_weekly_date, qty, revenue, cost, 2 more numerical variables

Initially I clustered using k means for different products based on qty, revenue, cost, other 2 numerical values by aggregating all the numerical values at the PID level

but if we need to bring the booking_weekly_date then the aggregation of all the numerical columns will be at (PID, bookingDate). what will the cluster convey? will it capture seasonality pattern of the product. Appreciate any help?


#2

Hi @rumsinha.

I’ll be able to assist you better if you can provide me with a few rows of your dataset.


#3

thanks Saurav,

few dummy data as below:
PID, week_start_date, Sales Order Count,Ship Set Count, Revenue, Booking Quantity, Booking Cost
A, 03-jan-2016,5,2,1000,10,5
B, 07-jan-2016,15,10,100,12,51
C, 10-jan-2016,10,5,2000,1,35
D, 17-jan-2016,12,6,5000,2,50
E, 24-jan-2016,4,1,3000,3,51

can you please help as to which algorithm I can use to get the best clusters… approx records 20000+

2 years data for 1000 PIDs, booking date weekly but not necessary that one PID will be booked every week.


#4

Ok @rumsinha.

So don’t add date as a dimension for clustering straight away. Rather extract date, month and year from the date and use them in clustering to capture seasonality.

Hope it helps. :slight_smile:


#5

Thanks…Saurav, so with date, month and year
should I use pam clustering?


#6

Saurav, if one PID has this kind of booking date then how can I do clustering including the booking date…
A, 03-jan-2016,5,2,1000,10,5
A, 07-jan-2016,15,10,100,12,51
A, 10-jan-2016,10,5,2000,1,35
A, 17-jan-2016,12,6,5000,2,50
A, 24-jan-2016,4,1,3000,3,51,

so I extract what information from booking date and how do I proceed with clustering.

without date, I did kmeans but when weekly booking date comes into picture then how should I proceed?


#7

@rumsinha

Use as.numeric(format(date1, “%m”)) to extract month and similarly extract year and date and use them as features for clustering by dropping the date column.


#8

Thanks Saurav