There is a stream of events such as A, B ,C, D…Some of these events are interrelated to each other like A->B, A->B->C, B->D->C.
The order of events is not known beforehand. The problem is we need to find association among the events and then find out the root event which triggered other events. For example, if we find out A->B->D pattern, then we should be getting A as the root event.
As mentioned earlier, the order is not known already so its a unsupervised problem. I can use apriori association algorithm here but it wont give me the order of the events, it will only tell grouping of events.
We can’t go ahead with the assumption that the event which has come first is always the parent event. So it invalidates the approach of using apriori and time of event arrival to come to a solution.
Please suggest what can we try here to achieve this.
Have you investigated if TraMineR is fit for purpose?
To illustrate an example … if you look at this sequence of events below:
id represents a related group of events
timestamp represents when the event took place
event represents the event itself
id timestamp event
1: 1 2018-01-01 12:00:00 A
2: 1 2018-01-02 12:00:00 B
3: 1 2018-01-03 12:00:00 C
4: 2 2018-02-01 12:00:00 A
5: 2 2018-02-02 12:00:00 B
6: 2 2018-02-03 12:00:00 C
7: 3 2018-04-01 12:00:00 B
8: 3 2018-04-02 12:00:00 C
This is a sample script to analyse the events listed above
sample_events <- data.table(
id = c(1,1,1,2,2,2,3,3),
timestamp = c(ymd_hms('2018-01-01 12:00:00'),ymd_hms('2018-01-02 12:00:00'),ymd_hms('2018-01-03 12:00:00'),ymd_hms('2018-02-01 12:00:00'),ymd_hms('2018-02-02 12:00:00'),ymd_hms('2018-02-03 12:00:00'),ymd_hms('2018-04-01 12:00:00'),ymd_hms('2018-04-02 12:00:00')),
event = c('A','B','C','A','B','C','B','C')
sample_events.seq <- seqecreate(sample_events, use.labels = T)
fsubseq <- seqefsub(sample_events.seq, pmin.support = 0.05)
plot(fsubseq, col = "cyan")
This is what you get. What this is showing is the combinations of sub events within your data and the frequency with which those combinations occur. This package is very well documented so you shouldn’t have any issues finding your way around.
Hopefully this helps.
Thanks for the inputs but this is not what I was looking for. The problem in this is it requires grouped sequences whereas in my case I need to find the grouping itself. Similar thing has been implemented in Spark as well - FPGrowth.
Please let me know if you have any other ideas.
You mention that the order is not known (although you are interested in ensuring the correctness of the order).
You probably need to spend a bit of time figuring what metadata (e.g. timestamp or the event characteristics themselves) can be used to infer/extrapolate the correct order of events. At this point, this sounds more like a subject/domain problem rather than a computational problem.