Regarding Text Analytics and Association Rules


#1

I created the dtm and most frequent words for (component, comment)

put into dataframe with 2 columns (freqwordsset, components)

when I am reading the same, I get the freqwordsset as

“c(“aipc”, “aipcv1”, “api’s”, “code”, “commit”, “fix”, “ncs6k”, “pie”, “test”)”

I need the individual words like “aipc”, “code” etc. How can I extract from this or how can I avoid this issue and get the proper word list


#2

One more question,

Once I have done the text mining, how can I use association rule with these frequent words and another column in data frame?


#3

Hi @rumsinha

You can use the gsub() function in R to do this specific task. Remember you might have to do multiple iterations of it.

Also, another solution might be to convert it into factor and see the levels of it to get the desired output.


#4

Thank You Saurav. The Apriori will not fit right? I am asking because I was thinking and thinking on this but didn’t get any where


#5

Hi @rumsinha

In order for me to comment on this question, can you tell me a bit more about how exactly you are planning to use it?


#6

my requirement is as below:

I have data for component, developer, code fix comments,defect and few other fields

Step1: need to show most frequent words from the code fix comments at the component level. So, I combined all the comments at component level and did the text mining analysis. Got the frequent terms. Can build the word cloud as well as histogram to represent this visually.

Step2: Associate the frequent words to the developer at the component level. For this after lot of thinking, I clubbed all the comments at (Component, Developer) level.
Is my thinking right. I could not understand as to how I can apply apriori analysis in this scenario.

Thanks and Regards


#7

Hi @rumsinha

As it looks to me, you can use apriori analysis for finding the (Component, Developer) and frequent words that exhibits the strongest relationship. You might have to set the threshold limit from your domain experience. To point out a resource, you can have a look at:

Before discarding the use of apriori algorithm, it might be a good idea to try it and I think, it might be of use, rest depends on the results itself.

Best,
Saurav.


#8

Thank you so much…I will read and try for sure.

Just let me know, the process of getting the frequent words at component and developer, the way I did in step2 will not help?

Regards


#9

Hi, @rumsinha

Based on my understanding of data and problem statement. The way you are doing it in step 2 is appropriate.

Best,
Saurav.


#10

Hi Saurav,

I went through the blog and understood the concept. What I am looking from you is your thought process on what I have reached till now with the above concept and the problem:
Apriori Algorithm application:

Component, Developer,Comments
C1,D1,H1
C1,D1,H2
C1,D2,H3
C1,D3,H4
C1,D1,H5,
C1,D2,H6
C1,D3,H7

from the above data say I form the basket of words from each Comment:
C1,D1,H1, {basket of words say w1,w2,w3}
C1,D1,H2, {basket of words say w2,w3}
C1,D2,H3, {basket of words say w1,w3}
C1,D3,H4, {basket of words say w2,w3}
C1,D1,H5, {basket of words say w1,w3,w4}
C1,D2,H6, {basket of words say w2,w4}
C1,D3,H7, {basket of words say w1,w2,w5}

please help me in understanding if I need to associate the combination of the (developer and a particular word) then I have to create a basket of words where each word will be of the form D1w2,D1w3 at each row so the association can be done?
Here D1 is the developer and w2 is the word from the comment.

I am kind of lost here. Appreciate your help on this to take it forward.

Regards
Ruma


#11

Hi @rumsinha

Based on the problem statement, shouldn’t you be grouping the word from comment(w) for same Component and Developer level before applying Apriori Analysis?

Something like this,
Component, Developer,Comments
C1,D1,H1+H2+H5
C1,D2,H3+H6
C1,D3,H4+H7

And then

C1,D1,H1+H2+H5, {basket of words say w1,w2,w3,w4}
C1,D2,H3+H6, {basket of words say w1,w2, w3, w4}
C1,D3,H4+H7, {basket of words say w2,w3,w5}

Best,
Saurav.


#12

Saurav,

at this step:
C1,D1,H1+H2+H5, {basket of words say w1,w2,w3,w4}
C1,D2,H3+H6, {basket of words say w1,w2, w3, w4}
C1,D3,H4+H7, {basket of words say w2,w3,w5}

I will have the frequent words at the Component and Developer level. So, Why do I need to do apriori analysis to associate developer with words? What will I achieve from the apriori analysis?

Please help?


#13

re framing the question…sorry if it is too basic::

If I need to get the frequent associations between the Components and the Developers,
how do I proceed with Apriori analysis? these 2 are different variables in the dataframe.


#14

I could create the transactions from the 2 data frame fields Components and Developers,

getting 1:1 rule that is {LHS} ==> {RHS} has 1:1 component to developer or vice versa.

Is there any way to get 1:many rules that is say on LHS I have component and rhs side all rules leading to the number of developers for that components as
C1==> D1,D2
C1==> D1
C1==> D1, D2, D3


#15

any help on calculating the minimum support threshold?


#16

Hi @rumsinha

I understand that they are two separate variables in your data, but what I don’t understand is how are you not able to Apriori analysis with this.

I’ll advice you to look at a few more examples of it from the Wiki page:

Also, to understand it better, you can have a look at this presentation as well:

Apriori Algorithm from International School of Engineering

On minimum threshold, you’ll have to give it a try using different values and find the most appropriate for your case. Had I been in your place, I would have used trial and error.

Best,
Saurav.


#17

Thanks Saurav… I final could do and get the rules using minimum support threshold through trial and error.
It took some time and go through multiple links to understand the formation of transaction inputs with the 2 different variables from a data frame.

I will also read these reference materials. You have been a big help in this…
Appreciate your help.

Regards
Ruma