I have a few 1-column datasets of free-text taken from 1000-1500 users in a survey. They were asked to write whether they would recommend a certain company/product, and their reason for it. A small sample of it looks like this:
ID Text 1 Because 1. information is clearer, there is a notification letter regarding the policy, so we were not aware. 2. I used to list agents, but did not succeed, because I did not take the exam. Because the upline moved the office address only, so confused where to call, did not have time to take the test. So I did not succeed in becoming an agent, just take the insurance. 2 It is safe to claim that the agent is very easy to explain according to the benefits taken by the customer 3 the first aspect of the service was very good, according to what was received by the agent. Both staff are friendly and nice. The three agents provide explanations, provide a pretty good guarantee. 4 all relatives already have similar products from other companies 5 I have no time 6 I can, as a hobby, recommend to friends and family, because the program is indeed good, but I feel disappointed, because I am sick and trying to claim, why is my claim long ?, so how do I answer my friend's question, if for my claim there is no certainty itself. When I pay, I feel dissatisfied because I have already paid and my child has paid, but the next month I still get the bill. 7 Depends on the program. 8 I still recommend, but sometimes some would agree and others disagree with my recommendations. 9 The problem is that we are also from a farming family, do not want to recommend to friends / relatives, fear that the payment will be difficult. Fear of not being paid, because the income is unclear. 10 good trusted company
For each row, I need to find a way to extract -
- its sentiment
- the main topic(s)/keyword(s) being talked about
- the opinion(s) of the user about the topic(s)/keyword(s)
So, for the example data above, I need the output to be something like this:
About the sentiment part, I can do sentiment analysis with libraries like NLTK; but the results are not always accurate. For example, if a sentence contains words like “no” or “not”, sometimes NLTK considers it to be negative sentiment: for example, “No issues whatsoever”, “I did not find any problems with the service”, etc. Nonetheless, its still comparatively an easier task than topic+opinion extraction, and if needed, I can spend some time to even manually tag sentiments to each row and train a model with that for other datasets.
The topic+opinion extraction part is much harder. I tried several different methods -
- n-gram analysis (tried with n=1, 2 and 3)
- LDA and LSA
- n-gram collocation analysis with NLTK (tried with n=1, 2 and 3)
- segregate into positive, neutral and negative based on sentiment, and then use TF-IDF to extract top n keywords, and find their most collocated words
- using grammar rules to find patterns like adjective-noun-verb, noun-adj/verb, etc, and visualize it in a network graph
- Scattertext visualization
- Manually specifying keywords to look for, and categorizing based on that
- Counting frequencies of adjacent words with part of speech filters
- Pointwise Mutual Information
- Word2Vec, both the CBOW and the Skip-Gram Model
With LDA and LSA, there are always words that shouldn’t be grouped together, or words that belong to multiple classes. Not to mention, you’ll arbitrarily get better or worse results depending on the number of classes you pre-specify.
With the grammar-specification approach, I was hoping that I’d be able to see clusters of keywords getting formed, using which I’d be able to probably do something. But it looks like this (red for nouns, yellow for adjectives and blue for verbs):
Anyway, the point is, none of them were really helpful to the task - which is ultimately to label each each row with the topic(s)/keyword(s) and opinions(s) in an unsupervised manner.
At this point, I’m too mentally exhausted to think of any other way than to hunker-down and manually label each row with the topic(s)/keyword(s) and opinion(s) as I showed for the 10 rows in the beginning. That would be an extremely tedious and boring 2 days of work, but at least I’ll get extremely accurate results. I’ve already wasted about 3-4 days on this so far. And then, for future datasets, I could train a model on this manually tagged dataset, and hopefully, it’ll be able to predict topics and opinions (?) providedthe free-texts in the datasets are similar in the writing style of the users.
Any suggestions are welcome.