Machine Learning Fundamentals on Text Mining Multi Label Classification!



Hi All,

Today I want to put a very fundamental and quintessential problem in the field of AI and machine learning and to be honest you will be very happy as this is a real time problem on which I’m working for a client so every relevant answer/suggestion goes for the implementation in the production.

Here is the Objective and then the Challenge!

I have got data for shipping industry having free text which is basically the comments put by field inspector in a report which talks about the inspected quantity and quality shipped. They see whether the shipped item has properly shipped at the destination, so they check the weight and perform other analysis to be sure of its perfect delivery. In case of a shipment not meeting a expected delivery they will raise that issue for that Metrics and also write the root cause of it.

so to give insight about all the shipments and their respective issues, we extracted the information manually and using keyword matching. Say, whether it has an ‘X’ issue or not we labelled that comment as 1/0 .similarly, a comment can have multiple issues(i.e. multiple labels). This methodology works for few comments but not for all , as it fails to detect the context and because I don’t have pre-labelled data, I can’t go with supervised modelling directly. As I told we labelled few data rows manually or using Keyword match and now I have Labelled data for modelling. I can perform Text mining and perform a classification on it .But what I feel is that-this labelled data (did by us) is not enough to detect the context and unstructured information to train a model. Further to this, even if train a model and to do some predictions How will I automate the process so that we can do an incremental learning with the time which means training the model on data set with new patterns.

I hope you are able to understand- Its a catch 22 problem!

Thanks in advance!

Happy Learning!


I don’t have much experience with text data, so take my suggestions with a grain of salt.

First thing I’d do is to classify whether a comment contains an issue or not. Should be a lot less costly than classifying every possible issue and it will help a lot.

Also, it seems to me that in your post there is already a hint. You “can’t go with supervised modeling directly”, so that leaves us with unsupervised models. I’d begin with a bag-of-words and then would apply PCA to it and look at the first two principal components, searching for possible clusters. If there are no obvious groups, apply tf-idf and try PCA again. If it still fails, replace PCA with a non-linear dimensionality reduction method like t-SNE. Hopefully, this will enable clustering methods to work well. Alternatively, a more costly (and powerful) approach is to use N-grams instead of single terms.

Having segmented the data, train supervised models to identity whether a comment has an issue or not, using the previously labeled data, for each of the clusters. At this point, if the clustering and classification steps worked well, you should have a fairly good multi-label classification pipeline. One thing to take note is that the resulting models won’t say that a given comment has an X issue, but rather a given comment has one or more issues that resemble {X, Y, Z}.


Thanks @caiotaniguchi for separating out time for writing a solution.
As mentioned by you in the reply to extract that whether a comment has issue or not, has already been labelled(Using some keywords match).For the comments having issues, we have more than 20 root causes so a particular document(comment) can be classified into more than 1 root cause. Therefore until and unless we are able to tell which root cause or set of root causes belong to a particular comment wont help much. I didn’t get your approach about PCA, do you want to cluster those words which indicate the issue?
Could you please be more specific , please take an example if required.

Thanks again!


My line of thinking is the following:

  • You want segment the data in multiple classes, but you only label two classes (with or without issue). This eliminates supervised multi-class classification methods
  • To deal with the above, we need some form of clustering. But text data has the problem of generally having high dimensionality, which is a deal breaker for clustering
  • To solve the high dimensionality problem we apply a dimensionality reduction method, in this case PCA or alternatively t-SNE
  • Finally find the appropriate clusters and train a binomial classification model for each. If all goes well, you will end up with a multi-class classification model using only data labeled with issue/non-issue, without having to specify the kind of issue in the training data.

There are plenty of material about PCA around since it’s a classical tool, you can just google it. I particularly like this one: Principal Component Analysis Explained Visually

What I’d hope to find in the is something like this: Manifold learning on handwritten digits