How to perform multivariate multi-class text classification?




I have been working on text classification problem which has three outcome variables and they are multi-class variables.

The dataset description as follows.

Dataset is about the accidents happened in the industries over the years and they are classified according to their Degree,Nature and Occupation.

My requirement is to build a classifier which will classify and assign the score to the documents so that any new documents which come from news,Google search or twitter will be classified according to their
degree,nature and occupation. Further I want to categorize them into high,medium and low risk documents using the score value(rank them).

The main goal is to build a intelligence/ classifier to prevent and mitigate the accidents in the industry before hand.Its kind of risk prediction model which I am trying to build.

Data set summary:

Independent feature:(input variable)

Description => Information about the cause of the particular
accident.(this is my text document)

Dependent/outcome features

Degree => Hospitalized,Non Hospitalized,Fatality (3 classes)
Nature => has many types/classes
Occupation => Occupation of the employees

I have read many papers where they mentioned about how to approach this problem.

  1. Combine the outcome variables in one.( i don’t how it is going to work)
  2. two level model like first build a classifier for first outcome variable and then second one.

How to achieve this problem. I am really struggling how to approach to this problem.

I have built a classifier for single outcome variable but how to do it for two or more outcome variables which themselves are multi class variables.

I have done all feature extraction,stemming,removing stop words.

I am using tf-idf approach.( also thinking to use word2vec approach)
I am using python

Machine learning algorithms
Sci-kit learn’s Naive_Bayes and SVM algorithms.

Any similar example or some reference would be helpful ?

Please find sample example data set.

Example_Dataset.csv (15.7 KB)



Hello @Niranjanp,

From my understanding, you have multiple response variables. It does not matter whether dataset will have single or multiple response variables. Each of them will have to be treated individually

The first option
Combine the outcome variables in one.( i don’t how it is going to work)
will not be a good solution.

The second option
two level model like first build a classifier for first outcome variable and then second one.
is a better solution

Will have to come up with an example data set during the weekend



@Anant. Thanks.

I was wondering how second option works(the flow) ? Any similar example or online reference would be helpful.

I have attached example data set(20 rows).

Example_Dataset.csv (15.7 KB)



Hi @Niranjanp,

If I understand your question right, this is a case of ‘Multi-Label’ classification. Please check out this link - This page talks about 2 approaches - a) Problem Transformation which is about transforming this problem into a set of binary classification problems, b) Algorithm Adaptation methods that treats the problem as a whole to find the outputs.

Also, am curious as to what will be the output if we run the apriori (Association) algorithm setting LHS to the different combinations of dependent variables.



Hello @skkeyan

Sorry for my late reply. Thanks for your reply.

I also first thought that it is a case of ‘Multi-Label’ classification but its actually multivariate case. There are multiple outcome variables and each outcome variable has multiple classes.

Multi-label => like categorizing news articles into different categories like politics,sports,technology etc.

In my case its multi-outcome. => n observations, each with p independent variables ( I got only one independent variable) and q dependent variables.

I have been thinking that what if I build classifier independently for each outcome variable.

classifier 1 => x = text-input, Y = outcome variable 1
classifier 2 => x = text-input, Y = outcome variable 2.

But question is how to chain these classifiers or make them to act as one single model ?