I have been working on text classification problem which has three outcome variables and they are multi-class variables.
The dataset description as follows.
Dataset is about the accidents happened in the industries over the years and they are classified according to their Degree,Nature and Occupation.
My requirement is to build a classifier which will classify and assign the score to the documents so that any new documents which come from news,Google search or twitter will be classified according to their
degree,nature and occupation. Further I want to categorize them into high,medium and low risk documents using the score value(rank them).
The main goal is to build a intelligence/ classifier to prevent and mitigate the accidents in the industry before hand.Its kind of risk prediction model which I am trying to build.
Data set summary:
Independent feature:(input variable)
Description => Information about the cause of the particular
accident.(this is my text document)
Degree => Hospitalized,Non Hospitalized,Fatality (3 classes)
Nature => has many types/classes
Occupation => Occupation of the employees
I have read many papers where they mentioned about how to approach this problem.
- Combine the outcome variables in one.( i don’t how it is going to work)
- two level model like first build a classifier for first outcome variable and then second one.
How to achieve this problem. I am really struggling how to approach to this problem.
I have built a classifier for single outcome variable but how to do it for two or more outcome variables which themselves are multi class variables.
I have done all feature extraction,stemming,removing stop words.
I am using tf-idf approach.( also thinking to use word2vec approach)
I am using python
Machine learning algorithms
Sci-kit learn’s Naive_Bayes and SVM algorithms.
Any similar example or some reference would be helpful ?
Please find sample example data set.
Example_Dataset.csv (15.7 KB)