A question on Solving Multi-Label Classification problems

machine_learning
multi_label_class

#1

Have a good day everybody. Some time ago I read an interesting post on Solving Multi-Label classification problems from someone of the Analitycs Vidhya community. You can find it in (https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-classification/). I have tried different approaches or thecniques described in the post, by using not the artificial dataset, but two of the MULAN repository, these datasets are emotions.arff and yeast.arrff. Here we have a screenshot of yeast.arrff

So far i have gotten very poor accuracy scores. Any hint about ? Are these reasonable scores given the fact that the file have a lot of features (+100) ?. I also tried with emotions.arrff but results are not too much different

That’s an example of what i’m doing, for yeast.arrff:

import pandas as pd
import numpy as np
import scipy
from scipy.io import arff
from scipy import sparse
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

data, meta = scipy.io.arff.loadarff(’/home/victorl/Documents/DATAScience/MachineLearning-AND-DATA-Mining/MULTICLASS-MULTILABEL/DATASETS/yeast.arff’)
df = pd.DataFrame(data)

Classes are stored as object that’s why i do this …

df[‘Class1’]=pd.to_numeric(df[‘Class1’])
df[‘Class2’]=pd.to_numeric(df[‘Class2’])
df[‘Class3’]=pd.to_numeric(df[‘Class3’])
df[‘Class4’]=pd.to_numeric(df[‘Class4’])
df[‘Class5’]=pd.to_numeric(df[‘Class5’])
df[‘Class6’]=pd.to_numeric(df[‘Class6’])
df[‘Class7’]=pd.to_numeric(df[‘Class7’])
df[‘Class8’]=pd.to_numeric(df[‘Class8’])
df[‘Class9’]=pd.to_numeric(df[‘Class9’])
df[‘Class10’]=pd.to_numeric(df[‘Class10’])
df[‘Class11’]=pd.to_numeric(df[‘Class11’])
df[‘Class12’]=pd.to_numeric(df[‘Class12’])
df[‘Class13’]=pd.to_numeric(df[‘Class13’])
df[‘Class14’]=pd.to_numeric(df[‘Class14’])

Separating the data into label and features

features=list(df.columns[0:103])
labels=list(df.columns[103:])
X=df[features]
Y=df[labels]

#SPLITTING IN TRAIN AND TEST SETS
X_train,X_test, y_train,y_test=train_test_split(X, Y, random_state=0)

A sparse representation of the Y matrix is preferred !!!

X_train_sp=sparse.csr_matrix(X_train.values)
y_train_sp=sparse.csr_matrix(y_train.values)

#1st approach
from skmultilearn.adapt import MLkNN

classifier1 = MLkNN(k=5)

train

classifier1.fit(X_train_sp, y_train_sp)

predict

predictions = classifier1.predict(X_test.values)
print(accuracy_score(y_test,predictions.todense()))
0.186776859504

#2nd approach
from skmultilearn.neurofuzzy import MLARAM
classifier2=MLARAM(vigilance=0.9, threshold=0.02)
classifier2.fit(X_train.values, y_train.values)
predictions=classifier2.predict(X_test.values)
print(accuracy_score(y_test,predictions))
0.181818181818

#3rd try
from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import GaussianNB

initialize Label Powerset multi-label classifier

with a gaussian naive bayes base classifier

classifier = LabelPowerset(GaussianNB())

train

classifier.fit(X_train.values, y_train.values)

predict

predictions = classifier.predict(X_test.values)
print(accuracy_score(y_test,predictions))

0.17520661157

#finally
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=50, max_depth=8,random_state=1)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
predictions=multi_target_forest.fit(X_train.values, y_train.values).predict(X_test.values)
print(accuracy_score(y_test,predictions))
0.104132231405

Thanks in advance …