Improving Supervised Algos Using Clustering

machine_learning

#1

Hello Experts,

I have used RandomForestClassifier for solve the classification problem, then to improve the accuracy used K-Means clustering then re-applied the RandomForestClassifier. However, rather than improving, accuracy is slipping away.

Note: The aforementioned technique to improve the result is discussed in following section of Analytics Vidhya.

One can download the dataset from this link (Search Download)

In above link the code is written in R, I’ve re-written it in python.
Appreciate if anyone could let me know where I am committing mistake. Below is my code

#Importing Libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

#Importing Dataset
df =pd.read_csv("stock_data.csv")

#Splitting the Dataset into independent & dependent variables
X=df.iloc[:,:-1]
y=df.iloc[:,-1]

"""
BEFORE CLUSTERING 

"""
#Importing Libraries
from sklearn.cross_validation import train_test_split

#Splitting dataset into training & test dataset
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2)
#Creating the object of RandomForestClassifier
randomFClassifier=RandomForestClassifier()
#Training the model
randomFClassifier.fit(X_train,y_train)
#Checking accuracy on training data
accuracy_score(y_train,randomFClassifier.predict(X_train))
#Predicting values based on test data
y_predict=randomFClassifier.predict(X_test)
#Checking accuracy on test data
accuracy_score(y_test,y_predict)

"""
Implementing clustering using KMeans

"""
#Importing Libraries
from sklearn.cluster import KMeans
#Setting cluster size to 5
kmeans=KMeans(n_clusters=5)
#Fitting the model
kmeans.fit(X)
#Predicting the labels
labels=kmeans.predict(X)
#Adding Labels to independent variable dataset
X["Labels"]=pd.Series(labels,index=df.index)

#from sklearn.cross_validation import train_test_split
#Splitting dataset into training & test dataset
X_train_1,X_test_1,y_train_1,y_test_1= train_test_split(X,y,test_size=0.2)
#Creating the object of RandomForestClassifier
randomFClassifier_1=RandomForestClassifier()
#Training the model
randomFClassifier_1.fit(X_train_1,y_train_1)
#Checking accuracy on training data
accuracy_score(y_train_1,randomFClassifier_1.predict(X_train_1))
#Predicting values based on test data
y_predict_1=randomFClassifier_1.predict(X_test_1)
#Checking accuracy on test data
accuracy_score(y_test_1,y_predict_1)

#2

Any pointers is highly appreciated…


#3

Hi,

Can you please clearly mention the link you are using for downloading the dataset? The links are not redirecting to any dataset at the moment.

Regards
Ankit


#4

Hi Ankit,

Could you please try the following link & let me know whether you are able to download the dataset or not.
https://drive.google.com/file/d/0ByPBn4rtMQ5HaVFITnBObXdtVUU/view


#5

Yes, I can download the dataset.


#6

Hi
What do you achieve with the K Means? If it does not increase the differentiation then it is a waste. Check is the cluster are not a random mix of 1 and -1 is you have .51 and .49 in cluster well not a lot a gain.
Other point you leave your variable selection in auto well when do you pick up your labels variable to build trees?? With ten trees (default value) )your cluster variable as you have 100 other variable could appear in one tree only so I am surprise that it not as good as you first point but to compare you must have the same seed in the second random forest as the first this could be the reason as the out of bag could be different.
My advise first set seed for both the same, check the heterogeneity of you cluster, then you will know better the gain of your procedure … my guess it brings nothing sorry to tell you this , the difference is a random effect … my guess I could be wrong …
Hope this help a little
Alain


#7

Hi Alain,

Many thanks for your pointers. I’ll definitely work on that & share the outcomes.

Rishi