Improving Supervised Algos Using Clustering



Hello Experts,

I have used RandomForestClassifier for solve the classification problem, then to improve the accuracy used K-Means clustering then re-applied the RandomForestClassifier. However, rather than improving, accuracy is slipping away.

Note: The aforementioned technique to improve the result is discussed in following section of Analytics Vidhya.

One can download the dataset from this link (Search Download)

In above link the code is written in R, I’ve re-written it in python.
Appreciate if anyone could let me know where I am committing mistake. Below is my code

#Importing Libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

#Importing Dataset
df =pd.read_csv("stock_data.csv")

#Splitting the Dataset into independent & dependent variables


#Importing Libraries
from sklearn.cross_validation import train_test_split

#Splitting dataset into training & test dataset
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2)
#Creating the object of RandomForestClassifier
#Training the model,y_train)
#Checking accuracy on training data
#Predicting values based on test data
#Checking accuracy on test data

Implementing clustering using KMeans

#Importing Libraries
from sklearn.cluster import KMeans
#Setting cluster size to 5
#Fitting the model
#Predicting the labels
#Adding Labels to independent variable dataset

#from sklearn.cross_validation import train_test_split
#Splitting dataset into training & test dataset
X_train_1,X_test_1,y_train_1,y_test_1= train_test_split(X,y,test_size=0.2)
#Creating the object of RandomForestClassifier
#Training the model,y_train_1)
#Checking accuracy on training data
#Predicting values based on test data
#Checking accuracy on test data


Any pointers is highly appreciated…



Can you please clearly mention the link you are using for downloading the dataset? The links are not redirecting to any dataset at the moment.



Hi Ankit,

Could you please try the following link & let me know whether you are able to download the dataset or not.


Yes, I can download the dataset.


What do you achieve with the K Means? If it does not increase the differentiation then it is a waste. Check is the cluster are not a random mix of 1 and -1 is you have .51 and .49 in cluster well not a lot a gain.
Other point you leave your variable selection in auto well when do you pick up your labels variable to build trees?? With ten trees (default value) )your cluster variable as you have 100 other variable could appear in one tree only so I am surprise that it not as good as you first point but to compare you must have the same seed in the second random forest as the first this could be the reason as the out of bag could be different.
My advise first set seed for both the same, check the heterogeneity of you cluster, then you will know better the gain of your procedure … my guess it brings nothing sorry to tell you this , the difference is a random effect … my guess I could be wrong …
Hope this help a little


Hi Alain,

Many thanks for your pointers. I’ll definitely work on that & share the outcomes.