Is it okay not to split the dataset into training and testing for text classification? (python code for stacking/ ensemble learning



I found the following python code for Ensemble machine learning( stacking) on Github. It increased the accuracy of my classifiers but it is different from the other common methods of text classification which involves splitting of the data set into testing and training. Is it okay not to split the dataset into training and test dataset? What are the disadvantages if done so?

Here’s the code-

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB 
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.classifier import StackingClassifier
from sklearn import cross_validation
import numpy as np
from sklearn.tree import DecisionTreeClassifier
iris = datasets.load_iris()
X, y =[:, 1:3],

def CalculateAccuracy(y_test,pred_label):
    nnz = np.shape(y_test)[0] - np.count_nonzero(pred_label - y_test)
    acc = 100*nnz/float(np.shape(y_test)[0])
    return acc

clf1 = KNeighborsClassifier(n_neighbors=2)
clf2 = RandomForestClassifier(n_estimators = 2,random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression(), y), y), y)

f1 = clf1.predict(X)
acc1 = CalculateAccuracy(y, f1)
print("accuracy from KNN: "+str(acc1) )
f2 = clf2.predict(X)
acc2 = CalculateAccuracy(y, f2)
print("accuracy from Random Forest: "+str(acc2) )
f3 = clf3.predict(X)
acc3 = CalculateAccuracy(y, f3)
print("accuracy from Naive Bays: "+str(acc3) )
f = [f1,f2,f3]
f = np.transpose(f), y)
final = lr.predict(f)
acc4 = CalculateAccuracy(y, final)
print("accuracy from Stacking: "+str(acc4) )


Hi @divisha,

Datasets are split into train and test set to check the performance of the model. For instance, if I have only the training set and I want to create a model, I can split the set into train and test. Further train my model on the training set, and check the score on the test set.