Splitting the Data in KFold CV according to label

crossvalidation
python
sklearn

#1

Hi all,
I was going through the documentation of KFold CV in sklearn and following structure is used for implementing it :
class sklearn.cross_validation.KFold(n, n_folds=3, shuffle=False, random_state=None)

I am aware that splitting will be done sequentially or randomly depending on shuffle and random_state parameters. I want to know that, is there a method that will split the data into train and test having equal distribution or proportion of 1’s and 0’s in the label?
Say in the data set label has the distribution like :

[1,1,1,0,0,0,1,1,1]

so the desirable splitting will be :

train : [1,1,1,,0,0,1] test : [0,1,1]

The ratio of 1’s and 0’s (i.e, 2:1) is mantained in the data set. I want to add one more question, doing this type of splitting while forming k-folds is effective or not?
Thanks in advance


#2

Hi @ravi_6767, what you are looking for is Stratified Split. For the effectiveness, in an unbalanced dataset, the above method is more useful.