Need Help with StratifiedShuffleSplit

crossvalidation
sss
stratifiedshuffle

#1

Hello,

  1. I have one (“single”) data file with approx 20K features (rows / fields).

  2. I want to split this file between training data and test data in the ratio of 80:20 using Stratified Shuffle Split

  3. In this file there is an attributes named “Income” and I would like train and test data - each of these file to retain the percentage distribution of income from the original file

  4. I converted Income to Income discreet values under income_cat field

Here is the code snippet:

SSS = StratifiedShuffleSplit(housing[“income_cat”], test_size=0.2)
train_indices, test_indices = next(iter(SSS)) I get the following error on running this command
TypeError: ‘StratifiedShuffleSplit’ object is not iterable

I also tried
for train_index,test_index enumerate(SSS.split(housing[“income_cat”])
however I still get the same error

How can I get indexes from SSS when I just have one file to work with

Thanks,

Mohit


#2

Follow this link:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html


#3

If I am getting your question right,
you want to distribute your dataset in 80-20 split while retaining the distribution of Income field…

for that you can do the following steps… ( Assuming you do python!)

from sklearn.model_selection import train_test_split as tts

x_train, x_test, income_train, income_test = tts( other_colums, income_column,
                         shuffle = True, stratify = Income_column)`

Source


#4

you can also do with pandas library.