When do we split the data into train and test data?

Hello community,

In order to avoid overfit/underfit of data one of the common mechanism we will do is to divide data j to train and test samples. But my question is at what stage should we do it? Is it before EDA or after EDA? The reason why i am checking this is because there is missing value treatment and outliers treatment in EDA for which we will look at the whole data, So if we do the split after EDA, model might have already seen the whole data compromising the basic concept of splitting. But there are many places i have seen the splitting is done after EDA. So please help me clear my ambiguity

1 Like


To answer your queries, we are treating data before training the model. In that case, you provide only precise data (training set) to a model, and it learns from it. You are keeping a balance data (testing data) for performance valuation of the model. To note, the data is divided randomly as training and testing data.

So, nowhere we feed complete data to a model. It’s like you won’t be knowing the route while driving to a strange place, and you use your previously learnt skills to find the location, haha!

I hope this clears your doubt.

1 Like

We should split the data after EDA process because,

  1. We should clearly understand the data before splitting. Clearly understanding the data helps in knowing which attributes to consider for building the model
  2. Another reason to do EDA before model building is, as you have mentioned we have treat the missing, outliers in the data which would highly influence the model
  3. Model building is later stage of the process compared to EDA
  4. Working on EDA does not tell the model how the data looks like, because EDA and model building are very different stages of the process


First you should simplify your data as possible as you can. Do data prepossessing, manipulation,imputation,wrangling then after EDA you should go forward to split data into train and test, if your using python then split data with help of “from sklearn.model_selection import train_test_split”
for reference go through my github project you will get basic idea - https://github.com/akhiljamdar?tab=repositories.com

EDA first, always. Simple as that.
Even other preprocessing must be done soon as possible.
Only split when your dataset is ready for methods: train and test.

Thanks everyone.

You should do splitting after EDA.

EDA includes pre-processing of data like missing values,outliers,scaling changing variables,etc after that only do spiltting.
as soon as when new test data arrive it should mandatory pass through all the preprocess data preparation step as you have done for data before splitting.
This will give you consistent output of your model at most of the timee…

Thanks And Regards
Wishy verma

© Copyright 2013-2019 Analytics Vidhya