Tranforming the test set

pandas
python
sklearn

#1

Hello,

I am doing some data analysis using python and Pandas.

I have a basic question to ask.

Suppose we do the standard data cleaning in the training set like replacing NaN values with mean and performing label encoding and training a model on it.

Do we have to perform the same cleaning on the test set also??If not then how will our model recognize NaN values and missing values which might be in the test set?

If yes, Can we put all the cleaning steps inside a python function and apply the same function to the data of the training set so it is cleaned in one step?

Is my understanding fine or am I missing something?

Thanks in advance for the help.


#2

Hi @siddharth185,

You will have to perform the pre-processing in the test set as well. This is because the model needs to get input in the same form as it had got while it was prepared.
I don’t know about python, but in R, what I do is replace the word ‘train’ in the existing code with ‘test’ after performing a few modifications. This performs the entire pre-processing in a single step for the test set.

Regards.


#3

Thanks a lot.Can anyone tell me how to do it in python?


#4

Any pandas people here?Im really confused.


#5

@siddharth185 I’m here!

I’m sorry but could you ask again? I didn’t understand your problem


#6

I got the answer.Thanks anyway :slight_smile:


#7

hi @siddharth185

You need to apply the same treatment on test and train.

One option to do this is to actually merge the dataframes and then apply all the transformations. When you actually build the model, you can split them in test and train again.

It should not be too difficult to code, but let me know, if you need any help.

Regards,
Kunal