I have doubt, I read somewhere that if we have a large dataset we should try to build a model on a smaller subset of it and evaluate it’s performance but if our data have missing values then should we impute missing values considering the smaller subset or we should impute missing values on original dataset and then create a smaller subset to work on ?

# Missing Value Imputation in a very large dataset

In my opinion, it is **best to impute on the large data set** before deriving the subset.

**Why?**

For example, if missing values imputed with mean/mode value then values of mean/mode will differ in subset compared to the whole data set. And the subset will not represent the original data set which affects the prediction.

thanks, Jprakash

But this is for if we impute with mean/median what happens if we use algorithms like kNN and mice for imputations then also do we need to work our imputation on whole dataset ?

Yes, you can see this method of calculating mean to whole data set in competitions. (in case you are trying with different options like mean, median…)

But for business applications, always choose the imputation with business sense.

If it was decided to go with mean for example, then the imputation value (i.e mean value) will change at every model refresh

sorry jprakash but you haven’t answered my question exactly I mean to ask imputation methods like KNN and MICE are supposed to be applied on whole dataset or should be applied seperately on training and test set