Approach for Missing Value Imputation in Big Mart Sales Data



Hi, I am working on Big Mart Sales Data and I came to know that people use different methods to Impute the Missing Values.

For E.G. in the Item_Weight Column, someone used grepl function to find the a pattern in the ID such as fd_id <<- grepl(“FD”, data$Item_Identifier)

#filter FD*
fdw <- data$Item_Weight[fd_id]
meanfdw <- mean(fdw, na.rm = T)

data$Item_Weight[fd_id &$Item_Weight)] <- meanfdw

Someone else simply took the mean of non NA weight values and imputed the mean result in NAs…

I understand that approach might be different but I believe understanding of the Domain Knowledge is important without which its difficult to move in correct direction.

Also, some is using KNN Imputation in Item_visibility where it is 0 and someone is using linear regression to solve the same issue.

How to decide which approach to take and does it have an impact at RMSE.

Kindly suggest. Thanks


Hi @mukul.mschauhan

There is no defined way for imputing missing values. There certainly are various methods which can be used, each having it’s own pros and cons. It depends on the dataset and the variable as to which technique should be used.

For example, you would not use mean when you have outliers in your dataset, or median when it is categorical variable.

Consider a problem where you have the following columns Age, Gender, Married_status and you have missing values in Married_status column. You use age and gender to impute these values , right? In this case, using mode would not give the best results.

Basically, while performing EDA, you should try to figure out what are the possible ways in which you can fill the missing values, and what will be most appropriate for the situation.


Thank you so much. Appreciate it.

Kindest Regards