Hi, I am working on Big Mart Sales Data and I came to know that people use different methods to Impute the Missing Values.
For E.G. in the Item_Weight Column, someone used grepl function to find the a pattern in the ID such as fd_id <<- grepl(“FD”, data$Item_Identifier)
fdw <- data$Item_Weight[fd_id]
meanfdw <- mean(fdw, na.rm = T)
#REPLACE NA’S IN FD* WEIGHTS
data$Item_Weight[fd_id & is.na(data$Item_Weight)] <- meanfdw
Someone else simply took the mean of non NA weight values and imputed the mean result in NAs…
I understand that approach might be different but I believe understanding of the Domain Knowledge is important without which its difficult to move in correct direction.
Also, some is using KNN Imputation in Item_visibility where it is 0 and someone is using linear regression to solve the same issue.
How to decide which approach to take and does it have an impact at RMSE.
Kindly suggest. Thanks