In training dataset we have 4th column named value which has some 0 values. Should we consider this as missing data and use imputation methods to fill values here? Kindly reply ASAP. --regards vishwa.
If all the values are zero in that column, then I don’t think it’ll make any sense to keep it or impute any value.
If you have any labeled data for that column which have other values then you can go with imputation.
Check if your test set also has the same column with all zero values. If the values are zero in that column as well, you can drop this column. But if test set has some meaningful values, try figuring out what it can be.
I framed my question wrong. My bad. Its not all 0 values. Only some of them are 0 values. I guess I need to use Imputation here. Please confirm. Also how do I go about finding corrupt or malformed data and fixing it?
If you have only few missing values, you can impute these values by the mean or mode of the column, or maybe use some other columns to fill values in this column. Are you sure that the values in this column cannot be zero? (Maybe zero is a valid input)
I believe you are referring to the outliers in your dataset. You can plot charts to see if some values in your column are very different from others. For example, if you have an age column with a value 400. You can replace such values or delete the row. (personally i’d prefer replacing the values because the second option makes you loose data points)
I’d recommend you to go through this article:
Thanks for the immediate reply.