While I was preparing a pivot table for substituting missing values of Loan Amount(using index as ‘SELF_EMPLOYED’ and columns as’EDUCATION’), why did Kunal Sir in his blog replaced values by median of each group rather than the mean i.e in aggfunc he used np.median(). I wonder why he used median instead of np.mean().??


Does it have to do anything with the fact that mean includes sum of values which include outliers as well???


Yes, your intuition is correct as in the case where we have outliers we generally prefer imputing missing values with the median.

Although, you should always try and experiment that what better suits your data through cross-validation.



