In my mind, there are multiple ways to treat dataset outliers
-> Delete data -> Transforming using log or Bin -> using mean median -> Test separately
I have a dataset of around 50000 observations and each observation has quite some outlier values (some variable have small amount of outliers some has 100-200 outliers) so excluding data is not the one I’m looking for as it causing me to loose a huge chunk of data.
I read somewhere that using mean and median is for artificial outliers but in my case I think the outliers are Natural
I was actually about to use median to get rid of the outliers and then using mean to fill in missing values but it doesn’t seem ok now. So I’m really confuse what technique to use as the data is giving me 97% accuracy right now because of all the outliers. Should I bin the data present in each column from 1-10? Also should I normalize or standardize my data before applying any model ? Any guidance will be appreciated