Data Anonymization



Is there a need for Data Anonymization? Especially in cases of datasets where the person doing the analysis is also a part of. Does that not increase the probability of selection bias when the analyst knows that he/her/acquaintances are part of the dataset.

Also, does anonymizing data lead to poorer results since we don’t have the complete view of the data?


Why would there be no need for data anonymization? Sometimes the data are restricted by legal contracts, sometimes the data are too important to be out in the open. Either way, there are situations where anonymizing the data is unavoidable.

Regarding your second question, anonymizing data is throwing away information, so it’s reasonable to assume that a model designed on open data will almost always overperform one designed on the anonymized dataset.