I am trying to find outliers using clustering analysis.
Data size : > 50 million records
Total columns : 50 . [ 39 Categorical , 12 numerical ]
Domain : Healthcare
- about 5-6 categorical variables have more than 10,000 possible values
- about 12-14 have about 15 categories possible
Is clustering the right way to look for outliers in this scenario ?
What are the best feature engineering [Feature selection and dimensionality reduction] methods in this case?
Is it advised to do kmeans by converting all the categorical into numerical , If yes any ideas and pointers on that.
Is it advised to do K-prototypes ? If yes, is it reliable/mature enough to work with. And any theories and pointers to the code base is appreciated.
Any other sample codes would help
Looking for ideas and direction to approach this problem ,using python for coding