Clustering using mixed variables , with categorical variables having about 10000 categories

outliers
categorical
machine_learning
data_science
python
#1

I am trying to find outliers using clustering analysis.

Data size : > 50 million records

Total columns : 50 . [ 39 Categorical , 12 numerical ]

Domain : Healthcare

Problem :

  • about 5-6 categorical variables have more than 10,000 possible values
  • about 12-14 have about 15 categories possible
  1. Is clustering the right way to look for outliers in this scenario ?

  2. What are the best feature engineering [Feature selection and dimensionality reduction] methods in this case?

  3. Is it advised to do kmeans by converting all the categorical into numerical , If yes any ideas and pointers on that.

  4. Is it advised to do K-prototypes ? If yes, is it reliable/mature enough to work with. And any theories and pointers to the code base is appreciated.

K-prototypes : https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py

Any other sample codes would help

Looking for ideas and direction to approach this problem ,using python for coding