Hello,

I have a question about the validity or problems associated to using clustering methods (e.g., k-means, or spectral, or dbscan, …) on a data set that has been dimensionally reduced using t-SNE algorithm.

In an earlier blog post on this site : t-sne-implementation it is stated that t-SNE is a good method as dimension reduction prior to clustering, but no real details are provided on that and whether there are limitations / concerns to be taken into consideration.

I see on other sites : k-means-clustering-on-the-output-of-t-sne very strong statements against using t-SNE as preparation for k-means clustering, with the basic cautionary idea that distance and density based information from the data set is lost in the t-SNE mapping process.

Is there a way to think about this from a quantitative point of view, i.e., there are some tests or evaluations to perform that aids in deciding if cluster on t-SNE-mapped data is appropriate ? Is there was to quantify if the t-SNE mapping has caused some loss of information or (more significantly) created information that did not exist in the original data set, from the perspective of information desired from a cluster analysis ?

In my current case, I am working with a moderate size data set (40,000 rows, 60 features). Based on the nature of the data, fundamental clustering techniques (k-means, dbscan, spectralclustering) are not providing clusters that have interpretable results, yet, when use the results of the 2-D mapping from t-SNE in basic k-means clustering, there are ~8-10 clusters identified (highest silhouette scores) that seem to have interpretable results. Very interested.

Am interested in thoughts / concerns / mathematical understanding of the validity or invalidity of this approach.

Thank you.