Which Dimentionality Reduction to use for the given problem?


An advertising company wants to predict the likelihood of purchase, using a training data set containing hundreds of columns of demographic data such as age, location, and income. The large dimensionality of this data set poses a problem for training the model, and no features that represent broader groups of people are available. What would be a reasonable approach to reduce the dimensionality of the training data?


There are lot of approaches for dimension reduction each having its pros and cons
!) Principal Component analysis
2) feature importance
3) statiscally removing of variables like lasso regression or removal of p variables

See there are 2 aspects of machine learning or prediction

  1. variables which are statistically important to the interpretation of model
  2. which variable are most important for prediction aspects of machine learning model
    when using linear or logistic regression use lasso regularization.and vif for multicollinearity treatment
    when using other ml algorithms go for feature importaace selection.

3)Suppose your dimension of dataset is too large thousands coloumns and rows (like in Nlp scenario)
dont go for above mention method.
Go for PCA because it is highly fast for dimension reductions…

Thanks and Regards
Wishy Verma

Thanks and Regards

Can we use K-Means for dimensionality reduction in this case and we stated that we don’t have training data that includes known demographic groups?

please let me know your thoughts

kindly explain the problem statement clearly.
is it supervised or semi supervised or unsupervised.
if you want reduced features than go for what i mention above.
However the training rows should not be reduced because it is what machine learning always required the lots of data.

However if you are ghaving the sufficient amount of training data which explains the target variable i.e not having large variance for target variables. then you can filter that data.


Please take a look at the below

Giving a real likelihood, a probabilistic method, like probabilistic graphs or Bayesian inference would be the best strategy, given you have the right amout of data points (not considering whether they are labeled or not - you just need the same amout of labels for each data class).
For example with 10000 to about 1000000 is a small but relatively representable dataset.
For a single variable 10000 points would indicate a 99.999% chance corectness in the frequentists approach. (Parameters are often also counted as variables, as well as latents.)

For example you can reduce your data using a GMM or PMM to cluster it into an amount of clusters equivalent to your classes and then create a dummy Categorical variable to represent your classes, e.g. taking a delta distribution.
Even if, the Gaussian model is the most complex model with much more parameters, the delta distribution is, in case of minimal labels, also pretty demanding to learn, since you always have to define the probability given a cluster that it belongs to the specific label.

If there is a lag of data K-Means or K-Nearest Neighbors could be a good choice.

You could consider a neural network like method like PCA, non-Negative Matrix Factorisation or Sparse Coding, but I am not sure, how well you can refert back to the original data points at this point. Since it is a transformation into another latent dimension, which, especially in a greedy approach, still has to be invertable.

Nevertheless, using feature importance methods with an essembly model or covariance analysis is a good start and proves, that you can actually work in an econometric context.

For the rest it is totally up to you and the property of your data.
Keep in mind, that specific features can have influence into multiple classes without being significant in a mixture model, but they could have a much more prominent role using neural networks or sparse coding. Whereas, sparse coding, does not build one-hot latent variables / clusters.

1 Like
© Copyright 2013-2019 Analytics Vidhya