I am unable to understand the concept of ‘Curse of Dimensionality’. How can a large number of variables negatively affect the model performance? If we have more variables to understand the data and make predictions, then why is it in practice to reduce the size of the dataset?
When we have a large number of variables (say more than 1000), it is practically impossible to analyze all these variables and training time in such case will be very huge. So, to overcome all these problems, we try to reduce the dimensions of our data so that it can be analyzed easily and computation time will be less.
While reducing the dimensions of our data, we try to preserve as much information as we can. There are various methods to reduce the dimensions of our data. You can have a look at the below-mentioned article to have a better understanding of what dimensionality reduction is and how one can reduce the dimensions of the dataset without losing much information.
The simple answer is if 10 variables are giving you what you want, Y go for 20???
Dimensionality reduction is the process of reducing the number of variables under consideration. By obtaining a set of principal variables (reduced set of variables while preserving most of the relevant information), we can speed up the trainng time and avoid overfitting when learning a model.
Dimensionality reduction can be done by feature selection (e.g. Forward Elimination, Backward Elimination etc) or feature extraction (Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Kernel PCA etc).
that makes sense but my question is if the model is able to work with 20 variables then why should I bother removing them?
With a large number of variables, the computation time will be more. And when you are getting the similar result with a smaller number of variables why should you bother to have all the variables and increase the computational time?
Hope it is clear!!
Let’s take a example to clarify…
Take a toy dataset with one independent variable and one dependent variable let’s say (x,y)… and plot a 2 d graph and now add another variable x1=x+4 and plot a 3d plot …
Now bulid same model on both sets and see accuracy…
The accuracy decreases when we add an extra variable because…
The model tries to fit the data such that it explains max variance…
When you add a linearly dependent I.V you are increasing the variance in the data which your model tries to fit and falis…
Thus when 2 variables are explaining the same thing you need to drop one so that model will not have the head ache of fitting those extra variance…
It’s my point of view and anyone finds flaw … please revert.
You can think of Curse of dimensionality as the no:of data points you would need to explain the variance across all the dimensions. As the dimensions increase, the same no:of data points will be part of very small fraction of the whole data space.
Here’s a brief overview of how you will need to have exponentially larger no:of training data points as you have more dimensions:
But not all algorithms get affected by dimensionality curse in same way. Some are affected much worse than others. Models like kNN and kMeans are affected the most, while ensemble boosting models like xgboost or random forests are least affected. Following link has a more detailed description on how a few popular models are affected:
(Disclosure: the above question was asked by me a few years back)
As an intuitive example, let me give a basic example:
say that I like the following movies:
Batman Begins, The Dark Knight, Avengers, Inception, Interstellar
And I don’t like the following movies:
Spiderman, X-men, Dunkirk
Now as scenario 1, you have only following features:
Director, Genre, Production house.
From these, it appears that I like movies from Marvel, or sci-fi movies from Christopher Nolan. It is simple
As scenario 2, you also have 10 other features such as:
main characters, imdb rating, total budget, year of release etc…
When you have so many features, you will need more data points (movies that I like/don’t like) to make similar inferences. This can tell us how having more features with same no:of data points can be problematic.
The curse of dimensionality is this idea that the more columns you have, it creates a space that is more and more empty. There is this fascinating mathematical idea that the more dimensions you have, the more all of the points sit on the edge of that space.
If you just have a single dimension where things are random, then they are spread out all over. Where else, if it is a square then the probability that they are in the middle means that they cannot be on the edge of either dimension so it is a little less likely that they are not on the edge. Each dimension you add, it becomes multiplicatively less likely that the point is not on the edge of at least one dimension, so in high dimensions, everything sits on the edge.
What that means in theory is that the distance between points is much less meaningful. So if we assume it matters, then it would suggest that when you have lots of columns and you just use them without being careful to remove the ones you do not care about that things will not work. This turns out not to be the case for number of reasons
- Points still do have different distances away from each other. Just because they are on the edge, they still vary on how far away they are from each other and so this point is more similar at this point than it is to that point.
- So things like k-nearest neighbours actually work really well in high dimensions despite what the theoreticians claimed.What really happened here was that in 90’s, theory took over machine learning.
There was this concept of support vector machines that were theoretically well justified, extremely easy to analyze mathematically, and you can prove things about them — and we lost a decade of real practical development. And all these theories became very popular like the curse of dimensionality. Nowadays the world of machine learning has become very empirical and it turns out that in practice, building models on lots of columns works really well.
This was explained in a lecture by Jeremy Howard http://course.fast.ai/lessonsml1/lesson1.html
you can jump to youtube link [38:16] for more details.