- Can you guys please explain PCA in detail with a simple example (how orthogonal components are formed , Covariance Matrix).
- The analytical workflow after PCA. Like if I have to do K-Means or run GAM after reducing the dimensions.
Principal component analysis works by trying to find ‘m’ orthogonal dimensions in an ‘n’-dimensional space such that m < p and they capture the maximum variation in the data.
At first projections from each data point is made onto a dimension and the variance along that dimension is calculated. So,a dimension is basically a linear combination of the variables with the coefficients capturing the correlation between the variables and the dimension. Each dimension or principal component is called an Eigen Vector.
While creating this linear combination there is a constraint that the length of the Eigen Vector = 1.
Similarly variances along other dimensions are calculated and the dimension which captures the maximum variation in the data or which has the maximum variance is the first principal component. The variances of each Principal component(PC) is called Eigen Value. After the first PC has been created,an orthogonal dimension which explains the maximum proportion of the remaining variation in the data is created. Similarly n number of principal components are generated which together explain 100% of the variation in the data. This is the crux of how PCA works. Let’s understand the terms eigen vectors/values more deeply.
2.The concept of Eigen Vectors and Eigen Values:
In the above diagram the matrix is the co-variance matrix between x1 and x2 standardized.
What this diagram depicts is that when a vector (-1,1) is multiplied by the co-variance matrix it rotates in a certain direction. Again multiplying with the cov matrix rotates it a little more and in this way after a certain number of rotations it lies along the dimension e2.Further multiplication only increases it’s value and does not change the direction. As you can see the slopes slowly converge to 0.454.This point is where the Eigen Vector or the Principal Component has been found.
Note: Co-variance matrix is used here since the standard deviations of the two columns x1 and x2 were similar. But in general this will not be the case,and hence correlation matrix is used.
To find the eigen values and eigen vectors we have to find
This is the first principal component or the Eigen Vector corresponding to the first PC. Similarly the second principle component is found out which is orthogonal to the first PC.
That was quite some vector algebra,but what actually all that did was capture a part of the variation in x1 & x2 within each Eigen Vector. The part of the variation captured is the Eigen Value of that Eigen Vector.
I hope this gives you a basic understanding of how PCA works.
After you have done PCA now we have to extract the rotated components.
For example you have 100 variables and you get 15 principal components and out of these 15 only 5 are required to explain 70% of the variation in data.
So you rotate only these 5 components and check which variables load onto which component.This is helpful in naming the components as new variables.Here since 2 components are explaining 95% of the variation we take only two components.
Here comp.2 is mainly correlated to Sepal.Width and the other 3 are to comp.1
Now we take only these 2 components and use them for further analysis-whatever that might be,linear regression,clustering etc.
Hope this helps!!
Please explain the difference between SVD and PCA