Clustering and PCA Session
Principle component Analysis
* Application of PCA:
1. Dimensionality reduction - Explain with one example
2. Data Visualization - we cant visualize beyond 3D so if we want to visualize multidimensional data say 10 Dimensional data then we can use PCA.
3. Data Anonymization - if you don't want to send any confidential information then you can use PCA. give an example of credit card fraud detection. https://www.kaggle.com/mlg-ulb/creditcardfraud
4. Factor Analysis: PCA component is basically a linear combination of multiple features so you can use it for factor analysis.
How PCA does dimension reduction.
Now how can I represent this same data on only 1 dimension without losing any information?
We can do this by rotating the axis. if I rotate my axis like this then I can see our all points are lying on the only one axis so we have preserved our all the information only on 1 axis.
so converted 2d into 1d
Suppose this time, we don't have all data points in one line, all are like the below image.
Then still I can convert it into one dimension but here I'll lose some information. (a blue line which is difference from point and red line).
How can I put points on x1? so I'll take projection, suppose light is coming from one direction then where will it's shadow will come on axis x1. there i'll take these points.
suppose if i want to take it on axis x2 then i'll put some light on opposite side and take projection on x2.
It has two components 1. direction 2. Variance
varaince is how much it is spread out along the axis.
so 1st we'll create covariance matrix then if we apply eigen decomposition on covariance matrix it'll give us two matrices 1st eigen vector and 2nd eigen values.
eigen vector will give us direction and eigen values will us the explain variance
* show the real world case of PCA - https://covidscholar.org/word-embeddings
here we have 10000 data points i.e. 10k words and each word is represented by 100 dimension.
by using PCA, researcher have transformed into 3 dimensions.
here you can see similar word to any words.
same way for TSNE but TNSE will find non linear pattern in the data.
Here andrew karpathy has taken high dimensional images and converted to the lower dimension.
In the same blog, he has created 50 dimensional word embedding.
here you can see, we have similar words together.
Case Study -
Try to Explain the code line by line.
This is correlation plot heat map.
This shows how much each feature is correlated with each other.
in pca we have orthogonal axis, i.e. both axis are decorrelated with each other.
pc1 contains feature which are correlated with each other so all will be in 1 dimension or component
45 min
Comments
Post a Comment