Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
- Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a smaller subset that can be used to model the problem. It usually involves three ways:
- Filter
- Wrapper
- Embedded
- Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e. a space with a lesser no. of dimensions.
The various methods used for dimensionality reduction include:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Generalized Discriminant Analysis (GDA)
Principal Component Analysis(PCA)
Condition - While the data in a higher-dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower dimensional space should be maximum.
Steps Involved in PCA
- Standardize the data. (with mean =0 and variance = 1)
- Compute the Covariance matrix of dimensions.
- Obtain the Eigenvectors and Eigenvalues from the covariance matrix (we can also use correlation matrix or even Single value decomposition, however in this post will focus on covariance matrix).
- Sort eigenvalues in descending order and choose the top k Eigenvectors that correspond to the k largest eigenvalues (k will become the number of dimensions of the new feature subspace k≤d, d is the number of original dimensions).
- Construct the projection matrix W from the selected k Eigenvectors.
- Transform the original data set X via W to obtain the new k-dimensional feature subspace Y.
Advantages of Dimensionality Reduction
- It helps in data compression, and hence reduced storage space.
- It reduces computation time.
- It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
- It may lead to some amount of data loss.
- PCA tends to find linear correlations between variables, which is sometimes undesirable.
- PCA fails in cases where mean and covariance are not enough to define datasets.
- We may not know how many principal components to keep- in practice, some thumb rules are applied.
#PCA is very sensitive to variances

Comments
Post a Comment