Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables.




Components of Dimensionality Reduction

There are two components of dimensionality reduction:

  • Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a smaller subset that can be used to model the problem. It usually involves three ways:
    1. Filter
    2. Wrapper
    3. Embedded
  • Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e. a space with a lesser no. of dimensions.

The various methods used for dimensionality reduction include:

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used.

Principal Component Analysis(PCA)

Condition - While the data in a higher-dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower dimensional space should be maximum.

Steps Involved in PCA
  1. Standardize the data. (with mean =0 and variance = 1)
  2. Compute the Covariance matrix of dimensions.
  3. Obtain the Eigenvectors and Eigenvalues from the covariance matrix (we can also use correlation matrix or even Single value decomposition, however in this post will focus on covariance matrix).
  4. Sort eigenvalues in descending order and choose the top k Eigenvectors that correspond to the k largest eigenvalues (k will become the number of dimensions of the new feature subspace k≤d, d is the number of original dimensions).
  5. Construct the projection matrix W from the selected k Eigenvectors.
  6. Transform the original data set X via W to obtain the new k-dimensional feature subspace Y.

Advantages of Dimensionality Reduction

  • It helps in data compression, and hence reduced storage space.
  • It reduces computation time.
  • It also helps remove redundant features, if any.

Disadvantages of Dimensionality Reduction

  • It may lead to some amount of data loss.
  • PCA tends to find linear correlations between variables, which is sometimes undesirable.
  • PCA fails in cases where mean and covariance are not enough to define datasets.
  • We may not know how many principal components to keep- in practice, some thumb rules are applied.
#PCA is very sensitive to variances



Comments