The PCA Guidebook: Practical, Intuitive, and Thorough
Quick start
- Paste your table above (rows = observations, columns = numeric variables). Headers are auto‑detected.
- Pick Basis: use Correlation when variables have different units or spreads; use Covariance when scales are comparable and you want large‑variance features to dominate.
- Choose how to treat missing values: Drop rows (listwise deletion) or Impute means (simple baseline).
- Click Run PCA. Read the Summary, then scan the Scree (elbow), Loadings (variable weights), Scores (observation positions), and the Correlation Circle (quality of representation on PC1–PC2).
- Export Scores and Loadings as CSV for downstream analysis.
PCA in plain language
PCA is a smart rotation of your data cloud. Imagine plotting every observation in p‑dimensional space. PCA finds perpendicular axes (principal components) that capture as much spread (variance) as possible, one after another. PC1 points where the data is widest; PC2 is the next widest direction orthogonal to PC1, and so on. Because these axes are uncorrelated, they remove redundancy and make structure easier to see.
Pre‑processing: what to do before PCA
- Centering (subtract column means) is essential. This tool centers automatically for both covariance and correlation PCA.
- Scaling: Correlation PCA divides by sample standard deviation (n‑1), putting all variables on the same footing. Use this when units differ (cm vs. kg vs. $) or when some variables are naturally high‑variance.
- Missing values: We support listwise deletion or mean imputation. For serious work consider EM/PPCA/knn imputation—the choice can change components.
-
Outliers: A few extreme points can rotate PCs. Inspect scores and consider robust methods (see Robustness).
- Collinearity: Highly correlated variables are fine—PCA thrives on correlation—but perfect duplicates yield zero‑variance directions.
- Categorical variables: PCA expects numeric, continuous inputs. If needed, one‑hot encode categories, then consider correlation PCA and interpret carefully.
Mathematics of PCA (clear but compact)
Let X be the n×p data matrix after centering (and optional standardizing). The sample covariance (or correlation) matrix is:
PCA solves the eigenproblem and orthonormality conditions:
The scores and a low‑rank reconstruction are:
(Add back the column means you subtracted to return to the original space.)
Explained variance ratio for PC j is:
SVD view (why it’s numerically stable)
Singular Value Decomposition factors the centered/standardized matrix as:
The eigenvalues of C relate to the singular values via:
Scores are equivalently:
Scores, Loadings, Communalities & Contributions
Choosing the number of components (k)
- Scree elbow: Look for the bend in the scree plot where additional PCs add little variance.
- Cumulative EVR: Keep PCs until 80–95% of variance is explained (context‑dependent: scientific data often uses ≥90%).
- Kaiser criterion (only for correlation PCA): keep PCs with
λ > 1 (they explain more variance than a single standardized variable). Use with caution.
- Broken‑stick: Compare eigenvalues to a random “broken‑stick” distribution; keep those exceeding the expectation.
- Parallel analysis: Compare to eigenvalues from random data with the same shape; keep PCs exceeding random. (Not implemented here; run offline for rigor.)
- Cross‑validation: For predictive tasks (PCR), pick k that maximizes validation performance.
Interpreting components like a pro
- Read loadings first. Identify which variables drive PC1, PC2, … High positive vs. negative weights can indicate meaningful trade‑offs (e.g., price ↑ while efficiency ↓).
- Use the scores plot to detect groups/outliers. Clusters in PC1–PC2 space often correspond to meaningful segments; extreme scores flag outliers or novel cases.
- Check the correlation circle. Variables close together are positively correlated; opposite sides indicate negative correlation; near‑orthogonal ≈ weakly related.
- Relate back to the domain. Components are combinations of variables—name them by what they measure (e.g., “overall size”, “sweetness vs. acidity”, “market risk”).
- Remember non‑uniqueness. If λ’s are tied or nearly equal, the corresponding PCs can rotate within their subspace. Focus on the subspace, not exact axes.
High‑dimensional case (p ≫ n)
When variables outnumber observations, at most n−1 eigenvalues are non‑zero. PCA still works and is often essential. Computation is faster via the SVD of X or by eigendecomposing XXᵀ and mapping to feature space. Interpretation is the same.
Outliers & robustness
- Diagnostics: Inspect score distances (e.g., Mahalanobis) to find leverage points that can twist PCs.
- Mitigations: Winsorize/extreme‑clip, transform (log), or use robust PCA variants (e.g., M‑estimators, S‑estimators, median‑based methods). This tool performs classical PCA; pair with robust prep if needed.
Advanced PCA topics
-
Whitening: Map to uncorrelated, unit‑variance features via . Useful for some ML pipelines; beware of amplifying noise for tiny eigenvalues.
- PCR (Principal Components Regression): Regress a target on PC scores to mitigate multicollinearity.
- Sparse PCA: Encourages loadings with many zeros for interpretability.
- Kernel PCA: Applies PCA in a nonlinear feature space via kernels (RBF, polynomial) for curved manifolds.
- t‑SNE/UMAP vs. PCA: Nonlinear methods are great for visualization of clusters but are not linear, global, or easily invertible; start with PCA.
Domain‑specific tips
- Finance: Yield curves typically yield PCs interpretable as level, slope, curvature. Use correlation PCA if mixing units; otherwise covariance can highlight dominant risk factors.
- Biology/Genomics: Center and standardize; PC1/PC2 often capture batch effects or population structure. Always check for confounders.
- Manufacturing/QC: PCA detects process drift and latent failure modes; monitor scores over time.
- Imaging/Signals: PCA ≈ Karhunen–Loève transform—great for denoising/compression; mind spatial structure when interpreting loadings.
Common pitfalls
- Mixing standardized and unstandardized variables when using covariance PCA.
- Over‑interpreting signs (they may flip between runs or tools).
- Assuming PCs imply causality—they summarize variance, not mechanisms.
- Keeping too many PCs (overfitting) or too few (information loss). Use scree + EVR + domain sense.
- Projecting new data without applying the same centering/standardization as training.
FAQ (extended)
How do I apply these loadings to new data? Store your training means (and stds for correlation PCA). For a new row x, compute x′ = (x − mean)/std as appropriate, then scores = x′ · V_k. This tool reports means and stds so you can replicate preprocessing.
Why don’t my results match another package exactly? PCA is unique up to sign flips; small differences arise from numerical methods, missing‑value handling, and whether covariance or correlation was used.
Can I rotate PCs (e.g., varimax)? Rotation is a factor analysis concept. PCA already yields orthogonal components that maximize variance; rotated solutions optimize different criteria.
Does scaling change scores? Yes—correlation PCA gives each variable equal variance, shifting both loadings and scores; covariance PCA lets high‑variance variables dominate.
Glossary
- Scores (T): coordinates of observations in PC space (
T = X · V).
- Loadings (V): weights defining PCs in terms of original variables (eigenvectors of C).
- Eigenvalue (λ): variance captured by a PC.
-
Explained Variance Ratio (EVR):
- Communality: how much of a variable’s variance is captured by the retained PCs.
- Whitening: rescaling scores to unit variance and zero correlation.