Chapter 5 — Unsupervised Learning and Dimensionality Reduction
The previous chapters worked with labelled data — every training example came with a known target. This chapter turns to the unlabelled case. Given a dataset of inputs without any output labels, what can a learning algorithm say about the structure of the data? Two main families of answers exist: clustering, which groups similar examples together, and dimensionality reduction, which finds a compact representation that preserves the data's essential structure. Both are widely used — clustering for customer segmentation, anomaly detection, and exploratory analysis; dimensionality reduction for visualisation, denoising, and feeding into downstream supervised algorithms. The chapter also covers Linear Discriminant Analysis (LDA), a supervised method for dimensionality reduction that uses class labels to find the most class-discriminating directions.
5.1 Introduction to clustering and criterion functions
Clustering
Clustering is the unsupervised-learning task of partitioning a set of objects into groups (clusters) such that objects within the same group are similar to one another and objects in different groups are dissimilar, based on some chosen measure of similarity or distance.
Clustering has no "right answer." The same dataset can be clustered many different ways depending on what features are used, what distance metric is chosen, what algorithm is run, and what hyperparameters are set. The goal is a clustering that is useful for the downstream task — segmenting customers for targeted marketing, finding anomalies, organising documents, compressing data.
Common applications
- Customer segmentation. Group eSewa users by their transaction patterns — frequent small-payments users, occasional large-payment users, business accounts, dormant accounts. Each segment can be served by appropriate features and marketing.
- Document organisation. Cluster news articles into topics without pre-defined categories. Cluster research papers, court rulings, customer-support tickets.
- Image segmentation. Group pixels of an image into regions corresponding to objects.
- Anomaly detection. Examples that do not fit any cluster well are anomalies. Useful for fraud detection, intrusion detection, equipment failure prediction.
- Recommendation systems. Cluster users into similar groups; recommend what others in the same cluster liked.
- Genetics. Cluster gene-expression patterns to find disease subtypes.
- Network analysis. Cluster social-network users to find communities.
Types of clustering
Hard vs soft clustering.
- Hard clustering assigns each example to exactly one cluster.
- Soft (fuzzy) clustering assigns each example a probability or weight for belonging to each cluster.
Partitional vs hierarchical.
- Partitional clustering produces a single division of the data into clusters (K-means is the canonical example).
- Hierarchical clustering produces a tree of clusterings at all levels of granularity.
Density-based vs centroid-based vs distribution-based.
- Centroid-based — each cluster is defined by its centre point (K-means).
- Density-based — clusters are dense regions of the data space separated by sparse regions (DBSCAN).
- Distribution-based — each cluster is modelled as a probability distribution (Gaussian Mixture Models).
- Connectivity-based — clusters are formed by connecting nearby points (hierarchical clustering with single linkage).
Criterion functions for clustering
A criterion function (or objective function) for clustering is a mathematical measure that quantifies the quality of a proposed clustering, used to guide algorithms toward good clusterings and to compare different clusterings of the same data.
The two key intuitions:
Intra-cluster similarity should be high. Points within the same cluster should be close to each other.
Inter-cluster similarity should be low. Points in different clusters should be far from each other.
Different criterion functions formalise this differently.
Sum of Squared Errors (SSE). Used by K-means.
where is the set of points in cluster and is the cluster's centroid. SSE measures how tightly points cluster around their centroids. Lower is better. Also called inertia in some software libraries.
Silhouette coefficient. Measures both cohesion (within-cluster tightness) and separation (between-cluster distance) for each point:
where is the average distance from point to other points in its cluster and is the average distance from to points in the nearest other cluster.
- close to 1: point is well-clustered.
- near 0: point is on the boundary between two clusters.
- negative: point may be in the wrong cluster.
The silhouette of a clustering is the average across all points. Useful for comparing clusterings with different numbers of clusters.
Davies-Bouldin index. Ratio of within-cluster scatter to between-cluster separation. Lower is better.
Calinski-Harabasz index. Ratio of between-cluster variance to within-cluster variance. Higher is better.
Dunn index. Ratio of minimum inter-cluster distance to maximum intra-cluster distance.
The criterion function matters because it determines what the algorithm optimises. K-means with SSE finds spherical clusters. Other criteria favour other shapes.
5.2 Algorithms for clustering
K-means clustering
K-means is a centroid-based clustering algorithm that partitions examples into clusters by iteratively assigning each example to the cluster with the nearest centroid and then recomputing each centroid as the mean of its assigned examples, minimising the sum of squared errors.
K-means is the most widely used clustering algorithm. Fast, simple, scales to large datasets, easy to interpret.
The K-means algorithm
The standard procedure (Lloyd's algorithm):
- Choose . The number of clusters is a hyperparameter that must be set in advance.
- Initialise centroids. Pick initial centroid positions, often by randomly choosing training examples.
- Assignment step. For each example, assign it to the cluster with the nearest centroid:
- Update step. Recompute each centroid as the mean of its assigned examples:
- Repeat steps 3 and 4 until centroids stop changing (or change less than a threshold).
The algorithm is guaranteed to converge — each iteration decreases or maintains the SSE — but it may converge to a local minimum rather than the global one.
A worked example
Suppose six points in 2D representing transaction (amount, hour-of-day):
| Point | Amount (NPR) | Hour |
|---|---|---|
| A | 100 | 9 |
| B | 150 | 10 |
| C | 120 | 9 |
| D | 5000 | 22 |
| E | 4500 | 23 |
| F | 5200 | 22 |
Try clusters. Initialise centroids randomly: , .
Iteration 1 — Assignment:
| Point | Distance to | Distance to | Cluster |
|---|---|---|---|
| A | 0 | 1 | |
| B | 1 | ||
| C | 1 | ||
| D | 0 | 2 | |
| E | 2 | ||
| F | 2 |
Update:
. .
Iteration 2 — Assignment: The assignments stay the same (A, B, C remain in cluster 1; D, E, F in cluster 2). The algorithm has converged.
Interpretation: cluster 1 is "small daytime transactions," cluster 2 is "large late-night transactions." A new transaction can be assigned to its nearest cluster — useful for flagging unusual transactions (a small late-night transaction would not fit either cluster well).
Initialisation matters
K-means is sensitive to initial centroid positions. Different initialisations can produce different final clusterings. Standard remedies:
Multiple random initialisations. Run K-means many times with different initial centroids, keep the result with the lowest SSE.
K-means++. A clever initialisation that picks initial centroids spread out across the data:
- Pick the first centroid uniformly at random from the data points.
- For each remaining point, compute the distance to the nearest already-chosen centroid.
- Pick the next centroid with probability proportional to the squared distance — points far from existing centroids are more likely to be picked.
- Repeat until centroids are chosen.
K-means++ typically converges in fewer iterations and finds better local minima than random initialisation.
Choosing K
K is the most important hyperparameter. Several methods help choose it.
Elbow method. Run K-means for . Plot SSE against . SSE always decreases with more clusters, but the rate of decrease slows. The "elbow" of the curve — where the marginal benefit drops sharply — suggests a good .
Silhouette analysis. Compute the average silhouette coefficient for each . Pick the with the highest average silhouette.
Gap statistic. Compares the SSE to what would be expected under a null distribution (uniform random data). Larger gap = better clustering.
Domain knowledge. Often the most reliable. If the business problem suggests three customer segments, use even if mathematical metrics suggest otherwise.
Strengths and weaknesses of K-means
Strengths:
- Simple and fast — where is the number of iterations.
- Scales to large datasets (mini-batch K-means scales further).
- Easy to interpret — centroids are themselves data points in the feature space.
- Works well when clusters are roughly spherical and similar in size.
Weaknesses:
- Must specify in advance.
- Sensitive to initialisation.
- Assumes spherical clusters — fails on elongated, ring-shaped, or oddly-shaped clusters.
- Sensitive to outliers — a single far-away point can drag a centroid.
- Sensitive to feature scaling — features must be scaled before applying K-means.
- Assumes clusters of similar size.
Hierarchical clustering
Hierarchical clustering is a family of clustering algorithms that build a tree (dendrogram) of nested clusterings, either by starting with each point as its own cluster and successively merging the closest pair (agglomerative), or by starting with one cluster and successively splitting it (divisive).
Hierarchical clustering produces a clustering at every level of granularity simultaneously. Cutting the dendrogram at different heights gives different numbers of clusters.
Agglomerative hierarchical clustering — bottom-up:
- Start with each point as its own cluster — singleton clusters.
- Find the two closest clusters and merge them.
- Repeat until only one cluster remains.
The order of merges defines the dendrogram.
Divisive hierarchical clustering — top-down:
- Start with all points in one cluster.
- Split the cluster (using some criterion, often K-means with ).
- Recurse on each child cluster.
Agglomerative is more common.
Linkage methods
When merging clusters, the algorithm needs to know the distance between two clusters of points. Different choices:
Single linkage. Distance between two clusters = distance between their closest pair of points. Produces long, chain-like clusters. Sensitive to noise.
Complete linkage. Distance = distance between their farthest pair. Produces compact, spherical clusters. Sensitive to outliers.
Average linkage. Distance = average distance between all cross-cluster pairs. Compromise.
Ward's linkage. Merge the pair that increases SSE the least. Produces clusters of similar size, often considered the best general-purpose choice.
Dendrograms
A dendrogram is a tree diagram showing the merge history. Vertical axis shows the distance at which clusters were merged. Cutting the dendrogram with a horizontal line at some height gives a clustering — clusters that have not yet been merged at that height become the final clusters.
Useful properties:
- No need to specify in advance — the dendrogram shows all levels.
- Visual representation reveals natural cluster structure.
- Can spot outliers (points that merge late, at high distances).
Downside: memory and time for naive implementations. Does not scale to very large datasets without approximations.
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that defines clusters as dense regions of points separated by regions of low density, classifying points as core, border, or noise based on local density.
DBSCAN has two parameters:
- — the neighbourhood radius.
- — the minimum number of points required to form a dense region.
Each point is classified as:
- Core point — has at least neighbours within distance .
- Border point — within of a core point but does not itself have enough neighbours.
- Noise — neither core nor border.
Clusters form by chaining core points and including their border points. Noise points are not assigned to any cluster.
Strengths:
- Does not require specifying the number of clusters.
- Handles arbitrarily-shaped clusters.
- Robust to outliers (they are labelled as noise).
- Good for spatial data.
Weaknesses:
- Sensitive to and .
- Struggles with clusters of varying density.
- Slow on very high-dimensional data.
DBSCAN is the go-to choice for clustering geographic data, anomaly detection in low-dimensional feature spaces, and identifying clusters in noisy datasets.
Gaussian Mixture Models (GMM)
A Gaussian Mixture Model is a probabilistic clustering model in which the data is assumed to come from a mixture of Gaussian distributions, with parameters fit by the Expectation-Maximisation algorithm, producing soft cluster assignments through posterior probabilities.
Each cluster is modelled as a Gaussian (multivariate normal) distribution with its own mean and covariance. Each point has a probability of belonging to each cluster, rather than a hard assignment.
The model parameters (means, covariances, mixing weights) are estimated by the Expectation-Maximisation (EM) algorithm:
- E step. Given current parameters, compute the posterior probability that each point belongs to each Gaussian.
- M step. Given current posteriors, update each Gaussian's parameters (mean, covariance) as weighted averages.
GMM generalises K-means — when all covariances are spherical and equal, GMM with hard assignments reduces to K-means.
Strengths:
- Soft assignments quantify uncertainty.
- Handles elliptical (not just spherical) clusters.
- Probabilistic foundation — supports model selection (Bayesian Information Criterion, Akaike Information Criterion).
Weaknesses:
- Slower than K-means.
- Sensitive to initialisation.
- Assumes Gaussian clusters; fails on non-Gaussian distributions.
Comparison
| Algorithm | Cluster shape | Need K? | Handles noise? | Scales? |
|---|---|---|---|---|
| K-means | Spherical | Yes | Poorly | Excellent |
| K-means++ | Spherical | Yes | Poorly | Excellent |
| Hierarchical (Ward) | Roughly spherical | No (use dendrogram) | Poorly | Poor ( memory) |
| Hierarchical (single) | Any | No | Poorly | Poor |
| DBSCAN | Any | No (uses , minPts) | Yes (noise label) | Moderate |
| GMM | Elliptical | Yes | No | Moderate |
The right algorithm depends on the data shape, the size, and what is needed downstream.
5.3 Dimensionality reduction — techniques and need
The need for dimensionality reduction
High-dimensional data is hard. As dimensions grow, problems multiply:
The curse of dimensionality. In high dimensions:
- Distances between all pairs of points become similar — "nearest" loses meaning.
- The volume of a hypersphere grows faster than the volume of the embedding hypercube — most volume sits near the corners, far from the centre.
- The amount of data needed to cover the space grows exponentially with dimensions.
- Many algorithms (K-NN, K-means) degrade in effectiveness.
Computational cost. Algorithms that scale as or become expensive. Storage requirements grow linearly with .
Visualisation. Humans can see 2D and 3D plots. Anything higher must be reduced to be visualised.
Redundancy. Many features may carry overlapping information. A dataset with 1000 features may have only 20 "real" dimensions of variation.
Noise. Some dimensions carry mostly noise rather than signal. Removing them improves downstream learning.
What dimensionality reduction does
Dimensionality reduction is the process of transforming data from a high-dimensional space to a lower-dimensional space while preserving the structure that matters for the downstream task, used for visualisation, denoising, compression, and improving the performance of subsequent learning algorithms.
A dimensionality-reduction technique maps to where — often much less. The mapping is chosen so that captures as much useful information as possible.
Categories of dimensionality reduction
Feature selection vs feature extraction.
- Feature selection — pick a subset of the original features. The reduced data is in a subspace of the original.
- Feature extraction — construct new features as functions of the originals. The reduced data is in a transformed space.
Linear vs non-linear.
- Linear methods — the new features are linear combinations of the originals. PCA, LDA, factor analysis.
- Non-linear methods — the new features are non-linear functions. t-SNE, UMAP, autoencoders, kernel PCA.
Supervised vs unsupervised.
- Unsupervised — use only the inputs . PCA, autoencoders, t-SNE.
- Supervised — use labels as well. LDA.
Standard methods
A summary of common methods:
- PCA (Principal Component Analysis). Linear, unsupervised. Finds directions of maximum variance. Section 5.4.
- LDA (Linear Discriminant Analysis). Linear, supervised. Finds directions that best separate classes. Section 5.5.
- t-SNE (t-distributed Stochastic Neighbour Embedding). Non-linear, unsupervised. Preserves local structure for visualisation.
- UMAP (Uniform Manifold Approximation and Projection). Non-linear, unsupervised. Faster than t-SNE, often preserves global structure better.
- Autoencoders. Neural networks trained to reconstruct their input through a low-dimensional bottleneck layer. The bottleneck representation is the reduced data.
- Random projections. Project to a random subspace. Surprisingly effective due to the Johnson-Lindenstrauss lemma.
- Factor analysis. Statistical model where observed variables are combinations of latent factors plus noise.
- Kernel PCA. PCA applied in a higher-dimensional kernel feature space. Captures non-linear structure.
5.4 Principal Component Analysis (PCA)
Principal Component Analysis
Principal Component Analysis is a linear, unsupervised dimensionality-reduction technique that finds the directions (principal components) along which the data has maximum variance, projecting the data onto the top such directions to produce a -dimensional representation that preserves as much variance as possible.
PCA was introduced by Karl Pearson in 1901. It remains the most widely used dimensionality-reduction method.
The intuition
Imagine a cloud of points in 2D forming an elongated ellipse. The data has a clear "main direction" — the long axis of the ellipse — along which most of the variation occurs. A second, perpendicular direction captures the residual variation. The main direction is the first principal component; the perpendicular direction is the second.
In higher dimensions, the same idea applies. PCA finds the directions of largest variance, ordered from most to least. By keeping only the top directions, we reduce the data to dimensions while losing as little variance as possible.
The mathematics
Given a dataset with rows (examples) and columns (features), centred so each column has mean zero:
-
Compute the covariance matrix. is the covariance matrix.
-
Compute eigenvectors and eigenvalues of . Each eigenvector satisfies . The eigenvalue measures the variance along the direction .
-
Sort eigenvalues in descending order. Let with corresponding eigenvectors .
-
Pick the top eigenvectors to form a projection matrix .
-
Project the data. The reduced data is , an matrix.
The th column of is the projection of the data onto the th principal component.
Why does this maximise variance?
The variance of the data projected onto a unit vector is . Maximising this subject to gives the eigenvector of with the largest eigenvalue. The second-largest variance, subject to being orthogonal to the first, gives the second eigenvector. And so on.
How many components to keep
Two common criteria:
Cumulative variance explained. The fraction of total variance captured by the first components is . Pick the smallest that captures a target fraction — 80%, 90%, 95%.
Scree plot. Plot eigenvalues against component index. The "elbow" — where eigenvalues drop sharply — suggests where to cut off.
Kaiser's criterion. Keep components with eigenvalues greater than 1 (when working with standardised data with unit variance per feature).
A worked example
Suppose a 2D dataset with three points:
| Feature 1 | Feature 2 |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
The data lies exactly on the line . The variance is entirely along this line.
Mean: . Centred data:
| -1 | -1 |
| 0 | 0 |
| 1 | 1 |
Covariance matrix:
Eigenvalues: solve , giving , .
Eigenvectors:
- For : . The direction .
- For : . Perpendicular.
The first principal component captures 100% of the variance (eigenvalue ratio ). The second captures 0%. Keeping just the first PC reduces the data to 1D without loss.
Projecting the centred data onto :
The 1D representation: — exactly recovering the original linear structure.
Practical considerations
Standardisation. PCA is sensitive to feature scale. A feature with large range dominates the covariance. Standardise features (zero mean, unit variance) before PCA unless there's a specific reason not to.
Centering. PCA requires data to be centred (zero mean per feature). Standardisation handles this.
Interpretation. Each principal component is a linear combination of original features. The weights (loadings) tell you which features contribute to each component. PC1 might be "overall financial activity" (positive loadings on all transaction features); PC2 might be "time-of-day pattern" (positive on evening transactions, negative on morning).
Limitations.
- PCA is linear. Non-linear structure (curves, clusters) is not captured. Use kernel PCA, t-SNE, or autoencoders for non-linear reduction.
- PCA captures variance, not class-discriminative information. The direction of maximum variance is not necessarily the direction that best separates classes. For supervised problems, LDA may be better.
- Outliers heavily influence PCA results.
Applications
PCA is used in:
- Visualisation. Reduce high-dimensional data to 2D or 3D for plotting.
- Compression. Image compression uses related methods (DCT, SVD).
- Feature engineering. PCA-reduced features can feed into supervised learning.
- Denoising. Reconstruct data from its top components to remove noise.
- Genomics. Reduce gene-expression data to a manageable number of components.
- Finance. Identify common factors driving asset prices.
- Sensor fusion. Combine many sensor readings into a few principal signals.
For Nepali contexts:
- NEA smart-meter data with hundreds of features can be reduced to a few principal components for clustering customers by usage profile.
- Macroeconomic indicators (NRB data — interest rates, inflation, exchange rates, remittance flows) can be PCA-reduced to identify common economic factors.
- High-dimensional sensor data from telecom network monitoring can be compressed for anomaly detection.
5.5 Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis
Linear Discriminant Analysis is a supervised, linear dimensionality-reduction technique that projects data onto a lower-dimensional space chosen to maximise the separation between classes while minimising the variation within each class, useful for both dimensionality reduction and as a classifier.
LDA was developed by Ronald Fisher in 1936, originally as a method for classifying iris flowers. It has been a staple of statistical classification ever since.
LDA vs PCA
The key difference:
- PCA finds directions of maximum variance, ignoring class labels.
- LDA finds directions of maximum class separability, using class labels.
For an unlabelled or single-class problem, only PCA applies. For a labelled multi-class problem, LDA can find directions where the classes are well-separated, which PCA might miss.
A classic example: suppose two classes form two clusters that are elongated parallel to each other. PCA's first component points along the elongation direction — the direction of maximum overall variance — which is the direction in which the classes are not separated. LDA's first component points perpendicular to the elongation, the direction in which the classes are separated. LDA finds the discriminative direction; PCA misses it.
The mathematics
For a dataset with classes:
Within-class scatter matrix:
The total variation within classes — how spread out points are around their own class mean.
Between-class scatter matrix:
The variation between class means and the overall mean , weighted by class size .
Fisher's criterion. LDA finds the projection direction that maximises:
The numerator is between-class scatter in the projected space; the denominator is within-class scatter. Maximising means classes are well-separated relative to their internal spread.
The solution: is the eigenvector of corresponding to the largest eigenvalue. For a -class problem, LDA produces at most discriminant directions — because has rank at most .
LDA as a classifier
Once data is projected onto the LDA directions, classification is straightforward — use the nearest class mean, or apply any standard classifier. LDA itself often serves as a complete classification method (assuming Gaussian class-conditionals with equal covariances).
The LDA decision rule:
This is linear in — the decision boundary is a hyperplane.
Assumptions of LDA
LDA assumes:
- Class-conditional distributions are Gaussian.
- All classes share the same covariance matrix.
- Features are continuous.
Violations of these assumptions weaken LDA but do not always break it — LDA is fairly robust in practice.
When the covariance assumption is violated, Quadratic Discriminant Analysis (QDA) allows each class its own covariance, producing quadratic decision boundaries.
A worked intuition
For a two-class problem (), LDA produces one discriminant direction. The data is projected onto a single line. Along this line, the two classes are as separated as possible.
If two classes have means and and share covariance :
The direction is roughly "from class 1 mean to class 2 mean," adjusted by the inverse covariance. If the features have very different scales or are correlated, corrects for it.
Strengths and weaknesses
Strengths:
- Supervised — uses labels to find discriminative directions.
- Simple and fast.
- Often a strong baseline for classification.
- Provides interpretable directions (the LDA components are class-discriminative axes).
- Closed-form solution, no iteration required.
Weaknesses:
- Assumes Gaussian class-conditionals with equal covariances.
- Limited to dimensions for classes — fewer than PCA can produce.
- Sensitive to outliers.
- Cannot capture non-linear class boundaries.
- Requires the within-class scatter matrix to be invertible — fails when the number of features exceeds the number of samples (regularised LDA addresses this).
PCA vs LDA — comparison
| PCA | LDA | |
|---|---|---|
| Supervision | Unsupervised | Supervised (needs labels) |
| Objective | Maximise variance | Maximise class separability |
| Output dimensions | Up to | Up to |
| Best when | Data has a low-dimensional subspace of variation | Classes have distinct means in some directions |
| Use as classifier | No (just reduces dimensions) | Yes (built-in classifier) |
| Sensitivity to scaling | Very sensitive | Less sensitive |
| Assumptions | None about distributions | Gaussian classes, equal covariance |
In practice, the methods are often used together — PCA first to remove very-low-variance noise dimensions, then LDA on the PCA output for class-discriminative reduction.
Applications
LDA appears in:
- Face recognition. Classic "Fisherfaces" approach uses LDA on the pixel features (or on PCA-reduced features).
- Speech recognition. LDA-based feature transformations in HMM-based ASR.
- Bioinformatics. Disease subtype classification from gene-expression data.
- Marketing. Customer-segment classification from behavioural features.
- Document classification. When class labels are available (topic, sentiment).
In Nepal-relevant contexts:
- Loan-approval features can be reduced via LDA to a small number of class-discriminative scores.
- Network-traffic classification (normal vs intrusion) can use LDA on flow features.
- Medical-diagnosis features (clinical data, lab tests) can be LDA-reduced for screening systems at major hospitals.
The chapter has covered the main unsupervised-learning tasks (clustering and dimensionality reduction) along with the supervised variant of dimensionality reduction (LDA). Together with the supervised methods of Chapters 3 and 4, these complete the core algorithmic toolkit. The next chapter turns to evaluation — how to measure whether the algorithms covered so far are actually working.