Chapter 7 — Deep Learning Basics
The previous chapters built up the classical machine-learning toolkit — linear models, decision trees, Bayesian methods, SVMs, K-means, PCA. Each works well for specific problem shapes. Deep learning is the modern descendant of the neural-network ideas introduced in Chapter 3, scaled up by orders of magnitude in depth, data, and computation. It has displaced classical methods on most large-scale problems involving images, speech, text, and complex sensor data, and has reshaped what machine learning can do. This chapter covers the foundations — what makes a network "deep," how backpropagation actually computes gradients, the activation functions that make non-linearity possible, the convolutional and recurrent architectures that dominate vision and sequence tasks, and the specific applications of ML in cybersecurity that draw together the threads of the whole subject.
7.1 Definition of deep networks
Deep neural network
A deep neural network is an artificial neural network with multiple hidden layers between the input and output layers — typically three or more — that learns hierarchical representations of data by composing simple non-linear transformations into increasingly abstract features.
The MLP introduced in Chapter 3 is a neural network. A deep neural network is the same idea, taken further. The threshold for "deep" is conventional — typically three or more hidden layers — but modern networks routinely have tens, hundreds, or even thousands of layers.
Why depth helps
The Universal Approximation Theorem (Chapter 3) says a single hidden layer is enough to approximate any continuous function — given enough neurons. So why bother with depth?
Parameter efficiency. Some functions require an exponentially large shallow network but only a polynomially large deep network. The classic example: the parity function (XOR generalised to bits) requires neurons in a shallow network but in a deep one.
Hierarchical features. Deep networks learn features at multiple levels of abstraction. In a vision network, the first layer detects edges; the next detects corners and short contours; the next detects motifs (eyes, wheels); the top layers detect whole objects. Each layer's features are constructed from the layer below.
Better generalisation. With proper regularisation, deep networks often generalise better than equally-parameterised shallow ones. The hierarchical structure seems to match the structure of natural data.
Data-efficient learning. Deep networks transfer well — a network pre-trained on a large dataset can be fine-tuned on a smaller related dataset with much less data than training from scratch would require.
Parameter counts
Modern deep networks have many parameters. A reference table:
| Network | Year | Parameters | Notable |
|---|---|---|---|
| LeNet-5 | 1998 | 60,000 | First major CNN |
| AlexNet | 2012 | 60 million | First ImageNet deep-learning win |
| VGG-16 | 2014 | 138 million | Deeper architecture |
| ResNet-50 | 2015 | 25 million | Skip connections |
| BERT-base | 2018 | 110 million | Language model |
| GPT-3 | 2020 | 175 billion | Large language model |
| GPT-4 (estimated) | 2023 | ~1 trillion+ | Multimodal LLM |
| GPT-5 / Claude 4.x | 2024-26 | undisclosed | State-of-the-art |
Each generation has roughly 10× more parameters than the last for the leading-edge models. The pattern is unlikely to continue forever (the marginal returns to scale are diminishing), but the trend has dominated the past decade.
What changed in 2012
Deep learning's modern era began with AlexNet in 2012 — a CNN that won the ImageNet competition by a large margin. Three factors made it possible:
Large labelled datasets. ImageNet provided 1.2 million labelled training images across 1000 categories.
GPU computing. General-purpose GPU computation made the matrix multiplications underlying neural networks fast enough.
Algorithmic improvements. ReLU activation (faster training than sigmoid/tanh), dropout (regularisation), and large-scale stochastic gradient descent enabled training of deeper networks.
The combination of these factors is the recipe behind every deep-learning advance since.
Representation learning
A key idea: deep networks learn their own representations. Classical ML relies on hand-engineered features — a fraud-detection model uses features like "transaction count in past 30 days." A deep network can learn its own features directly from raw inputs (raw transaction sequences, raw images, raw text) and often learns features that hand engineering misses.
This is why deep learning has succeeded on unstructured data — images, text, audio — where hand-engineering features is hard. On structured tabular data, gradient boosting and other classical methods still often match or beat deep learning.
7.2 Feed-forward and backpropagation
Feed-forward computation
The forward pass through a network was introduced in Chapter 3. To recap:
For a network with layers:
- Input: .
- For each layer :
- Pre-activation:
- Activation:
- Output: .
The computation flows from input through layers in one direction — forward. Each layer's output is the next layer's input. No loops, no recurrence (for a basic feed-forward network).
The training problem
To train a network, we need gradients of the loss function with respect to every parameter — every weight and every bias in every layer. A network with 100 million parameters needs 100 million partial derivatives, recomputed every gradient update.
Computing these naively (one parameter at a time, with numerical differentiation) is infeasible. Backpropagation computes all of them efficiently with one forward pass and one backward pass.
Backpropagation
Backpropagation is the algorithm for computing the gradient of a neural network's loss function with respect to every parameter, applying the chain rule of calculus layer-by-layer from the output backward, enabling efficient training of multi-layer networks via gradient descent.
Backpropagation was developed independently several times — Paul Werbos in 1974, Rumelhart-Hinton-Williams in 1986 (the paper that made it widely known). It is the engine of modern deep learning.
The chain rule applied to networks
For a loss that depends on parameters through multiple layers, the chain rule lets us decompose the gradient. Consider a simple two-layer network:
The loss depends on through , then , then , then , then .
By the chain rule:
Each factor is a local derivative computable from the layer's operation. The chain multiplies them.
The backpropagation algorithm
Define — the error signal at layer . The algorithm:
-
Forward pass. Compute for all layers, ending with the loss .
-
Compute output-layer error:
(where is element-wise multiplication and is the derivative of the activation).
-
Backward pass. For each layer :
-
Compute parameter gradients:
The algorithm uses each forward-pass intermediate and , plus the error signals propagated backward through the network.
Small worked example
A tiny network: one input , one hidden neuron with weight and bias and sigmoid activation, one output neuron with weight and bias and sigmoid activation, target , squared error loss.
Forward:
- .
- .
- .
- .
- .
Backward:
- .
- .
- .
- .
- .
- .
With learning rate :
- .
- .
Both weights nudge in the right direction — toward producing output closer to .
Modern automatic differentiation
In practice, no one implements backpropagation by hand. Automatic differentiation (autodiff) libraries — PyTorch, TensorFlow, JAX — compute gradients automatically from the forward-pass code. The user writes the forward computation; the library computes the gradients.
This frees practitioners to experiment with arbitrary network architectures. As long as the forward pass is written in terms of differentiable operations, the backward pass is free.
Gradient flow problems
Backpropagation has known difficulties in deep networks.
Vanishing gradients. Sigmoid and tanh activations have derivatives less than 1. Multiplying many such derivatives along a long chain gives a very small product — the gradient at lower layers approaches zero. Those layers train extremely slowly or not at all.
Exploding gradients. Symmetrically, if weights are large and activations have derivatives greater than 1, the gradient grows exponentially along the chain. Training diverges.
Mitigations:
- Careful initialisation. Xavier/Glorot, He initialisation, scaled to keep activations and gradients well-behaved.
- ReLU activation. Derivative is 0 or 1; does not shrink gradients.
- Batch normalisation. Normalises activations across the batch, keeping them well-conditioned.
- Skip connections (ResNet, 2015). Add the input of a block to its output. Gradients flow through the skip path without going through the block's transformations.
- Gradient clipping. Cap the gradient's magnitude during backpropagation.
Skip connections were the breakthrough that allowed networks of 100+ layers to train at all.
7.3 Activation functions
The non-linear functions applied at each neuron. Without them, a deep network would collapse to a single linear transformation. With them, deep networks can represent arbitrary functions.
Sigmoid
The sigmoid activation function is , mapping any real number to the open interval with an S-shape, historically the standard activation in early neural networks but largely replaced by ReLU in modern deep learning due to gradient-saturation problems.
Derivative: .
Properties:
- Output in — useful for representing probabilities.
- Smooth and differentiable.
- Saturates at extremes — as , as . The derivative becomes very small in these regions, causing vanishing gradients.
- Output is not zero-centred, which slows training.
Sigmoid is still used in:
- The output layer for binary classification (predicting probability of the positive class).
- Gates in LSTM and GRU networks (Section 7.5).
It is rarely used in hidden layers of modern deep networks because of the vanishing-gradient problem.
Tanh
The hyperbolic tangent activation maps real numbers to with an S-shape symmetric around zero, similar to sigmoid but zero-centred, used in some recurrent network architectures.
Derivative: .
Properties:
- Output in — zero-centred (an advantage over sigmoid).
- Smooth and differentiable.
- Saturates at extremes (same vanishing-gradient problem as sigmoid).
Tanh is the standard activation in vanilla RNNs and remains common in LSTM and GRU cells. For feed-forward networks, ReLU has displaced it.
ReLU
ReLU (Rectified Linear Unit) is the activation function , outputting the input unchanged if positive and zero otherwise, the dominant activation in modern deep neural networks because its non-saturating derivative for positive inputs allows fast training of very deep models.
Derivative: 1 if , 0 if , undefined at (conventionally taken as 0).
Properties:
- No saturation for positive inputs. Gradient flows unchanged for positive activations, dramatically reducing vanishing gradients.
- Computationally cheap. A single comparison.
- Sparse activation. Many neurons output 0 for any given input, producing sparse representations.
Dead ReLU problem. A neuron whose weights have moved such that it outputs 0 for every input always has gradient 0 and never updates. It is "dead." Large learning rates and unfavourable initialisation can kill many neurons. Variants address this:
Leaky ReLU. Small positive slope for negative inputs:
with typically 0.01. Prevents dead neurons.
Parametric ReLU (PReLU). Same as leaky ReLU but is learned.
ELU (Exponential Linear Unit). Smooth version of leaky ReLU.
GELU (Gaussian Error Linear Unit). Smooth approximation, used in transformer models (BERT, GPT family).
In 2026, GELU is dominant in transformer architectures; ReLU and its variants remain common in CNNs and basic MLPs.
Softmax
Softmax is the activation function applied to the output layer of multi-class classification networks, converting a vector of real numbers (logits) into a probability distribution over classes by exponentiating and normalising, producing outputs that sum to 1.
For classes with logits :
The output is a probability distribution: each and .
Softmax is the multi-class generalisation of sigmoid. For , softmax is equivalent to sigmoid on the logit difference.
The softmax output combined with cross-entropy loss produces well-behaved gradients and is the standard for multi-class classification.
Choosing activation functions
Standard practice in 2026:
| Position | Function |
|---|---|
| Hidden layers (CNN, MLP) | ReLU or its variants |
| Hidden layers (transformer) | GELU |
| Hidden layers (RNN cells) | Tanh and sigmoid (in gates) |
| Output (binary classification) | Sigmoid |
| Output (multi-class classification) | Softmax |
| Output (regression) | None (linear) |
These are starting points. Specific applications may call for other choices.
7.4 Convolutional Neural Networks (CNNs)
Convolutional neural network
A Convolutional Neural Network is a deep neural-network architecture designed for grid-structured data (images, audio spectrograms, time series) that uses learnable filters applied across the input spatially, followed by pooling and fully-connected layers, dramatically reducing parameter count compared to fully-connected networks while exploiting local structure.
CNNs are the architecture that made deep learning succeed on images. They have largely been displaced by Vision Transformers (ViT) on the largest scales since 2020, but CNNs remain the workhorse for most image-related applications.
The convolution operation
The core idea: instead of connecting every input pixel to every neuron in the next layer (an enormous number of parameters), connect each neuron only to a small local receptive field of the input. Use the same weights across all positions — weight sharing.
A 2D convolution applies a learnable filter (or kernel) — a small grid of weights — at every position of the input. At each position, multiply the filter by the corresponding input region and sum.
For input and filter of size :
The output is a feature map — a 2D array of responses to the filter across the input.
Multiple filters produce multiple feature maps stacked into a tensor. A typical convolutional layer has dozens or hundreds of filters.
Why convolution works for images
Locality. Image features are local — edges, textures, parts. A 3×3 filter detects local patterns; piling layers extends the receptive field.
Translation invariance. A cat is a cat whether at the top-left or bottom-right of the image. Weight sharing across positions means the same filter detects the pattern wherever it appears.
Parameter efficiency. A fully-connected layer from a 224×224 RGB image (150,528 inputs) to even a modest 1000-neuron layer would have 150 million parameters. A convolutional layer with 64 filters of size 3×3×3 has only 1,792 parameters. Vastly fewer parameters, same expressiveness for local pattern detection.
Pooling
A pooling layer reduces the spatial size of feature maps by summarising small regions, typically with max-pooling (take the maximum in each region) or average-pooling, providing translation invariance and reducing computational cost in subsequent layers.
Max pooling 2×2. Take the maximum value in each 2×2 region. Halves the spatial dimensions.
After several alternating convolution and pooling layers, the feature maps become small enough to flatten and feed into fully-connected layers for the final classification.
A typical CNN architecture
A standard image-classification CNN:
Input image (224 × 224 × 3)
↓
Conv layer 1: 64 filters of 3×3, ReLU
↓
Max pool 2×2 → 112 × 112 × 64
↓
Conv layer 2: 128 filters of 3×3, ReLU
↓
Max pool 2×2 → 56 × 56 × 128
↓
Conv layer 3: 256 filters of 3×3, ReLU
↓
Max pool 2×2 → 28 × 28 × 256
↓
Conv layer 4: 512 filters of 3×3, ReLU
↓
Max pool 2×2 → 14 × 14 × 512
↓
Conv layer 5: 512 filters of 3×3, ReLU
↓
Max pool 2×2 → 7 × 7 × 512
↓
Flatten → 25,088
↓
Dense layer: 4096 neurons, ReLU
↓
Dense layer: 4096 neurons, ReLU
↓
Dense output: K classes, Softmax
This is approximately the VGG-16 shape.
Landmark CNN architectures
LeNet-5 (1998, Yann LeCun). The first major CNN. Designed for handwritten-digit recognition on cheques. Tiny by modern standards (60,000 parameters) but established the convolutional template.
AlexNet (2012). The breakthrough. 60 million parameters, 8 layers deep, trained on two GPUs. Used ReLU and dropout — both important innovations. Won ImageNet by a massive margin.
VGG (2014). Showed that depth matters. Simple architecture (only 3×3 convolutions) with 16 or 19 layers. Heavy parameter count.
GoogLeNet / Inception (2014). "Inception modules" — parallel convolutions of different sizes within each block. More parameter-efficient than VGG.
ResNet (2015). Introduced residual connections — skip paths that let gradients flow through. Enabled training of networks with 50, 100, or 152 layers. The architecture underlying most modern CNNs.
DenseNet (2017). Each layer receives inputs from all previous layers. Very parameter-efficient.
EfficientNet (2019). Systematic scaling of depth, width, and resolution. State-of-the-art accuracy with fewer parameters than VGG-style networks.
Vision Transformers (2020-). Apply the transformer architecture (originally for language) to images. At very large scales, ViTs match or exceed CNN performance. Hybrid architectures (combining convolutions and transformers) are now common.
Applications of CNNs
CNNs power:
- Image classification. What is in this image?
- Object detection. Where are the objects in this image? (YOLO, Faster R-CNN, DETR, modern alternatives.)
- Semantic segmentation. Which pixels belong to which object class? (U-Net, DeepLab.)
- Face recognition. Identifying individuals from face images.
- Medical imaging. Cancer detection on X-rays, CT, MRI scans.
- Satellite imagery analysis. Land-use classification, building detection.
- Speech recognition (combined with other architectures).
- Video analysis. Action recognition, scene understanding.
In Nepali contexts:
- Crop-disease detection from smartphone photos for farmers.
- Satellite-imagery analysis for the Ministry of Agriculture and CBS.
- Document classification (citizenship cards, passports, voter ID forms) at government counters.
- Manuscript digitisation projects for Nepali heritage texts.
- License plate recognition at toll plazas and parking facilities.
- Medical image analysis at major hospitals (TUTH, Bir, Patan) — though deployment lags research.
7.5 Recurrent Neural Networks (RNNs)
Recurrent neural network
A Recurrent Neural Network is a neural-network architecture designed for sequential data in which the network maintains a hidden state that is updated at each time step from the current input and the previous state, allowing the network to capture temporal dependencies through learned memory.
CNNs handle grid-structured data. RNNs handle sequence data — text, audio, time series, anything where order matters. Each input arrives at a specific time step; the network processes them in order, carrying information through time via the hidden state.
The basic RNN
For input sequence :