Gradient-based learning applied to document recognition

Y. Lecun; L. Bottou; Y. Bengio; P. Haffner (1998)

Historical Paper Review

Patrick Li

BDSI, ANU

Overview

  1. The 1998 computer vision landscape
  2. The problem with traditional pattern recognition
  3. Gradient-based learning revolution
  4. Convolutional neural network foundations
  5. LeNet-5
  6. Beyond LeNet: Graph transformer networks
  7. GTN Applications

The 1998 computer vision landscape

%%{init: {'themeVariables': { 'fontSize': '20px'}}}%%
timeline
    title Evolution of Pattern Recognition (Pre-1998)
    1950s : Early Pattern Recognition in Cybernetics : Perceptron
    1960s : Linear Classifiers & Perceptrons dominate : Expose limitations (XOR problem), triggering "AI winter" for neural networks
    1970s : Statistical Decision Theory (Bayesian methods, k-NN, Fisher discriminant analysis) : Hidden Markov Models (HMMs)
    1980-1984 : Symbolic AI & Expert Systems peak : Hand-crafted feature engineering dominates : Neocognitron introduced (early CNN)
    1985-1989 : Backpropagation popularized : Neural networks revived
    1990s : Support Vector Machines (SVMs) : Neural Networks struggle (Limited compute/data, but backpropagation improves) : Boosting algorithms
    1998 : LeNet-5 (First practical CNN for digit recognition) : End-to-End Learning

Traditional pattern recognition

%%{init: {'themeVariables': { 'fontSize': '15px'}}}%%
graph TD
    A["Input Image"] --> B["Feature Extractor <br> (Handcrafted, Task-Specific)"]
    B --> C["Low-Dimensional Representation <br> (Invariant to Distortions)"]
    C --> D["Classifier <br> (General-Purpose, Trainable)"]
    D --> E["Prediction / Label"]

    subgraph " "
        B
        C
        D
    end

Two-module system

  1. Feature Extractor
  2. Classifier

Problem

Recognition accuracy heavily depends on the designer’s ability to create good features, which is time-consuming and problem-specific.

Challenges in handwriting recognition

Separating characters (Segmentation) within words/sentences is a major challenge.

Real world handwriting will be 100x worse than this.

Heuristic Oversegmentation (HOS)

Generates potential cuts between characters based on heuristics, then selects best combination based on recognizer scores.

Shift to learning-based systems

%%{init: {'themeVariables': { 'fontSize': '20px'}}}%%
graph TB
        subgraph "Learning-Based Approach"
        F[Raw Image] --> G[End-to-End<br/>Learning System]
        G --> H[Output]
        G -.-> I[("🤖 Automatic<br/>Feature Discovery")]
    end
    
    subgraph "Traditional Approach"
        A[Raw Image] --> B[Hand-crafted<br/>Feature Extractor]
        B --> C[Trainable<br/>Classifier]
        C --> D[Output]
        B -.-> E[("👨‍💻 Human Expert<br/>Designs Features")]
    end
    
    style E fill:#ffcccc
    style I fill:#ccffcc

  1. Faster hardware enables brute-force numerical methods.
  2. Large datasets (e.g., MNIST) allow reliance on real data.
  3. Powerful learning techniques handle high-dimensional inputs and complex decision functions.

Gradient-based learning

%%{init: {'themeVariables': { 'fontSize': '25px'}}}%%
graph LR
    A[Input Z] --> B[Learning Machine F]
    B --> C[Output Y]
    D[Parameters W] --> B
    E[Desired Output D] --> F[Loss Function E]
    C --> F
    F --> G["Gradient $$\;\partial E/ \partial W$$"]
    G --> H[Parameter Update]
    H --> D
    
    style F fill:#ffffcc
    style G fill:#ccffff

  • Learning machine computes: \(Y^p = F(Z^p, W)\), where\(Z^p\) is input, \(W\) is parameters.
  • Loss function: \(E^p = D(D^p, F(W, Z^p))\), measuring discrepancy between desired and actual output.
  • Average loss: \(E_{\text{train}}(W)\) over training set.

Generalization gap

\[E_{\text{test}} - E_{\text{train}} = k(h/P)^\alpha,\] where \(P\) is the number of training samples, \(h\) is the model capacity, \(\alpha \in [0.5, 1]\) and \(k \in \mathbb{R}\) are constants.

Tradeoff

Increasing capacity reduces \(E_{\text{train}}\) but increases the gap. Optimal capacity minimizes \(E_{\text{test}}\).

Structural risk minimization

\[\arg\min_{W} \; E_{\text{train}} + \beta H(W),\] where \(H(W)\) penalizes high-capacity parameters (regularization) and \(\beta\) is a constant.

In practice, structural risk minimization is usually achieved by:

  • Using models of increasing complexity (e.g., polynomial degree, network size)
  • Adding regularization penalties (e.g., L1/L2 regularization)
  • Selecting models via validation

Optimization techniques

%%{init: {'themeVariables': { 'fontSize': '18px'}}}%%
graph TD
    A[Optimization]

    A --> B[Gradient Descent]
    B --> B1["$$W_k = W_{k-1} - \varepsilon \; \frac{\partial E(W)}{\partial W}$$"]
    B --> B2["`ε can be scalar, 
    matrix, or adaptive`"]
    B2 --> B2a[Examples: <br>quasi-Newton methods]

    A --> C[Stochastic Gradient]
    C --> C1["$$W_k = W_{k-1} - \varepsilon \; \frac{\partial E^{p_k}(W)}{\partial W}$$"]
    C --> C2[Faster convergence for large, <br>redundant datasets]
    C2 --> C2a[Examples: speech, <br>character recognition]

Popularization

The popularization of gradient-based learning was driven by three key events:

  1. Local minima were found not to be a major issue in practice.

  2. Backpropagation algorithm was introduced as a simple and efficient method to compute gradients in multilayer nonlinear systems.

  3. Backpropagation + multilayer neural networks + sigmoidal units was demonstrated to effectively solve complex learning problems.

Backpropagation

Gradients are computed by propagating errors backward from the output layer to the input layer, using the chain rule.

\[\begin{align*} \frac{\partial E^p}{\partial W_n} &= \frac{\partial F}{\partial W}(W_n, X_{n-1})\frac{\partial E^p}{\partial X_n} \\ \frac{\partial E^p}{\partial X_{n-1}} &= \frac{\partial F}{\partial X}(W_n, X_{n-1})\frac{\partial E^p}{\partial X_n}, \\ \end{align*}\]

where \(W_n\) and \(X_n\) denote the weights and outputs of the \(n\)th layer, respectively. And \(X_n = F_n(W_n, X_{n-1})\).

Why fully connected NNs struggle with character recognition?

  1. High dimensionality: Large input sizes lead to an explosion of parameters.
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer)        │ (None, 4096)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ (None, 256)            │     1,048,832 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 2)              │           514 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,049,346 (4.00 MB)
 Trainable params: 1,049,346 (4.00 MB)
 Non-trainable params: 0 (0.00 B)

Why fully connected NNs struggle with character recognition?

  1. No translation invariance: The same object in a different place looks completely different.
keras_mod$predict(verbar, verbose = 0L)
          [,1]      [,2]
[1,] 0.8466364 0.1533636
keras_mod$predict(verbar_shift, verbose = 0L)
          [,1]      [,2]
[1,] 0.6208922 0.3791077

Why fully connected NNs struggle with character recognition?

  1. Ignored input topology: The spatial structure of images and local correlations are not inherently utilized.

Convolutional neural networks

A convolutional neural network typically consists of:

  1. Convolutional layers
  2. Subsampling (or pooling) layers
  3. Fully connected layers

Local receptive fields

  • Convolutional layers connect each unit to a small local region of the previous layer.
  • Local receptive fields extract basic features (e.g. edges, corners), which are combined in deeper layers to detect higher order patterns.

Weight sharing

Weight sharing

\[\text{Weight sharing} = \text{shift invariance} + \text{less parameters} + \text{smaller generalization gap}\]

  • Local feature detectors can be applied across the image using identical weights.
  • Multiple filters extract different features at each location.
  • Shifted inputs produce shifted feature maps, making CNNs robust to shifts and distortions.
  • Weight sharing reduces model capacity, helping to reduce the train-test error gap.

Spatial subsampling

Exact feature location is less important than relative position. Pooling reduces resolution and sensitivity to shifts and distortions by averaging over local neighbourhoods.

LeNet-5

%%{init: {'themeVariables': { 'fontSize': '33px'}}}%%
graph LR
    Input["Input<br/>32×32×1<br/>Grayscale Image"] --> C1["Convolution<br/>6 filters, 5×5<br/>Output: <br/>28×28×6"]
    C1 --> S2["Subsampling<br/>Average Pooling 2×2<br/>Output: <br/>14×14×6"]
    S2 --> C3["Convolution<br/>16 filters, 5×5<br/>Output: <br/>10×10×16"]
    C3 --> S4["Subsampling<br/>Average Pooling 2×2<br/>Output: <br/>5×5×16"]
    S4 --> C5["Convolution<br/>120 filters, 5×5<br/>Output: <br/>1×1×120<br/>"]
    C5 --> F6["Dense<br/>84 units<br/>tanh activation"]
    F6 --> Output["Output Layer<br/>10 units<br/>Classes 0-9"]

    classDef convLayer fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef poolLayer fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef fcLayer fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    classDef inputOutput fill:#fff3e0,stroke:#e65100,stroke-width:2px

    class Input,Output inputOutput
    class C1,C3,C5 convLayer
    class S2,S4 poolLayer
    class F6 fcLayer

Main characteristics

  • Feature maps increases as resolution decreases, building invariance through richer representations.
  • Standardised inputs accelerates training.
  • No padding is used, so feature maps shrink after convolution.
  • The final layer compares against 10 prototype vectors (RBF units) using Euclidean distance (prototype-based classification). It will produce low activations for out-of-distribution inputs.

Discriminative MAP criterion

Maximizes the posterior probability of the correct class while penalizing incorrect classes.

\[E(W) = \frac{1}{P} \sum_{p=1}^{P} \left( y_{D^p}(Z^p, W) + \log \left( e^{-j} + \sum_{i} e^{-y_i(Z^p, W)} \right) \right)\]

  • The output layer uses RBF units, where a lower \(y_{D^p}\) indicates better matching.
  • The term \(e^{-j}\) accounts for a “rubbish” class to reject non-character inputs, where \(j\) is a positive constant.

Results

  • LeNet-5 achieves high accuracy on MNIST, with LeNet-1 outperforming larger fully connected networks.
  • Data augmentation and boosting (e.g., boosted LeNet-4) significantly improve performance.
  • CNNs scale well with data and preserve spatial structure, outperforming SVMs and KNN-based methods like tangent distance classifier.
  • SVMs perform surprisingly well despite lacking spatial priors, but are less efficient on large datasets.

Multimodule systems

%%{init: {'themeVariables': { 'fontSize': '30px'}}}%%
graph LR
    A[Input Data] --> B[Module 1: <br/>Neural Network]
    A --> C[Module 2: <br/>Frozen <br/>Embedding]
    A --> D[Module 3: <br/>Nondifferentiable <br/>Function]
    
    B --> E[Intermediate <br/>Output 1]
    C --> F[Intermediate <br/>Output 2]
    D --> G[Intermediate <br/>Output 3]
    
    E --> H[Combiner <br/>Module]
    F --> H
    G --> H
    
    H --> I[Final Output]
    I --> J[Loss Function]
    
    J -.->|∂L/∂θ| B
    J -.->|No gradients| C
    J -.->|∂L/∂θ <br/> via <br/>differentiable <br/>params| D
    classDef trainable fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef frozen fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef nondiff fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef gradient stroke:#d32f2f,stroke-width:3px,stroke-dasharray: 5 5
    
    class B,H,K trainable
    class C,L frozen
    class D,M nondiff
    class J gradient

Gradient-based learning (backpropagation) can be generalized to systems composed of heterogeneous modules, not just traditional neural networks.

Graph transformer network (GTN)

Variable-length inputs can be represented using directed graphs, where each path corresponds to a possible sequence and arcs carry vectors containing information such as probabilities, features, or penalties.

In GTN, each module takes graph as input and output a graph.

Segmentation graph

HOS is a traditional segmentation method that generates candidate cuts using heuristics and selects the best one.

  • nodes represent cuts
  • arcs represent ink segments between cuts that could form characters

Recognition transformer

The recognition transformer converts the segmentation graph into an interpretation graph.

It processes each segmentation arc to generate all possible interpretations.

Penalties from segmentation and recognition are combined on interpretation arcs.

Viterbi transformer

The Viterbi transformer extracts the best interpretation using dynamic programming.

It finds the lowest-penalty path in the interpretation graph (Viterbi path).

Constrained interpretation graph

During training, a path selector is inserted between the interpretation graph and the Viterbi transformer.

This selector filters for all paths that contain the correct label sequence, producing what is known as the constrained interpretation graph.

Viterbi training

Viterbi algorithm finds the best path, and we want it close to the actual label sequence.

Loss = sum of penalties on the Viterbi-selected path.

Only the final path receives gradient updates; others are zero.

Competing low-penalty paths are ignored.

Discriminative Viterbi training

Compute the difference in penalties between the Viterbi path and the constrained Viterbi path, and use it as a loss term.

This penalizes incorrect predictions, but lacks a margin between classes. Once the penalties match, the gradient vanishes.

Ideally, wrong paths should be penalized more if they are too close to the correct one.

Discriminative forward training

Computes the “soft” version of the min function over all path penalties in both the constrained graph and unconstrained graph:

\[f_n = \text{logadd}_{i \in U_n}(c_i + f_{s_i}),\] where \(\text{logadd}(x_1, x_2, ..., x_n) = -\log(\sum_{i=1}^n e^{-x_i})\), \(U_n\) denotes the set of upstream arcs of node \(n\), \(c_i\) is the penalty on arc \(i\), and \(s_i\) is the source node of the arc.

Unlike Viterbi, which selects the best path, this smoothly incorporates all possible paths into the computation.

Space displacement neural network (SDNN)

We can avoid segmentation heuristics by applying a CNN at every position in the image, similar to how filters operate, to produce character probabilities at each location.

This output is then fed into a Character Model Transducer to construct the interpretation graph.

Space displacement neural network (SDNN)

SDNN can be used for brute-force object detection by training a CNN to output the probability of an object (e.g., a dog) in a fixed-size image.

Sweeping this CNN over a large image produces a 2D probability heatmap.

Repeating this at multiple scales allows detection of objects at different sizes.

AMAP representation

AMAP encodes pen trajectories as low-resolution images, with each pixel holding a five-element feature vector: four features are associated to four orientations of the pen trajectory in the area around the pixel and one for curvature. It is invariant to stroke order and writing speed.

Ex.1: online handwriting recognition

Ex.2: check reading system

Workflow

  1. Identify fields likely to contain the courtesy amount.
  2. Segment each field into characters, score interpretations, and select the most probable field and amount.

Takeaways

  1. Neural networks can handle inputs and outputs of varying structure (e.g. sequences, graphs) as long as the computation is differentiable. The model parameters can remain fixed in size.
  2. The same training approach applies when choosing between discrete interpretations, such as in structured prediction tasks.
  3. This 1998 paper laid the foundation for modern deep learning systems.
  • Replacing handcrafted features with end-to-end learning
  • Leveraging spatial structure via convolutional design
  • Scaling models effectively, shown by LeNet-5’s success in document recognition.

Thanks! Any questions?