Perfect! Hereβs a detailed and thematically grouped reading roadmap with 10+ key papers per topic to build your computer vision knowledge progressively. Topics include:
- Image Classification
- Object Detection
- Image Segmentation
- Self-Supervised & Representation Learning
- Vision Transformers
- Image Generation (GANs & Diffusion)
π¦ 1. Image Classification
These papers help you understand how CNNs evolved and how deep models classify entire images.
| Year | Title | Key Idea |
|---|---|---|
| 1998 | LeNet (LeCun et al.) | First CNN, digit recognition |
| 2012 | AlexNet (Krizhevsky et al.) | Deep CNN + ReLU + GPU |
| 2014 | VGGNet (Simonyan & Zisserman) | Deep networks with 3x3 filters |
| 2014 | GoogLeNet (Szegedy et al.) | Inception modules for efficiency |
| 2015 | ResNet (He et al.) | Residual connections enable depth |
| 2016 | Wide ResNet | Trade-off between depth and width |
| 2017 | DenseNet | Layer connections via dense blocks |
| 2018 | SENet | Channel-wise attention mechanism |
| 2019 | EfficientNet | Compound model scaling |
| 2021 | ConvNeXt | CNN re-architected with Transformer tricks |
π― 2. Object Detection
These papers trace the evolution from sliding window methods to anchor-free detectors.
| Year | Title | Key Idea |
|---|---|---|
| 2001 | Viola-Jones | Haar features + Adaboost |
| 2014 | R-CNN (Girshick) | CNN on region proposals |
| 2015 | Fast R-CNN | RoI pooling, end-to-end training |
| 2015 | Faster R-CNN | Region Proposal Network |
| 2016 | YOLOv1 | Single-stage, real-time detection |
| 2017 | YOLOv2 / v3 | Better accuracy, multi-scale |
| 2017 | RetinaNet | Focal loss for class imbalance |
| 2018 | Cascade R-CNN | Multi-stage refinement |
| 2019 | CenterNet | Keypoint-based, anchor-free |
| 2020 | YOLOv4 / v5 | Industrial-friendly improvements |
| 2021 | DETR | Transformer-based detection |
| 2022 | YOLOv7 | Cutting-edge real-time detector |
π§© 3. Image Segmentation
Focus shifts to per-pixel classification: semantic and instance segmentation.
| Year | Title | Key Idea |
|---|---|---|
| 2015 | FCN (Long et al.) | First end-to-end segmentation |
| 2015 | DeepLab | Atrous convolution for context |
| 2016 | U-Net | Encoder-decoder for biomedical images |
| 2017 | PSPNet | Pyramid scene parsing |
| 2017 | Mask R-CNN | Instance segmentation with RoIAlign |
| 2018 | DeepLabv3+ | Encoder-decoder, better boundary handling |
| 2019 | HRNet | High-resolution feature learning |
| 2020 | Panoptic FPN | Unified segmentation (semantic + instance) |
| 2020 | PointRend | Point-based refinement for boundaries |
| 2021 | SegFormer | Lightweight ViT-based segmentation |
| 2022 | Mask2Former | Unified model for all segmentation tasks |
π§ 4. Self-Supervised Learning
Self-supervised methods learn representations without labels, crucial for pretraining.
| Year | Title | Key Idea |
|---|---|---|
| 2014 | Autoencoders | Unsupervised feature learning |
| 2017 | CPC (Oord et al.) | Contrastive predictive coding |
| 2020 | SimCLR | Contrastive learning with augmentations |
| 2020 | MoCo (v1βv3) | Momentum contrast for large batches |
| 2020 | BYOL | Bootstrap latent views, no negatives |
| 2020 | SwAV | Online clustering of representations |
| 2021 | DINO | ViTs learn features without labels |
| 2021 | Barlow Twins | Redundancy reduction |
| 2022 | MAE | Masked autoencoding with ViTs |
| 2022 | VICReg | Variance-Invariance-Covariance trade-off |
π 5. Vision Transformers (ViTs)
Transformers, originally for NLP, now achieve top performance in vision.
| Year | Title | Key Idea |
|---|---|---|
| 2020 | ViT (Dosovitskiy et al.) | Patch-based image transformers |
| 2021 | DeiT | Data-efficient ViT with distillation |
| 2021 | Swin Transformer | Hierarchical and shifted windows |
| 2021 | CvT | Hybrid convolution + attention |
| 2022 | Focal Transformer | Efficient long-range modeling |
| 2022 | PVTv2 | Lightweight backbone for dense prediction |
| 2022 | CrossViT | Multi-scale patch attention |
| 2023 | InternImage | Unified visual backbone |
| 2023 | SAM (Meta AI) | Segment anything model |
| 2024 | DINOv2 | Self-supervised ViTs for many tasks |
π¨ 6. Image Generation (GANs & Diffusion)
Generative models synthesize realistic images from noise or prompts.
π GANs
| Year | Title | Key Idea |
|---|---|---|
| 2014 | GANs (Goodfellow) | Minimax adversarial learning |
| 2016 | DCGAN | Deep CNN for generative modeling |
| 2017 | WGAN / WGAN-GP | Better stability with Wasserstein loss |
| 2018 | Progressive GAN | Progressive image resolution |
| 2019 | StyleGAN | Style-based generator for faces |
| 2020 | StyleGAN2 | Fewer artifacts, better realism |
| 2021 | StyleGAN3 | Alias-free generation |
| 2021 | BigGAN | High-quality class-conditional images |
π Diffusion
| Year | Title | Key Idea |
|---|---|---|
| 2020 | DDPM (Ho et al.) | Denoising diffusion probabilistic models |
| 2021 | Improved DDPM | Class-conditional generation |
| 2022 | Imagen (Google) | Text-to-image with T5 encoder |
| 2022 | Stable Diffusion | Open-source text-to-image |
| 2023 | Consistency Models | Speed up diffusion sampling |
| 2024 | Sora (OpenAI) | Video generation from text |
Would you like:
- A downloadable version (PDF/Quarto/Notion)?
- Links to open-source codebases or implementations?
- Slide-style summaries for each?
Let me know your preferred format and next topic (e.g., medical CV, multimodal, 3D vision, etc.).
7. Other topics generated by Claude
Neural Network Building Blocks: A Comprehensive Timeline
| Year | Title | Key Idea | Associated Paper |
|---|---|---|---|
| 1943 | McCulloch-Pitts Neuron | First mathematical model of an artificial neuron. | A logical calculus of the ideas immanent in nervous activity (McCulloch & Pitts, 1943) |
| 1957 | Perceptron | Single-layer neural network with a step activation function. | The Perceptron: A Perceiving and Recognizing Automaton (Rosenblatt, 1957) |
| 1960 | Adaline/Widrow-Hoff Rule | Adaptive linear neuron with continuous output and delta rule learning. | Adaptive switching circuits (Widrow & Hoff, 1960) |
| 1969 | Sigmoid Activation | S-shaped activation function enabling gradient-based learning. | Perceptrons: An Introduction to Computational Geometry (Minsky & Papert, 1969) |
| 1974 | Backpropagation (Early) | Early formulation of gradient descent for neural networks. | Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences (Werbos, 1974) |
| 1980 | Neocognitron | Hierarchical neural network inspired by visual cortex, precursor to CNNs. | Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (Fukushima, 1980) |
| 1982 | Hopfield Networks | Recurrent neural network with associative memory properties. | Neural networks and physical systems with emergent collective computational abilities (Hopfield, 1982) |
| 1986 | Backpropagation | Efficient training of multi-layer neural networks using gradient descent. | Learning representations by back-propagating errors (Rumelhart et al., 1986) |
| 1986 | Weight Decay | L2 regularization to prevent overfitting by penalizing large weights. | Learning representations by back-propagating errors (Rumelhart et al., 1986) |
| 1988 | Radial Basis Functions | Networks using radially symmetric activation functions. | Multivariable functional interpolation and adaptive networks (Broomhead & Lowe, 1988) |
| 1989 | Convolutional Layers | Weight sharing and local connectivity for spatial data processing. | Backpropagation Applied to Handwritten Zip Code Recognition (LeCun et al., 1989) |
| 1990 | Early Stopping | Regularization technique using validation set to prevent overfitting. | Improving generalization with active learning (Finnoff et al., 1990) |
| 1991 | Vanishing Gradient Problem | Identification of gradient decay in deep networks. | Untersuchungen zu dynamischen neuronalen Netzen (Hochreiter, 1991) |
| 1992 | Max Pooling | Downsampling operation taking maximum value in local regions. | Handwritten Digit Recognition with a Back-Propagation Network (LeCun et al., 1992) |
| 1995 | Support Vector Machines | Alternative approach using kernel methods and margin maximization. | Support-Vector Networks (Cortes & Vapnik, 1995) |
| 1997 | LSTM (Long Short-Term Memory) | Recurrent neural network with gating mechanisms to handle long-term dependencies. | Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) |
| 1998 | LeNet-5 (CNN Architecture) | Complete CNN architecture for handwritten digit recognition. | Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) |
| 1999 | LSTM Forget Gate | Addition of forget gate to LSTM for better long-term memory control. | Learning to Forget: Continual Prediction with LSTM (Gers et al., 1999) |
| 2000 | Tanh Activation | Hyperbolic tangent activation function with zero-centered output. | Efficient BackProp (LeCun et al., 2000) |
| 2001 | Random Forests | Ensemble method combining multiple decision trees. | Random Forests (Breiman, 2001) |
| 2006 | Deep Belief Networks | Layer-wise pretraining for deep neural networks. | A Fast Learning Algorithm for Deep Belief Nets (Hinton et al., 2006) |
| 2006 | Autoencoders | Unsupervised learning for dimensionality reduction and feature learning. | Reducing the Dimensionality of Data with Neural Networks (Hinton & Salakhutdinov, 2006) |
| 2009 | Curriculum Learning | Training strategy presenting examples in order of increasing difficulty. | Curriculum Learning (Bengio et al., 2009) |
| 2010 | ReLU (Rectified Linear Unit) | Simple, non-saturating activation function to mitigate vanishing gradients. | Deep Sparse Rectifier Neural Networks (Glorot et al., 2011) |
| 2011 | Xavier/Glorot Initialization | Weight initialization strategy for stable gradient flow. | Understanding the difficulty of training deep feedforward neural networks (Glorot & Bengio, 2010) |
| 2012 | AlexNet | Deep CNN that won ImageNet, popularizing deep learning. | ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., 2012) |
| 2012 | Dropout | Randomly deactivating neurons during training to prevent overfitting. | Improving neural networks by preventing co-adaptation of feature detectors (Hinton et al., 2012) |
| 2013 | Maxout Networks | Activation function that generalizes ReLU by taking the max of linear functions. | Maxout Networks (Goodfellow et al., 2013) |
| 2013 | DropConnect | Regularization by randomly setting weights to zero instead of activations. | Regularization of Neural Networks using DropConnect (Wan et al., 2013) |
| 2014 | GRU (Gated Recurrent Unit) | Simplified alternative to LSTM with fewer parameters. | Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014) |
| 2014 | Adam Optimizer | Adaptive momentum-based optimization for faster convergence. | Adam: A Method for Stochastic Optimization (Kingma & Ba, 2014) |
| 2014 | Generative Adversarial Networks | Two-network framework for generative modeling. | Generative Adversarial Networks (Goodfellow et al., 2014) |
| 2014 | VGGNet | Very deep CNN with small 3x3 convolution filters. | Very Deep Convolutional Networks for Large-Scale Image Recognition (Simonyan & Zisserman, 2014) |
| 2015 | Batch Normalization | Normalizing layer inputs to stabilize and accelerate training. | Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe & Szegedy, 2015) |
| 2015 | Residual Connections (ResNet) | Skip connections to enable training of very deep networks. | Deep Residual Learning for Image Recognition (He et al., 2015) |
| 2015 | He Initialization | Weight initialization for ReLU networks. | Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (He et al., 2015) |
| 2015 | U-Net | Encoder-decoder architecture with skip connections for segmentation. | U-Net: Convolutional Networks for Biomedical Image Segmentation (Ronneberger et al., 2015) |
| 2015 | ELU (Exponential Linear Unit) | Activation function with negative saturation to push mean closer to zero. | Fast and Accurate Deep Network Learning by Exponential Linear Units (Clevert et al., 2015) |
| 2015 | Spatial Transformer Networks | Learnable spatial transformations within neural networks. | Spatial Transformer Networks (Jaderberg et al., 2015) |
| 2016 | Leaky ReLU | ReLU variant with small negative slope to address dying neurons. | Rectifier Nonlinearities Improve Neural Network Acoustic Models (Maas et al., 2013) |
| 2016 | DenseNet | Dense connectivity between layers for feature reuse. | Densely Connected Convolutional Networks (Huang et al., 2016) |
| 2016 | Layer Normalization | Normalization across features instead of batch dimension. | Layer Normalization (Ba et al., 2016) |
| 2016 | Squeeze-and-Excitation | Channel attention mechanism for adaptive feature recalibration. | Squeeze-and-Excitation Networks (Hu et al., 2017) |
| 2017 | Depthwise Separable Convolutions | Efficient convolution decomposition reducing parameters. | MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (Howard et al., 2017) |
| 2017 | Self-Attention (Transformer) | Attention mechanism to model long-range dependencies without recurrence. | Attention Is All You Need (Vaswani et al., 2017) |
| 2017 | Multi-Head Attention | Parallel attention mechanisms with different learned projections. | Attention Is All You Need (Vaswani et al., 2017) |
| 2017 | Positional Encoding | Adding position information to transformer inputs. | Attention Is All You Need (Vaswani et al., 2017) |
| 2017 | Feature Pyramid Networks | Multi-scale feature extraction using pyramidal hierarchy. | Feature Pyramid Networks for Object Detection (Lin et al., 2017) |
| 2017 | Dilated/Atrous Convolutions | Convolutions with gaps to increase receptive field without pooling. | Rethinking Atrous Convolution for Semantic Image Segmentation (Chen et al., 2017) |
| 2018 | Group Normalization | Normalization alternative to BatchNorm for small batch sizes. | Group Normalization (Wu & He, 2018) |
| 2018 | GELU (Gaussian Error Linear Unit) | Smooth activation function with probabilistic interpretation. | Gaussian Error Linear Units (GELUs) (Hendrycks & Gimpel, 2016) |
| 2018 | 1x1 Convolutions | Point-wise convolutions for channel mixing and dimensionality control. | Network In Network (Lin et al., 2013) |
| 2018 | BERT (Bidirectional Transformers) | Bidirectional transformer for language understanding. | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018) |
| 2019 | Swish/SiLU Activation | Smooth, non-monotonic activation function (x * sigmoid(x)). | Searching for Activation Functions (Ramachandran et al., 2017) |
| 2019 | GPT Architecture | Autoregressive transformer for language generation. | Language Models are Unsupervised Multitask Learners (Radford et al., 2019) |
| 2019 | EfficientNet | Compound scaling of network depth, width, and resolution. | EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (Tan & Le, 2019) |
| 2020 | Vision Transformer (ViT) | Applying pure transformers to image recognition tasks. | An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2020) |
| 2020 | DETR (Detection Transformer) | End-to-end object detection with transformers. | End-to-End Object Detection with Transformers (Carion et al., 2020) |
| 2021 | Mish Activation | Self-regularized non-monotonic activation function. | Mish: A Self Regularized Non-Monotonic Neural Activation Function (Misra, 2019) |
| 2021 | Swin Transformer | Hierarchical vision transformer with shifted windows. | Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., 2021) |
| 2022 | ConvNeXt | Modernized CNN architecture competitive with transformers. | A ConvNet for the 2020s (Liu et al., 2022) |
| 2023 | Rotary Position Embedding | Relative position encoding for better length extrapolation. | RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021) |
| 2023 | Flash Attention | Memory-efficient attention computation for long sequences. | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022) |
| 2024 | Mamba (State Space Models) | Efficient alternative to transformers using state space models. | Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023) |