Perfect! Here’s a detailed and thematically grouped reading roadmap with 10+ key papers per topic to build your computer vision knowledge progressively. Topics include:

  1. Image Classification
  2. Object Detection
  3. Image Segmentation
  4. Self-Supervised & Representation Learning
  5. Vision Transformers
  6. Image Generation (GANs & Diffusion)

πŸ“¦ 1. Image Classification

These papers help you understand how CNNs evolved and how deep models classify entire images.

Year Title Key Idea
1998 LeNet (LeCun et al.) First CNN, digit recognition
2012 AlexNet (Krizhevsky et al.) Deep CNN + ReLU + GPU
2014 VGGNet (Simonyan & Zisserman) Deep networks with 3x3 filters
2014 GoogLeNet (Szegedy et al.) Inception modules for efficiency
2015 ResNet (He et al.) Residual connections enable depth
2016 Wide ResNet Trade-off between depth and width
2017 DenseNet Layer connections via dense blocks
2018 SENet Channel-wise attention mechanism
2019 EfficientNet Compound model scaling
2021 ConvNeXt CNN re-architected with Transformer tricks

🎯 2. Object Detection

These papers trace the evolution from sliding window methods to anchor-free detectors.

Year Title Key Idea
2001 Viola-Jones Haar features + Adaboost
2014 R-CNN (Girshick) CNN on region proposals
2015 Fast R-CNN RoI pooling, end-to-end training
2015 Faster R-CNN Region Proposal Network
2016 YOLOv1 Single-stage, real-time detection
2017 YOLOv2 / v3 Better accuracy, multi-scale
2017 RetinaNet Focal loss for class imbalance
2018 Cascade R-CNN Multi-stage refinement
2019 CenterNet Keypoint-based, anchor-free
2020 YOLOv4 / v5 Industrial-friendly improvements
2021 DETR Transformer-based detection
2022 YOLOv7 Cutting-edge real-time detector

🧩 3. Image Segmentation

Focus shifts to per-pixel classification: semantic and instance segmentation.

Year Title Key Idea
2015 FCN (Long et al.) First end-to-end segmentation
2015 DeepLab Atrous convolution for context
2016 U-Net Encoder-decoder for biomedical images
2017 PSPNet Pyramid scene parsing
2017 Mask R-CNN Instance segmentation with RoIAlign
2018 DeepLabv3+ Encoder-decoder, better boundary handling
2019 HRNet High-resolution feature learning
2020 Panoptic FPN Unified segmentation (semantic + instance)
2020 PointRend Point-based refinement for boundaries
2021 SegFormer Lightweight ViT-based segmentation
2022 Mask2Former Unified model for all segmentation tasks

🧠 4. Self-Supervised Learning

Self-supervised methods learn representations without labels, crucial for pretraining.

Year Title Key Idea
2014 Autoencoders Unsupervised feature learning
2017 CPC (Oord et al.) Contrastive predictive coding
2020 SimCLR Contrastive learning with augmentations
2020 MoCo (v1–v3) Momentum contrast for large batches
2020 BYOL Bootstrap latent views, no negatives
2020 SwAV Online clustering of representations
2021 DINO ViTs learn features without labels
2021 Barlow Twins Redundancy reduction
2022 MAE Masked autoencoding with ViTs
2022 VICReg Variance-Invariance-Covariance trade-off

πŸ” 5. Vision Transformers (ViTs)

Transformers, originally for NLP, now achieve top performance in vision.

Year Title Key Idea
2020 ViT (Dosovitskiy et al.) Patch-based image transformers
2021 DeiT Data-efficient ViT with distillation
2021 Swin Transformer Hierarchical and shifted windows
2021 CvT Hybrid convolution + attention
2022 Focal Transformer Efficient long-range modeling
2022 PVTv2 Lightweight backbone for dense prediction
2022 CrossViT Multi-scale patch attention
2023 InternImage Unified visual backbone
2023 SAM (Meta AI) Segment anything model
2024 DINOv2 Self-supervised ViTs for many tasks

🎨 6. Image Generation (GANs & Diffusion)

Generative models synthesize realistic images from noise or prompts.

πŸ“Œ GANs

Year Title Key Idea
2014 GANs (Goodfellow) Minimax adversarial learning
2016 DCGAN Deep CNN for generative modeling
2017 WGAN / WGAN-GP Better stability with Wasserstein loss
2018 Progressive GAN Progressive image resolution
2019 StyleGAN Style-based generator for faces
2020 StyleGAN2 Fewer artifacts, better realism
2021 StyleGAN3 Alias-free generation
2021 BigGAN High-quality class-conditional images

πŸ“Œ Diffusion

Year Title Key Idea
2020 DDPM (Ho et al.) Denoising diffusion probabilistic models
2021 Improved DDPM Class-conditional generation
2022 Imagen (Google) Text-to-image with T5 encoder
2022 Stable Diffusion Open-source text-to-image
2023 Consistency Models Speed up diffusion sampling
2024 Sora (OpenAI) Video generation from text

Would you like:

  • A downloadable version (PDF/Quarto/Notion)?
  • Links to open-source codebases or implementations?
  • Slide-style summaries for each?

Let me know your preferred format and next topic (e.g., medical CV, multimodal, 3D vision, etc.).


7. Other topics generated by Claude

Neural Network Building Blocks: A Comprehensive Timeline

Year Title Key Idea Associated Paper
1943 McCulloch-Pitts Neuron First mathematical model of an artificial neuron. A logical calculus of the ideas immanent in nervous activity (McCulloch & Pitts, 1943)
1957 Perceptron Single-layer neural network with a step activation function. The Perceptron: A Perceiving and Recognizing Automaton (Rosenblatt, 1957)
1960 Adaline/Widrow-Hoff Rule Adaptive linear neuron with continuous output and delta rule learning. Adaptive switching circuits (Widrow & Hoff, 1960)
1969 Sigmoid Activation S-shaped activation function enabling gradient-based learning. Perceptrons: An Introduction to Computational Geometry (Minsky & Papert, 1969)
1974 Backpropagation (Early) Early formulation of gradient descent for neural networks. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences (Werbos, 1974)
1980 Neocognitron Hierarchical neural network inspired by visual cortex, precursor to CNNs. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (Fukushima, 1980)
1982 Hopfield Networks Recurrent neural network with associative memory properties. Neural networks and physical systems with emergent collective computational abilities (Hopfield, 1982)
1986 Backpropagation Efficient training of multi-layer neural networks using gradient descent. Learning representations by back-propagating errors (Rumelhart et al., 1986)
1986 Weight Decay L2 regularization to prevent overfitting by penalizing large weights. Learning representations by back-propagating errors (Rumelhart et al., 1986)
1988 Radial Basis Functions Networks using radially symmetric activation functions. Multivariable functional interpolation and adaptive networks (Broomhead & Lowe, 1988)
1989 Convolutional Layers Weight sharing and local connectivity for spatial data processing. Backpropagation Applied to Handwritten Zip Code Recognition (LeCun et al., 1989)
1990 Early Stopping Regularization technique using validation set to prevent overfitting. Improving generalization with active learning (Finnoff et al., 1990)
1991 Vanishing Gradient Problem Identification of gradient decay in deep networks. Untersuchungen zu dynamischen neuronalen Netzen (Hochreiter, 1991)
1992 Max Pooling Downsampling operation taking maximum value in local regions. Handwritten Digit Recognition with a Back-Propagation Network (LeCun et al., 1992)
1995 Support Vector Machines Alternative approach using kernel methods and margin maximization. Support-Vector Networks (Cortes & Vapnik, 1995)
1997 LSTM (Long Short-Term Memory) Recurrent neural network with gating mechanisms to handle long-term dependencies. Long Short-Term Memory (Hochreiter & Schmidhuber, 1997)
1998 LeNet-5 (CNN Architecture) Complete CNN architecture for handwritten digit recognition. Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998)
1999 LSTM Forget Gate Addition of forget gate to LSTM for better long-term memory control. Learning to Forget: Continual Prediction with LSTM (Gers et al., 1999)
2000 Tanh Activation Hyperbolic tangent activation function with zero-centered output. Efficient BackProp (LeCun et al., 2000)
2001 Random Forests Ensemble method combining multiple decision trees. Random Forests (Breiman, 2001)
2006 Deep Belief Networks Layer-wise pretraining for deep neural networks. A Fast Learning Algorithm for Deep Belief Nets (Hinton et al., 2006)
2006 Autoencoders Unsupervised learning for dimensionality reduction and feature learning. Reducing the Dimensionality of Data with Neural Networks (Hinton & Salakhutdinov, 2006)
2009 Curriculum Learning Training strategy presenting examples in order of increasing difficulty. Curriculum Learning (Bengio et al., 2009)
2010 ReLU (Rectified Linear Unit) Simple, non-saturating activation function to mitigate vanishing gradients. Deep Sparse Rectifier Neural Networks (Glorot et al., 2011)
2011 Xavier/Glorot Initialization Weight initialization strategy for stable gradient flow. Understanding the difficulty of training deep feedforward neural networks (Glorot & Bengio, 2010)
2012 AlexNet Deep CNN that won ImageNet, popularizing deep learning. ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., 2012)
2012 Dropout Randomly deactivating neurons during training to prevent overfitting. Improving neural networks by preventing co-adaptation of feature detectors (Hinton et al., 2012)
2013 Maxout Networks Activation function that generalizes ReLU by taking the max of linear functions. Maxout Networks (Goodfellow et al., 2013)
2013 DropConnect Regularization by randomly setting weights to zero instead of activations. Regularization of Neural Networks using DropConnect (Wan et al., 2013)
2014 GRU (Gated Recurrent Unit) Simplified alternative to LSTM with fewer parameters. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)
2014 Adam Optimizer Adaptive momentum-based optimization for faster convergence. Adam: A Method for Stochastic Optimization (Kingma & Ba, 2014)
2014 Generative Adversarial Networks Two-network framework for generative modeling. Generative Adversarial Networks (Goodfellow et al., 2014)
2014 VGGNet Very deep CNN with small 3x3 convolution filters. Very Deep Convolutional Networks for Large-Scale Image Recognition (Simonyan & Zisserman, 2014)
2015 Batch Normalization Normalizing layer inputs to stabilize and accelerate training. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe & Szegedy, 2015)
2015 Residual Connections (ResNet) Skip connections to enable training of very deep networks. Deep Residual Learning for Image Recognition (He et al., 2015)
2015 He Initialization Weight initialization for ReLU networks. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (He et al., 2015)
2015 U-Net Encoder-decoder architecture with skip connections for segmentation. U-Net: Convolutional Networks for Biomedical Image Segmentation (Ronneberger et al., 2015)
2015 ELU (Exponential Linear Unit) Activation function with negative saturation to push mean closer to zero. Fast and Accurate Deep Network Learning by Exponential Linear Units (Clevert et al., 2015)
2015 Spatial Transformer Networks Learnable spatial transformations within neural networks. Spatial Transformer Networks (Jaderberg et al., 2015)
2016 Leaky ReLU ReLU variant with small negative slope to address dying neurons. Rectifier Nonlinearities Improve Neural Network Acoustic Models (Maas et al., 2013)
2016 DenseNet Dense connectivity between layers for feature reuse. Densely Connected Convolutional Networks (Huang et al., 2016)
2016 Layer Normalization Normalization across features instead of batch dimension. Layer Normalization (Ba et al., 2016)
2016 Squeeze-and-Excitation Channel attention mechanism for adaptive feature recalibration. Squeeze-and-Excitation Networks (Hu et al., 2017)
2017 Depthwise Separable Convolutions Efficient convolution decomposition reducing parameters. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (Howard et al., 2017)
2017 Self-Attention (Transformer) Attention mechanism to model long-range dependencies without recurrence. Attention Is All You Need (Vaswani et al., 2017)
2017 Multi-Head Attention Parallel attention mechanisms with different learned projections. Attention Is All You Need (Vaswani et al., 2017)
2017 Positional Encoding Adding position information to transformer inputs. Attention Is All You Need (Vaswani et al., 2017)
2017 Feature Pyramid Networks Multi-scale feature extraction using pyramidal hierarchy. Feature Pyramid Networks for Object Detection (Lin et al., 2017)
2017 Dilated/Atrous Convolutions Convolutions with gaps to increase receptive field without pooling. Rethinking Atrous Convolution for Semantic Image Segmentation (Chen et al., 2017)
2018 Group Normalization Normalization alternative to BatchNorm for small batch sizes. Group Normalization (Wu & He, 2018)
2018 GELU (Gaussian Error Linear Unit) Smooth activation function with probabilistic interpretation. Gaussian Error Linear Units (GELUs) (Hendrycks & Gimpel, 2016)
2018 1x1 Convolutions Point-wise convolutions for channel mixing and dimensionality control. Network In Network (Lin et al., 2013)
2018 BERT (Bidirectional Transformers) Bidirectional transformer for language understanding. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)
2019 Swish/SiLU Activation Smooth, non-monotonic activation function (x * sigmoid(x)). Searching for Activation Functions (Ramachandran et al., 2017)
2019 GPT Architecture Autoregressive transformer for language generation. Language Models are Unsupervised Multitask Learners (Radford et al., 2019)
2019 EfficientNet Compound scaling of network depth, width, and resolution. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (Tan & Le, 2019)
2020 Vision Transformer (ViT) Applying pure transformers to image recognition tasks. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2020)
2020 DETR (Detection Transformer) End-to-end object detection with transformers. End-to-End Object Detection with Transformers (Carion et al., 2020)
2021 Mish Activation Self-regularized non-monotonic activation function. Mish: A Self Regularized Non-Monotonic Neural Activation Function (Misra, 2019)
2021 Swin Transformer Hierarchical vision transformer with shifted windows. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., 2021)
2022 ConvNeXt Modernized CNN architecture competitive with transformers. A ConvNet for the 2020s (Liu et al., 2022)
2023 Rotary Position Embedding Relative position encoding for better length extrapolation. RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
2023 Flash Attention Memory-efficient attention computation for long sequences. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)
2024 Mamba (State Space Models) Efficient alternative to transformers using state space models. Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023)