Seeing Through Layers: The Ultimate Guide to Convolutional Neural Networks

Tavish Aggarwal

April 18, 2025

Convolutional Neural Networks (CNNs) are specialized neural networks that have revolutionized computer vision and image processing. While they share similarities with traditional Artificial Neural Networks (read post In-depth understanding of Neural Networks and its working to know about ANN), CNNs have unique architectural elements that make them particularly effective for visual data processing.

Challenges in Image Processing

Let's consider an object detection task to identify the vehicles on the road. Human brains can perform the task easily as it is trained well. But our machines could struggle to perform such a basic task because of the following reasons:

Viewpoint Variation: Objects appear differently depending on the angle of view.
Scale Variation: Objects can appear larger or smaller depending on distance.
Illumination Conditions: Lighting changes can dramatically alter an image's appearance.
Background Clutter: Distinguishing objects from complex backgrounds.
Occlusion: Objects may be partially hidden.
Deformation: Many objects can deform in various ways.
Intra-class Variation: Objects within the same category can look very different.

A task that is very easy for humans to perform the machine will struggle to do so. Therefore, we need to train our algorithm with a lot of images.

ILSVRC challenges

There is an open image source dataset, ImageNet, available, having a variety of classes like vehicles, daily need products, and a lot more. There is also a challenge called the ImageNet Large Scale Visual Recognition Challenge or ILSVRC, which happens every year with the objective of image classification. In ILSVRC challenges, CNNs have shown impressive results.

Reference: https://image-net.org/challenges/LSVRC/

Introduction to CNNs

CNNs are specialized neural networks primarily designed for processing visual data, though they can also be effectively applied to text, audio, and video data. Originally inspired by the "animal visual cortex", CNNs have become the de-facto standard in deep learning approaches to computer vision.

Unlike traditional neural networks, CNNs are specifically designed to handle the "spatial hierarchy of features" (discussed later) in images through specialized layers that detect local conjunctions of features from the previous layer.

CNNs can be used in a large variety of applications apart from image classification. These are:

Object localization: Identifying the local region of the objects (as a rectangular area) and classifying them.
Semantic segmentation: Identifying the exact shapes of the objects (pixel by pixel) and classifying them.
Optical Character Recognition (OCR): Recognises characters in an image.

The three main terminologies related to the CNN architecture are Convolutions, Pooling, and Feature Maps. We will be understanding this in this post.

Before that, let's understand the input to the CNN network.

Input to CNNs

Like any neural network, inputs to CNNs should also be numeric. Here, the input to the network is images that are made up of pixels.

Pixels are nothing but intensity, which ranges from 0 to 255, where a white pixel is represented as 255 and black as 0.
The numbers of pixels are 'width x height' and are independent of depth.
Each pixel in a color image is an array representing the intensities of red, blue, and green. The red, blue, and green layers are called channels.
In a grayscale image (a 'black and white' image), only one number is required to represent the intensity of white. Thus, grayscale images have only one channel.

Processing Videos with CNNs

Similar to an image, we can also input video (which is nothing but a collection of images where each image is called a frame) to CNN networks. Let's understand how video processing works.

Consider we have a video classification task and a video of 1 minute. Also, let's consider we are processing at 4 frames per second (FPS). Therefore, we have 240 frames to process.

We input each of these frames to the CNN network, and we get a feature vector of each image.
We will have 240 feature vectors.
These 240 feature vectors representing the video are sequentially added to the RNN network or 3D CNN, and a classification task is performed.

In this way, we can process video as well using a combination of CNN and RNN networks to process videos as well. Having a basic understanding of how image and video processing happens in CNN networks, let's move forward and discuss techniques used in image processing.

Understanding CNN Networks Using VGGNet

With this knowledge and background in place, let's move ahead and explore the CNN networks in detail. We have seen there are a lot of architectures implemented in ILSVRC challenges.

VGGNet is one of the most straightforward CNN architectures, making it ideal for understanding the fundamental concepts. The most popular variants are VGG-16 and VGG-19, with 16 and 19 layers, respectively.

The basic building blocks of VGGNet include:

Convolutional Layers: Extract features from the input image.
Pooling Layers: Reduce spatial dimensions and create spatial invariance.
Fully Connected Layers: Perform classification based on the extracted features.

Let's see how VGGNet is implemented and understand the implementation of the above concepts in VGGNet architecture.

VGGNet Architecture

The VGGNet was specially designed for the ImageNet challenge, which is a classification task with 1000 categories. Thus, the softmax layer at the end has 1000 categories. In the diagram shown above, the legends show convolutional layers, pooling layers, and fully connected layers.

The most important point to notice is that the network acts as a feature extractor for images. For example, the CNN above extracts a 4096-dimensional feature vector representing each input image. In this case, the feature vector is fed to a softmax layer for classification, but we can use the feature vector to do other tasks as well (such as video analysis, object detection, image segmentation, etc.).

Let's further understand the building blocks of convolutional layers, pooling, and feature maps.

Convolutional Layers

Convolution is one of the main building blocks of a CNN. In image processing, convolution is used to detect features (such as vertical or horizontal edges), sharpen or blur the image, etc. The input data is convolved using a filter or kernel, which produces the feature maps.

Mathematically, the convolution operation is the summation of the element-wise product of two matrices.

Let’s understand this via an example and take two matrices, X and Y. If we convolve the image X using the filter Y, this operation will produce the matrix Z.

$$[\begin{matrix} 1 & 2 & 3\\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{matrix}] \text{ * } [\begin{matrix} 1 & 0 & -1\\ 1 & 0 & -1 \\ 1 & 0 & -1\end{matrix}] \text{ = } [\begin{matrix} 1 * 1 & 2 * 0 & 3 * -1\\ 4 * 1 & 5 * 0 & 6 * -1 \\ 7 * 1 & 8 * 0 & 9 * -1 \end{matrix}] \text{ = } [\begin{matrix} 1 & 0 & -3 \\ 4 & 0 & -6 \\ 7 & 0 & -9 \end{matrix}]$$

Finally, we compute the sum of all the elements in Z to get a scalar number.

$$1 + 0 -3 + 4 + 0 -6+7 +0 -9 = -6$$

The above example shows the convolution. By using appropriate filters, we can detect edges in the image.

Edge detection

In the example above, the 3 x 3 filter moves across the 6 x 6 image, computing a dot product for each region.

The filter is placed over the first 3 x 3 section of the image.
Each pixel in the filter is multiplied with the corresponding pixel in the image.
The results are summed up to get a single value.
This value is placed in the corresponding position of the output matrix.

The filter moves across the image, typically shifting one pixel at a time, either horizontally or vertically. The process continues until every possible 3 x 3 section of the 6 x 6 image has been covered.

Different filters can detect different features in an image. For example, the filter shown above is designed to detect vertical edges. In practice, CNNs learn these filters automatically during training rather than using predefined ones.

Also, you might have noticed that while convolving the image, the output dimension of the image is reduced.

NOTE: VGGNet architecture uses all filters of the same size and VGG-16 has 16 layers.

One question is yet to be answered: Can any filter convolve any image? For this, we need to understand the concept of stride and padding.

Stride and Padding

Two important concepts in convolution operations are stride and padding.

Stride refers to the number of pixels by which the filter moves at each step. A stride of 1 means the filter moves one pixel at a time (which was the case as shown in the example of 3 x 3 filter moves across the 6 x 6 image), while a stride of 2 means it moves two pixels at a time.

For example, with a stride of 2, the filter would process only every other pixel position, resulting in a smaller output feature map.

$$[\begin{matrix} 3 & 0 & 1 & 2 & 7 & 4\\ 1 & 5 & 8 & 9 & 3 & 1 \\ 2 & 7 & 2 & 5 & 1 & 3 \\ 0 & 1 & 3 & 1 & 7 & 8 \\ 4 & 2 & 1 & 6 & 2 & 8 \\ 2 & 4 & 5 & 2 & 3 & 9 \end{matrix}] \text{ * } [\begin{matrix} 1 & 0 & -1\\ 1 & 0 & -1 \\ 1 & 0 & -1\end{matrix}] \text{ = } [\begin{matrix} 1 * 1 & 2 * 0 & 3 * -1\\ 4 * 1 & 5 * 0 & 6 * -1 \\ 7 * 1 & 8 * 0 & 9 * -1 \end{matrix}] \text{ = } [\begin{matrix} -5 & 0 \\ 0 & -4 \end{matrix}]$$

If we have an image for which we don't need finer level details at that time, we can use a higher value for the stride. We cannot convolve the image for any value of stride, as it may happen that our filter doesn't fit the input patch.

For example, we cannot convolve a (4, 4) image with a (3, 3) filter using a stride of 2. Similarly, we also cannot convolve a (5, 5) image with a (2, 2) filter and a stride of 2. Why don't you try it and convince yourself?

To resolve the issue of stride with any dimension of an image, we use a concept called padding. Padding involves adding extra pixels around the border of the input image. This serves two purposes:

It allows the filter to be applied to border pixels.
It helps maintain the spatial dimensions of the output.

We can add padding to an image and fill the pixel values with either Zero Padding or Neighboring pixel values. Or we can also do the reverse, where we shrink the image. This is not commonly used in the industry.

We have seen earlier that on the convolving image, the output size decreases. If we don't want this to happen, we can add appropriate padding to an image so that we get the same dimensional output.

Therefore, padding is important when we have large CNNs with many layers to avoid information loss, which would happen at a massive rate if padding is not added.

We can generalize the output image given an input image (having n x n pixels), filter (k x k), padding (P), and stride (S) as:

$$({{n + 2P - k}\over S} + 1), ({{n + 2P - k}\over S} + 1)$$

We have applied convolution only on 2D arrays, but most images are colored and thus have multiple channels (e.g., RGB). Let's see how we can convolve colored images.

Convolution on colored Images

Most of the real-time industry problems we will be dealing with are colored images. For those with multiple channels (like RGB), we use 3D filters. Each channel of the filter convolves with the corresponding channel of the input image, and the results are summed to produce a single output feature map.

For example, a 3×3×3 filter applied to an RGB image would involve:

Convolving the R channel of the image with the R channel of the filter.
Convolving the G channel of the image with the G channel of the filter.
Convolving the B channel of the image with the B channel of the filter.
Summing these three results to get a single value in the output feature map.

Convolution on colored images

As we see in the diagram, we are using a 3D filter where each channel of the filter convolves the corresponding channel of the image.

The filters are learnt during training (i.e., during backpropagation). Hence, the individual values of the filters are often called the weights of a CNN.

Along with the weights of filters that need to be trained, there are biases as well. By convention, all the biases in the layer will have the same value, therefore called tied biases. The other way is to use untied biases where all the elements in the bias vector are different.

Patches

The filter (also sometimes called as kernel or feature detector) only looks at one chunk of an image at a time, then the filter moves to another patch of the image, and so on. CNN kernels/filters only process one patch at a time rather than the whole image. This is because we want filters to process small pieces of the image to detect features (edges, etc).

Feature Map

We have understood so far that filters are learned during the training of the CNN network. The output from convolution and activation functions are called feature maps. Let's deepen this learning and understand how multiple filters are used to detect various features in images.

A neuron is a filter whose weights are learnt during training. For example, a (3, 3, 3) filter (or neuron) has 27 weights. Each neuron looks at a particular region in the input (i.e., its 'receptive field'. More about this later in the post).

A feature map is a collection of multiple neurons, each of which looks at different regions of the input with the same weights. All neurons in a feature map extract the same feature but from different regions of the input.

Each filter (or neuron) produces a feature map, and each feature map tries to identify certain features, such as edges, textures, etc., in the input image.

The figure below shows two neurons in two feature maps (the output) along with the regions in the input from which the neurons extract features.

Feature map

NOTE: Feature map refers to the output of the activation function, which is usually non-linear (such as ReLU), not what goes into the activation function (i.e., the output of the convolution).

Pooling

After extracting features as feature maps, the next task is to aggregate extracted features using the pooling layer.

Pooling tries to figure out whether a particular region in the image has the feature we are interested in or not. It essentially looks at larger regions (having multiple patches) of the image and captures an aggregate statistic (max, average, etc.) of each region. In other words, it makes the network invariant to local transformations.

Pooling operates on each feature map independently, and only width and height are reduced, and feature maps remain the same.

The two most popular aggregate functions used in pooling are:

Max pooling: If any one of the patches says something strongly about the presence of a certain feature, then the pooling layer counts that feature as 'detected.'

Pooling operations

The example shown above performs max polling on input image with filter 2 x 2 and stride 2.

Average pooling: If one patch says something very firmly, but the other ones disagree, the pooling layer takes the average to find out.

$$[\begin{matrix} 1 & 3 & 2 & 9 \\ 5 & 6 & 1 & 7 \\ 4 & 2 & 8 & 6 \\ 3 & 5 & 7 & 2 \end{matrix}] \text{ ->Average Polling with stride 2 -> } [\begin{matrix} 4.25 & 4.25 \\ 4.25 & 3.5\end{matrix}]$$

The example shown above performs average polling on input image with filter 2 x 2 and stride 2.

Pooling offers several advantages:

Reduces computational complexity by decreasing the number of parameters.
Provides a form of translation invariance.
Helps prevent overfitting by reducing the model's sensitivity to small shifts in the input.

However, it also has disadvantages, particularly the loss of precise spatial information, which led to the development of alternative architectures like Capsule Networks.

Putting together a CNN network working

To summarise, a typical CNN layer (or unit) involves the following two components in sequence:

We start with an original image and do convolutions using multiple filters to get multiple feature maps.
A pooling layer takes the statistical aggregate of the feature maps.

A deep CNN has multiple such CNN units (i.e., feature map-pooling pairs) arranged sequentially. To summarise:

We have an input image that is convolved using multiple filters to create multiple feature maps
Each feature map, of size (c, c), is pooled to generate a (c/2, c/2) output (for a standard 2 x 2 pooling).
The above pattern is called a CNN layer or unit. Multiple such CNN layers are stacked on top of one another to create deep CNN networks.

Pooling reduces only the height and the width of a feature map, not the depth (i.e., the number of channels).

Other CNN Architectures

While VGGNet provides a clear understanding of CNN fundamentals, several other architectures have made significant contributions to the field:

AlexNet
VGGNet
GoogleNet
ResNet

Let's start discussing this with AlexNet.

AlexNet

AlexNet, developed by Krizhevsky et al. in 2012, was the breakthrough architecture that demonstrated the power of deep CNNs on large-scale image recognition tasks. This network won the ILSVRC-2012 competition with a top-5 error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Key features include:

Eight layers in total: Five convolutional layers followed by three fully connected layers.
First use of ReLU (Rectified Linear Unit) activation functions instead of tanh, which accelerated training by a factor of 6.
Implementation of "dropout" regularization in the fully connected layers to reduce overfitting.
Local Response Normalization (LRN) after the first and second convolutional layers.
Overlapping max-pooling with stride 2 and kernel size 3.
Data augmentation techniques, including image translations, horizontal reflections, and RGB color variations.
Trained on two GPUs with a specialized parallelization scheme.

AlexNet contained 60 million parameters and was trained on 1.2 million images from ImageNet. The network achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, on the ILSVRC-2010 test set, significantly outperforming previous state-of-the-art approaches.

Reference: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

VGGNet

VGGNet, developed by the Visual Geometry Group at Oxford in 2014, focused on investigating the effect of network depth on accuracy. The key innovation was using very small (3 x 3) convolutional filters throughout the entire network.

Key features include:

Multiple configurations (A-E) with increasing depth from 11 to 19 weight layers.
Consistent use of small 3 x 3 filters with stride 1 and padding 1 to preserve spatial resolution.
Stacking multiple 3 x 3 convolutional layers to achieve the same effective receptive field as larger filters (e.g., three 3 x 3 layers have the same effective receptive field as one 7 x 7 layer) while incorporating more non-linearities and using fewer parameters.
Five max-pooling layers with 2×2 window and stride 2.
Three fully connected layers (4096, 4096, and 1000 units).
No Local Response Normalization (found to increase memory consumption without improving accuracy).

The best-performing VGG ConvNet models (configurations D with 16 layers and E with 19 layers) achieved top-5 error rates of 7.5% on the ILSVRC-2014 validation set and 7.3% on the test set. Despite having a more uniform and simpler architecture than other networks, VGGNet demonstrated that depth is crucial for high performance.

The models have been made publicly available and have become popular choices for transfer learning due to their excellent generalization capabilities.

Reference: https://arxiv.org/pdf/1409.1556

GoogleNet

After AlexNet and VGGNet, GoogleNet was released, which outperformed both.

Unlike previous innovations, where the filter size was decreased and new layers were added, GoogleNet, introduced by Szegedy et al. in 2014, won the ILSVRC-2014 competition with a top-5 error rate of 6.7%, increased the depth using a new type of convolution technique using the Inception module.

Key features include:

22 layers deep (27 including pooling layers) with Inception modules.
Inception modules that perform multiple convolutions in parallel (1 x 1, 3 x 3, 5 x 5) and concatenate the results, allowing the network to capture features at different scales simultaneously.
The use of 1 x 1 convolutions as "bottleneck" layers to reduce dimensionality before expensive 3 x 3 and 5 x 5 convolutions significantly reduces computational cost.
Auxiliary classifiers during training to combat the vanishing gradient problem in deep networks.
Global average pooling instead of fully connected layers at the end of the network.
Only 5 million parameters (compared to AlexNet's 60 million and VGG's 138 million).

Despite having significantly fewer parameters than previous architectures, GoogLeNet achieved superior performance through its efficient design. The Inception module allowed the network to be both wider and deeper while keeping computational requirements manageable, demonstrating that architectural innovations could be more important than simply scaling up existing designs.

Reference: https://arxiv.org/pdf/1409.4842

ResNet

Before ResNet, the significant improvement in the past was in terms of an increase in the depth of the network. Does this mean that the greater the depth of the network, the greater the accuracy?

No. The deeper network is harder to train because of conceptual problems such as exploding or vanishing gradients.

Let's look at the improvements that ResNet or Residual Net brought in to improve the accuracy as compared to Google Net architecture.

Keeping in mind that adding one more layer will not guarantee a performance improvement, ResNet architecture, developed by He et al. in 2015, came up with skip connections, which enabled them to train networks as deep as 152 layers. The architecture won the ILSVRC-2015 competition with a top-5 error rate of 3.57% and enabled the training of networks with unprecedented depth (up to 152 layers).

The key features of the ResNet architecture are:

152-layer model for ImageNet and has other variants (with 35, 50, and 101 layers).
Every 'residual block' has two 3 x 3 convolution layers.
No FC (Fully Connected) layer, except one last 1000 FC softmax layer for classification.
Global average pooling layer after the last convolution.
Batch Normalization after every convolution layer.
SGD + momentum (0.9) with no dropout used.

The elegance of the residual learning framework is that it allows for extremely deep networks without degradation in performance. ResNet demonstrated that depth is crucial for high performance, with the 152-layer variant achieving state-of-the-art results while still having lower complexity than VGGNet.

Reference: https://arxiv.org/pdf/1512.03385

Transfer Learning

We have seen a lot of architectures exist to train a CNN network. Does this mean that for the problem statement that you need to solve, you should develop your own CNN architecture?

As we know by now, CNNs are feature extractors. The convolutional layers learn from an image, which can then be used for any task such as classification, object detection, object tracking, etc.

Transfer learning is the practice of reusing the skills learned from solving one problem to learn to solve a related problem. Let's take a simple example to understand it intuitively:

Consider a person who knows how to drive a bike. Can this person easily learn how to drive a car using the knowledge he has?

It would be easy for a person to learn how to drive a car. Similarly, when training the model, it would be great if we have some set of already trained weights and we can train on top of that for a specific task.

For example, we have a model that classifies images of animals, we can further train this model to classify images of birds as well rather than starting from scratch.

Transfer Learning helps:

When we have a huge dataset and computational power is limited.
When we don't have a huge dataset but have specific data.

For computer vision problems, we can use pre-trained models like AlexNet, VGGNet, GoogleNet, ResNet, etc. Transfer learning is based on the idea that the initial layers of a network extract the basic features, and the later layers extract more abstract features.

Therefore, we can reuse the initial layers of the already trained network and train more abstract layers for the specific task that we need to do.

Therefore, we can conclude that there are two separate ways in which we can use transfer learning:

Freeze the (weights of) the initial few layers and train only a few later layers
Retrain the entire network (all the weights), initializing from the learned weights and choosing a low learning rate (Since we don't want to retrain the existing weight to a great extent)

While implementing transfer learning, how should we choose the number of layers that you want to reuse from the already-trained model?

The decision for this can depend on the following factors:

Is the task a lot similar to that of the pre-trained model? If yes, we can use most of the layers except the last few layers, which we can retrain.
If not, then we can use a few initially trained weights for our task and train the rest of the network.

Let's look at Python code on how to perform transfer learning.

import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam

# Load the pre-trained model without the classification layers
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the convolutional layers
for layer in base_model.layers:
    layer.trainable = False

# Add new classification layers
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
predictions = Dense(num_classes, activation='softmax')(x)

# Create the new model
model = Model(inputs=base_model.input, outputs=predictions)

# Fine-tuning: unfreeze some layers for further training
for layer in base_model.layers[-4:]:
    layer.trainable = True

# Compile the model
model.compile(
    optimizer=Adam(learning_rate=0.00001),  # Lower learning rate for fine-tuning
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Continue training
model.fit(
    train_generator,
    steps_per_epoch=len(train_generator),
    epochs=5,
    validation_data=validation_generator,
    validation_steps=len(validation_generator)
)

The above code shows an example of freezing the initial layers of the VGG16 CNN network. Then, the remaining layers are trained for the downstream task.

The Animal Visual System and CNNs

You might be wondering how in the first place anyone got an idea to build a network in the above-stated manner to process images. Isn't it? Well, it was not a coincidence but a well-thought motivation from the animal's visual system.

CNNs draw inspiration from biological visual systems, particularly the visual cortex of cats. A bunch of experiments were conducted to understand the visual system of a cat.

The pioneering work of Hubel and Wiesel

In the late 1950s and early 1960s, two neuroscientists, David Hubel and Torsten Wiesel, conducted revolutionary experiments that would later earn them a Nobel prize. The paper "Receptive field for single neurons in the cat’s striate cortex" by Hubel and Wiesel.

In the experiments, spots of light (of various shapes and sizes) were made to fall on the retina of a cat and, using an appropriate mechanism and the response of the neurons in the cat's retina was recorded. This provided a way to observe which types of spots make some particular neurons 'fire' and how groups of neurons respond to spots of certain shapes, etc.

Key similarities between visual cortex and CNNs

1. Receptive fields

Hubel and Wiesel found that receptive fields in the cat's visual cortex covered areas subtending about 4° of visual angle, with some as small as 1° and others as large as 10°. Similarly, CNN filters typically cover small regions of the input, gradually increasing in their effective receptive field size as we go deeper into the network.

Just as each neuron in the visual cortex responds to stimuli only within a specific region of the visual field, each unit in a CNN processes only a small patch of the input image. This "local connectivity" allows both systems to detect local patterns regardless of where they appear in the image.

They also showed how receptive fields contained both excitatory and inhibitory regions. The neurons only ‘fire’ when there is a contrast between the excitatory and the inhibitory regions. If we splash light over the excitatory and inhibitory regions together, because of no contrast between them, the neurons don’t ‘fire’ (respond). If we splash light just over the excitatory region, neurons respond because of the contrast.

The figure below shows a certain region of the receptive field of a cat. The excitatory region (denoted by the triangular marks) is at the centre, surrounded by the inhibitory region marked by the crosses.

Receptive fields

The strength of the response is proportional to the summation over only the excitatory region (not the inhibitory region). The pooling layer in CNNs corresponds to this observation.

This concept is discussed in detail at the end of the post in the bonus section.

2. Feature detection

Hubel and Wiesel discovered that neurons in the visual cortex act as feature detectors, with different neurons responding to different features like edges or lines with specific orientations. Similarly, CNN filters learn to detect specific features in the input data.

In their experiments, they found that some neurons would fire rapidly when presented with a vertical line but remain silent when shown a horizontal line. This orientation selectivity is precisely what CNN filters develop during training.

3. Hierarchical processing

Hubel and Wiesel noted that "initial layers detect basic features (edges, corners), while deeper layers identify more complex patterns." This directly parallels how CNNs work, with early convolutional layers detecting edges and later layers detecting more complex shapes and objects.

Both systems hierarchically process visual information, with early layers detecting simple features and deeper layers combining these to recognize more complex patterns.

Hierarchical processing

The image above illustrates the hierarchy in units. The first level extracts low-level features (such as vertical edges) from the image, while the second level calculates the statistical aggregate of the first layer to extract higher-level features (such as texture, colour schemes, etc.).

4. Spatial invariance

Hubel and Wiesel observed that "complex cells" in the visual cortex would respond to a preferred stimulus (like a line with a specific orientation) regardless of its exact position within the receptive field. This is similar to how pooling layers in CNNs provide translation invariance.

Both systems develop some degree of position invariance, allowing them to recognize features regardless of their exact location in the visual field.

From Biology to Technology

The design of CNNs wasn't an accident. It was deliberately inspired by these biological principles. By mimicking how the brain processes visual information, researchers created an architecture that has revolutionized computer vision and many other fields. This is a beautiful example of how understanding biology can lead to breakthroughs in artificial intelligence.

This biological inspiration is reflected in the architecture of CNNs, particularly in how convolutional layers process local regions of input data.

Bonus: Understanding receptive field in detail - The Building Blocks of Vision

The receptive field is a fundamental concept in convolutional neural networks that has its roots in neuroscience, which we have briefly touched on earlier in this post. Understanding this concept further can help dive deeper into CNN architectures and make better design decisions.

What is a Receptive Field?

In biological systems, the receptive field of a neuron is defined as "the portion of the sensory space that can elicit neuronal responses when stimulated". For example, in the human visual system, each neuron in the retina responds to stimuli only within a specific region of the total field of view.

Similarly, in convolutional neural networks, the receptive field refers to the region in the input space that influences a particular feature in the output. It's essentially the "window" of input data that a particular neuron "sees" and responds to.

Key Insight: The concept of receptive fields applies only to local operations like convolution and pooling. Fully connected layers don't have a limited receptive field since each unit has access to the entire input.

Theoretical vs. Effective Receptive Field (ERF)

Before diving any further, let us clarify the difference between theoretical and effective receptive fields.

Not all pixels within a receptive field contribute equally to the output. Pixels at the center typically have a much larger impact because they have more paths to contribute to the output. This leads to the concept of the Effective Receptive Field (ERF).

The partial derivative of the output for the input pixels can measure the ERF. Research has shown that the ERF typically follows a 2D Gaussian distribution, with central pixels having the strongest influence.

Non-linearities like ReLU can cause the ERF to deviate from a perfect Gaussian distribution. When a pixel value is zeroed by ReLU, no path from that part of the receptive field can reach the output, resulting in a zero gradient.

Critical Finding: The effective receptive field in deep convolutional networks grows much slower than theoretical calculations suggest. However, after training, the ERF typically increases, reducing the gap between theoretical and effective receptive fields.

Why Receptive Fields Matter

Understanding the receptive field of your CNN is crucial for several reasons:

Image Segmentation: In segmentation tasks, each output pixel needs a large enough receptive field to "see" the entire relevant object. Without this, the model might miss crucial contextual information.

Object Detection: A small receptive field may not be able to recognize large objects, which is why multi-scale approaches are common in object detection architectures.

Motion Analysis: For tasks like optical flow estimation or video prediction, the receptive field needs to be large enough to capture significant pixel displacements between frames.

Classification Accuracy: Research shows a logarithmic relationship between classification accuracy and receptive field size, suggesting that larger receptive fields are necessary for high-level recognition tasks.

Receptive field size and accuracy

Calculating the Receptive Field

For a single-path network (without skip connections), we can calculate the receptive field size using a closed-form equation. For two sequential convolutional layers with kernel size k, stride s, and receptive field r:

$${r_{i-1} = s_i *r_i + (k_i - s_i)}$$

This can be generalized to a network with L layers:

$${r_0 = \sum_{i=1}^{L} \left( (k_i - 1) \prod_{j=1}^{i-1} s_j \right) + 1}$$

Where:

$r_0$ is the receptive field size of the entire network
$k_i$ is the kernel size of layer $i$
$s_j$ is the stride of layer $j$

Strategies to Increase the Receptive Field

As we have seen earlier, accuracy increases by increasing the receptive field. Therefore, let's understand several effective approaches to increase the receptive field of a CNN network.

1. Adding More Convolutional Layers

Each additional layer increases the receptive field linearly by the kernel size. However, this approach has diminishing returns as the network gets deeper.

NOTE: While the theoretical receptive field increases with depth, research shows that the effective receptive field (ERF) ratio actually decreases with more layers.

2. Pooling and Strided Convolutions

Subsampling techniques like pooling or using strided convolutions increase the receptive field multiplicatively. This makes them very efficient for quickly expanding the receptive field.

3. Dilated (Atrous) Convolutions

Dilated convolutions introduce "holes" in the convolutional kernel, allowing it to cover a larger area without increasing the number of parameters. The effective kernel size with dilation rate r becomes:

$$k' = r(k-1) + 1$$

For example, a 3 x 3 kernel with a dilation rate of 2 has the same receptive field as a 5 x 5 kernel, while only using 9 parameters instead of 25.

When used sequentially, dilated convolutions can increase the receptive field exponentially while the number of parameters only grows linearly.

4. Depth-wise Convolutions

Depth-wise convolutions don't directly increase the receptive field, but they allow for more efficient computation. This means we can add more layers with the same computational budget, indirectly enabling a larger receptive field.

Key Insight: Pooling operations and dilated convolutions are the most effective ways to quickly increase the receptive field size.

Skip Connections and Receptive Field

As we discussed earlier in the post, skip connections, as used in architectures like ResNet, create multiple paths through the network. This results in a range of different receptive fields rather than a single fixed size.

For example, a network with n skip-residual blocks utilizes $2^n$ different paths, each with potentially different receptive field sizes. This creates a distribution of receptive fields that can be visualized as a histogram.

While skip connections provide more paths and potentially larger maximum receptive fields, research suggests they tend to make the effective receptive field smaller.

Impact of Other Operations on Receptive Field

Transposed Convolutions & Upsampling

For receptive field calculation purposes, upsampling can be considered to have a kernel size equal to the number of input features involved in computing an output feature. When doubling spatial dimensions, this is typically k=1.

Separable Convolutions

The receptive field properties of separable convolutions are identical to their equivalent non-separable convolutions. There's no change in terms of receptive field size.

Batch Normalization

During training, batch normalization parameters are computed based on all channel elements of the feature map. In this sense, its receptive field could be considered to be the whole input image.

Key takeaways of receptive field

The concept of receptive fields applies only to local operations like convolution and pooling.
Design the model so its receptive field covers the entire relevant input region for the given task.
Sequential dilated convolutions grow the receptive field exponentially while parameters grow linearly.
Pooling and dilated convolutions are the most effective ways to quickly increase receptive field size.
Skip connections provide multiple paths but may make the effective receptive field smaller.
The effective receptive field increases after training.

Understanding the receptive field concept is crucial for designing effective CNN architectures. By carefully considering how different layers and operations affect the receptive field, we can create models that better capture the relevant features for your specific computer vision task.

Conclusion

Convolutional Neural Networks have revolutionized computer vision by providing a powerful framework for image analysis and understanding. CNN architecture, inspired by the biological visual system, enables it to automatically learn hierarchical representations of visual data.

CNNs use convolutional layers to extract features from images.
Pooling layers reduce spatial dimensions while preserving important information.
Deeper networks like AlexNet, VGGNet, GoogLeNet, and ResNet have pushed the boundaries of what's possible in computer vision.
Transfer learning allows us to leverage pre-trained models for new tasks, even with limited data.

As research continues, CNNs and their variants remain at the forefront of computer vision, enabling applications from facial recognition to autonomous driving to medical image analysis and much more.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.