Seeing Through AI Eyes: The Technology Behind Style Transfer and Object Detection

Tavish Aggarwal

April 18, 2025

In this blog, we'll dive deep into two computer vision applications that are changing how we interact with images, i.e., Style transfer and Object detection. We will explore the underlying working and technology used for these two applications.

In recent months, social media has been transformed into a Ghibli wonderland, with users across platforms sharing photos of themselves and their surroundings reimagined with a Ghibli filter. We will explore how similar outputs can be obtained using style transfer.

We will also explore the algorithms that enable computers to identify and locate objects within images. We will explore modern deep learning approaches like Faster R-CNN, which are used for object detection.

This blog assumes that you have good understanding of Convolution Neural Network (CNN). I will suggest to refer to post, Seeing Through Layers: The Ultimate Guide to Convolutional Neural Networks if you want to revise or understand how CNN works.

Style Transfer

Have you ever wondered how those apps that transform your photos into the style of famous paintings work? The technology behind this fascinating visual transformation is called Neural Style Transfer, and it represents one of the most visually striking applications of deep learning in computer vision. In this blog post, we'll dive into how style transfer works, breaking down the complex algorithms into understandable concepts. For example, the recently viral Ghibli filter.

Problem statement

Style transfer is an application of transfer learning where we have a 'content image' and a 'style image' and the objective is to transfer the style from the style image to the content image.

The objective is to transfer the 'style' from the style image and render that style into the content image. The input image is called a content image, and the output image with styles embedded is called a candidate image.

Style Transfer

The notations used to denote the images are Content Image: T, Style Image: S, and Candidate Image: C.

Note that there are many possible candidate images one can generate using the same content and style images, and the task is to generate a 'suitable' candidate image. That is, the candidate image should resemble the content from the content image and the style from the style image.

Understanding Content and Style Representations

To understand how neural style transfer works, we need to first understand how neural networks "see" images. When a deep convolutional neural network (CNN) processes an image, it creates different representations at different layers:

Lower layers capture basic features like edges, colors, and textures
Higher layers capture more complex features and, eventually, entire objects

The breakthrough insight from the 2015 paper "A Neural Algorithm of Artistic Style" by Gatys et al. was that these different layers of representation could be separated into content and style components.

Content Representation

The content of an image is captured by the activations of the higher layers in a CNN. These layers identify what objects are present and their arrangement, but aren't concerned with pixel-level details.

Style Representation

What does 'style' even mean? How do we extract the style of an image and write an algorithm that can transfer the style to the content?

The style of an image: its textures, colors, visual patterns is captured by the correlations between different feature maps within multiple layers of the network. These correlations form what's called a Gram matrix.

Example explaining the difference between content and style

Let's understand the difference between content and style with the help of an example of a web browser. On a web browser, whether a button is present on the screen is governed by the content. And how the button appears relative to the menu bar is governed by the style, i.e., the theme of the browser.

In general, style can be thought of as 'how different features of an image interact with each other'. For example, how a button in a specific color will look based on the background and the border of the button is governed by the style.

Mathematical Approach: Objective or loss function of the problem

Let's see next how to translate content and style into a mathematical form.

Content loss

The 'content' part of the loss function ensures that the feature vector corresponding to the content and candidate images is the same.

When we feed the content image to a pre-trained network such as VGGNet, we expect a high-level feature vector in the last layers of the network. Likewise, if we feed a candidate image, we will again expect a feature vector that represents the candidate image.

Let's assume a feature vector from the content image is represented as $F_n$ and a feature vector from the candidate image is represented as$F_d$. The objective here is that the feature vector of the content and candidate images should be as similar as possible. (After all, they represent the same image.)

So, the loss function, known as the content loss, is represented as:

$$Content Loss = ||F_n - F_d||^2$$

So far we have seen the content loss. But there also has to be style loss, which represents how much of the styles are absorbed. Let's understand the second part of the loss function which is called style loss.

Style loss

In general, we can see style as the correlation between different parts of an image. Consider a feature map at some layer $l$ of the CNN network (say the output of the $l^{th}$ layer). Let's assume there are 256 feature maps of shape 32 x 56. Each of these feature maps captures a particular feature of the image. 256 feature maps will capture 256 features, which could be edges, color gradients, textures, etc. Now, we want to define a loss function to measure the correlation between these features and, therefore, help us to capture "style" mathematically.

We use the Gram matrix, which captures the style of the image. Let's understand how.

Gram Matrix

Let’s understand the Gram matrix with an example. Consider an image of size 150 x 250 x 3. Now, imagine feeding this image to a CNN, which results in a feature vector of size 75 x 125 x 256 after a few convolution and pooling operations. This feature size implies that there are 256 feature maps, each of size 75 x 125.

Now, we flatten each of these 256 feature maps to get 256 feature vectors. If we flatten a feature map of size 75 x 125, we will get a vector of size 75 x 125 = 9375. So, after flattening the 256 feature maps, we have 256 feature vectors, each of size 9375. Each of these 256 vectors represents a 'feature' in the input image, i.e., the color combination, the texture, etc.

We represent this as a matrix $F^l$ of size 256 x 9375 (where $l$ is the layer). The $i^{th}$ row of this matrix $F_l^i$ represents a 'feature' in the input image.

Now, we want to compute the correlations between features (which we said is a good proxy for 'style'). Thus, we multiply the matrix $F^l$ with its transpose. The size of the transpose matrix will be 9375 x 256, so the product of the two matrices will be a square matrix of size 256 x 256.

This 256 x 256 matrix is the gram matrix where each element of the gram matrix represents the correlation between one of the 256 features in the image. We will capture style by calculating the Gram matrix at multiple layers in the VGGNet.

Let a feature map be represented by $F_l^i$ where $l$ is the layer of the network and i is the flattened feature map in a given layer. Let $F_l^j$ be the transpose of $F_l^i$ . For a given layer $l$, the Gram matrix can be calculated as:

$$G =(F^l_i.F^l_j)^2$$

While replicating the style, we are interested in capturing the similarity of correlation between the candidate feature vector and the style feature vector. $G_C$ denotes the Gram matrix of the candidate image feature matrix, and $G_S$ denotes the Gram matrix for the style image. Style loss is the L2 norm of both the Gram matrices:

$$Style Loss = ||G_C - G_S||^2$$

The above loss is for a single layer. We can calculate the style loss for different layers in the network.

The final loss function is based on which the network will be trained. The total loss (L) for the network is:

$$L = \sum_i(F^l_{Ti} - F^l_{Ci})^2 + \lambda\sum^L_{l=1}W_l(G_S^l - G_C^l)^2$$

where,

$F^l_{Ti}$ is the $i^{th}$ feature vector at layer $l$ generated by the content image (T). For example, if we have an output as 25 x 25 x 256, $F_T$ will be a flattened vector of size 625, and $i$ will run from 0-255.
$F^l_{Ci}$ is the $i^{th}$ feature vector at layer $l$ generated by candidate image (C).
$G_S^l$ is the Gram matrix at layer $l$ for Style Image (S)
$G_C^l$ is the Gram matrix at layer $l$ for Candidate Image (C)
$W_l$ is the weight for the layer $l$ in case of the Gram matrix. It is a hyperparameter.
$\lambda$ is the importance given to Content Loss/Style Loss

The Optimization Process

Unlike traditional neural network training, where we update the network weights, in style transfer, we keep the network fixed and instead optimize the pixel values of our candidate image. We start with random noise (or sometimes the content image itself) and iteratively update it to minimize our total loss function.

The process works like this:

Initialize a candidate image (often with random noise).
Feed it through a pre-trained CNN (typically VGG-19).
Calculate the content and style losses.
Compute the gradient of the loss via backpropagation with respect to the image pixels.
Update the image to reduce the loss.
Repeat until satisfied with the result.

I am also sharing a Jupyter Notebook link with code to perform style transfer on the image.

Conclusion

Neural style transfer represents a beautiful intersection of art and technology. By leveraging the power of deep neural networks, we can now mathematically define and manipulate the "style" of an image.

As this technology continues to evolve, we can expect even more sophisticated applications that will blur the line between human and machine creativity. Whether you're an artist looking for new tools or a developer interested in computer vision, neural style transfer offers a fascinating glimpse into how AI can help us see and create in new ways.

Object Detection

Ever had the idea of self-driving cars detect pedestrians, or of the cameras for security able to spot items which look suspicious? The technology used in these examples is known as object detection. It is a field of computer vision that enables machines to not only see but understand what they are looking at.

Object detection is a very comprehensive area of computer vision that includes image classification (the method with which we answer the question "what is in this image?"). This technology is not only able to tell us "what" but also "where" in the image.

This section thoroughly explains object detection and addresses one of the leading object detection algorithms: Faster R-CNN.

What is Object Detection?

Object detection is the task of locating objects in an image. For example, take a look at the following image, where three objects are detected. Each detected object has an associated probability or confidence.

Object Detection

The Three Tasks of Object Detection

An object detection problem can be divided into 3 subtasks:

1. Region proposal generation

The first challenge is to find the possible areas/regions, their shapes and sizes, which potentially contain an object. This task is where most object detection algorithms try to find novel ideas to improve state-of-the-art object detectors. For example, the Image shown below presents the regional proposals for a car.

Region proposal generation

Early object detection systems used techniques like sliding windows (checking every possible location) or selective search (grouping similar pixels). However, modern approaches like Faster R-CNN (which we will cover later in the post) use neural networks to generate these proposals more efficiently.

2. Object classification

Once we have potential regions, we need to determine what's in them. This is where convolutional neural networks (CNNs) shine.

Each proposed region is fed into a CNN, which acts as a powerful image classifier. The CNN extracts features from the region and predicts:

What class does the object belong to (car, person, dog, etc)?
A confidence score indicating how certain the model is about its prediction.

Regions with confidence scores below a certain threshold are discarded, leaving only the most promising candidates.

3. Non-maximum suppression

There is a negligible chance that a region generated in the first step will capture the object entirely. Each region will contain a part of the object. In the end, we just want one region that captures the object entirely. This is where the non-maximum suppression technique is used.

For example, consider the four region proposals of the car, each of which is classified as containing a car above the stipulated threshold.

Non-maximum suppression

In the above picture, we can see that only the red box captures most of the car, and the other boxes are redundant. How does an algorithm find this region?

NMS solves this problem by keeping only the most confident detection and eliminating redundant overlapping boxes using Intersection over Union.

IoU (Intersection over Union)

Intersection over Union (IoU) measures how much two bounding boxes overlap. It is the ratio of the area of intersection to the area of union between two given regions.

IoU = Area of Overlap / Area of Union

NOTE: If two boxes have a high IoU (meaning they substantially overlap) and detect the same class, the one with the lower confidence score is removed. This ensures we end up with just one detection per object.

We find the IoU metric between the ground truth region of interest and all the generated regions of interest. The ground truth region of interest is the ideal bounding box around the car, which is present in the label in the training data.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Network

Now that we understand the fundamentals, let's explore one of the most influential object detection algorithms, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Network, which revolutionized object detection by making it both more accurate and significantly faster than previous approaches.

The Evolution to Faster R-CNN

Before Faster R-CNN, object detection systems like R-CNN and Fast R-CNN relied on external region proposal methods like Selective Search, which were computationally expensive and created a bottleneck in the detection pipeline.

Faster R-CNN's key innovation was integrating the region proposal mechanism directly into the neural network architecture, creating an end-to-end trainable system.

How Faster R-CNN Works

Faster R-CNN consists of two main components:

Region Proposal Network (RPN): A specialized neural network that generates region proposals
Fast R-CNN Detector: A network that classifies the proposed regions and refines their bounding boxes

Faster R-CNN architecture

Here's how the process works:

Step 1: Feature Extraction

The image is first processed through a pre-trained CNN (like VGG-16 or ResNet) to extract feature maps - rich representations of the image content.

Step 2: Region Proposal Network

The feature maps generated in the previous step are fed to the RPN, which is also a part of the entire network. The RPN uses feature maps to generate regions of interest (RoI). Unlike the R-CNN network, which randomly generates the RoIs, the Faster R-CNN generates RoIs smartly.

The region generation or region proposal is also learned while training the network. The RPN scans the feature maps generated in the previous step to find potential RoIs. After classifying a RoI, an RPN generates two things about the detected RoI.

The first thing that it generates is an "objectness score" which is a probability of the RoI box having an object or not. For example, the objectness score could be 0.83 which tells us that the detected RoI has a high probability of having an object inside it. It doesn't classify the object yet, it just tells whether it contains an object or not.
The second thing generated by the RPN is a "pair of coordinates" that tells the location of the RoI box in the image. For example, the RNP can generate a pair of coordinates, as shown below. The coordinates are diagonally opposite, which can uniquely identify a box.

This approach is much more efficient than previous methods because:

It shares the convolutional computations across the entire image
It learns to generate high-quality proposals through training
It operates on feature maps rather than the original image

Step 3: Proposal Refinement and Classification

Next, we give the output of RPN and the features extracted by the initial convolutional layers to the classifier present at the end of the network.

NOTE: We push the output of RPN to the classifier only if the objectness score is above a certain threshold such as 0.5 which means the RPN thinks that there is a high probability of presence of an object inside the RoI box.

The classifier performs two tasks.

First, it classifies the object present in the RoI box into one of the classes such as car, cat, dog, etc. It then uses the log loss by comparing the classified object with the ground truth object.
The second task is to calculate the regression loss, which compares the 4 coordinates generated by the RPN with the ground truth bounding box coordinates. Both of these losses are backpropagated to update the weights in the entire network, including those of the RPN.

Faster R-CNN represented a significant advancement by integrating region proposals into the network and enabling end-to-end training where the entire system can be trained together.

Challenges and Future Directions

However, object detection, despite appreciable progress, still experiences several difficulties:

Small object detection: The numerous systems are not in the condition to be good with the objects that are very small.
Occlusion handling: Detection of the objects that are partially hidden is not easy for most systems.
Computational efficiency: Balancing accuracy with speed for real-time applications.
Domain adaptation: Transferring knowledge to new environments or conditions.

Conclusion

Object detection indicates a significant progress in the field of computer vision. It allows machines to interpret pictures as humans do thus leading to a breakthrough in the technology. Faster R-CNN, by defining the problem in terms of object region proposal, followed by classification, and completing with the refinement step, have accomplished great development in the process of teaching computers the object's understanding.

As the technologies in object recognition are developed, we would probably get more precise, cost-effective, and universal systems for object detection.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.