Standard Convolution to Deformable Convolution — Part 1

Bhargav Vadlamudi

5 min readFeb 24, 2021

Part 1 → Basic Convolution and Receptive field ( code implementation in Opencv and pytorch)

Part 2 → Deformable Convolution (will be out next week)

Prerequisites

Matrix Operation, Dot product
Basic understanding of convolution

Lets just review some terms before proceeding :

What is convolution ?
What’s a kernel, filter, weights ?
What are feature maps ?

Convolution & Filters & Feature Map :

Let’s not get into some technical definition’s, our goal is to understand the process and purpose.

Consider a greyscale image with a vertical line, as humans we are able to recognize objects based on their structure and other high level attributes. For something as simple as a vertical line we identify it based on edges.

The first basic step would be to find a way to identify edges in images, edges are the places where there’s an abrupt change in pixel values.

Lets look at the above image in terms of pixels. The image is of shape 5x3 and lets call this image matrix (X)

Now let’s pick a 3x3 matrix with pixel values as shown below , for now lets call this matrix (W)

Our input image (X) is of shape 5x3 and W is 3x3, start sampling the input in size’s of 3x3. As you can see below we get 3 matrices when we sample the input image (X) with step size =1 and sample size 3x3.

3 Samples each of size 3x3 from input Image (X)

On each sample we will perform a dot product with the matrix (W) and sum the result to finally get a value.

Overall Convolution operation on the input

Finally for 3 samples of input we have 3 outputs,

Sample 1 → 3

Sample 2 → 0

Sample 3 → -3

Now notice that the resultant value is away from zero (3 and -3) if the sample contains an abrupt change in pixels and close to zero if no change is observed.

That is all, the above steps we have performed so far is nothing but the convolution operation performed in CNN networks. Now lets relate the matrices and results above to some standard terms used in ML community.

X → Our input image
W → This could be called Filter / Weights / Kernel Matrix
Size of W → Kernel Size
Our final output values → Called Feature Maps

By now you must have understood, the matrix that plays crucial role in the identification of the edges or other high level features is the Filter matrix (W).

Now that you have some basic understanding, you should be able to infer what’s going on in the animation and relate each matrix to the above terms we have defined.

What’s the input matrix here ? What’s the filter or weight matrix ? What’s the final feature map matrix ?

In the past, these filters are predefined by humans to detect specific features. There’s a wide variety of predefined filters for edge detection, smoothing and for many other purposes. (Check OpenCV docs)

With the advent of deep learning, it became possible to automatically learn these filters using gradient descent algorithm optimization.

Before proceeding to Deformed Convolution, we need to cover below terms

Receptive Field
Bilinear Interpolation

Receptive Field

Receptive field of a cell in a feature map is simply the pixels from input contributing to the calculation of a feature map cell.

Let’s break it down. Observe the below animation, the blue grid is our input and by performing convolution using a 3x3 filter we are getting the green grid which is the feature map.

Now each cell in the feature map is calculated based on some cells from the input and those cells in input will be called the receptive field of the cell in the feature map.

I hope it’s clear, next lets look at convolution performed over 2 layers and analyze the receptive field.

Observe that the cells in feature map 2 have much more receptive field as compared to cells in feature map 1. So it’s evident that, feature maps of the higher layers will have better receptive field compared to lower level maps.

Note: Due to the constraint of the receptive field for lower level feature maps, the lower level layers can only learn low level features of an object like edges and curves… While high level layers having access almost to the entire image could learn high level features.

OpenCV implementation

Let’s take a simple grey-scale image with horizontal edge and check the output using sobel filters in opencv.

Sobel filter’s are edge detection filter’s, sobel_x filter detects vertical edges where as sobel_y detects horizontal edges.

Code below to create sample image :

We consider sobel_x filter and sobel_y filter of size 3x3 i.e our W matrix is of shape 3x3. You can see below sobel_x filter cannot detect any edges because this filter is designed to detect vertical edges not horizontal.

Pytorch Implementation

Now lets use the same input image and perform the convolution using exactly the same sobel filter’s. The results should be same as opencv. Indeed they are check below.

Great, take a break and move on to Deformable convolution :))))…if you find this post useful, feel free to drop a clap