In an earlier post, I had explained convolution and deconvolution in deep neural networks. The purpose of this post is to demo these operations using PyTorch. Before doing that, we will visit different operations associated with a convolution operation.
Convolution is an operation on two functions of real valued arguments. One of these functions, considered a signal, is an n-dimensional array of numbers, for example a 3-dimensional array of numbers representing a color image. The second function is a kernel or filter of identical dimension but whose size is typically much smaller than the input array size. The array representing the kernel function is called kernel mask. The purpose of the convolution operation is to transform the input into a new array with the aim of highlighting some property of the input array. Thus, convolution can be viewed as feature extraction, and the transformed array is often called feature map where feature implies a particular characteristic of the input extracted by the kernel.
The convolution operation is performed by moving the kernel mask over the signal array and calculating the kernel response at each location. To understand the convolution operation, let’s consider a 3-dimensional input array representing the red, green, and blue channels of a colored image patch and a 3×3 convolution filter as shown below. [I am using identical filters for the three input channels for convenience. In practice, each channel has its own mask weights.] To perform convolution at a particular position of the input array, we place the center of the convolution mask at the desired position and perform element by element multiplication between the signal array elements and the convolution mask elements followed by summation for each input channel as shown in the figure below. The responses from three channels are then added to produce the output of the convolution operation. The response over the entire input array is obtained by moving the mask center one step at a time and repeating the calculations.
Looking at the above figure, we see that we cannot place the center of kernel mask anywhere in the top or bottom row or in the left or rightmost column; doing so will place part of the mask outside the input array. However, if we were to pad our input array with an additional row at top and bottom, and with an additional column on left and right with all element values being zero, then we can place the convolution mask even at all positions in the top or bottom rows or left or rightmost columns of the input array. Adding extra rows/columns is what is meant by padding in convolution. Without padding, the result of convolution for the above example would be a 6×6 feature map. With padding, the result would be 8×8, same as the input array size. Although the mask used in the example here is a square mask, it is not necessary to have mask height (H) same as mask width (W). It is easy to see that we must add (H-1)/2 rows on top and bottom of the input, and (W-1)/2 columns on each side of the input to maintain the feature map size identical to the input array size. [These numbers for padding assume H and W to be odd integers, which is common.]
We generally move the kernel mask over the input array to the next pixel. However, we can skip a pixel or two in between when moving the mask. The parameter stride determines how the mask is moved during convolution. A stride of 1 means moving to the next pixel with no skipping of pixels/cells and a stride of 2 means moving by two pixels. A stride value other than the default value of 1 means convolution response will be calculated at fewer positions. This means the size of the resulting feature map will be smaller than the input even with padding. Thus, setting a suitable value for stride allows us to down sample the convolution result. The figure below shows the positions where a 3×3 mask would be placed with the default stride value of 1 (blue cells) and with a stride value of 2 (Cells marked with X), when there is no padding. Clearly, stride of 2 will down sample the input to produce a smaller feature map.
Another convolution layer parameter is dilation. This parameter is used to enlarge the mask so that convolution is applied over a larger area. This is different from using a larger kernel mask to start with. The figure below illustrates how a 3×3 mask would be enlarged for dilation of 2. The original 3×3 mask is considered to have a dilation of 1 which means the mask elements are adjacent to each other. The mask on right is the dilated version of the mask on left. As you can see, dilating the convolution mask ignores a certain number of input array elements while computing convolution. The main use of dilation is to produce better quality output in semantic segmentation.
A typical convolutional neural network (CNN) is used for classification. In such a network, you will find a large number of convolution layers. Since convolution is a linear operation, we need to insert some nonlinearity between two consecutive convolution layers. Thus, the output of convolution layer is rectified via running it through ReLU (Rectifier Linear Unit). The rectified output of each convolutional layer is followed by a pooling layer whose task is to down sample the convolution result. This is done by replacing a block of convolution layer cells with a single cell. For example, the convolution layer output can be divided into adjacent groups of 2×2 blocks to be replaced by the 2×2 block average. This is called average pooling. When a 2×2 blocks is replaced by the maximum value of the block, the resulting pooling is known as max pooling. Irrespective of the type of pooling used, the basic advantage of pooling is the resulting down sampling which in turn speeds up the computation and minimizes the variance in data moving forward.
With the above introduction to the different operations involved with a single convolution layer, lets try to put together a demo to show the effect of different parameters on convolution operation. To do the demo, lets get an image that we will use.
from PIL import Imageimport matplotlib.pyplot as plt%matplotlib inlinepil_image = Image.open('data/panda.jpg')plt.imshow(pil_image)
The image size is 3x235X180. Next, we import the necessary libraries. Since PyTorch accepts tensors, the image read earlier will be converted to a tensor. Furthermore, the input to convolutional layer should be of the size batch_size x Number_of_input_channels x input_height x input_width. Since our batch size is going to be one, we need to add this information to our demo image as well. We are going to use four convolution filters. These will not be learned but set by defining a numpy array. The code for this part including the visualization of the filters is shown below.
import torchimport torch.nn as nnimport torch.nn.functional as Fimport numpy as npfrom torchvision import datasets, models, transforms# Transform PIL image to a tensortransform = transforms.ToTensor()img = transform(pil_image)img = img.unsqueeze(0)#Define filtersfilter_array = np.array([[-1, -0.5,0, 0.5, 1], [-1, -0.5,0, 0.5, 1],[-1, -0.5, 0, 0.5, 1], [-1, -0.5,0, 0.5, 1],[-1, -0.5,0, 0.5, 1]])filter_1 = filter_arrayfilter_2 = -filter_1filter_3 = filter_1.Tfilter_4 = -filter_3filters = np.array([filter_1, filter_2, filter_3, filter_4])#Visualize filtersfig = plt.figure(figsize=(12, 6))fig.subplots_adjust(left=0, right=0.5, bottom=0.8, top=1, hspace=0.05, wspace=0.05)for i in range(4):ax = fig.add_subplot(1, 4, i+1, xticks=[], yticks=[])ax.imshow(filters[i], cmap='hot')ax.set_title('Filter %s' % str(i+1))
Now, we will set up a two-layer convolution network to perform convolution. The code for this is given below.
class DemoNet(nn.Module):def __init__(self, wt1,wt2):super(DemoNet, self).__init__()# We initialize the weights of the convolutional layer as the 4 defined filtersself.conv1 = nn.Conv2d(3, 4, kernel_size=5,stride=1,dilation=1, bias=False)# define a pooling layerself.pool1 = nn.MaxPool2d(2, 2)#Define another conv layerself.conv2 = nn.Conv2d(4,4,kernel_size=5, bias =False)self.pool2 = nn.MaxPool2d(2,2)# Set filter weightswith torch.no_grad():self.conv1.wt1 = torch.nn.Parameter(wt1)self.conv2.wt2 = torch.nn.Parameter(wt2)def forward(self, x):# calculates the output of a convolutional layer# pre- and post-activationconv1_x = self.conv1(x)activated1_x = F.relu(conv1_x)# apply poolingpooled1_x = self.pool1(activated1_x)conv2_x = self.conv2(pooled1_x)activated2_x = F.relu(conv2_x)pooled2_x = self.pool2(activated2_x)# returns all layersreturn conv1_x, activated1_x, pooled1_x, conv2_x,activated2_x,pooled2_x
Next, we define a function that will be used to visualize the output of the convolution layer filters.
def visualize_layer(layer, n_filters= 4):fig = plt.figure(figsize=(12, 12))for i in range(n_filters):ax = fig.add_subplot(1, n_filters, i+1)ax.imshow(np.squeeze(layer[0,i].data.numpy()))ax.set_title('Filter %s' % str(i+1))
Now, we are ready to instantiate our network, feed the input image, and compute the output at different layers. We will use the same filter set for both convolution layers. Further, we will use identical kernel weights for every input channel by repeating the weights.
wt1 = torch.from_numpy(filters).unsqueeze(1).type(torch.FloatTensor).repeat(1, 3, 1, 1)wt2 = torch.from_numpy(filters).unsqueeze(1).type(torch.FloatTensor).repeat(1, 4, 1, 1)model = DemoNet(wt1,wt2)#Compute outputconv1_x, activated1_layer, pooled1_layer, conv2_x,activated2_layer,pooled2_layer = model.forward(img)
Lets now visualize the output of the first convolution layer. The first row below shows the outputs of the four filters before rectification. The second row of four images is the output of the first convolution layer after rectification.
visualize_layer(conv1_x)visualize_layer(activated1_layer)
We should remember that the convolution output (images in the top row) has both positive and negative values while the rectified output (images in the bottom row) has only positive values. This is the reason two row of images look so different. Next, we visualize the second convolution layer in a similar manner.
visualize_layer(conv2_x)visualize_layer(activated2_layer)
We see that second layer output appears to highlight some image features as eyes as short linear segments. The complexity of such features increase with increasing convolution layers. This is why we need multiple convolution layers for better accuracy. Although all images are displayed at same size, the tick marks on axes indicate that the images at the output of the second layer filters are half of the input image size because of pooling. In this case, we also notice much more variation in the rectified output.
To see how changing the stride value from 1 to 2 will change the output, we set the stride to 2 for both layers and run the network again. The first row of the images show the second layer output before rectification and the second row after rectification. With stride of 2, the output at second layer is heavily down sampled and we have a coarser representation of the features.
Now, lets see the effect of dilation. With a dilation value of 3, the result at the first layer before and after rectification is shown below. In this case, image features appear prominently compared to output without dilation.
A 1×1 convolution is often confusing because its utility is not obvious. A 1×1 convolution applied to a single image will only scale the pixel values by a factor of the 1×1 convolution weight; thus, it is unclear what benefit might be there of such a convolution. Well! to understand what benefit might be there, lets consider m input channels over which 1×1 convolution is to be applied. In this case, the 1×1 convolution operation can be expressed with the following equation where a_k is the scaling factor or weight assigned to the k-th input channel input(k):
As this equation indicates, 1×1 convolution performs a weighted aggregation of the input channel values along the depth axis; thus it is often called the depth convolution. This is also illustrated in the figure below.
The main usage of 1×1 convolution is in reducing computation or dimensionality reduction by reshaping input before filtering. Suppose at some intermediate stage in your convolution network, you have 64 filtered images or feature maps of size 28×28 pixels. You want to apply 16 different convolution masks of size 3×3 to these 28x28x64 images. This will require 28*28*16*3*3*64 (7225344) operations. Instead of directly applying 16, 3×3 masks on 64 channels of incoming images, we first reshape the incoming images to 28x28x4 via 4, 1×1 convolution filters. This will require 28*28*4*1*1*64 (200704) operations. Next, applying 16, 3×3 filters on the reshaped input will require 28*28*4*3*3*16 (451584) operations. Adding these two sets of operations, we can see that reshaping via 1×1 convolution requires about 90% fewer operations.
Lets now perform 1×1 convolution on the output of our demo network. To do this, we add another convolution layer to our network and make necessary changes to the network definition. Instead of providing equal weight across all four channels, I am just using [1.0, 0.5, 0.25, -1.0] as weight values for no particular reason. The result of 1×1 convolution is then the feature map shown below,before and after rectification. Thus, 1×1 convolution allows to reduce the dimensionality while retaining the features of the input images.
The use of the term deconvolution in deep learning is different from its meaning in signal and image processing. While convolution without padding results in a smaller sized output, deconvolution increases the output size. With stride values greater than 1, deconvolution can be used as a way of up sampling the data stream. This appears to be its main usage in deep learning.
Both the convolution and deconvolution operations in deep learning are actually implemented as matrix multiplication operations and the deconvolution is actually transposed convolution. [Transposed convolution term is gaining more usage in deep learning literature to minimize confusion with actual deconvolution operation.] You can visit my earlier post on this topic where I explain how convolution and deconvolution operations are carried out as matrix multiplications. With the transposed convolution, it is possible to recover the original input with learning. Here I will just show the result of deconvolution operation using a single channel image that will be put through convolution first and then through deconvolution. By selecting a 5×5 kernel of all zeros except the center element of value -1, we will see that the sequence of convolution and deconvolution recovers our original image as shown below. The first image below is the input image, the center image is the output of the convolution, and the last image is the result of deconvolution.
ker = torch.tensor([[0,0,0],[0,0,0],[0,-1,0],[0,0,0],[0,0,0]]).type(torch.FloatTensor).repeat(1,1,1)con = nn.Conv2d(1,1,3, stride=1, bias=False)con.weight = torch.nn.Parameter(ker.unsqueeze(0))con_out = mm(img[0].unsqueeze(0).unsqueeze(0))decon = nn.ConvTranspose2d(1,1,3, stride=1,bias=False)decon.weight = torch.nn.Parameter(ker.unsqueeze(0))decon_out= decon(con_out)
As we can see from above, there are various parameter choices available in the convolution layer that can be used to control up or down sampling of the data as it moves through numerous layers in a deep convolutional neural network. Before closing this post, I want to tell you that the actual operation in the convolution layer is not really convolution but cross-correlation. However, the term convolution has come to be accepted and used because the convolution masks are not specified before hand, as we did in this example, but are rather learned. Since the difference between convolution and correlation is whether the kernel mask is flipped before applying or not, one can argue that the masks used are flipped versions of the actual learned masks.
We add Value. Not spam.