Forward and Backward Propagation using Convolution operation
For the forward pass, we move across the CNN, moving through its layers and at the end obtain the loss, using the loss function. And when we start to work the loss backwards, layer across layer, we get the gradient of the loss from the previous layer as ∂L/∂z. In order for the loss to be propagated to the other gates, we need to find ∂L/∂x and ∂L/∂y.
Now, lets assume the function f is a convolution between Input X and a Filter F. The basic difference between convolution and correlation is that the convolution process rotates the matrix by 180 degrees. Input X is a 3x3 matrix and Filter F is a 2x2 matrix, as shown below:
Convolution between Input X and Filter F, gives us an output O. This can be represented as:
To derive the equation of the gradients for the filter values and the input matrix values we will
consider that the convolution operation is same as correlation operation, just for simplicity.
Therefore, The convolution operation can be written as described in the figure below:-
It can be visualized in the figure below:-
We can find the gradients of the input matrix ‘X’ with respect to the error ‘E’ as below:-
Now, the above computation can be obtained by a different type of convolution operation known as full convolution. In order to obtain the gradients of the input matrix we need to rotate the filter by 180 degree and calculate the full convolution of the rotated filter by the gradients of the output with respect to error, As represented in the image below.
Dropdown
Dropdown is a technique patented by Google to reduce overfitting in deep networks. It randomly removes neurons in hidden layers and its related connections. It forces network to distribute learning throughout the network. Dropout probability of 0.5 is stated good. It is computationally cheap and remarkably effective but is not used during inference.
Vanishing Gradient Problem
As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
Certain activation functions, like the sigmoid function, squishes a large input space into a small input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.
By the chain rule, the derivatives of each layer are multiplied down the network (from the final layer to the initial) to compute the derivatives of the initial layers.
However, when n hidden layers use an activation like sigmoid function, n small derivatives are multiplied together. Thus, the gradient decreases exponentially as we propagate down to initial layers. A small gradient means that the weights and biases of the initial layers will not be updated effectively with each training session. Since these initial layers are often crucial to recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole network.
The solution of this problem is to either skip the connections or to use different activation functions.
Data Augmentation
The process where we increase the quantum of data we have or augment it by using existing data and applying some transformations on it. Some of the types of data augmentation techniques include:- 1. Shear
2. Rotate
3. Crop
4. Blur
5. Translate6. Zoom
7. Flip
ImageNet
The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories with a typical category, such as "balloon" or "strawberry", consisting of several hundred images.
AlexNet
It is Convolutional Neural Network and an incredibly powerful model capable of achieving high accuracies on very challenging datasets.
However, removing any of the convolutional layers will drastically degrade AlexNet’s performance. AlexNet is a leading architecture for any object-detection task and may have huge applications in the computer vision sector of artificial intelligence problems.
The architecture consists of eight layers: five convolutional layers and three fully-connected layers. But this isn’t what makes AlexNet special; these are some of the features used that are new approaches to convolutional neural networks:
- ReLU Nonlinearity. AlexNet uses Rectified Linear Units (ReLU) instead of the tanh function, which was standard at the time. ReLU’s advantage is in training time; a CNN using ReLU was able to reach a 25% error on the CIFAR-10 dataset six times faster than a CNN using tanh.
- Multiple GPUs. Back in the day, GPUs were still rolling around with 3 gigabytes of memory (nowadays those kinds of memory would be rookie numbers). This was especially bad because the training set had 1.2 million images. AlexNet allows for multi-GPU training by putting half of the model’s neurons on one GPU and the other half on another GPU. Not only does this mean that a bigger model can be trained, but it also cuts down on the training time.
- Overlapping Pooling. CNNs traditionally “pool” outputs of neighboring groups of neurons with no overlapping. However, when the authors introduced overlap, they saw a reduction in error by about 0.5% and found that models with overlapping pooling generally find it harder to overfit.
ZFNet
ZFNet is a modified version of AlexNet which gives a better accuracy. One major difference in the approaches was that ZF Net used 7x7 sized filters whereas AlexNet used 11x11 filters. The intuition behind this is that by using bigger filters we were losing a lot of pixel information, which we can retain by having smaller filter sizes in the earlier conv layers. The number of filters increase as we go deeper. This network also used ReLUs for their activation and trained using batch stochastic gradient descent.
ResNet
There are 152 layers in the Microsoft ResNet. The authors showed empirically that if you keep on adding layers the error rate should keep on decreasing in contrast to “plain nets” where adding a few layers resulted in higher training and test errors. It took two to three weeks to train it on an 8 GPU machine. One intuitive reason why residual blocks improve classification is the direct step from one layer to the next and intuitively using all these skip steps form a gradient highway where the gradients computed can directly affect the weights in the first layer making updates have more effect.
Comments
Post a Comment