The number of available open source libraries making Deep learning easier to use is spreading fast as hype continuous to build. However, without understanding the background principles, it just feels like poking around a black box.

In this post (or several, most likely) will try to give an introduction to Convolution Neural Networks (CNNs). Note that, for the sake of brevity, I assume that you already know the basics about Neural Networks. If not, I would suggest you go through the following introduction.

This post is part of a tutorial series:

- Getting through Deep Learning – CNNs (part 1)
- Getting through Deep Learning – TensorFlow intro (part 2)
- Getting through Deep Learning – TensorFlow intro (part 3)

Disclaimer: this post uses images and formulas from distinct sources. I would suggest to have a look over the complete list of sources at the end of the post, as usual.

**Inspiration**

In 1958 and 1959 David H. Hubel and Torsten Wiesel performed a series of experiments, whereby they concluded that many neurons in the visual cortex focus on a limited region in the vision field.

This insight provided the notion of a *local receptive field* – a narrow sub-region of what is available in the whole visual field which serves as input – thus giving rise for a different architecture than the previously fully connected neural network architecture.

**Basics – Convolution Layer**

The first thing to realize is that Convolution networks are simply the application of “mini-neural networks” to segments of input space. In the case of images, that results in that neurons in the first convolutional layer are not connected to every single pixel in their Receiptive Field (RF). The following image (source) shows an illustration of how a a a convolution layer is built using an image from the famous MNIST dataset – whereby the goal consists in identifyying the digits from handwritten numbers pictures.

OK, let’s break this down. In the MNIST dataset each, each image is 28 by 28 pixels – respectively **height** and **width.** For a fully connected neural network, the input space for the first layer of 28 x 28 = 728px if we were only to include height and width.

However, in a so-called convolution, you would instead apply a **mini-neural network** to just a single portion/segment of the image – let’s say a 3×3 px (width and height) rectangle. This 3×3 receptive field is also often refered to as a **filter** or **kernel** in Deep Learning (DL) lingo.

Next you would slide that kernel over the image, let us say 1px to the left in each step until we reach the far right end, and then 1px down until we reach the lower bound. The following animated illustration – credits go to Vincent Dumoulin and Francesco Visin with awesome CNN architectual overview available here – shows the building of a convolution layer using this 3×3 Receptive Field/Filter/Kernel step-by-step building what is called a **Feature Map –** a layer full of neurons using the same filter.

Thus a Feature Map can also be thought of a multitude of these mini-Neural Networks – whereby each filter – the dark blue rectangle – has its own weights, bias term, and activation function, and produces one output (darker green square).

The following illustration shows a detailed view of the progressive application of these mini neural network across filters of the initial input space -again credits go to Vincent Dumoulin and Francesco Visin – and producing a Feature map.

Note that the illustrations used an iterative sliding of a kernel of 3×3 in a step of 1 px. However this is not a requirement, as one can use for example a step size of 2, reducing the dimention of output space. By now you should be realizing that indeed this step size – called the **stride** in DL lingo – is yet another hyper parameter, just as the filter size.

OK, we are almost completed with the basic concepts related to a CNNs: we take a local receptive field – which we call **filter/kernel – **slide it through an image in given step size – which we call **stride** – and produce a set of mini neural networks – which we call a **feature map**.

The missing detail that builds uppon the previous knowledge is the fact that we can simultaneously use different filters to the same input space, thus producing several feature maps as a result. The only restriction is that feature maps in a given layer have the same dimention. The following illustration from the excellent book “Hands-On Machine Learning with Scikit-Learn and TensorFlow” gives more insight to the intuition of stacking multiple feature maps.

Until so far we have represented a convolution layer in a thin 2-D output space (1 single feature map). However, if one produces several feature maps in one layer, the result is a 3-D volume. And this is when the notion of convolution comes into place: the output of this convolution will have a new height, width and depth.

It is thus frequent to see convolutional networks illustrated as “tall and skinny” boxes (aka with high values of height and low of depth), and progressively getting “shorter and fatter” (smaller height and bigger depth).

**Basics – Pooling Layer**

Before we go into more detail about different architectures a Convolution Layer may have (in the next post of this series), it is important to cover some ground aspects. To start, one is that you are not restricted to use **Convolution Layer **when creating a new hidden layer. In fact there are two more main types of layers: **Pooling Layer** (which we will cover in just a bit) and **Fully-Connected layer** (exactly as a regular Neural Network).

Also note that similarly to regular Neural Networks, where you can have as many hidden layers as you want, the same goes the CNN. That is, you can build convolutions that serve as input space for a next convolution layer or a pooling layer for example, and so on.

Finally, and again similarly to Neural Networks, the last fully-connected layer will always contain as many neurons as the number of classes to be predicted.

**Typical Activation functions**

Again to be perfectly clear, what we do in each step of the sliding filter in a convolution is to a apply the dot product of a given set of weights (which we are trying to tune with training), plus a given bias. This is effectively a linear function represented in the following form:

where the weights is a vector represented by W, Xi would be the pixel values in our example inside a given filter, and b the bias. So what usually happens (except from pooling) is that we pass the output of this function to a neuron, which will then apply given activation function. The typical activation functions that are usually implemented are:

Softmax – is a generalized form of the logistic function/sigmoid function, which turns the outputs into probabilities (thus comprising in interval between 0 and 1).

ReLU – Rectified Linear Unit functions have a smoothing effect on the output, making results always bigger than zero.

Tanh – hyperbolic tangent function, which enables activation functions to range from -1 to +1.

**Typical Cost Functions**

As you know, cost functions are what makes the whole training of models possible. Here are three of the main, where cross entropy is probably the most frequently used.

Mean Squared Error – used to train linear regression models

Hinge Loss – used to train Support Vector Machines (SVM) models

Cross entropy – used to train logistic regression models

**Regularization**

Beyond activation functions applied to convolutions, there are also some other useful tricks applied to build a Deep Neural Network (DNN), which address the well known problem of over-fitting.

Pooling Layer – bottom line, pooling layers are used to reduce dimension. They sample from input space also using a filter/kernel with a given output dimension, and simply applying a reduce function. Usually a **max**() – usually called **max pooling** – or **mean**() – **average pooling** – functions.

For example, if one of kernels was a 2×2 matrix with the values [ [1,2], [3,4]], then max pooling would yield 4 as an output, where average pooling would yield 2.5 .

Dropouts -dropouts goal is exactly the same as regularization (it is after all a regularization technique); that is, it is intended to reduce over-fitting on outer sample. Initially it was used to simply turn off passing a portion of output of neurons at every iteration during training. That is, instead of passing all weights dot product computed against the input layer, we randomly (with a given specifiable probability) consciously decide to not add a given set of weights to the output layer.

In case you are wondering if this is similar trick to bagging that Random Forests does to Decision Trees, the answer would be not quite. The operation of averaging through lots of Decision Trees (wich have high propensity to over-fit data) using sampling with replacement is computationally doable (at today’ standards). However the same does not hold true to train distinct DNNs. So the dropouts technique is a practical method to average internally the outputs among layers of a network.

**Final Notes**

In part 2 I plan to get into more detail about convolution architectures, as well as provide a coding example in order to bring all these concepts home.

However, I didn’t want to terminate this post without any code snippet. So even though not going to go through it, just as a demonstration that with the concepts covered before you can already have some intuition around it, here is a code snippet where a training session with Google’s deep learning library TensorFlow is defined with two convolution layers:

As usual, here are the sources used for this post: