# Getting through Deep Learning – Tensorflow intro (part 3)

This post is part of a tutorial series:

Alright, lets move on to more interesting stuff: linear regression. Since the main focus in TensorFlow, and given the abundancy of online resources on the subject, I’ll just assume you are familiar with Linear Regressions.

As previously mentioned, a linear regression has the following formula:

Where Y is the dependent variable, X is the independent variable, and b0 and b1 being the parameters we want to adjust.

Let us generate random data, and feed that random data into a linear function. Then, as opposed to using the closed-form solution, we use an iterative algorithm to progressively become closer to a minimal cost, in this case using gradient descent to fit a linear regression.

We start by initializing the weights – b0 and b1 – with random variables, which naturally results in a poor fit. However, as one can see through the print statements as the model trains, b0 approaches the target value of 3, and b1 of 5, with the last printed step: [3.0229037, 4.9730182]

The next figure illustrates the progressive fitting of lines of the model, as weights change:

Sources:

# Getting through Deep Learning – Tensorflow intro (part 2)

Yes, I kind of jumped the guns on my initial post on Deep Learning straight into CNNs. For me this learning path works the best, as I dive straight into the fun part, and eventually stumble upon the fact that maybe I’m not that good of a swimmer, and it might be good to practice a bit before going out in deep waters. This post attempts to be exactly that: going back to the basics.

This post is part of a tutorial series:

TensorFlow is a great starting point for Deep Learning/ Machine Learning, as it provides a very concise yet extremely powerful API. It is an open-source project created by Google initially with numerical computation tasks in mind, and used for Machine Learning/Deep Learning.

TensorFlow provides APIs for both Python and C++, but it’s backend is written in C/C++, allowing it to achieve much greater performance milestones. Moreover, it supports CPU, GPU, as well as distributed computing in a cluster.

The first thing to realize is that TensorFlow uses the concept of a session. A session is nothing more than a series of operations to manipulate tensors, organized in a structure of a data flow graph. This graph building activity pretty much works like Lego building, by matching nodes and edges. Nodes represent mathematical operations, and edges multi-dimensional arrays – aka: Tensors. As the name hints, a tensor is the central data structure in TensorFlow, and is described by its shape. For example, one would characterize a 2 row by 3 columns matrix as a tensor with shape of [2,3].

Important also to note is that the graph is lazy loaded, meaning that computation will only be triggered by an explicit run order for that session graph. OK, enough talking, let us get into coding, by exemplifying how a basic graph Session is built:

In the previous example, variables “a” and “b” are the nodes in the graph, and the summation is the operation connecting both of them.

A TensorFlow program is typically split into two parts: construction phase – where the graph is built – and a second one called execution phase, when actually resources (CPU and/or GPU, RAM and disk) are allocated until the session is closed.

Typically machine learning applications strive to iteratively update model weights. So, of course, one can also specify tensors of variable type, and even combine those constants.

By defining the dtype of a node, one can gain/loose precision, and at the same time impact on memory utilization and computation times.

Note lines 35 to 37 now:

```r1 = op1.eval()
r2 = op2.eval()
result = f.eval()```

TensorFlow automatically detects which operations depend on each other. In this case, TensorFlow will know that op1 depends on x and y evaluation, op2 on a, b and c evaluation,  and finally that f depends on both op1 and op2. Thus internally the lazy evaluation is also aligned with the computation graph. So far so good.

However, all nodes values are dropped between graph runs, except for variable values, which are maintained by the session across graph runs. This has the import implication that op1 and op2 evaluation will not be reused upon f graph run – meaning the code will eveluate op1 and op2 twice.

To overcome this limitation, one needs to instruct TensorFlow to run those operations in a single graph:

And yes, that is all for today. I want to blog more frequently, and instead of writing just once every couple of months (and in the meanwhile pilling up a lot of draft posts that never see the light of day), I decided to keep it simple. See you soon 🙂

Sources:

# Getting through Deep Learning – CNNs (part 1)

The number of available open source libraries making Deep learning easier to use is spreading fast as hype continuous to build. However, without understanding the background principles, it just feels like poking around a black box.

In this post (or several, most likely) will try to give an introduction to Convolution Neural Networks (CNNs). Note that, for the sake of brevity, I assume that you already know the basics about Neural Networks. If not, I would suggest you go through the following introduction.

This post is part of a tutorial series:

Disclaimer: this post uses images and formulas from distinct sources. I would suggest to have a look over the complete list of sources at the end of the post, as usual.

Inspiration

In 1958 and 1959 David H. Hubel and Torsten Wiesel performed a series of experiments, whereby they concluded that many neurons in the visual cortex focus on a limited region in the vision field.

This insight provided the notion of a local receptive field – a narrow sub-region of what is available in the whole visual field which serves as input – thus giving rise for a different architecture than the previously fully connected neural network architecture.

Basics – Convolution Layer

The first thing to realize is that Convolution networks are simply the application of “mini-neural networks” to segments of input space. In the case of images, that results in that neurons in the first convolutional layer are not connected to every single pixel in their Receiptive Field (RF).  The following image (source) shows an illustration of how a a a convolution layer is built using an image from the famous MNIST dataset – whereby the goal consists in identifyying the digits from handwritten numbers pictures.

OK, let’s break this down. In the MNIST dataset each, each image is 28 by 28 pixels – respectively height and width. For a fully connected neural network, the input space for the first layer of 28 x 28 = 728px if we were only to include height and width.

However, in a so-called convolution, you would instead apply a mini-neural network to just a single portion/segment of the image – let’s say a 3×3 px (width and height) rectangle. This 3×3 receptive field is also often refered to as a filter or kernel in Deep Learning (DL) lingo.

Next you would slide that kernel over the image, let us say 1px to the left in each step until we reach the far right end, and then 1px down until we reach the lower bound. The following animated illustration – credits go to Vincent Dumoulin and Francesco Visin with awesome CNN architectual overview available here  – shows the building of a convolution layer using this 3×3 Receptive Field/Filter/Kernel step-by-step building what is called a Feature Map – a layer full of neurons using the same filter.

Thus a Feature Map can also be thought of a multitude of these mini-Neural Networks – whereby each filter – the dark blue rectangle – has its own weights, bias term, and activation function, and produces one output (darker green square).

The following illustration shows a detailed view of the progressive application of these mini neural network across filters of the initial input space -again credits go to Vincent Dumoulin and Francesco Visin – and producing a Feature map.

Note that the illustrations used an iterative sliding of a kernel of 3×3 in a step of 1 px. However this is not a requirement, as one can use for example a step size of 2, reducing the dimention of output space. By now you should be realizing that indeed this step size – called the stride in DL lingo – is yet another hyper parameter, just as the filter size.

OK, we are almost completed with the basic concepts related to a CNNs: we take a local receptive field  – which we call filter/kernel – slide it through an image in given step size – which we call stride – and produce a set of mini neural networks – which we call a feature map.

The missing detail that builds uppon the previous knowledge is the fact that we can simultaneously use different filters to the same input space, thus producing several feature maps as a result. The only restriction is that feature maps in a given layer have the same dimention. The following illustration from the excellent book “Hands-On Machine Learning with Scikit-Learn and TensorFlow”  gives more insight to the intuition of stacking multiple feature maps.

Until so far we have represented a convolution layer in a thin 2-D output space (1 single feature map). However, if one produces several feature maps in one layer, the result is a 3-D volume.  And this is when the notion of convolution comes into place: the output of this convolution will have a new height, width and depth.

It is thus frequent to see convolutional networks illustrated as “tall and skinny” boxes (aka with high values of height and low of depth), and progressively getting “shorter and fatter” (smaller height and bigger depth).

Basics – Pooling Layer

Before we go into more detail about different architectures a Convolution Layer may have (in the next post of this series), it is important to cover some ground aspects. To start, one is that you are not restricted to use Convolution Layer when creating a new hidden layer. In fact there are two more main types of layers: Pooling Layer (which we will cover in just a bit) and Fully-Connected layer (exactly as a regular Neural Network).

Also note that similarly to regular Neural Networks, where you can have as many hidden layers as you want, the same goes the CNN. That is, you can build convolutions that serve as input space for a next convolution layer or a pooling layer for example,  and so on.

Finally, and again similarly to Neural Networks, the last fully-connected layer will always contain as many neurons as the number of classes to be predicted.

Typical Activation functions

Again to be perfectly clear, what we do in each step of the sliding filter in a convolution is to a apply the dot product of a given set of weights (which we are trying to tune with training), plus a given bias. This is effectively a linear function represented in the following form:

where the weights is a vector represented by W, Xi would be the pixel values in our example inside a given filter, and b the bias. So what usually happens (except from pooling)  is that we pass the output of this function to a neuron, which will then apply given activation function. The typical activation functions that are usually implemented are:

Softmax – is a generalized form of the logistic function/sigmoid function, which turns the outputs into probabilities (thus comprising in interval between 0 and 1).

ReLU – Rectified Linear Unit functions have a smoothing effect on the output, making results always bigger than zero.

Tanh – hyperbolic tangent function, which enables activation functions to range from -1 to +1.

Typical Cost Functions

As you know, cost functions are what makes the whole training of models possible. Here are three of the main, where cross entropy is probably the most frequently used.

Mean Squared Error –  used to train linear regression models

Hinge Loss – used to train Support Vector Machines (SVM) models

Cross entropy – used to train logistic regression models

Regularization

Beyond activation functions applied to convolutions, there are also some other useful tricks applied to build a Deep Neural Network (DNN), which address the well known problem of over-fitting.

Pooling Layer – bottom line, pooling layers are used to reduce dimension. They sample from input space also using a filter/kernel with a given output dimension, and simply applying a reduce function. Usually a max()  – usually called max pooling – or mean()  – average pooling – functions.

For example, if one of kernels was a 2×2 matrix with the values [ [1,2], [3,4]], then max pooling would yield 4 as an output, where average pooling would yield 2.5 .

Dropouts -dropouts goal is exactly the same as regularization (it is after all a regularization technique); that is, it is intended to reduce over-fitting on outer sample. Initially it was used to simply turn off passing a portion of output of neurons at every iteration during training. That is, instead of passing all weights dot product computed against the input layer, we randomly (with a given specifiable probability) consciously decide to not add a given set of weights to the output layer.

In case you are wondering if this is similar trick to bagging that Random Forests does to Decision Trees, the answer would be not quite. The operation of averaging through lots of Decision Trees (wich have high propensity to over-fit data) using  sampling  with replacement is computationally doable (at today’ standards). However the same does not hold true to train distinct DNNs. So the dropouts technique is a practical method to average internally the outputs among layers of a network.

Final Notes

In part 2 I plan to get into more detail about convolution architectures, as well as provide a coding example in order to bring all these concepts home.

However, I didn’t want to terminate this post without any code snippet. So even though not going to go through it, just as a demonstration that with the concepts covered before you can already have some intuition around it, here is a code snippet where a training session with Google’s deep learning library TensorFlow is defined with two convolution layers:

As usual, here are the sources used for this post: