1. Introduction
2. Forward Propagation
3. Gradient Descent
4. Backpropagation of Errors
5. Checking gradient
6. Training via BFGS
7. Overfitting & Regularization
8. Deep Learning I : Image Recognition (Image uploading)
9. Deep Learning II : Image Recognition (Image classification)
10 – Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras



An Artificial Neural Network (ANN) is an interconnected group of nodes, similar to the our brain network.

Here, we have three layers, and each circular node represents a neuron and a line represents a connection from the output of one neuron to the input of another.

The first layer has input neurons which send data via synapses to the second layer of neurons, and then via more synapses to the third layer of output neurons.


Suppose we want to predict our test score based on how many hours we sleep and how many hours we study the night before.

In other words, we want to predict output value yy which are scores for a given set of input values XX which are hours of (sleep, study).

XX (sleep, study) y (test score)
(3,5) 75
(5,1) 82
(10,2) 93
(8,3) ?


In our machine learning approach, we’ll use the python to store our data in 2-dimensional numpy arrays.


We’ll use the data to train a model to predict how we will do on our next test.

This is a supervised regression problem.

It’s supervised because our examples have outputs(yy).

It’s a regression because we’re predicting the test score, which is a continuous output.

If we we’re predicting the grade (A,B, etc.), however, this is going to be a classification problem but not a regression problem.


We may want to scale our data so that the result should be in [0,1].


Now we can start building our Neural Network.

We know our network must have 2 inputs(XX) and 1 output(yy).

We’ll call our output y^y^, because it’s an estimate of yy.

Any layer between our input and output layer is called a hidden layer. Here, we’re going to use just one hidden layer with 3 neurons.

Neurons synapses

As explained in the earlier section, circles represent neurons and lines represent synapses.

Synapses have a really simple job.

They take a value from their input, multiply it by a specific weight, and output the result. In other words, the synapses store parameters called “weights” which are used to manipulate the data.

Neurons are a little more complicated.

Neurons’ job is to add together the outputs of all their synapses, and then apply an activation function.

Certain activation functions allow neural nets to model complex non-linear patterns.

For our neural network, we’re going to use sigmoid activation functions.

Forward Propagation

Our network has 2 inputs, 3 hidden units, and 1 output.

This time we’ll build our network as a python class.

The init() method of the class will take care of instantiating constants and variables.


Each input value in matrix XX should be multiplied by a corresponding weight and then added together with all the other results for each neuron.

z(2)z(2) is the activity of our second layer and it can be calculated as the following:

z(2)=XW(1)(1)(1)z(2)=XW(1)=3551102[W(1)11W(1)12W(1)13W(1)21W(1)22W(1)23]=[3551102][W11(1)W12(1)W13(1)W21(1)W22(1)W23(1)]=⎢ ⎢ ⎢3W(1)11+5W(1)213W(1)12+5W(1)223W(1)13+5W(1)235W(1)11+W(1)215W(1)12+W(1)225W(1)13+W(1)2310W(1)11+2W(1)2110W(1)12+2W(1)2210W(1)13+2W(1)23⎥ ⎥ ⎥=[3W11(1)+5W21(1)3W12(1)+5W22(1)3W13(1)+5W23(1)5W11(1)+W21(1)5W12(1)+W22(1)5W13(1)+W23(1)10W11(1)+2W21(1)10W12(1)+2W22(1)10W13(1)+2W23(1)]

Note that each entry in zz is a sum of weighted inputs to each hidden neuron. zz is 3×33×3 matrix, one row for each sample, and one column for each hidden unit.

Activation function – sigmoid

Now that we have the activities for our second layer, z(2)=XW(1)z(2)=XW(1), we need to apply the activation function.

We’ll independently apply the sigmoid function to each entry in matrix zz:

By using numpy we’ll apply the activation function element-wise, and return a result of the same dimension as it was given:

Let’s see how the sigmoid() takes an input and how returns the result:

The following calls for the sigmoid() with args : a number (scalar), 1-D (vector), and 2-D arrays (matrix).

Weight-matrices : W(1)W(1) and W(2)W(2)

We initialize our weight matrices (W(1)W(1) and W(2)W(2)) in our __init__() method with random numbers.


Implementing forward propagation

We now have our second formula for forward propagation, using our activation function(ff), we can write that our second layer activity: a(2)=f(z(2))a(2)=f(z(2)). The a(2)a(2) will be a matrix of the same size (3×33×3):


To finish forward propagation we want to propagate a(2)a(2) all the way to the output, y^.

All we have to do now is multiply a(2)a(2) by our second layer weights W(2)W(2) and apply one more activation function. The W(2)W(2) will be of size 3×13×1, one weight for each synapse:


Multiplying a(2)a(2), a (3×33×3 matrix), by W(2)W(2), a (3×13×1 matrix) results in a 3×13×1 matrix z(3)z(3), the activity of our 3rd layer. The z(3)z(3) has three activity values, one for each sample.

Then, we’ll apply our activation function to z(3)z(3) yielding our estimate of test score, y^:


Now we are ready to implement forward propagation in our forwardPropagation() method, using numpy’s built in dot method for matrix multiplication:


Getting estimate of test score

Now we have a class capable of estimating our test score given how many hours we sleep and how many hours we study. We pass in our input data (X) and get real outputs (y^).

Note that our estimates (y^) looks quite terrible when compared with our target (y). That’s because we have not yet trained our network, that’s what we’ll work on next article.


Continued from Artificial Neural Network (ANN) 2 – Forward Propagation where we built a neural network.

However, it gave us quite terrible predictions of our score on a test based on how many hours we slept and how many hours we studied the night before.

In this article, we’ll focus on the theory of making those predictions better.


Here are the equations for each layer and the diagram:



y^ and y

Here are the values of y^ and y:


Plot looks like this:


We can see our predictions (^yy^) are pretty inaccurate!


Cost function J

To improve our poor model, we first need to find a way of quantifying exactly how wrong our predictions are.

One way of doing it is to use a cost function. For a given sample, a cost function tells us how costly our models is.

We’ll use sum of square errors to compute an overall cost and we’ll try to minimize it. Actually, training a network means minimizing a cost function.


where the NN is the number of training samples.

As we can see from equation, the cost is a function of two things: our sample data and the weights on our synapses. Since we don’t have much control of our data, we’ll try to minimize our cost by changing the weights.

We have a collection of 9 weights:

W(1)=[W(1)11W(1)12W(1)13W(1)21W(1)22W(1)23]W(1)=[W11(1)W12(1)W13(1)W21(1)W22(1)W23(1)]W(2)=⎢ ⎢ ⎢W(2)11W(2)21W(2)31⎥ ⎥ ⎥W(2)=[W11(2)W21(2)W31(2)]

and we’re going to make our cost (J) as small as possible with a optimal combination of the weights.

Curse of dimensionality

Well, we’re not there yet. Considering the 9 weights, finding the right combination that gives us minimum J may be costly.

Let’s try the case when we tweek only one weight value ($W_{11}^{(1)}) in the range [-5,5] with 1000 try. Other weights remain untouched with the values of randomly initialized in “__init__()” method:


Here is the code for the 1-weight:


It takes about 0.11 seconds to check 1000 different weight values for our neural network. Since we’ve computed the cost for a wide range values of W, we can just pick the one with the smallest cost, let that be our weight, and we’ve trained our network.

Here is the plot for the 1000 weights:


Note that we have 9! But this time, let’s do just 2 weights. To maintain the same precision we now need to check 1000 times 1000, or one million values:

For 1 million evaluations, it took an 100 seconds! The real curse of dimensionality kicks in as we continue to add dimensions. Searching through three weights would take a billion evaluations, 100*1000 sec = 27 hrs!

For our all 9 weights, it could take “1,268,391,679,350”, 1 trillion millenium!:


Gradient descent method

So, we may want to use gradient descent algorithm to get the weights that take J to minimum. Though it may not seem so impressive in one dimension, it is capable of incredible speedups in higher dimensions.

Actually, I wrote couple of articles on gradient descent algorithm:

  1. Batch gradient descent algorithm
  2. Batch gradient descent versus stochastic gradient descent (SGD)
  3. Single Layer Neural Network – Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method
  4. Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)

Though we have two choices of the gradient descent: batch(standard) or stochastic, we’re going to use the batch to train our Neural Network.

In batch gradient descent method sums up all the derivatives of J for all samples:

JW∑∂J∂Wwhile the stochastic gradient descent (SGD) method uses one derivative at one sample and move to another sample point:JW∂J∂W













Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s