Published on

Deriving Neural Networks

Authors
  • avatar
    Name
    Jyotir Sai
    Twitter
    Engineering Student

Neural networks encompass a wide array of predictive algorithms that are the backbone of many of the latest advances in artificial intelligence and machine learning. I'm going to explain and derive the math behind a very basic type of neural network a single hidden layer neural network.

Notation & Symbols

The goal of a neural network is the same as any other supervised learning technique. We have a series of predictors or input data, Xp\mathbf{X_{p}}, and are trying to predict a response(s), Yk\mathbf{Y_{k}}. Xp\mathbf{X_{p}} is a matrix consisting of pp columns where each column represents a different predictor variable. Yk\mathbf{Y_{k}} is a vector consisting of kk classes.

X1X_{1}X2X_{2}...XjX_{j}
x11x_{11}x21x_{21}...xp1x_{p1}
x12x_{12}x22x_{22}...xp2x_{p2}
............
x1ix_{1i}x2ix_{2i}...xpix_{pi}

The subscript ii represents the row from i=1...Ni=1...N where NN is the number of rows i.e. data points in our dataset. The subscript jj represents the column from j=1...Pj=1...P where PP is the number of columns i.e. the variables in our dataset. If we had 3 variables consisting of age, gender, and height as well as 3 observations for each variable then the above table would look like the following:

agegenderweight
18M177
24M164
44F159

For the above table, N=3N = 3 and P=3P=3. Now, Yk\mathbf{Y_{k}} is a matrix of responses.

Y1Y_{1}Y2Y_{2}...YKY_{K}
Y11Y_{11}Y21Y_{21}...Yp1Y_{p1}
Y12Y_{12}Y22Y_{22}...Yp2Y_{p2}
............
Y1iY_{1i}Y2iY_{2i}...YKiY_{Ki}

Again, the subscript ii represents the row from i=1...Ni=1...N and kk is the column i.e. class from k=1...Kk=1...K. Let's say we are trying to predict salary from age, gender, and weight. Since we are only trying to predict 1 response, that means we have K=1K=1 classes. So, the above table is simplified to:

salary
60,000
39,000
...
121,000

Hopefully, you now understand the notation behind representing the data. It's important to remember the above notation or things will get very confusing.

Network Diagram

We represent neural networks using a network diagram.

Network diagrams help us visualize the input layer, hidden layer(s), and the output layer. The input layer consists of nodes that represent our input variables, Xp\mathbf{X_{p}}. The hidden layer consists of hidden nodes which we'll represent as ZmZ_{m} where m=1...Mm=1...M. The output layer represents the classes or responses, Yk\mathbf{Y_{k}}. The lines inbetween nodes represent our connections between the nodes. In a basic neural network, each node in a layer is connected to all of the nodes from the previous layer. For example, the hidden node Z1Z_{1} receives connections from X1,X2,X3,...,XPX_{1}, X_{2}, X_{3}, ..., X_{P}.

The middle layer is called the hidden layer because we don't directly observe their values. In a neural network, all we really "see" are the values for the inputs and the values for the outputs. In a single-layer hidden neural network, we only have 1 hidden layer. Multi-layer neural networks have multiple hidden layers.

Forward Propagation

I'll be using the same notation that is used in Elements of Statistical Learning textbook. Each node in the hidden layer represents a linear combination of the inputs. Let's start with the inputs:

Xj,          j=1,...,PX_{j}, \; \; \; \; \; j = 1,...,P

The lines in the above network diagram represent different linear combinations.

α0m+αmTX\alpha_{0m}+\alpha_{m}^{T}X

The α0m\alpha_{0m} and αmT\alpha_{m}^{T} are called the bias and weights, respectively. Each hidden node, ZmZ_{m}, has a different value for bias and weight. The bias term, α0m\alpha_{0m}, is a single value while the weights αmT\alpha_{m}^{T} is a vector of length PP (the number of input variables). The superscript TT means we take the transpose of the weight vector. Each weight in our weight vector represents how much we need to scale our respective input variable, XjX_{j}.

After we create a linear combination, the hidden node then applies an activation function, σ\sigma:

Zm=σ(α0m+αmTX),          m=1,...,MZ_{m} = \sigma \left (\alpha_{0m}+\alpha_{m}^{T}X \right ), \; \; \; \; \; m = 1,...,M

Now we need to transform our ZZ values into the output. To do this, we take another linear combination except this time its of the ZZ values.

Tk=β0k+βkTZ,          k=1,...,KT_{k} = \beta_{0k}+\beta_{k}^{T}Z, \; \; \; \; \; k = 1,...,K

Remember that kk represents which output node we are dealing with. If we only have 1 output node, then k=1k=1. The β\beta symbol again represent our weights and biases. We use a different symbol for them here to show that these are the weights and biases specifically for transforming the values in our hidden nodes to the output nodes.

Now, our output is equal to:

Yk=gk(T)Y_{k} = g_{k} \left (T \right)

Now, our output YkY_{k} depends on whether we are dealing with a regression or classification problem. If we are dealing with regression, then our output is simply equal to TkT_{k}.

Yk=fk(x)=gk(T)=TkY_{k} = f_{k} \left (x \right) = g_{k} \left (T \right) = T_{k}

When making predictions for a classification problem, we need to use the softmax function to transform the TkT_{k} values into values between 0 and 1.

Yk=fk(x)=gk(T)=eTk/k=1KeTkY_{k} = f_{k} \left (x \right) = g_{k} \left (T \right) = e^{T_{k}}/\sum_{k=1}^{K}e^{T_{k}}

The above gives us an array of size kk. The class with the largest value (between 0 and 1) is our class prediction.

Back Propagation

Fitting a neural network to a certain dataset involves finding the optimal values for the weights and biases. The optimal value is that which minimizes the loss function. For regression problems, the loss function is the Residual Sum-of-Squares (RSS):

L(yik,fk(xi))=i=1Nk=1K(yikfk(xi))L \left (y_{ik}, f_{k} \left (x_{i} \right) \right) = \sum_{i=1}^{N} \sum_{k=1}^{K} \left (y_{ik}-f_{k} \left (x_{i} \right) \right)

For the sake of simplicity, we'll combine the respective bias and weights into a single term. So we have:

α0m,αm=αm=[αm1,αm2,...,αml]\alpha_{0m}, \alpha_{m} = \alpha_{m} = [\alpha_{m1}, \alpha_{m2}, ... , \alpha_{ml}]
β0k,βk=βk=[βk1,βk2,...,βkm]\beta_{0k}, \beta_{k} = \beta_{k} = [\beta_{k1}, \beta_{k2}, ... , \beta_{km}]

We need to use gradient descent to update and find the optimal values for the above weights.

βkmnew=βkmoldγi=1NLβkm\beta_{km}^{new} = \beta_{km}^{old} - \gamma \sum_{i=1}^{N} \frac{\partial L}{\partial \beta_{km}}
αmlnew=αmloldγi=1NLαml\alpha_{ml}^{new} = \alpha_{ml}^{old} - \gamma \sum_{i=1}^{N} \frac{\partial L}{\partial \alpha_{ml}}

The constant, γ\gamma, is the learning rate and is usually a small value like 0.01. Let's calculate the derivatives for each respective weight:

Lβkm=2(yikfk(xi))fk(xi)βkm\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial f_{k} \left (x_{i} \right)}{\partial \beta_{km}}

As seen in the forward propagation section, fk(xi)=Gk(Tk)f_{k} \left (x_{i} \right) = G_{k} \left (T_{k} \right), so we can rewrite the above equation as:

Lβkm=2(yikfk(xi))Gk(Tk)βkm\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial G_{k} \left (T_{k} \right)}{\partial \beta_{km}}

We can split the partial derivative further since Tk=β0k+βkTZT_{k} = \beta_{0k}+\beta_{k}^{T}Z:

Lβkm=2(yikfk(xi))GkTkTkβkm\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial G_{k}}{\partial T_{k}} \frac{\partial T_{k}}{\partial \beta_{km}}

We will simply rewrite GkTk\frac{\partial G_{k}}{\partial T_{k}} as Gk(Tk)G'_{k} \left (T_{k} \right). We can also see that Tkβkm=Z\frac{\partial T_{k}}{\partial \beta_{km}} = Z.

Lβkm=2(yikfk(xi))Gk(Tk)zmi\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) G'_{k} \left (T_{k} \right) z_{mi}

We write ZZ as zmiz_{mi} to indicate the mm-th node and the ii-th observation. In the case where Gk(Tk)=TkG'_{k} \left (T_{k} \right) = T_{k}, the remaining derivative is just 1. Now, we follow a similar procedure for the other weights, αml\alpha_{ml}:

Lαml=k=1K2(yikfk(xi))fk(xi)αml\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial f_{k} \left (x_{i} \right)}{\partial \alpha_{ml}}
Lαml=k=1K2(yikfk(xi))Gk(Tk)αml\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial G_{k} \left (T_{k} \right)}{\partial \alpha_{ml}}
Lαml=k=1K2(yikfk(xi))Gk(Tk)Tkzmizmiαml\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) G'_{k} \left (T_{k} \right) \frac{\partial T_{k}}{\partial z_{mi}} \frac{\partial z_{mi}}{\partial \alpha_{ml}}
Lαml=k=1K2(yikfk(xi))Gk(Tk)βkmσ(αmT)xil\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) G'_{k} \left (T_{k} \right) \beta_{km} \sigma' \left(\alpha_{m}^{T} \right) x_{il}

That's all we need to compute the gradient descent to update our weights! For classification, we use the cross-entropy loss function and the derivations can be found in the same way.