Deriving Neural Networks

Neural networks encompass a wide array of predictive algorithms that are the backbone of many of the latest advances in artificial intelligence and machine learning. I'm going to explain and derive the math behind a very basic type of neural network — a single hidden layer neural network.

Notation & Symbols

The goal of a neural network is the same as any other supervised learning technique. We have a series of predictors or input data, $\mathbf{X_{p}}$ , and are trying to predict a response(s), $\mathbf{Y_{k}}$ . $\mathbf{X_{p}}$ is a matrix consisting of $p$ columns where each column represents a different predictor variable. $\mathbf{Y_{k}}$ is a vector consisting of $k$ classes.

$X_{1}$	$X_{2}$	...	$X_{j}$
$x_{11}$	$x_{21}$	...	$x_{p1}$
$x_{12}$	$x_{22}$	...	$x_{p2}$
...	...	...	...
$x_{1i}$	$x_{2i}$	...	$x_{pi}$

The subscript $i$ represents the row from $i=1...N$ where $N$ is the number of rows i.e. data points in our dataset. The subscript $j$ represents the column from $j=1...P$ where $P$ is the number of columns i.e. the variables in our dataset. If we had 3 variables consisting of age, gender, and height as well as 3 observations for each variable then the above table would look like the following:

age	gender	weight
18	M	177
24	M	164
44	F	159

For the above table, $N = 3$ and $P=3$ . Now, $\mathbf{Y_{k}}$ is a matrix of responses.

$Y_{1}$	$Y_{2}$	...	$Y_{K}$
$Y_{11}$	$Y_{21}$	...	$Y_{p1}$
$Y_{12}$	$Y_{22}$	...	$Y_{p2}$
...	...	...	...
$Y_{1i}$	$Y_{2i}$	...	$Y_{Ki}$

Again, the subscript $i$ represents the row from $i=1...N$ and $k$ is the column i.e. class from $k=1...K$ . Let's say we are trying to predict salary from age, gender, and weight. Since we are only trying to predict 1 response, that means we have $K=1$ classes. So, the above table is simplified to:

salary
60,000
39,000
...
121,000

Hopefully, you now understand the notation behind representing the data. It's important to remember the above notation or things will get very confusing.

Network Diagram

We represent neural networks using a network diagram.

Network diagrams help us visualize the input layer, hidden layer(s), and the output layer. The input layer consists of nodes that represent our input variables, $\mathbf{X_{p}}$ . The hidden layer consists of hidden nodes which we'll represent as $Z_{m}$ where $m=1...M$ . The output layer represents the classes or responses, $\mathbf{Y_{k}}$ . The lines inbetween nodes represent our connections between the nodes. In a basic neural network, each node in a layer is connected to all of the nodes from the previous layer. For example, the hidden node $Z_{1}$ receives connections from $X_{1}, X_{2}, X_{3}, ..., X_{P}$ .

The middle layer is called the hidden layer because we don't directly observe their values. In a neural network, all we really "see" are the values for the inputs and the values for the outputs. In a single-layer hidden neural network, we only have 1 hidden layer. Multi-layer neural networks have multiple hidden layers.

Forward Propagation

I'll be using the same notation that is used in Elements of Statistical Learning textbook. Each node in the hidden layer represents a linear combination of the inputs. Let's start with the inputs:

X_{j}, \; \; \; \; \; j = 1,...,P

The lines in the above network diagram represent different linear combinations.

\alpha_{0m}+\alpha_{m}^{T}X

The $\alpha_{0m}$ and $\alpha_{m}^{T}$ are called the bias and weights, respectively. Each hidden node, $Z_{m}$ , has a different value for bias and weight. The bias term, $\alpha_{0m}$ , is a single value while the weights $\alpha_{m}^{T}$ is a vector of length $P$ (the number of input variables). The superscript $T$ means we take the transpose of the weight vector. Each weight in our weight vector represents how much we need to scale our respective input variable, $X_{j}$ .

After we create a linear combination, the hidden node then applies an activation function, $\sigma$ :

Z_{m} = \sigma \left (\alpha_{0m}+\alpha_{m}^{T}X \right ), \; \; \; \; \; m = 1,...,M

Now we need to transform our $Z$ values into the output. To do this, we take another linear combination except this time its of the $Z$ values.

T_{k} = \beta_{0k}+\beta_{k}^{T}Z, \; \; \; \; \; k = 1,...,K

Remember that $k$ represents which output node we are dealing with. If we only have 1 output node, then $k=1$ . The $\beta$ symbol again represent our weights and biases. We use a different symbol for them here to show that these are the weights and biases specifically for transforming the values in our hidden nodes to the output nodes.

Now, our output is equal to:

Y_{k} = g_{k} \left (T \right)

Now, our output $Y_{k}$ depends on whether we are dealing with a regression or classification problem. If we are dealing with regression, then our output is simply equal to $T_{k}$ .

Y_{k} = f_{k} \left (x \right) = g_{k} \left (T \right) = T_{k}

When making predictions for a classification problem, we need to use the softmax function to transform the $T_{k}$ values into values between 0 and 1.

Y_{k} = f_{k} \left (x \right) = g_{k} \left (T \right) = e^{T_{k}}/\sum_{k=1}^{K}e^{T_{k}}

The above gives us an array of size $k$ . The class with the largest value (between 0 and 1) is our class prediction.

Back Propagation

Fitting a neural network to a certain dataset involves finding the optimal values for the weights and biases. The optimal value is that which minimizes the loss function. For regression problems, the loss function is the Residual Sum-of-Squares (RSS):

L \left (y_{ik}, f_{k} \left (x_{i} \right) \right) = \sum_{i=1}^{N} \sum_{k=1}^{K} \left (y_{ik}-f_{k} \left (x_{i} \right) \right)

For the sake of simplicity, we'll combine the respective bias and weights into a single term. So we have:

\alpha_{0m}, \alpha_{m} = \alpha_{m} = [\alpha_{m1}, \alpha_{m2}, ... , \alpha_{ml}]

\beta_{0k}, \beta_{k} = \beta_{k} = [\beta_{k1}, \beta_{k2}, ... , \beta_{km}]

We need to use gradient descent to update and find the optimal values for the above weights.

\beta_{km}^{new} = \beta_{km}^{old} - \gamma \sum_{i=1}^{N} \frac{\partial L}{\partial \beta_{km}}

\alpha_{ml}^{new} = \alpha_{ml}^{old} - \gamma \sum_{i=1}^{N} \frac{\partial L}{\partial \alpha_{ml}}

The constant, $\gamma$ , is the learning rate and is usually a small value like 0.01. Let's calculate the derivatives for each respective weight:

\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial f_{k} \left (x_{i} \right)}{\partial \beta_{km}}

As seen in the forward propagation section, $f_{k} \left (x_{i} \right) = G_{k} \left (T_{k} \right)$ , so we can rewrite the above equation as:

\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial G_{k} \left (T_{k} \right)}{\partial \beta_{km}}

We can split the partial derivative further since $T_{k} = \beta_{0k}+\beta_{k}^{T}Z$ :

\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial G_{k}}{\partial T_{k}} \frac{\partial T_{k}}{\partial \beta_{km}}

We will simply rewrite $\frac{\partial G_{k}}{\partial T_{k}}$ as $G'_{k} \left (T_{k} \right)$ . We can also see that $\frac{\partial T_{k}}{\partial \beta_{km}} = Z$ .

\frac{\partial L}{\partial \beta_{km}} = -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) G'_{k} \left (T_{k} \right) z_{mi}

We write $Z$ as $z_{mi}$ to indicate the $m$ -th node and the $i$ -th observation. In the case where $G'_{k} \left (T_{k} \right) = T_{k}$ , the remaining derivative is just 1. Now, we follow a similar procedure for the other weights, $\alpha_{ml}$ :

\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial f_{k} \left (x_{i} \right)}{\partial \alpha_{ml}}

\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) \frac{\partial G_{k} \left (T_{k} \right)}{\partial \alpha_{ml}}

\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) G'_{k} \left (T_{k} \right) \frac{\partial T_{k}}{\partial z_{mi}} \frac{\partial z_{mi}}{\partial \alpha_{ml}}

\frac{\partial L}{\partial \alpha_{ml}} = \sum_{k=1}^{K} -2 \left (y_{ik}-f_{k} \left (x_{i} \right) \right) G'_{k} \left (T_{k} \right) \beta_{km} \sigma' \left(\alpha_{m}^{T} \right) x_{il}

That's all we need to compute the gradient descent to update our weights! For classification, we use the cross-entropy loss function and the derivations can be found in the same way.