Neural Networks from scratch

Using and implementing a Neural Network could not be easier these days. Not only are there optimised GPU capable ML libraries like PyTorch and Tensorflow that can run your code many times faster, these libraries also make the job of writing the code so much easier with custom functions that take care of many aspects under the hood. The largest of those is probably gradient tracking and backpropagation which essentially takes place without you even thinking about it. This is without mentioning additional libraries built on top of this (e.g. FastAI) which make it even easier, or sites like Hugging Face which allow you to essentially plug and play models. This begs the question: why would anyone want to start from scratch and code up a neural net in Numpy? Well, in reality, no-one does. However, it’s an excellent exercise in improving your understanding about what goes on under the hood and I would certainly highly recommend it to anyone starting out. It also allows you to dive deep into the network and explore what’s going on, a great way to understand what your model is doing and help it to seem like less of a black box.

What are we building?

This article walks you through how to build a simple multilayer perceptron with a customisable size. We’ll include two activation functions (Sigmoid and Leaky ReLU) and we’ll add the option for Batch Normalisation. We’ll also include two loss functions: Mean Square Error and Cross Entropy, to use depending on what task we want to assign our model to. We’ll then examine how it does on MNIST and compare it to an identical model built in PyTorch.

The code for this project is available on GitHub in a Jupyter notebook.

This article assumes that you are already familiar with Machine Learning and understand the general process (for a high level introduction see here). The focus is mainly on implementation and so some mathematical aspects, e.g. derivatives, are not explained fully. However, I have tried to include links for further reading.

Step 1: Creating the Class

First, install dependencies:

              import numpy as np
              import pandas as pd
              import matplotlib.pyplot as plt
              from PIL import Image

Our model class will need to take a number of parameters, most importantly the dimensions of the network and the learning rate. In this case the dimensions should be passed as a list of integers determining the input size for each layer (and the final output size). We’ll also pass some additional settings so that we can select the loss and activation functions we want to use, as well as if we want to use batch normalisation. We then store these variables and select the correct loss function and activation function (along with its derivative function). We will define all of these functions later. Finally, we’ll also call a function to create the model, which initialises all of the weights and creates variables to store the gradients and activations as they get passed through the network.

              class Model():
                def __init__(self, model_dimensions, lr=0.01, activation="Leaky ReLU", loss="Cross Entropy", use_batch_norm=True)
                  self.dims = model_dimensions
                  self.lr = lr
                  self.layers = len(model_dimensions) - 1
                  self.use_batch_norm = use_batch_norm
                  self.train_mode_on = True
                  if loss == "Cross Entropy":
                    self.loss = self.cross_entropy
                  elif loss == "MSE":
                    self.loss = self.MSE_loss
                  else:
                    raise Exception("Loss must be either 'Cross Entropy' or 'MSE'")
                  if activation == "Sigmoid":
                    self.activation = self.sigmoid
                    self.activation_backwards = self.sigmoid_backwards
                  elif activation == "Leaky ReLU":
                    self.activation = self.Leaky_ReLU
                    self.activation_backwards = self.Leaky_ReLU_backwards
                  else:
                    raise Exception("Activation function must be either 'Sigmoid' or 'Leaky ReLU'")
                  self.parameters, self.activations, self.gradients, self.bn = self.create_model()

We only want to store gradients when we are training the model and batch normalisation works differently during training and test modes. We therefore need to create a variable (train_mode_on) to define the state of the model. The first two functions we’ll create, train_mode() and test_mode(), are therefore simply to allow us to easily toggle between modes.

              def train_mode(self):
                  self.train_mode_on = True
                def test_mode(self):
                  self.train_mode_on = False
Step 2: Create the Model

Here, we create several dictionaries to store and keep track of the weights, activations and gradients for our network. We iterate through each layer and create a corresponding weight and bias matrix, along with an identical matrix to store the gradients. We also need to initialise the weights. We’ll follow how PyTorch does this.

Weight Initialisation

pytorch.org

              def initialise_weights(self, size):
                  param = 2*np.random.random(size) - 1
                  stdv = 1. / np.sqrt(size[0])
                  return param * stdv

We also do the same for the gamma and beta factors in the batch normalisation layer. These are initialised to 1 and 0 respectively so that initially they have no effect on the normalisation. The running mean values for the mean and variance are initialised to 1 and 0 too (more on batch normalisation later).

              def create_model(self):
                  parameters, activations, gradients, bn = {}, {}, {}, {}, {}
                  gradients['L'] = None
                  for layer in range(1,len(self.dims)):
                    input = self.dims[layer-1]
                    output = self.dims[layer]
                    parameters['W'+ str(layer)] = self.initialise_weights((input,output))
                    parameters['B'+ str(layer)] = self.initialise_weights((output,))
                    parameters['gamma' + str(layer)] = np.ones(output)
                    parameters['beta' + str(layer)] = np.zeros(output)
                    bn['mu' + str(layer)] = np.zeros(output)
                    bn['var' + str(layer)] = np.ones(output)
                    gradients['W'+ str(layer)] = np.zeros((input,output))
                    gradients['B'+ str(layer)] = np.zeros(output)
                    gradients['gamma' + str(layer)] = np.zeros(output)
                    gradients['beta' + str(layer)] = np.zeros(output)

                  return parameters, activations, gradients, bn

Finally, following the same protocol as PyTorch, we’ll set the default behaviour to involve adding gradients to stored gradients when we do the backward pass rather than always replacing them. As a result, we’ll need a function to zero the gradients.

              def zero_grad(self):
                  self.gradients['L'] = None
                  for layer in range(1,self.layers+1):
                    self.gradients['W'+ str(layer)] = self.gradients['W'+ str(layer)] * 0
                    self.gradients['B'+ str(layer)] = self.gradients['B'+ str(layer)] * 0
                    self.gradients['gamma' + str(layer)] = self.gradients['gamma' + str(layer)] * 0
                    self.gradients['beta' + str(layer)] = self.gradients['beta' + str(layer)] * 0
Step 3: Activation layers

We’ll include the option to use two activation functions: Sigmoid and Leaky ReLU (the latter can also function as a simple ReLU by setting the alpha value to 0). The sigmoid function takes the input and squeezes it between 0 and 1. It is defined as follows:

Sigmoid Function
                def sigmoid(self, x):
                    return 1/(1+np.exp(-x))

We will also need the derivative function for the backward step:

Sigmoid Derivative
                def sigmoid_backwards(self, dx, x):
                    sig = self.sigmoid(x)
                    return dx*sig*(1-sig)

The ReLU function (Rectified Linear Unit) leaves values greater than zero unchanged but returns zero for all values less than zero. Leaky ReLU does the same but instead multiplies negative values by an alpha value (usually very small, e.g. 0.01). This helps the model to train by ensuring that all values have a gradient and can therefore be updated during the backpropagation step (avoiding the “dying ReLU” problem). Leaky ReLU is defined as follows:

Leaky Relu Function
                def Leaky_ReLU(self, x, alpha=0.01):
                    return np.maximum(alpha*x,x)

The derivative is simply 1 when x > 0 and alpha when x < 0. Technically it is undefined at 0 but we will simply define it to be 1 in order to avoid numerical issues.

                def Leaky_ReLU_backwards(self, dx, x, alpha=0.01):
                    return dx*np.where(x<0,alpha,1)
Step 4: Batch normalisation

Batch normalisation greatly increases training stability of deep neural networks. If we imagine a network which is designed to classify images, we can see that the top left hand pixel of the image is always going to follow the same path through the network regardless of which image we use. Therefore, when we pass a batch of images through the network during training, all of the values in the same position along the batch dimension pass through the network in the same way. Normalising these values greatly improves training.

In short: it normalises along the batch dimension and then re-scales the distribution with two learnable factors: gamma and beta. The equation is given below.

Batch Normalisation

(Ioffe and Szegedy, 2015)

The intermediate steps need to be cached for the backward pass. We also need to differentiate between training and test modes. This is because, during testing we want to normalise according to the distribution the model learned with during training. As a result, we need to keep a running mean of the mean and variance during training and store this for use during testing (‘mu’ and ‘var’ in the code).

                def batch_norm_forward(self, x, layer, eps=1e-5):
                    N, D = x.shape
                    gamma = self.parameters['gamma' + str(layer)]
                    beta = self.parameters['beta' + str(layer)]
                    if self.train_mode_on:
                      mu = 1./N * np.sum(x, axis = 0)
                      xmu = x - mu
                      var = 1./N * np.sum(xmu ** 2, axis = 0)
                      self.bn['mu' + str(layer)] = 0.9 * self.bn['mu' + str(layer)] + 0.1 * mu
                      self.bn['var' + str(layer)] = 0.9 * self.bn['var' + str(layer)] + 0.1 * var
                    else:
                      mu = self.bn['mu' + str(layer)]
                      xmu = x - mu
                      var = self.bn['var' + str(layer)]
                    sqrtvar = np.sqrt(var + eps)
                    ivar = 1./sqrtvar
                    xhat = xmu/sqrtvar
                    gammax = gamma * xhat
                    out = gammax + beta
                    if self.train_mode_on:
                      self.bn['cache' + str(layer)] = (xhat,xmu,ivar,sqrtvar,var,eps)
                    return out

The backward pass involves numerous steps and can be best understood by drawing a computational graph to represent the whole process. An excellent overview can be found here.

                def batch_norm_backward(self, dx, layer):
                    (xhat,xmu,ivar,sqrtvar,var,eps) = self.bn['cache' + str(layer)]
                    gamma = self.parameters['gamma' + str(layer)]
                    N,D = dx.shape
                    self.gradients['gamma' + str(layer)] += np.sum(dx*xhat, axis=0)
                    self.gradients['beta' + str(layer)] += np.sum(dx, axis=0)
                    dxhat = dx * gamma
                    divar = np.sum(dxhat*xmu, axis=0)
                    dxmu1 = dxhat * ivar
                    dsqrtvar = -1. /(sqrtvar**2) * divar
                    dvar = 0.5 * 1. /np.sqrt(var+eps) * dsqrtvar
                    dsq = 1. /N * np.ones((N,D)) * dvar
                    dxmu2 = 2 * xmu * dsq
                    dx1 = (dxmu1 + dxmu2)
                    dmu = -1 * np.sum(dxmu1+dxmu2, axis=0)
                    dx2 = 1. /N * np.ones((N,D)) * dmu
                    dx = dx1 + dx2
                    return dx
Step 5: Forward pass

Now we are ready to use our model and create the forward pass. Now that we have everything in place this is relatively straightforward. For every layer, we multiply the previous activations by the weights and add the biases. We then apply batch normalisation (if activated) and finally pass through the activation layer to give the new activations. We need to make sure we store these activations for the backward pass.

                def forward(self, x):
                    self.activations['A0'] = x
                    for layer in range(1,self.layers+1):
                      Z = self.activations['A'+ str(layer-1)] @ self.parameters['W'+ str(layer)] + self.parameters['B'+ str(layer)]
                      self.activations['Z'+ str(layer)] = Z
                      if self.use_batch_norm:
                        Z = self.batch_norm_forward(Z, layer)
                      self.activations['A'+ str(layer)] = self.activation(Z)
                    return self.activations['A'+ str(self.layers)]
Step 6: Loss function

Once the model has made a prediction we need a metric to measure its success. We’ll implement two possible loss functions depending on what kind of problem we are applying our model to. For a regression problem we will want to use mean square error (MSE) and for a classification problem we’ll want to use cross entropy with softmax. We will calculate the gradient of the loss ahead of time and store it ready for the backward pass. For more information on loss functions see this previous post.

Mean Square Error

Mean Square Error equation
                def MSE_loss(self, x, y):
                    self.gradients['L'] = 2 * (x - y) / y.shape[0]
                    return np.mean((x-y) ** 2)

Cross Entropy with Softmax

Softmax is applied first to convert each activation into a probability. The cross entropy is then calculated to assess how similar the predicted distribution is to the target distribution. The derivative involves a number of steps but simplifies incredibly nicely. For a detailed overview, head here.

Cross Entropy Loss equation
                def cross_entropy(self, x, y):
                    exps = np.exp(x)
                    s = exps / np.sum(exps,axis=1,keepdims=True)
                    loss = np.mean(np.sum( y * -np.log(s), axis=1))
                    self.gradients['L'] = (s - y) / y.shape[0]
                    return loss
Step 7: Backward pass

We now need to backpropagate through the network and calculate the gradients for each parameter. At each stage we apply the chain rule and multiply the local gradient by the gradient of the loss with respect to the current position to move one more step backwards in the network. We start by taking the gradient of the loss that we have already calculated. We have already built functions to calculate the derivative of the activation layers and the batch normalisation layers so these steps are relatively straightforward.

For the linear layer, we need to calculate the derivatives for the weights and biases, as well as for the input so that it can be passed to the next layer. The derivatives are stored so that we can use them to update the parameters later on. The derivatives are given as follows:

Backward Pass
                def backward(self):
                    dZ = self.gradients['L']
                    for i in range(self.layers):
                      layer = self.layers - i
                      dA = self.activation_backwards(dZ,self.activations['Z'+ str(layer)])
                      if self.use_batch_norm:
                        dA = self.batch_norm_backward(dA, layer)
                      self.gradients['B'+ str(layer)] += np.sum(dA,axis=0)
                      self.gradients['W'+ str(layer)] += self.activations['A'+ str(layer-1)].T @ dA
                      dZ = dA @ self.parameters['W'+ str(layer)].T
Step 8: Update the parameters

Once the gradients have been calculated, the next step is to perform gradient descent and update the parameters. We will use simple stochastic gradient descent (SGD) rather than any more advanced optimiser. This simply multiplies the gradient by a constant (the learning rate) and then updates the parameter by subtracting this value. Remember that each gradient is the gradient of the loss with respect to that parameter. Therefore by subtracting the gradient the parameter is updated in a way that should reduce the loss, hence improving the accuracy of the model.

Update Step
                def step(self):
                    for layer in range(1,self.layers+1):
                      self.parameters['W'+ str(layer)] -= self.lr * self.gradients['W'+ str(layer)]
                      self.parameters['B'+ str(layer)] -= self.lr * self.gradients['B'+ str(layer)]
                      self.parameters['gamma'+ str(layer)] -= self.lr * self.gradients['gamma'+ str(layer)]
                      self.parameters['beta'+ str(layer)] -= self.lr * self.gradients['beta'+ str(layer)]
Step 9: Training loop

By now the model is essentially built but we can add some additional helper functions to make training a bit easier. During training, we need to loop through the data loader (see below) to get each batch of training examples. The first step is to zero the gradients so that we are not adding the new gradients to the previous ones. Then we perform our forward pass, calculate the loss and perform our backward pass, then update the parameters and store the losses to plot later.

                def train_model(self, dataloader):
                    losses = []
                    self.train_mode()
                    iterator = iter(dataloader)
                    for batch in range(len(dataloader)):
                      x,y = next(iterator)
                      self.zero_grad()
                      out = self.forward(x)
                      loss = self.loss(out,y)
                      self.backward()
                      self.step()
                      losses.append(loss)
                    return losses

To test the accuracy of our model we need to use the test examples that we have set aside. Here, we complete the forward pass on each batch and then assess whether the prediction made by the model is correct. This can then be compiled to give an overall accuracy score.

                def test_model(self, dataloader):
                    count = 0
                    self.test_mode()
                    iterator = iter(dataloader)
                    for batch in range(len(dataloader)):
                      x,y = next(iterator)
                      out = self.forward(x)
                      count += np.where(np.argmax(out,axis=1) == np.argmax(y,axis=1), 1, 0).sum()
                    accuracy = 100* count / (len(dataloader)*x.shape[0])
                    return accuracy

Combining this altogether we can create an overall training program which first completes a training loop, then calculates the accuracy and repeats for a given number of epochs. We can then plot the losses at the end.

                
                def train(self, train_dataloader, test_dataloader, epochs=1):
                    losses = []
                    for epoch in range(epochs):
                      epoch_losses = self.train_model(train_dataloader)
                      losses = np.concatenate((losses, epoch_losses))
                      accuracy = self.test_model(test_dataloader)
                      print(f'[Epoch {epoch+1}/{epochs}] Accuracy: {accuracy}%')
                    plt.plot(losses)
Step 10: Data

Now that the model is built, we’ll test it on a dataset. Using google colab, MNIST is easily accessible in the sample_data folder as a csv file. Using the same paradigm as PyTorch, we can create a Dataset class to store our dataset and a DataLoader class to iterate through it during training.

The Dataset class separates the target value from the input data, converts the bytes to floats between 0 and 1 and returns the input and target as a tuple. We’ll also include a ‘view_sample’ method to visualise the incoming data.

                class Dataset():
                  def __init__(self, path):
                    self.path = path
                    self.samples = self.get_samples()

                  def get_samples(self):
                    samples = []
                    data = pd.read_csv(self.path).to_numpy()
                    for i in range(data.shape[0]):
                      y = np.zeros(10)
                      y[data[i][0]] = 1
                      x = data[i][1:] / 255
                      samples.append((x,y))
                    return samples

                  def __len__(self):
                    return len(self.samples)

                  def __getitem__(self, idx):
                    return self.samples[idx]

                  def view_sample(self, d=3):
                    fig = plt.figure(figsize=(5,5))
                    rows,cols = d,d
                    for i in range(1,rows*cols + 1):
                      idx = np.random.randint(len(self))
                      x,y = self[idx]
                      label = np.argmax(y)
                      img = np.reshape(x,(28,28))
                      fig.add_subplot(rows,cols,i)
                      plt.axis('off')
                      plt.title(label)
                      plt.imshow(img,cmap="gray")
                    plt.plot()

The Dataloader class takes the dataset and returns an iterator that returns a batch of inputs and a batch of targets as a tuple according to the specified batch size.

                class Dataloader():
                  def __init__(self,dataset,bs=64):
                    self.dataset = dataset
                    np.random.shuffle(self.dataset.samples)
                    self.bs = bs

                  def __len__(self):
                    return len(self.dataset) // self.bs

                  def __iter__(self):
                    np.random.shuffle(self.dataset.samples)
                    self.n = 0
                    return self

                  def __next__(self):
                    if self.n < (len(self.dataset) // self.bs):
                      x = np.expand_dims(self.dataset[self.n*self.bs][0],axis=0)
                      y = np.expand_dims(self.dataset[self.n*self.bs][1],axis=0)
                      for i in range(1, self.bs):
                        x = np.concatenate((x,np.expand_dims(self.dataset[self.n*self.bs + i][0],axis=0)),axis=0)
                        y = np.concatenate((y,np.expand_dims(self.dataset[self.n*self.bs + i][1],axis=0)),axis=0)
                      self.n += 1
                      return (x,y)
                    else:
                      raise StopIteration
Results

We can then specify our hyperparameters, create the data loaders and create an instance of our model. Then finally train it.

                DIMS = [28*28, 100, 50, 10]
                BS = 64
                LR = 0.01
                BATCHNORM = True
                ACTIVATION = "Leaky ReLU"           # 'Leaky ReLU' or 'Sigmoid'
                LOSS_FUNC = "Cross Entropy"         # 'Cross Entropy' or 'MSE'
                EPOCHS = 10

                train_data = Dataset(train_path)
                test_data = Dataset(test_path)
                train_dataloader = Dataloader(train_data, BS)
                test_dataloader = Dataloader(test_data, BS)

                train_data.view_sample()

                model = Model(DIMS, lr=LR, activation=ACTIVATION, loss=LOSS_FUNC, use_batch_norm=BATCHNORM)

                model.train(train_dataloader, test_dataloader, epochs=EPOCHS)
Model Results

Creating an identical model using PyTorch reveals almost identical results (see separate notebook), confirming our successful implementation, although unsurprisingly the PyTorch model is about 10x faster.

Analysis

Given that we have written a lot more code to get identical results in a lot more time, is there any benefit to what we have done? Well, apart from being a useful learning exercise, it allows us the opportunity to assess what is going on as our model is training. By storing the activations and gradients as the model learns, we can visualise what is going on. Machine learning models can often fail silently and, even when it is clear that they are not performing as they should, it is not immediately obvious why. Finding methods to visualise the learning process can help open up the black box that is our machine learning model to give us a better understanding of what it is doing and help us to improve it.

Saving the gradients after each iteration allows us to see how the gradients change over time. For a deep network, the benefits of batch normalisation become immediately obvious when we visualise these gradients. Without normalisation, the deeper layers in the network simply do not train as the gradients deeper in the network are very small and tend towards zero. On the other hand, with normalisation, the gradients remain active throughout training.

Without Batch Normalisation
With Batch Normalisation

The top image shows the change in gradients during training without batch normalisation and the bottom image shows the difference when batch normalisation is added. The images are first resized to 7x7 so that the input dimension is smaller, making the image clearer. The forward pass proceeds from left to right, the larger boxes represent the weights and the narrow columns represent the biases.

We can also train the model backwards to understand what it is ‘seeing’. First, the parameters are fixed, then a batch of random input vectors are trained on an identical target, say the number 1. The resulting trained input is averaged across the batch dimension to create an image which represents the perfect 1 as far as the model is concerned. It gives an insight into what the model has identified as the key traits for that digit. Unsurprisingly, the area around the edges appears to be of little importance and the focus is instead on the digit itself and the space immediately around it. Most of the digits are fairly clear but some are incomplete (e.g. the bottom right hand corner of the 8) indicating that it has not identified that as a critical part.

Reverse Generated Digits

The digits from 1 to 9 reverse generated using the classification model. Whilst the digits are clearly recognisable, there are subtle differences, indicating that the domain mapping to each digit is not perfect and there are differences between what the model perceives as an 8 and what a human perceives as an 8. More varied training data would be needed to improve this.

The source code for this project can be viewed on GitHub. The code is split into two notebooks: Version 1 is the streamlined version which just contains the model. Version 2 contains additional code for visualisation.