The earliest neural networks (NNs) are the feed-forward neural networksor multilayer perceptrons (MLPs), where input values flow forward linearly (forward pass) and gradients/derivatives flow backward linearly (backpropagation) through a network. Neurons are interconnected units/nodes arranged in layers in NNs. Each node in an input layer is a feature of the dataset. NNs can be wide, having many neurons in a given hidden layer and/or deep, having many hidden layers. More neurons/nodes enable complex learning at the cost of overfitting and more computational cost.
Forward propagation is when data moves from left (input layer) to right (output layer) in the network, and backward propagation is when the gradient moves from right to left in the network. Prominent NN architectures are GNNs, RNNs, CNNs, GANs, and transformers.
Architectures of a neural network
📌 Recurrent Neural Networks (RNNs) process sequential data. Unlike feedforward NNs which process data in a single pass, RNNs process data across multiple time steps. This makes them well-suited for tasks like natural language processing (NLP) and time-series forecasting. They can learn patterns in sequences by connecting the output from one time step to the input of the next, remembering previous information.
📌 Convolutional Neural Networks (CNNs) are specifically designed for processing spatial data, such as images. They use special convolutional layers to scan and identify local patterns within the input. This makes them more efficient for object detection and computer vision tasks.
📌 Graph Neural Networks (GNNs) operate on graph-structured data. They are designed to learn, and encode the relationships (edges) between nodes in a graph, making them useful for tasks such as social network analysis, molecular property prediction, and recommendation systems. Information in the form of scalars or embeddings can be stored at each graph node or edge.
GNNs on graphs with translational symmetry in all dimensions are CNNs. GNNs on one-dimemsional directed line graph are RNNs.
📌 Transformers are architectures that rely on a self-attention mechanism to process input data, allowing them to handle long-range dependencies effectively. Self-attention allows for capturing relationships within input sequences and weigh the importance of different words/tokens of the sequence. For details of this mechanism, refer to the article by Sebastian Raschka. Transformers incorporate in their architectures feed-forward NNs in parts.
The self-attention mechanism in a decoder in the transformer architecture can be viewed as GNN that is, a neural network on a fully connected graph on all tokens of the context window. It can be thought of as a (special) directed graph where one token is connected to all previous tokens in the context window. Transformers have been especially successful in tasks such as language translation and text summarization due to their ability to capture contextual information across large sequences.
📌 Generative Adversarial Networks (GANs) consist of two distinct neural networks, a generator and a discriminator that compete against each other. The generator creates a data sample and the discriminator determines if that data sample came from the captured training/observed data or the generator. By optimizing against each other, GANs learn to generate new samples. GAN extracts representative latent embeddings of observed data distributions, and is known to approximate distributions very well.
Hyper-parameters of a neural network
A network is a structure of interconnected nodes, in other words it is a structure of connected neurons arranged in layers, has one or more input nodes, a function node (activation function) and an output node. Typically, the NN parameters are activation function, dropout, epoch, early stopping, batch-size, learning rate.
📌 Activation function → The function nodes taken together in a neural network form a hidden layer (or layers as there can be multiple) that we change in accordance with the mathematical operation. It performs a weighted operation on the input layer that receives inputs and passes the result to the output layer. The mathematical operation is the activation function and must be non-linear in order to learn the underlying complex pattern in data and generalize from complex data.
An activation function decides how much signal to pass onto the next layer based on the input it receives. This idea of chaining many weighted signals together is what allows NNs to learn complex relationships in the dataset. Non-linear activation functions help solve intricate problems by adding layers of abstraction. More than two layers of perceptrons feeding one into another gives us MLPs. These layers are hidden and may be arbitrary in number. If we want to model a perceptron, we need a step function as the activation function.
For simplicity, bias terms are not shown in the perceptron. Weights refer to the connections (edges) of the input nodes to the function nodes, to move forward in the network, weights of input signals must be adjusted. There are different types of activation functions, of which ReLu reduces computational cost and mitigates the vanishing gradient problem, but it can lead to dead units or neurons (where some of them never activate). The sigmoid or logistic function can suffer from the vanishing gradient problem during backpropagation.
📌 Dropout → It refers to randomly dropping out or omitting neurons from both hidden and visible layers, while training a model to optimize performance of the network. It’s a regularization technique to prevent overfitting.
📌 Epoch → It is one full-cycle (complete pass) of learning from the (training) data, in other words an epoch is an iteration. Too many epochs may lead to overfitting of the model. There needs to be just the right number of iterations to arrive at an optimally fitted model.
The algorithm updates the weights after each epoch while moving toward the minimum error (loss function). These updates are then tested, reversed through the network to identify errors, and repeated to produce optimal results.
📌 Early stopping → It’s an implicit form of regularization that provides guidance to how many iterations can be run before the learner network begins to overfit. Beyond a certain point, improving the learner’s fit to the training data comes at the expense of increased generalization error (also known as out-of-sample error).
The validatiom curve (error vs. epoch) shows that early stopping is just the right point to stop training the learner, so it doesn’t overfit. The datasize is also a crucial aspect while the model is trained.
📌 Batch → It is either a subset of the training data (mini-batch), or each example in the training data, or the entire training data (batch) used in an epoch while training a model. The weight updates happen in these batches. For details on how to choose batch-size, refer to the article by Sebastian Raschka.
Gradient descent is the iterative method used to train NNs, the algorithm has to take just the right step (step-size neither too small nor too large) until convergence. This step-size is called the learning rate.
📌 Learning rate → It indicates the step-size that the gradient descent optimization method takes to move towards the local optimum. If the learning rate is too small, it will take more time to reach the optimum and if it is too large, it might start to diverge and never reach the optimal point.
Reaching convergence while training NNs can be difficult, nonetheless there’re ways to control it like picking an optimizer. A popular optimizer is the Adam optimizer wherein the learning rate is not set manually. Adam uses an adaptive learning rate.
Important points while building a neural network
📌 Goal
An important consideration besides the target value or performance benchmark is the choice of the performance metric. Several metrics may be used to measure the effectiveness of the application backed by neural networks and they are usually different from the loss functions (while training NN). The metrics have to align with the business goal.
📌 Data
When deciding if more data is required, it is necessary to decide how much more to gather. It is helpful to plot (learning) curves showing the relationship between training data size and model error. By extrapolating such curves, one can estimate how much additional training data would be needed to achieve a certain level of performance. Usually, adding a small fraction of the entire datatset will not have a noticeable effect on generalization error. It is therefore recommended to experiment with training dataset sizes on a logarithmic scale. If gathering more data is not feasible, the only other way to improve generalization error is to improve the learning algorithm itself. This becomes the domain of researchers and not that of applied practitioners much.
📌 Model Capacity
When deciding on adjusting hyperparameters for model improvement, there are two basic approaches - manual and automatic. Manual hyperparameter tuning can work very well when there’s a good starting point. For many applications however, these starting points are not available and in those cases, automated hyperparameter tuning helps find the optimal configuration.
When there are fewer hyperparameters to tune, the common practice is to perform a grid search. There is an alternative to grid search - random search, which is lesser exhaustive and converges faster to good values of the hyperparameters. The main reason that random search finds good configuration faster than grid search is that it has no wasted experimental runs.
A neural network with higher number of layers and hidden units per layer has higher representational capacity that is, it’s capable of representing more complicated functions. The neural network cannot necessarily learn the complex relationships if the algorithm cannot discover while training how some functions do a better job of minimizing the cost function, or if regularization terms such as weight decay forbid some of these functions.
Hyper-parameters other than the learning rate requires monitoring of both train and test errors to diagnose if the model is overfitting, then adjusting the network’s capacity appropriately. The learning rate is perhaps the most important hyperparameter during optimization as it controls the effective model capacity in a more complicated way than others, and given a problem the capacity is highest when the learning rate is correct.
📌 Framework
Practitioners turn typically to exisiting frameworks to solve problems with deep learning. The widely used and popular deep learning frameworks are tensorflow and pytorch.
In general, for development and deployment of any predictive model these points remain valid.