Introduction to Deep Learning

Published in

The Startup

10 min readNov 1, 2020

Neural networks(short for Artificial neural networks) model were inspired by the structure of neurons in our brain(biological neural networks).

Each cell in a neural network is called a neuron and is connected to multiple neurons. Neurons in human (and mammalian) brains communicate by sending electrical signals between each other.

But these are the only similarities between biological neural networks and artificial neural networks.

Deep Neural Network

A deep neural network is a specific type of neural network that excels at capturing nonlinear relationships in data. Deep neural networks have broken many benchmarks in audio and image classification. Previously, linear models were often used with nonlinear transformations that were discovered by hand through research.

Deep neural networks have some ability to discover how to structure the nonlinear transformations during the training process automatically and have grown to become a helpful tool for many problems.

Graphs

Neural networks are usually represented as graphs. A graph is a data structure that consists of nodes (represented as circles) that are connected by edges (represented as lines).

Graphs are commonly used to represent how components of a system are related or linked. For example, the Facebook Social Graph describes how all of the users on Facebook are connected to each other (and this graph is changing constantly as friends are added and removed). Google Maps uses graphs to represent locations in the physical world as nodes and roads as edges.

Graphs are a highly flexible data structure; you can even represent a list of values as a graph. Graphs are often categorized by their properties, which act as constraints

Graphs provide a mental model for thinking and reasoning about a specific class of models — those that consist of a series of functions that are executed in a specific order. In the context of neural networks, graphs let us compactly express a pipeline of functions that we want to be executed in succession.

This pipeline has 2 stages of functions that happen in sequence:

In the first stage, L1 is computed L1 = X * a1
In the second stage, L2 is computed L2 = L1 * a2

The second stage can’t happen without the first stage, because L1 is an input to the second stage. The heart of neural network models is the successive computation of functions. This is known as a computational graph. A computational graph uses nodes to describe variables and edges to describe how variables are combined.

Here’s a simple example:

y = a1x1 + a2x2

The computational graph is a powerful representation, as it allows us to compactly represent models with many layers of nesting. In fact, a decision tree is really a specific type of computational graph. There’s no compact way to express a decision tree model using just equations and standard algebraic notation

Linear Algebra and Neural Network Representation

Linear regression is represented as:

y = a0 + a1x1 + a2x2 + …. + anxn

Where:

a0 represents the intercept(also known as the bias)
a1 to represent the trained model weights
x1 to xn represent the features
y represents the predicted value

The first step is to rewrite this model using linear algebra notation, as a product of 2 vectors:

Xa^T= y

Neural Network Representation

In the neural network representation of this model:

each feature column in a data set is represented as an input neuron
each weight value is represented as an arrow from the feature column it multiples to the output neuron

The neurons and arrows act as a visual metaphor for the weighted sum, which is how the feature columns and weights are combined.

Inspired by biological neural networks, an activation function determines if the neuron fires or not. In a neural network model, the activation function transforms the weighted sum of the input values. For this network, the activation function is the identity function. The identity function returns the same value that was passed in:

f(x) = x

While the activation function isn’t interesting for a network that performs linear regression, it’s useful for logistic regression and more complex networks. Here’s a comparison of both representations of the same linear regression model:

Because the inputs from one layer of neurons feed to the next layer of the single, output neuron, this is known as a feedforward network. In the language of graphs, a feedforward network is a directed, acyclic graph.

Fitting A Network In the Linear Regression for Machine Learning course, we explored two different approaches to training a linear regression model: gradient descent and ordinary least squares. Gradient descent is the most common technique for fitting neural network models.

Activation function

The three most commonly used activation functions in neural networks are:

the sigmoid function
the ReLU function
the tanh function

ReLU activation function

The ReLU activation function, which is a commonly used activation function in neural networks for solving regression problems. ReLU stands for rectified linear unit and is defined as follows:

ReLU(x) = max(0,x)

The max(0,x) function call returns the maximum value between 0 and x. This means that:

when x is less than 0 the value 0 is returned
when x is greater than 0 the value x is returned

The ReLU function returns the positive component of the input value. Let’s visualize the expressivity of a model that performs a linear combination of the features and weights followed by the ReLU transformation:

tanh Function

Linking the unit circle with the Cartesian coordinate system tan is just the ratio between the y value (opposite) and the x value (adjacent) for a point on the unit circle corresponding to the angle.

Plotting Tan

To plot the tangent function, we need to use radians on the x-axis instead of degrees. To describe a full trip around the circle, radians range from 0 to 2π while degrees range from 0 to 360.

The periodic sharp spikes that you see in the plot are known as vertical asymptotes. At those points, the value isn’t defined but the limit approaches either negative or positive infinity (depending on which direction you’re approaching the x value from).

The key takeaway from the plot is how the tangent function is a repeating, periodic function. A periodic function is one that returns the same value at regular intervals.

The tangent function repeats itself every π, which is known as the period. The tangent function isn’t known to be used as an activation function in neural networks (or any machine learning model really) because the periodic nature isn’t a pattern that’s found in real datasets.

While there have been some experiments with periodic functions as the activation function for neural networks, the general conclusion has been that period functions like tangent don’t offer any unique benefits for modeling.

Generally speaking, the activation functions that are used in neural networks are increasing functions. An increasing function f is a function where f(x) always stays the same or increases as x increases.

While the tangent function describes the ratio of the y and x values on the unit circle, the hyperbolic tangent function describes the ratio of y and x values on the unit hyperbola.

Use the numpy.tanh() function to compute the hyperbolic tangent of the values in x

You’ll notice that like the sigmoid function, the tanh function has horizontal asymptotes as x approaches negative or positive infinity. In addition, the tanh function also constrains the range (y) to between −1 and 1.

Because of this property, both the sigmoid and the tanh functions are commonly used in neural networks for classification tasks.

ReLU function, on the other hand, is known to be more effective in regression tasks

Hidden Layers

In the above cells we worked with single layer neural networks. These networks had a single layer of neurons. To make a prediction, a single layer of neurons in these networks directly fed their results into the output neuron(s).

We’ll explore how multi-layer networks (also known as deep neural networks) are able to better capture nonlinearity in the data.

In a deep neural network, the first layer of input neurons feeds into a second, intermediate layer of neurons

The intermediate layers are known as hidden layers, because they aren’t directly represented in the input data or the output predictions. Instead, we can think of each hidden layer as intermediate features that are learned during the training process.

Decision Tree Vs Deep Neural Network

Neural Networks are actually very similar to how decision trees are structured. The branches and splits represent some intermediate features that are useful for making predictions and are analogous to the hidden layers in a neural network.

Each of these hidden layers has its own set of weights and biases, which are discovered during the training process. In decision tree models, the intermediate features in the model represented something more concrete we can understand (feature ranges).

Decision tree models are referred to as white box models because they can be observed and understood but not easily altered. After we train a decision tree model, we can visualize the tree, interpret it, and have new ideas for tweaking the model.

Neural networks, on the other hand, are much closer to being a black box. In a black box model, we can understand the inputs and the outputs but the intermediate features are actually difficult to interpret and understand. Even harder and perhaps more importantly, it’s difficult to understand how to tweak a neural network based on these intermediate features.

We’ll learn how adding more layers to a network and adding more neurons in the hidden layers can improve the model’s ability to learn more complex relationships

Train Neural Network in Scikit-Learn

Let’s learn how to train a neural network with a hidden layer using scikit-learn

Scikit-learn contains two classes for working with neural networks:

MLPClassifier
MLPRegressor

We can specify the number of hidden neurons we want to use in each layer using the hidden_layer_sizes parameter. This parameter accepts a tuple where the index value corresponds to the number of neurons in that hidden layer. The parameter is set to the tuple (100,) by default, which corresponds to a hundred neurons in a single hidden layer

We can specify the activation function we want used in all layers using the activation parameter. This parameter accepts only the following string values:

‘identity’: the identity function
‘logistic’: the sigmoid function
‘tanh’: the hyperbolic tangent (tanh) function
‘relu’: the ReLU function

The logistic regression model performed much better (accuracy of 88%) compared to the neural network model with one hidden layer and one neuron (48%). This network architecture doesn’t give the model much ability to capture nonlinearity in the data unfortunately, which is why logistic regression performed much better.

This network has 3 input neurons, 6 neurons in the single hidden layer, and 1 output neuron. You’ll notice that there’s an arrow between every input neuron and every hidden neuron (3 x 6 = 18 connections), representing a weight that needs to be learned during the training process. You’ll notice that there’s also a weight that needs to be learned between every hidden neuron and the final output neuron (6 x 1 = 6 connections).

Because every neuron has a connection between itself and all of the neurons in the next layer, this is known as a fully connected network. Lastly, because the computation flows from left (input layer) to right (hidden layer then to output layer), we can call this network a fully connected, feedforward network.

There are two weight matrices (a1 and a2) that need to be learned during the training process, one for each stage of the computation.

It seems like the test set prediction accuracy improved to 0.86 when using ten or fifteen neurons in the hidden layer. As we increased the number of neurons in the hidden layer, the accuracy vastly improved between the models

Here’s a diagram representing a neural network with six neurons in the first hidden layer and four neurons in the second hidden layer:

The number of hidden layers and the number of neurons in each hidden layer are hyperparameters that act as knobs for the model behavior.