Introduction to Deep Learning Using Keras and Tensorflow — Part2

Published in

The Startup

10 min readNov 7, 2020

Part 1 — https://rhnyewale.medium.com/introduction-to-deep-learning-using-keras-and-tensorflow-part1-e3c6d342ada8
In this part, we’ll learn about Loss Function, Optimizer-Stochastic Gradient Descent, Learning Rate & Batch Size, Overfitting & Underfitting.

3) Stochastic Gradient Descent

As with all machine learning tasks, we begin with a set of training data. Each example in the training data consists of some features (the inputs) together with an expected target (the output). Training the network means adjusting its weights in such a way that it can transform the features into the target. In the 80 Cereals dataset, for instance, we want a network that can take each cereal’s ‘sugar’, ‘fiber’, and ‘protein’ content and produce a prediction for that cereal’s ‘calories’. If we can successfully train a network to do that, its weights must represent in some way the relationship between those features and that target as expressed in the training data.

In addition to the training data, we need two more things:

A “loss function” that measures how good the network’s predictions are.
An “optimizer” that can tell the network how to change its weights.

The Loss Function

We’ve seen how to design an architecture for a network, but we haven’t seen how to tell a network what problem to solve. This is the job of the loss function.

The loss function measures the disparity between the target’s true value and the value the model predicts.

Different problems call for different loss functions. We have been looking at regression problems, where the task is to predict some numerical value — calories in 80 Cereals, rating in Red Wine Quality. Other regression tasks might be predicting the price of a house or the fuel efficiency of a car.

A common loss function for regression problems is the mean absolute error or MAE. For each prediction y_pred, MAE measures the disparity from the true target y_true by an absolute difference abs(y_true - y_pred).

The total MAE loss on a dataset is the mean of all these absolute differences.

Besides MAE, other loss functions you might see for regression problems are the mean-squared error (MSE) or the Huber loss (both available in Keras).

During training, the model will use the loss function as a guide for finding the correct values of its weights (lower loss is better). In other words, the loss function tells the network its objective.

The Optimizer — Stochastic Gradient Descent

We’ve described the problem we want the network to solve, but now we need to say how to solve it. This is the job of the optimizer. The optimizer is an algorithm that adjusts the weights to minimize the loss.

Virtually all of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent. They are iterative algorithms that train a network in steps. One step of training goes like this:

Sample some training data and run it through the network to make predictions.
Measure the loss between the predictions and the true values.
Finally, adjust the weights in a direction that makes the loss smaller.

Then just do this over and over until the loss is as small as you like (or until it won’t decrease any further.)

Each iteration’s sample of training data is called a minibatch (or often just “batch”), while a complete round of the training data is called an epoch. The number of epochs you train for is how many times the network will see each training example.

The animation shows the linear model from part1 being trained with SGD. The pale red dots depict the entire training set, while the solid red dots are the minibatches. Every time SGD sees a new minibatch, it will shift the weights (w the slope and b the y-intercept) toward their correct values on that batch. Batch after batch, the line eventually converges to its best fit. You can see that the loss gets smaller as the weights get closer to their true values.

Learning Rate and Batch Size

Notice that the line only makes a small shift in the direction of each batch (instead of moving all the way). The size of these shifts is determined by the learning rate. A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values.

The learning rate and the size of the minibatches are the two parameters that have the largest effect on how the SGD training proceeds. Their interaction is often subtle and the right choice for these parameters isn’t always obvious. (We’ll explore these effects in the exercise.)

Fortunately, for most work, it won’t be necessary to do an extensive hyperparameter search to get satisfactory results. Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning (it is “self-tuning”, in a sense). Adam is a great general-purpose optimizer.

Adding the Loss and Optimizer

After defining a model, you can add a loss function and optimizer with the model’s compile method:

model.compile( optimizer=”adam”, loss=”mae”, )

The gradient is a vector that tells us in what direction the weights need to go. More precisely, it tells us how to change the weights to make the loss change fastest. We call our process gradient descent because it uses the gradient to descend the loss curve towards a minimum.

Stochastic means “determined by chance.” Our training is stochastic because the minibatches are random samples from the dataset. And that’s why it’s called SGD!

After defining the model, we compile in the optimizer and loss function.

model.compile( optimizer=”adam”, loss=”mae”, )

Now we’re ready to start the training! We’ve told Keras to feed the optimizer 256 rows of the training data at a time (the batch_size) and to do that 10 times all the way through the dataset (the epochs).

Train the model

history = model.fit( X_train, y_train, validation_data=(X_valid, y_valid), batch_size=256, epochs=10 )

You can see that Keras will keep you updated on the loss as the model trains.

Often, a better way to view the loss though is to plot it.

The fit method in fact keeps a record of the loss produced during training in a History object. We’ll convert the data to a Pandas dataframe, which makes the plotting easy.

Notice how the loss levels off as the epochs go by. When the loss curve becomes horizontal like that, it means the model has learned all it can and there would be no reason to continue for additional epochs.

Evaluate Training

If you trained the model longer, would you expect the loss to decrease further?

This depends on how the loss has evolved during training: if the learning curves have leveled off, there won’t usually be an advantage to training for additional epochs. Conversely, if the loss appears to still be decreasing, then training for longer could be advantageous.

With the learning rate and the batch size, you have some control over:

How long it takes to train a model
How noisy the learning curves are
How small the loss becomes

To get a better understanding of these two parameters, we’ll look at the linear model, our ppsimplest neural network. Having only a single weight and a bias, it’s easier to see what effect a change of parameter has.

Change the values for learning_rate, batch_size, and num_examples.

Learning Rate and Batch Size

The smaller batch sizes gives noisier weight updates and loss curves. This is because each batch is a small sample of data and smaller samples tend to give noisier estimates. Smaller batches can have an “averaging” effect though which can be beneficial.

Smaller learning rates make the updates smaller and the training takes longer to converge.

Large learning rates can speed up training, but don’t “settle in” to a minimum as well. When the learning rate is too large, the training can fail completely. Let’s check for a high learning rate and small no. of batches.

4) Overfitting and Underfitting

In this part, we’re going to learn how to interpret these learning curves and how we can use them to guide model development. In particular, we’ll examine at the learning curves for evidence of underfitting and overfitting and look at a couple of strategies for correcting it.

Interpreting the Learning Curves

You might think about the information in the training data as being of two kinds:
signal and noise.

The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part that is only true of the training data; the noise is all of the random fluctuations that come from data in the real-world or all of the incidental, non-informative patterns that can’t actually help the model make predictions. The noise is the part that might look useful but really isn’t.

We train a model by choosing weights or parameters that minimize the loss on a training set. You might know, however, that to accurately assess a model’s performance, we need to evaluate it on a new set of data, the validation data.

When we train a model we’ve been plotting the loss on the training set epoch by epoch. To this, we’ll add a plot to the validation data too. These plots we call the learning curves. To train deep learning models effectively, we need to be able to interpret them.

Now, the training loss will go down either when the model learns the signal or when it learns noise. But the validation loss will go down only when the model learns the signal. (Whatever noise the model learned from the training set won’t generalize to new data.) So, when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned.

Ideally, we would create models that learn all of the signal and none of the noise. This will practically never happen. Instead, we make a trade. We can get the model to learn more signal at the cost of learning more noise. So long as the trade is in our favor, the validation loss will continue to decrease. After a certain point, however, the trade can turn against us, the cost exceeds the benefit, and the validation loss begins to rise.

This trade-off indicates that there can be two problems that occur when training a model: not enough signal or too much noise. Underfitting the training set is when the loss is not as low as it could be because the model hasn’t learned enough signal. Overfitting the training set is when the loss is not as low as it could be because the model learned too much noise. The trick to training deep learning models is finding the best balance between the two.

We’ll look at a couple of ways of getting more signal out of the training data while reducing the amount of noise.

Capacity

A model’s capacity refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you should try increasing its capacity.

You can increase the capacity of a network either by making it wider (more units to existing layers) or by making it deeper (adding more layers).

Wider networks have an easier time learning more linear relationships
Deeper networks prefer more nonlinear ones.

Which is better just depends on the dataset.

Early Stopping

We mentioned that when a model is too eagerly learning noise, the validation loss may start to increase during training. To prevent this, we can simply stop the training whenever it seems the validation loss isn’t decreasing anymore. Interrupting the training this way is called early stopping.

Once we detect that the validation loss is starting to rise again, we can reset the weights back to where the minimum occurred. This ensures that the model won’t continue to learn noise and overfit the data.

Training with early stopping also means we’re in less danger of stopping the training too early before the network has finished learning signal. So besides preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough. Just set your training epochs to some large number (more than you’ll need), and early stopping will take care of the rest.

Adding Early Stopping

In Keras, we include early stopping in our training through a callback. A callback is just a function you want run every so often while the network trains. The early stopping callback will run after every epoch. (Keras has a variety of useful callbacks pre-defined, but you can define your own, too.)

These parameters say: “If there hasn’t been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found.” It can sometimes be hard to tell if the validation loss is rising due to overfitting or just due to random batch variation. The parameters allow us to set some allowances around when to stop.

As we’ll see in our example, we’ll pass this callback to the fit method along with the loss and optimizer.

Introduction to Deep Learning Using Keras and Tensorflow — Part2

3) Stochastic Gradient Descent

The Loss Function

The Optimizer — Stochastic Gradient Descent

Learning Rate and Batch Size

Adding the Loss and Optimizer

Evaluate Training

Learning Rate and Batch Size

4) Overfitting and Underfitting

Interpreting the Learning Curves

Capacity

Early Stopping

Adding Early Stopping

Written by Rhnyewale