Implementing Adam Gradient Descent

We start by importing the necessary libraries. In this implementation, we only need the NumPy library for mathematical operations. The initialize_adam function initializes the moving average variables v and s as dictionaries based on the parameters of the model. It takes the parameters dictionary as input, which contains the weights and biases of the model.

v = 0 (Initialize first moment vector)
s = 0 (Initialize second moment vector)
t = 0 (Initialize time step)

Python3




def initialize_adam(parameters):
    L = len(parameters) // 2
    v = {}
    s = {}
 
    for l in range(L):
        v["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        v["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
        s["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        s["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
 
    return v, s


The function initializes the moving average variables for both weights (dW) and biases (db) of each layer in the model. It returns the initialized v and s dictionaries.

In above code,

initialize_adam(): This function is responsible for initializing the moments s and v used in the Adam algorithm. The moments are initialized with zero values to ensure a proper start for the optimization process. By initializing these moments, Adam keeps track of the first and second moments of the gradients for each parameter, which helps in adapting the learning rate during optimization.

How to Implement Adam Gradient Descent from Scratch using Python?

Grade descent is an extensively used optimization algorithm in machine literacy and deep literacy. It’s used to minimize the cost or loss function of a model by iteratively confirming the model’s parameters grounded on the slants of the cost function with respect to those parameters. One variant of gradient descent that has gained popularity is the Adam optimization algorithm. Adam combines the benefits of AdaGrad and RMSProp to achieve effective and adaptive learning rates.

Similar Reads

Understanding the Adam Algorithm

Before going into the implementation, let’s have a brief overview of the Adam optimization algorithm. Adam stands for Adaptive Moment Estimation. It maintains two moving average variables:...

Terminologies related to Adam’s Algorithm

Gradient Descent: An iterative optimization algorithm used to find the minimum of a function by iteratively adjusting the parameters in the direction of the steepest descent of the gradient. Learning Rate: A hyperparameter that determines the step size at each iteration of gradient descent. It controls how much the parameters are updated based on the computed gradients. Objective Function: The function that we aim to minimize or maximize. In the context of optimization, it is usually a cost function or a loss function. Derivative: The derivative of a function represents its rate of change at a particular point. In the context of gradient descent, the derivative (or gradient) provides the direction and magnitude of the steepest ascent or descent. Local Minimum: A point in the optimization landscape where the objective function has the lowest value in the vicinity. It may not be the global minimum, which is the absolute lowest point in the entire optimization landscape. Global Minimum: The lowest point in the entire optimization landscape, indicating the optimal solution to the problem. Momentum: In the context of optimization algorithms, momentum refers to the inclusion of past gradients to determine the current update. It helps accelerate convergence and overcome local optima. Adam (Adaptive Moment Estimation): An optimization algorithm that extends gradient descent by utilizing adaptive learning rates and momentum. It maintains a running average of past gradients and squared gradients to compute adaptive learning rates for each parameter. Beta1 and Beta2: Hyperparameters in the Adam algorithm that control the exponential decay rates for the estimation of the first and second moments of the gradients, respectively. Beta1 controls the momentum-like effect, while Beta2 controls the influence of the squared gradients. Epsilon (eps): A small constant added to the denominator in the Adam algorithm to prevent division by zero and ensure numerical stability....

Implementing Adam Gradient Descent

We start by importing the necessary libraries. In this implementation, we only need the NumPy library for mathematical operations. The initialize_adam function initializes the moving average variables v and s as dictionaries based on the parameters of the model. It takes the parameters dictionary as input, which contains the weights and biases of the model....

Mathematical Intuition Behind Adam’s Gradient Descent

...

Visualization of Adam

The update_parameters_with_adam function performs the Adam gradient descent update for the parameters of the model. It takes the parameters dictionary, gradients grads, moving averages v and s, iteration number t, learning rate learning_rate, and hyperparameters beta1, beta2, and epsilon as inputs....

Contact Us