Parameter Estimation in Logistic Regression using Newton’s Method

Parameter Estimation in Logistic Regression

Logistic regression is a popular statistical method used for binary classification tasks. Given a dataset with input features X and corresponding binary labels y, logistic regression models the probability that an input belongs to a particular class. The logistic regression model is defined by the logistic (or sigmoid) function, which maps the linear combination of input features to a probability:

[Tex]p(y=1|X) = \sigma(w^TX) = \frac{1}{1+e^{-w^TX}}[/Tex]

where

X is an Nxd input feature matrix, with N being the number of samples and d is the number of features.
w is the weight vector of size dx1
σ(z) is the logistic regression function.

Logistic Regression Model and Loss Function

The goal of logistic regression is to find the optimal weight vector w that maximizes the likelihood of the observed data. This is achieved by minimizing the loss function, known as the cross-entropy loss function:

[Tex]L(w) = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log(\sigma(w^T x_i))+(1-y_i)\log(1-\sigma(w^T x_i))][/Tex]

where y_i is the label of the i-th feature and x_i the feature vector of the i-th sample

Newton’s method can be applied to optimize the parameters w by iteratively updating the weight vector to minimize the logistic loss function. The steps to perform this are as follows:

Choose an initial guess for the weight vector w0.
For each iteration:
Compute the gradient vector ∇ L(w_n) and the Hessian H(w_n) of the loss function with respect to w0.
Update the weight vector using the Newton’s formula:

[Tex]w_n+1 = w_n – H^{-1}(w_n)\nabla{L(w_n)}[/Tex]

Check for convergence. If the change in the weight vector is sufficiently small or if a maximum number of iterations are reached, terminate the iteration.
The final weight vector obtained at the end of the iterations is considered the optimal solution

The gradient vector and Hessian matrix can be computed using the first and second derivatives of the logistic loss function, respectively. The Hessian matrix represents the curvature of the loss function, and its inverse is used to correct the weight updates, taking into account the local curvature of the loss surface.

We will now apply Newton’s Method to find the optimal parameters of a logistic regression model for binary classification.

Let us consider a simple binary classification problem with just one feature, where we aim to predict whether a student passes (y=1) or fails (y=0) an exam based on the number of hours they studied. We have the following dataset:

Hours Studied	Exam Result (Pass/Fail)
2	0
3	0
4	1
5	0
6	1

We shall use Logistic Regression to model the probability of the student passing the exam. To find the optimal weights we shall apply Newton’s Method for the model.

First, we define our model:

[Tex]\sigma(z) = \frac{1}{1+e^{-z}}[/Tex]

where z = w0+w1 x hours_studied, and w0, w1 are the weights that are needed to be optimized.

The logistic loss function is given by:

[Tex]L(w0, w1) = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log(\sigma(w^T x_i))+(1-y_i)\log(1-\sigma(w^T x_i))][/Tex]

where N is the number of samples.

To apply Newton’s method, we need to compute the gradient vector and Hessian matrix of the loss function with respect to the weights w0 and w1.

Let’s assume an initial guess for the weights w0 = 0 and w1 = 0. Using Newton’s Method, for each iteration, we perform the following steps.

Compute the gradient vector ∇ L(w0, w1) and Hessian matrix H(w0, w1)
Update the weights as follows:

[Tex]{{w_0}\choose{w_1}} = {{w_0}\choose{w_1}} – H^{-1}\nabla L(w_0, w_1)[/Tex]

Check for convergence and terminate once convergence is achieved.

Python Implementation

The Python implementation is given as follows:

Python3

import numpy as np

# Given dataset
hours_studied = np.array([2, 3, 4, 5, 6])
exam_result = np.array([0, 0, 1, 0, 1])

# Initialize weights
w0 = 0
w1 = 0

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Convergence criteria
max_iterations = 1000
tolerance = 1e-6
prev_delta_w = np.inf

# Iterative optimization
for iteration in range(max_iterations):
    # Compute z = w0 + w1 * hours_studied
    z = w0 + w1 * hours_studied

    # Compute predicted probabilities
    probabilities = sigmoid(z)

    # Compute error = predicted_prob - actual_result
    error = probabilities - exam_result

    # Compute gradient vector
    grad_w0 = np.mean(error)
    grad_w1 = np.mean(error * hours_studied)

    # Compute Hessian matrix
    hessian_w0_w0 = np.mean(probabilities * (1 - probabilities))
    hessian_w1_w1 = np.mean(hours_studied**2 * probabilities * (1 - probabilities))
    hessian_w0_w1 = np.mean(hours_studied * probabilities * (1 - probabilities))

    # Inverse of Hessian matrix
    hessian_inv = np.linalg.inv([[hessian_w0_w0, hessian_w0_w1],
                                 [hessian_w0_w1, hessian_w1_w1]])

    # Update weights
    grad = np.array([grad_w0, grad_w1])
    delta_w = np.dot(hessian_inv, grad)

    # Check convergence
    if np.linalg.norm(delta_w) < tolerance:
        print("Converged after", iteration+1, "iterations.")
        break

    # Update weights
    w0 -= delta_w[0]
    w1 -= delta_w[1]

    # Check for improvement in convergence
    if np.linalg.norm(delta_w - prev_delta_w) < tolerance:
        print("Converged (no significant change in weights) after", iteration+1, "iterations.")
        break

    # Update previous delta_w for next iteration
    prev_delta_w = delta_w

# Output
print("Optimal weights:")
print("w0 =", w0)
print("w1 =", w1)

Output:

Converged after 6 iterations.
Optimal weights:
w0 = -4.984392306601187
w1 = 1.0904255602930342

In the above code, we use the Numpy library to perform the numerical computations. The dataset is stored as Numpy arrays ‘hours_studied’ and ‘exam_result’. Initial weights are considered as (w0, w1) = (0,0).
We have considered the convergence criteria by setting the max_iterations = 1000 and the threshold/tolerance level = 10^-6. The rest of the code implements the algorithm as discussed and finally, we display the optimal values of the weights.

Newton’s method in Machine Learning

Optimization algorithms are essential tools across various fields, ranging from engineering and computer science to economics and physics. Among these algorithms, Newton’s method holds a significant place due to its efficiency and effectiveness in finding the roots of equations and optimizing functions, here in this article we will study more about Newton’s method and it’s use in machine learning.

Table of Content

Newton’s Method for Optimization
Second-Order Approximation
Newton’s Method for Finding Local Minima or Maxima in Python
Convergence Properties of Newton’s Method
Complexity of Newton’s Method
Time Complexity of Newton’s Method
Parameter Estimation in Logistic Regression using Newton’s Method
Data Fitting with Newton’s Method
Newton’s Method vs Other Optimization Algorithms
Applications of Newton’s Method

Parameter Estimation in Logistic Regression using Newton’s Method