Implementing the damped Newton method using the BFGS formula, and implementing L-BFGS: Limited-memory BFGS.

This article focuses on mathematical derivation, assuming readers already have a solid understanding of Logistic Regression.

The main references for this article are:

Broyden-class Algorithms: Derivation of the BFGS Iteration Formula

12-lect-QN.pdf (pku.edu.cn)

The code is available in the repository:

mr-wuliu/MECHINELEARNING

Notation Conventions

\begin{aligned} &s^k = x^{k+1}-x^k\\ &y^k = \nabla f(x^{k+1}) - \nabla f(x^k)\\ &B^k = \nabla^2f(x^k)\\ &H^k = (B^k)^{-1} \end{aligned}

$\nabla$ denotes the gradient.

Basic Framework

Quasi-Newton method framework:

\begin{aligned} &\text{Initialize }x_{n\times 1},H^k_{n\times n}, k = 0 \\ &\text{while }\|\nabla f(x_k) \|> \varepsilon \text{ do}\\ &\ \ \ \ d^k = - H^k\nabla f(x_k)\\ &\ \ \ \ \text{Wolfe line search to compute step size }a_k, x^{k+1} = x^k +a_kd^k\\ &\ \ \ \ \text{update }H^{k+1} \\ &\ \ \ \ k\leftarrow k+1\\ &\text{end while} \end{aligned}

Computing the Gradient

\nabla\mathscr{J}(\theta)=\frac1m X^T(\bar y - y)

If a regularization term is added, the loss function takes the form:

\mathscr{J}(\theta) = -\sum^m_{i=1}(y^{(i)}\ln p_1 + (1-y^{(i)}\ln (1-p_1) ) +\frac{\lambda }{2m}\|\theta\|^2_2

Its gradient is:

\nabla\mathscr{J}(\theta)=\frac1m X^T(\bar y - y) +\frac{\lambda}{m}\theta

Computing the Hessian Matrix

Let $f(x)$ be twice continuously differentiable. The first-order Taylor approximation of $\nabla f(x)$ at $x^{k+1}$ is:

\nabla f(x) = \nabla f(x^{k+1}) +\nabla^2f(x^{k+1})(x-x^{k+1{}})\\ +O(\|x-x^{k+1}\|_2^2)

If we ignore the higher-order terms, we obtain:

B^{k+1}s^k = y^k \tag{1}

or equivalently,

H^{k+1} y^k = s^k\tag{2}

Equations (1) and (2) are called the secant equations.

Using the BFGS formula:

H^{k+1}=\left(I-\frac{s^k(y^k)^T}{(s^k)^Ty^k}\right)^TH^k \left( I-\frac{s^k(y^k)^T}{(s^k)^Ty^k} \right) +\frac{s^k(s^k)^T}{(s^k)^Ty^k)}

Its core idea is to perform a rank-two update on $H^k$ .

The issue is that $H^{k+1}$ is approximated from the secant equation, and from the very start of the iteration it is not an exact inverse of the second-order derivative matrix (the initial value is the identity matrix). However, as the iteration progresses, it quickly converges to the correct Hessian matrix. If the secant equation is derived from the formula with the regularization term, I believe $H^{k+1}$ does not need to include the regularization term separately.

Rank-Two Update

For the quasi-Newton matrix $B^k\in\mathbb{R}^{n\times n}$ , let $0\neq u,v\in\mathbb{R}^{n}$ , and $a,b\in\mathbb{R}$ be undetermined. The rank-two update takes the form:

B^{k+1} = B^k + auu^T +bvv^T

Substituting into the secant equation yields:

(au^Ts^k)u + (bv^Ts^k)v = y^k - B^ks^k

By choosing appropriate $a,b,u,v$ , we obtain:

B^{k+1} = B^k +\frac{y^k(y^k)^T}{(s^k)^Ty^k} - \frac{B^ks^k(B^ks^k)^T}{(s^k)^TB^ks^k}

Applying the Sherman-Morrison formula twice:

(A+\frac{uv^T}{t})^{-1} = A^{-1} - \frac{A^{-1}uv^TA^{-1}}{t+ v^TA^{-1}u}

After tedious simplification, we arrive at the BFGS formula based on $H^k$ updates.

The derivation can also be done using the SMW (Sherman-Morrison-Woodbury) formula, but I am less familiar with that formula.

Wolfe Conditions for Line Search

Sufficient conditions for the quasi-Newton matrix to be positive definite:

$H^k$ is positive definite

The curvature condition $(s^k)^Ty^k>0, \forall k \in \mathbb{N}$ is satisfied

Using the Wolfe conditions for line search ensures that the curvature condition is satisfied, guaranteeing the positive definiteness of the quasi-Newton matrix.

Definition

Let $d^k$ be a descent direction at $x^k$ . If:

f(x^k + \alpha d^k)\leq f(x^k) + c_1\alpha \nabla f(x^k)^Td^k\\ \nabla f(x^k+\alpha d^k)^Td^k \geq c_2\nabla f(x^k)^Td^k

then the step size $\alpha$ satisfies the Wolfe conditions, where $c_1 < c_2$ and $c_1,c_2 \in (0,1)$ .

Bisection can be used to quickly find a suitable $\alpha$ .

L-BFGS Method

The BFGS method is fast, but it still has shortcomings. For Hessian matrices with a large number of features, computation and storage become challenging. Often we can solve this by adding more memory, but purely relying on hardware improvements has an upper limit. We need to design better algorithms to address this problem.

We observe that the computation of $H^{k+1}$ is based on an initial $H^0$ and intermediate quantities $s^k,y^k$ . Moreover, what we ultimately need is just the Newton direction $d^k$ , not a complete Hessian matrix. If there exists an algorithm that can directly compute the Newton direction $d^k$ , it would greatly save memory. It could even further improve computational speed (the overhead of large-scale matrix operations is significant).

Indeed, such a remarkable algorithm exists: it can directly compute $d^k$ based on information such as $s^k,y^k$ . The matrix $H^{k}$ is no longer stored explicitly; instead, a limited number of $s^k,y^k$ pairs are retained.

"Computing" the Hessian Matrix

For convenience, let us rewrite the matrix used to compute $H^k$ in BFGS as:

H^{k+1} = (V^k)^TH^kV^k + \rho_ks^k(s^k)^T

where

V^k =I-\frac{s^k(y^k)^T}{(s^k)^Ty^k};\rho_k =\frac1{(s^k)^Ty^k)}

If we expand $H^{k+1}$ with a limited m steps, we obtain:

\begin{aligned} H^{k+1}&=(\prod_{i=k-m}^kV^i)^TH^{k-m}(\prod_{i=k-m}^kV^i)\\ &+\rho_{k-m}(\prod_{i=k-m+1}^kV^i)^Ts^{k-m}(s^{k-m})^T(\prod_{i=k-m+1}^kV^i)\\ &+\cdots+\\ &+ \rho_{k-1}s^{k-1}(s^{k-1})^T \end{aligned}

It appears that after expansion we still need to store $H^{k-m}$ . However, there exists a remarkable algorithm that allows us to compute $d^k$ directly without storing $H^{k-m}$ or $H^{k}$ .

The two-loop recursion algorithm employs an iterative structure to minimize memory overhead as much as possible. It also avoids matrix multiplication, greatly accelerating computation.

If we multiply both sides of the equation by $\nabla f(\theta^k)$ , we obtain the following:

\begin{aligned} -d^k&=(\prod_{i=k-m}^kV^i)^TH^{k-m}(\prod_{i=k-m}^kV^i\nabla f(\theta^k)\\ &+\rho_{k-m}(\prod_{i=k-m+1}^kV^i)^Ts^{k-m}(s^{k-m})^T(\prod_{i=k-m+1}^kV^i)\nabla f(\theta^k)\\ &+\cdots+\\ &+ \rho_{k-1}s^{k-1}(s^{k-1})^T\nabla f(\theta^k) \end{aligned}

The process involving $(\prod_{i=k-m+1}^kV^i)\nabla f(\theta^k)$ can be stored and computed iteratively, reducing computational cost. This is the first loop.

If we denote:

q =(\prod_{i=k-m}^{k-1}V^i)\nabla f(\theta^k)\\ \alpha_i=\rho_{k-j}s^{k-j}(\prod_{i=k-l+1}^{k-1}V^i)\nabla f(\theta^k)

then we have:

-d^k = (\prod_{i=k-m}^{k-1}V^i)H^{k-m}q + (\prod_{i=k-m+1}^{k-1}V^i)s^{k-m}\alpha_{k-m}+\cdots+s^{k-1}\alpha_{k-1}

The other loop is for computing the direction $d^k$ .

First compute $H^{k-m}q$ . Expanding $V^i$ and merging the subsequent terms yields:

H^{k-m} - s^{k-m}\alpha_{k-m} + s^{k-m}\rho_{k-m}(y^{k})^TH^{k-m}\\ =H^{k-m} - s^{k-m}(\alpha_{k-m}-\rho_{k-m}(y^{k-m})^TH^0q)

Recursing in this manner yields the final result $-d^k$ .

Thus, the L-BFGS algorithm only needs to modify the way $H^k$ is computed and retain a limited number $m$ of iteration histories to keep the algorithm running.

Implementation Code

The above process can be written as code. Experiments show that the BFGS method requires many more iterations than the standard Newton method, but still needs a reasonable number. However, since it leverages second-order matrix information — even though the matrix is approximate — its convergence rate remains superlinear.

First, let us establish a common utility library shared by both L-BFGS and BFGS.

import pandas as pd
import numpy as np

"""
Clementine Model
A module for data processing. Also partially serves as a shared library,
e.g., loss function, gradient, etc.
Clementine means little tangerine, a pun on 'juzi' (bureau), suggesting
that writing this thing feels like being in prison.
As for why it's not called tangerine: that library name already exists,
so we avoid the conflict.
"""

'''
Shared functions, including:
sigmoid() activation function
cost() compute loss
nabla() analytically compute gradient
scrubbing() add constant term to data, with optional normalization
wolfe() search for suitable alpha value
predict() make predictions
'''


def sigmoid(x):
    return 1.0 / (1 + np.exp(-x))


def cost(X, theta, Y, c=False, Lambda=0):
    """
    Loss function using negative log-likelihood as the metric.
    The regularization term uses L2 regularization; default is zero.
    :param X: sample set
    :param theta: coefficients to be determined
    :param Y: label set
    :param c: function coefficient
    :param Lambda: regularization strength
    :return:
    """
    #h = sigmoid(X @ theta)
    if c:
        m, n = X.shape
        a = np.multiply(X @ theta, Y)
        b = np.log(1 + np.exp(X @ theta))
        loss = (1.0 / m) * np.sum(-a + b) \
               + Lambda * theta.T @ theta / 2
    else:
        a = np.multiply(X @  theta, Y)
        b = np.log(1 + np.exp(X @ theta))
        loss = np.sum( -a  + b )\
                +Lambda * theta.T @ theta / 2
    return loss


def nabla(X, theta, Y, Lambda=0, c=False):
    """
    Implement the nabla operator
    :param X:
    :param theta:
    :param Y:
    :param Lambda: regularization strength
    :param c: loss function coefficient
    :return:
    """
    grad = X.T @ (sigmoid(X @ theta) - Y) + Lambda * theta
    if c:
        m, n = X.shape
        return (1.0 / m) * grad
    return grad


def scrubbing(data, stander=True):
    """
    Normalize data and add a constant term in the first column
    :param data:
    :param stander:
    :return:
    """
    row, columns = data.shape
    ones = np.ones((row, 1))
    if stander:
        x = data
        data = (x - np.min(x, axis=0)) / (np.max(x, axis=0) - np.min(x, axis=0))
    data = np.append(ones, data, axis=1)
    return data


def wolfe(x, theta, y, d, c, Lambda, log=False):
    c1 = 0.1
    c2 = 0.9
    alpha = 25
    a = 0
    b = 50
    max_epoch = 40
    epoch = 0
    while a < b and epoch < max_epoch:
        if cost(x, theta + alpha * d, y, c=c, Lambda=Lambda) <= \
                cost(x, theta, y, c=c, Lambda=Lambda) + \
                c1 * alpha * nabla(x, theta, y, c=c, Lambda=Lambda).T @ d:
            if nabla(x, theta + alpha * d, y, c=c, Lambda=Lambda).T @ d >= \
                    c2 * nabla(x, theta, y, c=c, Lambda=Lambda).T @ d:
                break
            else:
                a = alpha
        else:
            b = alpha
        alpha = (a + b) / 2
        epoch += 1
    if log and epoch == max_epoch:
        alpha = 0.05
        print("Wolfe line search failed")
    return alpha

def predict(X, theta, y):
    m, n = X.shape
    pre = np.zeros((m, 1))

    for i in range(m):
        activate = sigmoid(np.dot(X[i, :], theta))
        if activate > 0.5:
            pre[i] = 1
    count = 0
    for i in range(m):
        if pre[i] == y[i]:
            count += 1
    return count / m

BFGS Algorithm Implementation

def bfgs_method(x_train, y_label, max_iter=5000,
                epsilon=1e-4, Lambda=0,
                c=False, log=False):
    """
    Implement the damped Newton method
    :param x_train:
    :param y_label:
    :param max_iter:
    :param epsilon: precision
    :param Lambda: regularization term
    :param c: loss function coefficient
    :param log: logging option
    :return:
    """

    m, n = x_train.shape
    theta_k = np.ones((n, 1))
    I = np.eye(n)
    Hk = I
    grad_k = ct.nabla(x_train, theta_k, y_label,
                      c=c, Lambda=Lambda)
    k = 0
    while np.linalg.norm(grad_k) > epsilon and k < max_iter:
        # Newton direction
        d = - Hk @ grad_k
        # Line search
        alpha_k = ct.wolfe(x_train, theta_k, y_label, d,
                        c, Lambda)
        # Iteration
        theta_k1 = theta_k + alpha_k * d
        # Update Hessian
        s_k = theta_k1 - theta_k
        grad_k1 = ct.nabla(x_train, theta_k1, y_label,
                           c=c, Lambda=Lambda)
        y_k = grad_k1 - grad_k
        if s_k.T @ y_k <= 0:
            print("Matrix is not positive definite, solution failed, exiting")
            print(s_k.T @ y_k)
            break
        # BFGS formula
        Hk = ((I - (s_k @ y_k.T) / (s_k.T @ y_k))).T @ Hk @ ((I - (s_k @ y_k.T) / (s_k.T @ y_k))) + ((s_k @ s_k.T) / (s_k.T @ y_k))
        grad_k = grad_k1
        theta_k = theta_k1
        k += 1
        if log and  k % 200 == 0:
            print(k)
    if log and max_iter == k:
        print("Iteration limit exceeded, |grad|=",np.linalg.norm(grad_k))
    if log:
        print("Loss:", ct.cost(x_train, theta_k, y_label, c=c, Lambda=Lambda))
    return theta_k

L-BFGS Algorithm Implementation

import numpy as np
import clementine as ct
import time


def two_loop(s, y, rho, gk,col):
    """
    Compute the d_k direction using sequences s, y, rho
    :param gk: current gradient
    :param s: dictionary, point difference set
    :param y: dictionary, gradient difference set
    :param rho: dictionary, scalar set
    :return: d_k, Newton direction
    """
    n = len(s)  # Compute list length

    # Choose approximate matrix

    a = np.ones((n, 1))  # a[i] is also a scalar
    q = gk.copy()  # n * 1 matrix

    # First loop
    for i in range(n - 1, -1, -1):
        a[i] = rho[i] @ s[i].T @ q
        q -= a[i] * y[i]

    # Choose approximate matrix
    h0 = np.eye(col)
    if n >= 1:
        h0 = (s[-1].T @ y[-1]) / (y[-1].T @ y[-1]) * h0
    # Second loop
    d = h0 @ q

    for i in range(n):
        b = rho[i] @ y[i].T @ d
        d += s[i] @ (a[i] - b)
    return d


def l_bfgs_method(x_train, y_label, max_iter=5000,
                  epsilon=1e-4, Lambda=0, m=5,
                  c=False, log=False):
    # Parameter initialization
    row, col = x_train.shape
    theta_k = np.zeros((col, 1))  # Initial position
    grad_k = ct.nabla(x_train, theta_k, y_label,
                      Lambda=Lambda, c=c)
    k = 0
    s, y, rho = [], [], []

    while np.linalg.norm(grad_k) > epsilon and k < max_iter:
        d_k = -two_loop(s, y, rho, grad_k,col)

        alpha = ct.wolfe(x_train, theta_k, y_label, d_k,
                      c=c, Lambda=Lambda)
        theta_k1 = theta_k + alpha * d_k

        s_k = theta_k1 - theta_k
        grad_k1 = ct.nabla(x_train, theta_k1, y_label,
                           Lambda=Lambda, c=c)
        y_k = grad_k1 - grad_k

        if s_k.T @ y_k <= 0:
            print("Matrix is not positive definite, solution failed, exiting")
            print(s_k.T @ y_k)
            break
        rho.append(1 / (s_k.T @ y_k))
        s.append(s_k)
        y.append(y_k)
        if (len(rho) > m):
            rho.pop(0)
            s.pop(0)
            y.pop(0)
        # Update
        k += 1
        if log and k % 10 == 0:
            print("Iteration count",k,"||grad||",np.linalg.norm(grad_k))
            print("loss", ct.cost(x_train, theta_k1, y_label,
                                  Lambda=Lambda, c=c))
        grad_k = grad_k1
        theta_k = theta_k1
    # end while

    return theta_k

Quasi-Newton Method for Logistic Binary Classification