Implementing the damped Newton method using the BFGS formula, and implementing L-BFGS: Limited-memory BFGS.

This article focuses on mathematical derivation, assuming readers already have a solid understanding of Logistic Regression.

The main references for this article are:

Broyden-class Algorithms: Derivation of the BFGS Iteration Formula

12-lect-QN.pdf (pku.edu.cn)

The code is available in the repository:

mr-wuliu/MECHINELEARNING

Notation Conventions

sk=xk+1xkyk=f(xk+1)f(xk)Bk=2f(xk)Hk=(Bk)1\begin{aligned} &s^k = x^{k+1}-x^k\\ &y^k = \nabla f(x^{k+1}) - \nabla f(x^k)\\ &B^k = \nabla^2f(x^k)\\ &H^k = (B^k)^{-1} \end{aligned}

\nabla denotes the gradient.

Basic Framework

Quasi-Newton method framework:

Initialize xn×1,Hn×nk,k=0while f(xk)>ε do    dk=Hkf(xk)    Wolfe line search to compute step size ak,xk+1=xk+akdk    update Hk+1    kk+1end while\begin{aligned} &\text{Initialize }x_{n\times 1},H^k_{n\times n}, k = 0 \\ &\text{while }\|\nabla f(x_k) \|> \varepsilon \text{ do}\\ &\ \ \ \ d^k = - H^k\nabla f(x_k)\\ &\ \ \ \ \text{Wolfe line search to compute step size }a_k, x^{k+1} = x^k +a_kd^k\\ &\ \ \ \ \text{update }H^{k+1} \\ &\ \ \ \ k\leftarrow k+1\\ &\text{end while} \end{aligned}

Computing the Gradient

J(θ)=1mXT(yˉy)\nabla\mathscr{J}(\theta)=\frac1m X^T(\bar y - y)

If a regularization term is added, the loss function takes the form:

J(θ)=i=1m(y(i)lnp1+(1y(i)ln(1p1))+λ2mθ22\mathscr{J}(\theta) = -\sum^m_{i=1}(y^{(i)}\ln p_1 + (1-y^{(i)}\ln (1-p_1) ) +\frac{\lambda }{2m}\|\theta\|^2_2

Its gradient is:

J(θ)=1mXT(yˉy)+λmθ\nabla\mathscr{J}(\theta)=\frac1m X^T(\bar y - y) +\frac{\lambda}{m}\theta

Computing the Hessian Matrix

Let f(x)f(x) be twice continuously differentiable. The first-order Taylor approximation of f(x)\nabla f(x) at xk+1x^{k+1} is:

f(x)=f(xk+1)+2f(xk+1)(xxk+1)+O(xxk+122)\nabla f(x) = \nabla f(x^{k+1}) +\nabla^2f(x^{k+1})(x-x^{k+1{}})\\ +O(\|x-x^{k+1}\|_2^2)

If we ignore the higher-order terms, we obtain:

Bk+1sk=yk(1)B^{k+1}s^k = y^k \tag{1}

or equivalently,

Hk+1yk=sk(2)H^{k+1} y^k = s^k\tag{2}

Equations (1) and (2) are called the secant equations.

Using the BFGS formula:

Hk+1=(Isk(yk)T(sk)Tyk)THk(Isk(yk)T(sk)Tyk)+sk(sk)T(sk)Tyk)H^{k+1}=\left(I-\frac{s^k(y^k)^T}{(s^k)^Ty^k}\right)^TH^k \left( I-\frac{s^k(y^k)^T}{(s^k)^Ty^k} \right) +\frac{s^k(s^k)^T}{(s^k)^Ty^k)}

Its core idea is to perform a rank-two update on HkH^k.

The issue is that Hk+1H^{k+1} is approximated from the secant equation, and from the very start of the iteration it is not an exact inverse of the second-order derivative matrix (the initial value is the identity matrix). However, as the iteration progresses, it quickly converges to the correct Hessian matrix. If the secant equation is derived from the formula with the regularization term, I believe Hk+1H^{k+1} does not need to include the regularization term separately.

Rank-Two Update

For the quasi-Newton matrix BkRn×nB^k\in\mathbb{R}^{n\times n}, let 0u,vRn0\neq u,v\in\mathbb{R}^{n}, and a,bRa,b\in\mathbb{R} be undetermined. The rank-two update takes the form:

Bk+1=Bk+auuT+bvvTB^{k+1} = B^k + auu^T +bvv^T

Substituting into the secant equation yields:

(auTsk)u+(bvTsk)v=ykBksk(au^Ts^k)u + (bv^Ts^k)v = y^k - B^ks^k

By choosing appropriate a,b,u,va,b,u,v, we obtain:

Bk+1=Bk+yk(yk)T(sk)TykBksk(Bksk)T(sk)TBkskB^{k+1} = B^k +\frac{y^k(y^k)^T}{(s^k)^Ty^k} - \frac{B^ks^k(B^ks^k)^T}{(s^k)^TB^ks^k}

Applying the Sherman-Morrison formula twice:

(A+uvTt)1=A1A1uvTA1t+vTA1u(A+\frac{uv^T}{t})^{-1} = A^{-1} - \frac{A^{-1}uv^TA^{-1}}{t+ v^TA^{-1}u}

After tedious simplification, we arrive at the BFGS formula based on HkH^k updates.

The derivation can also be done using the SMW (Sherman-Morrison-Woodbury) formula, but I am less familiar with that formula.

Wolfe Conditions for Line Search

Sufficient conditions for the quasi-Newton matrix to be positive definite:

  1. HkH^k is positive definite
  2. The curvature condition (sk)Tyk>0,kN(s^k)^Ty^k>0, \forall k \in \mathbb{N} is satisfied

Using the Wolfe conditions for line search ensures that the curvature condition is satisfied, guaranteeing the positive definiteness of the quasi-Newton matrix.

Definition

Let dkd^k be a descent direction at xkx^k. If:

f(xk+αdk)f(xk)+c1αf(xk)Tdkf(xk+αdk)Tdkc2f(xk)Tdkf(x^k + \alpha d^k)\leq f(x^k) + c_1\alpha \nabla f(x^k)^Td^k\\ \nabla f(x^k+\alpha d^k)^Td^k \geq c_2\nabla f(x^k)^Td^k

then the step size α\alpha satisfies the Wolfe conditions, where c1<c2c_1 < c_2 and c1,c2(0,1)c_1,c_2 \in (0,1).

Bisection can be used to quickly find a suitable α\alpha.

L-BFGS Method

The BFGS method is fast, but it still has shortcomings. For Hessian matrices with a large number of features, computation and storage become challenging. Often we can solve this by adding more memory, but purely relying on hardware improvements has an upper limit. We need to design better algorithms to address this problem.

We observe that the computation of Hk+1H^{k+1} is based on an initial H0H^0 and intermediate quantities sk,yks^k,y^k. Moreover, what we ultimately need is just the Newton direction dkd^k, not a complete Hessian matrix. If there exists an algorithm that can directly compute the Newton direction dkd^k, it would greatly save memory. It could even further improve computational speed (the overhead of large-scale matrix operations is significant).

Indeed, such a remarkable algorithm exists: it can directly compute dkd^k based on information such as sk,yks^k,y^k. The matrix HkH^{k} is no longer stored explicitly; instead, a limited number of sk,yks^k,y^k pairs are retained.

"Computing" the Hessian Matrix

For convenience, let us rewrite the matrix used to compute HkH^k in BFGS as:

Hk+1=(Vk)THkVk+ρksk(sk)TH^{k+1} = (V^k)^TH^kV^k + \rho_ks^k(s^k)^T

where

Vk=Isk(yk)T(sk)Tyk;ρk=1(sk)Tyk)V^k =I-\frac{s^k(y^k)^T}{(s^k)^Ty^k};\rho_k =\frac1{(s^k)^Ty^k)}

If we expand Hk+1H^{k+1} with a limited m steps, we obtain:

Hk+1=(i=kmkVi)THkm(i=kmkVi)+ρkm(i=km+1kVi)Tskm(skm)T(i=km+1kVi)+++ρk1sk1(sk1)T\begin{aligned} H^{k+1}&=(\prod_{i=k-m}^kV^i)^TH^{k-m}(\prod_{i=k-m}^kV^i)\\ &+\rho_{k-m}(\prod_{i=k-m+1}^kV^i)^Ts^{k-m}(s^{k-m})^T(\prod_{i=k-m+1}^kV^i)\\ &+\cdots+\\ &+ \rho_{k-1}s^{k-1}(s^{k-1})^T \end{aligned}

It appears that after expansion we still need to store HkmH^{k-m}. However, there exists a remarkable algorithm that allows us to compute dkd^k directly without storing HkmH^{k-m} or HkH^{k}.

The two-loop recursion algorithm employs an iterative structure to minimize memory overhead as much as possible. It also avoids matrix multiplication, greatly accelerating computation.

If we multiply both sides of the equation by f(θk)\nabla f(\theta^k), we obtain the following:

dk=(i=kmkVi)THkm(i=kmkVif(θk)+ρkm(i=km+1kVi)Tskm(skm)T(i=km+1kVi)f(θk)+++ρk1sk1(sk1)Tf(θk)\begin{aligned} -d^k&=(\prod_{i=k-m}^kV^i)^TH^{k-m}(\prod_{i=k-m}^kV^i\nabla f(\theta^k)\\ &+\rho_{k-m}(\prod_{i=k-m+1}^kV^i)^Ts^{k-m}(s^{k-m})^T(\prod_{i=k-m+1}^kV^i)\nabla f(\theta^k)\\ &+\cdots+\\ &+ \rho_{k-1}s^{k-1}(s^{k-1})^T\nabla f(\theta^k) \end{aligned}

The process involving (i=km+1kVi)f(θk)(\prod_{i=k-m+1}^kV^i)\nabla f(\theta^k) can be stored and computed iteratively, reducing computational cost. This is the first loop.

If we denote:

q=(i=kmk1Vi)f(θk)αi=ρkjskj(i=kl+1k1Vi)f(θk)q =(\prod_{i=k-m}^{k-1}V^i)\nabla f(\theta^k)\\ \alpha_i=\rho_{k-j}s^{k-j}(\prod_{i=k-l+1}^{k-1}V^i)\nabla f(\theta^k)

then we have:

dk=(i=kmk1Vi)Hkmq+(i=km+1k1Vi)skmαkm++sk1αk1-d^k = (\prod_{i=k-m}^{k-1}V^i)H^{k-m}q + (\prod_{i=k-m+1}^{k-1}V^i)s^{k-m}\alpha_{k-m}+\cdots+s^{k-1}\alpha_{k-1}

The other loop is for computing the direction dkd^k.

First compute HkmqH^{k-m}q. Expanding ViV^i and merging the subsequent terms yields:

Hkmskmαkm+skmρkm(yk)THkm=Hkmskm(αkmρkm(ykm)TH0q)H^{k-m} - s^{k-m}\alpha_{k-m} + s^{k-m}\rho_{k-m}(y^{k})^TH^{k-m}\\ =H^{k-m} - s^{k-m}(\alpha_{k-m}-\rho_{k-m}(y^{k-m})^TH^0q)

Recursing in this manner yields the final result dk-d^k.

Thus, the L-BFGS algorithm only needs to modify the way HkH^k is computed and retain a limited number mm of iteration histories to keep the algorithm running.

Implementation Code

The above process can be written as code. Experiments show that the BFGS method requires many more iterations than the standard Newton method, but still needs a reasonable number. However, since it leverages second-order matrix information — even though the matrix is approximate — its convergence rate remains superlinear.

First, let us establish a common utility library shared by both L-BFGS and BFGS.

import pandas as pd
import numpy as np

"""
Clementine Model
A module for data processing. Also partially serves as a shared library,
e.g., loss function, gradient, etc.
Clementine means little tangerine, a pun on 'juzi' (bureau), suggesting
that writing this thing feels like being in prison.
As for why it's not called tangerine: that library name already exists,
so we avoid the conflict.
"""

'''
Shared functions, including:
sigmoid() activation function
cost() compute loss
nabla() analytically compute gradient
scrubbing() add constant term to data, with optional normalization
wolfe() search for suitable alpha value
predict() make predictions
'''


def sigmoid(x):
    return 1.0 / (1 + np.exp(-x))


def cost(X, theta, Y, c=False, Lambda=0):
    """
    Loss function using negative log-likelihood as the metric.
    The regularization term uses L2 regularization; default is zero.
    :param X: sample set
    :param theta: coefficients to be determined
    :param Y: label set
    :param c: function coefficient
    :param Lambda: regularization strength
    :return:
    """
    #h = sigmoid(X @ theta)
    if c:
        m, n = X.shape
        a = np.multiply(X @ theta, Y)
        b = np.log(1 + np.exp(X @ theta))
        loss = (1.0 / m) * np.sum(-a + b) \
               + Lambda * theta.T @ theta / 2
    else:
        a = np.multiply(X @  theta, Y)
        b = np.log(1 + np.exp(X @ theta))
        loss = np.sum( -a  + b )\
                +Lambda * theta.T @ theta / 2
    return loss


def nabla(X, theta, Y, Lambda=0, c=False):
    """
    Implement the nabla operator
    :param X:
    :param theta:
    :param Y:
    :param Lambda: regularization strength
    :param c: loss function coefficient
    :return:
    """
    grad = X.T @ (sigmoid(X @ theta) - Y) + Lambda * theta
    if c:
        m, n = X.shape
        return (1.0 / m) * grad
    return grad


def scrubbing(data, stander=True):
    """
    Normalize data and add a constant term in the first column
    :param data:
    :param stander:
    :return:
    """
    row, columns = data.shape
    ones = np.ones((row, 1))
    if stander:
        x = data
        data = (x - np.min(x, axis=0)) / (np.max(x, axis=0) - np.min(x, axis=0))
    data = np.append(ones, data, axis=1)
    return data


def wolfe(x, theta, y, d, c, Lambda, log=False):
    c1 = 0.1
    c2 = 0.9
    alpha = 25
    a = 0
    b = 50
    max_epoch = 40
    epoch = 0
    while a < b and epoch < max_epoch:
        if cost(x, theta + alpha * d, y, c=c, Lambda=Lambda) <= \
                cost(x, theta, y, c=c, Lambda=Lambda) + \
                c1 * alpha * nabla(x, theta, y, c=c, Lambda=Lambda).T @ d:
            if nabla(x, theta + alpha * d, y, c=c, Lambda=Lambda).T @ d >= \
                    c2 * nabla(x, theta, y, c=c, Lambda=Lambda).T @ d:
                break
            else:
                a = alpha
        else:
            b = alpha
        alpha = (a + b) / 2
        epoch += 1
    if log and epoch == max_epoch:
        alpha = 0.05
        print("Wolfe line search failed")
    return alpha

def predict(X, theta, y):
    m, n = X.shape
    pre = np.zeros((m, 1))

    for i in range(m):
        activate = sigmoid(np.dot(X[i, :], theta))
        if activate > 0.5:
            pre[i] = 1
    count = 0
    for i in range(m):
        if pre[i] == y[i]:
            count += 1
    return count / m

BFGS Algorithm Implementation

def bfgs_method(x_train, y_label, max_iter=5000,
                epsilon=1e-4, Lambda=0,
                c=False, log=False):
    """
    Implement the damped Newton method
    :param x_train:
    :param y_label:
    :param max_iter:
    :param epsilon: precision
    :param Lambda: regularization term
    :param c: loss function coefficient
    :param log: logging option
    :return:
    """

    m, n = x_train.shape
    theta_k = np.ones((n, 1))
    I = np.eye(n)
    Hk = I
    grad_k = ct.nabla(x_train, theta_k, y_label,
                      c=c, Lambda=Lambda)
    k = 0
    while np.linalg.norm(grad_k) > epsilon and k < max_iter:
        # Newton direction
        d = - Hk @ grad_k
        # Line search
        alpha_k = ct.wolfe(x_train, theta_k, y_label, d,
                        c, Lambda)
        # Iteration
        theta_k1 = theta_k + alpha_k * d
        # Update Hessian
        s_k = theta_k1 - theta_k
        grad_k1 = ct.nabla(x_train, theta_k1, y_label,
                           c=c, Lambda=Lambda)
        y_k = grad_k1 - grad_k
        if s_k.T @ y_k <= 0:
            print("Matrix is not positive definite, solution failed, exiting")
            print(s_k.T @ y_k)
            break
        # BFGS formula
        Hk = ((I - (s_k @ y_k.T) / (s_k.T @ y_k))).T @ Hk @ ((I - (s_k @ y_k.T) / (s_k.T @ y_k))) + ((s_k @ s_k.T) / (s_k.T @ y_k))
        grad_k = grad_k1
        theta_k = theta_k1
        k += 1
        if log and  k % 200 == 0:
            print(k)
    if log and max_iter == k:
        print("Iteration limit exceeded, |grad|=",np.linalg.norm(grad_k))
    if log:
        print("Loss:", ct.cost(x_train, theta_k, y_label, c=c, Lambda=Lambda))
    return theta_k

L-BFGS Algorithm Implementation

import numpy as np
import clementine as ct
import time


def two_loop(s, y, rho, gk,col):
    """
    Compute the d_k direction using sequences s, y, rho
    :param gk: current gradient
    :param s: dictionary, point difference set
    :param y: dictionary, gradient difference set
    :param rho: dictionary, scalar set
    :return: d_k, Newton direction
    """
    n = len(s)  # Compute list length

    # Choose approximate matrix

    a = np.ones((n, 1))  # a[i] is also a scalar
    q = gk.copy()  # n * 1 matrix

    # First loop
    for i in range(n - 1, -1, -1):
        a[i] = rho[i] @ s[i].T @ q
        q -= a[i] * y[i]

    # Choose approximate matrix
    h0 = np.eye(col)
    if n >= 1:
        h0 = (s[-1].T @ y[-1]) / (y[-1].T @ y[-1]) * h0
    # Second loop
    d = h0 @ q

    for i in range(n):
        b = rho[i] @ y[i].T @ d
        d += s[i] @ (a[i] - b)
    return d


def l_bfgs_method(x_train, y_label, max_iter=5000,
                  epsilon=1e-4, Lambda=0, m=5,
                  c=False, log=False):
    # Parameter initialization
    row, col = x_train.shape
    theta_k = np.zeros((col, 1))  # Initial position
    grad_k = ct.nabla(x_train, theta_k, y_label,
                      Lambda=Lambda, c=c)
    k = 0
    s, y, rho = [], [], []

    while np.linalg.norm(grad_k) > epsilon and k < max_iter:
        d_k = -two_loop(s, y, rho, grad_k,col)

        alpha = ct.wolfe(x_train, theta_k, y_label, d_k,
                      c=c, Lambda=Lambda)
        theta_k1 = theta_k + alpha * d_k

        s_k = theta_k1 - theta_k
        grad_k1 = ct.nabla(x_train, theta_k1, y_label,
                           Lambda=Lambda, c=c)
        y_k = grad_k1 - grad_k

        if s_k.T @ y_k <= 0:
            print("Matrix is not positive definite, solution failed, exiting")
            print(s_k.T @ y_k)
            break
        rho.append(1 / (s_k.T @ y_k))
        s.append(s_k)
        y.append(y_k)
        if (len(rho) > m):
            rho.pop(0)
            s.pop(0)
            y.pop(0)
        # Update
        k += 1
        if log and k % 10 == 0:
            print("Iteration count",k,"||grad||",np.linalg.norm(grad_k))
            print("loss", ct.cost(x_train, theta_k1, y_label,
                                  Lambda=Lambda, c=c))
        grad_k = grad_k1
        theta_k = theta_k1
    # end while

    return theta_k