Implementing the damped Newton method using the BFGS formula, and implementing L-BFGS: Limited-memory BFGS.
This article focuses on mathematical derivation, assuming readers already have a solid understanding of Logistic Regression.
The main references for this article are:
Broyden-class Algorithms: Derivation of the BFGS Iteration Formula
The code is available in the repository:
Notation Conventions
denotes the gradient.
Basic Framework
Quasi-Newton method framework:
Computing the Gradient
If a regularization term is added, the loss function takes the form:
Its gradient is:
Computing the Hessian Matrix
Let be twice continuously differentiable. The first-order Taylor approximation of at is:
If we ignore the higher-order terms, we obtain:
or equivalently,
Equations (1) and (2) are called the secant equations.
Using the BFGS formula:
Its core idea is to perform a rank-two update on .
The issue is that is approximated from the secant equation, and from the very start of the iteration it is not an exact inverse of the second-order derivative matrix (the initial value is the identity matrix). However, as the iteration progresses, it quickly converges to the correct Hessian matrix. If the secant equation is derived from the formula with the regularization term, I believe does not need to include the regularization term separately.
Rank-Two Update
For the quasi-Newton matrix , let , and be undetermined. The rank-two update takes the form:
Substituting into the secant equation yields:
By choosing appropriate , we obtain:
Applying the Sherman-Morrison formula twice:
After tedious simplification, we arrive at the BFGS formula based on updates.
The derivation can also be done using the SMW (Sherman-Morrison-Woodbury) formula, but I am less familiar with that formula.
Wolfe Conditions for Line Search
Sufficient conditions for the quasi-Newton matrix to be positive definite:
- is positive definite
- The curvature condition is satisfied
Using the Wolfe conditions for line search ensures that the curvature condition is satisfied, guaranteeing the positive definiteness of the quasi-Newton matrix.
Definition
Let be a descent direction at . If:
then the step size satisfies the Wolfe conditions, where and .
Bisection can be used to quickly find a suitable .
L-BFGS Method
The BFGS method is fast, but it still has shortcomings. For Hessian matrices with a large number of features, computation and storage become challenging. Often we can solve this by adding more memory, but purely relying on hardware improvements has an upper limit. We need to design better algorithms to address this problem.
We observe that the computation of is based on an initial and intermediate quantities . Moreover, what we ultimately need is just the Newton direction , not a complete Hessian matrix. If there exists an algorithm that can directly compute the Newton direction , it would greatly save memory. It could even further improve computational speed (the overhead of large-scale matrix operations is significant).
Indeed, such a remarkable algorithm exists: it can directly compute based on information such as . The matrix is no longer stored explicitly; instead, a limited number of pairs are retained.
"Computing" the Hessian Matrix
For convenience, let us rewrite the matrix used to compute in BFGS as:
where
If we expand with a limited m steps, we obtain:
It appears that after expansion we still need to store . However, there exists a remarkable algorithm that allows us to compute directly without storing or .
The two-loop recursion algorithm employs an iterative structure to minimize memory overhead as much as possible. It also avoids matrix multiplication, greatly accelerating computation.
If we multiply both sides of the equation by , we obtain the following:
The process involving can be stored and computed iteratively, reducing computational cost. This is the first loop.
If we denote:
then we have:
The other loop is for computing the direction .
First compute . Expanding and merging the subsequent terms yields:
Recursing in this manner yields the final result .
Thus, the L-BFGS algorithm only needs to modify the way is computed and retain a limited number of iteration histories to keep the algorithm running.
Implementation Code
The above process can be written as code. Experiments show that the BFGS method requires many more iterations than the standard Newton method, but still needs a reasonable number. However, since it leverages second-order matrix information — even though the matrix is approximate — its convergence rate remains superlinear.
First, let us establish a common utility library shared by both L-BFGS and BFGS.
import pandas as pd
import numpy as np
"""
Clementine Model
A module for data processing. Also partially serves as a shared library,
e.g., loss function, gradient, etc.
Clementine means little tangerine, a pun on 'juzi' (bureau), suggesting
that writing this thing feels like being in prison.
As for why it's not called tangerine: that library name already exists,
so we avoid the conflict.
"""
'''
Shared functions, including:
sigmoid() activation function
cost() compute loss
nabla() analytically compute gradient
scrubbing() add constant term to data, with optional normalization
wolfe() search for suitable alpha value
predict() make predictions
'''
def sigmoid(x):
return 1.0 / (1 + np.exp(-x))
def cost(X, theta, Y, c=False, Lambda=0):
"""
Loss function using negative log-likelihood as the metric.
The regularization term uses L2 regularization; default is zero.
:param X: sample set
:param theta: coefficients to be determined
:param Y: label set
:param c: function coefficient
:param Lambda: regularization strength
:return:
"""
#h = sigmoid(X @ theta)
if c:
m, n = X.shape
a = np.multiply(X @ theta, Y)
b = np.log(1 + np.exp(X @ theta))
loss = (1.0 / m) * np.sum(-a + b) \
+ Lambda * theta.T @ theta / 2
else:
a = np.multiply(X @ theta, Y)
b = np.log(1 + np.exp(X @ theta))
loss = np.sum( -a + b )\
+Lambda * theta.T @ theta / 2
return loss
def nabla(X, theta, Y, Lambda=0, c=False):
"""
Implement the nabla operator
:param X:
:param theta:
:param Y:
:param Lambda: regularization strength
:param c: loss function coefficient
:return:
"""
grad = X.T @ (sigmoid(X @ theta) - Y) + Lambda * theta
if c:
m, n = X.shape
return (1.0 / m) * grad
return grad
def scrubbing(data, stander=True):
"""
Normalize data and add a constant term in the first column
:param data:
:param stander:
:return:
"""
row, columns = data.shape
ones = np.ones((row, 1))
if stander:
x = data
data = (x - np.min(x, axis=0)) / (np.max(x, axis=0) - np.min(x, axis=0))
data = np.append(ones, data, axis=1)
return data
def wolfe(x, theta, y, d, c, Lambda, log=False):
c1 = 0.1
c2 = 0.9
alpha = 25
a = 0
b = 50
max_epoch = 40
epoch = 0
while a < b and epoch < max_epoch:
if cost(x, theta + alpha * d, y, c=c, Lambda=Lambda) <= \
cost(x, theta, y, c=c, Lambda=Lambda) + \
c1 * alpha * nabla(x, theta, y, c=c, Lambda=Lambda).T @ d:
if nabla(x, theta + alpha * d, y, c=c, Lambda=Lambda).T @ d >= \
c2 * nabla(x, theta, y, c=c, Lambda=Lambda).T @ d:
break
else:
a = alpha
else:
b = alpha
alpha = (a + b) / 2
epoch += 1
if log and epoch == max_epoch:
alpha = 0.05
print("Wolfe line search failed")
return alpha
def predict(X, theta, y):
m, n = X.shape
pre = np.zeros((m, 1))
for i in range(m):
activate = sigmoid(np.dot(X[i, :], theta))
if activate > 0.5:
pre[i] = 1
count = 0
for i in range(m):
if pre[i] == y[i]:
count += 1
return count / m
BFGS Algorithm Implementation
def bfgs_method(x_train, y_label, max_iter=5000,
epsilon=1e-4, Lambda=0,
c=False, log=False):
"""
Implement the damped Newton method
:param x_train:
:param y_label:
:param max_iter:
:param epsilon: precision
:param Lambda: regularization term
:param c: loss function coefficient
:param log: logging option
:return:
"""
m, n = x_train.shape
theta_k = np.ones((n, 1))
I = np.eye(n)
Hk = I
grad_k = ct.nabla(x_train, theta_k, y_label,
c=c, Lambda=Lambda)
k = 0
while np.linalg.norm(grad_k) > epsilon and k < max_iter:
# Newton direction
d = - Hk @ grad_k
# Line search
alpha_k = ct.wolfe(x_train, theta_k, y_label, d,
c, Lambda)
# Iteration
theta_k1 = theta_k + alpha_k * d
# Update Hessian
s_k = theta_k1 - theta_k
grad_k1 = ct.nabla(x_train, theta_k1, y_label,
c=c, Lambda=Lambda)
y_k = grad_k1 - grad_k
if s_k.T @ y_k <= 0:
print("Matrix is not positive definite, solution failed, exiting")
print(s_k.T @ y_k)
break
# BFGS formula
Hk = ((I - (s_k @ y_k.T) / (s_k.T @ y_k))).T @ Hk @ ((I - (s_k @ y_k.T) / (s_k.T @ y_k))) + ((s_k @ s_k.T) / (s_k.T @ y_k))
grad_k = grad_k1
theta_k = theta_k1
k += 1
if log and k % 200 == 0:
print(k)
if log and max_iter == k:
print("Iteration limit exceeded, |grad|=",np.linalg.norm(grad_k))
if log:
print("Loss:", ct.cost(x_train, theta_k, y_label, c=c, Lambda=Lambda))
return theta_k
L-BFGS Algorithm Implementation
import numpy as np
import clementine as ct
import time
def two_loop(s, y, rho, gk,col):
"""
Compute the d_k direction using sequences s, y, rho
:param gk: current gradient
:param s: dictionary, point difference set
:param y: dictionary, gradient difference set
:param rho: dictionary, scalar set
:return: d_k, Newton direction
"""
n = len(s) # Compute list length
# Choose approximate matrix
a = np.ones((n, 1)) # a[i] is also a scalar
q = gk.copy() # n * 1 matrix
# First loop
for i in range(n - 1, -1, -1):
a[i] = rho[i] @ s[i].T @ q
q -= a[i] * y[i]
# Choose approximate matrix
h0 = np.eye(col)
if n >= 1:
h0 = (s[-1].T @ y[-1]) / (y[-1].T @ y[-1]) * h0
# Second loop
d = h0 @ q
for i in range(n):
b = rho[i] @ y[i].T @ d
d += s[i] @ (a[i] - b)
return d
def l_bfgs_method(x_train, y_label, max_iter=5000,
epsilon=1e-4, Lambda=0, m=5,
c=False, log=False):
# Parameter initialization
row, col = x_train.shape
theta_k = np.zeros((col, 1)) # Initial position
grad_k = ct.nabla(x_train, theta_k, y_label,
Lambda=Lambda, c=c)
k = 0
s, y, rho = [], [], []
while np.linalg.norm(grad_k) > epsilon and k < max_iter:
d_k = -two_loop(s, y, rho, grad_k,col)
alpha = ct.wolfe(x_train, theta_k, y_label, d_k,
c=c, Lambda=Lambda)
theta_k1 = theta_k + alpha * d_k
s_k = theta_k1 - theta_k
grad_k1 = ct.nabla(x_train, theta_k1, y_label,
Lambda=Lambda, c=c)
y_k = grad_k1 - grad_k
if s_k.T @ y_k <= 0:
print("Matrix is not positive definite, solution failed, exiting")
print(s_k.T @ y_k)
break
rho.append(1 / (s_k.T @ y_k))
s.append(s_k)
y.append(y_k)
if (len(rho) > m):
rho.pop(0)
s.pop(0)
y.pop(0)
# Update
k += 1
if log and k % 10 == 0:
print("Iteration count",k,"||grad||",np.linalg.norm(grad_k))
print("loss", ct.cost(x_train, theta_k1, y_label,
Lambda=Lambda, c=c))
grad_k = grad_k1
theta_k = theta_k1
# end while
return theta_k