Implementing multiple linear regression using the normal equation and gradient descent respectively, with L2 regularization added for further optimization.

Multiple Linear Regression.jpg

Linear Algebra Fundamentals

This section reviews some necessary linear algebra fundamentals and wraps the code into reusable functions for future use.

Matrix transpose:

(A+B)T=AT+BT(AB)T=BTAT(A+B)^T=A^T+B^T\\ (AB)^T=B^TA^T

Iterative Methods

The cornerstone of iterative methods is the fixed-point theorem. In a complete metric space, if a function satisfies:

f(x)f(y)<αxy|f(x)-f(y)| < \alpha|x-y|

then ff must have a fixed point, and for any initial value within the space, it can be obtained through a finite number of iterative steps.

Therefore, the core of constructing a valid iterative formula is to formulate x=f(x)x=f(x), then iterate f(x0)f(x_0)

Multiple Linear Regression

Given a sample set XX and a target set yy, they have the following forms respectively:

X=[x11x12x1mx21x22x2mxn1xn2xnm]X = \left[\begin{matrix} x_{11} & x_{12} & \cdots &x_{1m}\\ x_{21} & x_{22} & \cdots & x_{2m} \\ \vdots & &&\vdots\\ x_{n1} & x_{n2} &\cdots & x_{nm}\\ \end{matrix} \right] y=[y1y2yn]Ty = \left[ \begin{matrix} y_1 & y_2 & \cdots &y_n \end{matrix} \right]^T

The so-called linear regression is essentially finding a certain plane such that the sum of distances from all sample points to this "plane" is minimized.

This so-called distance is the L2 norm, which defines the length of a vector, also known as the Euclidean distance. A more detailed explanation can be found in the [Regularization](#### Norms) section later; for now, a general understanding suffices.

x2=i=1nxi2\| x\|_2 = \sqrt{\sum^n_{i=1} x_i^2}

Usage example:

x2=((i=1Nx(i)2)1/2)2;x22=2x\|x\|^2 = ((\sum^N_{i=1}x^{(i)^2})^{1/2})^2;\nabla\|x\|^2_2 = 2x

The "plane" we are looking for is actually a hyperplane. In the mathematical sense, a hyperplane must pass through the origin, but using it directly may introduce some error. For example, in simple linear regression, if the regression function y=kx+by=k*x+b has its intercept term removed, the error will most likely increase. However, we can rewrite the dataset XX by adding one more dimension:

Xˉ=[1x11x12x1m1x21x22x2m1xn1xn2xnm]\bar X = \left[\begin{matrix} 1 &x_{11} & x_{12} & \cdots &x_{1m}\\ 1&x_{21} & x_{22} & \cdots & x_{2m}\\ \vdots&\vdots & &&\vdots\\ 1&x_{n1} & x_{n2} &\cdots & x_{nm}\\ \end{matrix} \right]

In this way, the hyperplane we seek can be written as y=θXˉy = \theta\bar X, where

θ=[θ0θ1θn]T\boldsymbol{\theta} = \left[ \begin{matrix} \theta_0 & \theta_1 &\cdots &\theta_n \end{matrix} \right]^T

Note 1: All letters are assumed to be matrices (vectors) by default, except where specifically noted as scalars in formulas. For example, θ\boldsymbol\theta in the above equation is a matrix, and the rest are scalar parameters.

Note 2: In fact, the so-called hyperplane in most documentation is actually an affine set of some hyperplane, but the distinction is generally not strictly enforced. For more details, refer to books on convex optimization.

The loss function can be defined using the norm form:

L(θ)=12mXTθy^2\mathscr L(\theta) =\frac1{2m}\|X^T\theta -\hat y\|^2

where mm is the number of rows of XX.

Observing the loss function, it is in some sense very similar to a quadratic function, and we can think of it as such. To find the minimum of a quadratic function, we simply take its derivative; the point where the derivative equals zero is the local minimum, and for a quadratic function, the local minimum is the global minimum. The same approach applies here: take the gradient of the loss function with respect to θ\theta, and the point where the gradient is zero is certainly the global minimum "point". Of course, in high-dimensional cases, the point that minimizes the function is likely a solution set rather than a single vector.

Therefore, taking the gradient of the loss function with respect to θ\theta, we get:

θ(L(θ))=1mX(XTθy)\nabla_\theta(\mathscr{L}(\theta))= \frac1{m}X(X^T\theta - y)

Setting it to zero and solving for θ\theta, we obtain:

θ=(XTX)1XTy\theta = (X^TX)^{-1} X^Ty

This method yields the analytical (closed-form) solution. Clearly, this requires XX to be non-singular. When the number of samples is very small, this method cannot be used. However, this does not mean we cannot find the minimum of the loss function; we can consider gradient descent.

As mentioned in [Iterative Methods](# Iterative Methods), construct an iterative function and overwrite the old value with the newly computed one, i.e.:

θj+1:=θjαθL(θj)\theta_{j+1} := \theta_j - \alpha\nabla_\theta\mathscr{L}(\theta_j)

where α\alpha is called the learning rate.

With an appropriate learning rate set, if we continue iterating, the gradient will eventually approach zero, and θ\theta will converge to the vicinity of the optimal solution.

In this way, as long as we can compute the gradient, we can obtain the optimal solution (which is likely not unique).\

The core code is as follows:

while θL(θ)>εθ=θαθL(θ)\begin{align} &\text{while}\ |\nabla_\theta\mathscr{L}(\theta)|>\varepsilon\\ &\quad\quad\theta=\theta - \alpha \nabla_\theta\mathscr{L}(\theta) \\ \end{align}

Detailed explanation of the numerical solution program using the gradient descent algorithm:

Importing libraries:

import numpy as np
import time # for timing

Specific function implementation:

def cal_gradient(theta=[], X=[],y=[]):
    # Calculate gradient
    rows , columns = X.shape
    '''
    Gradient formula:
    gradient = (1/n) * X * ( X * theta - y)
    '''
    diff = np.dot(X,theta) - y
    return (1./rows) * np.dot(np.transpose(X),diff)

def MLR_gradient_descent(X_0=[],y = [], tol = 1e-5,alpha = 0.01):
       rows , columns = X_0.shape
    X_add_one  = np.ones((rows,1))
    X = np.append(X_add_one,X_0,axis=1) # Add a column of 1s to X_0

    # theta target matrix
    theta = np.random.rand(columns+1,1)
    # Gradient at current point
    gradient = cal_gradient(theta,X,y)

    while not np.all(np.absolute(gradient) <= tol):
        theta = theta - alpha * gradient
        gradient = cal_gradient(theta,X,y)

    return theta

if __name__ == '__main__':

    X_0 = data_in.data_in_X()
    y = data_in.data_in_Y() # Data loading

    rows, columns = X_0.shape

    X_add_one = np.ones((rows, 1))
    X = np.append(X_add_one, X_0, axis=1)  # Add a column of 1s to X_0
    begin_time = time.time()

    theta = MLR_gradient_descent(X_0,y)

    end_time = time.time()

    print(np.dot(X,theta)) # Verify results
    print('Computation time:',end_time - begin_time,'s')

Regularization

When fitting with the above methods, various problems may arise. This fitting approach is very sensitive to data and may encounter issues such as overfitting and collinearity. Moreover, when the data is sparse but features are numerous, the number of θ\theta that minimize the loss is vast. Therefore, we introduce regularization terms to reduce model complexity, making the results more generalizable and stable. But before using regularization, let us first understand norms.

Norms

Norms can be seen as a generalization of the concept of distance. In plane geometry, given the coordinates of two points (x1,y1),(x2,y2)(x_1,y_1),(x_2,y_2), we can easily determine the distance formula:

d=(x1x2)2+(y1y2)2d = \sqrt{(x_1-x_2)^2+(y_1-y_2)^2}

Even in three-dimensional space, given two points (x1,y1,z1),(x2,y2,z2)(x_1,y_1,z_1),(x_2,y_2,z_2), their distance is

d=(x1x1)2+(y1y2)2+(z1z2)2d = \sqrt{(x_1-x_1)^2+(y_1-y_2)^2+(z_1-z_2)^2}

What about four-dimensional space? Five-dimensional or even higher dimensions? Although we cannot visualize these spaces, it is not hard to conjecture that the "distance" in these spaces has a unified definition similar to two and three dimensions. Its form is:

d=i=1n(AiBi)2d = \sqrt{\sum^n_{i=1}(A_i - B_i)^2}

where A,BA,B are both nn-dimensional vectors. It is easy to verify that this also holds in the one-dimensional case, confirming the reasonableness of this conjecture. In fact, any function satisfying the following three conditions can be called a "distance". The conjecture here is actually just the L2 norm, which belongs to the L-P norm family. The L-P norm is not a single norm but a family of norms satisfying the form:

Lp=Xp=i=1nxippX=(x1,x2,,xn)\begin{aligned} L_p = \|X\|_p& = \sqrt[p]{\sum^n_{i=1}x_i^p}\\ X = (x_1,&x_2,\cdots,x_n) \end{aligned}

Note that when p=0p=0, it does not actually satisfy the properties of distance (triangle inequality), so it is not a true norm.

When p=2p=2, we get the L2 norm, often referred to as the Euclidean norm.

In regularization, in addition to the L2 norm, the L1 norm is also commonly used, known as the Manhattan Distance.

L2 Regularization

In regression analysis, all features are used to establish the coefficients for the solution. However, using them directly will encounter the various problems mentioned earlier. We introduce regularization terms to reduce the influence of these uninformative features.

Thus, after adding the regularization term, the loss function becomes:

L(θ)=12i=1N(f(x(i))y(i))2+λ2θ22\mathscr{L}(\theta) = \frac{1}{2}\sum^N_{i=1}(f(x^{(i)}) - y^{(i)})^2 + \frac\lambda2\|\theta\|^2_2

Written in norm form:

Lθ=12XTθY22+λ2θ22\mathscr{L}\theta =\frac1{2} \|X^T\theta-Y\|^2_2 + \frac\lambda 2\|\theta\|^2_2

The parameter λ\lambda controls the complexity of the model. The smaller λ\lambda is, the closer it is to the original linear regression, and the more prone it is to overfitting and other issues.

Next, taking the gradient of the loss function, we get:

θL(θ)=(XTX+λI2)θXTY\nabla_{\theta}\mathscr{L}(\theta)=(X^TX + \frac{\lambda I}{2})\theta - X^TY

Setting it to zero and solving for θ\theta, we obtain:

θ=(XTX+λI2)1XTY\theta=(X^TX+\frac{\lambda I}{2})^{-1}X^TY

When XTXX^TX is a non-full-rank matrix (determinant is zero), adding λ2I\frac{\lambda}2I ensures the matrix is invertible.

However, note that the identity matrix II mentioned above is not a strict identity matrix. It is an (n+1)×(n+1)(n+1)\times(n+1) matrix of the form:

[0000010000100001]\left[ \begin{matrix} 0 & 0 &0\cdots &0\\ 0 & 1 &0\cdots &0\\ 0 & 0 &1\cdots &0\\ \vdots & & & \vdots\\ 0 & 0 &0 \cdots&1\\ \end{matrix} \right]

The reason is simply that the first coefficient does not need to be scaled.

If using gradient descent iteration, based on the original iterative formula

θ:=θαθLθ\theta := \theta- \alpha \cdot \nabla_\theta \mathscr{L}\theta

we add the regularization term:

θ:=θαθLθαλθ\theta := \theta- \alpha \cdot \nabla_\theta \mathscr{L}\theta-\alpha \lambda \cdot \theta

However, L2 regularization has difficulty converging to exactly zero, so additional constraints need to be added in the code:

iter = 500
Lambda = 0.5
while (not np.all(np.absolute(gradient) <= tol)) and iter > 0:
    theta = theta - alpha * gradient - (1./rows) *alpha*Lambda*theta
    gradient = cal_gradient(theta,X,y)
    iter -= 1

L1 Regularization

Compared to L2 regularization, L1 regularization only changes the form of the regularization term. Its loss function is as follows:

L(θ)=12XTθY22+λθ1\mathscr{L}(\theta)=\frac12\|X^T\theta-Y\|^2_2+\lambda\|\theta\|_1

However, when we attempt to compute the gradient, we find that the L1 norm is not differentiable, but a subgradient exists:

X1=sign(X)\nabla \|X\|_1 = \text{sign}(X)

Therefore, the gradient of the loss function can be written as:

θL(θ)=λ×sign(θ)+θJ(θ)\nabla_{\theta}\mathscr{L}(\theta) = \lambda\times\text{sign}(\theta) + \nabla_{\theta}\mathscr{J}(\theta)

where J(θ)=12XTθY\mathscr{J}(\theta)=\frac{1}{2}\|X^T\theta-Y\|

Implementing L1 regularization only requires a slight modification to the analytical solution code.

equ1 = np.linalg.inv(np.dot(X_t, X))

Change to:

I = Lambda/2 * np.identity(max(rows+1,columns+1))
I[0][0] = 0
equ1 = np.linalg.inv(np.dot(X_t,X) + I )

Comparison of L1 and L2 Regularization

Starting with the conclusion: L1 regularization tends to produce sparse fits, while L2 regularization tends to produce smooth fits.

Observing the L1 regularization approach, for any given value, if it is greater than zero, then λθ1\lambda \|\theta\|_1 will make it smaller; conversely, if it is less than zero, λθ1\lambda \|\theta\|_1 will make it larger. After several iterations, it drives more features to zero, making the fit result sparser.

L2 regularization, on the other hand, multiplies each parameter by a weight less than 1 at every iteration, then subtracts the descent gradient. A somewhat imprecise explanation is that L2 regularization reduces all parameters but applies a larger penalty to prominent features, ultimately making the fit smoother.