Implementing multiple linear regression using the normal equation and gradient descent respectively, with L2 regularization added for further optimization.

Linear Algebra Fundamentals
This section reviews some necessary linear algebra fundamentals and wraps the code into reusable functions for future use.
Matrix transpose:
Iterative Methods
The cornerstone of iterative methods is the fixed-point theorem. In a complete metric space, if a function satisfies:
then must have a fixed point, and for any initial value within the space, it can be obtained through a finite number of iterative steps.
Therefore, the core of constructing a valid iterative formula is to formulate , then iterate
Multiple Linear Regression
Given a sample set and a target set , they have the following forms respectively:
The so-called linear regression is essentially finding a certain plane such that the sum of distances from all sample points to this "plane" is minimized.
This so-called distance is the L2 norm, which defines the length of a vector, also known as the Euclidean distance. A more detailed explanation can be found in the [Regularization](#### Norms) section later; for now, a general understanding suffices.
Usage example:
The "plane" we are looking for is actually a hyperplane. In the mathematical sense, a hyperplane must pass through the origin, but using it directly may introduce some error. For example, in simple linear regression, if the regression function has its intercept term removed, the error will most likely increase. However, we can rewrite the dataset by adding one more dimension:
In this way, the hyperplane we seek can be written as , where
Note 1: All letters are assumed to be matrices (vectors) by default, except where specifically noted as scalars in formulas. For example, in the above equation is a matrix, and the rest are scalar parameters.
Note 2: In fact, the so-called hyperplane in most documentation is actually an affine set of some hyperplane, but the distinction is generally not strictly enforced. For more details, refer to books on convex optimization.
The loss function can be defined using the norm form:
where is the number of rows of .
Observing the loss function, it is in some sense very similar to a quadratic function, and we can think of it as such. To find the minimum of a quadratic function, we simply take its derivative; the point where the derivative equals zero is the local minimum, and for a quadratic function, the local minimum is the global minimum. The same approach applies here: take the gradient of the loss function with respect to , and the point where the gradient is zero is certainly the global minimum "point". Of course, in high-dimensional cases, the point that minimizes the function is likely a solution set rather than a single vector.
Therefore, taking the gradient of the loss function with respect to , we get:
Setting it to zero and solving for , we obtain:
This method yields the analytical (closed-form) solution. Clearly, this requires to be non-singular. When the number of samples is very small, this method cannot be used. However, this does not mean we cannot find the minimum of the loss function; we can consider gradient descent.
As mentioned in [Iterative Methods](# Iterative Methods), construct an iterative function and overwrite the old value with the newly computed one, i.e.:
where is called the learning rate.
With an appropriate learning rate set, if we continue iterating, the gradient will eventually approach zero, and will converge to the vicinity of the optimal solution.
In this way, as long as we can compute the gradient, we can obtain the optimal solution (which is likely not unique).\
The core code is as follows:
Detailed explanation of the numerical solution program using the gradient descent algorithm:
Importing libraries:
import numpy as np
import time # for timing
Specific function implementation:
def cal_gradient(theta=[], X=[],y=[]):
# Calculate gradient
rows , columns = X.shape
'''
Gradient formula:
gradient = (1/n) * X * ( X * theta - y)
'''
diff = np.dot(X,theta) - y
return (1./rows) * np.dot(np.transpose(X),diff)
def MLR_gradient_descent(X_0=[],y = [], tol = 1e-5,alpha = 0.01):
rows , columns = X_0.shape
X_add_one = np.ones((rows,1))
X = np.append(X_add_one,X_0,axis=1) # Add a column of 1s to X_0
# theta target matrix
theta = np.random.rand(columns+1,1)
# Gradient at current point
gradient = cal_gradient(theta,X,y)
while not np.all(np.absolute(gradient) <= tol):
theta = theta - alpha * gradient
gradient = cal_gradient(theta,X,y)
return theta
if __name__ == '__main__':
X_0 = data_in.data_in_X()
y = data_in.data_in_Y() # Data loading
rows, columns = X_0.shape
X_add_one = np.ones((rows, 1))
X = np.append(X_add_one, X_0, axis=1) # Add a column of 1s to X_0
begin_time = time.time()
theta = MLR_gradient_descent(X_0,y)
end_time = time.time()
print(np.dot(X,theta)) # Verify results
print('Computation time:',end_time - begin_time,'s')
Regularization
When fitting with the above methods, various problems may arise. This fitting approach is very sensitive to data and may encounter issues such as overfitting and collinearity. Moreover, when the data is sparse but features are numerous, the number of that minimize the loss is vast. Therefore, we introduce regularization terms to reduce model complexity, making the results more generalizable and stable. But before using regularization, let us first understand norms.
Norms
Norms can be seen as a generalization of the concept of distance. In plane geometry, given the coordinates of two points , we can easily determine the distance formula:
Even in three-dimensional space, given two points , their distance is
What about four-dimensional space? Five-dimensional or even higher dimensions? Although we cannot visualize these spaces, it is not hard to conjecture that the "distance" in these spaces has a unified definition similar to two and three dimensions. Its form is:
where are both -dimensional vectors. It is easy to verify that this also holds in the one-dimensional case, confirming the reasonableness of this conjecture. In fact, any function satisfying the following three conditions can be called a "distance". The conjecture here is actually just the L2 norm, which belongs to the L-P norm family. The L-P norm is not a single norm but a family of norms satisfying the form:
Note that when , it does not actually satisfy the properties of distance (triangle inequality), so it is not a true norm.
When , we get the L2 norm, often referred to as the Euclidean norm.
In regularization, in addition to the L2 norm, the L1 norm is also commonly used, known as the Manhattan Distance.
L2 Regularization
In regression analysis, all features are used to establish the coefficients for the solution. However, using them directly will encounter the various problems mentioned earlier. We introduce regularization terms to reduce the influence of these uninformative features.
Thus, after adding the regularization term, the loss function becomes:
Written in norm form:
The parameter controls the complexity of the model. The smaller is, the closer it is to the original linear regression, and the more prone it is to overfitting and other issues.
Next, taking the gradient of the loss function, we get:
Setting it to zero and solving for , we obtain:
When is a non-full-rank matrix (determinant is zero), adding ensures the matrix is invertible.
However, note that the identity matrix mentioned above is not a strict identity matrix. It is an matrix of the form:
The reason is simply that the first coefficient does not need to be scaled.
If using gradient descent iteration, based on the original iterative formula
we add the regularization term:
However, L2 regularization has difficulty converging to exactly zero, so additional constraints need to be added in the code:
iter = 500
Lambda = 0.5
while (not np.all(np.absolute(gradient) <= tol)) and iter > 0:
theta = theta - alpha * gradient - (1./rows) *alpha*Lambda*theta
gradient = cal_gradient(theta,X,y)
iter -= 1
L1 Regularization
Compared to L2 regularization, L1 regularization only changes the form of the regularization term. Its loss function is as follows:
However, when we attempt to compute the gradient, we find that the L1 norm is not differentiable, but a subgradient exists:
Therefore, the gradient of the loss function can be written as:
where
Implementing L1 regularization only requires a slight modification to the analytical solution code.
equ1 = np.linalg.inv(np.dot(X_t, X))
Change to:
I = Lambda/2 * np.identity(max(rows+1,columns+1))
I[0][0] = 0
equ1 = np.linalg.inv(np.dot(X_t,X) + I )
Comparison of L1 and L2 Regularization
Starting with the conclusion: L1 regularization tends to produce sparse fits, while L2 regularization tends to produce smooth fits.
Observing the L1 regularization approach, for any given value, if it is greater than zero, then will make it smaller; conversely, if it is less than zero, will make it larger. After several iterations, it drives more features to zero, making the fit result sparser.
L2 regularization, on the other hand, multiplies each parameter by a weight less than 1 at every iteration, then subtracts the descent gradient. A somewhat imprecise explanation is that L2 regularization reduces all parameters but applies a larger penalty to prominent features, ultimately making the fit smoother.