Computation of the intercept¶
This note gives insights and guidance for the handling of an intercept coefficient within the skglm
solvers.
Let the design matrix be `X in RR^{n times p}` where `n` is the number of samples and `p` the number of features.
We denote `beta in RR^p` the coefficients of the Generalized Linear Model and `beta_0` its intercept.
In many packages such as liblinear
, the intercept is handled by adding an extra column of ones in the design matrix. This is costly in memory, and may lead to different solutions if all coefficients are penalized, as the intercept `beta_0` is usually not.
skglm
follows a different route and solves directly:
where `bb"1"_{n}` is the vector of size `n` composed only of ones.
The solvers of skglm
update the intercept after each update of `beta` by doing a (1 dimensional) gradient descent update:
where `L_0` is the Lipschitz constant associated to the intercept. The local Lipschitz constant `L_0` statisfies the following inequality
This update rule should be implemented in the intercept_update_step
method of the datafit class.
The convergence criterion computed for the gradient is then only the absolute value of the gradient with respect to `beta_0` since the intercept optimality condition, for a solution `beta^star`, `beta_0^star` is:
Moreover, we have that
We will now derive the update used in Equation 2 for three different datafitting functions.
The Quadratic datafit¶
We define
In this case `nabla f(z) = 1/n (z - y)` hence Eq. 4 is equal to:
Finally, the Lipschitz constant is `L_0 = 1/n sum_(i=1)^n 1^2 = 1`.
The Logistic datafit¶
In this case,
We can then write
Finally, the Lipschitz constant is `L_0 = 1/(4n) sum_(i=1)^n 1^2 = 1/4`.
The Huber datafit¶
In this case,
where
Let `r_i = y_i - X_( i: ) beta - beta_0 bb"1"_n`. We can then write
where `bbb"1"_({x > delta})` is the classical indicator function.
Finally, the Lipschitz constant is `L_0 = 1/n sum_(i=1)^n 1^2 = 1`.