Getting started#


This page provides a starter example to get familiar with skglm and explore some of its features.

In the first section, we fit a Lasso estimator on a high dimensional toy dataset (number of features is largely greater than the number of samples). Linear models don’t generalize well for unseen dataset. By adding a penalty, `\ell_1` penalty, we can train estimator that overcome this drawback.

The last section, we explore other combinations of datafit and penalty to create a custom estimator that achieves a lower prediction error, in the sequel `\ell_1` Huber regression. We show that skglm is perfectly adapted to these experiments thanks to its modular design.

Beforehand, make sure that you have already installed skglm

# Installing from PyPI using pip
pip install -U skglm

# Installing from conda-forge using conda
conda install -c conda-forge skglm

Fitting a Lasso estimator#

Let’s start first by generating a toy dataset and splitting it to train and test sets. For that, we will use scikit-learn make_regression

# imports
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# generate toy data
X, y = make_regression(n_samples=100, n_features=1000)

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y)

Then let’s fit skglm Lasso estimator and prints its score on the test set.

# import estimator
from skglm import Lasso

# init and fit
estimator = Lasso()
estimator.fit(X_train, y_train)

# compute R²
estimator.score(X_test, y_test)

Note

  • The first fit after importing skglm has an overhead as skglm uses Numba The subsequent fits will achieve top speed since Numba compilation is cached.

skglm has several other scikit-learn compatible estimators. Check the API for more information about the available estimators.

Fitting `\ell_1` Huber regression#

Suppose that the latter dataset contains outliers and we would like to mitigate their effects on the learned coefficients while having an estimator that generalizes well to unseen data. Ideally, we would like to fit a `\ell_1` Huber regressor.

skglm offers high flexibility to compose custom estimators. Through a simple API, it is possible to combine any skglm datafit and penalty.

Note

  • `\ell_1` regularization is not supported in scikit-learn for HuberRegressor

Let’s explore how to achieve that.

Generate corrupt data#

We will use the same script as before except that we will take 10 samples and corrupt their values.

# imports
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# generate toy data
X, y = make_regression(n_samples=100, n_features=1000)

# select and corrupt 10 random samples
y[np.random.choice(n_samples, 10)] = 100 * y.max()

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y)

Now let’s compose a custom estimator using GeneralizedLinearEstimator. It’s the go-to way to create custom estimator by combining a datafit and a penalty.

# import penalty and datafit
from skglm.penalties import L1
from skglm.datafits import Huber

# import GLM estimator
from skglm import GeneralizedLinearEstimator

# build and fit estimator
estimator = GeneralizedLinearEstimator(
    Huber(1.),
    L1(alpha=1.)
)
estimator.fit(X_train, y_train)

Note

  • Here the arguments given to the datafit and penalty are arbitrary and given just for sake of illustration.

GeneralizedLinearEstimator allows to combine any penalties and datafits implemented in skglm. If you don’t find an estimator in the estimators module, you can build it by combining the appropriate datafit and penalty and pass it to GeneralizedLinearEstimator. Explore the list of supported datafits and penalties.

Important

Explore further advanced topics and get hands-on examples on the tutorials page