Skip to content

Utilities

Utility functions and resampling classes.

mapie.utils.train_conformalize_test_split

train_conformalize_test_split(
    X: NDArray,
    y: NDArray,
    train_size: Union[float, int],
    conformalize_size: Union[float, int],
    test_size: Union[float, int],
    random_state: Optional[int] = None,
    shuffle: bool = True,
) -> Tuple[
    NDArray, NDArray, NDArray, NDArray, NDArray, NDArray
]

Split arrays or matrices into train, conformalization and test subsets.

Utility similar to sklearn.model_selection.train_test_split for splitting data into 3 sets.

We advise to give the major part of the data points to the train set and at least 200 data points to the conformalization set.

PARAMETER DESCRIPTION
X

Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

TYPE: indexable with same type and length / shape[0] than "y"

y

Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

TYPE: indexable with same type and length / shape[0] than "X"

train_size

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples.

TYPE: float or int

conformalize_size

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the conformalize split. If int, represents the absolute number of conformalize samples.

TYPE: float or int

test_size

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.

TYPE: float or int

random_state

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

TYPE: int, RandomState instance or None DEFAULT: None

shuffle

Whether or not to shuffle the data before splitting.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
X_train, X_conformalize, X_test, y_train, y_conformalize, y_test :

6 array-like splits of inputs. output types are the same as the input types.

Examples:

>>> import numpy as np
>>> from sklearn.datasets import make_regression
>>> from mapie.utils import train_conformalize_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> (
...     X_train, X_conformalize, X_test,
...     y_train, y_conformalize, y_test
... ) = train_conformalize_test_split(
...     X, y, train_size=0.6, conformalize_size=0.2, test_size=0.2, random_state=1
... )
>>> X_train
array([[8, 9],
       [0, 1],
       [6, 7]])
>>> X_conformalize
array([[2, 3]])
>>> X_test
array([[4, 5]])
>>> y_train
[4, 0, 3]
>>> y_conformalize
[1]
>>> y_test
[2]
Source code in mapie/utils.py
def train_conformalize_test_split(
    X: NDArray,
    y: NDArray,
    train_size: Union[float, int],
    conformalize_size: Union[float, int],
    test_size: Union[float, int],
    random_state: Optional[int] = None,
    shuffle: bool = True,
) -> Tuple[NDArray, NDArray, NDArray, NDArray, NDArray, NDArray]:
    """Split arrays or matrices into train, conformalization and test subsets.

    Utility similar to sklearn.model_selection.train_test_split
    for splitting data into 3 sets.

    We advise to give the major part of the data points to the train set
    and at least 200 data points to the conformalization set.

    Parameters
    ----------
    X : indexable with same type and length / shape[0] than "y"
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    y : indexable with same type and length / shape[0] than "X"
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    train_size : float or int
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples.

    conformalize_size : float or int
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the conformalize split. If int, represents the
        absolute number of conformalize samples.

    test_size : float or int
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples.

    random_state : int, RandomState instance or None, default=None
        Controls the shuffling applied to the data before applying the split.
        Pass an int for reproducible output across multiple function calls.

    shuffle : bool, default=True
        Whether or not to shuffle the data before splitting.

    Returns
    -------
    X_train, X_conformalize, X_test, y_train, y_conformalize, y_test :
        6 array-like splits of inputs.
        output types are the same as the input types.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.datasets import make_regression
    >>> from mapie.utils import train_conformalize_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]
    >>> (
    ...     X_train, X_conformalize, X_test,
    ...     y_train, y_conformalize, y_test
    ... ) = train_conformalize_test_split(
    ...     X, y, train_size=0.6, conformalize_size=0.2, test_size=0.2, random_state=1
    ... )
    >>> X_train
    array([[8, 9],
           [0, 1],
           [6, 7]])
    >>> X_conformalize
    array([[2, 3]])
    >>> X_test
    array([[4, 5]])
    >>> y_train
    [4, 0, 3]
    >>> y_conformalize
    [1]
    >>> y_test
    [2]
    """

    _check_train_conf_test_proportions(train_size, conformalize_size, test_size, len(X))

    X_train, X_conformalize_test, y_train, y_conformalize_test = train_test_split(
        X,
        y,
        train_size=train_size,
        random_state=random_state,
        shuffle=shuffle,
    )

    if isinstance(train_size, float):
        test_size_after_split = test_size / (1 - train_size)
    else:
        test_size_after_split = test_size

    X_conformalize, X_test, y_conformalize, y_test = train_test_split(
        X_conformalize_test,
        y_conformalize_test,
        test_size=test_size_after_split,
        random_state=random_state,
        shuffle=shuffle,
    )

    return X_train, X_conformalize, X_test, y_train, y_conformalize, y_test

mapie.subsample.Subsample

Subsample(
    n_resamplings: int = 30,
    n_samples: Optional[Union[int, float]] = None,
    replace: bool = True,
    random_state: Optional[Union[int, RandomState]] = None,
)

Bases: BaseCrossValidator

Generate a sampling method, that resamples the training set with possible bootstraps. It can be used as cv argument in JackknifeAfterBootstrapRegressor.

PARAMETER DESCRIPTION
n_resamplings

Number of resamplings. By default 30.

TYPE: int DEFAULT: 30

n_samples

Number of samples in each resampling. By default None, the size of the training set. If it is between 0 and 1, it becomes the fraction of samples

TYPE: Optional[Union[int, float]] DEFAULT: None

replace

Whether to replace samples in resamplings or not. By default True.

TYPE: bool DEFAULT: True

random_state

int or RandomState instance. By default None

TYPE: Optional[Union[int, RandomState]] DEFAULT: None

Examples:

>>> import numpy as np
>>> from mapie.subsample import Subsample
>>> cv = Subsample(n_resamplings=2,random_state=0)
>>> X = np.array([1,2,3,4,5,6,7,8,9,10])
>>> for train_index, test_index in cv.split(X):
...    print(f"train index is {train_index}, test index is {test_index}")
train index is [5 0 3 3 7 9 3 5 2 4], test index is [1 6 8]
train index is [7 6 8 8 1 6 7 7 8 1], test index is [0 2 3 4 5 9]
Source code in mapie/subsample.py
def __init__(
    self,
    n_resamplings: int = 30,
    n_samples: Optional[Union[int, float]] = None,
    replace: bool = True,
    random_state: Optional[Union[int, RandomState]] = None,
) -> None:
    self.n_resamplings = n_resamplings
    self.n_samples = n_samples
    self.replace = replace
    self.random_state = random_state

split

split(
    X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]

Generate indices to split data into training and test sets.

PARAMETER DESCRIPTION
X

Training data.

TYPE: NDArray of shape (n_samples, n_features)

YIELDS DESCRIPTION
train

The training set indices for that split.

TYPE:: NDArray of shape (n_indices_training,)

test

The testing set indices for that split.

TYPE:: NDArray of shape (n_indices_test,)

Source code in mapie/subsample.py
def split(
    self, X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]:
    """
    Generate indices to split data into training and test sets.

    Parameters
    ----------
    X : NDArray of shape (n_samples, n_features)
        Training data.

    Yields
    ------
    train : NDArray of shape (n_indices_training,)
        The training set indices for that split.
    test : NDArray of shape (n_indices_test,)
        The testing set indices for that split.
    """
    indices = np.arange(_num_samples(X))
    n_samples = _check_n_samples(X, self.n_samples, indices)
    random_state = check_random_state(self.random_state)
    for k in range(self.n_resamplings):
        train_index = resample(
            indices,
            replace=self.replace,
            n_samples=n_samples,
            random_state=random_state,
            stratify=None,
        )
        test_index = np.setdiff1d(indices, train_index)
        yield train_index, test_index

get_n_splits

get_n_splits(*args: Any, **kargs: Any) -> int

Returns the number of splitting iterations in the cross-validator.

RETURNS DESCRIPTION
int

Returns the number of splitting iterations in the cross-validator.

Source code in mapie/subsample.py
def get_n_splits(self, *args: Any, **kargs: Any) -> int:
    """
    Returns the number of splitting iterations in the cross-validator.

    Returns
    -------
    int
        Returns the number of splitting iterations in the cross-validator.
    """
    return self.n_resamplings

mapie.subsample.BlockBootstrap

BlockBootstrap(
    n_resamplings: int = 30,
    length: Optional[int] = None,
    n_blocks: Optional[int] = None,
    overlapping: bool = False,
    random_state: Optional[Union[int, RandomState]] = None,
)

Bases: BaseCrossValidator

Generate a sampling method, that block bootstraps the training set. It can replace KFold, LeaveOneOut or SubSample as cv argument in the TimeSeriesRegressor class.

PARAMETER DESCRIPTION
n_resamplings

Number of resamplings. By default 30.

TYPE: int DEFAULT: 30

length

Length of the blocks. By default None, the length of the training set divided by n_blocks.

TYPE: Optional[int] DEFAULT: None

overlapping

Whether the blocks can overlap or not. By default False.

TYPE: bool DEFAULT: False

n_blocks

Number of blocks in each resampling. By default None, the size of the training set divided by length.

TYPE: Optional[int] DEFAULT: None

random_state

int or RandomState instance.

TYPE: Optional[Union[int, RandomState]] DEFAULT: None

RAISES DESCRIPTION
ValueError

If both length and n_blocks are None.

Examples:

>>> import numpy as np
>>> from mapie.subsample import BlockBootstrap
>>> cv = BlockBootstrap(n_resamplings=2, length=3, random_state=0)
>>> X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> for train_index, test_index in cv.split(X):
...    print(f"train index is {train_index}, test index is {test_index}")
train index is [0 1 2 3 4 5 0 1 2 3 4 5], test index is [8 9 6 7]
train index is [3 4 5 6 7 8 0 1 2 6 7 8], test index is [9]
Source code in mapie/subsample.py
def __init__(
    self,
    n_resamplings: int = 30,
    length: Optional[int] = None,
    n_blocks: Optional[int] = None,
    overlapping: bool = False,
    random_state: Optional[Union[int, RandomState]] = None,
) -> None:
    self.n_resamplings = n_resamplings
    self.length = length
    self.n_blocks = n_blocks
    self.overlapping = overlapping
    self.random_state = random_state

split

split(
    X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]

Generate indices to split data into training and test sets.

PARAMETER DESCRIPTION
X

Training data.

TYPE: NDArray of shape (n_samples, n_features)

YIELDS DESCRIPTION
train

The training set indices for that split.

TYPE:: NDArray of shape (n_indices_training,)

test

The testing set indices for that split.

TYPE:: NDArray of shape (n_indices_test,)

RAISES DESCRIPTION
ValueError

If length is not positive or greater than the train set size.

Source code in mapie/subsample.py
def split(
    self, X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]:
    """
    Generate indices to split data into training and test sets.

    Parameters
    ----------
    X : NDArray of shape (n_samples, n_features)
        Training data.

    Yields
    ------
    train : NDArray of shape (n_indices_training,)
        The training set indices for that split.
    test : NDArray of shape (n_indices_test,)
        The testing set indices for that split.

    Raises
    ------
    ValueError
        If `length` is not positive or greater than the train set size.
    """
    if (self.n_blocks is not None) + (self.length is not None) != 1:
        raise ValueError(
            "Exactly one argument between `length` or `n_blocks` has to be not None"
        )

    n = len(X)

    if self.n_blocks is not None:
        length = self.length if self.length is not None else n // self.n_blocks
        n_blocks = self.n_blocks
    else:
        length = cast(int, self.length)
        n_blocks = (n // length) + 1

    indices = np.arange(n)
    if (length <= 0) or (length > n):
        raise ValueError(
            "The length of blocks is <= 0 or greater than the length"
            "of training set."
        )

    if self.overlapping:
        blocks = sliding_window_view(indices, window_shape=length)
    else:
        if n % length == 0:
            indices_used_for_blocks = indices
        else:
            indices_used_for_blocks = indices[: -(n % length)]
        blocks_number = n // length
        blocks = np.asarray(
            np.split(indices_used_for_blocks, indices_or_sections=blocks_number)
        )

    random_state = check_random_state(self.random_state)

    for k in range(self.n_resamplings):
        block_indices = resample(
            range(len(blocks)),
            replace=True,
            n_samples=n_blocks,
            random_state=random_state,
            stratify=None,
        )
        train_index = np.concatenate([blocks[k] for k in block_indices], axis=0)
        test_index = np.array(list(set(indices) - set(train_index)), dtype=np.int64)
        yield train_index, test_index

get_n_splits

get_n_splits(*args: Any, **kargs: Any) -> int

Returns the number of splitting iterations in the cross-validator.

RETURNS DESCRIPTION
int

Returns the number of splitting iterations in the cross-validator.

Source code in mapie/subsample.py
def get_n_splits(self, *args: Any, **kargs: Any) -> int:
    """
    Returns the number of splitting iterations in the cross-validator.

    Returns
    -------
    int
        Returns the number of splitting iterations in the cross-validator.
    """
    return self.n_resamplings