Utilities¶

Utility functions and resampling classes.

mapie.utils.train_conformalize_test_split ¶

train_conformalize_test_split(
    X: NDArray,
    y: NDArray,
    train_size: Union[float, int],
    conformalize_size: Union[float, int],
    test_size: Union[float, int],
    random_state: Optional[int] = None,
    shuffle: bool = True,
) -> Tuple[
    NDArray, NDArray, NDArray, NDArray, NDArray, NDArray
]

Split arrays or matrices into train, conformalization and test subsets.

Utility similar to sklearn.model_selection.train_test_split for splitting data into 3 sets.

We advise to give the major part of the data points to the train set and at least 200 data points to the conformalization set.

PARAMETER	DESCRIPTION
`X`	Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. TYPE: `indexable with same type and length / shape[0] than "y"`
`y`	Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. TYPE: `indexable with same type and length / shape[0] than "X"`
`train_size`	If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. TYPE: `float or int`
`conformalize_size`	If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the conformalize split. If int, represents the absolute number of conformalize samples. TYPE: `float or int`
`test_size`	If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. TYPE: `float or int`
`random_state`	Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. TYPE: `int, RandomState instance or None` DEFAULT: `None`
`shuffle`	Whether or not to shuffle the data before splitting. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`X_train, X_conformalize, X_test, y_train, y_conformalize, y_test :`	6 array-like splits of inputs. output types are the same as the input types.

Examples:

>>> import numpy as np
>>> from sklearn.datasets import make_regression
>>> from mapie.utils import train_conformalize_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> (
...     X_train, X_conformalize, X_test,
...     y_train, y_conformalize, y_test
... ) = train_conformalize_test_split(
...     X, y, train_size=0.6, conformalize_size=0.2, test_size=0.2, random_state=1
... )
>>> X_train
array([[8, 9],
       [0, 1],
       [6, 7]])
>>> X_conformalize
array([[2, 3]])
>>> X_test
array([[4, 5]])
>>> y_train
[4, 0, 3]
>>> y_conformalize
[1]
>>> y_test
[2]

Source code in mapie/utils.py

def train_conformalize_test_split(
    X: NDArray,
    y: NDArray,
    train_size: Union[float, int],
    conformalize_size: Union[float, int],
    test_size: Union[float, int],
    random_state: Optional[int] = None,
    shuffle: bool = True,
) -> Tuple[NDArray, NDArray, NDArray, NDArray, NDArray, NDArray]:
    """Split arrays or matrices into train, conformalization and test subsets.

    Utility similar to sklearn.model_selection.train_test_split
    for splitting data into 3 sets.

    We advise to give the major part of the data points to the train set
    and at least 200 data points to the conformalization set.

    Parameters
    ----------
    X : indexable with same type and length / shape[0] than "y"
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    y : indexable with same type and length / shape[0] than "X"
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    train_size : float or int
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples.

    conformalize_size : float or int
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the conformalize split. If int, represents the
        absolute number of conformalize samples.

    test_size : float or int
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples.

    random_state : int, RandomState instance or None, default=None
        Controls the shuffling applied to the data before applying the split.
        Pass an int for reproducible output across multiple function calls.

    shuffle : bool, default=True
        Whether or not to shuffle the data before splitting.

    Returns
    -------
    X_train, X_conformalize, X_test, y_train, y_conformalize, y_test :
        6 array-like splits of inputs.
        output types are the same as the input types.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.datasets import make_regression
    >>> from mapie.utils import train_conformalize_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]
    >>> (
    ...     X_train, X_conformalize, X_test,
    ...     y_train, y_conformalize, y_test
    ... ) = train_conformalize_test_split(
    ...     X, y, train_size=0.6, conformalize_size=0.2, test_size=0.2, random_state=1
    ... )
    >>> X_train
    array([[8, 9],
           [0, 1],
           [6, 7]])
    >>> X_conformalize
    array([[2, 3]])
    >>> X_test
    array([[4, 5]])
    >>> y_train
    [4, 0, 3]
    >>> y_conformalize
    [1]
    >>> y_test
    [2]
    """

    _check_train_conf_test_proportions(train_size, conformalize_size, test_size, len(X))

    X_train, X_conformalize_test, y_train, y_conformalize_test = train_test_split(
        X,
        y,
        train_size=train_size,
        random_state=random_state,
        shuffle=shuffle,
    )

    if isinstance(train_size, float):
        test_size_after_split = test_size / (1 - train_size)
    else:
        test_size_after_split = test_size

    X_conformalize, X_test, y_conformalize, y_test = train_test_split(
        X_conformalize_test,
        y_conformalize_test,
        test_size=test_size_after_split,
        random_state=random_state,
        shuffle=shuffle,
    )

    return X_train, X_conformalize, X_test, y_train, y_conformalize, y_test

mapie.subsample.Subsample ¶

Subsample(
    n_resamplings: int = 30,
    n_samples: Optional[Union[int, float]] = None,
    replace: bool = True,
    random_state: Optional[Union[int, RandomState]] = None,
)

Bases: BaseCrossValidator

Generate a sampling method, that resamples the training set with possible bootstraps. It can be used as cv argument in JackknifeAfterBootstrapRegressor.

PARAMETER	DESCRIPTION
`n_resamplings`	Number of resamplings. By default `30`. TYPE: `int` DEFAULT: `30`
`n_samples`	Number of samples in each resampling. By default `None`, the size of the training set. If it is between 0 and 1, it becomes the fraction of samples TYPE: `Optional[Union[int, float]]` DEFAULT: `None`
`replace`	Whether to replace samples in resamplings or not. By default `True`. TYPE: `bool` DEFAULT: `True`
`random_state`	int or RandomState instance. By default `None` TYPE: `Optional[Union[int, RandomState]]` DEFAULT: `None`

Examples:

>>> import numpy as np
>>> from mapie.subsample import Subsample
>>> cv = Subsample(n_resamplings=2,random_state=0)
>>> X = np.array([1,2,3,4,5,6,7,8,9,10])
>>> for train_index, test_index in cv.split(X):
...    print(f"train index is {train_index}, test index is {test_index}")
train index is [5 0 3 3 7 9 3 5 2 4], test index is [1 6 8]
train index is [7 6 8 8 1 6 7 7 8 1], test index is [0 2 3 4 5 9]

Source code in mapie/subsample.py

def __init__(
    self,
    n_resamplings: int = 30,
    n_samples: Optional[Union[int, float]] = None,
    replace: bool = True,
    random_state: Optional[Union[int, RandomState]] = None,
) -> None:
    self.n_resamplings = n_resamplings
    self.n_samples = n_samples
    self.replace = replace
    self.random_state = random_state

split ¶

split(
    X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]

Generate indices to split data into training and test sets.

PARAMETER	DESCRIPTION
`X`	Training data. TYPE: `NDArray of shape (n_samples, n_features)`

YIELDS	DESCRIPTION
`train`	The training set indices for that split. TYPE:: `NDArray of shape (n_indices_training,)`
`test`	The testing set indices for that split. TYPE:: `NDArray of shape (n_indices_test,)`

Source code in mapie/subsample.py

def split(
    self, X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]:
    """
    Generate indices to split data into training and test sets.

    Parameters
    ----------
    X : NDArray of shape (n_samples, n_features)
        Training data.

    Yields
    ------
    train : NDArray of shape (n_indices_training,)
        The training set indices for that split.
    test : NDArray of shape (n_indices_test,)
        The testing set indices for that split.
    """
    indices = np.arange(_num_samples(X))
    n_samples = _check_n_samples(X, self.n_samples, indices)
    random_state = check_random_state(self.random_state)
    for k in range(self.n_resamplings):
        train_index = resample(
            indices,
            replace=self.replace,
            n_samples=n_samples,
            random_state=random_state,
            stratify=None,
        )
        test_index = np.setdiff1d(indices, train_index)
        yield train_index, test_index

get_n_splits ¶

get_n_splits(*args: Any, **kargs: Any) -> int

Returns the number of splitting iterations in the cross-validator.

RETURNS	DESCRIPTION
`int`	Returns the number of splitting iterations in the cross-validator.

Source code in mapie/subsample.py

def get_n_splits(self, *args: Any, **kargs: Any) -> int:
    """
    Returns the number of splitting iterations in the cross-validator.

    Returns
    -------
    int
        Returns the number of splitting iterations in the cross-validator.
    """
    return self.n_resamplings

mapie.subsample.BlockBootstrap ¶

BlockBootstrap(
    n_resamplings: int = 30,
    length: Optional[int] = None,
    n_blocks: Optional[int] = None,
    overlapping: bool = False,
    random_state: Optional[Union[int, RandomState]] = None,
)

Bases: BaseCrossValidator

Generate a sampling method, that block bootstraps the training set. It can replace KFold, LeaveOneOut or SubSample as cv argument in the TimeSeriesRegressor class.

PARAMETER	DESCRIPTION
`n_resamplings`	Number of resamplings. By default `30`. TYPE: `int` DEFAULT: `30`
`length`	Length of the blocks. By default `None`, the length of the training set divided by `n_blocks`. TYPE: `Optional[int]` DEFAULT: `None`
`overlapping`	Whether the blocks can overlap or not. By default `False`. TYPE: `bool` DEFAULT: `False`
`n_blocks`	Number of blocks in each resampling. By default `None`, the size of the training set divided by `length`. TYPE: `Optional[int]` DEFAULT: `None`
`random_state`	int or RandomState instance. TYPE: `Optional[Union[int, RandomState]]` DEFAULT: `None`

RAISES	DESCRIPTION
`ValueError`	If both `length` and `n_blocks` are `None`.

Examples:

>>> import numpy as np
>>> from mapie.subsample import BlockBootstrap
>>> cv = BlockBootstrap(n_resamplings=2, length=3, random_state=0)
>>> X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> for train_index, test_index in cv.split(X):
...    print(f"train index is {train_index}, test index is {test_index}")
train index is [0 1 2 3 4 5 0 1 2 3 4 5], test index is [8 9 6 7]
train index is [3 4 5 6 7 8 0 1 2 6 7 8], test index is [9]

Source code in mapie/subsample.py

def __init__(
    self,
    n_resamplings: int = 30,
    length: Optional[int] = None,
    n_blocks: Optional[int] = None,
    overlapping: bool = False,
    random_state: Optional[Union[int, RandomState]] = None,
) -> None:
    self.n_resamplings = n_resamplings
    self.length = length
    self.n_blocks = n_blocks
    self.overlapping = overlapping
    self.random_state = random_state

split ¶

split(
    X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]

Generate indices to split data into training and test sets.

PARAMETER	DESCRIPTION
`X`	Training data. TYPE: `NDArray of shape (n_samples, n_features)`

YIELDS	DESCRIPTION
`train`	The training set indices for that split. TYPE:: `NDArray of shape (n_indices_training,)`
`test`	The testing set indices for that split. TYPE:: `NDArray of shape (n_indices_test,)`

RAISES	DESCRIPTION
`ValueError`	If `length` is not positive or greater than the train set size.

Source code in mapie/subsample.py

def split(
    self, X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]:
    """
    Generate indices to split data into training and test sets.

    Parameters
    ----------
    X : NDArray of shape (n_samples, n_features)
        Training data.

    Yields
    ------
    train : NDArray of shape (n_indices_training,)
        The training set indices for that split.
    test : NDArray of shape (n_indices_test,)
        The testing set indices for that split.

    Raises
    ------
    ValueError
        If `length` is not positive or greater than the train set size.
    """
    if (self.n_blocks is not None) + (self.length is not None) != 1:
        raise ValueError(
            "Exactly one argument between `length` or `n_blocks` has to be not None"
        )

    n = len(X)

    if self.n_blocks is not None:
        length = self.length if self.length is not None else n // self.n_blocks
        n_blocks = self.n_blocks
    else:
        length = cast(int, self.length)
        n_blocks = (n // length) + 1

    indices = np.arange(n)
    if (length <= 0) or (length > n):
        raise ValueError(
            "The length of blocks is <= 0 or greater than the length"
            "of training set."
        )

    if self.overlapping:
        blocks = sliding_window_view(indices, window_shape=length)
    else:
        if n % length == 0:
            indices_used_for_blocks = indices
        else:
            indices_used_for_blocks = indices[: -(n % length)]
        blocks_number = n // length
        blocks = np.asarray(
            np.split(indices_used_for_blocks, indices_or_sections=blocks_number)
        )

    random_state = check_random_state(self.random_state)

    for k in range(self.n_resamplings):
        block_indices = resample(
            range(len(blocks)),
            replace=True,
            n_samples=n_blocks,
            random_state=random_state,
            stratify=None,
        )
        train_index = np.concatenate([blocks[k] for k in block_indices], axis=0)
        test_index = np.array(list(set(indices) - set(train_index)), dtype=np.int64)
        yield train_index, test_index

get_n_splits ¶

get_n_splits(*args: Any, **kargs: Any) -> int

Returns the number of splitting iterations in the cross-validator.

RETURNS	DESCRIPTION
`int`	Returns the number of splitting iterations in the cross-validator.

Source code in mapie/subsample.py

def get_n_splits(self, *args: Any, **kargs: Any) -> int:
    """
    Returns the number of splitting iterations in the cross-validator.

    Returns
    -------
    int
        Returns the number of splitting iterations in the cross-validator.
    """
    return self.n_resamplings