imblearn.ensemble.BalanceCascade

class imblearn.ensemble.BalanceCascade(ratio='auto', return_indices=False, random_state=None, n_max_subset=None, classifier=None, estimator=None, **kwargs)[source][source]

Create an ensemble of balanced sets by iteratively under-sampling the imbalanced dataset using an estimator.

This method iteratively select subset and make an ensemble of the different sets. The selection is performed using a specific classifier.

Read more in the User Guide.

Parameters:

ratio : str, dict, or callable, optional (default=’auto’)

Ratio to use for resampling the data set.

  • If str, has to be one of: (i) 'minority': resample the minority class; (ii) 'majority': resample the majority class, (iii) 'not minority': resample all classes apart of the minority class, (iv) 'all': resample all classes, and (v) 'auto': correspond to 'all' with for over-sampling methods and 'not minority' for under-sampling methods. The classes targeted will be over-sampled or under-sampled to achieve an equal number of sample with the majority or minority class.
  • If dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples.
  • If callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples.

return_indices : bool, optional (default=True)

Whether or not to return the indices of the samples randomly selected from the majority class.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_max_subset : int or None, optional (default=None)

Maximum number of subsets to generate. By default, all data from the training will be selected that could lead to a large number of subsets. We can probably deduce this number empirically.

classifier : str, optional (default=None)

The classifier that will be selected to confront the prediction with the real labels. The choices are the following: 'knn', 'decision-tree', 'random-forest', 'adaboost', 'gradient-boosting', and 'linear-svm'.

Deprecated since version 0.2: classifier is deprecated from 0.2 and will be replaced in 0.4. Use estimator instead.

estimator : object, optional (default=KNeighborsClassifier())

An estimator inherited from sklearn.base.ClassifierMixin and having an attribute predict_proba.

bootstrap : bool, optional (default=True)

Whether to bootstrap the data before each iteration.

**kwargs : keywords

The parameters associated with the classifier provided.

Deprecated since version 0.2: **kwargs has been deprecated from 0.2 and will be replaced in 0.4. Use estimator object instead to pass parameters associated to an estimator.

Notes

The method is described in [R5757].

Supports mutli-class resampling. A one-vs.-rest scheme is used as originally proposed in [R5757].

See Balance cascade.

References

[R5757](1, 2, 3) X. Y. Liu, J. Wu and Z. H. Zhou, “Exploratory Undersampling for Class-Imbalance Learning,” in IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550, April 2009.

Examples

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.ensemble import BalanceCascade 
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape {}'.format(Counter(y)))
Original dataset shape Counter({1: 900, 0: 100})
>>> bc = BalanceCascade(random_state=42)
>>> X_res, y_res = bc.fit_sample(X, y)
>>> print('Resampled dataset shape {}'.format(Counter(y_res[0])))     
Resampled dataset shape Counter({...})
__init__(ratio='auto', return_indices=False, random_state=None, n_max_subset=None, classifier=None, estimator=None, **kwargs)[source][source]
fit(X, y)[source][source]

Find the classes statistics before to perform sampling.

Parameters:

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like, shape (n_samples,)

Corresponding label for each sample in X.

Returns:

self : object,

Return self.

fit_sample(X, y)[source]

Fit the statistics and resample the data directly.

Parameters:

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like, shape (n_samples,)

Corresponding label for each sample in X.

Returns:

X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : array-like, shape (n_samples_new,)

The corresponding label of X_resampled

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

sample(X, y)[source]

Resample the dataset.

Parameters:

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : array-like, shape (n_samples,)

Corresponding label for each sample in X.

Returns:

X_resampled : {ndarray, sparse matrix}, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : ndarray, shape (n_samples_new)

The corresponding label of X_resampled

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self

Examples using imblearn.ensemble.BalanceCascade