Wrappers

class category_encoders.wrapper.PolynomialWrapper(feature_encoder: BaseEncoder)[source]

Extend supervised encoders to n-class labels, where n >= 2.

The label can be numerical (e.g.: 0, 1, 2, 3,…,n), string or categorical (pandas.Categorical). The label is first encoded into n-1 binary columns. Subsequently, the inner supervised encoder is executed for each binarized label.

The names of the encoded features are suffixed with underscore and the corresponding class name (edge scenarios like ‘dog’+’cat_frog’ vs. ‘dog_cat’+’frog’ are not currently handled).

The implementation is experimental and the API may change in the future. The order of the returned features may change in the future.

Parameters:

feature_encoder: Object: an instance of a supervised encoder.

Methods

`fit`(X, y, **kwargs)	Fit a multi-label encoder.
`fit_transform`(X[, y])	Fit encoder and encode the training data.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X[, y])	Encode new data.

Init the polynomial wrapper.

Methods

`fit`(X, y, **kwargs)	Fit a multi-label encoder.
`fit_transform`(X[, y])	Fit encoder and encode the training data.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X[, y])	Encode new data.

fit(X, y, **kwargs)[source]: Fit a multi-label encoder.

fit_transform(X, y=None, **fit_params)[source]: Fit encoder and encode the training data.

transform(X, y=None)[source]: Encode new data.

class category_encoders.wrapper.NestedCVWrapper(feature_encoder, cv=5, shuffle=True, random_state=None)[source]

Cross validate supervised encoders to avoid overfitting.

For a validation or a test set, supervised encoders can be used as follows:

X_train_encoded = encoder.fit_transform(X_train, y_train) X_valid_encoded = encoder.transform(X_valid)

However, the downstream model will be overfitting to the encoded training data due to target leakage. Using out-of-fold encodings is an effective way to prevent target leakage. This is equivalent to:

X_train_encoded = np.zeros(X.shape) for trn, val in kfold.split(X, y):

encoder.fit(X[trn], y[trn]) X_train_encoded[val] = encoder.transform(X[val])

This can be used in place of the “inner folds” as discussed here:: https://sebastianraschka.com/faq/docs/evaluate-a-model.html

See README.md for a list of supervised encoders.

Discussion: Although leave-one-out encoder internally performs leave-one-out cross-validation, it is actually the most overfitting supervised model in our library. To illustrate the issue, let’s imagine we have a totally unpredictive nominal feature and a perfectly balanced binary label. A supervised encoder should encode the feature into a constant vector as the feature is unpredictive of the label. But when we use leave-one-out cross-validation, the label ratio cease to be perfectly balanced and the wrong class label always becomes the majority in the training fold. Leave-one-out encoder returns a seemingly predictive feature. And the downstream model starts to overfit to the encoded feature. Unfortunately, even 10-fold cross-validation is not immune to this effect:

http://www.kdd.org/exploration_files/v12-02-4-UR-Perlich.pdf

To decrease the effect, it is recommended to use a low count of the folds. And that is the reason why this wrapper uses 5 folds by default.

Based on the empirical results, only LeaveOneOutEncoder benefits greatly from this wrapper. The remaining encoders can be used without this wrapper.

Parameters:

feature_encoder: Object: an instance of a supervised encoder.
cv: int or sklearn cv Object: if an int is given, StratifiedKFold is used by default, where the int is the number of folds.
shuffle: boolean, optional: whether to shuffle each classes samples before splitting into batches. Ignored if a CV method is provided.
random_state: int, RandomState instance or None, optional, default=None: if int, random_state is the seed used by the random number generator. Ignored if a CV method is provided.

Methods

`fit`(X, y, **kwargs)	Fit on the base feature_encoder without nested cross validation.
`fit_transform`(X[, y, X_test, groups])	Unbiased encodings from a supervised encoder and inference on test set.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform on the base feature_encoder without nested cross validation.

Init wrapper.

Methods

`fit`(X, y, **kwargs)	Fit on the base feature_encoder without nested cross validation.
`fit_transform`(X[, y, X_test, groups])	Unbiased encodings from a supervised encoder and inference on test set.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform on the base feature_encoder without nested cross validation.

fit(X, y, **kwargs)[source]: Fit on the base feature_encoder without nested cross validation.

fit_transform(X, y=None, X_test=None, groups=None, **fit_params)[source]

Unbiased encodings from a supervised encoder and inference on test set.

Parameters:

X – array-like, shape = [n_samples, n_features] Training vectors for the supervised encoder, where n_samples is the number of samples and n_features is the number of features.
y – array-like, shape = [n_samples] Target values for the supervised encoder.
optional (X_test,) – array-like, shape = [m_samples, n_features] or a tuple of array-likes (X_test, X_valid…) Vectors to be used for inference by an encoder (e.g. test or validation sets) trained on the full X & y sets. No nested folds are used here
groups – Groups to be passed to the cv method, e.g. for GroupKFold
fit_params

Returns:

array, shape = [n_samples, n_numeric + N] Transformed values with encoding applied. Returns multiple arrays if X_test is not None

transform(X)[source]: Transform on the base feature_encoder without nested cross validation.