CatBoost Encoder

class category_encoders.cat_boost.CatBoostEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None, a=1)[source]

CatBoost coding for categorical features.

This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.

Beware, the training data have to be randomly permutated. E.g.:

# Random permutation perm = np.random.permutation(len(X)) X = X.iloc[perm].reset_index(drop=True) y = y.iloc[perm].reset_index(drop=True)

This is necessary because some data sets are sorted based on the target value and this coder encodes the features on-the-fly in a single pass.

Parameters:
verbose: int

integer indicating verbosity of the output. 0 for none.

cols: list

a list of columns to encode, if None, all string columns will be encoded.

drop_invariant: bool

boolean for whether or not to drop columns with 0 variance.

return_df: bool

boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

handle_missing: str

options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

handle_unknown: str

options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

sigma: float

adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). sigma gives the standard deviation (spread or “width”) of the normal distribution.

a: float

additive smoothing (it is the same variable as “m” in m-probability estimate). By default set to 1.

References

[Rb8aae43bc7e3-1]Transforming categorical features to numerical features, from

https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/

[Rb8aae43bc7e3-2]CatBoost: unbiased boosting with categorical features, from

https://arxiv.org/abs/1706.09516

Methods

fit(self, X, y, \*\*kwargs) Fit encoder according to X and y.
fit_transform(self, X[, y]) Encoders that utilize the target must make sure that the training data are transformed with:
get_feature_names(self) Returns the names of all transformed / added columns.
get_params(self[, deep]) Get parameters for this estimator.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X[, y, override_return_df]) Perform the transformation to new categorical data.
fit(self, X, y, **kwargs)[source]

Fit encoder according to X and y.

Parameters:
X : array-like, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape = [n_samples]

Target values.

Returns:
self : encoder

Returns self.

fit_transform(self, X, y=None, **fit_params)[source]
Encoders that utilize the target must make sure that the training data are transformed with:
transform(X, y)
and not with:
transform(X)
get_feature_names(self)[source]

Returns the names of all transformed / added columns.

Returns:
feature_names: list

A list with all feature names transformed or added. Note: potentially dropped features are not included!

transform(self, X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters:
X : array-like, shape = [n_samples, n_features]
y : array-like, shape = [n_samples] when transform by leave one out

None, when transform without target information (such as transform test set)

Returns:
p : array, shape = [n_samples, n_numeric + N]

Transformed values with encoding applied.