CatBoost Encoder¶

class
category_encoders.cat_boost.
CatBoostEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None, a=1)[source]¶ CatBoost coding for categorical features.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is very similar to leaveoneout encoding, but calculates the values “onthefly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
Beware, the training data have to be randomly permutated. E.g.:
# Random permutation perm = np.random.permutation(len(X)) X = X.iloc[perm].reset_index(drop=True) y = y.iloc[perm].reset_index(drop=True)
This is necessary because some data sets are sorted based on the target value and this coder encodes the features onthefly in a single pass.
 Parameters
 verbose: int
integer indicating verbosity of the output. 0 for none.
 cols: list
a list of columns to encode, if None, all string columns will be encoded.
 drop_invariant: bool
boolean for whether or not to drop columns with 0 variance.
 return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
 handle_missing: str
options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
 handle_unknown: str
options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
 sigma: float
adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). sigma gives the standard deviation (spread or “width”) of the normal distribution.
 a: float
additive smoothing (it is the same variable as “m” in mprobability estimate). By default set to 1.
References
 1
Transforming categorical features to numerical features, from
https://tech.yandex.com/catboost/doc/dg/concepts/algorithmmainstages_cattonumbericdocpage/
 2
CatBoost: unbiased boosting with categorical features, from
https://arxiv.org/abs/1706.09516
Methods
fit
(X, y, **kwargs)Fit encoder according to X and y.
fit_transform
(X[, y])Encoders that utilize the target must make sure that the training data are transformed with:
Returns the names of all transformed / added columns.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y, override_return_df])Perform the transformation to new categorical data.

fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and y.
 Parameters
 Xarraylike, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
 yarraylike, shape = [n_samples]
Target values.
 Returns
 selfencoder
Returns self.

get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
 Returns
 feature_names: list
A list with all feature names transformed or added. Note: potentially dropped features are not included!

transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
 Parameters
 Xarraylike, shape = [n_samples, n_features]
 yarraylike, shape = [n_samples] when transform by leave one out
None, when transform without target information (such as transform test set)
 Returns
 parray, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.