CatBoost Encoder
- class category_encoders.cat_boost.CatBoostEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None, a=1)[source]
CatBoost Encoding for categorical features.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
Beware, the training data have to be randomly permutated. E.g.:
# Random permutation perm = np.random.permutation(len(X)) X = X.iloc[perm].reset_index(drop=True) y = y.iloc[perm].reset_index(drop=True)
This is necessary because some data sets are sorted based on the target value and this coder encodes the features on-the-fly in a single pass.
- Parameters
- verbose: int
integer indicating verbosity of the output. 0 for none.
- cols: list
a list of columns to encode, if None, all string columns will be encoded.
- drop_invariant: bool
boolean for whether or not to drop columns with 0 variance.
- return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
- handle_missing: str
options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
- handle_unknown: str
options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
- sigma: float
adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). sigma gives the standard deviation (spread or “width”) of the normal distribution.
- a: float
additive smoothing (it is the same variable as “m” in m-probability estimate). By default set to 1.
References
- 1
Transforming categorical features to numerical features, from
https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
- 2
CatBoost: unbiased boosting with categorical features, from
https://arxiv.org/abs/1706.09516
Methods
fit
(X, y, **kwargs)Fit encoder according to X and y.
fit_transform
(X[, y])Encoders that utilize the target must make sure that the training data are transformed with:
Returns the names of all transformed / added columns.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y, override_return_df])Perform the transformation to new categorical data.
- fit(X, y, **kwargs)[source]
Fit encoder according to X and y.
- Parameters
- Xarray-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
- yarray-like, shape = [n_samples]
Target values.
- Returns
- selfencoder
Returns self.
- get_feature_names()[source]
Returns the names of all transformed / added columns.
- Returns
- feature_names: list
A list with all feature names transformed or added. Note: potentially dropped features are not included!
- transform(X, y=None, override_return_df=False)[source]
Perform the transformation to new categorical data.
- Parameters
- Xarray-like, shape = [n_samples, n_features]
- yarray-like, shape = [n_samples] when transform by leave one out
None, when transform without target information (such as transform test set)
- Returns
- parray, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.