Category Encoders
A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. While ordinal, one-hot, and hashing encoders have similar equivalents in the existing scikit-learn version, the transformers in this library all share a few useful properties:
First-class support for pandas dataframes as an input (and optionally as output)
Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type
Can drop any columns with very low variance based on training set optionally
Portability: train a transformer on data, pickle it, reuse it later and get the same thing out.
Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer (*)
(*) For full compatibility with Pipelines and ColumnTransformers, and consistent behaviour of get_feature_names_out, it’s recommended to upgrade sklearn to a version at least ‘1.2.0’ and to set output as pandas:
import sklearn
sklearn.set_config(transform_output="pandas")
Usage
install as:
pip install category_encoders
or
conda install -c conda-forge category_encoders
To use:
import category_encoders as ce
encoder = ce.BackwardDifferenceEncoder(cols=[...])
encoder = ce.BaseNEncoder(cols=[...])
encoder = ce.BinaryEncoder(cols=[...])
encoder = ce.CatBoostEncoder(cols=[...])
encoder = ce.CountEncoder(cols=[...])
encoder = ce.GLMMEncoder(cols=[...])
encoder = ce.GrayEncoder(cols=[...])
encoder = ce.HashingEncoder(cols=[...])
encoder = ce.HelmertEncoder(cols=[...])
encoder = ce.JamesSteinEncoder(cols=[...])
encoder = ce.LeaveOneOutEncoder(cols=[...])
encoder = ce.MEstimateEncoder(cols=[...])
encoder = ce.OneHotEncoder(cols=[...])
encoder = ce.OrdinalEncoder(cols=[...])
encoder = ce.PolynomialEncoder(cols=[...])
encoder = ce.QuantileEncoder(cols=[...])
encoder = ce.RankHotEncoder(cols=[...])
encoder = ce.SumEncoder(cols=[...])
encoder = ce.TargetEncoder(cols=[...])
encoder = ce.WOEEncoder(cols=[...])
encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)
All of these are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. If the cols parameter isn’t passed, every non-numeric column will be converted. See below for detailed documentation
Known issues:
CategoryEncoders internally works with pandas DataFrames as apposed to sklearn which works with numpy arrays. This can cause problems in sklearn versions prior to 1.2.0. In order to ensure full compatibility with sklearn set sklearn to also output DataFrames. This can be done by
sklearn.set_config(transform_output="pandas")
for a whole project or just for a single pipeline using
Pipeline(
steps=[
("preprocessor", SomePreprocessor().set_output("pandas"),
("encoder", SomeEncoder()),
]
)
If you experience another bug, feel free to report it on [github](https://github.com/scikit-learn-contrib/category_encoders/issues)
Contents:
- Backward Difference Coding
BackwardDifferenceEncoder
BackwardDifferenceEncoder.fit()
BackwardDifferenceEncoder.fit_contrast_coding()
BackwardDifferenceEncoder.fit_transform()
BackwardDifferenceEncoder.get_contrast_matrix()
BackwardDifferenceEncoder.get_feature_names()
BackwardDifferenceEncoder.get_feature_names_in()
BackwardDifferenceEncoder.get_feature_names_out()
BackwardDifferenceEncoder.get_metadata_routing()
BackwardDifferenceEncoder.get_params()
BackwardDifferenceEncoder.set_output()
BackwardDifferenceEncoder.set_params()
BackwardDifferenceEncoder.set_transform_request()
BackwardDifferenceEncoder.transform()
BackwardDifferenceEncoder.transform_contrast_coding()
- BaseN
BaseNEncoder
BaseNEncoder.basen_encode()
BaseNEncoder.basen_to_integer()
BaseNEncoder.calc_required_digits()
BaseNEncoder.col_transform()
BaseNEncoder.fit()
BaseNEncoder.fit_base_n_encoding()
BaseNEncoder.fit_transform()
BaseNEncoder.get_feature_names()
BaseNEncoder.get_feature_names_in()
BaseNEncoder.get_feature_names_out()
BaseNEncoder.get_metadata_routing()
BaseNEncoder.get_params()
BaseNEncoder.inverse_transform()
BaseNEncoder.number_to_base()
BaseNEncoder.set_inverse_transform_request()
BaseNEncoder.set_output()
BaseNEncoder.set_params()
BaseNEncoder.set_transform_request()
BaseNEncoder.transform()
- Binary
BinaryEncoder
BinaryEncoder.basen_encode()
BinaryEncoder.basen_to_integer()
BinaryEncoder.calc_required_digits()
BinaryEncoder.col_transform()
BinaryEncoder.fit()
BinaryEncoder.fit_base_n_encoding()
BinaryEncoder.fit_transform()
BinaryEncoder.get_feature_names()
BinaryEncoder.get_feature_names_in()
BinaryEncoder.get_feature_names_out()
BinaryEncoder.get_metadata_routing()
BinaryEncoder.get_params()
BinaryEncoder.inverse_transform()
BinaryEncoder.number_to_base()
BinaryEncoder.set_inverse_transform_request()
BinaryEncoder.set_output()
BinaryEncoder.set_params()
BinaryEncoder.set_transform_request()
BinaryEncoder.transform()
- CatBoost Encoder
CatBoostEncoder
CatBoostEncoder.fit()
CatBoostEncoder.fit_transform()
CatBoostEncoder.get_feature_names()
CatBoostEncoder.get_feature_names_in()
CatBoostEncoder.get_feature_names_out()
CatBoostEncoder.get_metadata_routing()
CatBoostEncoder.get_params()
CatBoostEncoder.set_output()
CatBoostEncoder.set_params()
CatBoostEncoder.set_transform_request()
CatBoostEncoder.transform()
- Count Encoder
CountEncoder
CountEncoder.combine_min_categories()
CountEncoder.fit()
CountEncoder.fit_transform()
CountEncoder.get_feature_names()
CountEncoder.get_feature_names_in()
CountEncoder.get_feature_names_out()
CountEncoder.get_metadata_routing()
CountEncoder.get_params()
CountEncoder.set_output()
CountEncoder.set_params()
CountEncoder.set_transform_request()
CountEncoder.transform()
- Generalized Linear Mixed Model Encoder
GLMMEncoder
GLMMEncoder.fit()
GLMMEncoder.fit_transform()
GLMMEncoder.get_feature_names()
GLMMEncoder.get_feature_names_in()
GLMMEncoder.get_feature_names_out()
GLMMEncoder.get_metadata_routing()
GLMMEncoder.get_params()
GLMMEncoder.set_output()
GLMMEncoder.set_params()
GLMMEncoder.set_transform_request()
GLMMEncoder.transform()
- Gray
GrayEncoder
GrayEncoder.basen_encode()
GrayEncoder.basen_to_integer()
GrayEncoder.calc_required_digits()
GrayEncoder.col_transform()
GrayEncoder.fit()
GrayEncoder.fit_base_n_encoding()
GrayEncoder.fit_transform()
GrayEncoder.get_feature_names()
GrayEncoder.get_feature_names_in()
GrayEncoder.get_feature_names_out()
GrayEncoder.get_metadata_routing()
GrayEncoder.get_params()
GrayEncoder.gray_code()
GrayEncoder.inverse_transform()
GrayEncoder.number_to_base()
GrayEncoder.set_inverse_transform_request()
GrayEncoder.set_output()
GrayEncoder.set_params()
GrayEncoder.set_transform_request()
GrayEncoder.transform()
- Hashing
HashingEncoder
HashingEncoder.fit()
HashingEncoder.fit_transform()
HashingEncoder.get_feature_names()
HashingEncoder.get_feature_names_in()
HashingEncoder.get_feature_names_out()
HashingEncoder.get_metadata_routing()
HashingEncoder.get_params()
HashingEncoder.hash_chunk()
HashingEncoder.hashing_trick()
HashingEncoder.hashing_trick_with_np_no_parallel()
HashingEncoder.hashing_trick_with_np_parallel()
HashingEncoder.set_output()
HashingEncoder.set_params()
HashingEncoder.set_transform_request()
HashingEncoder.transform()
- Helmert Coding
HelmertEncoder
HelmertEncoder.fit()
HelmertEncoder.fit_contrast_coding()
HelmertEncoder.fit_transform()
HelmertEncoder.get_contrast_matrix()
HelmertEncoder.get_feature_names()
HelmertEncoder.get_feature_names_in()
HelmertEncoder.get_feature_names_out()
HelmertEncoder.get_metadata_routing()
HelmertEncoder.get_params()
HelmertEncoder.set_output()
HelmertEncoder.set_params()
HelmertEncoder.set_transform_request()
HelmertEncoder.transform()
HelmertEncoder.transform_contrast_coding()
- James-Stein Encoder
JamesSteinEncoder
JamesSteinEncoder.fit()
JamesSteinEncoder.fit_transform()
JamesSteinEncoder.get_feature_names()
JamesSteinEncoder.get_feature_names_in()
JamesSteinEncoder.get_feature_names_out()
JamesSteinEncoder.get_metadata_routing()
JamesSteinEncoder.get_params()
JamesSteinEncoder.set_output()
JamesSteinEncoder.set_params()
JamesSteinEncoder.set_transform_request()
JamesSteinEncoder.transform()
- Leave One Out
LeaveOneOutEncoder
LeaveOneOutEncoder.fit()
LeaveOneOutEncoder.fit_leave_one_out()
LeaveOneOutEncoder.fit_transform()
LeaveOneOutEncoder.get_feature_names()
LeaveOneOutEncoder.get_feature_names_in()
LeaveOneOutEncoder.get_feature_names_out()
LeaveOneOutEncoder.get_metadata_routing()
LeaveOneOutEncoder.get_params()
LeaveOneOutEncoder.set_output()
LeaveOneOutEncoder.set_params()
LeaveOneOutEncoder.set_transform_request()
LeaveOneOutEncoder.transform()
LeaveOneOutEncoder.transform_leave_one_out()
- M-estimate
MEstimateEncoder
MEstimateEncoder.fit()
MEstimateEncoder.fit_transform()
MEstimateEncoder.get_feature_names()
MEstimateEncoder.get_feature_names_in()
MEstimateEncoder.get_feature_names_out()
MEstimateEncoder.get_metadata_routing()
MEstimateEncoder.get_params()
MEstimateEncoder.set_output()
MEstimateEncoder.set_params()
MEstimateEncoder.set_transform_request()
MEstimateEncoder.transform()
- One Hot
OneHotEncoder
OneHotEncoder.category_mapping
OneHotEncoder.fit()
OneHotEncoder.fit_transform()
OneHotEncoder.generate_mapping()
OneHotEncoder.get_dummies()
OneHotEncoder.get_feature_names()
OneHotEncoder.get_feature_names_in()
OneHotEncoder.get_feature_names_out()
OneHotEncoder.get_metadata_routing()
OneHotEncoder.get_params()
OneHotEncoder.inverse_transform()
OneHotEncoder.reverse_dummies()
OneHotEncoder.set_inverse_transform_request()
OneHotEncoder.set_output()
OneHotEncoder.set_params()
OneHotEncoder.set_transform_request()
OneHotEncoder.transform()
- Ordinal
OrdinalEncoder
OrdinalEncoder.category_mapping
OrdinalEncoder.fit()
OrdinalEncoder.fit_transform()
OrdinalEncoder.get_feature_names()
OrdinalEncoder.get_feature_names_in()
OrdinalEncoder.get_feature_names_out()
OrdinalEncoder.get_metadata_routing()
OrdinalEncoder.get_params()
OrdinalEncoder.inverse_transform()
OrdinalEncoder.ordinal_encoding()
OrdinalEncoder.set_inverse_transform_request()
OrdinalEncoder.set_output()
OrdinalEncoder.set_params()
OrdinalEncoder.set_transform_request()
OrdinalEncoder.transform()
- Polynomial Coding
PolynomialEncoder
PolynomialEncoder.fit()
PolynomialEncoder.fit_contrast_coding()
PolynomialEncoder.fit_transform()
PolynomialEncoder.get_contrast_matrix()
PolynomialEncoder.get_feature_names()
PolynomialEncoder.get_feature_names_in()
PolynomialEncoder.get_feature_names_out()
PolynomialEncoder.get_metadata_routing()
PolynomialEncoder.get_params()
PolynomialEncoder.set_output()
PolynomialEncoder.set_params()
PolynomialEncoder.set_transform_request()
PolynomialEncoder.transform()
PolynomialEncoder.transform_contrast_coding()
- Quantile Encoder
QuantileEncoder
QuantileEncoder.fit()
QuantileEncoder.fit_quantile_encoding()
QuantileEncoder.fit_transform()
QuantileEncoder.get_feature_names()
QuantileEncoder.get_feature_names_in()
QuantileEncoder.get_feature_names_out()
QuantileEncoder.get_metadata_routing()
QuantileEncoder.get_params()
QuantileEncoder.quantile_encode()
QuantileEncoder.set_output()
QuantileEncoder.set_params()
QuantileEncoder.set_transform_request()
QuantileEncoder.transform()
- RankHotEncoder
RankHotEncoder
RankHotEncoder.fit()
RankHotEncoder.fit_transform()
RankHotEncoder.generate_mapping()
RankHotEncoder.get_feature_names()
RankHotEncoder.get_feature_names_in()
RankHotEncoder.get_feature_names_out()
RankHotEncoder.get_metadata_routing()
RankHotEncoder.get_params()
RankHotEncoder.inverse_transform()
RankHotEncoder.set_inverse_transform_request()
RankHotEncoder.set_output()
RankHotEncoder.set_params()
RankHotEncoder.set_transform_request()
RankHotEncoder.transform()
- Sum Coding
SumEncoder
SumEncoder.fit()
SumEncoder.fit_contrast_coding()
SumEncoder.fit_transform()
SumEncoder.get_contrast_matrix()
SumEncoder.get_feature_names()
SumEncoder.get_feature_names_in()
SumEncoder.get_feature_names_out()
SumEncoder.get_metadata_routing()
SumEncoder.get_params()
SumEncoder.set_output()
SumEncoder.set_params()
SumEncoder.set_transform_request()
SumEncoder.transform()
SumEncoder.transform_contrast_coding()
- Summary Encoder
SummaryEncoder
SummaryEncoder.fit()
SummaryEncoder.fit_transform()
SummaryEncoder.get_feature_names()
SummaryEncoder.get_feature_names_in()
SummaryEncoder.get_feature_names_out()
SummaryEncoder.get_metadata_routing()
SummaryEncoder.get_params()
SummaryEncoder.set_params()
SummaryEncoder.set_transform_request()
SummaryEncoder.transform()
- Target Encoder
TargetEncoder
TargetEncoder.fit()
TargetEncoder.fit_target_encoding()
TargetEncoder.fit_transform()
TargetEncoder.get_feature_names()
TargetEncoder.get_feature_names_in()
TargetEncoder.get_feature_names_out()
TargetEncoder.get_metadata_routing()
TargetEncoder.get_params()
TargetEncoder.set_output()
TargetEncoder.set_params()
TargetEncoder.set_transform_request()
TargetEncoder.target_encode()
TargetEncoder.transform()
- Weight of Evidence
WOEEncoder
WOEEncoder.fit()
WOEEncoder.fit_transform()
WOEEncoder.get_feature_names()
WOEEncoder.get_feature_names_in()
WOEEncoder.get_feature_names_out()
WOEEncoder.get_metadata_routing()
WOEEncoder.get_params()
WOEEncoder.set_output()
WOEEncoder.set_params()
WOEEncoder.set_transform_request()
WOEEncoder.transform()
- Wrappers