Category Encoders
A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. While ordinal, one-hot, and hashing encoders have similar equivalents in the existing scikit-learn version, the transformers in this library all share a few useful properties:
First-class support for pandas dataframes as an input (and optionally as output)
Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type
Can drop any columns with very low variance based on training set optionally
Portability: train a transformer on data, pickle it, reuse it later and get the same thing out.
Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer (*)
(*) For full compatibility with Pipelines and ColumnTransformers, and consistent behaviour of get_feature_names_out, it’s recommended to upgrade sklearn to a version at least ‘1.2.0’ and to set output as pandas:
import sklearn
sklearn.set_config(transform_output="pandas")
Usage
install as:
pip install category_encoders
or
conda install -c conda-forge category_encoders
To use:
import category_encoders as ce
encoder = ce.BackwardDifferenceEncoder(cols=[...])
encoder = ce.BaseNEncoder(cols=[...])
encoder = ce.BinaryEncoder(cols=[...])
encoder = ce.CatBoostEncoder(cols=[...])
encoder = ce.CountEncoder(cols=[...])
encoder = ce.GLMMEncoder(cols=[...])
encoder = ce.GrayEncoder(cols=[...])
encoder = ce.HashingEncoder(cols=[...])
encoder = ce.HelmertEncoder(cols=[...])
encoder = ce.JamesSteinEncoder(cols=[...])
encoder = ce.LeaveOneOutEncoder(cols=[...])
encoder = ce.MEstimateEncoder(cols=[...])
encoder = ce.OneHotEncoder(cols=[...])
encoder = ce.OrdinalEncoder(cols=[...])
encoder = ce.PolynomialEncoder(cols=[...])
encoder = ce.QuantileEncoder(cols=[...])
encoder = ce.RankHotEncoder(cols=[...])
encoder = ce.SumEncoder(cols=[...])
encoder = ce.TargetEncoder(cols=[...])
encoder = ce.WOEEncoder(cols=[...])
encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)
All of these are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. If the cols parameter isn’t passed, every non-numeric column will be converted. See below for detailed documentation
Known issues:
CategoryEncoders internally works with pandas DataFrames as opposed to sklearn which works with numpy arrays. This can cause problems in sklearn versions prior to 1.2.0. In order to ensure full compatibility with sklearn set sklearn to also output DataFrames. This can be done by
sklearn.set_config(transform_output="pandas")
for a whole project or just for a single pipeline using
Pipeline(
steps=[
("preprocessor", SomePreprocessor().set_output("pandas"),
("encoder", SomeEncoder()),
]
)
If you experience another bug, feel free to report it on [github](https://github.com/scikit-learn-contrib/category_encoders/issues)
Contents:
- Backward Difference Coding
BackwardDifferenceEncoderBackwardDifferenceEncoder.fit()BackwardDifferenceEncoder.fit_contrast_coding()BackwardDifferenceEncoder.fit_transform()BackwardDifferenceEncoder.get_contrast_matrix()BackwardDifferenceEncoder.get_feature_names()BackwardDifferenceEncoder.get_feature_names_in()BackwardDifferenceEncoder.get_feature_names_out()BackwardDifferenceEncoder.get_metadata_routing()BackwardDifferenceEncoder.get_params()BackwardDifferenceEncoder.set_output()BackwardDifferenceEncoder.set_params()BackwardDifferenceEncoder.set_transform_request()BackwardDifferenceEncoder.transform()BackwardDifferenceEncoder.transform_contrast_coding()
- BaseN
BaseNEncoderBaseNEncoder.basen_encode()BaseNEncoder.basen_to_integer()BaseNEncoder.calc_required_digits()BaseNEncoder.col_transform()BaseNEncoder.fit()BaseNEncoder.fit_base_n_encoding()BaseNEncoder.fit_transform()BaseNEncoder.get_feature_names()BaseNEncoder.get_feature_names_in()BaseNEncoder.get_feature_names_out()BaseNEncoder.get_metadata_routing()BaseNEncoder.get_params()BaseNEncoder.inverse_transform()BaseNEncoder.number_to_base()BaseNEncoder.set_inverse_transform_request()BaseNEncoder.set_output()BaseNEncoder.set_params()BaseNEncoder.set_transform_request()BaseNEncoder.transform()
- Binary
BinaryEncoderBinaryEncoder.basen_encode()BinaryEncoder.basen_to_integer()BinaryEncoder.calc_required_digits()BinaryEncoder.col_transform()BinaryEncoder.fit()BinaryEncoder.fit_base_n_encoding()BinaryEncoder.fit_transform()BinaryEncoder.get_feature_names()BinaryEncoder.get_feature_names_in()BinaryEncoder.get_feature_names_out()BinaryEncoder.get_metadata_routing()BinaryEncoder.get_params()BinaryEncoder.inverse_transform()BinaryEncoder.number_to_base()BinaryEncoder.set_inverse_transform_request()BinaryEncoder.set_output()BinaryEncoder.set_params()BinaryEncoder.set_transform_request()BinaryEncoder.transform()
- CatBoost Encoder
CatBoostEncoderCatBoostEncoder.fit()CatBoostEncoder.fit_transform()CatBoostEncoder.get_feature_names()CatBoostEncoder.get_feature_names_in()CatBoostEncoder.get_feature_names_out()CatBoostEncoder.get_metadata_routing()CatBoostEncoder.get_params()CatBoostEncoder.set_output()CatBoostEncoder.set_params()CatBoostEncoder.set_transform_request()CatBoostEncoder.transform()
- Count Encoder
CountEncoderCountEncoder.combine_min_categories()CountEncoder.fit()CountEncoder.fit_transform()CountEncoder.get_feature_names()CountEncoder.get_feature_names_in()CountEncoder.get_feature_names_out()CountEncoder.get_metadata_routing()CountEncoder.get_params()CountEncoder.set_output()CountEncoder.set_params()CountEncoder.set_transform_request()CountEncoder.transform()
- Generalized Linear Mixed Model Encoder
GLMMEncoderGLMMEncoder.fit()GLMMEncoder.fit_transform()GLMMEncoder.get_feature_names()GLMMEncoder.get_feature_names_in()GLMMEncoder.get_feature_names_out()GLMMEncoder.get_metadata_routing()GLMMEncoder.get_params()GLMMEncoder.set_output()GLMMEncoder.set_params()GLMMEncoder.set_transform_request()GLMMEncoder.transform()
- Gray
GrayEncoderGrayEncoder.basen_encode()GrayEncoder.basen_to_integer()GrayEncoder.calc_required_digits()GrayEncoder.col_transform()GrayEncoder.fit()GrayEncoder.fit_base_n_encoding()GrayEncoder.fit_transform()GrayEncoder.get_feature_names()GrayEncoder.get_feature_names_in()GrayEncoder.get_feature_names_out()GrayEncoder.get_metadata_routing()GrayEncoder.get_params()GrayEncoder.gray_code()GrayEncoder.inverse_transform()GrayEncoder.number_to_base()GrayEncoder.set_inverse_transform_request()GrayEncoder.set_output()GrayEncoder.set_params()GrayEncoder.set_transform_request()GrayEncoder.transform()
- Hashing
HashingEncoderHashingEncoder.fit()HashingEncoder.fit_transform()HashingEncoder.get_feature_names()HashingEncoder.get_feature_names_in()HashingEncoder.get_feature_names_out()HashingEncoder.get_metadata_routing()HashingEncoder.get_params()HashingEncoder.hash_chunk()HashingEncoder.hashing_trick()HashingEncoder.hashing_trick_with_np_no_parallel()HashingEncoder.hashing_trick_with_np_parallel()HashingEncoder.set_output()HashingEncoder.set_params()HashingEncoder.set_transform_request()HashingEncoder.transform()
- Helmert Coding
HelmertEncoderHelmertEncoder.fit()HelmertEncoder.fit_contrast_coding()HelmertEncoder.fit_transform()HelmertEncoder.get_contrast_matrix()HelmertEncoder.get_feature_names()HelmertEncoder.get_feature_names_in()HelmertEncoder.get_feature_names_out()HelmertEncoder.get_metadata_routing()HelmertEncoder.get_params()HelmertEncoder.set_output()HelmertEncoder.set_params()HelmertEncoder.set_transform_request()HelmertEncoder.transform()HelmertEncoder.transform_contrast_coding()
- James-Stein Encoder
JamesSteinEncoderJamesSteinEncoder.fit()JamesSteinEncoder.fit_transform()JamesSteinEncoder.get_feature_names()JamesSteinEncoder.get_feature_names_in()JamesSteinEncoder.get_feature_names_out()JamesSteinEncoder.get_metadata_routing()JamesSteinEncoder.get_params()JamesSteinEncoder.set_output()JamesSteinEncoder.set_params()JamesSteinEncoder.set_transform_request()JamesSteinEncoder.transform()
- Leave One Out
LeaveOneOutEncoderLeaveOneOutEncoder.fit()LeaveOneOutEncoder.fit_leave_one_out()LeaveOneOutEncoder.fit_transform()LeaveOneOutEncoder.get_feature_names()LeaveOneOutEncoder.get_feature_names_in()LeaveOneOutEncoder.get_feature_names_out()LeaveOneOutEncoder.get_metadata_routing()LeaveOneOutEncoder.get_params()LeaveOneOutEncoder.set_output()LeaveOneOutEncoder.set_params()LeaveOneOutEncoder.set_transform_request()LeaveOneOutEncoder.transform()LeaveOneOutEncoder.transform_leave_one_out()
- M-estimate
MEstimateEncoderMEstimateEncoder.fit()MEstimateEncoder.fit_transform()MEstimateEncoder.get_feature_names()MEstimateEncoder.get_feature_names_in()MEstimateEncoder.get_feature_names_out()MEstimateEncoder.get_metadata_routing()MEstimateEncoder.get_params()MEstimateEncoder.set_output()MEstimateEncoder.set_params()MEstimateEncoder.set_transform_request()MEstimateEncoder.transform()
- One Hot
OneHotEncoderOneHotEncoder.category_mappingOneHotEncoder.fit()OneHotEncoder.fit_transform()OneHotEncoder.generate_mapping()OneHotEncoder.get_dummies()OneHotEncoder.get_feature_names()OneHotEncoder.get_feature_names_in()OneHotEncoder.get_feature_names_out()OneHotEncoder.get_metadata_routing()OneHotEncoder.get_params()OneHotEncoder.inverse_transform()OneHotEncoder.reverse_dummies()OneHotEncoder.set_inverse_transform_request()OneHotEncoder.set_output()OneHotEncoder.set_params()OneHotEncoder.set_transform_request()OneHotEncoder.transform()
- Ordinal
OrdinalEncoderOrdinalEncoder.category_mappingOrdinalEncoder.fit()OrdinalEncoder.fit_transform()OrdinalEncoder.get_feature_names()OrdinalEncoder.get_feature_names_in()OrdinalEncoder.get_feature_names_out()OrdinalEncoder.get_metadata_routing()OrdinalEncoder.get_params()OrdinalEncoder.inverse_transform()OrdinalEncoder.ordinal_encoding()OrdinalEncoder.set_inverse_transform_request()OrdinalEncoder.set_output()OrdinalEncoder.set_params()OrdinalEncoder.set_transform_request()OrdinalEncoder.transform()
- Polynomial Coding
PolynomialEncoderPolynomialEncoder.fit()PolynomialEncoder.fit_contrast_coding()PolynomialEncoder.fit_transform()PolynomialEncoder.get_contrast_matrix()PolynomialEncoder.get_feature_names()PolynomialEncoder.get_feature_names_in()PolynomialEncoder.get_feature_names_out()PolynomialEncoder.get_metadata_routing()PolynomialEncoder.get_params()PolynomialEncoder.set_output()PolynomialEncoder.set_params()PolynomialEncoder.set_transform_request()PolynomialEncoder.transform()PolynomialEncoder.transform_contrast_coding()
- Quantile Encoder
QuantileEncoderQuantileEncoder.fit()QuantileEncoder.fit_quantile_encoding()QuantileEncoder.fit_transform()QuantileEncoder.get_feature_names()QuantileEncoder.get_feature_names_in()QuantileEncoder.get_feature_names_out()QuantileEncoder.get_metadata_routing()QuantileEncoder.get_params()QuantileEncoder.quantile_encode()QuantileEncoder.set_output()QuantileEncoder.set_params()QuantileEncoder.set_transform_request()QuantileEncoder.transform()
- RankHotEncoder
RankHotEncoderRankHotEncoder.fit()RankHotEncoder.fit_transform()RankHotEncoder.generate_mapping()RankHotEncoder.get_feature_names()RankHotEncoder.get_feature_names_in()RankHotEncoder.get_feature_names_out()RankHotEncoder.get_metadata_routing()RankHotEncoder.get_params()RankHotEncoder.inverse_transform()RankHotEncoder.set_inverse_transform_request()RankHotEncoder.set_output()RankHotEncoder.set_params()RankHotEncoder.set_transform_request()RankHotEncoder.transform()
- Sum Coding
SumEncoderSumEncoder.fit()SumEncoder.fit_contrast_coding()SumEncoder.fit_transform()SumEncoder.get_contrast_matrix()SumEncoder.get_feature_names()SumEncoder.get_feature_names_in()SumEncoder.get_feature_names_out()SumEncoder.get_metadata_routing()SumEncoder.get_params()SumEncoder.set_output()SumEncoder.set_params()SumEncoder.set_transform_request()SumEncoder.transform()SumEncoder.transform_contrast_coding()
- Summary Encoder
SummaryEncoderSummaryEncoder.fit()SummaryEncoder.fit_transform()SummaryEncoder.get_feature_names()SummaryEncoder.get_feature_names_in()SummaryEncoder.get_feature_names_out()SummaryEncoder.get_metadata_routing()SummaryEncoder.get_params()SummaryEncoder.set_params()SummaryEncoder.set_transform_request()SummaryEncoder.transform()
- Target Encoder
TargetEncoderTargetEncoder.fit()TargetEncoder.fit_target_encoding()TargetEncoder.fit_transform()TargetEncoder.get_feature_names()TargetEncoder.get_feature_names_in()TargetEncoder.get_feature_names_out()TargetEncoder.get_metadata_routing()TargetEncoder.get_params()TargetEncoder.set_output()TargetEncoder.set_params()TargetEncoder.set_transform_request()TargetEncoder.target_encode()TargetEncoder.transform()
- Weight of Evidence
WOEEncoderWOEEncoder.fit()WOEEncoder.fit_transform()WOEEncoder.get_feature_names()WOEEncoder.get_feature_names_in()WOEEncoder.get_feature_names_out()WOEEncoder.get_metadata_routing()WOEEncoder.get_params()WOEEncoder.set_output()WOEEncoder.set_params()WOEEncoder.set_transform_request()WOEEncoder.transform()
- Wrappers