Category Encoders

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. While ordinal, one-hot, and hashing encoders have similar equivalents in the existing scikit-learn version, the transformers in this library all share a few useful properties:

  • First-class support for pandas dataframes as an input (and optionally as output)

  • Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type

  • Can drop any columns with very low variance based on training set optionally

  • Portability: train a transformer on data, pickle it, reuse it later and get the same thing out.

  • Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer (*)

(*) For full compatibility with Pipelines and ColumnTransformers, and consistent behaviour of get_feature_names_out, it’s recommended to upgrade sklearn to a version at least ‘1.2.0’ and to set output as pandas:

import sklearn
sklearn.set_config(transform_output="pandas")

Usage

install as:

pip install category_encoders

or

conda install -c conda-forge category_encoders

To use:

import category_encoders as ce

encoder = ce.BackwardDifferenceEncoder(cols=[...])
encoder = ce.BaseNEncoder(cols=[...])
encoder = ce.BinaryEncoder(cols=[...])
encoder = ce.CatBoostEncoder(cols=[...])
encoder = ce.CountEncoder(cols=[...])
encoder = ce.GLMMEncoder(cols=[...])
encoder = ce.GrayEncoder(cols=[...])
encoder = ce.HashingEncoder(cols=[...])
encoder = ce.HelmertEncoder(cols=[...])
encoder = ce.JamesSteinEncoder(cols=[...])
encoder = ce.LeaveOneOutEncoder(cols=[...])
encoder = ce.MEstimateEncoder(cols=[...])
encoder = ce.OneHotEncoder(cols=[...])
encoder = ce.OrdinalEncoder(cols=[...])
encoder = ce.PolynomialEncoder(cols=[...])
encoder = ce.QuantileEncoder(cols=[...])
encoder = ce.RankHotEncoder(cols=[...])
encoder = ce.SumEncoder(cols=[...])
encoder = ce.TargetEncoder(cols=[...])
encoder = ce.WOEEncoder(cols=[...])

encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)

All of these are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. If the cols parameter isn’t passed, every non-numeric column will be converted. See below for detailed documentation

Known issues:

CategoryEncoders internally works with pandas DataFrames as apposed to sklearn which works with numpy arrays. This can cause problems in sklearn versions prior to 1.2.0. In order to ensure full compatibility with sklearn set sklearn to also output DataFrames. This can be done by

sklearn.set_config(transform_output="pandas")

for a whole project or just for a single pipeline using

Pipeline(
    steps=[
        ("preprocessor", SomePreprocessor().set_output("pandas"),
        ("encoder", SomeEncoder()),
    ]
)

If you experience another bug, feel free to report it on [github](https://github.com/scikit-learn-contrib/category_encoders/issues)

Contents:

Indices and tables