niklib.models.preprocessors package#

Submodules#

niklib.models.preprocessors.core module#

Contains core functionalities that is shared by all preprocessors.

class niklib.models.preprocessors.core.TrainTestEvalSplit(stratify=None, random_state=None)[source]#

Bases: object

Convert a pandas dataframe to a numpy array for with train, test, and eval splits

For conversion from pandas.DataFrame to numpy.ndarray, we use the same functionality as pandas.DataFrame.to_numpy(), but it separates dependent and independent variables given the target column target_column.

Note

  • To obtain the eval set, we use the train set as the original data to be splitted i.e. the eval set is a subset of train set. This is of course to make sure model by no means sees the test set.

  • args cannot be set directly and need to be provided using a json file. See set_configs() for more information.

  • You can explicitly override following args by passing it as an argument to __init__():

    • random_state

    • stratify

Returns:

Order is (x_train, x_test, x_eval, y_train, y_test, y_eval)

Return type:

Tuple[numpy.ndarray, …]

__init__(stratify=None, random_state=None)[source]#
set_configs(path=None)[source]#

Defines and sets the config to be parsed

The keys of the configs are the attributes of this class which are:

  • test_ratio (float): Ratio of test data

  • eval_ratio (float): Ratio of eval data

  • shuffle (bool): Whether to shuffle the data

  • stratify (numpy.ndarray, optional): If not None, this is used to stratify the data

  • random_state (int, optional): Random state to use for shuffling

Note

You can explicitly override following attributes by passing it as an argument to __init__():

  • random_state

  • stratify

The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.

Parameters:

path (Union[str, Path]) – path to the JSON file containing the configs

Returns:

A dictionary of str: Any pairs of configs as class attributes

Return type:

dict

as_mlflow_artifact(target_path)[source]#

Saves the configs to the MLFlow artifact directory

Parameters:

target_path (Union[str, Path]) – Path to the MLFlow artifact directory. The name of the file will be same as original config file, hence, only provide path to dir.

Return type:

None

__call__(df, target_column, *args, **kwds)[source]#

Convert a pandas dataframe to a numpy array for with train, test, and eval splits

Parameters:
  • df (pandas.DataFrame) – Dataframe to convert

  • target_column (str) – Name of the target column

Returns:

Order is (x_train, x_test, x_eval, y_train, y_test, y_eval)

Return type:

Tuple[numpy.ndarray, …]

class niklib.models.preprocessors.core.PandasTrainTestSplit(stratify=None, random_state=None)[source]#

Bases: object

Split a pandas dataframe with train and test

Note

This is a class very similar to TrainTestEvalSplit with this difference that this class is specialized for Pandas Dataframe and since we are going to use augmentation on Pandas Dataframe rather than Numpy, then this class enable us to do augmentation only on train split and let the test part stay as it is.

Note

  • args cannot be set directly and need to be provided using a json file. See set_configs() for more information.

  • You can explicitly override following args by passing it as an argument to __init__():

    • random_state

    • stratify

Returns:

A tuple of (data_train, data_test) which contains both dependent and independent variables

Return type:

Tuple[numpy.ndarray, …]

__init__(stratify=None, random_state=None)[source]#
set_configs(path=None)[source]#

Defines and sets the config to be parsed

The keys of the configs are the attributes of this class which are:

  • train_ratio (float): Ratio of train data

  • shuffle (bool): Whether to shuffle the data

  • stratify (numpy.ndarray, optional): If not None, this is used to stratify the data

  • random_state (int, optional): Random state to use for shuffling

Note

You can explicitly override following attributes by passing it as an argument to __init__():

  • random_state

  • stratify

The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.

Parameters:

path (Union[str, Path]) – path to the JSON file containing the configs

Returns:

A dictionary of str: Any pairs of configs as class attributes

Return type:

dict

as_mlflow_artifact(target_path)[source]#

Saves the configs to the MLFlow artifact directory

Parameters:

target_path (Union[str, Path]) – Path to the MLFlow artifact directory. The name of the file will be same as original config file, hence, only provide path to dir.

Return type:

None

__call__(df, target_column, *args, **kwds)[source]#

Split a pandas dataframe with train and test splits

Parameters:
  • df (pandas.DataFrame) – Dataframe to convert

  • target_column (str) – Name of the target column

Returns:

Order is (data_train, data_test)

Return type:

Tuple[numpy.ndarray, …]

class niklib.models.preprocessors.core.ColumnSelector(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#

Bases: object

Selects columns based on regex pattern and dtype

User can specify the dtype of columns to select, and the dtype of columns to ignore. Also, user can specify the regex pattern for including and excluding columns, separately.

This is particularly useful when combined with sklearn.compose.ColumnTransformer to apply different sort of transformers to different subsets of columns. E.g:

# select columns that contain 'Country' in their name and are of type `np.float32`
columns = preprocessors.ColumnSelector(columns_type='numeric',
                                       dtype_include=np.float32,
                                       pattern_include='.*Country.*',
                                       pattern_exclude=None,
                                       dtype_exclude=None)(df=data)
# use a transformer for selected columns
ct = preprocessors.ColumnTransformer(
    [('some_name',                   # just a name
    preprocessors.StandardScaler(),  # the transformer
    columns),                        # the columns to apply the transformer to
    ],
)

ct.fit_transform(...)

Note

If the data that is passed to the ColumnSelector is a pandas.DataFrame, then you can ignore calling the instance of this class and directly use it in the pipeline. E.g:

# select columns that contain 'Country' in their name and are of type `np.float32`
columns = preprocessors.ColumnSelector(columns_type='numeric',
                                       dtype_include=np.float32,
                                       pattern_include='.*Country.*',
                                       pattern_exclude=None,
                                       dtype_exclude=None)  # THIS LINE
# use a transformer for selected columns
ct = preprocessors.ColumnTransformer(
    [('some_name',                   # just a name
    preprocessors.StandardScaler(),  # the transformer
    columns),                        # the columns to apply the transformer to
    ],
)

ct.fit_transform(...)

See also

sklearn.compose.make_column_selector as ColumnSelector follows the same semantics.

__init__(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#

Selects columns based on regex pattern and dtype

Parameters:
  • columns_type (str) –

    Type of columns:

    1. 'string': returns the name of the columns. Useful for pandas.DataFrame

    2. 'numeric': returns the index of the columns. Useful for numpy.ndarray

  • dtype_include (type) – Type of the columns to select. For more info see pandas.DataFrame.select_dtypes().

  • pattern_include (str) – Regex pattern to match columns to include

  • dtype_exclude (type) – Type of the columns to ignore. For more info see pandas.DataFrame.select_dtypes(). Defaults to None.

  • pattern_exclude (str) – Regex pattern to match columns to exclude

__call__(df, *args, **kwds)[source]#
Parameters:

df (pandas.DataFrame) – Dataframe to extract columns from

Returns:

List of names or indices of filtered columns

Return type:

Union[List[str], List[int]]

Raises:

ValueError – If the df is not instance of pandas.DataFrame

class niklib.models.preprocessors.core.ColumnTransformerConfig[source]#

Bases: object

A helper class that parses configs for using the sklearn.compose.ColumnTransformer

The purpose of this class is to create the list of transformers to be used by the sklearn.compose.ColumnTransformer. Hence, one needs to define the configs by using the set_configs() method. Then use the generate_pipeline() method to create the list of transformers.

This class at the end, will return a list of tuples, where each tuple is a in the form of (name, transformer, columns).

__init__()[source]#
set_configs(path=None)[source]#

Defines and sets the config to be parsed

The keys of the configs are the names of the transformers. They must include the API name of one of the available transforms at the end:

  • sklearn transformers: Any class that could be used for transformation that is importable as sklearn.preprocessing.API_NAME

  • custom transformers: Any class that is not a sklearn transformer and is importable as niklib.models.preprocessors.API_NAME

This naming convention is used to create proper transformers for each type of data. e.g in json format:

"age_StandardScaler": {
    "columns_type": "'numeric'",
    "dtype_include": "np.float32",
    "pattern_include": "'age'",
    "pattern_exclude": "None",
    "dtype_exclude": "None",
    "group": "False",
    "use_global": "False"
}

"sex_OneHotEncoder": {
    "columns_type": "'numeric'",
    "dtype_include": "'category'",
    "pattern_include": "'VisaResult'",
    "pattern_exclude": "None",
    "dtype_exclude": "None",
    "group": "True",
    "use_global": "True"
}

The values of the configs are the columns to be transformed. The columns can be obtained by using niklib.models.preprocessors.core.ColumnSelector which requires user to pass certain parameters. This parameters can be set manually or extracted from JSON config files by providing the path to the JSON file.

The group key is used to determine if the transformer should be applied considering a group of columns or not. If group is True, then required values for transformation are obtained from all columns rather than handling each group separately. For instance, one can use OneHotEncoding on a set of columns where if group is True, then all unique categories of all of those columns are extracted, then transformed. if group is False, then each column will have be transformed based on their unique categories independently. (group cannot be passed to ColumnSelector)

The use_global key is used to determine if the transformer should be applied considering the all data or train data (since fitting transformation for normalization need to be only done on train data). If use_global is True, then the transformer will be applied on all data. This is particularly useful for one hot encoding categorical features where some categories might are rare and might only exist in test and eval data.

Parameters:

path (Union[str, Path]) – path to the JSON file containing the configs

Returns:

A dictionary where keys are string names, values are tuple of niklib.models.preprocessors.core.ColumnSelector instance and a boolean control variable which will be passed to generate_pipeline().

Return type:

dict

as_mlflow_artifact(target_path)[source]#

Saves the configs to the MLFlow artifact directory

Parameters:

target_path (Union[str, Path]) – Path to the MLFlow artifact directory. The name of the file will be same as original config file, hence, only provide path to dir.

Return type:

None

static extract_selected_columns(selector, df)[source]#

Extracts the columns from the dataframe based on the selector

Note

This method is simply a wrapper around niklib.models.preprocessors.core.ColumnSelector that makes the call given a dataframe. I.e.:

# assuming same configs
selector = preprocessors.ColumnSelector(...)
A = ColumnTransformerConfig.extract_selected_columns(selector=selector, df=df)
B = selector(df)
A == B  # True

Also, this is a static method.

Parameters:
Returns:

List of columns to be transformed

Return type:

Union[List[str], List[int]]

__check_arg_exists(callable, arg)#

Checks if the argument exists in the callable signature

Parameters:
  • callable (Callable) – Callable to check the argument in

  • arg (str) – Argument to check if exists in the callable signature

Raises:

ValueError – If the argument does not exist in the callable signature

Return type:

None

__get_df_column_unique(df, loc)#

Gets uniques of a column in a dataframe

Parameters:
  • df (pandas.DataFrame) – Dataframe to get uniques from

  • loc (Union[int, str]) – Column to locate on the dataframe

Returns:

List of unique values in the column. Values of the returned list can be anything that is supported by pandas.DataFrame

Return type:

list

calculate_params(df, columns, group, transformer_name)[source]#

Calculates the parameters for the group transformation w.r.t. the transformer name

Parameters:
  • df (pandas.DataFrame) – Dataframe to extract columns from

  • columns (List) – List of columns to be transformed

  • group (bool) – If True, then the columns will be grouped together and the parameters will be calculated over all columns passed in

  • transformer_name (str) – Name of the transformer. It is used to determine the type of params to be passed to the transformer. E.g. if transformer_name corresponds to OneHotEncoding, then params would be unique categories.

Raises:

ValueError – If the transformer name is not implemented but supported

Returns:

Parameters for the group transformation

Return type:

dict

_check_overlap_in_transformation_columns(transformers)[source]#

Checks if there are multiple transformers on the same columns and reports them

Throw info if columns of different transformers overlap. I.e. at least another transform is happening on a column that is already has been transformed.

Note

This is not a bug or misbehavior since we should be able to pipe multiple transformers sequentially on the same column (e.g. add -> divide). The warning is thrown when user didn’t meant to do so since the output might be acceptable but wrong values and there is no way to find out except manual inspection. Hence, this method will make the user aware that something might be wrong.

Parameters:

transformers (List[Tuple]) – A list of tuples, where each tuple is a in the form of (name, transformer, columns) where name is the name of the transformer, transformer is the transformer object and columns is the list of columns names to be transformed.

Return type:

None

generate_pipeline(df, df_all=None)[source]#

Generates the list of transformers to be used by the sklearn.compose.ColumnTransformer

Note

For more info about how the transformers are created, see methods set_configs(), extract_selected_columns() and calculate_params().

Parameters:
  • df (pandas.DataFrame) – Dataframe to extract columns from if df_all is None, then this is interpreted as train data

  • df_all (Optional[pandas.DataFrame]) – Dataframe to extract columns from if df_all is not None, then this is interpreted as entire data. For more info see set_configs().

Raises:

ValueError – If the naming convention used for the keys in the configs (see set_configs()) is not followed.

Returns:

A list of tuples, where each tuple is a in the form of (name, transformer, columns) where name is the name of the transformer, transformer is the transformer object and columns is the list of columns names to be transformed.

Return type:

list

niklib.models.preprocessors.core.get_transformed_feature_names(column_transformer, original_columns_names)[source]#

Gives feature names for transformed data via original feature names

This is super useful as the default sklearn.compose.ColumnTransformer.get_feature_names_out() uses meaningless names for features after transformation which makes tracking the transformed features almost impossible as it uses f0[_category], f1[_category], ... fn[_category]` as feature names. This method for example, extracts the name of original column ``A (with categories [a, b]) before transformation and finds new columns after transforming that column and names them A_a and A_b meanwhile sklearn method gives x[num0]_a and x_[num0]_b.

Parameters:
  • column_transformer (sklearn.compose.ColumnTransformer) – A fitted column transformer that has .transformers_ where each is a tuple as (name, transformer, in_columns). in_columns used to detect the original index of transformed columns.

  • original_columns_names (List[str]) – List of original columns names before transformation

Returns:

A list of transformed columns names prefixed with original columns names

Return type:

List[str]

niklib.models.preprocessors.core.move_dependent_variable_to_end(df, target_column)[source]#

Move the dependent variable to the end of the dataframe

This is useful for some frameworks that require the dependent variable to be the last or in general form, it is way easier to play with numpy.ndarray s when the dependent variable is the last one.

Note

This is particularly is useful for us since we have multiple columns of the same type in our dataframe, and when we want to apply same preprocessing to a all members of a group of features, we can directly use index of those features from our pandas dataframe in converted numpy array. E.g:

df = pd.DataFrame(...)
x = df.to_numpy()
index = df.columns.get_loc(a_group_of_columns_with_the_same_logic)
x[:, index] = transform(x[:, index])
Parameters:
  • df (pandas.DataFrame) – Dataframe to convert

  • target_column (str) – Name of the target column

Returns:

Dataframe with the dependent variable at the end

Return type:

pandas.DataFrame

niklib.models.preprocessors.helpers module#

niklib.models.preprocessors.helpers.preview_column_transformer(column_transformer, original, transformed, df, random_state=Generator(PCG64) at 0x7FF565FC8820, **kwargs)[source]#

Preview transformed data next to original one obtained via ColumnTransformer

When the transformation is not sklearn.preprocessing.OneHotEncoder, the transformed data is previewed next to the original data in a pandas dataframe.

But when the transformation is sklearn.preprocessing.OneHotEncoder, this is no longer clean or informative in seeing only 0s and 1s. So, I just skip previewing the transformed data entirely but report following information:

  • The number of columns affected by transformation

  • The number of unique values in all of affected columns

  • The number of newly produced columns

Parameters:
Raises:

ValueError – If original and transformed are not of the same shape

Yields:

pandas.DataFrame – Preview dataframe for each transformer in column_transformer.transformers_. Dataframe has twice as columns as original and transformed, i.e. df.shape == (original.shape[0], 2 * original.shape[1])

Return type:

DataFrame

Module contents#

Contains preprocessing methods for preparing data solely for estimators in niklib.models.estimators

This preprocessors expect “already cleaned” data acquired by niklib.data for sole usage of machine learning models for desired frameworks (let’s say changing dtypes or one hot encoding for torch or sklearn that is only useful for these frameworks)

Following modules are available:
niklib.models.preprocessors.EXAMPLE_COLUMN_TRANSFORMER_CONFIG_X = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_column_transformer_config_x.json')#

Configs for transforming features data for Example

For information about how to use it and what fields are expected, see niklib.models.preprocessors.core.ColumnTransformerConfig.

niklib.models.preprocessors.EXAMPLE_COLUMN_TRANSFORMER_CONFIG_Y = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_column_transformer_config_y.json')#

Configs for transforming target data for Example

For information about how to use it and what fields are expected, see niklib.models.preprocessors.core.ColumnTransformerConfig.

niklib.models.preprocessors.EXAMPLE_TRAIN_TEST_EVAL_SPLIT = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_train_test_eval_split.json')#

Configs for splitting dataframe into numpy ndarray of train, test, eval for Example

For information about how to use it and what fields are expected, see niklib.models.preprocessors.core.TrainTestEvalSplit.

niklib.models.preprocessors.EXAMPLE_PANDAS_TRAIN_TEST_SPLIT = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_pandas_train_test_split.json')#

Configs for splitting dataframe train and test for Example

For information about how to use it and what fields are expected, see niklib.models.preprocessors.core.PandasTrainTestSplit.

niklib.models.preprocessors.TRANSFORMS = {'LabelBinarizer': <class 'sklearn.preprocessing._label.LabelBinarizer'>, 'LabelEncoder': <class 'sklearn.preprocessing._label.LabelEncoder'>, 'MaxAbsScaler': <class 'sklearn.preprocessing._data.MaxAbsScaler'>, 'MinMaxScaler': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'MultiLabelBinarizer': <class 'sklearn.preprocessing._label.MultiLabelBinarizer'>, 'OneHotEncoder': <class 'sklearn.preprocessing._encoders.OneHotEncoder'>, 'RobustScaler': <class 'sklearn.preprocessing._data.RobustScaler'>, 'StandardScaler': <class 'sklearn.preprocessing._data.StandardScaler'>}#

A dictionary of transforms and their names used to verify configs in niklib.models.preprocessors.core.ColumnTransformerConfig

This is used to verify that the configs are correct and that the transformers are available.

Note

All transforms from third-party library or our own must be included in that list to be usable by niklib.models.preprocessors.core.ColumnTransformerConfig.

class niklib.models.preprocessors.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)[source]#

Bases: TransformerMixin, _BaseComposition

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

Read more in the User Guide.

New in version 0.20.

Parameters:
  • transformers (list of tuples) –

    List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.

    namestr

    Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using set_params and searched in grid search.

    transformer{‘drop’, ‘passthrough’} or estimator

    Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

    columnsstr, array-like of str, int, array-like of int, array-like of bool, slice or callable

    Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.

  • remainder ({'drop', 'passthrough'} or estimator, default='drop') – By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers, but present in the data passed to fit will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during fit will be excluded from the output of transform. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

  • sparse_threshold (float, default=0.3) – If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.

  • n_jobs (int, default=None) – Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

  • transformer_weights (dict, default=None) – Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

  • verbose (bool, default=False) – If True, the time elapsed while fitting each transformer will be printed as it is completed.

  • verbose_feature_names_out (bool, default=True) –

    If True, get_feature_names_out() will prefix all feature names with the name of the transformer that generated that feature. If False, get_feature_names_out() will not prefix any feature names and will error if feature names are not unique.

    New in version 1.0.

Variables:
  • transformers (list) – The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be the unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the remainder parameter. If there are remaining columns, then len(transformers_)==len(transformers)+1, otherwise len(transformers_)==len(transformers).

  • named_transformers (Bunch) – Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

  • sparse_output (bool) – Boolean flag indicating whether the output of transform is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.

  • output_indices (dict) –

    A dictionary from each transformer name to a slice, where the slice corresponds to indices in the transformed output. This is useful to inspect which transformer is responsible for which transformed feature(s).

    New in version 1.0.

  • n_features_in (int) –

    Number of features seen during fit. Only defined if the underlying transformers expose such an attribute when fit.

    New in version 0.24.

See also

make_column_transformer

Convenience function for combining the outputs of multiple transformer objects applied to column subsets of the original feature space.

make_column_selector

Convenience function for selecting columns based on datatype or the columns name with a regex pattern.

Notes

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

Examples

>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

ColumnTransformer can be configured with a transformer that requires a 1d array by setting the column to a string:

>>> from sklearn.feature_extraction import FeatureHasher
>>> from sklearn.preprocessing import MinMaxScaler
>>> import pandas as pd   
>>> X = pd.DataFrame({
...     "documents": ["First item", "second one here", "Is this the last?"],
...     "width": [3, 4, 5],
... })  
>>> # "documents" is a string which configures ColumnTransformer to
>>> # pass the documents column as a 1d array to the FeatureHasher
>>> ct = ColumnTransformer(
...     [("text_preprocess", FeatureHasher(input_type="string"), "documents"),
...      ("num_preprocess", MinMaxScaler(), ["width"])])
>>> X_trans = ct.fit_transform(X)  
_required_parameters = ['transformers']#
_parameter_constraints: dict = {'n_jobs': [<class 'numbers.Integral'>, None], 'remainder': [<sklearn.utils._param_validation.StrOptions object>, <sklearn.utils._param_validation.HasMethods object>, <sklearn.utils._param_validation.HasMethods object>], 'sparse_threshold': [<sklearn.utils._param_validation.Interval object>], 'transformer_weights': [<class 'dict'>, None], 'transformers': [<class 'list'>, <sklearn.utils._param_validation.Hidden object>], 'verbose': ['verbose'], 'verbose_feature_names_out': ['boolean']}#
__init__(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)[source]#
property _transformers#

Internal list of transformer only containing the name and transformers, dropping the columns. This is for the implementation of get_params via BaseComposition._get_params which expects lists of tuples of len 2.

set_output(*, transform=None)[source]#

Set the output container when “transform” and “fit_transform” are called.

Calling set_output will set the output of all estimators in transformers and transformers_.

Parameters:

transform ({"default", "pandas"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:

self – Estimator instance.

Return type:

estimator instance

get_params(deep=True)[source]#

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**kwargs)[source]#

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.

Parameters:

**kwargs (dict) – Estimator parameters.

Returns:

self – This estimator.

Return type:

ColumnTransformer

_iter(fitted=False, replace_strings=False, column_as_strings=False)[source]#

Generate (name, trans, column, weight) tuples.

If fitted=True, use the fitted transformers, else use the user specified transformers updated with converted column names and potentially appended with transformer for remainder.

_validate_transformers()[source]#
_validate_column_callables(X)[source]#

Converts callable column specifications.

_validate_remainder(X)[source]#

Validates remainder and defines _remainder targeting the remaining columns.

property named_transformers_#

Access the fitted transformer by name.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

_get_feature_name_out_for_transformer(name, trans, column, feature_names_in)[source]#

Gets feature names of transformer.

Used in conjunction with self._iter(fitted=True) in get_feature_names_out.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out – Transformed feature names.

Return type:

ndarray of str objects

_add_prefix_for_feature_names_out(transformer_with_feature_names_out)[source]#

Add prefix for feature names out that includes the transformer names.

Parameters:

transformer_with_feature_names_out (list of tuples of (str, array-like of str)) – The tuple consistent of the transformer’s name and its feature names out.

Returns:

feature_names_out – Transformed feature names.

Return type:

ndarray of shape (n_features,), dtype=str

_update_fitted_transformers(transformers)[source]#
_validate_output(result)[source]#

Ensure that the output of each transformer is 2D. Otherwise hstack can raise an error or produce incorrect results.

_record_output_indices(Xs)[source]#

Record which transformer produced which column.

_log_message(name, idx, total)[source]#
_fit_transform(X, y, func, fitted=False, column_as_strings=False)[source]#

Private function to fit and/or transform on demand.

Return value (transformers and/or transformed X data) depends on the passed function. fitted=True ensures the fitted transformers are used.

fit(X, y=None)[source]#

Fit all transformers using X.

Parameters:
  • X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.

  • y (array-like of shape (n_samples,...), default=None) – Targets for supervised learning.

Returns:

self – This estimator.

Return type:

ColumnTransformer

fit_transform(X, y=None)[source]#

Fit all transformers, transform the data and concatenate results.

Parameters:
  • X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.

  • y (array-like of shape (n_samples,), default=None) – Targets for supervised learning.

Returns:

X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Return type:

{array-like, sparse matrix} of shape (n_samples, sum_n_components)

transform(X)[source]#

Transform X separately by each transformer, concatenate results.

Parameters:

X ({array-like, dataframe} of shape (n_samples, n_features)) – The data to be transformed by subset.

Returns:

X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

Return type:

{array-like, sparse matrix} of shape (n_samples, sum_n_components)

_hstack(Xs)[source]#

Stacks Xs horizontally.

This allows subclasses to control the stacking behavior, while reusing everything else from ColumnTransformer.

Parameters:

Xs (list of {array-like, sparse matrix, dataframe}) –

_sk_visual_block_()[source]#
_abc_impl = <_abc._abc_data object>#
_sklearn_auto_wrap_output_keys = {'transform'}#
class niklib.models.preprocessors.OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None, feature_name_combiner='concat')[source]#

Bases: _BaseEncoder

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse_output parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters:
  • categories ('auto' or a list of array-like, default='auto') –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

    New in version 0.20.

  • drop ({'first', 'if_binary'} or an array-like of shape (n_features,), default=None) –

    Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model.

    However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

    • None : retain all features (the default).

    • ’first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

    • ’if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

    • array : drop[i] is the category in feature X[:, i] that should be dropped.

    When max_categories or min_frequency is configured to group infrequent categories, the dropping behavior is handled after the grouping.

    New in version 0.21: The parameter drop was added in 0.21.

    Changed in version 0.23: The option drop=’if_binary’ was added in 0.23.

    Changed in version 1.1: Support for dropping infrequent categories.

  • sparse (bool, default=True) –

    Will return sparse matrix if set True else will return an array.

    Deprecated since version 1.2: sparse is deprecated in 1.2 and will be removed in 1.4. Use sparse_output instead.

  • sparse_output (bool, default=True) –

    Will return sparse matrix if set True else will return an array.

    New in version 1.2: sparse was renamed to sparse_output

  • dtype (number type, default=float) – Desired dtype of output.

  • handle_unknown ({'error', 'ignore', 'infrequent_if_exist'}, default='error') –

    Specifies the way unknown categories are handled during transform().

    • ’error’ : Raise an error if an unknown category is present during transform.

    • ’ignore’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

    • ’infrequent_if_exist’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted ‘infrequent’ if it exists. If the ‘infrequent’ category does not exist, then transform() and inverse_transform() will handle an unknown category as with handle_unknown=’ignore’. Infrequent categories exist based on min_frequency and max_categories. Read more in the User Guide.

    Changed in version 1.1: ‘infrequent_if_exist’ was added to automatically handle unknown categories and infrequent categories.

  • min_frequency (int or float, default=None) –

    Specifies the minimum frequency below which a category will be considered infrequent.

    • If int, categories with a smaller cardinality will be considered infrequent.

    • If float, categories with a smaller cardinality than min_frequency * n_samples will be considered infrequent.

    New in version 1.1: Read more in the User Guide.

  • max_categories (int, default=None) –

    Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.

    New in version 1.1: Read more in the User Guide.

  • feature_name_combiner ("concat" or callable, default="concat") –

    Callable with signature def callable(input_feature, category) that returns a string. This is used to create feature names to be returned by get_feature_names_out().

    ”concat” concatenates encoded feature name and category with feature + “_” + str(category).E.g. feature X with values 1, 6, 7 create feature names X_1, X_6, X_7.

    New in version 1.3.

Variables:
  • categories (list of arrays) – The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

  • drop_idx (array of shape (n_features,)) –

    • drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.

    • drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop=’if_binary’ and the feature isn’t binary.

    • drop_idx_ = None if all the transformed features will be retained.

    If infrequent categories are enabled by setting min_frequency or max_categories to a non-default value and drop_idx[i] corresponds to a infrequent category, then the entire infrequent category is dropped.

    Changed in version 0.23: Added the possibility to contain None values.

  • infrequent_categories (list of ndarray) –

    Defined only if infrequent categories are enabled by setting min_frequency or max_categories to a non-default value. infrequent_categories_[i] are the infrequent categories for feature i. If the feature i has no infrequent categories infrequent_categories_[i] is None.

    New in version 1.1.

  • n_features_in (int) –

    Number of features seen during fit.

    New in version 1.0.

  • feature_names_in (ndarray of shape (n_features_in_,)) –

    Names of features seen during fit. Defined only when X has feature names that are all strings.

    New in version 1.0.

  • feature_name_combiner (callable or None) –

    Callable with signature def callable(input_feature, category) that returns a string. This is used to create feature names to be returned by get_feature_names_out().

    New in version 1.3.

See also

OrdinalEncoder

Performs an ordinal (integer) encoding of the categorical features.

TargetEncoder

Encodes categorical features using the target.

sklearn.feature_extraction.DictVectorizer

Performs a one-hot encoding of dictionary items (also handles string-valued features).

sklearn.feature_extraction.FeatureHasher

Performs an approximate one-hot encoding of dictionary items or strings.

LabelBinarizer

Binarizes labels in a one-vs-all fashion.

MultiLabelBinarizer

Transforms between iterable of iterables and a multilabel format, e.g. a (samples x classes) binary matrix indicating the presence of a class label.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen during fit:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)

One can always drop the first column for each feature:

>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 0., 0.],
       [1., 1., 0.]])

Or drop a column for feature only having 2 categories:

>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
>>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 1., 0., 0.],
       [1., 0., 1., 0.]])

One can change the way feature names are created.

>>> def custom_combiner(feature, category):
...     return str(feature) + "_" + type(category).__name__ + "_" + str(category)
>>> custom_fnames_enc = OneHotEncoder(feature_name_combiner=custom_combiner).fit(X)
>>> custom_fnames_enc.get_feature_names_out()
array(['x0_str_Female', 'x0_str_Male', 'x1_int_1', 'x1_int_2', 'x1_int_3'],
      dtype=object)

Infrequent categories are enabled by setting max_categories or min_frequency.

>>> import numpy as np
>>> X = np.array([["a"] * 5 + ["b"] * 20 + ["c"] * 10 + ["d"] * 3], dtype=object).T
>>> ohe = OneHotEncoder(max_categories=3, sparse_output=False).fit(X)
>>> ohe.infrequent_categories_
[array(['a', 'd'], dtype=object)]
>>> ohe.transform([["a"], ["b"]])
array([[0., 0., 1.],
       [1., 0., 0.]])
_parameter_constraints: dict = {'categories': [<sklearn.utils._param_validation.StrOptions object>, <class 'list'>], 'drop': [<sklearn.utils._param_validation.StrOptions object>, 'array-like', None], 'dtype': 'no_validation', 'feature_name_combiner': [<sklearn.utils._param_validation.StrOptions object>, <built-in function callable>], 'handle_unknown': [<sklearn.utils._param_validation.StrOptions object>], 'max_categories': [<sklearn.utils._param_validation.Interval object>, None], 'min_frequency': [<sklearn.utils._param_validation.Interval object>, <sklearn.utils._param_validation.Interval object>, None], 'sparse': [<sklearn.utils._param_validation.Hidden object>, 'boolean'], 'sparse_output': ['boolean']}#
__init__(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None, feature_name_combiner='concat')[source]#
_map_drop_idx_to_infrequent(feature_idx, drop_idx)[source]#

Convert drop_idx into the index for infrequent categories.

If there are no infrequent categories, then drop_idx is returned. This method is called in _set_drop_idx when the drop parameter is an array-like.

_set_drop_idx()[source]#

Compute the drop indices associated with self.categories_.

If self.drop is: - None, No categories have been dropped. - ‘first’, All zeros to drop the first category. - ‘if_binary’, All zeros if the category is binary and None

otherwise.

  • array-like, The indices of the categories that match the categories in self.drop. If the dropped category is an infrequent category, then the index for the infrequent category is used. This means that the entire infrequent category is dropped.

This methods defines a public drop_idx_ and a private _drop_idx_after_grouping.

  • drop_idx_: Public facing API that references the drop category in self.categories_.

  • _drop_idx_after_grouping: Used internally to drop categories after the infrequent categories are grouped together.

If there are no infrequent categories or drop is None, then drop_idx_=_drop_idx_after_grouping.

_compute_transformed_categories(i, remove_dropped=True)[source]#

Compute the transformed categories used for column i.

1. If there are infrequent categories, the category is named ‘infrequent_sklearn’. 2. Dropped columns are removed when remove_dropped=True.

_remove_dropped_categories(categories, i)[source]#

Remove dropped categories.

_compute_n_features_outs()[source]#

Compute the n_features_out for each input feature.

fit(X, y=None)[source]#

Fit OneHotEncoder to X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns:

Fitted encoder.

Return type:

self

transform(X)[source]#

Transform X using one-hot encoding.

If there are infrequent categories for a feature, the infrequent categories will be grouped into a single category.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to encode.

Returns:

X_out – Transformed input. If sparse_output=True, a sparse matrix will be returned.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)

inverse_transform(X)[source]#

Convert the data back to the original representation.

When unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse.

For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_encoded_features)) – The transformed data.

Returns:

X_tr – Inverse transformed array.

Return type:

ndarray of shape (n_samples, n_features)

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out – Transformed feature names.

Return type:

ndarray of str objects

_check_get_feature_name_combiner()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
class niklib.models.preprocessors.LabelEncoder[source]#

Bases: TransformerMixin, BaseEstimator

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

Read more in the User Guide.

New in version 0.12.

Variables:

classes (ndarray of shape (n_classes,)) – Holds the label for each class.

See also

OrdinalEncoder

Encode categorical features using an ordinal encoding scheme.

OneHotEncoder

Encode categorical features as a one-hot numeric array.

Examples

LabelEncoder can be used to normalize labels.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
fit(y)[source]#

Fit label encoder.

Parameters:

y (array-like of shape (n_samples,)) – Target values.

Returns:

self – Fitted label encoder.

Return type:

returns an instance of self.

fit_transform(y)[source]#

Fit label encoder and return encoded labels.

Parameters:

y (array-like of shape (n_samples,)) – Target values.

Returns:

y – Encoded labels.

Return type:

array-like of shape (n_samples,)

transform(y)[source]#

Transform labels to normalized encoding.

Parameters:

y (array-like of shape (n_samples,)) – Target values.

Returns:

y – Labels as normalized encodings.

Return type:

array-like of shape (n_samples,)

inverse_transform(y)[source]#

Transform labels back to original encoding.

Parameters:

y (ndarray of shape (n_samples,)) – Target values.

Returns:

y – Original encoding.

Return type:

ndarray of shape (n_samples,)

_more_tags()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
class niklib.models.preprocessors.StandardScaler(*, copy=True, with_mean=True, with_std=True)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform().

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Read more in the User Guide.

Parameters:
  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).

Variables:
  • scale (ndarray of shape (n_features,) or None) –

    Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1. scale_ is equal to None when with_std=False.

    New in version 0.17: scale_

  • mean (ndarray of shape (n_features,) or None) – The mean value for each feature in the training set. Equal to None when with_mean=False.

  • var (ndarray of shape (n_features,) or None) – The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

  • n_features_in (int) –

    Number of features seen during fit.

    New in version 0.24.

  • feature_names_in (ndarray of shape (n_features_in_,)) –

    Names of features seen during fit. Defined only when X has feature names that are all strings.

    New in version 1.0.

  • n_samples_seen (int or ndarray of shape (n_features,)) – The number of samples processed by the estimator for each feature. If there are no missing samples, the n_samples_seen will be an integer, otherwise it will be an array of dtype int. If sample_weights are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across partial_fit calls.

See also

scale

Equivalent function without the estimator API.

PCA

Further removes the linear correlation across features with ‘whiten=True’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]
_parameter_constraints: dict = {'copy': ['boolean'], 'with_mean': ['boolean'], 'with_std': ['boolean']}#
__init__(*, copy=True, with_mean=True, with_std=True)[source]#
_reset()[source]#

Reset internal data-dependent state of the scaler, if necessary.

__init__ parameters are not touched.

fit(X, y=None, sample_weight=None)[source]#

Compute the mean and std to be used for later scaling.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (array-like of shape (n_samples,), default=None) –

    Individual weights for each sample.

    New in version 0.24: parameter sample_weight support to StandardScaler.

Returns:

self – Fitted scaler.

Return type:

object

partial_fit(X, y=None, sample_weight=None)[source]#

Online computation of mean and std on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (array-like of shape (n_samples,), default=None) –

    Individual weights for each sample.

    New in version 0.24: parameter sample_weight support to StandardScaler.

Returns:

self – Fitted scaler.

Return type:

object

transform(X, copy=None)[source]#

Perform standardization by centering and scaling.

Parameters:
  • X ({array-like, sparse matrix of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • copy (bool, default=None) – Copy the input X or not.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

inverse_transform(X, copy=None)[source]#

Scale back the data to the original representation.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • copy (bool, default=None) – Copy the input X or not.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

_more_tags()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardScaler#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_inverse_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardScaler#

Request metadata passed to the inverse_transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in inverse_transform.

Returns:

self – The updated object.

Return type:

object

set_partial_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardScaler#

Request metadata passed to the partial_fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to partial_fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in partial_fit.

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardScaler#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.

Returns:

self – The updated object.

Return type:

object

class niklib.models.preprocessors.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Read more in the User Guide.

Parameters:
  • feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.

  • copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

  • clip (bool, default=False) –

    Set to True to clip transformed values of held-out data to provided feature range.

    New in version 0.24.

Variables:
  • min (ndarray of shape (n_features,)) – Per feature adjustment for minimum. Equivalent to min - X.min(axis=0) * self.scale_

  • scale (ndarray of shape (n_features,)) –

    Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))

    New in version 0.17: scale_ attribute.

  • data_min (ndarray of shape (n_features,)) –

    Per feature minimum seen in the data

    New in version 0.17: data_min_

  • data_max (ndarray of shape (n_features,)) –

    Per feature maximum seen in the data

    New in version 0.17: data_max_

  • data_range (ndarray of shape (n_features,)) –

    Per feature range (data_max_ - data_min_) seen in the data

    New in version 0.17: data_range_

  • n_features_in (int) –

    Number of features seen during fit.

    New in version 0.24.

  • n_samples_seen (int) – The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across partial_fit calls.

  • feature_names_in (ndarray of shape (n_features_in_,)) –

    Names of features seen during fit. Defined only when X has feature names that are all strings.

    New in version 1.0.

See also

minmax_scale

Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]
_parameter_constraints: dict = {'clip': ['boolean'], 'copy': ['boolean'], 'feature_range': [<class 'tuple'>]}#
__init__(feature_range=(0, 1), *, copy=True, clip=False)[source]#
_reset()[source]#

Reset internal data-dependent state of the scaler, if necessary.

__init__ parameters are not touched.

fit(X, y=None)[source]#

Compute the minimum and maximum to be used for later scaling.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

  • y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

partial_fit(X, y=None)[source]#

Online computation of min and max on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

transform(X)[source]#

Scale features of X according to feature_range.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_features)

inverse_transform(X)[source]#

Undo the scaling of X according to feature_range.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_features)

_more_tags()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
class niklib.models.preprocessors.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform() method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

New in version 0.17.

Read more in the User Guide.

Parameters:
  • with_centering (bool, default=True) – If True, center the data before scaling. This will cause transform() to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_scaling (bool, default=True) – If True, scale the data to interquartile range.

  • quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)) –

    Quantile range used to calculate scale_. By default this is equal to the IQR, i.e., q_min is the first quantile and q_max is the third quantile.

    New in version 0.18.

  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • unit_variance (bool, default=False) –

    If True, scale data so that normally distributed features have a variance of 1. In general, if the difference between the x-values of q_max and q_min for a standard normal distribution is greater than 1, the dataset will be scaled down. If less than 1, the dataset will be scaled up.

    New in version 0.24.

Variables:
  • center (array of floats) – The median value for each feature in the training set.

  • scale (array of floats) –

    The (scaled) interquartile range for each feature in the training set.

    New in version 0.17: scale_ attribute.

  • n_features_in (int) –

    Number of features seen during fit.

    New in version 0.24.

  • feature_names_in (ndarray of shape (n_features_in_,)) –

    Names of features seen during fit. Defined only when X has feature names that are all strings.

    New in version 1.0.

See also

robust_scale

Equivalent function without the estimator API.

sklearn.decomposition.PCA

Further removes the linear correlation across features with ‘whiten=True’.

Notes

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Interquartile_range

Examples

>>> from sklearn.preprocessing import RobustScaler
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> transformer = RobustScaler().fit(X)
>>> transformer
RobustScaler()
>>> transformer.transform(X)
array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])
_parameter_constraints: dict = {'copy': ['boolean'], 'quantile_range': [<class 'tuple'>], 'unit_variance': ['boolean'], 'with_centering': ['boolean'], 'with_scaling': ['boolean']}#
__init__(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]#
fit(X, y=None)[source]#

Compute the median and quantiles to be used for scaling.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the median and quantiles used for later scaling along the features axis.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns:

self – Fitted scaler.

Return type:

object

transform(X)[source]#

Center and scale the data.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the specified axis.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

inverse_transform(X)[source]#

Scale back the data to the original representation.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The rescaled data to be transformed back.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

_more_tags()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
class niklib.models.preprocessors.MaxAbsScaler(*, copy=True)[source]#

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

This scaler can also be applied to sparse CSR or CSC matrices.

New in version 0.17.

Parameters:

copy (bool, default=True) – Set to False to perform inplace scaling and avoid a copy (if the input is already a numpy array).

Variables:
  • scale (ndarray of shape (n_features,)) –

    Per feature relative scaling of the data.

    New in version 0.17: scale_ attribute.

  • max_abs (ndarray of shape (n_features,)) – Per feature maximum absolute value.

  • n_features_in (int) –

    Number of features seen during fit.

    New in version 0.24.

  • feature_names_in (ndarray of shape (n_features_in_,)) –

    Names of features seen during fit. Defined only when X has feature names that are all strings.

    New in version 1.0.

  • n_samples_seen (int) – The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.

See also

maxabs_scale

Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MaxAbsScaler
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = MaxAbsScaler().fit(X)
>>> transformer
MaxAbsScaler()
>>> transformer.transform(X)
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
_parameter_constraints: dict = {'copy': ['boolean']}#
__init__(*, copy=True)[source]#
_reset()[source]#

Reset internal data-dependent state of the scaler, if necessary.

__init__ parameters are not touched.

fit(X, y=None)[source]#

Compute the maximum absolute value to be used for later scaling.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

  • y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

partial_fit(X, y=None)[source]#

Online computation of max absolute value of X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

Returns:

self – Fitted scaler.

Return type:

object

transform(X)[source]#

Scale the data.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be scaled.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

inverse_transform(X)[source]#

Scale back the data to the original representation.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be transformed back.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

_more_tags()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
class niklib.models.preprocessors.LabelBinarizer(*, neg_label=0, pos_label=1, sparse_output=False)[source]#

Bases: TransformerMixin, BaseEstimator

Binarize labels in a one-vs-all fashion.

Several regression and binary classification algorithms are available in scikit-learn. A simple way to extend these algorithms to the multi-class classification case is to use the so-called one-vs-all scheme.

At learning time, this simply consists in learning one regressor or binary classifier per class. In doing so, one needs to convert multi-class labels to binary labels (belong or does not belong to the class). LabelBinarizer makes this process easy with the transform method.

At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the inverse_transform() method.

Read more in the User Guide.

Parameters:
  • neg_label (int, default=0) – Value with which negative labels must be encoded.

  • pos_label (int, default=1) – Value with which positive labels must be encoded.

  • sparse_output (bool, default=False) – True if the returned array from transform is desired to be in sparse CSR format.

Variables:
  • classes (ndarray of shape (n_classes,)) – Holds the label for each class.

  • y_type (str) – Represents the type of the target data as evaluated by type_of_target(). Possible type are ‘continuous’, ‘continuous-multioutput’, ‘binary’, ‘multiclass’, ‘multiclass-multioutput’, ‘multilabel-indicator’, and ‘unknown’.

  • sparse_input (bool) – True if the input data to transform is given as a sparse matrix, False otherwise.

See also

label_binarize

Function to perform the transform operation of LabelBinarizer with fixed classes.

OneHotEncoder

Encode categorical features using a one-hot aka one-of-K scheme.

Examples

>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer()
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
       [0, 0, 0, 1]])

Binary targets transform to a column vector

>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit_transform(['yes', 'no', 'no', 'yes'])
array([[1],
       [0],
       [0],
       [1]])

Passing a 2D matrix for multilabel classification

>>> import numpy as np
>>> lb.fit(np.array([[0, 1, 1], [1, 0, 0]]))
LabelBinarizer()
>>> lb.classes_
array([0, 1, 2])
>>> lb.transform([0, 1, 2, 1])
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 1, 0]])
_parameter_constraints: dict = {'neg_label': [<class 'numbers.Integral'>], 'pos_label': [<class 'numbers.Integral'>], 'sparse_output': ['boolean']}#
__init__(*, neg_label=0, pos_label=1, sparse_output=False)[source]#
fit(y)[source]#

Fit label binarizer.

Parameters:

y (ndarray of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns:

self – Returns the instance itself.

Return type:

object

fit_transform(y)[source]#

Fit label binarizer/transform multi-class labels to binary labels.

The output of transform is sometimes referred to as the 1-of-K coding scheme.

Parameters:

y ({ndarray, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification. Sparse matrix can be CSR, CSC, COO, DOK, or LIL.

Returns:

Y – Shape will be (n_samples, 1) for binary problems. Sparse matrix will be of CSR format.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_classes)

transform(y)[source]#

Transform multi-class labels to binary labels.

The output of transform is sometimes referred to by some authors as the 1-of-K coding scheme.

Parameters:

y ({array, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification. Sparse matrix can be CSR, CSC, COO, DOK, or LIL.

Returns:

Y – Shape will be (n_samples, 1) for binary problems. Sparse matrix will be of CSR format.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_classes)

inverse_transform(Y, threshold=None)[source]#

Transform binary labels back to multi-class labels.

Parameters:
  • Y ({ndarray, sparse matrix} of shape (n_samples, n_classes)) – Target values. All sparse matrices are converted to CSR before inverse transformation.

  • threshold (float, default=None) –

    Threshold used in the binary and multi-label cases.

    Use 0 when Y contains the output of decision_function (classifier). Use 0.5 when Y contains the output of predict_proba.

    If None, the threshold is assumed to be half way between neg_label and pos_label.

Returns:

y – Target values. Sparse matrix will be of CSR format.

Return type:

{ndarray, sparse matrix} of shape (n_samples,)

Notes

In the case when the binary labels are fractional (probabilistic), inverse_transform() chooses the class with the greatest value. Typically, this allows to use the output of a linear model’s decision_function method directly as the input of inverse_transform().

_more_tags()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
set_inverse_transform_request(*, threshold: bool | None | str = '$UNCHANGED$') LabelBinarizer#

Request metadata passed to the inverse_transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

threshold (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for threshold parameter in inverse_transform.

Returns:

self – The updated object.

Return type:

object

class niklib.models.preprocessors.MultiLabelBinarizer(*, classes=None, sparse_output=False)[source]#

Bases: TransformerMixin, BaseEstimator

Transform between iterable of iterables and a multilabel format.

Although a list of sets or tuples is a very intuitive format for multilabel data, it is unwieldy to process. This transformer converts between this intuitive format and the supported multilabel format: a (samples x classes) binary matrix indicating the presence of a class label.

Parameters:
  • classes (array-like of shape (n_classes,), default=None) – Indicates an ordering for the class labels. All entries should be unique (cannot contain duplicate classes).

  • sparse_output (bool, default=False) – Set to True if output binary array is desired in CSR sparse format.

Variables:

classes (ndarray of shape (n_classes,)) – A copy of the classes parameter when provided. Otherwise it corresponds to the sorted set of classes found when fitting.

See also

OneHotEncoder

Encode categorical features using a one-hot aka one-of-K scheme.

Examples

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
       [0, 0, 1]])
>>> mlb.classes_
array([1, 2, 3])
>>> mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])
array([[0, 1, 1],
       [1, 0, 0]])
>>> list(mlb.classes_)
['comedy', 'sci-fi', 'thriller']

A common mistake is to pass in a list, which leads to the following issue:

>>> mlb = MultiLabelBinarizer()
>>> mlb.fit(['sci-fi', 'thriller', 'comedy'])
MultiLabelBinarizer()
>>> mlb.classes_
array(['-', 'c', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 'r', 's', 't',
    'y'], dtype=object)

To correct this, the list of labels should be passed in as:

>>> mlb = MultiLabelBinarizer()
>>> mlb.fit([['sci-fi', 'thriller', 'comedy']])
MultiLabelBinarizer()
>>> mlb.classes_
array(['comedy', 'sci-fi', 'thriller'], dtype=object)
_parameter_constraints: dict = {'classes': ['array-like', None], 'sparse_output': ['boolean']}#
__init__(*, classes=None, sparse_output=False)[source]#
fit(y)[source]#

Fit the label sets binarizer, storing classes_.

Parameters:

y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.

Returns:

self – Fitted estimator.

Return type:

object

fit_transform(y)[source]#

Fit the label sets binarizer and transform the given label sets.

Parameters:

y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.

Returns:

y_indicator – A matrix such that y_indicator[i, j] = 1 iff classes_[j] is in y[i], and 0 otherwise. Sparse matrix will be of CSR format.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_classes)

transform(y)[source]#

Transform the given label sets.

Parameters:

y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.

Returns:

y_indicator – A matrix such that y_indicator[i, j] = 1 iff classes_[j] is in y[i], and 0 otherwise.

Return type:

array or CSR matrix, shape (n_samples, n_classes)

_build_cache()[source]#
_transform(y, class_mapping)[source]#

Transforms the label sets with a given mapping.

Parameters:
  • y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.

  • class_mapping (Mapping) – Maps from label to column index in label indicator matrix.

Returns:

y_indicator – Label indicator matrix. Will be of CSR format.

Return type:

sparse matrix of shape (n_samples, n_classes)

inverse_transform(yt)[source]#

Transform the given indicator matrix into label sets.

Parameters:

yt ({ndarray, sparse matrix} of shape (n_samples, n_classes)) – A matrix containing only 1s ands 0s.

Returns:

y – The set of labels for each sample such that y[i] consists of classes_[j] for each yt[i, j] == 1.

Return type:

list of tuples

_more_tags()[source]#
_sklearn_auto_wrap_output_keys = {'transform'}#
class niklib.models.preprocessors.PandasTrainTestSplit(stratify=None, random_state=None)[source]#

Bases: object

Split a pandas dataframe with train and test

Note

This is a class very similar to TrainTestEvalSplit with this difference that this class is specialized for Pandas Dataframe and since we are going to use augmentation on Pandas Dataframe rather than Numpy, then this class enable us to do augmentation only on train split and let the test part stay as it is.

Note

  • args cannot be set directly and need to be provided using a json file. See set_configs() for more information.

  • You can explicitly override following args by passing it as an argument to __init__():

    • random_state

    • stratify

Returns:

A tuple of (data_train, data_test) which contains both dependent and independent variables

Return type:

Tuple[numpy.ndarray, …]

__init__(stratify=None, random_state=None)[source]#
set_configs(path=None)[source]#

Defines and sets the config to be parsed

The keys of the configs are the attributes of this class which are:

  • train_ratio (float): Ratio of train data

  • shuffle (bool): Whether to shuffle the data

  • stratify (numpy.ndarray, optional): If not None, this is used to stratify the data

  • random_state (int, optional): Random state to use for shuffling

Note

You can explicitly override following attributes by passing it as an argument to __init__():

  • random_state

  • stratify

The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.

Parameters:

path (Union[str, Path]) – path to the JSON file containing the configs

Returns:

A dictionary of str: Any pairs of configs as class attributes

Return type:

dict

as_mlflow_artifact(target_path)[source]#

Saves the configs to the MLFlow artifact directory

Parameters:

target_path (Union[str, Path]) – Path to the MLFlow artifact directory. The name of the file will be same as original config file, hence, only provide path to dir.

Return type:

None

__call__(df, target_column, *args, **kwds)[source]#

Split a pandas dataframe with train and test splits

Parameters:
  • df (pandas.DataFrame) – Dataframe to convert

  • target_column (str) – Name of the target column

Returns:

Order is (data_train, data_test)

Return type:

Tuple[numpy.ndarray, …]

class niklib.models.preprocessors.TrainTestEvalSplit(stratify=None, random_state=None)[source]#

Bases: object

Convert a pandas dataframe to a numpy array for with train, test, and eval splits

For conversion from pandas.DataFrame to numpy.ndarray, we use the same functionality as pandas.DataFrame.to_numpy(), but it separates dependent and independent variables given the target column target_column.

Note

  • To obtain the eval set, we use the train set as the original data to be splitted i.e. the eval set is a subset of train set. This is of course to make sure model by no means sees the test set.

  • args cannot be set directly and need to be provided using a json file. See set_configs() for more information.

  • You can explicitly override following args by passing it as an argument to __init__():

    • random_state

    • stratify

Returns:

Order is (x_train, x_test, x_eval, y_train, y_test, y_eval)

Return type:

Tuple[numpy.ndarray, …]

__init__(stratify=None, random_state=None)[source]#
set_configs(path=None)[source]#

Defines and sets the config to be parsed

The keys of the configs are the attributes of this class which are:

  • test_ratio (float): Ratio of test data

  • eval_ratio (float): Ratio of eval data

  • shuffle (bool): Whether to shuffle the data

  • stratify (numpy.ndarray, optional): If not None, this is used to stratify the data

  • random_state (int, optional): Random state to use for shuffling

Note

You can explicitly override following attributes by passing it as an argument to __init__():

  • random_state

  • stratify

The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.

Parameters:

path (Union[str, Path]) – path to the JSON file containing the configs

Returns:

A dictionary of str: Any pairs of configs as class attributes

Return type:

dict

as_mlflow_artifact(target_path)[source]#

Saves the configs to the MLFlow artifact directory

Parameters:

target_path (Union[str, Path]) – Path to the MLFlow artifact directory. The name of the file will be same as original config file, hence, only provide path to dir.

Return type:

None

__call__(df, target_column, *args, **kwds)[source]#

Convert a pandas dataframe to a numpy array for with train, test, and eval splits

Parameters:
  • df (pandas.DataFrame) – Dataframe to convert

  • target_column (str) – Name of the target column

Returns:

Order is (x_train, x_test, x_eval, y_train, y_test, y_eval)

Return type:

Tuple[numpy.ndarray, …]

class niklib.models.preprocessors.ColumnTransformerConfig[source]#

Bases: object

A helper class that parses configs for using the sklearn.compose.ColumnTransformer

The purpose of this class is to create the list of transformers to be used by the sklearn.compose.ColumnTransformer. Hence, one needs to define the configs by using the set_configs() method. Then use the generate_pipeline() method to create the list of transformers.

This class at the end, will return a list of tuples, where each tuple is a in the form of (name, transformer, columns).

__init__()[source]#
set_configs(path=None)[source]#

Defines and sets the config to be parsed

The keys of the configs are the names of the transformers. They must include the API name of one of the available transforms at the end:

  • sklearn transformers: Any class that could be used for transformation that is importable as sklearn.preprocessing.API_NAME

  • custom transformers: Any class that is not a sklearn transformer and is importable as niklib.models.preprocessors.API_NAME

This naming convention is used to create proper transformers for each type of data. e.g in json format:

"age_StandardScaler": {
    "columns_type": "'numeric'",
    "dtype_include": "np.float32",
    "pattern_include": "'age'",
    "pattern_exclude": "None",
    "dtype_exclude": "None",
    "group": "False",
    "use_global": "False"
}

"sex_OneHotEncoder": {
    "columns_type": "'numeric'",
    "dtype_include": "'category'",
    "pattern_include": "'VisaResult'",
    "pattern_exclude": "None",
    "dtype_exclude": "None",
    "group": "True",
    "use_global": "True"
}

The values of the configs are the columns to be transformed. The columns can be obtained by using niklib.models.preprocessors.core.ColumnSelector which requires user to pass certain parameters. This parameters can be set manually or extracted from JSON config files by providing the path to the JSON file.

The group key is used to determine if the transformer should be applied considering a group of columns or not. If group is True, then required values for transformation are obtained from all columns rather than handling each group separately. For instance, one can use OneHotEncoding on a set of columns where if group is True, then all unique categories of all of those columns are extracted, then transformed. if group is False, then each column will have be transformed based on their unique categories independently. (group cannot be passed to ColumnSelector)

The use_global key is used to determine if the transformer should be applied considering the all data or train data (since fitting transformation for normalization need to be only done on train data). If use_global is True, then the transformer will be applied on all data. This is particularly useful for one hot encoding categorical features where some categories might are rare and might only exist in test and eval data.

Parameters:

path (Union[str, Path]) – path to the JSON file containing the configs

Returns:

A dictionary where keys are string names, values are tuple of niklib.models.preprocessors.core.ColumnSelector instance and a boolean control variable which will be passed to generate_pipeline().

Return type:

dict

as_mlflow_artifact(target_path)[source]#

Saves the configs to the MLFlow artifact directory

Parameters:

target_path (Union[str, Path]) – Path to the MLFlow artifact directory. The name of the file will be same as original config file, hence, only provide path to dir.

Return type:

None

static extract_selected_columns(selector, df)[source]#

Extracts the columns from the dataframe based on the selector

Note

This method is simply a wrapper around niklib.models.preprocessors.core.ColumnSelector that makes the call given a dataframe. I.e.:

# assuming same configs
selector = preprocessors.ColumnSelector(...)
A = ColumnTransformerConfig.extract_selected_columns(selector=selector, df=df)
B = selector(df)
A == B  # True

Also, this is a static method.

Parameters:
Returns:

List of columns to be transformed

Return type:

Union[List[str], List[int]]

__check_arg_exists(callable, arg)#

Checks if the argument exists in the callable signature

Parameters:
  • callable (Callable) – Callable to check the argument in

  • arg (str) – Argument to check if exists in the callable signature

Raises:

ValueError – If the argument does not exist in the callable signature

Return type:

None

__get_df_column_unique(df, loc)#

Gets uniques of a column in a dataframe

Parameters:
  • df (pandas.DataFrame) – Dataframe to get uniques from

  • loc (Union[int, str]) – Column to locate on the dataframe

Returns:

List of unique values in the column. Values of the returned list can be anything that is supported by pandas.DataFrame

Return type:

list

calculate_params(df, columns, group, transformer_name)[source]#

Calculates the parameters for the group transformation w.r.t. the transformer name

Parameters:
  • df (pandas.DataFrame) – Dataframe to extract columns from

  • columns (List) – List of columns to be transformed

  • group (bool) – If True, then the columns will be grouped together and the parameters will be calculated over all columns passed in

  • transformer_name (str) – Name of the transformer. It is used to determine the type of params to be passed to the transformer. E.g. if transformer_name corresponds to OneHotEncoding, then params would be unique categories.

Raises:

ValueError – If the transformer name is not implemented but supported

Returns:

Parameters for the group transformation

Return type:

dict

_check_overlap_in_transformation_columns(transformers)[source]#

Checks if there are multiple transformers on the same columns and reports them

Throw info if columns of different transformers overlap. I.e. at least another transform is happening on a column that is already has been transformed.

Note

This is not a bug or misbehavior since we should be able to pipe multiple transformers sequentially on the same column (e.g. add -> divide). The warning is thrown when user didn’t meant to do so since the output might be acceptable but wrong values and there is no way to find out except manual inspection. Hence, this method will make the user aware that something might be wrong.

Parameters:

transformers (List[Tuple]) – A list of tuples, where each tuple is a in the form of (name, transformer, columns) where name is the name of the transformer, transformer is the transformer object and columns is the list of columns names to be transformed.

Return type:

None

generate_pipeline(df, df_all=None)[source]#

Generates the list of transformers to be used by the sklearn.compose.ColumnTransformer

Note

For more info about how the transformers are created, see methods set_configs(), extract_selected_columns() and calculate_params().

Parameters:
  • df (pandas.DataFrame) – Dataframe to extract columns from if df_all is None, then this is interpreted as train data

  • df_all (Optional[pandas.DataFrame]) – Dataframe to extract columns from if df_all is not None, then this is interpreted as entire data. For more info see set_configs().

Raises:

ValueError – If the naming convention used for the keys in the configs (see set_configs()) is not followed.

Returns:

A list of tuples, where each tuple is a in the form of (name, transformer, columns) where name is the name of the transformer, transformer is the transformer object and columns is the list of columns names to be transformed.

Return type:

list

class niklib.models.preprocessors.ColumnSelector(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#

Bases: object

Selects columns based on regex pattern and dtype

User can specify the dtype of columns to select, and the dtype of columns to ignore. Also, user can specify the regex pattern for including and excluding columns, separately.

This is particularly useful when combined with sklearn.compose.ColumnTransformer to apply different sort of transformers to different subsets of columns. E.g:

# select columns that contain 'Country' in their name and are of type `np.float32`
columns = preprocessors.ColumnSelector(columns_type='numeric',
                                       dtype_include=np.float32,
                                       pattern_include='.*Country.*',
                                       pattern_exclude=None,
                                       dtype_exclude=None)(df=data)
# use a transformer for selected columns
ct = preprocessors.ColumnTransformer(
    [('some_name',                   # just a name
    preprocessors.StandardScaler(),  # the transformer
    columns),                        # the columns to apply the transformer to
    ],
)

ct.fit_transform(...)

Note

If the data that is passed to the ColumnSelector is a pandas.DataFrame, then you can ignore calling the instance of this class and directly use it in the pipeline. E.g:

# select columns that contain 'Country' in their name and are of type `np.float32`
columns = preprocessors.ColumnSelector(columns_type='numeric',
                                       dtype_include=np.float32,
                                       pattern_include='.*Country.*',
                                       pattern_exclude=None,
                                       dtype_exclude=None)  # THIS LINE
# use a transformer for selected columns
ct = preprocessors.ColumnTransformer(
    [('some_name',                   # just a name
    preprocessors.StandardScaler(),  # the transformer
    columns),                        # the columns to apply the transformer to
    ],
)

ct.fit_transform(...)

See also

sklearn.compose.make_column_selector as ColumnSelector follows the same semantics.

__init__(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#

Selects columns based on regex pattern and dtype

Parameters:
  • columns_type (str) –

    Type of columns:

    1. 'string': returns the name of the columns. Useful for pandas.DataFrame

    2. 'numeric': returns the index of the columns. Useful for numpy.ndarray

  • dtype_include (type) – Type of the columns to select. For more info see pandas.DataFrame.select_dtypes().

  • pattern_include (str) – Regex pattern to match columns to include

  • dtype_exclude (type) – Type of the columns to ignore. For more info see pandas.DataFrame.select_dtypes(). Defaults to None.

  • pattern_exclude (str) – Regex pattern to match columns to exclude

__call__(df, *args, **kwds)[source]#
Parameters:

df (pandas.DataFrame) – Dataframe to extract columns from

Returns:

List of names or indices of filtered columns

Return type:

Union[List[str], List[int]]

Raises:

ValueError – If the df is not instance of pandas.DataFrame

niklib.models.preprocessors.move_dependent_variable_to_end(df, target_column)[source]#

Move the dependent variable to the end of the dataframe

This is useful for some frameworks that require the dependent variable to be the last or in general form, it is way easier to play with numpy.ndarray s when the dependent variable is the last one.

Note

This is particularly is useful for us since we have multiple columns of the same type in our dataframe, and when we want to apply same preprocessing to a all members of a group of features, we can directly use index of those features from our pandas dataframe in converted numpy array. E.g:

df = pd.DataFrame(...)
x = df.to_numpy()
index = df.columns.get_loc(a_group_of_columns_with_the_same_logic)
x[:, index] = transform(x[:, index])
Parameters:
  • df (pandas.DataFrame) – Dataframe to convert

  • target_column (str) – Name of the target column

Returns:

Dataframe with the dependent variable at the end

Return type:

pandas.DataFrame

niklib.models.preprocessors.preview_column_transformer(column_transformer, original, transformed, df, random_state=Generator(PCG64) at 0x7FF565FC8820, **kwargs)[source]#

Preview transformed data next to original one obtained via ColumnTransformer

When the transformation is not sklearn.preprocessing.OneHotEncoder, the transformed data is previewed next to the original data in a pandas dataframe.

But when the transformation is sklearn.preprocessing.OneHotEncoder, this is no longer clean or informative in seeing only 0s and 1s. So, I just skip previewing the transformed data entirely but report following information:

  • The number of columns affected by transformation

  • The number of unique values in all of affected columns

  • The number of newly produced columns

Parameters:
Raises:

ValueError – If original and transformed are not of the same shape

Yields:

pandas.DataFrame – Preview dataframe for each transformer in column_transformer.transformers_. Dataframe has twice as columns as original and transformed, i.e. df.shape == (original.shape[0], 2 * original.shape[1])

Return type:

DataFrame

niklib.models.preprocessors.get_transformed_feature_names(column_transformer, original_columns_names)[source]#

Gives feature names for transformed data via original feature names

This is super useful as the default sklearn.compose.ColumnTransformer.get_feature_names_out() uses meaningless names for features after transformation which makes tracking the transformed features almost impossible as it uses f0[_category], f1[_category], ... fn[_category]` as feature names. This method for example, extracts the name of original column ``A (with categories [a, b]) before transformation and finds new columns after transforming that column and names them A_a and A_b meanwhile sklearn method gives x[num0]_a and x_[num0]_b.

Parameters:
  • column_transformer (sklearn.compose.ColumnTransformer) – A fitted column transformer that has .transformers_ where each is a tuple as (name, transformer, in_columns). in_columns used to detect the original index of transformed columns.

  • original_columns_names (List[str]) – List of original columns names before transformation

Returns:

A list of transformed columns names prefixed with original columns names

Return type:

List[str]