niklib.models.preprocessors package#
Submodules#
niklib.models.preprocessors.core module#
Contains core functionalities that is shared by all preprocessors.
- class niklib.models.preprocessors.core.TrainTestEvalSplit(stratify=None, random_state=None)[source]#
Bases:
object
Convert a pandas dataframe to a numpy array for with train, test, and eval splits
For conversion from
pandas.DataFrame
tonumpy.ndarray
, we use the same functionality aspandas.DataFrame.to_numpy()
, but it separates dependent and independent variables given the target columntarget_column
.Note
To obtain the eval set, we use the train set as the original data to be splitted i.e. the eval set is a subset of train set. This is of course to make sure model by no means sees the test set.
args
cannot be set directly and need to be provided using a json file. Seeset_configs()
for more information.You can explicitly override following
args
by passing it as an argument to__init__()
:random_state
stratify
- Returns:
Order is
(x_train, x_test, x_eval, y_train, y_test, y_eval)
- Return type:
Tuple[
numpy.ndarray
, …]
- set_configs(path=None)[source]#
Defines and sets the config to be parsed
The keys of the configs are the attributes of this class which are:
test_ratio
(float): Ratio of test dataeval_ratio
(float): Ratio of eval datashuffle
(bool): Whether to shuffle the datastratify
(numpy.ndarray
, optional): If not None, this is used to stratify the datarandom_state
(int, optional): Random state to use for shuffling
Note
You can explicitly override following attributes by passing it as an argument to
__init__()
:random_state
stratify
The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.
- __call__(df, target_column, *args, **kwds)[source]#
Convert a pandas dataframe to a numpy array for with train, test, and eval splits
- Parameters:
df (
pandas.DataFrame
) – Dataframe to converttarget_column (str) – Name of the target column
- Returns:
Order is
(x_train, x_test, x_eval, y_train, y_test, y_eval)
- Return type:
Tuple[
numpy.ndarray
, …]
- class niklib.models.preprocessors.core.PandasTrainTestSplit(stratify=None, random_state=None)[source]#
Bases:
object
Split a pandas dataframe with train and test
Note
This is a class very similar to
TrainTestEvalSplit
with this difference that this class is specialized for Pandas Dataframe and since we are going to use augmentation on Pandas Dataframe rather than Numpy, then this class enable us to do augmentation only on train split and let the test part stay as it is.Note
args
cannot be set directly and need to be provided using a json file. Seeset_configs()
for more information.You can explicitly override following
args
by passing it as an argument to__init__()
:random_state
stratify
- Returns:
A tuple of
(data_train, data_test)
which contains both dependent and independent variables- Return type:
Tuple[
numpy.ndarray
, …]
- set_configs(path=None)[source]#
Defines and sets the config to be parsed
The keys of the configs are the attributes of this class which are:
train_ratio
(float): Ratio of train datashuffle
(bool): Whether to shuffle the datastratify
(numpy.ndarray
, optional): If not None, this is used to stratify the datarandom_state
(int, optional): Random state to use for shuffling
Note
You can explicitly override following attributes by passing it as an argument to
__init__()
:random_state
stratify
The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.
- __call__(df, target_column, *args, **kwds)[source]#
Split a pandas dataframe with train and test splits
- Parameters:
df (
pandas.DataFrame
) – Dataframe to converttarget_column (str) – Name of the target column
- Returns:
Order is
(data_train, data_test)
- Return type:
Tuple[
numpy.ndarray
, …]
- class niklib.models.preprocessors.core.ColumnSelector(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#
Bases:
object
Selects columns based on regex pattern and dtype
User can specify the dtype of columns to select, and the dtype of columns to ignore. Also, user can specify the regex pattern for including and excluding columns, separately.
This is particularly useful when combined with
sklearn.compose.ColumnTransformer
to apply different sort oftransformers
to different subsets of columns. E.g:# select columns that contain 'Country' in their name and are of type `np.float32` columns = preprocessors.ColumnSelector(columns_type='numeric', dtype_include=np.float32, pattern_include='.*Country.*', pattern_exclude=None, dtype_exclude=None)(df=data) # use a transformer for selected columns ct = preprocessors.ColumnTransformer( [('some_name', # just a name preprocessors.StandardScaler(), # the transformer columns), # the columns to apply the transformer to ], ) ct.fit_transform(...)
Note
If the data that is passed to the
ColumnSelector
is apandas.DataFrame
, then you can ignore calling the instance of this class and directly use it in the pipeline. E.g:# select columns that contain 'Country' in their name and are of type `np.float32` columns = preprocessors.ColumnSelector(columns_type='numeric', dtype_include=np.float32, pattern_include='.*Country.*', pattern_exclude=None, dtype_exclude=None) # THIS LINE # use a transformer for selected columns ct = preprocessors.ColumnTransformer( [('some_name', # just a name preprocessors.StandardScaler(), # the transformer columns), # the columns to apply the transformer to ], ) ct.fit_transform(...)
See also
sklearn.compose.make_column_selector
asColumnSelector
follows the same semantics.- __init__(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#
Selects columns based on regex pattern and dtype
- Parameters:
columns_type (str) –
Type of columns:
'string'
: returns the name of the columns. Useful forpandas.DataFrame
'numeric'
: returns the index of the columns. Useful fornumpy.ndarray
dtype_include (type) – Type of the columns to select. For more info see
pandas.DataFrame.select_dtypes()
.pattern_include (str) – Regex pattern to match columns to include
dtype_exclude (type) – Type of the columns to ignore. For more info see
pandas.DataFrame.select_dtypes()
. Defaults to None.pattern_exclude (str) – Regex pattern to match columns to exclude
- __call__(df, *args, **kwds)[source]#
- Parameters:
df (
pandas.DataFrame
) – Dataframe to extract columns from- Returns:
List of names or indices of filtered columns
- Return type:
- Raises:
ValueError – If the
df
is not instance ofpandas.DataFrame
- class niklib.models.preprocessors.core.ColumnTransformerConfig[source]#
Bases:
object
A helper class that parses configs for using the
sklearn.compose.ColumnTransformer
The purpose of this class is to create the list of
transformers
to be used by thesklearn.compose.ColumnTransformer
. Hence, one needs to define the configs by using theset_configs()
method. Then use thegenerate_pipeline()
method to create the list of transformers.This class at the end, will return a list of tuples, where each tuple is a in the form of
(name, transformer, columns)
.- set_configs(path=None)[source]#
Defines and sets the config to be parsed
The keys of the configs are the names of the transformers. They must include the API name of one of the available transforms at the end:
sklearn transformers: Any class that could be used for transformation that is importable as
sklearn.preprocessing.API_NAME
custom transformers: Any class that is not a
sklearn
transformer and is importable asniklib.models.preprocessors.API_NAME
This naming convention is used to create proper transformers for each type of data. e.g in json format:
"age_StandardScaler": { "columns_type": "'numeric'", "dtype_include": "np.float32", "pattern_include": "'age'", "pattern_exclude": "None", "dtype_exclude": "None", "group": "False", "use_global": "False" } "sex_OneHotEncoder": { "columns_type": "'numeric'", "dtype_include": "'category'", "pattern_include": "'VisaResult'", "pattern_exclude": "None", "dtype_exclude": "None", "group": "True", "use_global": "True" }
The values of the configs are the columns to be transformed. The columns can be obtained by using
niklib.models.preprocessors.core.ColumnSelector
which requires user to pass certain parameters. This parameters can be set manually or extracted from JSON config files by providing the path to the JSON file.The
group
key is used to determine if the transformer should be applied considering a group of columns or not. Ifgroup
isTrue
, then required values for transformation are obtained from all columns rather than handling each group separately. For instance, one can useOneHotEncoding
on a set of columns where ifgroup
isTrue
, then all unique categories of all of those columns are extracted, then transformed. ifgroup
isFalse
, then each column will have be transformed based on their unique categories independently. (group
cannot be passed toColumnSelector
)The
use_global
key is used to determine if the transformer should be applied considering the all data or train data (since fitting transformation for normalization need to be only done on train data). Ifuse_global
isTrue
, then the transformer will be applied on all data. This is particularly useful for one hot encoding categorical features where some categories might are rare and might only exist in test and eval data.- Parameters:
path (
Union
[str
,Path
]) – path to the JSON file containing the configs- Returns:
A dictionary where keys are string names, values are tuple of
niklib.models.preprocessors.core.ColumnSelector
instance and a boolean control variable which will be passed togenerate_pipeline()
.- Return type:
- static extract_selected_columns(selector, df)[source]#
Extracts the columns from the dataframe based on the selector
Note
This method is simply a wrapper around
niklib.models.preprocessors.core.ColumnSelector
that makes the call given a dataframe. I.e.:# assuming same configs selector = preprocessors.ColumnSelector(...) A = ColumnTransformerConfig.extract_selected_columns(selector=selector, df=df) B = selector(df) A == B # True
Also, this is a static method.
- Parameters:
selector (
niklib.models.preprocessors.core.ColumnSelector
) – Initialized selector objectdf (
pandas.DataFrame
) – Dataframe to extract columns from
- Returns:
List of columns to be transformed
- Return type:
- __check_arg_exists(callable, arg)#
Checks if the argument exists in the callable signature
- Parameters:
callable (Callable) – Callable to check the argument in
arg (str) – Argument to check if exists in the callable signature
- Raises:
ValueError – If the argument does not exist in the callable signature
- Return type:
- __get_df_column_unique(df, loc)#
Gets uniques of a column in a dataframe
- Parameters:
df (
pandas.DataFrame
) – Dataframe to get uniques from
- Returns:
List of unique values in the column. Values of the returned list can be anything that is supported by
pandas.DataFrame
- Return type:
- calculate_params(df, columns, group, transformer_name)[source]#
Calculates the parameters for the group transformation w.r.t. the transformer name
- Parameters:
df (
pandas.DataFrame
) – Dataframe to extract columns fromcolumns (List) – List of columns to be transformed
group (bool) – If True, then the columns will be grouped together and the parameters will be calculated over all columns passed in
transformer_name (str) – Name of the transformer. It is used to determine the type of params to be passed to the transformer. E.g. if
transformer_name
corresponds toOneHotEncoding
, then params would be unique categories.
- Raises:
ValueError – If the transformer name is not implemented but supported
- Returns:
Parameters for the group transformation
- Return type:
- _check_overlap_in_transformation_columns(transformers)[source]#
Checks if there are multiple transformers on the same columns and reports them
Throw info if columns of different transformers overlap. I.e. at least another transform is happening on a column that is already has been transformed.
Note
This is not a bug or misbehavior since we should be able to pipe multiple transformers sequentially on the same column (e.g.
add
->divide
). The warning is thrown when user didn’t meant to do so since the output might be acceptable but wrong values and there is no way to find out except manual inspection. Hence, this method will make the user aware that something might be wrong.- Parameters:
transformers (List[Tuple]) – A list of tuples, where each tuple is a in the form of
(name, transformer, columns)
wherename
is the name of the transformer,transformer
is the transformer object andcolumns
is the list of columns names to be transformed.- Return type:
- generate_pipeline(df, df_all=None)[source]#
Generates the list of transformers to be used by the
sklearn.compose.ColumnTransformer
Note
For more info about how the transformers are created, see methods
set_configs()
,extract_selected_columns()
andcalculate_params()
.- Parameters:
df (
pandas.DataFrame
) – Dataframe to extract columns from ifdf_all
is None, then this is interpreted as train datadf_all (Optional[
pandas.DataFrame
]) – Dataframe to extract columns from ifdf_all
is not None, then this is interpreted as entire data. For more info seeset_configs()
.
- Raises:
ValueError – If the naming convention used for the keys in the configs (see
set_configs()
) is not followed.- Returns:
A list of tuples, where each tuple is a in the form of
(name, transformer, columns)
wherename
is the name of the transformer,transformer
is the transformer object andcolumns
is the list of columns names to be transformed.- Return type:
- niklib.models.preprocessors.core.get_transformed_feature_names(column_transformer, original_columns_names)[source]#
Gives feature names for transformed data via original feature names
This is super useful as the default
sklearn.compose.ColumnTransformer.get_feature_names_out()
uses meaningless names for features after transformation which makes tracking the transformed features almost impossible as it usesf0[_category], f1[_category], ... fn[_category]` as feature names. This method for example, extracts the name of original column ``A
(with categories[a, b]
) before transformation and finds new columns after transforming that column and names themA_a
andA_b
meanwhilesklearn
method givesx[num0]_a
andx_[num0]_b
.- Parameters:
column_transformer (
sklearn.compose.ColumnTransformer
) – A fitted column transformer that has.transformers_
where each is a tuple as(name, transformer, in_columns)
.in_columns
used to detect the original index of transformed columns.original_columns_names (List[str]) – List of original columns names before transformation
- Returns:
A list of transformed columns names prefixed with original columns names
- Return type:
List[str]
- niklib.models.preprocessors.core.move_dependent_variable_to_end(df, target_column)[source]#
Move the dependent variable to the end of the dataframe
This is useful for some frameworks that require the dependent variable to be the last or in general form, it is way easier to play with
numpy.ndarray
s when the dependent variable is the last one.Note
This is particularly is useful for us since we have multiple columns of the same type in our dataframe, and when we want to apply same preprocessing to a all members of a group of features, we can directly use index of those features from our pandas dataframe in converted numpy array. E.g:
df = pd.DataFrame(...) x = df.to_numpy() index = df.columns.get_loc(a_group_of_columns_with_the_same_logic) x[:, index] = transform(x[:, index])
- Parameters:
df (
pandas.DataFrame
) – Dataframe to converttarget_column (str) – Name of the target column
- Returns:
Dataframe with the dependent variable at the end
- Return type:
niklib.models.preprocessors.helpers module#
- niklib.models.preprocessors.helpers.preview_column_transformer(column_transformer, original, transformed, df, random_state=Generator(PCG64) at 0x7FF565FC8820, **kwargs)[source]#
Preview transformed data next to original one obtained via
ColumnTransformer
When the transformation is not
sklearn.preprocessing.OneHotEncoder
, the transformed data is previewed next to the original data in a pandas dataframe.But when the transformation is
sklearn.preprocessing.OneHotEncoder
, this is no longer clean or informative in seeing only 0s and 1s. So, I just skip previewing the transformed data entirely but report following information:The number of columns affected by transformation
The number of unique values in all of affected columns
The number of newly produced columns
- Parameters:
column_transformer (ColumnTransformer) – An instance of
sklearn.compose.ColumnTransformer
original (
numpy.ndarray
) – Original data as anumpy.ndarray
. Same shape astransformed
transformed (
numpy.ndarray
) – Transformed data as anumpy.ndarray
. Same shape asoriginal
df (
pandas.DataFrame
) – A dataframe that hosts theoriginal
andtransformed
data. Used to extract column names and unique values for logging information about the transformations donerandom_state (Union[int,
numpy.random.Generator
], optional) – A seed value or instance ofnumpy.random.Generator
for sampling. Defaults tonumpy.random.default_rng()
.**kwargs –
Additional arguments as follows:
n_samples
(int): Number of samples to draw. Defaults to 1.
- Raises:
ValueError – If
original
andtransformed
are not of the same shape- Yields:
pandas.DataFrame
– Preview dataframe for each transformer incolumn_transformer.transformers_
. Dataframe has twice as columns asoriginal
andtransformed
, i.e.df.shape == (original.shape[0], 2 * original.shape[1])
- Return type:
Module contents#
Contains preprocessing methods for preparing data solely for estimators in niklib.models.estimators
This preprocessors expect “already cleaned” data acquired by niklib.data
for sole usage of machine learning models for desired frameworks (let’s say
changing dtypes or one hot encoding for torch or sklearn that is only
useful for these frameworks)
- Following modules are available:
niklib.models.preprocessors.core
: contains implementations that could be shared betweenall other preprocessors modules defined here
niklib.models.preprocessors.pytorch
: contains implementations to be used solelyfor PyTorch only for preprocessing purposes, e.g. https://pytorch.org/docs/stable/data.html
niklib.models.preprocessors.sklearn
: contains implementations to be used solelyfor Scikit-Learn only for preprocessing purposes, e.g. https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
- niklib.models.preprocessors.EXAMPLE_COLUMN_TRANSFORMER_CONFIG_X = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_column_transformer_config_x.json')#
Configs for transforming features data for Example
For information about how to use it and what fields are expected, see
niklib.models.preprocessors.core.ColumnTransformerConfig
.
- niklib.models.preprocessors.EXAMPLE_COLUMN_TRANSFORMER_CONFIG_Y = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_column_transformer_config_y.json')#
Configs for transforming target data for Example
For information about how to use it and what fields are expected, see
niklib.models.preprocessors.core.ColumnTransformerConfig
.
- niklib.models.preprocessors.EXAMPLE_TRAIN_TEST_EVAL_SPLIT = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_train_test_eval_split.json')#
Configs for splitting dataframe into numpy ndarray of train, test, eval for Example
For information about how to use it and what fields are expected, see
niklib.models.preprocessors.core.TrainTestEvalSplit
.
- niklib.models.preprocessors.EXAMPLE_PANDAS_TRAIN_TEST_SPLIT = PosixPath('/home/runner/work/niklib/niklib/niklib/models/preprocessors/data/example_pandas_train_test_split.json')#
Configs for splitting dataframe train and test for Example
For information about how to use it and what fields are expected, see
niklib.models.preprocessors.core.PandasTrainTestSplit
.
- niklib.models.preprocessors.TRANSFORMS = {'LabelBinarizer': <class 'sklearn.preprocessing._label.LabelBinarizer'>, 'LabelEncoder': <class 'sklearn.preprocessing._label.LabelEncoder'>, 'MaxAbsScaler': <class 'sklearn.preprocessing._data.MaxAbsScaler'>, 'MinMaxScaler': <class 'sklearn.preprocessing._data.MinMaxScaler'>, 'MultiLabelBinarizer': <class 'sklearn.preprocessing._label.MultiLabelBinarizer'>, 'OneHotEncoder': <class 'sklearn.preprocessing._encoders.OneHotEncoder'>, 'RobustScaler': <class 'sklearn.preprocessing._data.RobustScaler'>, 'StandardScaler': <class 'sklearn.preprocessing._data.StandardScaler'>}#
A dictionary of transforms and their names used to verify configs in
niklib.models.preprocessors.core.ColumnTransformerConfig
This is used to verify that the configs are correct and that the transformers are available.
Note
All transforms from third-party library or our own must be included in that list to be usable by
niklib.models.preprocessors.core.ColumnTransformerConfig
.
- class niklib.models.preprocessors.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)[source]#
Bases:
TransformerMixin
,_BaseComposition
Applies transformers to columns of an array or pandas DataFrame.
This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
Read more in the User Guide.
New in version 0.20.
- Parameters:
transformers (list of tuples) –
List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.
- namestr
Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using
set_params
and searched in grid search.- transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.
- columnsstr, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where
transformer
expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can usemake_column_selector
.
remainder ({'drop', 'passthrough'} or estimator, default='drop') – By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of
'drop'
). By specifyingremainder='passthrough'
, all remaining columns that were not specified in transformers, but present in the data passed to fit will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during fit will be excluded from the output of transform. By settingremainder
to be an estimator, the remaining non-specified columns will use theremainder
estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.sparse_threshold (float, default=0.3) – If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use
sparse_threshold=0
to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.n_jobs (int, default=None) – Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.transformer_weights (dict, default=None) – Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.
verbose (bool, default=False) – If True, the time elapsed while fitting each transformer will be printed as it is completed.
verbose_feature_names_out (bool, default=True) –
If True,
get_feature_names_out()
will prefix all feature names with the name of the transformer that generated that feature. If False,get_feature_names_out()
will not prefix any feature names and will error if feature names are not unique.New in version 1.0.
- Variables:
transformers (list) – The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be the unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the
remainder
parameter. If there are remaining columns, thenlen(transformers_)==len(transformers)+1
, otherwiselen(transformers_)==len(transformers)
.named_transformers (
Bunch
) – Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.sparse_output (bool) – Boolean flag indicating whether the output of
transform
is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.output_indices (dict) –
A dictionary from each transformer name to a slice, where the slice corresponds to indices in the transformed output. This is useful to inspect which transformer is responsible for which transformed feature(s).
New in version 1.0.
n_features_in (int) –
Number of features seen during fit. Only defined if the underlying transformers expose such an attribute when fit.
New in version 0.24.
See also
make_column_transformer
Convenience function for combining the outputs of multiple transformer objects applied to column subsets of the original feature space.
make_column_selector
Convenience function for selecting columns based on datatype or the columns name with a regex pattern.
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Examples
>>> import numpy as np >>> from sklearn.compose import ColumnTransformer >>> from sklearn.preprocessing import Normalizer >>> ct = ColumnTransformer( ... [("norm1", Normalizer(norm='l1'), [0, 1]), ... ("norm2", Normalizer(norm='l1'), slice(2, 4))]) >>> X = np.array([[0., 1., 2., 2.], ... [1., 1., 0., 1.]]) >>> # Normalizer scales each row of X to unit norm. A separate scaling >>> # is applied for the two first and two last elements of each >>> # row independently. >>> ct.fit_transform(X) array([[0. , 1. , 0.5, 0.5], [0.5, 0.5, 0. , 1. ]])
ColumnTransformer
can be configured with a transformer that requires a 1d array by setting the column to a string:>>> from sklearn.feature_extraction import FeatureHasher >>> from sklearn.preprocessing import MinMaxScaler >>> import pandas as pd >>> X = pd.DataFrame({ ... "documents": ["First item", "second one here", "Is this the last?"], ... "width": [3, 4, 5], ... }) >>> # "documents" is a string which configures ColumnTransformer to >>> # pass the documents column as a 1d array to the FeatureHasher >>> ct = ColumnTransformer( ... [("text_preprocess", FeatureHasher(input_type="string"), "documents"), ... ("num_preprocess", MinMaxScaler(), ["width"])]) >>> X_trans = ct.fit_transform(X)
- _required_parameters = ['transformers']#
-
_parameter_constraints:
dict
= {'n_jobs': [<class 'numbers.Integral'>, None], 'remainder': [<sklearn.utils._param_validation.StrOptions object>, <sklearn.utils._param_validation.HasMethods object>, <sklearn.utils._param_validation.HasMethods object>], 'sparse_threshold': [<sklearn.utils._param_validation.Interval object>], 'transformer_weights': [<class 'dict'>, None], 'transformers': [<class 'list'>, <sklearn.utils._param_validation.Hidden object>], 'verbose': ['verbose'], 'verbose_feature_names_out': ['boolean']}#
- __init__(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)[source]#
- property _transformers#
Internal list of transformer only containing the name and transformers, dropping the columns. This is for the implementation of get_params via BaseComposition._get_params which expects lists of tuples of len 2.
- set_output(*, transform=None)[source]#
Set the output container when “transform” and “fit_transform” are called.
Calling set_output will set the output of all estimators in transformers and transformers_.
- Parameters:
transform ({"default", "pandas"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- get_params(deep=True)[source]#
Get parameters for this estimator.
Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.
- set_params(**kwargs)[source]#
Set the parameters of this estimator.
Valid parameter keys can be listed with
get_params()
. Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.- Parameters:
**kwargs (dict) – Estimator parameters.
- Returns:
self – This estimator.
- Return type:
- _iter(fitted=False, replace_strings=False, column_as_strings=False)[source]#
Generate (name, trans, column, weight) tuples.
If fitted=True, use the fitted transformers, else use the user specified transformers updated with converted column names and potentially appended with transformer for remainder.
- _validate_remainder(X)[source]#
Validates
remainder
and defines_remainder
targeting the remaining columns.
- property named_transformers_#
Access the fitted transformer by name.
Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.
- _get_feature_name_out_for_transformer(name, trans, column, feature_names_in)[source]#
Gets feature names of transformer.
Used in conjunction with self._iter(fitted=True) in get_feature_names_out.
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transformation.
- Parameters:
input_features (array-like of str or None, default=None) –
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns:
feature_names_out – Transformed feature names.
- Return type:
ndarray of str objects
- _add_prefix_for_feature_names_out(transformer_with_feature_names_out)[source]#
Add prefix for feature names out that includes the transformer names.
- _validate_output(result)[source]#
Ensure that the output of each transformer is 2D. Otherwise hstack can raise an error or produce incorrect results.
- _fit_transform(X, y, func, fitted=False, column_as_strings=False)[source]#
Private function to fit and/or transform on demand.
Return value (transformers and/or transformed X data) depends on the passed function.
fitted=True
ensures the fitted transformers are used.
- fit(X, y=None)[source]#
Fit all transformers using X.
- Parameters:
X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.
y (array-like of shape (n_samples,...), default=None) – Targets for supervised learning.
- Returns:
self – This estimator.
- Return type:
- fit_transform(X, y=None)[source]#
Fit all transformers, transform the data and concatenate results.
- Parameters:
X ({array-like, dataframe} of shape (n_samples, n_features)) – Input data, of which specified subsets are used to fit the transformers.
y (array-like of shape (n_samples,), default=None) – Targets for supervised learning.
- Returns:
X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
- Return type:
{array-like, sparse matrix} of shape (n_samples, sum_n_components)
- transform(X)[source]#
Transform X separately by each transformer, concatenate results.
- Parameters:
X ({array-like, dataframe} of shape (n_samples, n_features)) – The data to be transformed by subset.
- Returns:
X_t – Horizontally stacked results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
- Return type:
{array-like, sparse matrix} of shape (n_samples, sum_n_components)
- _hstack(Xs)[source]#
Stacks Xs horizontally.
This allows subclasses to control the stacking behavior, while reusing everything else from ColumnTransformer.
- Parameters:
Xs (list of {array-like, sparse matrix, dataframe}) –
- _abc_impl = <_abc._abc_data object>#
- _sklearn_auto_wrap_output_keys = {'transform'}#
- class niklib.models.preprocessors.OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None, feature_name_combiner='concat')[source]#
Bases:
_BaseEncoder
Encode categorical features as a one-hot numeric array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the
sparse_output
parameter)By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.
This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
Note: a one-hot encoding of y labels should use a LabelBinarizer instead.
Read more in the User Guide.
- Parameters:
categories ('auto' or a list of array-like, default='auto') –
Categories (unique values) per feature:
’auto’ : Determine categories automatically from the training data.
list :
categories[i]
holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.
The used categories can be found in the
categories_
attribute.New in version 0.20.
drop ({'first', 'if_binary'} or an array-like of shape (n_features,), default=None) –
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model.
However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
None : retain all features (the default).
’first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
’if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
array :
drop[i]
is the category in featureX[:, i]
that should be dropped.
When max_categories or min_frequency is configured to group infrequent categories, the dropping behavior is handled after the grouping.
New in version 0.21: The parameter drop was added in 0.21.
Changed in version 0.23: The option drop=’if_binary’ was added in 0.23.
Changed in version 1.1: Support for dropping infrequent categories.
sparse (bool, default=True) –
Will return sparse matrix if set True else will return an array.
Deprecated since version 1.2: sparse is deprecated in 1.2 and will be removed in 1.4. Use sparse_output instead.
sparse_output (bool, default=True) –
Will return sparse matrix if set True else will return an array.
New in version 1.2: sparse was renamed to sparse_output
dtype (number type, default=float) – Desired dtype of output.
handle_unknown ({'error', 'ignore', 'infrequent_if_exist'}, default='error') –
Specifies the way unknown categories are handled during
transform()
.’error’ : Raise an error if an unknown category is present during transform.
’ignore’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
’infrequent_if_exist’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted ‘infrequent’ if it exists. If the ‘infrequent’ category does not exist, then
transform()
andinverse_transform()
will handle an unknown category as with handle_unknown=’ignore’. Infrequent categories exist based on min_frequency and max_categories. Read more in the User Guide.
Changed in version 1.1: ‘infrequent_if_exist’ was added to automatically handle unknown categories and infrequent categories.
min_frequency (int or float, default=None) –
Specifies the minimum frequency below which a category will be considered infrequent.
If int, categories with a smaller cardinality will be considered infrequent.
If float, categories with a smaller cardinality than min_frequency * n_samples will be considered infrequent.
New in version 1.1: Read more in the User Guide.
max_categories (int, default=None) –
Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.
New in version 1.1: Read more in the User Guide.
feature_name_combiner ("concat" or callable, default="concat") –
Callable with signature def callable(input_feature, category) that returns a string. This is used to create feature names to be returned by
get_feature_names_out()
.”concat” concatenates encoded feature name and category with feature + “_” + str(category).E.g. feature X with values 1, 6, 7 create feature names X_1, X_6, X_7.
New in version 1.3.
- Variables:
categories (list of arrays) – The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of
transform
). This includes the category specified indrop
(if any).drop_idx (array of shape (n_features,)) –
drop_idx_[i]
is the index incategories_[i]
of the category to be dropped for each feature.drop_idx_[i] = None
if no category is to be dropped from the feature with indexi
, e.g. when drop=’if_binary’ and the feature isn’t binary.drop_idx_ = None
if all the transformed features will be retained.
If infrequent categories are enabled by setting min_frequency or max_categories to a non-default value and drop_idx[i] corresponds to a infrequent category, then the entire infrequent category is dropped.
Changed in version 0.23: Added the possibility to contain None values.
infrequent_categories (list of ndarray) –
Defined only if infrequent categories are enabled by setting min_frequency or max_categories to a non-default value. infrequent_categories_[i] are the infrequent categories for feature i. If the feature i has no infrequent categories infrequent_categories_[i] is None.
New in version 1.1.
n_features_in (int) –
Number of features seen during fit.
New in version 1.0.
feature_names_in (ndarray of shape (n_features_in_,)) –
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
feature_name_combiner (callable or None) –
Callable with signature def callable(input_feature, category) that returns a string. This is used to create feature names to be returned by
get_feature_names_out()
.New in version 1.3.
See also
OrdinalEncoder
Performs an ordinal (integer) encoding of the categorical features.
TargetEncoder
Encodes categorical features using the target.
sklearn.feature_extraction.DictVectorizer
Performs a one-hot encoding of dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher
Performs an approximate one-hot encoding of dictionary items or strings.
LabelBinarizer
Binarizes labels in a one-vs-all fashion.
MultiLabelBinarizer
Transforms between iterable of iterables and a multilabel format, e.g. a (samples x classes) binary matrix indicating the presence of a class label.
Examples
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.
>>> from sklearn.preprocessing import OneHotEncoder
One can discard categories not seen during fit:
>>> enc = OneHotEncoder(handle_unknown='ignore') >>> X = [['Male', 1], ['Female', 3], ['Female', 2]] >>> enc.fit(X) OneHotEncoder(handle_unknown='ignore') >>> enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> enc.transform([['Female', 1], ['Male', 4]]).toarray() array([[1., 0., 1., 0., 0.], [0., 1., 0., 0., 0.]]) >>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]]) array([['Male', 1], [None, 2]], dtype=object) >>> enc.get_feature_names_out(['gender', 'group']) array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)
One can always drop the first column for each feature:
>>> drop_enc = OneHotEncoder(drop='first').fit(X) >>> drop_enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 0., 0.], [1., 1., 0.]])
Or drop a column for feature only having 2 categories:
>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X) >>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 1., 0., 0.], [1., 0., 1., 0.]])
One can change the way feature names are created.
>>> def custom_combiner(feature, category): ... return str(feature) + "_" + type(category).__name__ + "_" + str(category) >>> custom_fnames_enc = OneHotEncoder(feature_name_combiner=custom_combiner).fit(X) >>> custom_fnames_enc.get_feature_names_out() array(['x0_str_Female', 'x0_str_Male', 'x1_int_1', 'x1_int_2', 'x1_int_3'], dtype=object)
Infrequent categories are enabled by setting max_categories or min_frequency.
>>> import numpy as np >>> X = np.array([["a"] * 5 + ["b"] * 20 + ["c"] * 10 + ["d"] * 3], dtype=object).T >>> ohe = OneHotEncoder(max_categories=3, sparse_output=False).fit(X) >>> ohe.infrequent_categories_ [array(['a', 'd'], dtype=object)] >>> ohe.transform([["a"], ["b"]]) array([[0., 0., 1.], [1., 0., 0.]])
-
_parameter_constraints:
dict
= {'categories': [<sklearn.utils._param_validation.StrOptions object>, <class 'list'>], 'drop': [<sklearn.utils._param_validation.StrOptions object>, 'array-like', None], 'dtype': 'no_validation', 'feature_name_combiner': [<sklearn.utils._param_validation.StrOptions object>, <built-in function callable>], 'handle_unknown': [<sklearn.utils._param_validation.StrOptions object>], 'max_categories': [<sklearn.utils._param_validation.Interval object>, None], 'min_frequency': [<sklearn.utils._param_validation.Interval object>, <sklearn.utils._param_validation.Interval object>, None], 'sparse': [<sklearn.utils._param_validation.Hidden object>, 'boolean'], 'sparse_output': ['boolean']}#
- __init__(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None, feature_name_combiner='concat')[source]#
- _map_drop_idx_to_infrequent(feature_idx, drop_idx)[source]#
Convert drop_idx into the index for infrequent categories.
If there are no infrequent categories, then drop_idx is returned. This method is called in _set_drop_idx when the drop parameter is an array-like.
- _set_drop_idx()[source]#
Compute the drop indices associated with self.categories_.
If self.drop is: - None, No categories have been dropped. - ‘first’, All zeros to drop the first category. - ‘if_binary’, All zeros if the category is binary and None
otherwise.
array-like, The indices of the categories that match the categories in self.drop. If the dropped category is an infrequent category, then the index for the infrequent category is used. This means that the entire infrequent category is dropped.
This methods defines a public drop_idx_ and a private _drop_idx_after_grouping.
drop_idx_: Public facing API that references the drop category in self.categories_.
_drop_idx_after_grouping: Used internally to drop categories after the infrequent categories are grouped together.
If there are no infrequent categories or drop is None, then drop_idx_=_drop_idx_after_grouping.
- _compute_transformed_categories(i, remove_dropped=True)[source]#
Compute the transformed categories used for column i.
1. If there are infrequent categories, the category is named ‘infrequent_sklearn’. 2. Dropped columns are removed when remove_dropped=True.
- fit(X, y=None)[source]#
Fit OneHotEncoder to X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.
y (None) – Ignored. This parameter exists only for compatibility with
Pipeline
.
- Returns:
Fitted encoder.
- Return type:
self
- transform(X)[source]#
Transform X using one-hot encoding.
If there are infrequent categories for a feature, the infrequent categories will be grouped into a single category.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data to encode.
- Returns:
X_out – Transformed input. If sparse_output=True, a sparse matrix will be returned.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)
- inverse_transform(X)[source]#
Convert the data back to the original representation.
When unknown categories are encountered (all zeros in the one-hot encoding),
None
is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse.For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_encoded_features)) – The transformed data.
- Returns:
X_tr – Inverse transformed array.
- Return type:
ndarray of shape (n_samples, n_features)
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transformation.
- Parameters:
input_features (array-like of str or None, default=None) –
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns:
feature_names_out – Transformed feature names.
- Return type:
ndarray of str objects
- _sklearn_auto_wrap_output_keys = {'transform'}#
- class niklib.models.preprocessors.LabelEncoder[source]#
Bases:
TransformerMixin
,BaseEstimator
Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e. y, and not the input X.
Read more in the User Guide.
New in version 0.12.
- Variables:
classes (ndarray of shape (n_classes,)) – Holds the label for each class.
See also
OrdinalEncoder
Encode categorical features using an ordinal encoding scheme.
OneHotEncoder
Encode categorical features as a one-hot numeric array.
Examples
LabelEncoder can be used to normalize labels.
>>> from sklearn import preprocessing >>> le = preprocessing.LabelEncoder() >>> le.fit([1, 2, 2, 6]) LabelEncoder() >>> le.classes_ array([1, 2, 6]) >>> le.transform([1, 1, 2, 6]) array([0, 0, 1, 2]...) >>> le.inverse_transform([0, 0, 1, 2]) array([1, 1, 2, 6])
It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
>>> le = preprocessing.LabelEncoder() >>> le.fit(["paris", "paris", "tokyo", "amsterdam"]) LabelEncoder() >>> list(le.classes_) ['amsterdam', 'paris', 'tokyo'] >>> le.transform(["tokyo", "tokyo", "paris"]) array([2, 2, 1]...) >>> list(le.inverse_transform([2, 2, 1])) ['tokyo', 'tokyo', 'paris']
- fit(y)[source]#
Fit label encoder.
- Parameters:
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self – Fitted label encoder.
- Return type:
returns an instance of self.
- fit_transform(y)[source]#
Fit label encoder and return encoded labels.
- Parameters:
y (array-like of shape (n_samples,)) – Target values.
- Returns:
y – Encoded labels.
- Return type:
array-like of shape (n_samples,)
- transform(y)[source]#
Transform labels to normalized encoding.
- Parameters:
y (array-like of shape (n_samples,)) – Target values.
- Returns:
y – Labels as normalized encodings.
- Return type:
array-like of shape (n_samples,)
- inverse_transform(y)[source]#
Transform labels back to original encoding.
- Parameters:
y (ndarray of shape (n_samples,)) – Target values.
- Returns:
y – Original encoding.
- Return type:
ndarray of shape (n_samples,)
- _sklearn_auto_wrap_output_keys = {'transform'}#
- class niklib.models.preprocessors.StandardScaler(*, copy=True, with_mean=True, with_std=True)[source]#
Bases:
OneToOneFeatureMixin
,TransformerMixin
,BaseEstimator
Standardize features by removing the mean and scaling to unit variance.
The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using
transform()
.Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.
Read more in the User Guide.
- Parameters:
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
- Variables:
scale (ndarray of shape (n_features,) or None) –
Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1. scale_ is equal to None when with_std=False.
New in version 0.17: scale_
mean (ndarray of shape (n_features,) or None) – The mean value for each feature in the training set. Equal to
None
whenwith_mean=False
.var (ndarray of shape (n_features,) or None) – The variance for each feature in the training set. Used to compute scale_. Equal to
None
whenwith_std=False
.n_features_in (int) –
Number of features seen during fit.
New in version 0.24.
feature_names_in (ndarray of shape (n_features_in_,)) –
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
n_samples_seen (int or ndarray of shape (n_features,)) – The number of samples processed by the estimator for each feature. If there are no missing samples, the
n_samples_seen
will be an integer, otherwise it will be an array of dtype int. If sample_weights are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments acrosspartial_fit
calls.
See also
scale
Equivalent function without the estimator API.
PCA
Further removes the linear correlation across features with ‘whiten=True’.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> from sklearn.preprocessing import StandardScaler >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] >>> scaler = StandardScaler() >>> print(scaler.fit(data)) StandardScaler() >>> print(scaler.mean_) [0.5 0.5] >>> print(scaler.transform(data)) [[-1. -1.] [-1. -1.] [ 1. 1.] [ 1. 1.]] >>> print(scaler.transform([[2, 2]])) [[3. 3.]]
-
_parameter_constraints:
dict
= {'copy': ['boolean'], 'with_mean': ['boolean'], 'with_std': ['boolean']}#
- _reset()[source]#
Reset internal data-dependent state of the scaler, if necessary.
__init__ parameters are not touched.
- fit(X, y=None, sample_weight=None)[source]#
Compute the mean and std to be used for later scaling.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (array-like of shape (n_samples,), default=None) –
Individual weights for each sample.
New in version 0.24: parameter sample_weight support to StandardScaler.
- Returns:
self – Fitted scaler.
- Return type:
- partial_fit(X, y=None, sample_weight=None)[source]#
Online computation of mean and std on X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number of n_samples or because X is read from a continuous stream.The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (array-like of shape (n_samples,), default=None) –
Individual weights for each sample.
New in version 0.24: parameter sample_weight support to StandardScaler.
- Returns:
self – Fitted scaler.
- Return type:
- transform(X, copy=None)[source]#
Perform standardization by centering and scaling.
- Parameters:
X ({array-like, sparse matrix of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)
- inverse_transform(X, copy=None)[source]#
Scale back the data to the original representation.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)
- _sklearn_auto_wrap_output_keys = {'transform'}#
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardScaler #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.
- set_inverse_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardScaler #
Request metadata passed to the
inverse_transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toinverse_transform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toinverse_transform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.
- set_partial_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardScaler #
Request metadata passed to the
partial_fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topartial_fit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topartial_fit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.
- set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardScaler #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.
- class niklib.models.preprocessors.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)[source]#
Bases:
OneToOneFeatureMixin
,TransformerMixin
,BaseEstimator
Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
The transformation is given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_scaled = X_std * (max - min) + min
where min, max = feature_range.
This transformation is often used as an alternative to zero mean, unit variance scaling.
Read more in the User Guide.
- Parameters:
feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.
copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).
clip (bool, default=False) –
Set to True to clip transformed values of held-out data to provided feature range.
New in version 0.24.
- Variables:
min (ndarray of shape (n_features,)) – Per feature adjustment for minimum. Equivalent to
min - X.min(axis=0) * self.scale_
scale (ndarray of shape (n_features,)) –
Per feature relative scaling of the data. Equivalent to
(max - min) / (X.max(axis=0) - X.min(axis=0))
New in version 0.17: scale_ attribute.
data_min (ndarray of shape (n_features,)) –
Per feature minimum seen in the data
New in version 0.17: data_min_
data_max (ndarray of shape (n_features,)) –
Per feature maximum seen in the data
New in version 0.17: data_max_
data_range (ndarray of shape (n_features,)) –
Per feature range
(data_max_ - data_min_)
seen in the dataNew in version 0.17: data_range_
n_features_in (int) –
Number of features seen during fit.
New in version 0.24.
n_samples_seen (int) – The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across
partial_fit
calls.feature_names_in (ndarray of shape (n_features_in_,)) –
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
See also
minmax_scale
Equivalent function without the estimator API.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> from sklearn.preprocessing import MinMaxScaler >>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]] >>> scaler = MinMaxScaler() >>> print(scaler.fit(data)) MinMaxScaler() >>> print(scaler.data_max_) [ 1. 18.] >>> print(scaler.transform(data)) [[0. 0. ] [0.25 0.25] [0.5 0.5 ] [1. 1. ]] >>> print(scaler.transform([[2, 2]])) [[1.5 0. ]]
-
_parameter_constraints:
dict
= {'clip': ['boolean'], 'copy': ['boolean'], 'feature_range': [<class 'tuple'>]}#
- _reset()[source]#
Reset internal data-dependent state of the scaler, if necessary.
__init__ parameters are not touched.
- fit(X, y=None)[source]#
Compute the minimum and maximum to be used for later scaling.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
y (None) – Ignored.
- Returns:
self – Fitted scaler.
- Return type:
- partial_fit(X, y=None)[source]#
Online computation of min and max on X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number of n_samples or because X is read from a continuous stream.- Parameters:
X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
- Returns:
self – Fitted scaler.
- Return type:
- transform(X)[source]#
Scale features of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- inverse_transform(X)[source]#
Undo the scaling of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- _sklearn_auto_wrap_output_keys = {'transform'}#
- class niklib.models.preprocessors.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]#
Bases:
OneToOneFeatureMixin
,TransformerMixin
,BaseEstimator
Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the
transform()
method.Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.
New in version 0.17.
Read more in the User Guide.
- Parameters:
with_centering (bool, default=True) – If True, center the data before scaling. This will cause
transform()
to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.with_scaling (bool, default=True) – If True, scale the data to interquartile range.
quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)) –
Quantile range used to calculate scale_. By default this is equal to the IQR, i.e., q_min is the first quantile and q_max is the third quantile.
New in version 0.18.
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
unit_variance (bool, default=False) –
If True, scale data so that normally distributed features have a variance of 1. In general, if the difference between the x-values of q_max and q_min for a standard normal distribution is greater than 1, the dataset will be scaled down. If less than 1, the dataset will be scaled up.
New in version 0.24.
- Variables:
center (array of floats) – The median value for each feature in the training set.
scale (array of floats) –
The (scaled) interquartile range for each feature in the training set.
New in version 0.17: scale_ attribute.
n_features_in (int) –
Number of features seen during fit.
New in version 0.24.
feature_names_in (ndarray of shape (n_features_in_,)) –
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
See also
robust_scale
Equivalent function without the estimator API.
sklearn.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.
Notes
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Interquartile_range
Examples
>>> from sklearn.preprocessing import RobustScaler >>> X = [[ 1., -2., 2.], ... [ -2., 1., 3.], ... [ 4., 1., -2.]] >>> transformer = RobustScaler().fit(X) >>> transformer RobustScaler() >>> transformer.transform(X) array([[ 0. , -2. , 0. ], [-1. , 0. , 0.4], [ 1. , 0. , -1.6]])
-
_parameter_constraints:
dict
= {'copy': ['boolean'], 'quantile_range': [<class 'tuple'>], 'unit_variance': ['boolean'], 'with_centering': ['boolean'], 'with_scaling': ['boolean']}#
- __init__(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]#
- fit(X, y=None)[source]#
Compute the median and quantiles to be used for scaling.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the median and quantiles used for later scaling along the features axis.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns:
self – Fitted scaler.
- Return type:
- transform(X)[source]#
Center and scale the data.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the specified axis.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)
- inverse_transform(X)[source]#
Scale back the data to the original representation.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The rescaled data to be transformed back.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)
- _sklearn_auto_wrap_output_keys = {'transform'}#
- class niklib.models.preprocessors.MaxAbsScaler(*, copy=True)[source]#
Bases:
OneToOneFeatureMixin
,TransformerMixin
,BaseEstimator
Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
This scaler can also be applied to sparse CSR or CSC matrices.
New in version 0.17.
- Parameters:
copy (bool, default=True) – Set to False to perform inplace scaling and avoid a copy (if the input is already a numpy array).
- Variables:
scale (ndarray of shape (n_features,)) –
Per feature relative scaling of the data.
New in version 0.17: scale_ attribute.
max_abs (ndarray of shape (n_features,)) – Per feature maximum absolute value.
n_features_in (int) –
Number of features seen during fit.
New in version 0.24.
feature_names_in (ndarray of shape (n_features_in_,)) –
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
n_samples_seen (int) – The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across
partial_fit
calls.
See also
maxabs_scale
Equivalent function without the estimator API.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> from sklearn.preprocessing import MaxAbsScaler >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> transformer = MaxAbsScaler().fit(X) >>> transformer MaxAbsScaler() >>> transformer.transform(X) array([[ 0.5, -1. , 1. ], [ 1. , 0. , 0. ], [ 0. , 1. , -0.5]])
- _reset()[source]#
Reset internal data-dependent state of the scaler, if necessary.
__init__ parameters are not touched.
- fit(X, y=None)[source]#
Compute the maximum absolute value to be used for later scaling.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
y (None) – Ignored.
- Returns:
self – Fitted scaler.
- Return type:
- partial_fit(X, y=None)[source]#
Online computation of max absolute value of X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number of n_samples or because X is read from a continuous stream.- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
- Returns:
self – Fitted scaler.
- Return type:
- transform(X)[source]#
Scale the data.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be scaled.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)
- inverse_transform(X)[source]#
Scale back the data to the original representation.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be transformed back.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_features)
- _sklearn_auto_wrap_output_keys = {'transform'}#
- class niklib.models.preprocessors.LabelBinarizer(*, neg_label=0, pos_label=1, sparse_output=False)[source]#
Bases:
TransformerMixin
,BaseEstimator
Binarize labels in a one-vs-all fashion.
Several regression and binary classification algorithms are available in scikit-learn. A simple way to extend these algorithms to the multi-class classification case is to use the so-called one-vs-all scheme.
At learning time, this simply consists in learning one regressor or binary classifier per class. In doing so, one needs to convert multi-class labels to binary labels (belong or does not belong to the class). LabelBinarizer makes this process easy with the transform method.
At prediction time, one assigns the class for which the corresponding model gave the greatest confidence. LabelBinarizer makes this easy with the
inverse_transform()
method.Read more in the User Guide.
- Parameters:
- Variables:
classes (ndarray of shape (n_classes,)) – Holds the label for each class.
y_type (str) – Represents the type of the target data as evaluated by
type_of_target()
. Possible type are ‘continuous’, ‘continuous-multioutput’, ‘binary’, ‘multiclass’, ‘multiclass-multioutput’, ‘multilabel-indicator’, and ‘unknown’.sparse_input (bool) – True if the input data to transform is given as a sparse matrix, False otherwise.
See also
label_binarize
Function to perform the transform operation of LabelBinarizer with fixed classes.
OneHotEncoder
Encode categorical features using a one-hot aka one-of-K scheme.
Examples
>>> from sklearn import preprocessing >>> lb = preprocessing.LabelBinarizer() >>> lb.fit([1, 2, 6, 4, 2]) LabelBinarizer() >>> lb.classes_ array([1, 2, 4, 6]) >>> lb.transform([1, 6]) array([[1, 0, 0, 0], [0, 0, 0, 1]])
Binary targets transform to a column vector
>>> lb = preprocessing.LabelBinarizer() >>> lb.fit_transform(['yes', 'no', 'no', 'yes']) array([[1], [0], [0], [1]])
Passing a 2D matrix for multilabel classification
>>> import numpy as np >>> lb.fit(np.array([[0, 1, 1], [1, 0, 0]])) LabelBinarizer() >>> lb.classes_ array([0, 1, 2]) >>> lb.transform([0, 1, 2, 1]) array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0]])
-
_parameter_constraints:
dict
= {'neg_label': [<class 'numbers.Integral'>], 'pos_label': [<class 'numbers.Integral'>], 'sparse_output': ['boolean']}#
- fit(y)[source]#
Fit label binarizer.
- Parameters:
y (ndarray of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.
- Returns:
self – Returns the instance itself.
- Return type:
- fit_transform(y)[source]#
Fit label binarizer/transform multi-class labels to binary labels.
The output of transform is sometimes referred to as the 1-of-K coding scheme.
- Parameters:
y ({ndarray, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification. Sparse matrix can be CSR, CSC, COO, DOK, or LIL.
- Returns:
Y – Shape will be (n_samples, 1) for binary problems. Sparse matrix will be of CSR format.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_classes)
- transform(y)[source]#
Transform multi-class labels to binary labels.
The output of transform is sometimes referred to by some authors as the 1-of-K coding scheme.
- Parameters:
y ({array, sparse matrix} of shape (n_samples,) or (n_samples, n_classes)) – Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification. Sparse matrix can be CSR, CSC, COO, DOK, or LIL.
- Returns:
Y – Shape will be (n_samples, 1) for binary problems. Sparse matrix will be of CSR format.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_classes)
- inverse_transform(Y, threshold=None)[source]#
Transform binary labels back to multi-class labels.
- Parameters:
Y ({ndarray, sparse matrix} of shape (n_samples, n_classes)) – Target values. All sparse matrices are converted to CSR before inverse transformation.
threshold (float, default=None) –
Threshold used in the binary and multi-label cases.
Use 0 when
Y
contains the output of decision_function (classifier). Use 0.5 whenY
contains the output of predict_proba.If None, the threshold is assumed to be half way between neg_label and pos_label.
- Returns:
y – Target values. Sparse matrix will be of CSR format.
- Return type:
{ndarray, sparse matrix} of shape (n_samples,)
Notes
In the case when the binary labels are fractional (probabilistic),
inverse_transform()
chooses the class with the greatest value. Typically, this allows to use the output of a linear model’s decision_function method directly as the input ofinverse_transform()
.
- _sklearn_auto_wrap_output_keys = {'transform'}#
- set_inverse_transform_request(*, threshold: bool | None | str = '$UNCHANGED$') LabelBinarizer #
Request metadata passed to the
inverse_transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toinverse_transform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toinverse_transform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.
- class niklib.models.preprocessors.MultiLabelBinarizer(*, classes=None, sparse_output=False)[source]#
Bases:
TransformerMixin
,BaseEstimator
Transform between iterable of iterables and a multilabel format.
Although a list of sets or tuples is a very intuitive format for multilabel data, it is unwieldy to process. This transformer converts between this intuitive format and the supported multilabel format: a (samples x classes) binary matrix indicating the presence of a class label.
- Parameters:
classes (array-like of shape (n_classes,), default=None) – Indicates an ordering for the class labels. All entries should be unique (cannot contain duplicate classes).
sparse_output (bool, default=False) – Set to True if output binary array is desired in CSR sparse format.
- Variables:
classes (ndarray of shape (n_classes,)) – A copy of the classes parameter when provided. Otherwise it corresponds to the sorted set of classes found when fitting.
See also
OneHotEncoder
Encode categorical features using a one-hot aka one-of-K scheme.
Examples
>>> from sklearn.preprocessing import MultiLabelBinarizer >>> mlb = MultiLabelBinarizer() >>> mlb.fit_transform([(1, 2), (3,)]) array([[1, 1, 0], [0, 0, 1]]) >>> mlb.classes_ array([1, 2, 3])
>>> mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}]) array([[0, 1, 1], [1, 0, 0]]) >>> list(mlb.classes_) ['comedy', 'sci-fi', 'thriller']
A common mistake is to pass in a list, which leads to the following issue:
>>> mlb = MultiLabelBinarizer() >>> mlb.fit(['sci-fi', 'thriller', 'comedy']) MultiLabelBinarizer() >>> mlb.classes_ array(['-', 'c', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 'r', 's', 't', 'y'], dtype=object)
To correct this, the list of labels should be passed in as:
>>> mlb = MultiLabelBinarizer() >>> mlb.fit([['sci-fi', 'thriller', 'comedy']]) MultiLabelBinarizer() >>> mlb.classes_ array(['comedy', 'sci-fi', 'thriller'], dtype=object)
- fit(y)[source]#
Fit the label sets binarizer, storing classes_.
- Parameters:
y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.
- Returns:
self – Fitted estimator.
- Return type:
- fit_transform(y)[source]#
Fit the label sets binarizer and transform the given label sets.
- Parameters:
y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.
- Returns:
y_indicator – A matrix such that y_indicator[i, j] = 1 iff classes_[j] is in y[i], and 0 otherwise. Sparse matrix will be of CSR format.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_classes)
- transform(y)[source]#
Transform the given label sets.
- Parameters:
y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.
- Returns:
y_indicator – A matrix such that y_indicator[i, j] = 1 iff classes_[j] is in y[i], and 0 otherwise.
- Return type:
array or CSR matrix, shape (n_samples, n_classes)
- _transform(y, class_mapping)[source]#
Transforms the label sets with a given mapping.
- Parameters:
y (iterable of iterables) – A set of labels (any orderable and hashable object) for each sample. If the classes parameter is set, y will not be iterated.
class_mapping (Mapping) – Maps from label to column index in label indicator matrix.
- Returns:
y_indicator – Label indicator matrix. Will be of CSR format.
- Return type:
sparse matrix of shape (n_samples, n_classes)
- inverse_transform(yt)[source]#
Transform the given indicator matrix into label sets.
- Parameters:
yt ({ndarray, sparse matrix} of shape (n_samples, n_classes)) – A matrix containing only 1s ands 0s.
- Returns:
y – The set of labels for each sample such that y[i] consists of classes_[j] for each yt[i, j] == 1.
- Return type:
list of tuples
- _sklearn_auto_wrap_output_keys = {'transform'}#
- class niklib.models.preprocessors.PandasTrainTestSplit(stratify=None, random_state=None)[source]#
Bases:
object
Split a pandas dataframe with train and test
Note
This is a class very similar to
TrainTestEvalSplit
with this difference that this class is specialized for Pandas Dataframe and since we are going to use augmentation on Pandas Dataframe rather than Numpy, then this class enable us to do augmentation only on train split and let the test part stay as it is.Note
args
cannot be set directly and need to be provided using a json file. Seeset_configs()
for more information.You can explicitly override following
args
by passing it as an argument to__init__()
:random_state
stratify
- Returns:
A tuple of
(data_train, data_test)
which contains both dependent and independent variables- Return type:
Tuple[
numpy.ndarray
, …]
- set_configs(path=None)[source]#
Defines and sets the config to be parsed
The keys of the configs are the attributes of this class which are:
train_ratio
(float): Ratio of train datashuffle
(bool): Whether to shuffle the datastratify
(numpy.ndarray
, optional): If not None, this is used to stratify the datarandom_state
(int, optional): Random state to use for shuffling
Note
You can explicitly override following attributes by passing it as an argument to
__init__()
:random_state
stratify
The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.
- __call__(df, target_column, *args, **kwds)[source]#
Split a pandas dataframe with train and test splits
- Parameters:
df (
pandas.DataFrame
) – Dataframe to converttarget_column (str) – Name of the target column
- Returns:
Order is
(data_train, data_test)
- Return type:
Tuple[
numpy.ndarray
, …]
- class niklib.models.preprocessors.TrainTestEvalSplit(stratify=None, random_state=None)[source]#
Bases:
object
Convert a pandas dataframe to a numpy array for with train, test, and eval splits
For conversion from
pandas.DataFrame
tonumpy.ndarray
, we use the same functionality aspandas.DataFrame.to_numpy()
, but it separates dependent and independent variables given the target columntarget_column
.Note
To obtain the eval set, we use the train set as the original data to be splitted i.e. the eval set is a subset of train set. This is of course to make sure model by no means sees the test set.
args
cannot be set directly and need to be provided using a json file. Seeset_configs()
for more information.You can explicitly override following
args
by passing it as an argument to__init__()
:random_state
stratify
- Returns:
Order is
(x_train, x_test, x_eval, y_train, y_test, y_eval)
- Return type:
Tuple[
numpy.ndarray
, …]
- set_configs(path=None)[source]#
Defines and sets the config to be parsed
The keys of the configs are the attributes of this class which are:
test_ratio
(float): Ratio of test dataeval_ratio
(float): Ratio of eval datashuffle
(bool): Whether to shuffle the datastratify
(numpy.ndarray
, optional): If not None, this is used to stratify the datarandom_state
(int, optional): Random state to use for shuffling
Note
You can explicitly override following attributes by passing it as an argument to
__init__()
:random_state
stratify
The values of the configs are parameters and can be set manually or extracted from JSON config files by providing the path to the JSON file.
- __call__(df, target_column, *args, **kwds)[source]#
Convert a pandas dataframe to a numpy array for with train, test, and eval splits
- Parameters:
df (
pandas.DataFrame
) – Dataframe to converttarget_column (str) – Name of the target column
- Returns:
Order is
(x_train, x_test, x_eval, y_train, y_test, y_eval)
- Return type:
Tuple[
numpy.ndarray
, …]
- class niklib.models.preprocessors.ColumnTransformerConfig[source]#
Bases:
object
A helper class that parses configs for using the
sklearn.compose.ColumnTransformer
The purpose of this class is to create the list of
transformers
to be used by thesklearn.compose.ColumnTransformer
. Hence, one needs to define the configs by using theset_configs()
method. Then use thegenerate_pipeline()
method to create the list of transformers.This class at the end, will return a list of tuples, where each tuple is a in the form of
(name, transformer, columns)
.- set_configs(path=None)[source]#
Defines and sets the config to be parsed
The keys of the configs are the names of the transformers. They must include the API name of one of the available transforms at the end:
sklearn transformers: Any class that could be used for transformation that is importable as
sklearn.preprocessing.API_NAME
custom transformers: Any class that is not a
sklearn
transformer and is importable asniklib.models.preprocessors.API_NAME
This naming convention is used to create proper transformers for each type of data. e.g in json format:
"age_StandardScaler": { "columns_type": "'numeric'", "dtype_include": "np.float32", "pattern_include": "'age'", "pattern_exclude": "None", "dtype_exclude": "None", "group": "False", "use_global": "False" } "sex_OneHotEncoder": { "columns_type": "'numeric'", "dtype_include": "'category'", "pattern_include": "'VisaResult'", "pattern_exclude": "None", "dtype_exclude": "None", "group": "True", "use_global": "True" }
The values of the configs are the columns to be transformed. The columns can be obtained by using
niklib.models.preprocessors.core.ColumnSelector
which requires user to pass certain parameters. This parameters can be set manually or extracted from JSON config files by providing the path to the JSON file.The
group
key is used to determine if the transformer should be applied considering a group of columns or not. Ifgroup
isTrue
, then required values for transformation are obtained from all columns rather than handling each group separately. For instance, one can useOneHotEncoding
on a set of columns where ifgroup
isTrue
, then all unique categories of all of those columns are extracted, then transformed. ifgroup
isFalse
, then each column will have be transformed based on their unique categories independently. (group
cannot be passed toColumnSelector
)The
use_global
key is used to determine if the transformer should be applied considering the all data or train data (since fitting transformation for normalization need to be only done on train data). Ifuse_global
isTrue
, then the transformer will be applied on all data. This is particularly useful for one hot encoding categorical features where some categories might are rare and might only exist in test and eval data.- Parameters:
path (
Union
[str
,Path
]) – path to the JSON file containing the configs- Returns:
A dictionary where keys are string names, values are tuple of
niklib.models.preprocessors.core.ColumnSelector
instance and a boolean control variable which will be passed togenerate_pipeline()
.- Return type:
- static extract_selected_columns(selector, df)[source]#
Extracts the columns from the dataframe based on the selector
Note
This method is simply a wrapper around
niklib.models.preprocessors.core.ColumnSelector
that makes the call given a dataframe. I.e.:# assuming same configs selector = preprocessors.ColumnSelector(...) A = ColumnTransformerConfig.extract_selected_columns(selector=selector, df=df) B = selector(df) A == B # True
Also, this is a static method.
- Parameters:
selector (
niklib.models.preprocessors.core.ColumnSelector
) – Initialized selector objectdf (
pandas.DataFrame
) – Dataframe to extract columns from
- Returns:
List of columns to be transformed
- Return type:
- __check_arg_exists(callable, arg)#
Checks if the argument exists in the callable signature
- Parameters:
callable (Callable) – Callable to check the argument in
arg (str) – Argument to check if exists in the callable signature
- Raises:
ValueError – If the argument does not exist in the callable signature
- Return type:
- __get_df_column_unique(df, loc)#
Gets uniques of a column in a dataframe
- Parameters:
df (
pandas.DataFrame
) – Dataframe to get uniques from
- Returns:
List of unique values in the column. Values of the returned list can be anything that is supported by
pandas.DataFrame
- Return type:
- calculate_params(df, columns, group, transformer_name)[source]#
Calculates the parameters for the group transformation w.r.t. the transformer name
- Parameters:
df (
pandas.DataFrame
) – Dataframe to extract columns fromcolumns (List) – List of columns to be transformed
group (bool) – If True, then the columns will be grouped together and the parameters will be calculated over all columns passed in
transformer_name (str) – Name of the transformer. It is used to determine the type of params to be passed to the transformer. E.g. if
transformer_name
corresponds toOneHotEncoding
, then params would be unique categories.
- Raises:
ValueError – If the transformer name is not implemented but supported
- Returns:
Parameters for the group transformation
- Return type:
- _check_overlap_in_transformation_columns(transformers)[source]#
Checks if there are multiple transformers on the same columns and reports them
Throw info if columns of different transformers overlap. I.e. at least another transform is happening on a column that is already has been transformed.
Note
This is not a bug or misbehavior since we should be able to pipe multiple transformers sequentially on the same column (e.g.
add
->divide
). The warning is thrown when user didn’t meant to do so since the output might be acceptable but wrong values and there is no way to find out except manual inspection. Hence, this method will make the user aware that something might be wrong.- Parameters:
transformers (List[Tuple]) – A list of tuples, where each tuple is a in the form of
(name, transformer, columns)
wherename
is the name of the transformer,transformer
is the transformer object andcolumns
is the list of columns names to be transformed.- Return type:
- generate_pipeline(df, df_all=None)[source]#
Generates the list of transformers to be used by the
sklearn.compose.ColumnTransformer
Note
For more info about how the transformers are created, see methods
set_configs()
,extract_selected_columns()
andcalculate_params()
.- Parameters:
df (
pandas.DataFrame
) – Dataframe to extract columns from ifdf_all
is None, then this is interpreted as train datadf_all (Optional[
pandas.DataFrame
]) – Dataframe to extract columns from ifdf_all
is not None, then this is interpreted as entire data. For more info seeset_configs()
.
- Raises:
ValueError – If the naming convention used for the keys in the configs (see
set_configs()
) is not followed.- Returns:
A list of tuples, where each tuple is a in the form of
(name, transformer, columns)
wherename
is the name of the transformer,transformer
is the transformer object andcolumns
is the list of columns names to be transformed.- Return type:
- class niklib.models.preprocessors.ColumnSelector(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#
Bases:
object
Selects columns based on regex pattern and dtype
User can specify the dtype of columns to select, and the dtype of columns to ignore. Also, user can specify the regex pattern for including and excluding columns, separately.
This is particularly useful when combined with
sklearn.compose.ColumnTransformer
to apply different sort oftransformers
to different subsets of columns. E.g:# select columns that contain 'Country' in their name and are of type `np.float32` columns = preprocessors.ColumnSelector(columns_type='numeric', dtype_include=np.float32, pattern_include='.*Country.*', pattern_exclude=None, dtype_exclude=None)(df=data) # use a transformer for selected columns ct = preprocessors.ColumnTransformer( [('some_name', # just a name preprocessors.StandardScaler(), # the transformer columns), # the columns to apply the transformer to ], ) ct.fit_transform(...)
Note
If the data that is passed to the
ColumnSelector
is apandas.DataFrame
, then you can ignore calling the instance of this class and directly use it in the pipeline. E.g:# select columns that contain 'Country' in their name and are of type `np.float32` columns = preprocessors.ColumnSelector(columns_type='numeric', dtype_include=np.float32, pattern_include='.*Country.*', pattern_exclude=None, dtype_exclude=None) # THIS LINE # use a transformer for selected columns ct = preprocessors.ColumnTransformer( [('some_name', # just a name preprocessors.StandardScaler(), # the transformer columns), # the columns to apply the transformer to ], ) ct.fit_transform(...)
See also
sklearn.compose.make_column_selector
asColumnSelector
follows the same semantics.- __init__(columns_type, dtype_include, pattern_include=None, dtype_exclude=None, pattern_exclude=None)[source]#
Selects columns based on regex pattern and dtype
- Parameters:
columns_type (str) –
Type of columns:
'string'
: returns the name of the columns. Useful forpandas.DataFrame
'numeric'
: returns the index of the columns. Useful fornumpy.ndarray
dtype_include (type) – Type of the columns to select. For more info see
pandas.DataFrame.select_dtypes()
.pattern_include (str) – Regex pattern to match columns to include
dtype_exclude (type) – Type of the columns to ignore. For more info see
pandas.DataFrame.select_dtypes()
. Defaults to None.pattern_exclude (str) – Regex pattern to match columns to exclude
- __call__(df, *args, **kwds)[source]#
- Parameters:
df (
pandas.DataFrame
) – Dataframe to extract columns from- Returns:
List of names or indices of filtered columns
- Return type:
- Raises:
ValueError – If the
df
is not instance ofpandas.DataFrame
- niklib.models.preprocessors.move_dependent_variable_to_end(df, target_column)[source]#
Move the dependent variable to the end of the dataframe
This is useful for some frameworks that require the dependent variable to be the last or in general form, it is way easier to play with
numpy.ndarray
s when the dependent variable is the last one.Note
This is particularly is useful for us since we have multiple columns of the same type in our dataframe, and when we want to apply same preprocessing to a all members of a group of features, we can directly use index of those features from our pandas dataframe in converted numpy array. E.g:
df = pd.DataFrame(...) x = df.to_numpy() index = df.columns.get_loc(a_group_of_columns_with_the_same_logic) x[:, index] = transform(x[:, index])
- Parameters:
df (
pandas.DataFrame
) – Dataframe to converttarget_column (str) – Name of the target column
- Returns:
Dataframe with the dependent variable at the end
- Return type:
- niklib.models.preprocessors.preview_column_transformer(column_transformer, original, transformed, df, random_state=Generator(PCG64) at 0x7FF565FC8820, **kwargs)[source]#
Preview transformed data next to original one obtained via
ColumnTransformer
When the transformation is not
sklearn.preprocessing.OneHotEncoder
, the transformed data is previewed next to the original data in a pandas dataframe.But when the transformation is
sklearn.preprocessing.OneHotEncoder
, this is no longer clean or informative in seeing only 0s and 1s. So, I just skip previewing the transformed data entirely but report following information:The number of columns affected by transformation
The number of unique values in all of affected columns
The number of newly produced columns
- Parameters:
column_transformer (ColumnTransformer) – An instance of
sklearn.compose.ColumnTransformer
original (
numpy.ndarray
) – Original data as anumpy.ndarray
. Same shape astransformed
transformed (
numpy.ndarray
) – Transformed data as anumpy.ndarray
. Same shape asoriginal
df (
pandas.DataFrame
) – A dataframe that hosts theoriginal
andtransformed
data. Used to extract column names and unique values for logging information about the transformations donerandom_state (Union[int,
numpy.random.Generator
], optional) – A seed value or instance ofnumpy.random.Generator
for sampling. Defaults tonumpy.random.default_rng()
.**kwargs –
Additional arguments as follows:
n_samples
(int): Number of samples to draw. Defaults to 1.
- Raises:
ValueError – If
original
andtransformed
are not of the same shape- Yields:
pandas.DataFrame
– Preview dataframe for each transformer incolumn_transformer.transformers_
. Dataframe has twice as columns asoriginal
andtransformed
, i.e.df.shape == (original.shape[0], 2 * original.shape[1])
- Return type:
- niklib.models.preprocessors.get_transformed_feature_names(column_transformer, original_columns_names)[source]#
Gives feature names for transformed data via original feature names
This is super useful as the default
sklearn.compose.ColumnTransformer.get_feature_names_out()
uses meaningless names for features after transformation which makes tracking the transformed features almost impossible as it usesf0[_category], f1[_category], ... fn[_category]` as feature names. This method for example, extracts the name of original column ``A
(with categories[a, b]
) before transformation and finds new columns after transforming that column and names themA_a
andA_b
meanwhilesklearn
method givesx[num0]_a
andx_[num0]_b
.- Parameters:
column_transformer (
sklearn.compose.ColumnTransformer
) – A fitted column transformer that has.transformers_
where each is a tuple as(name, transformer, in_columns)
.in_columns
used to detect the original index of transformed columns.original_columns_names (List[str]) – List of original columns names before transformation
- Returns:
A list of transformed columns names prefixed with original columns names
- Return type:
List[str]