niklib.data package#
Submodules#
niklib.data.constant module#
- niklib.data.constant.EXAMPLE_FINANCIAL_RATIOS = {'deposit2rent': 0.03, 'deposit2worth': 5.0, 'income2tax': 0.15, 'income2worth': 15.0, 'rent2deposit': 33.333333333333336, 'tax2income': 6.666666666666667, 'worth2deposit': 0.2, 'worth2income': 0.06666666666666667}#
Ratios used to convert rent, deposit, and total worth to each other
Note
This is part of dictionaries containing factors in used in heuristic calculations using domain knowledge.
Note
Although this is created as an code example, values chosen here are from basic rule of thump and actually can be used if no other reliable information is available.
- class niklib.data.constant.ExampleFillna(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
Values used to fill
None
s depending on the form structureMembers follow the
<field_name>_<form_name>
naming convention. The value has been extracted by manually inspecting the documents. Hence, for each form, user must find and set this value manually.Note
We do not use any heuristics here, we just follow what form used and only add another option which should be used as
None
state; i.e.None
as a separate feature in categorical mode.- CHD_M_STATUS_5645E = 9#
- class niklib.data.constant.ExampleDocTypes(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
Contains all document types which can be used to customize ETL steps for each document type
Members follow the
<country_name>_<document_type>
naming convention. The value and its order are meaningless.- CANADA = 1#
- CANADA_5257E = 2#
- CANADA_5645E = 3#
- CANADA_LABEL = 4#
- class niklib.data.constant.ExampleMarriageStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
States of marriage in (some specific) form
Note
Values for the members are the values used in original forms. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values.
Note
These values have been chosen for demonstration purposes in this class and and do not carry any meaning or information (El No Sabe). But for real world, you must use meaningful ones.
- COMMON_LAW = 69#
- DIVORCED = 3#
- SEPARATED = 4#
- MARRIED = 0#
- SINGLE = 7#
- WIDOWED = 85#
- UNKNOWN = 9#
- class niklib.data.constant.ExampleSex(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
Sex types in general
Note
The values of enum members are not important, hence no explicit valuing is used
Note
The name of the members has to be customized because of bad preprocessing (or in some cases, domain-specific knowledge), hence,
name
has been overridden.- FEMALE = 1#
- MALE = 2#
niklib.data.functional module#
Contains implementation of functions that could be used for processing data everywhere and are not necessarily bounded to a class.
- niklib.data.functional.dict_to_csv(d, path)[source]#
Takes a flattened dictionary and writes it to a CSV file.
- niklib.data.functional.dump_directory_structure_csv(src, shallow=True)[source]#
Saves a tree structure of a directory in csv file
Takes a
src
directory path, creates a tree of dir structure and writes it down to a csv file with name'label.csv'
with default value of'0'
for each pathNote
This has been used to manually extract and record labels.
- niklib.data.functional.create_directory_structure_tree(src, shallow=False)[source]#
Takes a path to directory and creates a dictionary of its directory structure tree
- Parameters:
- Returns:
Dictionary of all dirs (and subdirs) where keys are path and values are
0
- Return type:
- niklib.data.functional.flatten_dict(d)[source]#
Takes a (nested) multilevel dictionary and flattens it
- Parameters:
d (dict) – A dictionary (could be multilevel)
- Returns:
- Flattened dictionary where keys and values of returned dict are:
new_keys[i] = f'{old_leys[level]}.{old_leys[level+1]}.[...].{old_leys[level+n]}'
new_value = old_value
- Return type:
- niklib.data.functional.xml_to_flattened_dict(xml)[source]#
Takes a (nested) XML and flattens it to a dict via
flatten_dict()
- niklib.data.functional.process_directory(src_dir, dst_dir, compose, file_pattern='*', manager=None)[source]#
Transforms all files that match pattern in given dir and saves new files preserving dir structure
Note
A methods used for handling files from manually processed dataset to raw-dataset see
FileTransform
for more information.References
- Parameters:
src_dir (str) – Source directory to be processed
dst_dir (str) – Destination directory to write processed files
compose (FileTransformCompose) – An instance of transform composer. see
niklib.data.preprocessor.FileTransformCompose
.file_pattern (str, optional) – pattern to match files, default to
'*'
for all files. Defaults to'*'
.manager (Optional[Manager], optional) –
enlighten.Manager
for progressbar. Defaults to None.
- Return type:
- niklib.data.functional.extended_dict_get(string, dic, if_nan, condition=None)[source]#
Takes a string and looks for it inside a dictionary with default value if condition is satisfied
- Parameters:
Examples
>>> d = {'1': 'a', '2': 'b', '3': 'c'} >>> extended_dict_get('1', d, 'z', str.isnumeric) 'a' >>> extended_dict_get('x', d, 'z', str.isnumeric) 'x'
- Returns:
Substituted value instead of string
- Return type:
Any
- niklib.data.functional.config_csv_to_dict(path)[source]#
Takes a config CSV and return a dictionary of key and values
Note
Configs of our use case can be found in
niklib.configs
niklib.data.logic module#
- class niklib.data.logic.Logics(dataframe=None)[source]#
Bases:
object
Applies logics on different type of data resulting in summarized, expanded, or transformed data
Methods here are implemented in the way that can be used as
Pandas.agg_
function overpandas.Series
usingfunctools.reduce_
.Note
This is constructed based on domain knowledge hence is designed for a specific purpose based on application.
For demonstration purposes, see following methods of this class:
These methods has be implemented by their superclass. See:
- __init__(dataframe=None)[source]#
Init class by setting dataframe globally
- Parameters:
dataframe (
pandas.DataFrame
, optional) – The dataframe that functions of this class will be user over its series, i.e.Logics.*(series)
. Defaults to None.
- __check_df(func)#
Checks that
df
is initialized when function with the namefunc
is being called
- reset_dataframe(dataframe)[source]#
Takes a new dataframe and replaces the old one
Note
This should be used when the dataframe is modified outside of functions provided in this class. E.g.:
my_df: pd.DataFrame = ... logics = Logics(dataframe=my_df) my_df = third_party_tools(my_df) # now update df in logics logics.reset_dataframe(dataframe=my_df)
- Parameters:
dataframe (
pandas.DataFrame
) – The new dataframe- Return type:
- add_agg_column(aggregator, agg_column_name, columns)[source]#
Aggregate columns and adds it to the original dataframe using an aggregator function
- Parameters:
Note
Although this function updated the dataframe the class initialized with inplace, but user must update the main dataframe outside of this class to make sure he/she can use it via different tools. Simply put:
my_df: pd.DataFrame = ... logics = Logics(dataframe=my_df) my_df = logics.add_agg_column(...) my_df = third_party_tools(my_df) # now update df in logics logics.reset_dataframe(dataframe=my_df) # aggregate again... my_df = logics.add_agg_column(...)
- Returns:
Updated dataframe that contains aggregated data
- Return type:
- count_previous_residency_country(series)[source]#
Counts the number of previous country of resident
- Parameters:
series (
pandas.Series
) – Pandas Series to be processed- Returns:
Result of counting
- Return type:
- count_rel(series)[source]#
Counts the number of items for the given relationship
- Parameters:
series (
pandas.Series
) – Pandas Series to be processed- Returns:
Result of counting
- Return type:
- count_foreign_family_resident(series)[source]#
Counts the number of family members that are living in a foreign country
- Parameters:
series (
pandas.Series
) – Pandas Series to be processed- Returns:
Result of counting
- Return type:
- class niklib.data.logic.ExampleLogics(dataframe=None)[source]#
Bases:
Logics
Customize and extend logics defined in
Logics
for an Example (Canada) dataset- __init__(dataframe=None)[source]#
Init class by setting dataframe globally
- Parameters:
dataframe (
pandas.DataFrame
, optional) – The dataframe that functions of this class will be user over its series, i.e.Logics.*(series)
. Defaults to None.
- count_previous_residency_country(series)[source]#
Counts the number of previous residency by counting non-zero periods of residency
When
*.Period == 0
, then we can say that the person has no residency. This way one just needs to count non-zero periods.- Parameters:
series (
pandas.Series
) – Pandas Series to be processed containing residency periods- Returns:
Result of counting
- Return type:
- count_rel(series)[source]#
Counts the number of people for the given relationship, e.g. siblings.
- Parameters:
series (
pandas.Series
) – Pandas Series to be processed- Returns:
Result of counting
- Return type:
- count_foreign_family_resident(series)[source]#
Counts the number of family members that are long distance resident
This is being done by only checking the literal value
'foreign'
in the'*Addr'
columns (address columns).- Parameters:
series (
pandas.Series
) – Pandas Series to be processed containing the residency state/province in string. In practice, any string different from applicant’s province will be counted as difference.
Examples
>>> import pandas as pd >>> from niklib.data.logic import CanadaLogics >>> f = CanadaLogics().count_foreign_family_resident >>> s = pd.Series(['alborz', 'alborz', 'alborz', None, 'foreign', None, 'gilan', 'isfahan', None]) >>> f(s) 1 >>> s1 = pd.Series(['foreign', 'foreign', 'alborz', 'fars']) >>> f(s1) 2 >>> s2 = pd.Series([None, None, 'alborz', 'fars']) >>> f(s2) 0
- Returns:
Result of counting
- Return type:
niklib.data.pdf module#
- class niklib.data.pdf.PDFIO[source]#
Bases:
object
Base class for dealing with PDF files
For each mode of PDF, let’s say XFA files, one needs to extend this class and abstract methods like
extract_raw_content()
to generate a string of the content of the PDF in a format that can be used by the other classes (e.g. XML). For instance, seeXFAPDF
for the extension of this class.
- class niklib.data.pdf.XFAPDF[source]#
Bases:
PDFIO
Contains functions and utility tools for dealing with XFA PDF documents.
Note
Developers should subclass this override
clean_xml_for_csv()
for their own specific XFA data used.- extract_raw_content(pdf_path)[source]#
Extracts RAW content of XFA PDF files which are in XML format
- Parameters:
pdf_path (str) – path to the pdf file
- Reference:
- Returns:
XFA object of the pdf file in XML format
- Return type:
- clean_xml_for_csv(xml, mode)[source]#
Cleans the XML file extracted from XFA forms
Since each form has its own format and issues, this method needs to be implemented uniquely for each unique file/form which needs to be specified using argument
mode
that can be populated fromniklib.data.constant.ExampleDocTypes
.- Parameters:
xml (str) – XML content
mode (Enum) – mode of the document defined in
niklib.data.constant.ExampleDocTypes
- Returns:
cleaned XML content to be used in CSV file
- Return type:
- class niklib.data.pdf.ExampleXFA[source]#
Bases:
XFAPDF
Handles Canada XFA PDF files
- clean_xml_for_csv(xml, mode)[source]#
Hardcoded cleaning of Example XFA XML files to be XML compatible with CSV
- Parameters:
xml (str) – XML content
mode (Enum) – mode of the document defined in
niklib.data.constant.ExampleDocTypes
- Returns:
cleaned XML content to be used in CSV file
- Return type:
niklib.data.preprocessor module#
- class niklib.data.preprocessor.FileTransform[source]#
Bases:
object
A base class for applying transforms as a composable object over files.
Any behavior over the files itself (not the content of files) must extend this class.
- class niklib.data.preprocessor.CopyFile(mode)[source]#
Bases:
FileTransform
Only copies a file, a wrapper around
shutil
‘s copying methodsDefault is set to
'cf'
, i.e.shutil.copyfile()
. For more info see shutil documentation.
- class niklib.data.preprocessor.MakeContentCopyProtectedMachineReadable[source]#
Bases:
FileTransform
Reads a
'content-copy'
protected PDF and removes this restrictionRemoving the protection is done by saving a “printed” version of via pikepdf
References
- class niklib.data.preprocessor.FileTransformCompose(transforms)[source]#
Bases:
object
Composes several transforms operating on files together
The transforms should be tied to files with keyword and this will be only applying functions on files that match the keyword using a dictionary
Transformation dictionary over files in the following structure:
{ FileTransform: 'filter_str', ..., }
Note
Transforms will be applied in order of the keys in the dictionary
- __init__(transforms)[source]#
- Parameters:
transforms (
dict
) – a dictionary of transforms, where the key is the instance of FileTransform and the value is the keyword that the transform will be applied to- Raises:
ValueError – if the keyword is not a string