niklib.data package#

Submodules#

niklib.data.constant module#

niklib.data.constant.EXAMPLE_FINANCIAL_RATIOS = {'deposit2rent': 0.03, 'deposit2worth': 5.0, 'income2tax': 0.15, 'income2worth': 15.0, 'rent2deposit': 33.333333333333336, 'tax2income': 6.666666666666667, 'worth2deposit': 0.2, 'worth2income': 0.06666666666666667}#: Ratios used to convert rent, deposit, and total worth to each other

Note

This is part of dictionaries containing factors in used in heuristic calculations using domain knowledge.

Note

Although this is created as an code example, values chosen here are from basic rule of thump and actually can be used if no other reliable information is available.

class niklib.data.constant.ExampleFillna(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

Values used to fill None s depending on the form structure

Members follow the <field_name>_<form_name> naming convention. The value has been extracted by manually inspecting the documents. Hence, for each form, user must find and set this value manually.

Note

We do not use any heuristics here, we just follow what form used and only add another option which should be used as None state; i.e. None as a separate feature in categorical mode.

CHD_M_STATUS_5645E = 9#

class niklib.data.constant.ExampleDocTypes(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

Contains all document types which can be used to customize ETL steps for each document type

Members follow the <country_name>_<document_type> naming convention. The value and its order are meaningless.

CANADA = 1#

CANADA_5257E = 2#

CANADA_5645E = 3#

CANADA_LABEL = 4#

class niklib.data.constant.ExampleMarriageStatus(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

States of marriage in (some specific) form

Note

Values for the members are the values used in original forms. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values.

Note

These values have been chosen for demonstration purposes in this class and and do not carry any meaning or information (El No Sabe). But for real world, you must use meaningful ones.

COMMON_LAW = 69#

DIVORCED = 3#

SEPARATED = 4#

MARRIED = 0#

SINGLE = 7#

WIDOWED = 85#

UNKNOWN = 9#

class niklib.data.constant.ExampleSex(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

Sex types in general

Note

The values of enum members are not important, hence no explicit valuing is used

Note

The name of the members has to be customized because of bad preprocessing (or in some cases, domain-specific knowledge), hence, name has been overridden.

FEMALE = 1#

MALE = 2#

niklib.data.functional module#

Contains implementation of functions that could be used for processing data everywhere and are not necessarily bounded to a class.

niklib.data.functional.dict_to_csv(d, path)[source]#

Takes a flattened dictionary and writes it to a CSV file.

Parameters:

d (dict) – A dictionary
path (str) – Path to the output file (will be created if not exist)

Return type:

None

niklib.data.functional.dump_directory_structure_csv(src, shallow=True)[source]#

Saves a tree structure of a directory in csv file

Takes a src directory path, creates a tree of dir structure and writes it down to a csv file with name 'label.csv' with default value of '0' for each path

Note

This has been used to manually extract and record labels.

Parameters:

src (str) – Source directory path
shallow (bool, optional) – If only dive one level of depth (False: recursive). Defaults to True.

Return type:

None

niklib.data.functional.create_directory_structure_tree(src, shallow=False)[source]#

Takes a path to directory and creates a dictionary of its directory structure tree

Parameters:

src (str) – Path to source directory
shallow (bool, optional) – Whether or not just dive to root dir’s subdir. Defaults to False.

Reference:

https://stackoverflow.com/a/25226267/18971263

Returns:: Dictionary of all dirs (and subdirs) where keys are path and values are 0
Return type:: dict

niklib.data.functional.flatten_dict(d)[source]#

Takes a (nested) multilevel dictionary and flattens it

Parameters:: d (dict) – A dictionary (could be multilevel)

Reference:

https://stackoverflow.com/a/67744709/18971263

Returns:

Flattened dictionary where keys and values of returned dict are:

new_keys[i] = f'{old_leys[level]}.{old_leys[level+1]}.[...].{old_leys[level+n]}'
new_value = old_value

Return type:

dict

niklib.data.functional.xml_to_flattened_dict(xml)[source]#

Takes a (nested) XML and flattens it to a dict via flatten_dict()

Parameters:: xml (str) – A XML string
Returns:: A flattened dictionary of given XML
Return type:: dict

niklib.data.functional.process_directory(src_dir, dst_dir, compose, file_pattern='*', manager=None)[source]#

Transforms all files that match pattern in given dir and saves new files preserving dir structure

Note

A methods used for handling files from manually processed dataset to raw-dataset see FileTransform for more information.

References

https://stackoverflow.com/a/24041933/18971263

Parameters:

src_dir (str) – Source directory to be processed
dst_dir (str) – Destination directory to write processed files
compose (FileTransformCompose) – An instance of transform composer. see niklib.data.preprocessor.FileTransformCompose.
file_pattern (str, optional) – pattern to match files, default to '*' for all files. Defaults to '*'.
manager (Optional[Manager], optional) – enlighten.Manager for progressbar. Defaults to None.

Return type:

None

niklib.data.functional.extended_dict_get(string, dic, if_nan, condition=None)[source]#

Takes a string and looks for it inside a dictionary with default value if condition is satisfied

Parameters:

string (str) – the string to look for inside dictionary dic
dic (dict) – the dictionary that string is expected to be
if_nan (str) – the value returned if string could not be found in dic
condition (Optional[bool], optional) – look for string in dic only if condition is True

Examples

>>> d = {'1': 'a', '2': 'b', '3': 'c'}
>>> extended_dict_get('1', d, 'z', str.isnumeric)
'a'
>>> extended_dict_get('x', d, 'z', str.isnumeric)
'x'

Returns:: Substituted value instead of string
Return type:: Any

niklib.data.functional.config_csv_to_dict(path)[source]#

Takes a config CSV and return a dictionary of key and values

Note

Configs of our use case can be found in niklib.configs

Parameters:: path (str) – string path to config file
Returns:: A dictionary of converted csv
Return type:: dict

niklib.data.logic module#

class niklib.data.logic.Logics(dataframe=None)[source]#

Bases: object

Applies logics on different type of data resulting in summarized, expanded, or transformed data

Methods here are implemented in the way that can be used as Pandas.agg_ function over pandas.Series using functools.reduce_.

Note

This is constructed based on domain knowledge hence is designed for a specific purpose based on application.

For demonstration purposes, see following methods of this class:

count_previous_residency_country()

count_rel()

count_foreign_family_resident()

These methods has be implemented by their superclass. See:

ExampleLogics.count_previous_residency_country()

ExampleLogics.count_rel()

ExampleLogics.count_foreign_family_resident()

__init__(dataframe=None)[source]#

Init class by setting dataframe globally

Parameters:: dataframe (pandas.DataFrame, optional) – The dataframe that functions of this class will be user over its series, i.e. Logics.*(series). Defaults to None.

__check_df(func)#

Checks that df is initialized when function with the name func is being called

Parameters:: func (str) – The name of the function that operates over df
Raises:: TypeError – If df is not initialized
Return type:: None

reset_dataframe(dataframe)[source]#

Takes a new dataframe and replaces the old one

Note

This should be used when the dataframe is modified outside of functions provided in this class. E.g.:

my_df: pd.DataFrame = ...
logics = Logics(dataframe=my_df)
my_df = third_party_tools(my_df)
# now update df in logics
logics.reset_dataframe(dataframe=my_df)

Parameters:: dataframe (pandas.DataFrame) – The new dataframe
Return type:: None

add_agg_column(aggregator, agg_column_name, columns)[source]#

Aggregate columns and adds it to the original dataframe using an aggregator function

Parameters:

aggregator (Callable) – A function that takes multiple columns of a series and reduces it
agg_column_name (str) – The name of new aggregated column
columns (list) – Name of columns to be aggregated (i.e. input to aggregator)

Note

Although this function updated the dataframe the class initialized with inplace, but user must update the main dataframe outside of this class to make sure he/she can use it via different tools. Simply put:

my_df: pd.DataFrame = ...
logics = Logics(dataframe=my_df)
my_df = logics.add_agg_column(...)
my_df = third_party_tools(my_df)
# now update df in logics
logics.reset_dataframe(dataframe=my_df)
# aggregate again...
my_df = logics.add_agg_column(...)

Returns:: Updated dataframe that contains aggregated data
Return type:: pandas.DataFrame

count_previous_residency_country(series)[source]#

Counts the number of previous country of resident

Parameters:: series (pandas.Series) – Pandas Series to be processed
Returns:: Result of counting
Return type:: int

count_rel(series)[source]#

Counts the number of items for the given relationship

Parameters:: series (pandas.Series) – Pandas Series to be processed
Returns:: Result of counting
Return type:: int

count_foreign_family_resident(series)[source]#

Counts the number of family members that are living in a foreign country

Parameters:: series (pandas.Series) – Pandas Series to be processed
Returns:: Result of counting
Return type:: int

class niklib.data.logic.ExampleLogics(dataframe=None)[source]#

Bases: Logics

Customize and extend logics defined in Logics for an Example (Canada) dataset

__init__(dataframe=None)[source]#

Init class by setting dataframe globally

Parameters:: dataframe (pandas.DataFrame, optional) – The dataframe that functions of this class will be user over its series, i.e. Logics.*(series). Defaults to None.

count_previous_residency_country(series)[source]#

Counts the number of previous residency by counting non-zero periods of residency

When *.Period == 0, then we can say that the person has no residency. This way one just needs to count non-zero periods.

Parameters:: series (pandas.Series) – Pandas Series to be processed containing residency periods
Returns:: Result of counting
Return type:: int

count_rel(series)[source]#

Counts the number of people for the given relationship, e.g. siblings.

Parameters:: series (pandas.Series) – Pandas Series to be processed
Returns:: Result of counting
Return type:: int

count_foreign_family_resident(series)[source]#

Counts the number of family members that are long distance resident

This is being done by only checking the literal value 'foreign' in the '*Addr' columns (address columns).

Parameters:: series (pandas.Series) – Pandas Series to be processed containing the residency state/province in string. In practice, any string different from applicant’s province will be counted as difference.

Examples

>>> import pandas as pd
>>> from niklib.data.logic import CanadaLogics
>>> f = CanadaLogics().count_foreign_family_resident
>>> s = pd.Series(['alborz', 'alborz', 'alborz', None, 'foreign', None, 'gilan', 'isfahan', None])
>>> f(s)
1
>>> s1 = pd.Series(['foreign', 'foreign', 'alborz', 'fars'])
>>> f(s1)
2
>>> s2 = pd.Series([None, None, 'alborz', 'fars'])
>>> f(s2)
0

Returns:: Result of counting
Return type:: int

niklib.data.pdf module#

class niklib.data.pdf.PDFIO[source]#

Bases: object

Base class for dealing with PDF files

For each mode of PDF, let’s say XFA files, one needs to extend this class and abstract methods like extract_raw_content() to generate a string of the content of the PDF in a format that can be used by the other classes (e.g. XML). For instance, see XFAPDF for the extension of this class.

__init__()[source]#

extract_raw_content(pdf_path)[source]#

Extracts unprocessed data from a PDF file

Parameters:: pdf_path (str) – Path to the pdf file
Return type:: str

find_in_dict(needle, haystack)[source]#

Looks for the value of a key inside a nested dictionary

Parameters:

needle (Any) – Key to look for
haystack (Any) – Dictionary to look in. Can be a dict inside another dict

Returns:

The value of key needle

Return type:

Any

class niklib.data.pdf.XFAPDF[source]#

Bases: PDFIO

Contains functions and utility tools for dealing with XFA PDF documents.

Note

Developers should subclass this override clean_xml_for_csv() for their own specific XFA data used.

__init__()[source]#

extract_raw_content(pdf_path)[source]#

Extracts RAW content of XFA PDF files which are in XML format

Parameters:: pdf_path (str) – path to the pdf file

Reference:

https://towardsdatascience.com/how-to-extract-data-from-pdf-forms-using-python-10b5e5f26f70

Returns:: XFA object of the pdf file in XML format
Return type:: str

clean_xml_for_csv(xml, mode)[source]#

Cleans the XML file extracted from XFA forms

Since each form has its own format and issues, this method needs to be implemented uniquely for each unique file/form which needs to be specified using argument mode that can be populated from niklib.data.constant.ExampleDocTypes.

Parameters:

xml (str) – XML content
mode (Enum) – mode of the document defined in niklib.data.constant.ExampleDocTypes

Returns:

cleaned XML content to be used in CSV file

Return type:

str

flatten_dict(d)[source]#

Takes a (nested) multilevel dictionary and flattens it

The final keys are key.key... and values are the leaf values of dictionary

Parameters:: d (dict) – A dictionary

References

https://stackoverflow.com/a/67744709/18971263

Returns:: A flattened dictionary
Return type:: dict

xml_to_flattened_dict(xml)[source]#

Takes a (nested) XML and converts it to a flattened dictionary

The final keys are key.key... and values are the leaf values of XML tree

Parameters:: xml (str) – A XML string
Returns:: A flattened dictionary
Return type:: dict

class niklib.data.pdf.ExampleXFA[source]#

Bases: XFAPDF

Handles Canada XFA PDF files

__init__()[source]#

clean_xml_for_csv(xml, mode)[source]#

Hardcoded cleaning of Example XFA XML files to be XML compatible with CSV

Parameters:

xml (str) – XML content
mode (Enum) – mode of the document defined in niklib.data.constant.ExampleDocTypes

Returns:

cleaned XML content to be used in CSV file

Return type:

str

niklib.data.preprocessor module#

class niklib.data.preprocessor.FileTransform[source]#

Bases: object

A base class for applying transforms as a composable object over files.

Any behavior over the files itself (not the content of files) must extend this class.

__init__()[source]#

__call__(src, dst, *args, **kwds)[source]#

Parameters:

src (str) – source file to be processed
dst (str) – the pass that the processed file to be saved

Return type:

Any

class niklib.data.preprocessor.CopyFile(mode)[source]#

Bases: FileTransform

Only copies a file, a wrapper around shutil ‘s copying methods

Default is set to 'cf', i.e. shutil.copyfile(). For more info see shutil documentation.

Reference:

https://stackoverflow.com/a/30359308/18971263

__init__(mode)[source]#

__call__(src, dst, *args, **kwds)[source]#

Parameters:

src (str) – source file to be processed
dst (str) – the pass that the processed file to be saved

Return type:

Any

__check_mode(mode)#

Checks copying mode to be available in shutil

Parameters:: mode (str) – copying mode in shutil, one of 'c', 'cf', 'c2'

class niklib.data.preprocessor.MakeContentCopyProtectedMachineReadable[source]#

Bases: FileTransform

Reads a 'content-copy' protected PDF and removes this restriction

Removing the protection is done by saving a “printed” version of via pikepdf

References

__init__()[source]#

__call__(src, dst, *args, **kwds)[source]#

Parameters:

src (str) – source file to be processed
dst (str) – destination to save the processed file

Returns:

None

Return type:

Any

class niklib.data.preprocessor.FileTransformCompose(transforms)[source]#

Bases: object

Composes several transforms operating on files together

The transforms should be tied to files with keyword and this will be only applying functions on files that match the keyword using a dictionary

Transformation dictionary over files in the following structure:

{
    FileTransform: 'filter_str',
    ...,
}

Note

Transforms will be applied in order of the keys in the dictionary

__init__(transforms)[source]#

Parameters:: transforms (dict) – a dictionary of transforms, where the key is the instance of FileTransform and the value is the keyword that the transform will be applied to
Raises:: ValueError – if the keyword is not a string

__call__(src, dst, *args, **kwds)[source]#

Applies transforms in order

Parameters:

src (str) – source file path to be processed
dst (str) – destination to save the processed file

Return type:

Any

niklib.data package#

Submodules#

niklib.data.constant module#

niklib.data.functional module#

niklib.data.logic module#

niklib.data.pdf module#

niklib.data.preprocessor module#

Module contents#