cvfe.data package#

Submodules#

cvfe.data.constant module#

cvfe.data.constant.CANADA_5257E_KEY_ABBREVIATION = {'Address': 'Addr', 'BGI2.VisaChoice1': 'noAuthStay', 'BGI2.VisaChoice2': 'refuseDeport', 'BGI3.Choice': 'criminalRec', 'BackgroundInfo': 'BGI', 'Contact': 'cntct', 'ContactInformation': 'CI', 'CountryWhereApplying': 'CWA', 'Current': 'Curr', 'Details.VisaChoice3': 'PrevApply', 'DetailsOfVisit': 'DOV', 'Education': 'Edu', 'GovPosition.Choice': 'witnessIllTreat', 'HowLongStay': 'HLS', 'Language': 'Lang', 'MaritalStatus': 'MS', 'Marriage': 'Marr', 'Married': 'Marr', 'Number': 'Num', 'Occ.Choice': 'politicViol', 'Occupation': 'Occ', 'Page': 'P', 'PageWrapper': 'PW', 'Passport': 'Psprt', 'PersonalDetails': 'PD', 'Phone': 'Phn', 'Previous': 'Prev', 'Previously': 'Prev', 'Purpose': 'Prps', 'Resident': 'Resi', 'Section': 'Sec', 'Signature': 'Sign', 'backgroundInfoCalc': 'otherThanMedic', 'contact': 'cntct'}#: Dict of abbreviation used to shortening length of KEYS in XML to CSV conversion

cvfe.data.constant.CANADA_5645E_KEY_ABBREVIATION = {'Address': 'Addr', 'Applicant': 'App', 'Child': 'Chd', 'Father': 'Fa', 'Mother': 'Mo', 'Occupation': 'Occ', 'Relationship': 'Rel', 'Section': 'Sec', 'Spouse': 'Sps', 'Yes': 'Accomp', 'page': 'p'}#: Dict of abbreviation used to shortening length of KEYS in XML to CSV conversion

cvfe.data.constant.CANADA_5257E_VALUE_ABBREVIATION = {'045': 'TURKEY', '223': 'IRAN', 'BIOMETRIC ENROLMENT': 'Bio'}#: Dict of abbreviation used to shortening length of VALUES in XML to CSV conversion

cvfe.data.constant.CANADA_5257E_DROP_COLUMNS = ['ns0:datasets.@xmlns:ns0', 'P1.Header.CRCNum', 'P1.FormVersion', 'P1.PD.UCIClientID', 'P1.PD.SecHeader.@ns0:dataNode', 'P1.PD.CurrCOR.Row1.@ns0:dataNode', 'P1.PD.PrevCOR.Row1.@ns0:dataNode', 'P1.PD.CWA.Row1.@ns0:dataNode', 'P1.PD.ApplicationValidatedFlag', 'P2.MS.SecA.SecHeader.@ns0:dataNode', 'P2.MS.SecA.PsprtSecHeader.@ns0:dataNode', 'P2.MS.SecA.Langs.languagesHeader.@ns0:dataNode', 'P2.natID.SecHeader.@ns0:dataNode', 'P2.USCard.SecHeader.@ns0:dataNode', 'P2.USCard.SecHeader.@ns0:dataNode', 'P2.CI.cntct.cntctInfoSecHeader.@ns0:dataNode', 'P3.SecHeader_DOV.@ns0:dataNode', 'P3.Edu.Edu_SecHeader.@ns0:dataNode', 'P3.Occ.SecHeader_CurrOcc.@ns0:dataNode', 'P3.BGI_SecHeader.@ns0:dataNode', 'P3.Sign.Consent0.Choice', 'P3.Sign.hand.@ns0:dataNode', 'P3.Sign.TextField2', 'P3.Disclosure.@ns0:dataNode', 'P3.ReaderInfo', 'Barcodes.@ns0:dataNode']#: List of columns to be dropped before doing any preprocessing

Note

This list has been determined manually.

cvfe.data.constant.CANADA_5645E_DROP_COLUMNS = {'formNum', 'p1.SecA.SecAdate', 'p1.SecA.SecAsignature', 'p1.SecA.Title.@xfa:dataNode', 'p1.SecB.SecBdate', 'p1.SecB.SecBsignature', 'p1.SecB.Title.@xfa:dataNode', 'p1.SecC.SecCsignature', 'p1.SecC.Subform2.@xfa:dataNode', 'p1.SecC.Title.@xfa:dataNode', 'xfa:datasets.@xmlns:xfa'}#: List of columns to be dropped before doing any preprocessing

Note

This list has been determined manually.

class cvfe.data.constant.DocTypes(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Contains all document types which can be used to customize ETL steps for each document type

Members follow the <country_name>_<document_type> naming convention. The value and its order are meaningless.

CANADA = 1#

CANADA_5257E = 2#

CANADA_5645E = 3#

CANADA_LABEL = 4#

class cvfe.data.constant.CanadaCutoffTerms[source]#

Bases: object

Dict of cut off terms for different files that is can be used with :func:`vizard.data.functional.dict_summarizer

CA5645E = 'IMM_5645'#

CA5257E = 'form1'#

class cvfe.data.constant.CanadaFillna[source]#

Bases: object

Values used to fill None s depending on the form structure

Members follow the <field_name>_<form_name> naming convention. The value has been extracted by manually inspecting the documents. Hence, for each form, user must find and set this value manually.

Note

We do not use any heuristics here, we just follow what form used and only add another option which should be used as None state; i.e. None as a separate feature in categorical mode.

COUNTRY_CODE_5257E = 'Unknown'#

VISA_TYPE_5257E = 'OTHER'#

PLACE_BIRTH_CITY_5257E = 'OTHER'#

COUNTRY_5257E = 'IRAN'#

CITIZENSHIP_5257E = 'IRAN'#

RESIDENCY_STATUS_5257E = 6#

OTHER_DESCRIPTION_INDICATOR_5257E = False#

PREVIOUS_COUNTRY_5257E = 'OTHER'#

COUNTRY_WHERE_APPLYING_5257E = 'OTHER'#

MARRIAGE_TYPE_5257E = 'OTHER'#

PASSPORT_COUNTRY_5257E = 'OTHER'#

NATIVE_LANG_5257E = 'IRAN'#

LANGUAGES_ABLE_TO_COMMUNICATE_5257E = 'NEITHER'#

ID_COUNTRY_5257E = 'IRAN'#

PURPOSE_OF_VISIT_5257E = 7#

CONTACT_TYPE_5257E = 'OTHER'#

OCCUPATION_5257E = 'OTHER'#

INDICATOR_FIELD_5257E = False#

VISA_APPLICATION_TYPE_5645E = '0'#

CHILD_MARRIAGE_STATUS_5645E = 9#

CHILD_RELATION_5645E = 'OTHER'#

VISA_RESULT = 0#

cvfe.data.constant.DATEUTIL_DEFAULT_DATETIME = {'day': 1, 'month': 1, 'year': 1}#: A default date for the dateutil.parser.parse function when some part of date is not provided

cvfe.data.constant.T0 = '19000202T000000'#: a default meaningless time to fill the `None`s

class cvfe.data.constant.CustomNamingEnum(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Extends base enum.Enum to support custom naming for members

Note

Class attribute name has been overridden to return the name of a marital status that matches with the dataset and not the Enum naming convention of Python. For instance, COMMON_LAW -> common-law in case of Canada forms.

Note

Devs should subclass this class and add their desired members in newly created classes. E.g. see CanadaMarriageStatus

Note

Classes that subclass this, for values of their members should use enum.auto to demonstrate that chosen value is not domain-specific. Otherwise, any explicit value given to members should implicate a domain-specific (e.g. extracted from dataset) value. Values that are explicitly provided are the values used in original data. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values. E.g. compare values in CanadaMarriageStatus and SiblingRelation.

classmethod get_member_names()[source]#

class cvfe.data.constant.CanadaMarriageStatus(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

States of marriage in Canada forms

Note

Values for the members are the values used in original Canada forms. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values.

COMMON_LAW = 2#

DIVORCED = 3#

SEPARATED = 4#

MARRIED = 5#

SINGLE = 7#

WIDOWED = 8#

UNKNOWN = 9#

class cvfe.data.constant.CanadaContactRelation(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

Contact relation in Canada data

F1 = 1#

F2 = 2#

HOTEL = 3#

WORK = 4#

FRIEND = 5#

UKN = 6#

class cvfe.data.constant.CanadaResidencyStatus(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

Residency status in a country in Canada data

CITIZEN = 1#

VISITOR = 3#

OTHER = 6#

class cvfe.data.constant.Sex(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: CustomNamingEnum

Sex types in general

FEMALE = 1#

MALE = 2#

cvfe.data.functional module#

cvfe.data.functional.dict_summarizer(data_dict, cutoff_term, KEY_ABBREVIATION_DICT=None, VALUE_ABBREVIATION_DICT=None)[source]#

Takes a flattened dictionary and shortens its keys

Parameters:

data_dict (dict[str, Any]) – The dictionary to be shortened
cutoff_term (str) – The string that used to find in keys and remove anything behind it
KEY_ABBREVIATION_DICT (dict, optional) – A dictionary containing abbreviation mapping for keys. Defaults to None.
VALUE_ABBREVIATION_DICT (dict, optional) – A dictionary containing abbreviation mapping for values. Defaults to None.

Returns:

A dict with shortened keys by throwing away some part and using a abbreviation dictionary for both keys and values.

Return type:

dict[str, Any]

cvfe.data.functional.dict_to_csv(data_dict, path)[source]#

Takes a flattened dictionary and writes it to a CSV file.

Parameters:

data_dict (dict[str, Any]) – A dictionary to be saved
path (str) – Path to the output file (will be created if not exist)

Return type:

None

cvfe.data.functional.key_dropper(data_dict, string, exclude=None, regex=False, inplace=True)[source]#

Takes a dictionary and drops keys matching a pattern

Parameters:

data_dict (dict[str, Any]) – Dictionary to be processed
string (str) – string to look for in data_dict keys
exclude (Optional[str], optional) – string to exclude a subset of keys from being dropped. Defaults to None.
regex (bool, optional) – compile string as regex. Defaults to False.
inplace (bool, optional) – whether or not use and inplace operation. Defaults to True.

Returns:

Takes a dictionary and searches for keys containing string in them either raw string or regex (in latter case, use regex=True) and after exclude ing a subset of them, drops the remaining in-place.

Return type:

Optional[dict[str, Any]]

cvfe.data.functional.fillna_datetime(data_dict, key_base_name, date, doc_type, one_sided=False, inplace=False)[source]#

Takes names of two keys with dates value (start, end) and fills them with a predefined value

Parameters:

data_dict (dict[str, Any]) – A dictionary to be processed
key_base_name (str) – Base key name that accepts 'From' and 'To' for extracting dates of same category
date (str) – The desired date
doc_type (DocTypes) – DocTypes used to use rules for matching tags and filling appropriately. Defaults to False.
one_sided (str | bool, optional) –
Different ways of filling empty date keys:
1. 'right': Uses the current_date as the final time
2. 'left': Uses the reference_date as the starting time
inplace (bool, optional) – whether or not use an inplace operation. Defaults to False.

Note

In transformation operations such as aggregate_datetime() function, this would be converted to period of zero. It is useful for filling periods of non existing items (e.g. age of children for single person).

Returns:: A dictionary that two keys with dates types that had no value (None) which was filled to the exact same date via date.
Return type:: dict[str, Any]

cvfe.data.functional.aggregate_datetime(data_dict, key_base_name, new_key_name, doc_type, if_nan='skip', one_sided=None, reference_date=None, current_date=None, **kwargs)[source]#

Takes two keys of dates in string form and calculates the period of them

Parameters:

data_dict (dict[str, Any]) – A dictionary to be processed
key_base_name (str) – Base key name that accepts 'From' and 'To' for extracting dates of same category
new_key_name (str) – The key name that extends key_base_name and will be the final key containing the period.
doc_type (DocTypes) – document type used to use rules for matching tags and filling appropriately. See DocTypes.
if_nan (str | Callable, optional) –
What to do with None s (NaN). Could be a function or predefined states as follow:
1. 'skip': do nothing (i.e. ignore None``s). Defaults to ``'skip'.
one_sided (Optional[str], optional) –
Different ways of filling empty date keys. Defaults to None. Could be one of the following:
1. 'right': Uses the current_date as the final time
2. 'left': Uses the reference_date as the starting time
reference_date (Optional[str], optional) – Assumed reference_date (t0<t1). Defaults to None.
current_date (Optional[str], optional) – Assumed current_date (t1>t0). Defaults to None.
default_datetime – accepts datetime.datetime to set default date for dateutil.parser.parse.

Returns:

A new dictionary that contains a key with result of calculation of the period of two keys with values of dates and represent it in integer form. The two keys used for this are dropped.

Return type:

dict[str, Any]

cvfe.data.functional.tag_to_regex_compatible(string, doc_type)[source]#

Takes a string and makes it regex compatible for XML parsed string

Note

This is specialized method and it may be better to override it for your own case.

Parameters:

string (str) – input string to get manipulated
doc_type (DocTypes) – specified DocTypes to determine regex rules

Returns:

A modified string

Return type:

str

cvfe.data.functional.change_dtype(data_dict, key_name, dtype, if_nan='skip', **kwargs)[source]#

Changes the data type of a key with ability to fill None s (fillna)

Parameters:

data_dict (dict[str, Any]) – A dictionary that key_name will be searched on
key_name (str) – Desired key name of the dictionary
dtype (Callable) – target data type as a function e.g. float
if_nan (str, Callable, optional) –
What to do with None s (NaN). Defaults to 'skip'. Could be a function or predefined states as follow:
1. 'skip': do nothing (i.e. ignore None s)
2. 'value': fill the None with value argument via kwargs
default_datetime (optional) – accepts datetime.datetime to set default date for dateutil.parser.parse

Raises:

ValueError – if string mode passed to if_nan does not exist. It won’t raise if if_nan is Callable.

Returns:

A dictionary that contains the calculation of the period of two keys with values of type dates and represent it in number of days. The two keys used for the calculation of period are dropped.

Return type:

dict[str, Any]

cvfe.data.functional.flatten_dict(dictionary)[source]#

Takes a (nested) multilevel dictionary and flattens it

Parameters:: dictionary (dict[str, Any]) – A dictionary (could be multilevel)

References

https://stackoverflow.com/a/67744709/18971263

Returns:

Flattened dictionary where keys and values of returned dict are:

new_keys[i] = f'{old_leys[level]}.{old_leys[level+1]}.[...].{old_leys[level+n]}'

new_value = old_value

Return type:

dict[str, Any]

cvfe.data.functional.xml_to_flattened_dict(xml)[source]#

Takes a (nested) XML and flattens it to a dict via flatten_dict()

Parameters:: xml (str) – A XML string
Returns:: A flattened dictionary of given XML
Return type:: dict

cvfe.data.functional.process_directory(src_dir, dst_dir, compose, file_pattern='*')[source]#

Transforms all files that match pattern in given dir and saves new files preserving dir structure

Note

A methods used for handling files from manually processed dataset to raw-dataset see FileTransform for more information.

References

https://stackoverflow.com/a/24041933/18971263

Parameters:

src_dir (str) – Source directory to be processed
dst_dir (str) – Destination directory to write processed files
compose (FileTransformCompose) – An instance of transform composer. see Compose.
file_pattern (str, optional) – pattern to match files, default to '*' for all files. Defaults to '*'.

Return type:

None

cvfe.data.pdf module#

class cvfe.data.pdf.PDFIO[source]#

Bases: object

Base class for dealing with PDF files

For each type of PDF, let’s say XFA files, one needs to extend this class and abstract methods like extract_raw_content() to generate a string of the content of the PDF in a format that can be used by the other classes (e.g. XML). For instance, see XFAPDF for the extension of this class.

__init__()[source]#

extract_raw_content(pdf_path)[source]#

Extracts unprocessed data from a PDF file

Parameters:: pdf_path (str) – Path to the pdf file
Return type:: str

find_in_dict(needle, haystack)[source]#

Looks for the value of a key inside a nested dictionary

Parameters:

needle (Any) – Key to look for
haystack (Any) – Dictionary to look in. Can be a dict inside another dict

Returns:

The value of key needle

Return type:

Any

class cvfe.data.pdf.XFAPDF[source]#

Bases: PDFIO

Contains functions and utility tools for dealing with XFA PDF documents.

__init__()[source]#

extract_raw_content(pdf_path)[source]#

Extracts RAW content of XFA PDF files which are in XML format

Parameters:: pdf_path (str) – path to the pdf file

Reference:

https://towardsdatascience.com/how-to-extract-data-from-pdf-forms-using-python-10b5e5f26f70

Returns:: XFA object of the pdf file in XML format
Return type:: str

clean_xml_for_csv(xml, type)[source]#

Cleans the XML file extracted from XFA forms

Since each form has its own format and issues, this method needs to be implemented uniquely for each unique file/form which needs to be specified using argument type that can be populated from DocTypes.

Parameters:

xml (str) – XML content
type (Enum) – type of the document defined in DocTypes

Returns:

cleaned XML content to be used in CSV file

Return type:

str

flatten_dict_basic(d)[source]#

Takes a (nested) dictionary and flattens it

ref: https://stackoverflow.com/questions/38852822/how-to-flatten-xml-file-in-python :type d: dict :param d: A dictionary :param return: An ordered dict

Return type:: dict

flatten_dict(d)[source]#

Takes a (nested) multilevel dictionary and flattens it

The final keys are key.key... and values are the leaf values of dictionary

Parameters:: d (dict) – A dictionary

References

https://stackoverflow.com/a/67744709/18971263

Returns:: A flattened dictionary
Return type:: dict

xml_to_flattened_dict(xml)[source]#

Takes a (nested) XML and converts it to a flattened dictionary

The final keys are key.key... and values are the leaf values of XML tree

Parameters:: xml (str) – A XML string
Returns:: A flattened dictionary
Return type:: dict

class cvfe.data.pdf.CanadaXFA[source]#

Bases: XFAPDF

Handles Canada XFA PDF files

__init__()[source]#

clean_xml_for_csv(xml, type)[source]#

Hardcoded cleaning of Canada XFA XML files to be XML compatible with CSV

Parameters:

xml (str) – XML content
type (Enum) – type of the document defined in DocTypes

Returns:

cleaned XML content to be used in CSV file

Return type:

str

cvfe.data.preprocessor module#

class cvfe.data.preprocessor.DataDictPreprocessor(data_dict=None)[source]#

Bases: object

A set of utilities over dictionary of data to make it easier for data preprocessing

A class that contains methods for dealing with dictionaries regarding transformation of data such as filling missing values, dropping keys, or aggregating multiple keys into a single more meaningful one.

This class needs to be extended for file specific preprocessing where tags are unique and need to be done entirely manually. In this case, file_specific_basic_transform() needs to be implemented.

__init__(data_dict=None)[source]#

Parameters:: data_dict (Optional[dict[str, Any]], optional) – Main dictionary of data to be preprocessed. Defaults to None.

key_dropper(string, exclude=None, regex=False, inplace=True)[source]#

See cvfe.data.functional.key_dropper() for more information

Return type:: Optional[dict[str, Any]]

file_specific_basic_transform(doc_type, path)[source]#

Takes a specific file then does data type fixing, missing value filling, discretization, etc.

Note

Since each files has its own unique tags and requirements, it is expected that all these transformation being hardcoded for each file, hence this method exists to just improve readability without any generalization to other problems or even files.

Parameters:

doc_type (DocTypes) – The input document type (see DocTypes)
path (str) – Path to the input document

Return type:

dict[str, Any]

change_dtype(key_name, dtype, if_nan='skip', **kwargs)[source]#: See cvfe.data.functional.change_dtype() for more details

config_csv_to_dict(path)[source]#

Take a config CSV and return a dictionary of key and values

Parameters:: path (str) – string path to config file
Return type:: dict

class cvfe.data.preprocessor.CanadaDataDictPreprocessor(data_dict=None)[source]#

Bases: DataDictPreprocessor

__init__(data_dict=None)[source]#

Parameters:: data_dict (Optional[dict[str, Any]], optional) – Main dictionary of data to be preprocessed. Defaults to None.

convert_country_code_to_name(string)[source]#

Converts the (custom and non-standard) code of a country to its name given the XFA docs LOV section.

Parameters:: string (str) – input code string
Return type:: str

file_specific_basic_transform(doc_type, path)[source]#

Takes a specific file then does data type fixing, missing value filling, discretization, etc.

Note

Parameters:

doc_type (DocTypes) – The input document type (see DocTypes)
path (str) – Path to the input document

Return type:

dict[str, Any]

class cvfe.data.preprocessor.FileTransformCompose(transforms)[source]#

Bases: object

Composes several transforms operating on files together

The transforms should be tied to files with keyword and this will be only applying functions on files that match the keyword using a dictionary

Transformation dictionary over files in the following structure:

{
    FileTransform: 'filter_str',
    ...,
}

Note

Transforms will be applied in order of the keys in the dictionary

__init__(transforms)[source]#

Parameters:: transforms (dict[FileTransform, str]) – a dictionary of transforms, where the key is the instance of FileTransform and the value is the keyword that the transform will be applied to
Raises:: ValueError – if the keyword is not a string

__call__(src, dst, *args, **kwds)[source]#

Applies transforms in order

Parameters:

src (str) – source file path to be processed
dst (str) – destination to save the processed file

Return type:

Any

class cvfe.data.preprocessor.FileTransform[source]#

Bases: object

A base class for applying transforms as a composable object over files.

Any behavior over the files itself (not the content of files) must extend this class.

__init__()[source]#

__call__(src, dst, *args, **kwds)[source]#

Parameters:

src (str) – source file to be processed
dst (str) – the pass that the processed file to be saved

Return type:

Any

class cvfe.data.preprocessor.CopyFile(mode)[source]#

Bases: FileTransform

Only copies a file, a wrapper around shutil’s copying methods

Default is set to ‘cf’, i.e. shutil.copyfile. For more info see shutil documentation.

Reference:

https://stackoverflow.com/a/30359308/18971263

__init__(mode)[source]#

__call__(src, dst, *args, **kwds)[source]#

Parameters:

src (str) – source file to be processed
dst (str) – the pass that the processed file to be saved

Return type:

Any

__check_mode(mode)#

Checks copying mode to be available in shutil

Parameters:: mode (str) – copying mode in shutil, one of ‘c’, ‘cf’, ‘c2’

class cvfe.data.preprocessor.MakeContentCopyProtectedMachineReadable[source]#

Bases: FileTransform

Reads a ‘content-copy’ protected PDF and removes this restriction

Removing the protection is done by saving a “printed” version of via pikepdf

References

__init__()[source]#

__call__(src, dst, *args, **kwds)[source]#

Parameters:

src (str) – source file to be processed
dst (str) – destination to save the processed file

Returns:

None

Return type:

Any

cvfe.data package#

Submodules#

cvfe.data.constant module#

cvfe.data.functional module#

cvfe.data.pdf module#

cvfe.data.preprocessor module#

Module contents#