cvfe.data package#
Submodules#
cvfe.data.constant module#
- cvfe.data.constant.CANADA_5257E_KEY_ABBREVIATION = {'Address': 'Addr', 'BGI2.VisaChoice1': 'noAuthStay', 'BGI2.VisaChoice2': 'refuseDeport', 'BGI3.Choice': 'criminalRec', 'BackgroundInfo': 'BGI', 'Contact': 'cntct', 'ContactInformation': 'CI', 'CountryWhereApplying': 'CWA', 'Current': 'Curr', 'Details.VisaChoice3': 'PrevApply', 'DetailsOfVisit': 'DOV', 'Education': 'Edu', 'GovPosition.Choice': 'witnessIllTreat', 'HowLongStay': 'HLS', 'Language': 'Lang', 'MaritalStatus': 'MS', 'Marriage': 'Marr', 'Married': 'Marr', 'Number': 'Num', 'Occ.Choice': 'politicViol', 'Occupation': 'Occ', 'Page': 'P', 'PageWrapper': 'PW', 'Passport': 'Psprt', 'PersonalDetails': 'PD', 'Phone': 'Phn', 'Previous': 'Prev', 'Previously': 'Prev', 'Purpose': 'Prps', 'Resident': 'Resi', 'Section': 'Sec', 'Signature': 'Sign', 'backgroundInfoCalc': 'otherThanMedic', 'contact': 'cntct'}#
Dict of abbreviation used to shortening length of KEYS in XML to CSV conversion
- cvfe.data.constant.CANADA_5645E_KEY_ABBREVIATION = {'Address': 'Addr', 'Applicant': 'App', 'Child': 'Chd', 'Father': 'Fa', 'Mother': 'Mo', 'Occupation': 'Occ', 'Relationship': 'Rel', 'Section': 'Sec', 'Spouse': 'Sps', 'Yes': 'Accomp', 'page': 'p'}#
Dict of abbreviation used to shortening length of KEYS in XML to CSV conversion
- cvfe.data.constant.CANADA_5257E_VALUE_ABBREVIATION = {'045': 'TURKEY', '223': 'IRAN', 'BIOMETRIC ENROLMENT': 'Bio'}#
Dict of abbreviation used to shortening length of VALUES in XML to CSV conversion
- cvfe.data.constant.CANADA_5257E_DROP_COLUMNS = ['ns0:datasets.@xmlns:ns0', 'P1.Header.CRCNum', 'P1.FormVersion', 'P1.PD.UCIClientID', 'P1.PD.SecHeader.@ns0:dataNode', 'P1.PD.CurrCOR.Row1.@ns0:dataNode', 'P1.PD.PrevCOR.Row1.@ns0:dataNode', 'P1.PD.CWA.Row1.@ns0:dataNode', 'P1.PD.ApplicationValidatedFlag', 'P2.MS.SecA.SecHeader.@ns0:dataNode', 'P2.MS.SecA.PsprtSecHeader.@ns0:dataNode', 'P2.MS.SecA.Langs.languagesHeader.@ns0:dataNode', 'P2.natID.SecHeader.@ns0:dataNode', 'P2.USCard.SecHeader.@ns0:dataNode', 'P2.USCard.SecHeader.@ns0:dataNode', 'P2.CI.cntct.cntctInfoSecHeader.@ns0:dataNode', 'P3.SecHeader_DOV.@ns0:dataNode', 'P3.Edu.Edu_SecHeader.@ns0:dataNode', 'P3.Occ.SecHeader_CurrOcc.@ns0:dataNode', 'P3.BGI_SecHeader.@ns0:dataNode', 'P3.Sign.Consent0.Choice', 'P3.Sign.hand.@ns0:dataNode', 'P3.Sign.TextField2', 'P3.Disclosure.@ns0:dataNode', 'P3.ReaderInfo', 'Barcodes.@ns0:dataNode']#
List of columns to be dropped before doing any preprocessing
Note
This list has been determined manually.
- cvfe.data.constant.CANADA_5645E_DROP_COLUMNS = {'formNum', 'p1.SecA.SecAdate', 'p1.SecA.SecAsignature', 'p1.SecA.Title.@xfa:dataNode', 'p1.SecB.SecBdate', 'p1.SecB.SecBsignature', 'p1.SecB.Title.@xfa:dataNode', 'p1.SecC.SecCsignature', 'p1.SecC.Subform2.@xfa:dataNode', 'p1.SecC.Title.@xfa:dataNode', 'xfa:datasets.@xmlns:xfa'}#
List of columns to be dropped before doing any preprocessing
Note
This list has been determined manually.
- class cvfe.data.constant.DocTypes(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Contains all document types which can be used to customize ETL steps for each document type
Members follow the
<country_name>_<document_type>
naming convention. The value and its order are meaningless.- CANADA = 1#
- CANADA_5257E = 2#
- CANADA_5645E = 3#
- CANADA_LABEL = 4#
- class cvfe.data.constant.CanadaCutoffTerms[source]#
Bases:
object
Dict of cut off terms for different files that is can be used with :func:`vizard.data.functional.dict_summarizer
- CA5645E = 'IMM_5645'#
- CA5257E = 'form1'#
- class cvfe.data.constant.CanadaFillna[source]#
Bases:
object
Values used to fill
None
s depending on the form structureMembers follow the
<field_name>_<form_name>
naming convention. The value has been extracted by manually inspecting the documents. Hence, for each form, user must find and set this value manually.Note
We do not use any heuristics here, we just follow what form used and only add another option which should be used as
None
state; i.e.None
as a separate feature in categorical mode.- COUNTRY_CODE_5257E = 'Unknown'#
- VISA_TYPE_5257E = 'OTHER'#
- PLACE_BIRTH_CITY_5257E = 'OTHER'#
- COUNTRY_5257E = 'IRAN'#
- CITIZENSHIP_5257E = 'IRAN'#
- RESIDENCY_STATUS_5257E = 6#
- OTHER_DESCRIPTION_INDICATOR_5257E = False#
- PREVIOUS_COUNTRY_5257E = 'OTHER'#
- COUNTRY_WHERE_APPLYING_5257E = 'OTHER'#
- MARRIAGE_TYPE_5257E = 'OTHER'#
- PASSPORT_COUNTRY_5257E = 'OTHER'#
- NATIVE_LANG_5257E = 'IRAN'#
- LANGUAGES_ABLE_TO_COMMUNICATE_5257E = 'NEITHER'#
- ID_COUNTRY_5257E = 'IRAN'#
- PURPOSE_OF_VISIT_5257E = 7#
- CONTACT_TYPE_5257E = 'OTHER'#
- OCCUPATION_5257E = 'OTHER'#
- INDICATOR_FIELD_5257E = False#
- VISA_APPLICATION_TYPE_5645E = '0'#
- CHILD_MARRIAGE_STATUS_5645E = 9#
- CHILD_RELATION_5645E = 'OTHER'#
- VISA_RESULT = 0#
- cvfe.data.constant.DATEUTIL_DEFAULT_DATETIME = {'day': 1, 'month': 1, 'year': 1}#
A default date for the
dateutil.parser.parse
function when some part of date is not provided
- class cvfe.data.constant.CustomNamingEnum(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
Extends base
enum.Enum
to support custom naming for membersNote
Class attribute
name
has been overridden to return the name of a marital status that matches with the dataset and not theEnum
naming convention of Python. For instance,COMMON_LAW
->common-law
in case of Canada forms.Note
Devs should subclass this class and add their desired members in newly created classes. E.g. see
CanadaMarriageStatus
Note
Classes that subclass this, for values of their members should use
enum.auto
to demonstrate that chosen value is not domain-specific. Otherwise, any explicit value given to members should implicate a domain-specific (e.g. extracted from dataset) value. Values that are explicitly provided are the values used in original data. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values. E.g. compare values inCanadaMarriageStatus
andSiblingRelation
.
- class cvfe.data.constant.CanadaMarriageStatus(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
States of marriage in Canada forms
Note
Values for the members are the values used in original Canada forms. Hence, it should not be modified by any means as it is tied to dataset, transformation, and other domain-specific values.
- COMMON_LAW = 2#
- DIVORCED = 3#
- SEPARATED = 4#
- MARRIED = 5#
- SINGLE = 7#
- WIDOWED = 8#
- UNKNOWN = 9#
- class cvfe.data.constant.CanadaContactRelation(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
Contact relation in Canada data
- F1 = 1#
- F2 = 2#
- HOTEL = 3#
- WORK = 4#
- FRIEND = 5#
- UKN = 6#
- class cvfe.data.constant.CanadaResidencyStatus(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
Residency status in a country in Canada data
- CITIZEN = 1#
- VISITOR = 3#
- OTHER = 6#
- class cvfe.data.constant.Sex(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
CustomNamingEnum
Sex types in general
- FEMALE = 1#
- MALE = 2#
cvfe.data.functional module#
- cvfe.data.functional.dict_summarizer(data_dict, cutoff_term, KEY_ABBREVIATION_DICT=None, VALUE_ABBREVIATION_DICT=None)[source]#
Takes a flattened dictionary and shortens its keys
- Parameters:
cutoff_term (str) – The string that used to find in keys and remove anything behind it
KEY_ABBREVIATION_DICT (dict, optional) – A dictionary containing abbreviation mapping for keys. Defaults to None.
VALUE_ABBREVIATION_DICT (dict, optional) – A dictionary containing abbreviation mapping for values. Defaults to None.
- Returns:
A dict with shortened keys by throwing away some part and using a abbreviation dictionary for both keys and values.
- Return type:
- cvfe.data.functional.dict_to_csv(data_dict, path)[source]#
Takes a flattened dictionary and writes it to a CSV file.
- cvfe.data.functional.key_dropper(data_dict, string, exclude=None, regex=False, inplace=True)[source]#
Takes a dictionary and drops keys matching a pattern
- Parameters:
string (str) – string to look for in
data_dict
keysexclude (Optional[str], optional) – string to exclude a subset of keys from being dropped. Defaults to None.
regex (bool, optional) – compile
string
as regex. Defaults to False.inplace (bool, optional) – whether or not use and inplace operation. Defaults to True.
- Returns:
Takes a dictionary and searches for keys containing
string
in them either raw string or regex (in latter case, useregex=True
) and afterexclude
ing a subset of them, drops the remaining in-place.- Return type:
- cvfe.data.functional.fillna_datetime(data_dict, key_base_name, date, doc_type, one_sided=False, inplace=False)[source]#
Takes names of two keys with dates value (start, end) and fills them with a predefined value
- Parameters:
key_base_name (str) – Base key name that accepts
'From'
and'To'
for extracting dates of same categorydate (str) – The desired date
doc_type (DocTypes) –
DocTypes
used to use rules for matching tags and filling appropriately. Defaults to False.one_sided (str | bool, optional) –
Different ways of filling empty date keys:
'right'
: Uses thecurrent_date
as the final time'left'
: Uses thereference_date
as the starting time
inplace (bool, optional) – whether or not use an inplace operation. Defaults to False.
Note
In transformation operations such as
aggregate_datetime()
function, this would be converted to period of zero. It is useful for filling periods of non existing items (e.g. age of children for single person).
- cvfe.data.functional.aggregate_datetime(data_dict, key_base_name, new_key_name, doc_type, if_nan='skip', one_sided=None, reference_date=None, current_date=None, **kwargs)[source]#
Takes two keys of dates in string form and calculates the period of them
- Parameters:
key_base_name (str) – Base key name that accepts
'From'
and'To'
for extracting dates of same categorynew_key_name (str) – The key name that extends
key_base_name
and will be the final key containing the period.doc_type (DocTypes) – document type used to use rules for matching tags and filling appropriately. See
DocTypes
.if_nan (str | Callable, optional) –
What to do with None s (NaN). Could be a function or predefined states as follow:
'skip'
: do nothing (i.e. ignoreNone``s). Defaults to ``'skip'
.
one_sided (Optional[str], optional) –
Different ways of filling empty date keys. Defaults to None. Could be one of the following:
'right'
: Uses thecurrent_date
as the final time'left'
: Uses thereference_date
as the starting time
reference_date (Optional[str], optional) – Assumed
reference_date
(t0<t1). Defaults to None.current_date (Optional[str], optional) – Assumed
current_date
(t1>t0). Defaults to None.default_datetime – accepts datetime.datetime to set default date for dateutil.parser.parse.
- Returns:
A new dictionary that contains a key with result of calculation of the period of two keys with values of dates and represent it in integer form. The two keys used for this are dropped.
- Return type:
- cvfe.data.functional.tag_to_regex_compatible(string, doc_type)[source]#
Takes a string and makes it regex compatible for XML parsed string
Note
This is specialized method and it may be better to override it for your own case.
- cvfe.data.functional.change_dtype(data_dict, key_name, dtype, if_nan='skip', **kwargs)[source]#
Changes the data type of a key with ability to fill
None
s (fillna)- Parameters:
data_dict (dict[str, Any]) – A dictionary that
key_name
will be searched onkey_name (str) – Desired key name of the dictionary
dtype (Callable) – target data type as a function e.g.
float
if_nan (str, Callable, optional) –
What to do with None s (NaN). Defaults to
'skip'
. Could be a function or predefined states as follow:'skip'
: do nothing (i.e. ignoreNone
s)'value'
: fill the None withvalue
argument viakwargs
default_datetime (optional) – accepts datetime.datetime to set default date for dateutil.parser.parse
- Raises:
ValueError – if string mode passed to
if_nan
does not exist. It won’t raise ifif_nan
isCallable
.- Returns:
A dictionary that contains the calculation of the period of two keys with values of type dates and represent it in number of days. The two keys used for the calculation of period are dropped.
- Return type:
- cvfe.data.functional.flatten_dict(dictionary)[source]#
Takes a (nested) multilevel dictionary and flattens it
References
- cvfe.data.functional.xml_to_flattened_dict(xml)[source]#
Takes a (nested) XML and flattens it to a dict via
flatten_dict()
- cvfe.data.functional.process_directory(src_dir, dst_dir, compose, file_pattern='*')[source]#
Transforms all files that match pattern in given dir and saves new files preserving dir structure
Note
A methods used for handling files from manually processed dataset to raw-dataset see
FileTransform
for more information.References
- Parameters:
src_dir (str) – Source directory to be processed
dst_dir (str) – Destination directory to write processed files
compose (FileTransformCompose) – An instance of transform composer. see
Compose
.file_pattern (str, optional) – pattern to match files, default to
'*'
for all files. Defaults to'*'
.
- Return type:
cvfe.data.pdf module#
- class cvfe.data.pdf.PDFIO[source]#
Bases:
object
Base class for dealing with PDF files
For each type of PDF, let’s say XFA files, one needs to extend this class and abstract methods like
extract_raw_content()
to generate a string of the content of the PDF in a format that can be used by the other classes (e.g. XML). For instance, seeXFAPDF
for the extension of this class.
- class cvfe.data.pdf.XFAPDF[source]#
Bases:
PDFIO
Contains functions and utility tools for dealing with XFA PDF documents.
- extract_raw_content(pdf_path)[source]#
Extracts RAW content of XFA PDF files which are in XML format
- Parameters:
pdf_path (str) – path to the pdf file
Reference:
- Returns:
XFA object of the pdf file in XML format
- Return type:
- clean_xml_for_csv(xml, type)[source]#
Cleans the XML file extracted from XFA forms
Since each form has its own format and issues, this method needs to be implemented uniquely for each unique file/form which needs to be specified using argument
type
that can be populated fromDocTypes
.
- flatten_dict_basic(d)[source]#
Takes a (nested) dictionary and flattens it
ref: https://stackoverflow.com/questions/38852822/how-to-flatten-xml-file-in-python :type d:
dict
:param d: A dictionary :param return: An ordered dict- Return type:
cvfe.data.preprocessor module#
- class cvfe.data.preprocessor.DataDictPreprocessor(data_dict=None)[source]#
Bases:
object
A set of utilities over dictionary of data to make it easier for data preprocessing
A class that contains methods for dealing with dictionaries regarding transformation of data such as filling missing values, dropping keys, or aggregating multiple keys into a single more meaningful one.
This class needs to be extended for file specific preprocessing where tags are unique and need to be done entirely manually. In this case,
file_specific_basic_transform()
needs to be implemented.- key_dropper(string, exclude=None, regex=False, inplace=True)[source]#
See
cvfe.data.functional.key_dropper()
for more information
- file_specific_basic_transform(doc_type, path)[source]#
Takes a specific file then does data type fixing, missing value filling, discretization, etc.
Note
Since each files has its own unique tags and requirements, it is expected that all these transformation being hardcoded for each file, hence this method exists to just improve readability without any generalization to other problems or even files.
- change_dtype(key_name, dtype, if_nan='skip', **kwargs)[source]#
See
cvfe.data.functional.change_dtype()
for more details
- class cvfe.data.preprocessor.CanadaDataDictPreprocessor(data_dict=None)[source]#
Bases:
DataDictPreprocessor
- convert_country_code_to_name(string)[source]#
Converts the (custom and non-standard) code of a country to its name given the XFA docs LOV section.
- file_specific_basic_transform(doc_type, path)[source]#
Takes a specific file then does data type fixing, missing value filling, discretization, etc.
Note
Since each files has its own unique tags and requirements, it is expected that all these transformation being hardcoded for each file, hence this method exists to just improve readability without any generalization to other problems or even files.
- class cvfe.data.preprocessor.FileTransformCompose(transforms)[source]#
Bases:
object
Composes several transforms operating on files together
The transforms should be tied to files with keyword and this will be only applying functions on files that match the keyword using a dictionary
Transformation dictionary over files in the following structure:
{ FileTransform: 'filter_str', ..., }
Note
Transforms will be applied in order of the keys in the dictionary
- __init__(transforms)[source]#
- Parameters:
transforms (dict[FileTransform, str]) – a dictionary of transforms, where the key is the instance of FileTransform and the value is the keyword that the transform will be applied to
- Raises:
ValueError – if the keyword is not a string
- class cvfe.data.preprocessor.FileTransform[source]#
Bases:
object
A base class for applying transforms as a composable object over files.
Any behavior over the files itself (not the content of files) must extend this class.
- class cvfe.data.preprocessor.CopyFile(mode)[source]#
Bases:
FileTransform
Only copies a file, a wrapper around shutil’s copying methods
Default is set to ‘cf’, i.e. shutil.copyfile. For more info see shutil documentation.
- class cvfe.data.preprocessor.MakeContentCopyProtectedMachineReadable[source]#
Bases:
FileTransform
Reads a ‘content-copy’ protected PDF and removes this restriction
Removing the protection is done by saving a “printed” version of via pikepdf
References