AWS Glue PII detector job taking too much time

0

I have an AWS Glue PII data detector job, its taking around 47 minutes to complete for 17.9 MB file size which is very long time for any spark job.

Sharing the code snippet used in the job

S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": '"',
        "withHeader": True,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={"paths": [f'{input_location}{file_name}']},
    transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
entity_detector = EntityDetector()
classified_map = entity_detector.classify_columns(
    S3bucket_node1,
    [
        "PERSON_NAME",
        "EMAIL",
        "CREDIT_CARD",
        "IP_ADDRESS",
        "MAC_ADDRESS",
        "PHONE_NUMBER",
        "USA_PASSPORT_NUMBER",
        "USA_SSN",
        "USA_ITIN",
        "BANK_ACCOUNT",
        "USA_DRIVING_LICENSE",
        "USA_HCPCS_CODE",
        "USA_NATIONAL_DRUG_CODE",
        "USA_NATIONAL_PROVIDER_IDENTIFIER",
        "USA_DEA_NUMBER",
        "USA_HEALTH_INSURANCE_CLAIM_NUMBER",
        "USA_MEDICARE_BENEFICIARY_IDENTIFIER",
        "JAPAN_BANK_ACCOUNT",
        "JAPAN_DRIVING_LICENSE",
        "JAPAN_MY_NUMBER",
        "JAPAN_PASSPORT_NUMBER",
        "UK_BANK_ACCOUNT",
        "UK_BANK_SORT_CODE",
        "UK_DRIVING_LICENSE",
        "UK_ELECTORAL_ROLL_NUMBER",
        "UK_NATIONAL_HEALTH_SERVICE_NUMBER",
        "UK_NATIONAL_INSURANCE_NUMBER",
        "UK_PASSPORT_NUMBER",
        "UK_PHONE_NUMBER",
        "UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER",
        "UK_VALUE_ADDED_TAX",
        "CANADA_SIN",
        "CANADA_PASSPORT_NUMBER",
        "GENDER",
    ],
    1.0,
    0.55,
)

I have spark application log file as well, can't attach in this question.

What is the root cause of time consumption by this job?

1 réponse
0
Réponse acceptée

Hi,

It could be because you are looking for a huge list of entity types against all the rows.

Does your use case requirements allow you to reduce the sample portion (which defines the percent of scanned rows for the PII entity) as well as detection threshold (which defines he percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity)?

profile picture
EXPERT
répondu il y a un an
  • In my use case, I am getting files from different sources like customer, bank-loans, credit-risks and many more sources. I am profiling the data files and at the same time trying to detect PII data. If I reduce the number of entities then I might miss some of the PII data columns. As I am using Spark environment in Glue then parallel processing should happen and processing should complete within few minutes and NOT in 47 minutes.

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions