AWS Glue PII detector job taking too much time

0

I have an AWS Glue PII data detector job, its taking around 47 minutes to complete for 17.9 MB file size which is very long time for any spark job.

Sharing the code snippet used in the job

S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": '"',
        "withHeader": True,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={"paths": [f'{input_location}{file_name}']},
    transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
entity_detector = EntityDetector()
classified_map = entity_detector.classify_columns(
    S3bucket_node1,
    [
        "PERSON_NAME",
        "EMAIL",
        "CREDIT_CARD",
        "IP_ADDRESS",
        "MAC_ADDRESS",
        "PHONE_NUMBER",
        "USA_PASSPORT_NUMBER",
        "USA_SSN",
        "USA_ITIN",
        "BANK_ACCOUNT",
        "USA_DRIVING_LICENSE",
        "USA_HCPCS_CODE",
        "USA_NATIONAL_DRUG_CODE",
        "USA_NATIONAL_PROVIDER_IDENTIFIER",
        "USA_DEA_NUMBER",
        "USA_HEALTH_INSURANCE_CLAIM_NUMBER",
        "USA_MEDICARE_BENEFICIARY_IDENTIFIER",
        "JAPAN_BANK_ACCOUNT",
        "JAPAN_DRIVING_LICENSE",
        "JAPAN_MY_NUMBER",
        "JAPAN_PASSPORT_NUMBER",
        "UK_BANK_ACCOUNT",
        "UK_BANK_SORT_CODE",
        "UK_DRIVING_LICENSE",
        "UK_ELECTORAL_ROLL_NUMBER",
        "UK_NATIONAL_HEALTH_SERVICE_NUMBER",
        "UK_NATIONAL_INSURANCE_NUMBER",
        "UK_PASSPORT_NUMBER",
        "UK_PHONE_NUMBER",
        "UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER",
        "UK_VALUE_ADDED_TAX",
        "CANADA_SIN",
        "CANADA_PASSPORT_NUMBER",
        "GENDER",
    ],
    1.0,
    0.55,
)

I have spark application log file as well, can't attach in this question.

What is the root cause of time consumption by this job?

已提問 1 年前檢視次數 421 次
1 個回答
0
已接受的答案

Hi,

It could be because you are looking for a huge list of entity types against all the rows.

Does your use case requirements allow you to reduce the sample portion (which defines the percent of scanned rows for the PII entity) as well as detection threshold (which defines he percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity)?

profile picture
專家
已回答 1 年前
  • In my use case, I am getting files from different sources like customer, bank-loans, credit-risks and many more sources. I am profiling the data files and at the same time trying to detect PII data. If I reduce the number of entities then I might miss some of the PII data columns. As I am using Spark environment in Glue then parallel processing should happen and processing should complete within few minutes and NOT in 47 minutes.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南