AWS Glue PII detector job taking too much time

0

I have an AWS Glue PII data detector job, its taking around 47 minutes to complete for 17.9 MB file size which is very long time for any spark job.

Sharing the code snippet used in the job

S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": '"',
        "withHeader": True,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={"paths": [f'{input_location}{file_name}']},
    transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
entity_detector = EntityDetector()
classified_map = entity_detector.classify_columns(
    S3bucket_node1,
    [
        "PERSON_NAME",
        "EMAIL",
        "CREDIT_CARD",
        "IP_ADDRESS",
        "MAC_ADDRESS",
        "PHONE_NUMBER",
        "USA_PASSPORT_NUMBER",
        "USA_SSN",
        "USA_ITIN",
        "BANK_ACCOUNT",
        "USA_DRIVING_LICENSE",
        "USA_HCPCS_CODE",
        "USA_NATIONAL_DRUG_CODE",
        "USA_NATIONAL_PROVIDER_IDENTIFIER",
        "USA_DEA_NUMBER",
        "USA_HEALTH_INSURANCE_CLAIM_NUMBER",
        "USA_MEDICARE_BENEFICIARY_IDENTIFIER",
        "JAPAN_BANK_ACCOUNT",
        "JAPAN_DRIVING_LICENSE",
        "JAPAN_MY_NUMBER",
        "JAPAN_PASSPORT_NUMBER",
        "UK_BANK_ACCOUNT",
        "UK_BANK_SORT_CODE",
        "UK_DRIVING_LICENSE",
        "UK_ELECTORAL_ROLL_NUMBER",
        "UK_NATIONAL_HEALTH_SERVICE_NUMBER",
        "UK_NATIONAL_INSURANCE_NUMBER",
        "UK_PASSPORT_NUMBER",
        "UK_PHONE_NUMBER",
        "UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER",
        "UK_VALUE_ADDED_TAX",
        "CANADA_SIN",
        "CANADA_PASSPORT_NUMBER",
        "GENDER",
    ],
    1.0,
    0.55,
)

I have spark application log file as well, can't attach in this question.

What is the root cause of time consumption by this job?

已提问 1 年前421 查看次数
1 回答
0
已接受的回答

Hi,

It could be because you are looking for a huge list of entity types against all the rows.

Does your use case requirements allow you to reduce the sample portion (which defines the percent of scanned rows for the PII entity) as well as detection threshold (which defines he percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity)?

profile picture
专家
已回答 1 年前
  • In my use case, I am getting files from different sources like customer, bank-loans, credit-risks and many more sources. I am profiling the data files and at the same time trying to detect PII data. If I reduce the number of entities then I might miss some of the PII data columns. As I am using Spark environment in Glue then parallel processing should happen and processing should complete within few minutes and NOT in 47 minutes.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则