How to fix the unknown schema datatype of AWS Glue Table?

0

There was a data source (JSON files) in S3. The JSON structure is as follows. I used AWS Glue Crawler to build the Glue table based on this S3 data source. I think the "data" column should be "Struct" type. It seems that this may have to do with the fact that the key contains commas in the "data" column. So how can I fix the unknown schema datatype caused by the JSON key containing commas?

Glue table schema

{
  "meta": {
    "request": {
      "domains": "test.com,test.org",
      "format": "json",
      "start_date": "2023-10-01",
      "end_date": "2023-10-31",
      "country": "world"
    },
    "status": "Success",
    "last_updated": "2023-10-31"
  },
  "data": {
    "test.com": 6161.0497,
    "test.org": 3129.9708,
    "test.org, test.com": 1127.10123112
  },
  "total": 10161.9689
}
질문됨 4달 전345회 조회
2개 답변
1

Hi there, Using space in name of field can cause some bug with AWS Glue’s JSON classifier because it doesn’t handle nested properties that have spaces in them well.

If you want the field data should be struct, you can use any other character but space to separate words.

Enter image description here

Here is some ref for you: https://stackoverflow.com/questions/54203398/aws-glue-ignoring-spaces-in-json-properties

profile picture
답변함 4달 전
  • These JSON files are obtained from API responses, so I can't change the JSON structure. Does adding custom classifiers for the "data" column when building the Glue crawler fix my problems?

0
수락된 답변

You might define the data column in a more structured way:

"data": {
  "domains": [
    {
      "name": "test.com",
      "value": 6161.0497
    },
    {
      "name": "test.org",
      "value": 3129.9708
    },
    {
      "name": "test.org, test.com",
      "value": 1127.10123112
    }
  ]
}

This format removes the commas from the keys and presents the data in a way that should be more straightforward for AWS Glue to parse.

If changing the JSON structure is not an option, then writing a custom ETL script would be the best way to ensure the data types are correctly identified.

If this has resolved your issue or was helpful, accepting the answer would be greatly appreciated. Thank you!

profile picture
전문가
답변함 4달 전
  • These JSON files are obtained from API responses, so I can't change the JSON structure. And Sorry I'm not very familiar with the AWS Glue ETL job script, do you know how can I define the ETL script to parse the "data" column to the correct datatype? Here are some of my examples of JSON

    {
      "meta": {
        "request": {
          "domains": "test.com,aaa.com",
          "format": "json",
          "start_date": "2023-11-01",
          "end_date": "2023-11-30",
          "country": "world"
        },
        "status": "Success",
        "last_updated": "2023-11-30"
      },
      "data": {
        "test.com": 58743300.39000001,
        "aaa.com": 2820.25,
        "aaa.com,test.com": 1444.4570719168455
      },
      "total": 58744676.18292809
    }
    
    {
      "meta": {
        "request": {
          "domains": "test.com,bbb.com",
          "format": "json",
          "start_date": "2023-11-01",
          "end_date": "2023-11-30",
          "country": "world"
        },
        "status": "Success",
        "last_updated": "2023-11-30"
      },
      "data": {
        "test.com": 58743300.39000001,
        "bbb.com": 0.0,
        "bbb.com,test.com": 0.0
      },
      "total": 58743300.39000001
    }
    

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠