By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Glue Crawler cannot determine data structure from XML file

0

Hi,

I've created Glue Crawler to determine data sctructure from XML file uploaded to s3, and write Table into Data Catalog. I tried 2 approaches:

  1. Use Glue default classifier - this is preferred option as I might get different xml files with different structures - don't want to create custom classifiers for each of them.
  2. Use Glue Custom classifier (with Feed set as a row tag).

For above two approaches, the table is created with below schema:

Enter image description here

None of the approaches seems to reflect the real structure of the XML file (I would expect to see something similar to schema extracted by the crawler in https://aws.amazon.com/blogs/big-data/process-and-analyze-highly-nested-and-large-xml-files-using-aws-glue-and-amazon-athena/ - each tag from my file should be represented in the table)

Not sure if I misconfigured something or Glue has some limitations on discovering schema from XML files?

The xml file with data has below structure (file size is below 1 KB):

<Feed>
  <Document Type="RESULTS Latest">
    <Data id="t123">
      <Info MatchDay="1">
        <Date>2023-08-05 15:00:00</Date>
      </Info>
     </Data>
    <Data id="t456">
     <Info MatchDay="1">
       <Date>2023-08-05 15:00:00</Date>
     </Info>
    </Data>
    <Team id="44">
      <Name>Bradford City</Name>
      <OfficialName>Bradford City</OfficialName>
    </Team>
    <Team id="12">
      <Name>Mansfield Town</Name>
      <OfficialName>Mansfield Town</OfficialName>
    </Team>
  </Document>
</Feed>

asked 7 months ago309 views
1 Answer
0

Hello,

I have tried to replicate the issue on my environment, but when i used S3 as source and use the xml classifier with "Feed/feed" as the row tag, i get the schema as Document with struct type. Please find the screenshots of the configurations and schema. I would request you to validate your configurations and compare them to the below.

Schema Screenshot

Struct content

Classifier

Crawler configuration

Thank you and Have a great day.

AWS
answered 7 months ago
profile pictureAWS
SUPPORT ENGINEER
reviewed 6 months ago
profile picture
EXPERT
reviewed 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions