Hi,
I've created Glue Crawler to determine data sctructure from XML file uploaded to s3, and write Table into Data Catalog.
I tried 2 approaches:
- Use Glue default classifier - this is preferred option as I might get different xml files with different structures - don't want to create custom classifiers for each of them.
- Use Glue Custom classifier (with
Feed
set as a row tag).
For above two approaches, the table is created with below schema:
None of the approaches seems to reflect the real structure of the XML file (I would expect to see something similar to schema extracted by the crawler
in https://aws.amazon.com/blogs/big-data/process-and-analyze-highly-nested-and-large-xml-files-using-aws-glue-and-amazon-athena/ - each tag from my file should be represented in the table)
Not sure if I misconfigured something or Glue has some limitations on discovering schema from XML files?
The xml file with data has below structure (file size is below 1 KB):
<Feed>
<Document Type="RESULTS Latest">
<Data id="t123">
<Info MatchDay="1">
<Date>2023-08-05 15:00:00</Date>
</Info>
</Data>
<Data id="t456">
<Info MatchDay="1">
<Date>2023-08-05 15:00:00</Date>
</Info>
</Data>
<Team id="44">
<Name>Bradford City</Name>
<OfficialName>Bradford City</OfficialName>
</Team>
<Team id="12">
<Name>Mansfield Town</Name>
<OfficialName>Mansfield Town</OfficialName>
</Team>
</Document>
</Feed>