- Newest
- Most votes
- Most comments
Since the files up to 2024 were generated by Lambda and the 2025 file was created manually, I recommend checking the following:
- Delimiter Consistency: Ensure the 2025 file uses the same delimiter as the previous files (e.g., if earlier files used commas and the 2025 file uses pipes, Athena may misinterpret the data).
- File Encoding: Verify that the encoding (e.g., UTF-8) is consistent across all files.
- Column Order and Data Types: Confirm that the column order and data types in the 2025 file match those of earlier files.
- Junk Characters: Check for hidden or junk characters in the 2025 file that could cause parsing issues.
- Correct File Format: Ensure the 2025 file is saved in the CSV format. Saving it in Excel format could cause parsing errors, as Glue classifiers don't support MS Excel.
- Quoted Strings: If the data contains quoted strings, the default SerDe(serial/deserialiser) may not parse them correctly. Consider using the OpenCSV SerDe(https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html).
I suggest you find a single SQL query whose result set contains mostly healthy rows and at least one broken one. Add a column defined as "$path"
to the SELECT list of your query and rerun the query. This will show which files/objects in S3 each row came from.
The broken file is probably one of those that you uploaded. If you aren't sure, you can check the creation time of the offending file/object in S3.
Download the healthy file and the broken one, and find the lines in the file that correspond with the two rows you chose to compare above. You might want to use a competent text editor for that, such as Notepad++, which also allows you to show the two files side-by-side in two panes. It's likely that you can visually identify the difference in structures, such as a superfluous separator character or an empty space that appears on every line of one file and not the other.
If the lines seem identical in structure, even when you copy the two lines in a new file to see the differences character by character, you could expose whitespace under the View / Show symbol menu in Notepad++ to see if there's a hidden difference, like one file containing a space and the other containing a tab character.
Another possible difference could be that there are non-ASCII characters that could be getting encoded differently in different character encodings, such as ü or é. If that seems possible, you could try producing the files with a different encoding than you used previously. If you used UTF-8, try switching it to ANSI or ISO Latin-1 or Latin-15, or vice versa. You might also discover the encoding Notepad++ chose to use to interpret the file you downloaded in the Encoding menu of Notepad++.
Thank you for your response and advice! I managed to find the main difference between the healthy and unhealthy lines was an inclusion of commas, so I removed them and that resolved the problem.
The issue you're experiencing with misaligned columns for your 2025 data in AWS Glue and Athena queries could be caused by a few factors:
-
Inconsistent data format: Even if the column names match, there might be inconsistencies in the data format between your manually created 2025 CSV and the existing 2015-2024 data. This could include differences in delimiters, quote characters, or escape characters.
-
Schema inference: When you upload new data to S3, AWS Glue may not automatically detect the changes or update the schema. The crawler might be using an outdated schema definition, causing misalignment.
-
Partition mismatch: If your table is partitioned, ensure that the new 2025 data is correctly placed in the appropriate partition structure within S3.
To resolve this issue:
-
Verify data consistency: Double-check that your 2025 CSV file exactly matches the format of the existing files, including delimiters, quote characters, and escape characters.
-
Update the schema: After uploading new data, run the AWS Glue crawler again to update the schema. Alternatively, you can manually update the table definition in the AWS Glue Data Catalog.
-
Check partition alignment: Ensure that the new data is correctly placed within the existing partition structure in S3.
-
Validate data types: Confirm that the data types for each column in your 2025 data match those defined in the AWS Glue Data Catalog. Inconsistent data types can cause misalignment.
-
Use CREATE TABLE statement: If the issue persists, consider manually defining the table schema using a CREATE TABLE statement in Athena. This gives you more control over the schema definition and can help ensure consistency across all your data.
By following these steps, you should be able to resolve the column misalignment issue and ensure that your 2025 data is correctly integrated with the existing dataset when querying in Athena.
Sources
Troubleshoot Athena queries that have new partitions | AWS re:Post
Using files in Amazon S3 for the data source - AWS Glue
Integrating with other AWS services - AWS Glue
Relevant content
- asked 6 months ago
- Accepted Answerasked a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 9 months ago
Thank you for the response! I actually found that the delimiter issue was the problem, but not in the way it was described. All of the "clean" files up until 2024 had removed any and all commas, which were the delimiters Glue used. I removed the commas for the 2025 data, and that worked in resolving the issue.