How to handle schema changes and versions in the Glue Data Catalog?

0

Given a dataset with a temporally changing schema, how do you:

  1. Specify a schema version when running a job/query? i.e. I want to specify version X of a table because I know that the data I am querying against was created when version X was current.
  2. Query data across versions? i.e. Data in January has slightly different schema version than data in February and I want to run a query across both.

For scenario 2, I've run some tests with an evolving schema and adding or dropping columns creates no issues. If the column type changes though (i.e a TIMESTAMP changes to a STRING), then queries against data that don't match the catalog fail. I've tested this behavior in both Athena and Glue (DynamicFrame can handle some data type inconsistencies).

My thought would be to do separate queries and then UNION the results, but that requires the answer to scenario 1...

질문됨 5년 전4332회 조회
1개 답변
1
수락된 답변

Version by itself is not available as a column to Athena. In Athena/Presto, you can use the typeOf function to determine the type of a column and choose how you want to process that. A Case statement would also work. e.g.

SELECT  x, typeof(x)  
from (select timestamp '2012-10-31 01:00 UTC' as x)
where typeof(x) like 'timestamp%'
UNION
SELECT date_parse(x,'%Y-%m-%d %h:%i:%s'), typeof(x)
from (select '2012-10-31 01:00:00' as x)
where typeof(x) like 'varchar%'

But ideally even if you have a varying schema in your raw data tier, you should be resolving that schema in the etl layer so that in the analytics tier you have a single schema with perhaps 2 columns - one for the original column, one for the changed column. Then at the serving layer it is easy to build a view to coalesce the columns if needed e.g.: if a string column is being changed to timestamp

SELECT coalesce(new_column, date_parse(old_column,'%m/%d/%Y %h:%i:%s %p')) as timestamp_column
AWS
답변함 5년 전
profile picture
전문가
검토됨 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠