By using AWS re:Post, you agree to the Terms of Use
/AWS Glue/

Questions tagged with AWS Glue

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Glue Hudi get the freshly added or updated records

Hello, I'm using Hudi connector in Glue, first, I bulk inserted the initial dataset to Hudi table, I'm adding a daily incremental records and I can query them using Athena, what I'm trying to do is to get the newly added, updated or deleted records in a separate parquet file. I've tried different approaches and configurations using both copy on write and merge on read tables but still can get the updates in a separate file. I used these configurations in different combinations: 'className' : 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.datasource.write.recordkey.field': 'uuid', 'hoodie.payload.event.time.field': 'ts', 'hoodie.table.name': 'table_name', 'hoodie.datasource.hive_sync.database': 'hudi_db', 'hoodie.datasource.hive_sync.table': 'table_name', 'hoodie.datasource.hive_sync.enable': 'false', # 'hoodie.datasource.write.partitionpath.field': 'date:SIMPLE', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.meta.sync.client.tool.class': 'org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'path': 's3://path/to/output/', # 'hoodie.datasource.write.operation': 'bulk_insert', 'hoodie.datasource.write.operation': 'upsert', # 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor', # 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator', # 'hoodie.compaction.payload.class': 'org.apache.hudi.common.model.OverwriteWithLatestAvroPayload', # 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.delete.bootstrap.base.file': 'true', "hoodie.index.type": "GLOBAL_BLOOM", 'hoodie.file.index.enable': 'true', 'hoodie.bloom.index.update.partition.path': 'true', 'hoodie.bulkinsert.shuffle.parallelism': 1, # 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator' Thank you.
1
answers
0
votes
11
views
asked 17 hours ago

Hudi Clustering

I am using EMR 6.6.0, which has hudi 10.1. I am trying to bulkinsert and do inline clustering using Hudi. But seems its not clustering the file as per file size being mentioned. But it is still producing the files in KB only. I tried below configuration: > hudi_clusteringopt = { 'hoodie.table.name': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'creation_date', 'hoodie.datasource.write.precombine.field': 'last_update_time', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': 'my_hudi_db', 'hoodie.datasource.hive_sync.table': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.hive_sync.partition_fields': 'creation_date', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.write.operation": "bulk_insert", } # "hoodie.datasource.write.operation": "bulk_insert", try: inputDF.write.format("org.apache.hudi"). \ options(**hudi_clusteringopt). \ option("hoodie.parquet.small.file.limit", "0"). \ option("hoodie.clustering.inline", "true"). \ option("hoodie.clustering.inline.max.commits", "0"). \ option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). \ option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). \ option("hoodie.clustering.plan.strategy.sort.columns", "pk_col"). \ mode('append'). \ save("s3://xxxxxxxxxxxxxx"); except Exception as e: print(e) Here is the data set if someone wants to regenerate: inputDF = spark.createDataFrame( [ ("1001",1001, "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("1011",1011, "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("1021",1021, "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("1031",1031, "2015-01-01", "2015-01-01T13:51:40.519832Z"), ("1041",1041, "2015-01-02", "2015-01-01T12:15:00.512679Z"), ("1051",1051, "2015-01-02", "2015-01-01T13:51:42.248818Z"), ], ["id","id_val", "creation_date", "last_update_time"] )
1
answers
0
votes
7
views
asked 3 days ago

AWS GLUE - java.sql.SQLIntegrityConstraintViolationException: Duplicate entry Error

I was doing a POC using Glue to Migrate data from RDS MySql to RDS Postgres. I have created Connectors to both source and target, and a crawler which connected to source. Then created a job and tried to migrate data with out any transformation and started getting java.sql.SQLIntegrityConstraintViolationException: Duplicate entry error java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat com.mysql.cj.util.Util.handleNewInstance(Util.java:192)\n\tat com.mysql.cj.util.Util.getInstance(Util.java:167)\n\tat com.mysql.cj.util.Util.getInstance(Util.java:174)\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createBatchUpdateException(SQLError.java:224)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.executeBatchSerially(ServerPreparedStatement.java:385)\n\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchInternal(ClientPreparedStatement.java:435)\n\tat com.mysql.cj.jdbc.StatementImpl.executeBatch(StatementImpl.java:794)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.saveBatch(GlueJDBCSink.scala:400)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$4(GlueJDBCSink.scala:77)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$4$adapted(GlueJDBCSink.scala:77)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.savePartition(GlueJDBCSink.scala:261)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$3(GlueJDBCSink.scala:77)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$3$adapted(GlueJDBCSink.scala:76)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2278)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:131)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat ****java.lang.Thread.run(Thread.java:750)\nCaused by: java.**sql.SQLIntegrityConstraintViolationException: Duplicate entry '00002b0b-1a34-3319-b003-fb073fb8248d' for key 'transformer_fragment.PRIMARY'\n\tat ******com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:117)\n\tat com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:637)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:418)\n\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1092)\n\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1040)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.executeBatchSerially(ServerPreparedStatement.java:357)\n\t... 19 more\n","Accumulator Updates":[{"ID":130,"Update":"105","Internal":false,"Count Failed Values":true},{"ID":132,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":139,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":140,"Update":"1","Internal":false,"Count Failed Values":true},{"ID":141,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":142,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":143,"Update":"1135","Internal":false,"Count Failed Values":true},{"ID":144,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":145,"Update":"1","Internal":false,"Count Failed Values":true}]},"Task Info":{"Task ID":13,"Index":5,"Attempt":1,"Launch Time":1655403050036,"Executor ID":"1","Host":"172.31.22.88","Locality":"NODE_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1655403050162,"Failed":true,"Killed":false,"Accumulables": The following are the values I have used while creating the job, Type : Spark Glue Version - Glue 3.0 Supports spark3.1, Scala2, Python3 Language - Python3 Worker Type - G.1X Automatically Scale the number of workers - False Requested number of workers - 2 Generate Job Insights - True Job Bookmark - Enable Number of retries - 1 job timeout - 2880 Really appreciate any help !
0
answers
0
votes
19
views
asked 12 days ago

AWS parameterized Glue Concurrent jobs using step functions with enabled bookmarks - throws Version mismatch exception

I have a parameterized glue job , that will be called in parallel (25 glue job) through step functions, when bookmark enabled , version mismatch exception is thrown, when disabled, it runs fine. . Below are the inputs to the step function { "JobName": "gluejobname3", "Arguments": { "--cdb": "catalogdbname", "--ctbl": "tablename", "--dest": "s3://thepath/raw/", "--extra-py-files": "s3://thepath/mytransformation.py.zip" } } When bookmarks are disabled, the step functions calls the parameterized glue job, and loads the data into the different s3 location. below is the glue job script --------------- import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job import mytransformation args = getResolvedOptions(sys.argv, ["JOB_NAME","cdb","ctbl","dest"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) # this was given so that while it was initialized each job would have their unique id job.init(args["JOB_NAME"]+args["ctbl"], args) #Custom parameters that will be dynamically passed through with the call to the parameterized glue job db = args["cdb"] tbl = args["ctbl"] dest = args["dest"] # Script generated for node MySQL table #dynamically creating a variable name. so that transformation_ctx would be unique for each glue job globals()[f'MySQLtable_sourceDf{tbl}'] = glueContext.create_dynamic_frame.from_catalog( database=db, table_name=tbl, transformation_ctx = '"'+f'MySQLtable_sourceDf{tbl}'+'"' ) #passing the same transformation_ctx into the destination frame destination_Df = mytransformation.tansform(globals()[f'MySQLtable_sourceDf{tbl}']) # Script generated for node Amazon S3 # : "s3://anvil-houston-preprod" #creating the dynamic unique transformation ctx name since the jobs will be run concurrently globals()[f'S3bucket_node{tbl}'] = glueContext.write_dynamic_frame.from_options( frame=destination_Df, connection_type="s3", format="glueparquet", connection_options={"path": dest, "partitionKeys": []}, format_options={"compression": "snappy"}, transformation_ctx='"'+f'S3bucket_node{tbl}'+'"' ) job.commit() ------ Above runs fine , while the execution is started through step functions ( 25 parallel parameterized glue job), the job runs fine, and loads to 25 diffferent locations. When bookmark is now enabled, the job fails with version mismatch . An error occurred while calling z:com.amazonaws.services.glue.util.Job.commit. Continuation update failed due to version mismatch. Expected version 1 but found version 2 (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: VersionMismatchException; Please help
0
answers
0
votes
20
views
asked 13 days ago

Generating Parquet files from Glue Data Catalog

I have a glue job that write to a Data Catalog. In the Data Catalog I originally set it up as CSV, and all works fine. Now I would like to try to use Parquet for the Data Catalog. I thought I would just have to re-create the table and select Parquet instead of CSV, so I did so like so: ``` CREATE EXTERNAL TABLE `gp550_load_database_beta.gp550_load_table_beta`( `vid` string, `altid` string, `vtype` string, `time` timestamp, `timegmt` timestamp, `value` float, `filename` string) PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://ds905-load-forecast/data_store_beta/' TBLPROPERTIES ( 'classification'='parquet') ``` I left my glue job unchanged. I have it sending its output to the Data Catalog Table like so: ``` additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} additionalOptions["partitionKeys"] = ["year", "month", "day"] # Data Catalog WRITE DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog( frame = dynamicDF, database = db_name, table_name = tbl_name, additional_options=additionalOptions, transformation_ctx = "DataCatalogtable_node2", ) ``` When I checked the files being created by the Data Catalog in s3://ds905-load-forecast/data_store_beta/, they look to just be CSV. What do I need to do to use Parquet? Can I just change the sink routine to use glueContext_write_dynamic_frame.from_options()?
0
answers
0
votes
16
views
asked 19 days ago
1
answers
0
votes
14
views
asked 21 days ago

MSCK REPAIR TABLE behaves differently when executed via Spark Context vs Athena Console/boto3

I have a Glue ETL job which creates partitions during the job ``` additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} additionalOptions["partitionKeys"] = ["year", "month", "day"] ``` I don’t have it Update the Data Catalog because doing so changes all my Table Data Types. So after I am done, the way I get the Data Catalog updated with the correct partition information is to run MSCK REPAIR TABLE. If I do this inside the Glue ETL job using the Spark Context like so: ``` spark.sql("use gp550_load_database_beta") spark.sql("msck repair table gp550_load_table_beta").show() ``` The following happens: Serde Properties of my table are updated with “serialization.format : 1” Table Properties are updated with: EXTERNAL : TRUE spark.sql.partitionProvider : catalog ALL Data Types in my table are set to “String” with a comment of “from deserializer” Basically it makes a mess. If I instead run MSCK REPAIR TABLE from boto3, or if I manually run it from Athena Console, then there are no issues. No Serde Properties are changes, no table properties, no data types are changed, it simply adds my partitions to the Data Catalog like I want. I do like so in my ETL job: ``` client = boto3.client('athena') sql = 'MSCK REPAIR TABLE gp550_load_database_beta.gp550_load_table_beta' context = {'Database': 'gp550_load_database_beta'} client.start_query_execution(QueryString = sql, QueryExecutionContext = context, ResultConfiguration= { 'OutputLocation': 's3://aws-glue-assets-977465404123-us-east-1/temporary/' }) ``` Why does it behave so differently? Is it because somehow I need to tell Spark to work with HIVE? I had thought that since I already had a spark context it would be easy to use that to kick off the MSCK REPAIR TABLE, but obviously I was surprised at the result!
0
answers
0
votes
14
views
asked 21 days ago

Quicksight Athena - analysis error: "HIVE_UNSUPPORTED_FORMAT: Unable to create input format"

Hello. I'm trying to create an analysis from my DocumentDB instance. I'm using the aws services Glue, Athena and Quicksight. In Glue I have created a connection to the DocumentDB and a crawler for auto creating tables. This works as expected and tables are created and displayed in glue. Even though I specify that the crawler should not give the tables any prefixes, it does add the database name as a prefix. When I look at the Glue catalog in Athena (the default AwsDataCatalog) I do see the database that was created in glue, however it does not show any tables. If I click on edit, it takes me to the correct database in glue which displays the tables that have been created by the previously mentioned crawler. So my first question is **Why doesn't the tables show up in Athena?** This is blocking me from performing queries in Athena. When I go to QuickSight and select the default Athena glue catalog ("AwsDataCatalog") I DO get the tables created by the crawler, and I can create datasets. However, when I try to create an analysis using these datasets, I get the error: ``` sourceErrorCode: 100071 sourceErrorMessage: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. HIVE_UNSUPPORTED_FORMAT: Unable to create input format ``` I have looked a bit around and some people said that this error is due to the table properties **"Input format"** and **"Output format"** being empty (which they indeed are for me). I have tried entering almost all the different formats to the table, but I keep on getting the Quicksight error above only now it has the input format at the end ``` HIVE_UNSUPPORTED_FORMAT: Unable to create input format json ``` **So my second questions is** I do not see anywhere in the crawler where I can specify input or output format. Does it have to be done manually? And What are the correct input and output formats for my setup?
0
answers
0
votes
30
views
asked 22 days ago

How to create dynamic dataframe from AWS Glue catalog in local environment?

I I have performed some AWS Glue version 3.0 jobs testing using Docker containers as detailed [here](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/). The following code outputs two lists, one per connection, with the names of the tables in a database: ```python import boto3 db_name_s3 = "s3_connection_db" db_name_mysql = "glue_catalog_mysql_connection_db" def retrieve_tables(database_name): session = boto3.session.Session() glue_client = session.client("glue") response_get_tables = glue_client.get_tables(DatabaseName=database_name) return response_get_tables s3_tables_list = [table_dict["Name"] for table_dict in retrieve_tables(db_name_s3)["TableList"]] mysql_tables_list = [table_dict["Name"] for table_dict in retrieve_tables(db_name_mysql)["TableList"]] print(f"These are the tables from {db_name_s3} db: {s3_tables_list}\n") print(f"These are the tables from {db_name_mysql} db {mysql_tables_list}") ``` Now, I try to create a dynamic dataframe with the *from_catalog* method in this way: ```python import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.dynamicframe import DynamicFrame source_activities = glueContext.create_dynamic_frame.from_catalog( database = db_name, table_name =table_name ) ``` When `database="s3_connection_db"`, everything works fine, however, when I set `database="glue_catalog_mysql_connection_db"`, I get the following error: ```python Py4JJavaError: An error occurred while calling o45.getDynamicFrame. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver ``` I understand the issue is related to the fact that I am trying to fetch data from a mysql table but I am not sure how to solve this. By the way, the job runs fine on the Glue console. I would really appreciate some help, thanks!
0
answers
0
votes
22
views
asked a month ago

AWS Glue problems reading from PostgreSQL DB that has uppercased table and column names

I have an RDS PostgreSQL database that has table names and column names with upper cased characters. I have created a glue crawler that connects to the database via jdbc and populates the glue data catalog with the database schemas but in that translation the upper case characters are converted to lower case characters when stored in the data catalog tables. When I run a glue job to query I get this error: An error occurred while calling o94.getDynamicFrame. ERROR: column "id" does not exist I made a copy of this table and changed the table names and column names to have all lower case characters and the same glue jobs and queries run successfully. Changing the table names and column names to lower cased characters in our production environment is just not an option due to the extensive work it would require. I found the 'Edit Schema' option in the Glue UI where you can change column names and data types and thought for a moment the solution had been found. However, when you change a character to upper case and then select "Save" it is reverted to lower case as it is saved. I have edited the pyspark script directly and worked with the glueContext.create_dynamic_frame.from_catalog method using the additional_options parameter to build my select statement using upper and lower case characters but that still fails with the error message noted above. # Script generated for node PostgreSQL table PostgreSQLtable_node1 = glueContext.create_dynamic_frame.from_catalog( database="mydatabase", table_name="mytable", additional_options={"query":"SELECT id from mytable;"}, transformation_ctx="PostgreSQLtable_node1" I believe the failure is because the schema as it is stored in the data catalog contains lower characters while the actual schema in the database is upper characters so when Glue tries to work with the table it is looking for "id" while the actual is "ID" and so "not found" is returned. I have read about the CaseSensitive option and looking in that direction next for a solution. I have not seen any recent (less than couple years old) posts about this issue so not sure if I'm missing something. Any assistance would be greatly appreciated.
2
answers
0
votes
27
views
asked a month ago

Data Catalog schema table getting modified when I run my Glue ETL job

I created a Data Catalog with a table that I manually defined. I run my ETL job and all works well. I added partitions to both the table in the Data Catalog, as well as the ETL job. it creates the partitions and I see the folders being created in S3 as well. But my table data types change. I originally had: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | timestamp | | timegmt | timestamp | | value | float | | filename | string | | year | int | | month | int | | day | int | But now after the ETL job with partitions, my table ends up like so: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | bigint | | timegmt | bigint | | value | float | | filename | string | | year | bigint | | month | bigint | | day | bigint | Before this change of data types, I could do queries in Athena. Including a query like this: ``` SELECT * FROM "gp550-load-database"."gp550-load-table-beta" WHERE vid IN ('F_NORTH', 'F_EAST', 'F_WEST', 'F_SOUTH', 'F_SEAST') AND vtype='LOAD' AND time BETWEEN TIMESTAMP '2021-05-13 06:00:00' and TIMESTAMP '2022-05-13 06:00:00' ``` But now with the data types change, I get an error when trying to do a query like above ``` "SYNTAX_ERROR: line 1:154: Cannot check if bigint is BETWEEN timestamp and timestamp This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 2a5287bc-7ac2-43a8-b617-bf01c63b00d5" ``` So then if I go into the the table and change the data type back to "timestamp", I then run the query and get a different error: ``` "HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'time' in table 'gp550-load-database.gp550-load-table-beta' is declared as type 'timestamp', but partition 'year=2022/month=2/day=2' declared column 'time' as type 'bigint'. This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: f788ea2b-e274-43fe-a3d9-22d80a2bbbab" ``` With Glue Crawlers, you can influence how the partitions are created. You can have the Crawler modify the Data Catalog table, or not make changes to the table scheme, other than adding new partitions: ``` { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } } ``` basically, this *InheritFromTable* behavior is what I am looking for with ETL jobs. Does anyone know what is happening?
0
answers
0
votes
51
views
asked a month ago

Can't get Partitions to work with my Glue Data Catalog

I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works. What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions: Partition 1 YEAR Partition 2 MONTH Partition 3 DAY The proper values are in these columns, and I have verified that. After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok." I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows: year month day No partitions found What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.
1
answers
0
votes
33
views
asked a month ago

aws-glue-libs:glue_libs_3.0.0_image_01 image issue

I am getting issues in aws-glue-libs:glue_libs_3.0.0_image_01 image docker run -it -p 8888:8888 -p 4040:4040 -e DISABLE_SSL="true" -v C:/Docker/jupyter_workspace:**/home/glue_user/workspace/jupyter_workspace/ ** --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh It is getting started locally but When I am trying to read the csv file stored locally it is giving error : An error was encountered: Path does not exist: file:/home/glue_user/workspace/employees.csv Traceback (most recent call last): File "/home/glue_user/spark/python/pyspark/sql/readwriter.py", line 737, in csv return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Path does not exist: file:/home/glue_user/workspace/employees.csv Or When I am trying to start with docker run -it -p 8888:8888 -p 4040:4040 -e DISABLE_SSL="true" -v C:/Docker/jupyter_workspace****:/home/glue_user/workspace** ** --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh then container is not getting started getting following error : Bad config encountered during initialization: No such directory: ''/home/glue_user/workspace/jupyter_workspace''
1
answers
0
votes
38
views
asked 2 months ago

AWs trigger EventBatchingCondition/BatchWindow is not optional

Hi team, I have a glue workflow : trigger (type = "EVENT") => trigger a glue job (to take data from S3 and push them to MySQL RDS) I configured the glue Triggering criteria to kickoff the glue job after 5 events were received. in the console it says : > Specify the number of events received or maximum elapsed time before firing this trigger. > Time delay in seconds (optional) on AWS documentation it says also it's not required : ``` BatchWindow Window of time in seconds after which EventBridge event trigger fires. Window starts when first event is received. Type: Integer Valid Range: Minimum value of 1. Maximum value of 900. Required: No ``` So I want only my trigger to be triggered only and only after 5 events are received and not depending on: Time delay in seconds (optional). actually, the Time delay in seconds (optional) is set to 900 by default and my job is started after 900s even if there are no 5 events received. that's not the behaviour we want. We want ONLY the job to be started after x events are received. I tried via the console to edit the trigger and remove the 900s for the Time delay in seconds (optional) input but I can't save it until I put a value on it. it says it's optional but it doesn't seem to be. is there a workaround to make the trigger not take account of Time delay in seconds (optional)? and only be launched when it received x events and nothing else. right now the behaviour I have is that my job is triggered after 900s, we want to eliminate this case and let the job be triggered only and only if there is x event received and nothing else. how can I make the Time delay in seconds (optional) input optional, because now the console forces me to put a value in there? thank you.
1
answers
0
votes
16
views
asked 2 months ago

Is there a way to create a Redshift Table from a Glue table's schema?

Athena tables can be created from Glue tables which can have schemas based on crawlers. **Is it also possible to use the schema of a Glue table to generate a *Redshift-compatible* `CREATE TABLE` statement? ** I tried `SHOW CREATE TABLE encounter;` in Athena. And then I tried plugging in the resulting `CREATE TABLE` statement in Redshift, but got an error: ``` ERROR: syntax error at or near "`" Position: 23. ``` I can go through the statement Athena generated and clean it up to fit Redshift requirements, like taking out the back-ticks, but I'm wondering if there's any more direct way to generate a table based on a Glue table? This is that `CREATE TABLE` statement that Athena generated: ``` CREATE EXTERNAL TABLE `encounter`( `resourcetype` string COMMENT 'from deserializer', `id` string COMMENT 'from deserializer', `meta` struct<lastupdated:string,profile:array<string>> COMMENT 'from deserializer', `identifier` array<struct<use:string,system:string,value:string>> COMMENT 'from deserializer', `status` string COMMENT 'from deserializer', `class` struct<system:string,code:string> COMMENT 'from deserializer', `type` array<struct<coding:array<struct<system:string,code:string,display:string>>,text:string>> COMMENT 'from deserializer', `subject` struct<reference:string,display:string> COMMENT 'from deserializer', `participant` array<struct<type:array<struct<coding:array<struct<system:string,code:string,display:string>>,text:string>>,period:struct<start:string,end:string>,individual:struct<reference:string,display:string>>> COMMENT 'from deserializer', `period` struct<start:string,end:string> COMMENT 'from deserializer', `location` array<struct<location:struct<reference:string,display:string>>> COMMENT 'from deserializer', `serviceprovider` struct<reference:string,display:string> COMMENT 'from deserializer', `reasoncode` array<struct<coding:array<struct<system:string,code:string,display:string>>>> COMMENT 'from deserializer') ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'paths'='class,id,identifier,location,meta,participant,period,reasonCode,resourceType,serviceProvider,status,subject,type') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://bucket/Encounter/' TBLPROPERTIES ( 'CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='healthlake-export-crawler', 'averageRecordSize'='1561', 'classification'='json', 'compressionType'='none', 'objectCount'='14', 'recordCount'='53116', 'sizeKey'='83059320', 'typeOfData'='file') ``` Here's an example of what the original data looks like (it's synthetic data, so not PHI): ``` { "period": { "start": "2019-11-18T13:53:49-08:00", "end": "2019-11-18T14:23:49-08:00" }, "subject": { "reference": "Patient/92e36d1e-66a2-4e77-9f50-155f7edf819c", "display": "Cyndi533 Bogan287" }, "serviceProvider": { "reference": "Organization/3ecb1bdd-03d7-3fd2-b52d-8df2a04f5b0a", "display": "SOUTH SHORE SKIN CENTER, LLC" }, "id": "b39745ae-14dd-46b3-9345-2916efa759ad", "type": [{ "coding": [{ "system": "http://snomed.info/sct", "code": "410620009", "display": "Well child visit (procedure)" }], "text": "Well child visit (procedure)" }], "class": { "system": "http://terminology.hl7.org/CodeSystem/v3-ActCode", "code": "AMB" }, "participant": [{ "period": { "start": "2019-11-18T13:53:49-08:00", "end": "2019-11-18T14:23:49-08:00" }, "individual": { "reference": "Practitioner/c51e847b-fcd0-3f98-98a7-7e4274a2e6f3", "display": "Dr. Jacquelyne425 O'Reilly797" }, "type": [{ "coding": [{ "system": "http://terminology.hl7.org/CodeSystem/v3-ParticipationType", "code": "PPRF", "display": "primary performer" }], "text": "primary performer" }] }], "resourceType": "Encounter", "status": "finished", "meta": { "lastUpdated": "2022-04-08T15:40:39.926Z" } } ```
2
answers
0
votes
41
views
asked 2 months ago

Data transformation not taken into account in AWS Glue

I have a S3 bucket with folders in which we have files. I want to make a database to be able to query these documents on a few keys with an API based on Lambda. But for that I need to normalize the data. For example I need to transform all the files in the folder `/jomalone/` as the following: { "data": { "products": { "items": [ { "default_category": { "id": "25956", "value": "Bath & Body" }, "description": "London's Covent Garden early morning market. Succulent nectarine, peach and cassis and delicate spring flowers melt into the note of acacia honey. Sweet and delightfully playful. Our luxuriously rich Body Crème with its conditioning oils of jojoba seed, cocoa seed and sweet almond, help to hydrate, nourish and protect the skin, while delicious signature fragrances leave your body scented all over.", "display_name": "Nectarine Blossom & Honey Body Crème", "is_hazmat": false, "meta": { "description": "The Jo Malone&trade; Nectarine Blossom & Honey Body Crème leaves skin beautifully scented with fruity notes of nectarine and peach sweetened with acacia honey." }, ... { "currency": "EUR", "is_discounted": false, "include_tax": { "price": 68, "original_price": 68, "price_per_unit": 38.86, "price_formatted": "€68.00", "original_price_formatted": "€68.00", "price_per_unit_formatted": "€38.86 / 100ML" } } ], "sizes": [ { "value": "175ML", "key": 1 } ], "shades": [ { "name": "", "description": "", "hex_val": "" } ], "sku_id": "L4P801", "sku_badge": null, "unit_size_formatted": "100ML", "upc": "690251040254", "is_engravable": null, "perlgem": { "SKU_BASE_ID": 63584 }, "media": { "large": [ { "src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_0.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 1000, "width": 1000 }, { "src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_1.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 1000, "width": 1000 } ], "medium": [ { "src": "/media/export/cms/products/670x670/jo_sku_L4P801_670x670_0.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 670, "width": 670 } ], "small": [ { "src": "/media/export/cms/products/100x100/jo_sku_L4P801_100x100_0.png", "alt": "Nectarine Blossom & Honey Body Crème", "height": 100, "width": 100 } ] }, "collection": null, "recipient": [ { "key": "mom-recipient", "value": "mom_recipient" }, { "key": "bride-recipient", "value": "bride_recipient" }, { "key": "host-recipient", "value": "host_recipient" }, { "key": "me-recipient", "value": "me_recipient" }, { "key": "her-recipient", "value": "her_recipient" } ], "occasion": [ { "key": "thankyou-occasion", "value": "thankyou_occasion" }, { "key": "birthday-occasion", "value": "birthday_occasion" }, { "key": "treat-occasion", "value": "treat_occasion" } ], "location": [ { "key": "bathroom-location", "value": "bathroom_location" } ] } ] } } ] } } } In a json with the following schema: brandName String productName String productLink String productType ? maleFemale Male/Female price float unitPrice String size float ingredients String notes String numReviews Int userIDs float locations float dates Date ages int sexes M/F ratings Int reviews Array of String sources String characteristics String specificRatings String So I have tried AWS Glue but I don't know how to get rid of the nested data as the keys at the beginning: "data": { "products": { "items": [ ... Indeed, I used to test the modifications on the names: [![introducir la descripción de la imagen aquí][1]][1] But it doesn't seem to have any of the consequences I was looking for if I am to believe the Preview tab: [![introducir la descripción de la imagen aquí][2]][2] I had indeed deleted the first and last soubrayed fields and modified the others but none of this seems to have been taken into account in the Preview. Indeed it doesn't seem there is anyhting like at least mapping in the related script: ``` import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ["JOB_NAME"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args["JOB_NAME"], args) # Script generated for node S3 bucket S3bucket_node1 = glueContext.create_dynamic_frame.from_options( format_options={"multiline": False}, connection_type="s3", format="json", connection_options={"paths": ["s3://datahubpredicity/JoMalone/"], "recurse": True}, transformation_ctx="S3bucket_node1", ) # Script generated for node ApplyMapping ApplyMapping_node2 = ApplyMapping.apply( frame=S3bucket_node1, mappings=[("data.products.items", "array", "data.products.items", "array")], transformation_ctx="ApplyMapping_node2", ) # Script generated for node S3 bucket S3bucket_node3 = glueContext.write_dynamic_frame.from_options( frame=ApplyMapping_node2, connection_type="s3", format="json", connection_options={"path": "s3://datahubpredicity/merged/", "partitionKeys": []}, transformation_ctx="S3bucket_node3", ) job.commit() ``` [1]: https://i.stack.imgur.com/xYHOu.png [2]: https://i.stack.imgur.com/nWNrn.png
0
answers
0
votes
9
views
asked 2 months ago

AWS:InstanceInformation folder created in s3 by Resource Data Sync cannot be queried by Athena because it has an invalid schema with duplicate columns.

[After resolving my first issue](https://repost.aws/questions/QUXOInFRr1QrKfR0Bh9wVglA/aws-glue-not-properly-crawling-s-3-bucket-populated-by-resource-data-sync-specifically-aws-instance-information-is-not-made-into-a-table) with getting a resource data sync set up, I've now run into another issue with the same folder. When a resource data sync is created, it creates a folder structure with 13 folders following a folder structure like: `s3://resource-data-sync-bucket/AWS:*/accountid=*/regions=*/resourcetype=*/instance.json}` When running the glue crawler over this, a schema is created where partitions are made for each subpath with an `=` in it. This works fine for most of the data, except for the path starting with `AWS:InstanceInformation`. The instance information json files ALSO contain a "resourcetype" field as can be seen here. ``` {"PlatformName":"Microsoft Windows Server 2019 Datacenter","PlatformVersion":"10.0.17763","AgentType":"amazon-ssm-agent","AgentVersion":"3.1.1260.0","InstanceId":"i","InstanceStatus":"Active","ComputerName":"computer.name","IpAddress":"10.0.0.0","ResourceType":"EC2Instance","PlatformType":"Windows","resourceId":"i-0a6dfb4f042d465b2","captureTime":"2022-04-22T19:27:27Z","schemaVersion":"1.0"} ``` As a result, there are now two "resourcetype" columns in the "aws_instanceinformation" table schema. Attempts to query that table result in the error `HIVE_INVALID_METADATA: Hive metadata for table is invalid: Table descriptor contains duplicate columns` I've worked around this issue by removing the offending field and setting the crawler to ignore schema updates, but this doesn't seem like a great long term solution since any changes made by AWS to the schema will be ignored. Is this a known issue with using this solution? Are there any plans to change how the AWS:InstanceInformation documents are so duplicate columns aren't created.
0
answers
0
votes
8
views
asked 2 months ago

AWS Glue not properly crawling s3 bucket populated by "Resource Data Sync" -- specifically, "AWS: InstanceInformation" is not made into a table

I set up an s3 bucket that collects inventory data from multiple AWS accounts using the Systems Manager "Resource Data Sync". I was able to set up the Data Syncs to feed into the single bucket without issue and the Glue crawler was created automatically. Now that I'm trying to query the data in Athena, I noticed there is an issue with how the Crawler is parsing the data in the bucket. The folder "AWS:InstanceInformation" is not being turned into a table. Instead, it is turning all of the "region=us-east-1/" and "test.json" sub-items into tables which are, obviously, not queryable. To illustrate further, each of the following paths is being turned into it's own table. * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/test.json * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/test.json * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/test.json This is ONLY happening with the "AWS:InstanceInformation" folder. All of the other folders (e.g. "AWS:DetailedInstanceInformation") are being properly turned into tables. Since all of this data was populated automatically, I'm assuming that we are dealing with a bug? Is there anything I can do to fix this?
1
answers
0
votes
10
views
asked 2 months ago
  • 1
  • 90 / page