By using AWS re:Post, you agree to the Terms of Use
/Analytics/

Analytics

AWS provides the broadest selection of analytics services that fit all your data analytics needs and enables organizations of all sizes and industries reinvent their business with data. From data movement, data storage, data lakes, big data analytics, machine learning, and anything in between, AWS offers purpose-built services that provide the best price-performance, scalability, and lowest cost.

Recent questions

see all
1/18

Not able to abort redshift connection - having a statement in waiting state

At certain point of time, all java threads which abort the redshift db connections get blocked in the service. Thread dump: ``` thread-2" #377 prio=5 os_prio=0 cpu=23073.41ms elapsed=1738215.53s tid=0x00007fd1c413a000 nid=0x5a1f waiting for monitor entry [0x00007fd193dfe000] java.lang.Thread.State: BLOCKED (on object monitor) at com.amazon.jdbc.common.SStatement.close(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) - waiting to lock <0x00000006086ac800> (a com.amazon.redshift.core.jdbc42.PGJDBC42Statement) at com.amazon.jdbc.common.SConnection.closeChildStatements(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.jdbc.common.SConnection.closeChildObjects(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.jdbc.common.SConnection.abortInternal(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) - locked <0x0000000607941af8> (a com.amazon.redshift.core.jdbc42.S42NotifiedConnection) at com.amazon.jdbc.jdbc41.S41Connection.access$000(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.jdbc.jdbc41.S41Connection$1.run(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.9.1/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.9.1/ThreadPoolExecutor.java:628) at java.lang.Thread.run(java.base@11.0.9.1/Thread.java:829) ``` These are blocked on the threads which are still running statement on these connections. ``` thread-366" #23081 daemon prio=5 os_prio=0 cpu=972668.98ms elapsed=1553882.44s tid=0x00007fd1642b3000 nid=0x73ff waiting on condition [0x00007fd1920ac000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.9.1/Native Method) - parking to wait for <0x00000006086ae350> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.9.1/LockSupport.java:234) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.9.1/AbstractQueuedSynchronizer.java:2123) at java.util.concurrent.ArrayBlockingQueue.poll(java.base@11.0.9.1/ArrayBlockingQueue.java:432) at com.amazon.jdbc.communications.InboundMessagesPipeline.validateCurrentContainer(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.redshift.client.PGMessagingContext.getReadyForQuery(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.redshift.client.PGMessagingContext.closeOperation(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.redshift.dataengine.PGAbstractQueryExecutor.close(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.jdbc.common.SStatement.replaceQueryExecutor(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.jdbc.common.SStatement.executeNoParams(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) at com.amazon.jdbc.common.SStatement.execute(com.foo.drivers.redshift@1.2.43.1067/Unknown Source) - locked <0x00000006086ac800> (a com.amazon.redshift.core.jdbc42.PGJDBC42Statement) ``` Statement executed in these threads : `statement.execute(“SHOW SEARCH_PATH”);` Once the java service is restarted, it works fine. But after a few days, this issue comes up again. Q1 a. Why a close connection thread is blocked even if its child statement is in a queued state? Q1 b. Is there a way to force close the connection? Q2 Why are the child statement in the waiting state?
0
answers
0
votes
6
views
asked 15 hours ago

Kinesis Analytics for SQL Application Issue

Hello, I am having trouble to properly handle query with tumbling window. My application sends 15 sensor data messages per second to Kinesis Data Stream, which is used as an input stream for Kinesis Analytics application. I am trying to run an aggregation query using a GROUP BY clause to process rows in a tumbling window by 60 second interval. The output stream then sends data to a lambda function. I expect that the messages should arrive at lambda every 60 seconds but instead, they arrive much faster, almost every second, and the aggregations don't work as expected. Here is the CloudFormation template that I am using: ApplicationCode: CREATE OR REPLACE STREAM "SENSORCALC_STREAM" ( "name" VARCHAR(16), "facilityId" INTEGER, "processId" BIGINT, "sensorId" INTEGER NOT NULL, "min_value" REAL, "max_value" REAL, "stddev_value" REAL); CREATE OR REPLACE PUMP "SENSORCALC_STREAM_PUMP" AS INSERT INTO "SENSORCALC_STREAM" SELECT STREAM "name", "facilityId", "processId", "sensorId", MIN("sensorData") AS "min_value", MAX("sensorData") AS "max_value", STDDEV_SAMP("sensorData") AS "stddev_value" FROM "SOURCE_SQL_STREAM_001" GROUP BY "facilityId","processId", "sensorId", "name", STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '60' SECOND); KinesisAnalyticsSensorApplicationOutput: Type: "AWS::KinesisAnalytics::ApplicationOutput" DependsOn: KinesisAnalyticsSensorApplication Properties: ApplicationName: !Ref KinesisAnalyticsSensorApplication Output: Name: "SENSORCALC_STREAM" LambdaOutput: ResourceARN: !GetAtt SensorStatsFunction.Arn RoleARN: !GetAtt KinesisAnalyticsSensorRole.Arn DestinationSchema: RecordFormatType: "JSON" I would really appreciate your help in pointing what I am missing. Thank you, Serge
0
answers
0
votes
7
views
asked 16 hours ago

Glue Hudi get the freshly added or updated records

Hello, I'm using Hudi connector in Glue, first, I bulk inserted the initial dataset to Hudi table, I'm adding a daily incremental records and I can query them using Athena, what I'm trying to do is to get the newly added, updated or deleted records in a separate parquet file. I've tried different approaches and configurations using both copy on write and merge on read tables but still can get the updates in a separate file. I used these configurations in different combinations: 'className' : 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.datasource.write.recordkey.field': 'uuid', 'hoodie.payload.event.time.field': 'ts', 'hoodie.table.name': 'table_name', 'hoodie.datasource.hive_sync.database': 'hudi_db', 'hoodie.datasource.hive_sync.table': 'table_name', 'hoodie.datasource.hive_sync.enable': 'false', # 'hoodie.datasource.write.partitionpath.field': 'date:SIMPLE', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.meta.sync.client.tool.class': 'org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'path': 's3://path/to/output/', # 'hoodie.datasource.write.operation': 'bulk_insert', 'hoodie.datasource.write.operation': 'upsert', # 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor', # 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator', # 'hoodie.compaction.payload.class': 'org.apache.hudi.common.model.OverwriteWithLatestAvroPayload', # 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.delete.bootstrap.base.file': 'true', "hoodie.index.type": "GLOBAL_BLOOM", 'hoodie.file.index.enable': 'true', 'hoodie.bloom.index.update.partition.path': 'true', 'hoodie.bulkinsert.shuffle.parallelism': 1, # 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator' Thank you.
1
answers
0
votes
10
views
asked 16 hours ago

Hudi Clustering

I am using EMR 6.6.0, which has hudi 10.1. I am trying to bulkinsert and do inline clustering using Hudi. But seems its not clustering the file as per file size being mentioned. But it is still producing the files in KB only. I tried below configuration: > hudi_clusteringopt = { 'hoodie.table.name': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'creation_date', 'hoodie.datasource.write.precombine.field': 'last_update_time', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': 'my_hudi_db', 'hoodie.datasource.hive_sync.table': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.hive_sync.partition_fields': 'creation_date', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.write.operation": "bulk_insert", } # "hoodie.datasource.write.operation": "bulk_insert", try: inputDF.write.format("org.apache.hudi"). \ options(**hudi_clusteringopt). \ option("hoodie.parquet.small.file.limit", "0"). \ option("hoodie.clustering.inline", "true"). \ option("hoodie.clustering.inline.max.commits", "0"). \ option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). \ option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). \ option("hoodie.clustering.plan.strategy.sort.columns", "pk_col"). \ mode('append'). \ save("s3://xxxxxxxxxxxxxx"); except Exception as e: print(e) Here is the data set if someone wants to regenerate: inputDF = spark.createDataFrame( [ ("1001",1001, "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("1011",1011, "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("1021",1021, "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("1031",1031, "2015-01-01", "2015-01-01T13:51:40.519832Z"), ("1041",1041, "2015-01-02", "2015-01-01T12:15:00.512679Z"), ("1051",1051, "2015-01-02", "2015-01-01T13:51:42.248818Z"), ], ["id","id_val", "creation_date", "last_update_time"] )
1
answers
0
votes
7
views
asked 3 days ago
1
answers
0
votes
18
views
asked 5 days ago

Popular users

see all
1/18

Learn AWS faster by following popular topics

1/3