Questions tagged with AWS Glue

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

I was looking at the Glue Crawler resource creation docs, and saw that the DynamoDB Target object: [https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-dynamodbtarget.html](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-crawler-dynamodbtarget.html) The only allowed parameter is 'Path' for a DynamoDB Target of a AWS Glue Crawler resource. Interestingly, when I deployed my crawler, I noticed that **the 'data sampling' setting was automatically enabled** for my DDB data source. This is NOT the setting I want, so I am looking for a way to specify that the Crawler should scan the **entire** data source (DDB table).
0
answers
0
votes
11
views
asked 19 hours ago
I have an array which is stored inside s3 bucket that looks like ``` [ { "bucket_name": "ababa", "bucket_creation_date": "130999", "additional_data": { "bucket_acl": [ { "Grantee": { "DisplayName": "abaabbb", "ID": "abaaaa", "Type": "CanonicalUser" }, "Permission": "FULL_CONTROL" } ], "bucket_policy": { "Version": "2012-10-17", "Id": "abaaa", "Statement": [ { "Sid": "iddd", "Effect": "Allow", "Principal": { "Service": "logging.s3.amazonaws.com" }, "Action": "s3:PutObject", "Resource": "aarnnn" }, { "Effect": "Deny", "Principal": "*", "Action": [ "s3:GetBucket*", "s3:List*", "s3:DeleteObject*" ], "Resource": [ "arn:aws:s3:::1111-aaa/*", "arn:aws:s3:::1111-bbb" ], "Condition": { "Bool": { "aws_SecureTransport": "false" } } } ] }, "public_access_block_configuration": { "BlockPublicAcls": true, "IgnorePublicAcls": true, "BlockPublicPolicy": true, "RestrictPublicBuckets": true }, "website_hosting": {}, "bucket_tags": [ { "Key": "keyyy", "Value": "valueee" } ] }, "processed_data": {} }, ....................... ] ``` NOTE- some of the field may be string/array/struct based on the data we get(eg actions can be array or string) END GOAL- I want to query inside this data and look for multiple conditions and then create a field inside processed_data and set it to true/false based on the query using AWS Glue Example- For each object inside the array, i want to check : ``` 1- if bucket_acl has grantee.type=CanonicalUser and Permission=FULL_CONTROL AND 2- if bucket_policy has statement that contains Effect=Allow and Principal=* and Action = ...... and Resources = ...... and condition is empty AND 3- website_hosting is empty and then create a field inside processes_data and set it to true if the above query satisfies eg- processed_data:{ isPublic: True} ``` Approaches I Tried: 1- I tried saving the data in s3 bucket in parquet format using aws-wrangler/aws-pandas for faster querying and then getting the data in aws glue using glue dynamic frame: ``` S3bucket_node1 = glueContext.create_dynamic_frame.from_options( format_options={}, connection_type="s3", format="parquet", connection_options={"paths": ["s3://abaabbb/abaaaaa/"], "recurse": True}, transformation_ctx="S3bucket_node1", ) S3bucket_node1.printSchema() S3bucket_node1.show() ``` Output: ``` root |-- bucket_name: string |-- bucket_creation_date: string |-- additional_data: string |-- processed_data: string {"bucket_name": "abaaaa", "bucket_creation_date": "139999", "additional_data": "{'bucket_acl': [{'Grantee': {'DisplayName': 'abaaaaaa', 'ID': 'abaaa', 'Type': 'CanonicalUser'}, 'Permission': 'FULL_CONTROL'}], 'bucket_policy': {}, 'public_access_block_configuration': {'BlockPublicAcls': True, 'IgnorePublicAcls': True, 'BlockPublicPolicy': True, 'RestrictPublicBuckets': True}, 'website_hosting': {}, 'bucket_tags': []}", "processed_data": "{}"} ``` Getting everything as string, seems like most of these libraries doesn't support nested data types 2- Tried saving the data as it is(in json) using put object API and then getting the data in aws glue using glue dynamic frame: ``` piece1 = glueContext.create_dynamic_frame.from_options( format_options={"multiline": True}, connection_type="s3", format="json", connection_options={"paths": ["s3://raghav-test-df/raghav3.json"], "recurse": True}, transformation_ctx="S3bucket_node1", ) piece1.printSchema() piece1.show() piece1.count() ``` Output: ``` root 0 ``` Getting no schema and count as 0 3- Tried getting the data using spark data frame: ``` sparkDF=spark.read.option("inferSchema", "true").option("multiline", "true").json("s3://ababa/abaa.json") sparkDF.printSchema() sparkDF.count() sparkDF.show() ``` Output- ``` root |-- additional_data: struct (nullable = true) | |-- bucket_acl: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- Grantee: struct (nullable = true) | | | | |-- DisplayName: string (nullable = true) | | | | |-- ID: string (nullable = true) | | | | |-- Type: string (nullable = true) | | | |-- Permission: string (nullable = true) | |-- bucket_policy: struct (nullable = true) | | |-- Id: string (nullable = true) | | |-- Statement: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- Action: string (nullable = true) | | | | |-- Condition: struct (nullable = true) | | | | | |-- Bool: struct (nullable = true) | | | | | | |-- aws:SecureTransport: string (nullable = true) | | | | | |-- StringEquals: struct (nullable = true) | | | | | | |-- AWS:SourceAccount: string (nullable = true) | | | | | | |-- AWS:SourceArn: string (nullable = true) | | | | | | |-- aws:PrincipalAccount: string (nullable = true) | | | | | | |-- s3:x-amz-acl: string (nullable = true) | | | | |-- Effect: string (nullable = true) | | | | |-- Principal: string (nullable = true) | | | | |-- Resource: string (nullable = true) | | | | |-- Sid: string (nullable = true) | | |-- Version: string (nullable = true) | |-- bucket_tags: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- Key: string (nullable = true) | | | |-- Value: string (nullable = true) | |-- public_access_block_configuration: struct (nullable = true) | | |-- BlockPublicAcls: boolean (nullable = true) | | |-- BlockPublicPolicy: boolean (nullable = true) | | |-- IgnorePublicAcls: boolean (nullable = true) | | |-- RestrictPublicBuckets: boolean (nullable = true) |-- bucket_creation_date: string (nullable = true) |-- bucket_name: string (nullable = true) ``` Getting the schema and correct count, but some of the field has different data types(eg actions can be string or array) and spark makes them default to string, i think querying the data based on multiple conditions using sql will be too complex Do i need to change the approach or something else, i am stuck here Can someone please help in achieving the end goal?
0
answers
0
votes
18
views
asked a day ago
Are there any ways to create a custom DeltaTargetProperties with cdk since it doesn't exist? I see the big is that the DeltaTargetProperties is not support in CFn yet. Attempt 1 with docs as code examples ``` cfn_crawler = glue.CfnCrawler(self, "MyCfnCrawler", role=glue_job_crawler_arn, targets=glue.CfnCrawler.TargetsProperty( delta_targets=glue.CfnCrawler.TargetsProperty( delta_table=[ "s3://bucket/dataset_name/table_name/", ], write_manifest = True ) ), ... ) ``` I can run `aws glue get-crawler --name cralwer_name` and get: ``` "Crawler": { "Name": "cralwer_name", "Role": "cralwer_role", "Targets": { "S3Targets": [], "JdbcTargets": [], "MongoDBTargets": [], "DynamoDBTargets": [], "CatalogTargets": [], "DeltaTargets": [ { "DeltaTables": [ "s3://bucket/dataset_name/table_name/", ], "WriteManifest": true } ] }, "DatabaseName": "db_name", "Classifiers": [], "RecrawlPolicy": { "RecrawlBehavior": "CRAWL_EVERYTHING" }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "LineageConfiguration": { "CrawlerLineageSettings": "DISABLE" }, "State": "READY", ... "Version": 2, "LakeFormationConfiguration": { "UseLakeFormationCredentials": false, "AccountId": "" } } ``` Attempt 2 with .... code examples This loops through a dict with crawler_name as key and crawler_details as value. ``` crawlers[crawler_name]['crawler'] = glue.CfnCrawler(self, "CfnCrawler_" + crawler_name, name=crawler_name, database_name=crawler_details['database_name'], description=crawler_details['description'] + ' Delta Lake Crawler', role=self.glue_crawler_role, targets={}, # 'DeltaTargets': [ # { # 'DeltaTables': crawler_details['delta_tables'], # 'WriteManifest': crawler_details['write_manifest'] # } # ] # }, schema_change_policy={ 'UpdateBehavior': 'UPDATE_IN_DATABASE', 'DeleteBehavior': 'LOG' } ) crawlers[crawler_name]['crawler'].add_property_override("Targets.DeltaTargets", [{ "DeltaTables": crawler_details['delta_tables'], "WriteManifest": crawler_details['write_manifest'] }]) ``` ❌ CrawlersStack failed: Error: The stack named CrawlersStack failed to deploy: UPDATE_ROLLBACK_COMPLETE: Property validation failure: [Encountered unsupported properties in {/Targets}: [DeltaTargets]] at FullCloudFormationDeployment.monitorDeployment (/usr/local/Cellar/aws-cdk/2.61.1/libexec/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:505:13) at processTicksAndRejections (node:internal/process/task_queues:95:5) at deployStack2 (/usr/local/Cellar/aws-cdk/2.61.1/libexec/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:265:24) at /usr/local/Cellar/aws-cdk/2.61.1/libexec/lib/node_modules/aws-cdk/lib/deploy.ts:39:11 at run (/usr/local/Cellar/aws-cdk/2.61.1/libexec/lib/node_modules/p-queue/dist/index.js:163:29) ❌ Deployment failed: Error: Stack Deployments Failed: Error: The stack named CrawlersStack failed to deploy: UPDATE_ROLLBACK_COMPLETE: Property validation failure: [Encountered unsupported properties in {/Targets}: [DeltaTargets]] at deployStacks (/usr/local/Cellar/aws-cdk/2.61.1/libexec/lib/node_modules/aws-cdk/lib/deploy.ts:61:11) at processTicksAndRejections (node:internal/process/task_queues:95:5) at CdkToolkit.deploy (/usr/local/Cellar/aws-cdk/2.61.1/libexec/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:339:7) at exec4 (/usr/local/Cellar/aws-cdk/2.61.1/libexec/lib/node_modules/aws-cdk/lib/cli.ts:384:12)
1
answers
0
votes
19
views
asked 2 days ago
The Glue Marketplace Connector "Google BigQuery Connector for AWS Glue" is currently at 0.24.2. Can you please update this connector to 0.28? The newest versions provide direct write using the GRPC streaming write, as well as more forgiving schema handling when using this method. Additionally the current 0.24.2 seems to have issues with passing credentials properly when using indirect write (via cloud storage), which is the only option. Thanks! https://github.com/GoogleCloudDataproc/spark-bigquery-connector/releases Current connector is: https://709825985650.dkr.ecr.us-east-1.amazonaws.com/amazon-web-services/glue/bigquery:0.24.2-glue3.0
1
answers
0
votes
4
views
Chris
asked 3 days ago
![Enter image description here](/media/postImages/original/IMfjzobBhMTeCb_xjirxp9Iw) Is there a limit to use a particular connection in Glue? I have existing crawlers that work fine with the connector. But when I try to update or create a new one, the error above is what I get. Ideas?
Accepted AnswerAWS Glue
1
answers
0
votes
44
views
asked 3 days ago
How can I add an ODBC driver to my Glue **Python shell** job? I am trying to use the pyodbc library and can see with pyodbc.drivers() the MySQL and PostgresSQL are available however I would like to add a different driver. Note that I specifically asking for Python Shell jobs not Spark Glue jobs.
1
answers
0
votes
14
views
asked 4 days ago
We provision our AWS Glue Crowler with Cloudformation and while doing so ran into a bug. When I create a DB Connection like this: ```yaml GlueConnectionPostgres: Type: AWS::Glue::Connection Properties: CatalogId: !Ref AWS::AccountId ConnectionInput: Name: !Sub '${AWS::StackName}-${Environment}-connection' Description: "Connection to database." ConnectionType: "JDBC" PhysicalConnectionRequirements: SubnetId: !Ref DBSubnetId SecurityGroupIdList: - !Ref DBSecurityGroup ConnectionProperties: { "JDBC_CONNECTION_URL": !Ref JDBCConnectionString, "JDBC_ENFORCE_SSL": "true", "USERNAME": !Ref DBUsername, "PASSWORD": !Ref DBPassword } ``` The AWS Console shows the Glue connection and it has the property "Require SSL connection" set to true. When I then start a crowler using that connection, it ends with the following error: ` ERROR : Crawler cannot be started. Verify the permissions in the policies attached to the IAM role defined in the crawler. ` If I now go back to the Glue Connection -> click edit -> change "Require SSL connection" to "false" save it and then switch it back to true, my crawler works. When I delete my Cloudformation Stack and recreate it, I can reproduce that behavior. I guess that is a Bug. P.S.: Tried it as boolean (`"JDBC_ENFORCE_SSL": true`) as well, same effect.
0
answers
0
votes
14
views
asked 4 days ago
I'm trying to play with the new Athena enhancement that provides support for spark notebooks. I have data in S3 that follows a partition scheme like this: s3://my_bucket/path1/p1=1642/p2=431/p3=2023-02-02 00:00:00/*.parquet" I want to read it into a spark data frame by prefix. Something like: df = spark.read.option("recursiveFileLookup","true").parquet('s3://my_bucket/path1/p1=1642/p2=431/*') When I try that, I get an error that "path doesn't exist" Even if I put the full path to a .parquet file, I still get an error related to the p3 partition: pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: fdatehb=2023-02-02 00:00:00 I even tried to simply read a file from S3 using straight python: `def read_file(bucket, key, encoding="utf-8") -> str: file_obj = io.BytesIO() bucket.download_fileobj(key, file_obj) wrapper = io.TextIOWrapper(file_obj, encoding=encoding) file_obj.seek(0) return wrapper.read() ` And I still get errors related to permissions - which indicates to me that I have to setup some kind of relationship between Athena and the source bucket. I can already query this same data using "traditional" Athena (SQL), so it must be something additional (?) But I can't find any documentation about this. Who can point me in the right direction?
1
answers
0
votes
16
views
zack
asked 5 days ago
On AWS Glue 2.0, I would like to try out the Pandas API on Spark. I follow this tutorial: https://spark.apache.org/docs/3.2.1/api/python/getting_started/quickstart_ps.html ``` import pandas as pd import numpy as np import pyspark.pandas as ps from pyspark.sql import SparkSession s = ps.Series([1, 3, 5, np.nan, 6, 8]) ``` I followed instructions how to include the library pyspark_pandas https://pypi.org/project/pyspark-pandas/ in a .zip File and copied it to an S3 bucket, and added under AWS Glue *Job Details* under *Python library path* It does not matter if I include the directory *pyspark_pandas* in the root folder of the *pyspark_pandas.zip* file, or if the files are already in the root folder of the .zip file, it does not change anything. Can please someone support how I can achieve to import pyspark.pandas in a Glue job, or perhaps also tell me that it is nonsense what I am trying to do?
2
answers
0
votes
18
views
wkl3nk
asked 7 days ago
Hi, I am trying to perform an upsert of an inceberg table. The script below creates a table with raw data stored in parquet format in an S3 bucket. Then it creates an empty iceberg table to be populated and eventually updated. When trying to insert data, it fails, please see error further down. The script: ``` import pandas as pd import awswrangler as wr import boto3 database = "test" iceberg_database = "iceberg_mid_sized_demo" bucket_name = "test-bucket" folder_name = "iceberg_mid_sized/raw_input" path = f"s3://{bucket_name}/{folder_name}" session = boto3.Session() glue_client = session.client('glue') try: glue_client.create_database(DatabaseInput={'Name': database}) print('Database created') except glue_client.exceptions.AlreadyExistsException as e: print("The database already exists") # Create external table in input parquet files. create_raw_data_table_query = """ CREATE EXTERNAL TABLE test.raw_input( op string, id bigint, name string, city string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS PARQUET LOCATION 's3://test-bucket/iceberg_mid_sized/raw_input/' tblproperties ("parquet.compress"="SNAPPY"); """ create_raw_data_table_query_exec_id = wr.athena.start_query_execution(sql=create_raw_data_table_query, database=database) # create iceberg tables database try: glue_client.create_database(DatabaseInput={'Name': iceberg_database}) print('Database created') except glue_client.exceptions.AlreadyExistsException as e: print("The database already exists") # Create output Iceberg table with partitioning. Replace the S3 bucket name with your bucket name create_output_iceberg_query = """ CREATE TABLE iceberg_mid_sized_demo.iceberg_output ( id bigint, name string, city string ) LOCATION 's3://test-bucket/iceberg-mid_sized/iceberg_output/' TBLPROPERTIES ( 'table_type'='ICEBERG', 'format'='parquet' ) """ create_iceberg_table_query_exec_id = wr.athena.start_query_execution(sql=create_output_iceberg_query, database=iceberg_database) primary_key = ['id'] wr.s3.merge_upsert_table(delta_df=val_df, database='iceberg_mid_sized_demo', table='iceberg_output', primary_key=primary_key) ``` This last line returns the following traceback and error: ``` python ArrowInvalid Traceback (most recent call last) /var/folders/y8/11mxbknn1sxbbq7vvhd14frr0000gn/T/ipykernel_17075/2358353780.py in <module> 1 primary_key = ['id'] ----> 2 wr.s3.merge_upsert_table(delta_df=val_df, database='iceberg_mid_sized_demo', table='iceberg_output', primary_key=primary_key) /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/s3/_merge_upsert_table.py in merge_upsert_table(delta_df, database, table, primary_key, boto3_session) 111 if wr.catalog.does_table_exist(database=database, table=table, boto3_session=boto3_session): 112 # Read the existing table into a pandas dataframe --> 113 existing_df = wr.s3.read_parquet_table(database=database, table=table, boto3_session=boto3_session) 114 # Check if data quality inside dataframes to be merged are sufficient 115 if _is_data_quality_sufficient(existing_df=existing_df, delta_df=delta_df, primary_key=primary_key): /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/_config.py in wrapper(*args_raw, **kwargs) 448 del args[name] 449 args = {**args, **keywords} --> 450 return function(**args) 451 452 wrapper.__doc__ = _inject_config_doc(doc=function.__doc__, available_configs=available_configs) /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py in read_parquet_table(table, database, filename_suffix, filename_ignore_suffix, catalog_id, partition_filter, columns, validate_schema, categories, safe, map_types, chunked, use_threads, boto3_session, s3_additional_kwargs) 969 use_threads=use_threads, 970 boto3_session=boto3_session, --> 971 s3_additional_kwargs=s3_additional_kwargs, 972 ) 973 partial_cast_function = functools.partial( /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py in read_parquet(path, path_root, path_suffix, path_ignore_suffix, version_id, ignore_empty, ignore_index, partition_filter, columns, validate_schema, chunked, dataset, categories, safe, map_types, use_threads, last_modified_begin, last_modified_end, boto3_session, s3_additional_kwargs, pyarrow_additional_kwargs) 767 if len(paths) == 1: 768 return _read_parquet( --> 769 path=paths[0], version_id=versions[paths[0]] if isinstance(versions, dict) else None, **args 770 ) 771 if validate_schema is True: /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py in _read_parquet(path, version_id, columns, categories, safe, map_types, boto3_session, dataset, path_root, s3_additional_kwargs, use_threads, pyarrow_additional_kwargs) 538 use_threads=use_threads, 539 version_id=version_id, --> 540 pyarrow_additional_kwargs=pyarrow_args, 541 ), 542 categories=categories, /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py in _read_parquet_file(path, columns, categories, boto3_session, s3_additional_kwargs, use_threads, version_id, pyarrow_additional_kwargs) 480 source=f, 481 read_dictionary=categories, --> 482 coerce_int96_timestamp_unit=pyarrow_args["coerce_int96_timestamp_unit"], 483 ) 484 if pq_file is None: /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py in _pyarrow_parquet_file_wrapper(source, read_dictionary, coerce_int96_timestamp_unit) 41 try: 42 return pyarrow.parquet.ParquetFile( ---> 43 source=source, read_dictionary=read_dictionary, coerce_int96_timestamp_unit=coerce_int96_timestamp_unit 44 ) 45 except TypeError as ex: /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata, read_dictionary, memory_map, buffer_size, pre_buffer, coerce_int96_timestamp_unit) 232 buffer_size=buffer_size, pre_buffer=pre_buffer, 233 read_dictionary=read_dictionary, metadata=metadata, --> 234 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit 235 ) 236 self.common_metadata = common_metadata /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.open() /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file. ``` I have also tried to run the script replacing the following line: ``` python wr.s3.merge_upsert_table(delta_df=val_df, database='iceberg_mid_sized_demo', table='iceberg_output', primary_key=primary_key) ``` with these ``` python merge_into_query = """ MERGE INTO iceberg_mid_sized_demo.iceberg_output t USING test.raw_input s ON t.id = s.id WHEN MATCHED AND s.op = 'D' THEN DELETE WHEN MATCHED THEN UPDATE SET t.city = s.city WHEN NOT MATCHED THEN INSERT (id, name, city) VALUES (s.id, s.name, s.city) ; """ merge_into_query_id = wr.athena.start_query_execution(sql=merge_into_query, database="iceberg_mid_sized_demo", workgroup='wgname' ) ``` however, now I am getting: ``` python --------------------------------------------------------------------------- InvalidRequestException Traceback (most recent call last) /var/folders/y8/11mxbknn1sxbbq7vvhd14frr0000gn/T/ipykernel_17075/2112489404.py in <module> 1 merge_into_query_id = wr.athena.start_query_execution(sql=merge_into_query, 2 database="iceberg_mid_sized_demo", ----> 3 workgroup='athena3' 4 ) /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/_config.py in wrapper(*args_raw, **kwargs) 448 del args[name] 449 args = {**args, **keywords} --> 450 return function(**args) 451 452 wrapper.__doc__ = _inject_config_doc(doc=function.__doc__, available_configs=available_configs) /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/athena/_utils.py in start_query_execution(sql, database, s3_output, workgroup, encryption, kms_key, params, boto3_session, max_cache_seconds, max_cache_query_inspections, max_remote_cache_entries, max_local_cache_entries, data_source, wait) 494 encryption=encryption, 495 kms_key=kms_key, --> 496 boto3_session=session, 497 ) 498 if wait: /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/athena/_utils.py in _start_query_execution(sql, wg_config, database, data_source, s3_output, workgroup, encryption, kms_key, boto3_session) 101 ex_code="ThrottlingException", 102 max_num_tries=5, --> 103 **args, 104 ) 105 return cast(str, response["QueryExecutionId"]) /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/awswrangler/_utils.py in try_it(f, ex, ex_code, base, max_num_tries, **kwargs) 341 for i in range(max_num_tries): 342 try: --> 343 return f(**kwargs) 344 except ex as exception: 345 if ex_code is not None and hasattr(exception, "response"): /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 389 "%s() only accepts keyword arguments." % py_operation_name) 390 # The "self" in this scope is referring to the BaseClient. --> 391 return self._make_api_call(operation_name, kwargs) 392 393 _api_call.__name__ = str(py_operation_name) /opt/anaconda3/envs/data_analysis/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 717 error_code = parsed_response.get("Error", {}).get("Code") 718 error_class = self.exceptions.from_code(error_code) --> 719 raise error_class(parsed_response, operation_name) 720 else: 721 return parsed_response InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 5:31: mismatched input '.'. Expecting: '=' ``` How do you perform UPSERT of Athena tables? Thanks
1
answers
0
votes
50
views
asked 8 days ago
Hi Team, I am getting below error while creating Crawler in AWS Glue using root account. While same functionality was working till November 2022. One crawler failed to create The following crawler failed to create: "TestDb" Here is the most recent error message: Account <RootUser> is denied access. Please reply asap.
4
answers
0
votes
25
views
profile picture
asked 8 days ago
I am building a pipeline from ETL to complete ML lifecycle. I have used AWS glue notebooks for most of the ETL Job and I have created pipeline with step functions. Rest Machine learning work completed in sagemaker notebook instances with multiple notebooks. Now I want to use step functions to build end to end pipeline. Unfortunately I can't see any sagemaker notebook instance run function in step function as like aws glue job run. Am I missing something here. please help me to complete this gap. ![Enter image description here](/media/postImages/original/IMMpsITAdUQcG8v5ab4inuwg)
0
answers
0
votes
10
views
asked 8 days ago