By using AWS re:Post, you agree to the Terms of Use
/Amazon Athena/

Questions tagged with Amazon Athena

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Describe table in Athena fails with insufficient lake formation permissions

When I try to run the following query via the Athena JDBC Driver ```sql describe gitlab.issues ``` I get the following error: > [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. FAILED: SemanticException Unable to fetch table gitlab. Insufficient Lake Formation permission(s) on gitlab (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: be6aeb1b-fc06-410d-9723-2df066307b35; Proxy: null) [Execution ID: a2534d22-c4df-49e9-8515-80224779bf01] the following query works: ```sql select * from gitlab.issues limit 10 ``` The role that is used has the `DESCRIBE` permission on the `gitlab` database and `DESCRIBE, SELECT` permissions on the table `issues`. It also has the following IAM permissions: ```json { "Version": "2012-10-17", "Statement": [ { "Action": [ "athena:BatchGetNamedQuery", "athena:BatchGetQueryExecution", "athena:CreatePreparedStatement", "athena:DeletePreparedStatement", "athena:GetDataCatalog", "athena:GetDatabase", "athena:GetNamedQuery", "athena:GetPreparedStatement", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:GetQueryResultsStream", "athena:GetTableMetadata", "athena:GetWorkGroup", "athena:ListDatabases", "athena:ListNamedQueries", "athena:ListPreparedStatements", "athena:ListDataCatalogs", "athena:ListEngineVersions", "athena:ListQueryExecutions", "athena:ListTableMetadata", "athena:ListTagsForResource", "athena:ListWorkGroups", "athena:StartQueryExecution", "athena:StopQueryExecution", "athena:UpdatePreparedStatement" ], "Resource": "*", "Effect": "Allow" }, { "Action": [ "glue:BatchGetCustomEntityTypes", "glue:BatchGetPartition", "glue:GetCatalogImportStatus", "glue:GetColumnStatisticsForPartition", "glue:GetColumnStatisticsForTable", "glue:GetCustomEntityType", "glue:GetDatabase", "glue:GetDatabases", "glue:GetPartition", "glue:GetPartitionIndexes", "glue:GetPartitions", "glue:GetSchema", "glue:GetSchemaByDefinition", "glue:GetSchemaVersion", "glue:GetSchemaVersionsDiff", "glue:GetTable", "glue:GetTableVersion", "glue:GetTableVersions", "glue:GetTables", "glue:GetUserDefinedFunction", "glue:GetUserDefinedFunctions", "glue:ListCustomEntityTypes", "glue:ListSchemaVersions", "glue:ListSchemas", "glue:QuerySchemaVersionMetadata", "glue:SearchTables" ], "Resource": "*", "Effect": "Allow" }, { "Condition": { "ForAnyValue:StringEquals": { "aws:CalledVia": "athena.amazonaws.com" } }, "Action": [ "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::aws-athena-query-results-123456789012-eu-west-1", "arn:aws:s3:::aws-athena-query-results-123456789012-eu-west-1/*", "arn:aws:s3:::aws-athena-federation-spill-123456789012-eu-west-1", "arn:aws:s3:::aws-athena-federation-spill-123456789012-eu-west-1/*" ], "Effect": "Allow" }, { "Action": [ "lakeformation:CancelTransaction", "lakeformation:CommitTransaction", "lakeformation:DescribeResource", "lakeformation:DescribeTransaction", "lakeformation:ExtendTransaction", "lakeformation:GetDataAccess", "lakeformation:GetQueryState", "lakeformation:GetQueryStatistics", "lakeformation:GetTableObjects", "lakeformation:GetWorkUnitResults", "lakeformation:GetWorkUnits", "lakeformation:StartQueryPlanning", "lakeformation:StartTransaction" ], "Resource": "*", "Effect": "Allow" }, { "Condition": { "ForAnyValue:StringEquals": { "aws:CalledVia": "athena.amazonaws.com" } }, "Action": "lambda:InvokeFunction", "Resource": "arn:aws:lambda:*:*:function:athena-federation-*", "Effect": "Allow" }, { "Condition": { "ForAnyValue:StringEquals": { "aws:CalledVia": "athena.amazonaws.com" } }, "Action": ["s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket"], "Resource": "*", "Effect": "Allow" } ] } ``` even if I make the role a LakeFormation Admin, Database Creator, assign Super Permissions to the table and database and add the AdministratorAccess IAM Policy to the role it still fails.
0
answers
0
votes
20
views
asked 6 days ago

Athena federated query on PostgresSQL

Hi I am trying to execute queries on a postgresql database I created in AWS. I added a data source to Athena, I created the data source for postgresql and I created the lambda function. In lambda function I set: * default connection string * spill_bucket and spill prefix (I set the same for both: 'athena-spill'. In the S3 page I cannot see any athena-spill bucket) * the security group --> I set the security group I created to access the db * the subnet --> I set one of the database subnet I deployed the lambda function but I received an error and I had to add a new environment variable created with the connection string but named as 'dbname_connection_string'. After adding this new env variable I am able to see the database in Athena but when I try to execute any query on this database as: ``` select * from tests_summary limit 10; ``` I receive this error: ``` GENERIC_USER_ERROR: Encountered an exception[com.amazonaws.SdkClientException] from your LambdaFunction[arn:aws:lambda:eu-central-1:449809321626:function:data-production-athena-connector-nina-lambda] executed in context[retrieving meta-data] with message[Unable to execute HTTP request: Connect to s3.eu-central-1.amazonaws.com:443 [s3.eu-central-1.amazonaws.com/52.219.170.25] failed: connect timed out] This query ran against the "public" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 3366bd80-143e-459c-a4da-5350b5ab4a77 ``` What could be causing the problem? Thanks a lot!
2
answers
0
votes
42
views
asked 12 days ago

Preventing SQL Injection in Athena queries in the JS SDK

Using the [v3 js sdk](https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-athena/index.html) Athena Client, I'm trying to safely execute a query that is constructed from user input. I want to avoid doing anything like this: ```js const queryString = `SELECT * from mytable where field = ${userSuppliedValue}`; ``` It seems like the proper solution to this would be using a [prepared statement](https://docs.aws.amazon.com/athena/latest/ug/querying-with-prepared-statements.html). The SDK provides a way to create prepared statements like this: ```js await client.send( new CreatePreparedStatementCommand({ Description: 'My example query', QueryStatement: 'SELECT * FROM mytable WHERE ?', StatementName: 'MyPreparedStatement', WorkGroup: 'primary', }), ) ``` So, there should be a way to execute this prepared statement, providing user input safely to replace the `?` parameter. However, I can't find any way in the SDK to execute the statement, except to build a raw query: ```js const data = await client.send( new StartQueryExecutionCommand({ QueryString: `EXECUTE MyPreparedStatement USING ${userSuppliedValue}`, ResultConfiguration: { OutputLocation: 's3://my-example-bucket', }, }), ); ``` Since this still involves building a raw query string with user input, it seems to leave me just as vulnerable to SQL injection as if I had not used a prepared statement. The documentation on prepared statements says: > Prepared statements enable Athena queries to take parameters directly and help to prevent SQL injection attacks. Is there some other programmatic way to execute prepared statements that I'm overlooking?
1
answers
0
votes
24
views
asked 15 days ago

MSCK REPAIR TABLE behaves differently when executed via Spark Context vs Athena Console/boto3

I have a Glue ETL job which creates partitions during the job ``` additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} additionalOptions["partitionKeys"] = ["year", "month", "day"] ``` I don’t have it Update the Data Catalog because doing so changes all my Table Data Types. So after I am done, the way I get the Data Catalog updated with the correct partition information is to run MSCK REPAIR TABLE. If I do this inside the Glue ETL job using the Spark Context like so: ``` spark.sql("use gp550_load_database_beta") spark.sql("msck repair table gp550_load_table_beta").show() ``` The following happens: Serde Properties of my table are updated with “serialization.format : 1” Table Properties are updated with: EXTERNAL : TRUE spark.sql.partitionProvider : catalog ALL Data Types in my table are set to “String” with a comment of “from deserializer” Basically it makes a mess. If I instead run MSCK REPAIR TABLE from boto3, or if I manually run it from Athena Console, then there are no issues. No Serde Properties are changes, no table properties, no data types are changed, it simply adds my partitions to the Data Catalog like I want. I do like so in my ETL job: ``` client = boto3.client('athena') sql = 'MSCK REPAIR TABLE gp550_load_database_beta.gp550_load_table_beta' context = {'Database': 'gp550_load_database_beta'} client.start_query_execution(QueryString = sql, QueryExecutionContext = context, ResultConfiguration= { 'OutputLocation': 's3://aws-glue-assets-977465404123-us-east-1/temporary/' }) ``` Why does it behave so differently? Is it because somehow I need to tell Spark to work with HIVE? I had thought that since I already had a spark context it would be easy to use that to kick off the MSCK REPAIR TABLE, but obviously I was surprised at the result!
0
answers
0
votes
14
views
asked 21 days ago

Quicksight Athena - analysis error: "HIVE_UNSUPPORTED_FORMAT: Unable to create input format"

Hello. I'm trying to create an analysis from my DocumentDB instance. I'm using the aws services Glue, Athena and Quicksight. In Glue I have created a connection to the DocumentDB and a crawler for auto creating tables. This works as expected and tables are created and displayed in glue. Even though I specify that the crawler should not give the tables any prefixes, it does add the database name as a prefix. When I look at the Glue catalog in Athena (the default AwsDataCatalog) I do see the database that was created in glue, however it does not show any tables. If I click on edit, it takes me to the correct database in glue which displays the tables that have been created by the previously mentioned crawler. So my first question is **Why doesn't the tables show up in Athena?** This is blocking me from performing queries in Athena. When I go to QuickSight and select the default Athena glue catalog ("AwsDataCatalog") I DO get the tables created by the crawler, and I can create datasets. However, when I try to create an analysis using these datasets, I get the error: ``` sourceErrorCode: 100071 sourceErrorMessage: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. HIVE_UNSUPPORTED_FORMAT: Unable to create input format ``` I have looked a bit around and some people said that this error is due to the table properties **"Input format"** and **"Output format"** being empty (which they indeed are for me). I have tried entering almost all the different formats to the table, but I keep on getting the Quicksight error above only now it has the input format at the end ``` HIVE_UNSUPPORTED_FORMAT: Unable to create input format json ``` **So my second questions is** I do not see anywhere in the crawler where I can specify input or output format. Does it have to be done manually? And What are the correct input and output formats for my setup?
0
answers
0
votes
30
views
asked 22 days ago

Simple Join query errors out with "table not found"

I'm trying to join a simple "spine" table with a "feature" table. The spine table may or may not have additional columns besides the join key. If there are additional columns (in this case, random_column), the following query works fine (please note that these queries are self-sufficient. they don't depend on any external tables - all tables are inline created by the query itself): ``` with spine_table as (select 'abc' as user_id, 'random_value' as random_column), feature_table as (select 'abc' as user_id, '123' as feature_column) select user_id, spine_table.*, feature_column from spine_table join feature_table using (user_id) ``` If, however, there's no additional column, the query throws an exception: ``` with spine_table as (select 'abc' as user_id), feature_table as (select 'abc' as user_id, '123' as feature_column) select user_id, spine_table.*, feature_column from spine_table join feature_table using (user_id) ``` `Error: SYNTAX_ERROR: line 5:17: Table 'spine_table' not found` The second query works fine if I omit spine_table. : with spine_table as ``` (select 'abc' as user_id), feature_table as (select 'abc' as user_id, '123' as feature_column) select user_id, feature_column from spine_table join feature_table using (user_id) ``` The problem is that my application dynamically generates the query, and it doesn't know ahead of time whether there are additional columns in the spine_table besides the join keys.
0
answers
0
votes
22
views
asked a month ago

Data Catalog schema table getting modified when I run my Glue ETL job

I created a Data Catalog with a table that I manually defined. I run my ETL job and all works well. I added partitions to both the table in the Data Catalog, as well as the ETL job. it creates the partitions and I see the folders being created in S3 as well. But my table data types change. I originally had: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | timestamp | | timegmt | timestamp | | value | float | | filename | string | | year | int | | month | int | | day | int | But now after the ETL job with partitions, my table ends up like so: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | bigint | | timegmt | bigint | | value | float | | filename | string | | year | bigint | | month | bigint | | day | bigint | Before this change of data types, I could do queries in Athena. Including a query like this: ``` SELECT * FROM "gp550-load-database"."gp550-load-table-beta" WHERE vid IN ('F_NORTH', 'F_EAST', 'F_WEST', 'F_SOUTH', 'F_SEAST') AND vtype='LOAD' AND time BETWEEN TIMESTAMP '2021-05-13 06:00:00' and TIMESTAMP '2022-05-13 06:00:00' ``` But now with the data types change, I get an error when trying to do a query like above ``` "SYNTAX_ERROR: line 1:154: Cannot check if bigint is BETWEEN timestamp and timestamp This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 2a5287bc-7ac2-43a8-b617-bf01c63b00d5" ``` So then if I go into the the table and change the data type back to "timestamp", I then run the query and get a different error: ``` "HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'time' in table 'gp550-load-database.gp550-load-table-beta' is declared as type 'timestamp', but partition 'year=2022/month=2/day=2' declared column 'time' as type 'bigint'. This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: f788ea2b-e274-43fe-a3d9-22d80a2bbbab" ``` With Glue Crawlers, you can influence how the partitions are created. You can have the Crawler modify the Data Catalog table, or not make changes to the table scheme, other than adding new partitions: ``` { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } } ``` basically, this *InheritFromTable* behavior is what I am looking for with ETL jobs. Does anyone know what is happening?
0
answers
0
votes
51
views
asked a month ago

Can't get Partitions to work with my Glue Data Catalog

I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works. What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions: Partition 1 YEAR Partition 2 MONTH Partition 3 DAY The proper values are in these columns, and I have verified that. After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok." I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows: year month day No partitions found What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.
1
answers
0
votes
33
views
asked a month ago

Athena Error: Permission Denied on S3 Path.

I am trying to execute athena queries from a lambda function but I am getting this error: `Athena Query Failed to run with Error Message: Permission denied on S3 path: s3://bkt_logs/apis/2020/12/16/14` The bucket `bkt_logs` is the bucket which is used by AWS Glue Crawlers to crawl through all the sub-folders and populate Athena table on which I am querying on. Also, `bkt_logs` is an encrypted bucket. These are the policies that I have assigned to the Lambda. ``` [ { "Action": [ "s3:Get*", "s3:List*", "s3:PutObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::athena-query-results/*", "Effect": "Allow", "Sid": "AllowS3AccessToSaveAndReadQueryResults" }, { "Action": [ "s3:*" ], "Resource": "arn:aws:s3:::bkt_logs/*", "Effect": "Allow", "Sid": "AllowS3AccessForGlueToReadLogs" }, { "Action": [ "athena:GetQueryExecution", "athena:StartQueryExecution", "athena:StopQueryExecution", "athena:GetWorkGroup", "athena:GetDatabase", "athena:BatchGetQueryExecution", "athena:GetQueryResults", "athena:GetQueryResultsStream", "athena:GetTableMetadata" ], "Resource": [ "*" ], "Effect": "Allow", "Sid": "AllowAthenaAccess" }, { "Action": [ "glue:GetTable", "glue:GetDatabase", "glue:GetPartitions" ], "Resource": [ "*" ], "Effect": "Allow", "Sid": "AllowGlueAccess" }, { "Action": [ "kms:CreateGrant", "kms:DescribeKey", "kms:Decrypt" ], "Resource": [ "*" ], "Effect": "Allow", "Sid": "AllowKMSAccess" } ] ``` What seems to be wrong here? What should I do to resolve this issue?
1
answers
0
votes
128
views
asked a month ago

_temp AWS lake formation blueprint pipeline tables appears to IAM user in athena editor although I didn't give this user permission on them

_temp lake formation blueprint pipeline tables appears to IAM user in Athena editor, although I didn't give this user permission on them below the policy granted to this IAM user,also in lake formation permsissions ,I didnt give this user any permissions on _temp tables: { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1652364721496", "Action": [ "athena:BatchGetNamedQuery", "athena:BatchGetQueryExecution", "athena:GetDataCatalog", "athena:GetDatabase", "athena:GetNamedQuery", "athena:GetPreparedStatement", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:GetQueryResultsStream", "athena:GetTableMetadata", "athena:GetWorkGroup", "athena:ListDataCatalogs", "athena:ListDatabases", "athena:ListEngineVersions", "athena:ListNamedQueries", "athena:ListPreparedStatements", "athena:ListQueryExecutions", "athena:ListTableMetadata", "athena:ListTagsForResource", "athena:ListWorkGroups", "athena:StartQueryExecution", "athena:StopQueryExecution" ], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:GetDatabases", "glue:BatchDeleteTable", "glue:GetTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition" ], "Resource": [ "*" ] }, { "Sid": "Stmt1652365282568", "Action": "s3:*", "Effect": "Allow", "Resource": [ "arn:aws:s3:::queryresults-all", "arn:aws:s3:::queryresults-all/*" ] }, { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess" ], "Resource": [ "*" ] } ] }
1
answers
0
votes
14
views
asked 2 months ago

Sync DynamoDB to S3

What is the best way to sync my DynamoDB tables to S3, so that I can perform serverless 'big data' queries using Athena? The data must be kept in sync without any intervention. The frequency of sync would depend on the cost, ideally daily but perhaps weekly. I have had this question a long time. I will cover what I have considered, and why I don't like the options. 1) AWS Glue Elastic Views. Sounds like this will do the job with no code, but it was announced 18 months ago and there have been no updates since. Its not generally available, and there is not information on when it might be. 2) Use dynamodb native backup following this blog https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/. I actually already use this method for 'one-off' data transfers that I kick-off manually and then configure in Athena. I have two issues with this option. The first is that, to my knowledge, the export cannot be scheduled natively. The blog suggests using the CLI to kick off exports, and I assume the writer intends that the CLI would need scheduling on a cron job somewhere. I don't run any servers for this. I imagine I could do it via a scheduled Lambda with an SDK. The second issue is that the export path in S3 always includes a unique export ID. This means I can't configure the Athena table to point to a static location for the data and just switch over the new data after a scheduled export. Perhaps I could write another lambda to move the data around to a static location after the export has finished, but it seems a shame to have to do so much work and I've not seen that covered anywhere before. 3) I can use data pipeline as described in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html. This post is more about backing data up than making it accessible to Athena. I feel like this use case must be so common, and yet none of the ideas I've seen online are really complete. I was wondering if anyone had any ideas or experiences that would be useful here?
2
answers
0
votes
11
views
asked 2 months ago

ErrorCode: INTERNAL_ERROR_QUERY_ENGINE

I am trying to create a view using a table have more than 5 billion rows. using the following query:- with date_month as (select distinct as_on_month from prod_edp_db_trusted.keyword_metrics_serp_dataforseo_v2_details where as_on_date >= date_add('Month',-2,current_date) ), sim_data as (select *,date as sim_date,substring(date,1,4)||substring(date,6,2) as sim_yr_month from ( select tenant,locale,search_engine,device_type,url,keyword,traffic,cast(date as varchar(255)) as date,cast(organic_rank as decimal(38,0)) as organic_rank,create_date ,row_number() over(partition by tenant,locale,search_engine,device_type,url,keyword order by date desc,create_date) as rn from prod_edp_db_trusted.simulated_score_v3_4 WHERE save_type='simulation' ) where rn=1 ), serp_data as ( select * from( select sim.tenant,sim.locale,sim.search_engine,sim.device_type,sim.url,sim.keyword,sim.traffic,sim.organic_rank as sim_rank,sim.sim_date as simulation_date,sim.sim_yr_month,srp.position as rank, srp_mn.as_on_month as serp_year_month, srp.as_on_date as serp_as_on_date, row_number() over(partition by sim.tenant,sim.locale,sim.search_engine,sim.device_type,sim.url,sim.keyword,srp_mn.as_on_month order by srp.as_on_date desc,srp.created_date) as rn from sim_data sim join date_month as srp_mn on srp_mn.as_on_month<=sim.sim_yr_month or srp_mn.as_on_month>=sim.sim_yr_month left join prod_edp_db_trusted.keyword_metrics_serp_dataforseo_v2_details srp on srp.locale=sim.locale and srp.search_engine=sim.search_engine and srp.device_type=sim.device_type and srp.url=sim.url and srp.keyword=sim.keyword --and srp.as_on_date>=sim.sim_date and srp.as_on_month=srp_mn.as_on_month and srp.as_on_date>=date_add('Month',-2,date_trunc('Month',current_date)) ) where rn=1 ), sv_data AS ( select tenant,locale,url,search_engine,keyword,search_volume,sv_yr_month from ( select LKP.tenant ,MAIN.locale ,lkp.url ,MAIN.search_engine ,MAIN.keyword ,MAIN.count as search_volume ,MAIN.as_on_month as sv_yr_month ,row_number() over(partition by lkp.tenant,main.locale,lkp.url,main.search_engine,main.keyword ,as_on_month order by as_on_date desc,created_date) rn from prod_edp_db_trusted.keyword_metrics_search_volume_dataforseo_v2 MAIN JOIN (SELECT tenant,locale,url,search_engine,keyword,Min(SIM_YR_MONTH) AS SIM_YR_MONTH FROM sim_data GROUP BY 1,2,3,4,5) LKP ON MAIN.locale=LKP.locale AND MAIN.search_engine=LKP.search_engine and main.keyword=LKP.keyword where MAIN.as_on_month IN(SELECT * FROM date_month ) ) where rn=1 ), base_dataset AS ( select srp.tenant,srp.locale,srp.search_engine,srp.device_type,srp.url,srp.keyword,srp.sim_rank,srp.rank,srp.serp_year_ month,srp.traffic ,srp.serp_as_on_date,srp.simulation_date,srp.SIM_YR_MONTH ,sv.search_volume from serp_data srp left join sv_data sv on srp.tenant=sv.tenant and srp.locale=sv.locale and srp.search_engine=sv.search_engine and srp.keyword=sv.keyword and srp.url=sv.url and srp.serp_year_month=sv.sv_yr_month ), sim_wavg as ( (select * from base_dataset where sim_yr_month=serp_year_month) union (select tenant,locale,search_engine,device_type,url,keyword,sim_rank,rank,serp_year_month,traffic ,serp_as_on_date,simulation_date,SIM_YR_MONTH ,search_volume from (select *,row_number() over(partition by tenant,locale,search_engine,device_type,url,keyword,simulation_date, SIM_YR_MONTH order by serp_year_month) base_rn from base_dataset where sim_yr_month<(select min(as_on_month) from date_month))t3 where base_rn=1 ) ), sim_latest_mnth as ( select * FROM ( select *,row_number() over(partition by tenant,locale,search_engine,device_type,url,keyword ,simulation_date, SIM_YR_MONTH order by serp_year_month desc) base_rn from base_dataset ) t3 where base_rn=1 ), final_base AS ( select sim.tenant,sim.locale,sim.search_engine,sim.device_type,sim.url,sim.keyword,sim.sim_rank,sim.search_volume ,lkp.serp_year_month,lkp.serp_as_on_date,lkp.RANK as SERP_Rank,lkp.traffic ,sim.simulation_date,sim.SIM_YR_MONTH,lkp.search_volume as month_sv ,lst.search_volume as latest_month_sv,sim.sim_rank-lst.rank as rank_diff from sim_wavg sim left join base_dataset lkp on sim.tenant=lkp.tenant and sim.locale=lkp.locale and sim.search_engine=lkp.search_engine and sim.device_type=lkp.device_type and sim.url=lkp.url and sim.keyword=lkp.keyword left join sim_latest_mnth lst on sim.tenant=lst.tenant and sim.locale=lst.locale and sim.search_engine=lst.search_engine and sim.device_type=lst.device_type and sim.url=lst.url and sim.keyword=lst.keyword ) select tenant,locale,search_engine,device_type,url,keyword ,search_volume as "Simulation Month Search Volume",simulation_date,latest_month_sv,traffic as "SIMULATION MONTH TRAFFIC" ,sim_rank as "Rank as on Last Simulation",rank_diff as "Rank Difference With Latest Month" ,kw_imp['202204'] as Current_Month ,kw_imp['202203'] as Last_Month ,kw_imp['202202'] as "2nd_Last_Month" ,kw_imp['202201'] as "3rd_Last_Month" FROM ( select tenant,locale,search_engine,device_type,url,keyword,search_volume,latest_month_sv,simulation_date,traffic,si m_rank ,rank_diff ,map_agg(serp_year_month,SERP_Rank) as kw_imp from final_base group by tenant,locale,search_engine,device_type,url,keyword,search_volume,latest_month_sv,simulation_date,traffic,si m_rank ,rank_diff ) order by tenant,locale,search_engine,device_type,url,keyword;
0
answers
0
votes
14
views
asked 2 months ago

AWS:InstanceInformation folder created in s3 by Resource Data Sync cannot be queried by Athena because it has an invalid schema with duplicate columns.

[After resolving my first issue](https://repost.aws/questions/QUXOInFRr1QrKfR0Bh9wVglA/aws-glue-not-properly-crawling-s-3-bucket-populated-by-resource-data-sync-specifically-aws-instance-information-is-not-made-into-a-table) with getting a resource data sync set up, I've now run into another issue with the same folder. When a resource data sync is created, it creates a folder structure with 13 folders following a folder structure like: `s3://resource-data-sync-bucket/AWS:*/accountid=*/regions=*/resourcetype=*/instance.json}` When running the glue crawler over this, a schema is created where partitions are made for each subpath with an `=` in it. This works fine for most of the data, except for the path starting with `AWS:InstanceInformation`. The instance information json files ALSO contain a "resourcetype" field as can be seen here. ``` {"PlatformName":"Microsoft Windows Server 2019 Datacenter","PlatformVersion":"10.0.17763","AgentType":"amazon-ssm-agent","AgentVersion":"3.1.1260.0","InstanceId":"i","InstanceStatus":"Active","ComputerName":"computer.name","IpAddress":"10.0.0.0","ResourceType":"EC2Instance","PlatformType":"Windows","resourceId":"i-0a6dfb4f042d465b2","captureTime":"2022-04-22T19:27:27Z","schemaVersion":"1.0"} ``` As a result, there are now two "resourcetype" columns in the "aws_instanceinformation" table schema. Attempts to query that table result in the error `HIVE_INVALID_METADATA: Hive metadata for table is invalid: Table descriptor contains duplicate columns` I've worked around this issue by removing the offending field and setting the crawler to ignore schema updates, but this doesn't seem like a great long term solution since any changes made by AWS to the schema will be ignored. Is this a known issue with using this solution? Are there any plans to change how the AWS:InstanceInformation documents are so duplicate columns aren't created.
0
answers
0
votes
8
views
asked 2 months ago

AWS Glue not properly crawling s3 bucket populated by "Resource Data Sync" -- specifically, "AWS: InstanceInformation" is not made into a table

I set up an s3 bucket that collects inventory data from multiple AWS accounts using the Systems Manager "Resource Data Sync". I was able to set up the Data Syncs to feed into the single bucket without issue and the Glue crawler was created automatically. Now that I'm trying to query the data in Athena, I noticed there is an issue with how the Crawler is parsing the data in the bucket. The folder "AWS:InstanceInformation" is not being turned into a table. Instead, it is turning all of the "region=us-east-1/" and "test.json" sub-items into tables which are, obviously, not queryable. To illustrate further, each of the following paths is being turned into it's own table. * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=12345679012/test.json * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=23456790123/test.json * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/region=us-east-1 * s3://resource-data-sync-bucket/AWS:InstanceInformation/accountid=34567901234/test.json This is ONLY happening with the "AWS:InstanceInformation" folder. All of the other folders (e.g. "AWS:DetailedInstanceInformation") are being properly turned into tables. Since all of this data was populated automatically, I'm assuming that we are dealing with a bug? Is there anything I can do to fix this?
1
answers
0
votes
10
views
asked 2 months ago

Athena returns "FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null"

Following the well architected labs 200: Cost and usage analysis I get the following error when adding partitions in Athena Query Editor: ``` MSCK REPAIR TABLE `cost_optimization_10XXXXXXXX321`; ``` and it returned the following error: > FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null This query ran against the "costfubar" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 856e146a-8b13-4175-8cd8-692eef6d3fa5 The table was created correctly in Glue with ``` Name cost_optimization_10XXXXXXXXX21 Description Database costfubar Classification parquet Location s3://cost-optimization-10XXXXXXX321// Connection Deprecated No Last updated Wed Apr 20 16:46:28 GMT-500 2022 Input format org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat Output format org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Serde serialization lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe Serde parameters serialization.format 1 Table properties sizeKey 4223322objectCount 4UPDATED_BY_CRAWLER costfubarCrawlerSchemaSerializerVersion 1.0recordCount 335239averageRecordSize 27exclusions ["s3://cost-optimization-107457606321/**.json","s3://cost-optimization-1XXXXXXXX21/**.csv","s3://cost-optimization-107457606321/**.sql","s3://cost-optimization-1XXXXXXXX321/**.gz","s3://cost-optimization-107457606321/**.zip","s3://cost-optimization-107457606321/**/cost_and_usage_data_status/*","s3://cost-optimization-107457606321/**.yml"]CrawlerSchemaDeserializerVersion 1.0compressionType nonetypeOfData file ``` and has the following partitions shown in Glue: ``` partition_0 partition_1 year month detailed-cur-1XXXXXXXX57 detailed-cur-1XXXXXXXX57 2018 12 View files View properties detailed-cur-1XXXXXXXXX57 detailed-cur-1XXXXXXXXX57 2022 4 View files View properties detailed-cur-1XXXXXXXXX57 detailed-cur-1XXXXXXXXX57 2018 11 View files View properties detailed-cur-1XXXXXXXX57 detailed-cur-1XXXXXXXX57 2018 10 View files View properties ```
2
answers
0
votes
240
views
asked 2 months ago

Best way to overcome HIVE_PARTITION_SCHEMA_MISMATCH error in Athena while preserving structure of structs?

I ran the following Amazon Athena query on a table created by AWS Glue, which had crawled an Amazon S3 export of Synthea data from Amazon HealthLake: ``` SELECT * FROM "glue01234567890_fhir_export_abcdefghijklmnopqrst"; ``` That resulted in this error: > HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'code' in table 'demo.glue01234567890_fhir_export_abcdefghijklmnopqrst' is declared as type 'struct<coding:array<struct<system:string,code:string,display:string>>,text:string>', but partition 'partition_0=ImagingStudy' declared column 'id' as type 'string'. This query ran against the "demo" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id I saw [one answer](https://repost.aws/questions/QU1vPk1069Q5qg4iyuQWTk6Q/while-importing-s-3-data-into-quicksight-from-glue-database-getting-the-following-error) to go to edit the Crawler and then to output -->configuration and "Update all new and existing partitions with metadata from the table." However, that didn't resolve the error for me. When I edit the schema manually an change column 'code' from `struct` to `string`, the error goes away, and Athena brings my attention to the next mismatch: > partition 'partition_0=Immunization' declared column 'patient' as type 'struct<reference:string>'. I don't think, however, manually replacing all structs with strings will get me the results I want, because I need Glue and Athena to be aware of the fields inside the structs. What is the best approach to overcoming this error while preserving the structure of the data, in this situation where I want to make the HealthLake export queryable by Athena? I understand I'll want to add a step to the ETL in Glue to first convert to Parquet or ORC for Athena performance reasons, but right now I'm dealing with a small sample dataset and just want to focus on getting a minimum viable end-to-end flow going.
1
answers
0
votes
51
views
asked 3 months ago

Athena query consistently fails with HIVE_CURSOR_ERROR: Failed to read ORC file

I am seeing Athena queries over a bucket containing ORC files fail with the error message 'HIVE_CURSOR_ERROR: Failed to read ORC file'. Any query over entirety of the data in the bucket fails. A specific example query has been ```SELECT * FROM reachcounts_outbound WHERE calculation='a8d9458d-83e2-4e94-b272-3dbcd91296a0'``` where calculation is set up as a partition in the reachcounts_outbound table (which is backed by an S3 bucket unscoreit-reachcounts-outbound). I've validated that the file referenced by the error message is a valid ORC file by downloading it and running ```orc-tools data ``` on it, and the contents are what I'd expect. I've downloaded other ORC files in the bucket and compared them. They have the same schema and that schema is what I'd expect it to be; it matches the schema I've defined for the table. I've tried deleting the individual file referenced when the error message first appeared. However, it continues to fail with the same message with a different file in the bucket. However if I specify a limit clause of any number under 1597894 on the query above, it will succeed. I've tried running MSCK REPAIR TABLE on the reachcounts_outbound table. This did not change anything. The query id of a request that caused a failure is ```54480f27-1992-40f7-8240-17cc622f91db```. Thanks! Update: The ORC files that are rejected all appear to have exactly 10,000 rows, which is the stride size for the file
0
answers
0
votes
16
views
asked 3 months ago

Advice for best database/datastorage for historical data

Hi, I´m doing some reasearch to find the best place to centralize lots of data logs generated by my application considering pricing ,performance and scalabilty. Today all my application data including logs are stored on an Oracle database, but I´m thinking to move all the LOG related data outside it to reduce it´s size and not to worry about storage performance etc... Just put everything on a "infinite" storage apart from my actual database using CDC or a regular batch process **Below are some needs:** - Only inserts are necessary (no updates or deletes) - Customers will need access to this historic data - Well defined pattern of access (one or two indexes at maximum) - Latencies of few seconds is ok - Avoid infrastrucure, DBA, perfomance bottleneck log term... - Infinite Retentiton period (means I don´t want to worry about performance issues, storage size in long term. But something that can handle a few terabytes of data ) **Use case example: ** Historical Sales order by items ( id_item | id_customer | qty_sold | date_inserted ... ), aprox 50 millions records per day Where I would need to see the historical data by item, and by customer for example (two dimensions) I´ve done some research with the options below **S3 + Athena **-> Put everthing on s3, no worries about infrastructure perfomance issues, however as I need query by item and customer, probably it´would be necessary to break files by item or customer , generate millions of partitions to avoid high costs searching on every file etc.. **Postgre** -> Not sure if could be performance bottleneck once tables gets too big even with partition strategies **DynamoDB **-> Not sure if it´s a good alternative to historical data regarding pricing once seconds latency is ok **MongoDB/ DocumentDB **-> Not very familiar with it (I´d prefer SQL language type) but I know it´s has a good scalability **Cassandra**-> dont´know very much **Timeseries db as influxDB, timestream etc..**-> dont´know very much, but it seems appropriate for timeseries What option would you choose ? Sorry in advance if I saying something wrong or impossible :) Thank you!
1
answers
0
votes
10
views
asked 3 months ago

HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

I am reading files from S3 and using a Glue ETL job to populate a Data Catalog Table. My S3 files look like this: ``` VID,ALTID,VTYPE,TIME,TIMEGMT,VALUE ABC, ABC, DATAFEED,31-10-2021 22:00:00,01-11-2021 02:00:00, 11775.685 ``` The scheme read in looks like this ``` root |-- VID: string |-- ALTID: string |-- VTYPE: string |-- TIME: string |-- TIMEGMT: string |-- VALUE: string ``` I am changing fields 3 and 4 from "strings" to timestamps, which matches the schema of my Data Catalog Table. I am also doing a few other transformations. I am transforming them like so: ``` df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss')) df = df.withColumn("timegmt", to_timestamp("timegmt", 'dd-MM-yyyy HH:mm:ss')) ``` When I try to read the data with Athena, I get an error: `HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"` The Data Catalog Table Schema looks like this: | Column Name | Data Type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | timestamp | | timegmt | timestamp | | value | int | | filename | string | And the line in the `run-1647806179090-part-r-00000` file it is choking on looks like this: ``` vid,altid,vtype,time,timegmt,value,filename ABC,ABC, DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"lf_buf_20211101_005904.csv" ``` Does anyone know why it would throw this error? I believe according to the documentation, this is the correct timestamp format.
1
answers
0
votes
127
views
asked 3 months ago

DataSourceArn error for Athena Quick sight in cfn template

AWSTemplateFormatVersion: '2010-09-09' Description: 'Creating QuickSight data source' Resources: QuickSightDataSource: Type: AWS::QuickSight::DataSource Properties: AwsAccountId: !Ref AWS::AccountId Name: Testing Data Source Type: ATHENA DataSourceId: testing-data-source DataSourceParameters: AthenaParameters: Workgroup: primary Permissions: - Actions: - quicksight:DescribeDataSource - quicksight:DescribeDataSourcePermissions - quicksight:PassDataSource Principal: !Sub - 'arn:aws:quicksight:us-east-1:${Account}:user/default/my-user-name' - Account: !Ref AWS::AccountId QSDataSet: DependsOn: QuickSightDataSource Type: AWS::QuickSight::DataSet Properties: AwsAccountId: !Ref 'AWS::AccountId' DataSetId: 'QSDataSet-test' ImportMode: SPICE Name: Dataset-test Permissions: - Actions: - 'quicksight:CreateDataSet' - 'quicksight:DeleteDataSet' - 'quicksight:DescribeDataSet' - 'quicksight:DescribeDataSetPermissions' - 'quicksight:PassDataSet' - 'quicksight:UpdateDataSet' - 'quicksight:UpdateDataSetPermissions' Principal: !Sub - 'arn:aws:quicksight:us-east-1:${AWS::AccountId}:user/default/my-user-name' - Account: !Ref 'AWS::AccountId' PhysicalTableMap: downtime_data_json: DataSourceArn: !GetAtt QuickSightDataSource.Arn InputColumns: - Name: downtime Type: INTEGER Schema: XXX??? I have data source & data tables in Athena and want to refer that in above template. I am not finding DatasourceArn, the above template throwing error as Model validation failed (#: extraneous key [DataSourceArn] is not permitted)". can anyone please help me out to hash out this.
0
answers
0
votes
7
views
asked 4 months ago
  • 1
  • 90 / page