Questions tagged with Data Lakes

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Hi folks, Curious to know if it is possible to expose an AWS-hosted Snowflake data lake/warehouse directly using an API Gateway? I found documentation on Snowflake's site which mentions the use of a Lambda function between the API Gateway and Snowflake. Is this Lambda function really necessary? Or does Snowflake offer an API that can be exposed directly via API Gateway? Thank you.
1
answers
0
votes
25
views
EG83
asked 23 days ago
Hi, Is there a way to perform time travel with Athena queries on Apache Hudi tables in a way similar to the one described here [Implement a cdc based upsert in a data lake using Apache Iceberg and AWS Glue](https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/)? \ Or the only way is through *spark.sql* library at the moment? \ Thanks
2
answers
0
votes
43
views
asked 2 months ago
I am having trouble connecting MWAA to snowflake. I used the MWAA UI to automatically create a VPC, security group, and IAM for my airflow environment. I can not get snowflake to show as an option in the connector, even with the requirements and constraints doc set up. I tried following this tutorial: https://techholding.co/blog/airflow-mwaa-integrations-pt1/ web server logs indicated that `requirements.txt` ran successfully ``` --constraint "/usr/local/airflow/dags/constraints-3.7-mod.txt" apache-airflow-providers-amazon==6.0.0 apache-airflow-providers-snowflake==3.0.0 snowflake-connector-python==2.7.8 snowflake-sqlalchemy==1.3.4 ``` Since the VPC, security group, and IAM roles are all the defaults that MWAA auto-generates is it possible that they were not configured properly for external connections (services outside the AWS suite, like snowflake)?
1
answers
0
votes
116
views
asked 2 months ago
I am building an ETL pipeline using primarily state machines, Athena, and the Glue catalog. In general things work in the following way: 1. A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process. 2. A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time. My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table? I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture: 1. Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment. 2. Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access) 3. Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows. Any guidance would be appreciated!
1
answers
0
votes
74
views
asked 5 months ago
I created the tutorial from this [link](https://aws.amazon.com/pt/blogs/big-data/apply-record-level-changes-from-relational-databases-to-amazon-s3-data-lake-using-apache-hudi-on-amazon-emr-and-aws-database-migration-service/) successfully But a trying make that using other data and table I don't have success. I receive this error: ``` hadoop@ip-10-99-2-111 bin]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5 --master yarn --deploy-mode cluster --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar --table-type COPY_ON_WRITE --source-ordering-field dms_received_ts --props s3://hudi-test-tt/properties/dfs-source-health-care-full.properties --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3://hudi-test-tt/hudi/health_care --target-table hudiblogdb.health_care --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --enable-hive-sync Ivy Default Cache set to: /home/hadoop/.ivy2/cache The jars for the packages stored in: /home/hadoop/.ivy2/jars :: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml org.apache.hudi#hudi-utilities-bundle_2.11 added as a dependency org.apache.spark#spark-avro_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c;1.0 confs: [default] found org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating in central found org.apache.spark#spark-avro_2.11;2.4.5 in central found org.spark-project.spark#unused;1.0.0 in central :: resolution report :: resolve 270ms :: artifacts dl 7ms :: modules in use: org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating from central in [default] org.apache.spark#spark-avro_2.11;2.4.5 from central in [default] org.spark-project.spark#unused;1.0.0 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 0 | 0 | 0 || 3 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c confs: [default] 0 artifacts copied, 3 already retrieved (0kB/7ms) 22/08/25 21:39:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/08/25 21:39:38 INFO RMProxy: Connecting to ResourceManager at ip-10-99-2-111.us-east-2.compute.internal/10.99.2.111:8032 22/08/25 21:39:38 INFO Client: Requesting a new application from cluster with 1 NodeManagers 22/08/25 21:39:38 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container) 22/08/25 21:39:38 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead 22/08/25 21:39:38 INFO Client: Setting up container launch context for our AM 22/08/25 21:39:38 INFO Client: Setting up the launch environment for our AM container 22/08/25 21:39:39 INFO Client: Preparing resources for our AM container 22/08/25 21:39:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 22/08/25 21:39:41 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_libs__5969710364624957851.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_libs__5969710364624957851.zip 22/08/25 21:39:41 INFO Client: Uploading resource file:/usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hudi-utilities-bundle_2.11-0.5.2-incubating.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.5.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.spark_spark-avro_2.11-2.4.5.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.spark-project.spark_unused-1.0.0.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hive-site.xml 22/08/25 21:39:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_conf__6985991088000323368.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_conf__.zip 22/08/25 21:39:42 INFO SecurityManager: Changing view acls to: hadoop 22/08/25 21:39:42 INFO SecurityManager: Changing modify acls to: hadoop 22/08/25 21:39:42 INFO SecurityManager: Changing view acls groups to: 22/08/25 21:39:42 INFO SecurityManager: Changing modify acls groups to: 22/08/25 21:39:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 22/08/25 21:39:43 INFO Client: Submitting application application_1661296163923_0003 to ResourceManager 22/08/25 21:39:43 INFO YarnClientImpl: Submitted application application_1661296163923_0003 22/08/25 21:39:44 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:44 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:39:45 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:46 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:47 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:48 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:49 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:50 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:50 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal ApplicationMaster RPC port: 33179 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:39:51 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:52 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:53 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:54 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:55 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:55 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:39:56 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:57 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:58 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:59 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:40:00 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:00 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal ApplicationMaster RPC port: 34591 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:40:01 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:02 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:03 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:04 INFO Client: Application report for application_1661296163923_0003 (state: FINISHED) 22/08/25 21:40:04 INFO Client: client token: N/A diagnostics: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685) Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89) at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:99) ... 9 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78) ... 11 more Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Required property hoodie.deltastreamer.schemaprovider.source.schema.file is missing at org.apache.hudi.DataSourceUtils.lambda$checkRequiredProperties$1(DataSourceUtils.java:173) at java.util.Collections$SingletonList.forEach(Collections.java:4824) at org.apache.hudi.DataSourceUtils.checkRequiredProperties(DataSourceUtils.java:171) at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:55) ... 16 more ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal ApplicationMaster RPC port: 34591 queue: default start time: 1661463583358 final status: FAILED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:40:04 ERROR Client: Application diagnostics message: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685) Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80) at org.apache.hudi.common.util.ReflectionUtils.loadClass(Refl ```
1
answers
0
votes
44
views
asked 5 months ago
I'm a beginner in tasks of date engineer. I have a taks to create a data lakehouse and i'm tryding undestand how to do it using these tools: DMS, S3, Glue and Hudi. I already created a simple data lake and i don't have more dificults but to build a data lakehouse its very hard to my, because I couldn't find any simple example for this. My enviroment is like this: Database with Postgresql and I need to update the data daily from the data lake, but now an update will be done and no longer full copy. Does AWS have an example to build it?
1
answers
0
votes
172
views
asked 6 months ago
Hello! I am looking for an equivalent to this solution that MIcrosoft has flaunted called IDP intelligent data platform, it is governance + operations + analytics in one. they flaunt synapse with aml and purview and other stuff they did with these to make it more integrated . I know we have RDS and Sagemaker -- but how about purview? how are we tgo make it more cohesive?
0
answers
0
votes
54
views
AWS
asked 6 months ago
I'm investigating and deploying https://docs.aws.amazon.com/solutions/latest/data-lake-solution/welcome.html Looking at the GitHub repo https://github.com/aws-solutions/aws-data-lake-solution it looks like this solution is somewhat outdated. It also contains a broken link to the underlying documentation: https://github.com/aws-solutions/aws-data-lake-solution from this page: https://docs.aws.amazon.com/solutions/latest/data-lake-solution/deployment.html Before I invest a lot of time in this solution the question then is if this solution is still up-to-date and is it still worthwhile to deploy and explore it for a production use case? Alternatively, what is a comparable and current solution to look at? Thanks in advance for advice.
Accepted AnswerServerlessData Lakes
3
answers
0
votes
61
views
asked 6 months ago
HI , I have data in delta format in s3 , created external table with following query . >CREATE EXTERNAL TABLE delta_mongo.transactions > ( `_id` string , >account_id bigint, >bucket_end_date string, >bucket_start_date string, >transaction_count bigint, >time string) >ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' >STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' >OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' >LOCATION 's3://****/test/lambda-mongodb/atlas/simple-analytics/transactions/curated/full-load/_symlink_format_manifest/' ; Im able to create external table , when i query the table . i'm getting following error when i test in ap-south-1 region : >HIVE_UNKNOWN_ERROR: >com.amazonaws.services.lakeformation.model.InvalidInputException: Unsupported >vendor for Glue supported principal: arn:aws:iam::054709220113:root (Service: >AWSLakeFormation; Status Code: 400; Error Code: InvalidInputException; Request ID: >24c65ff9-e6dd-4393-9f01-2168b2849849; Proxy: null) * When i test this us-east-1 , im able to query the data . How do i resolve this in ap-south-1?
0
answers
0
votes
70
views
asked 6 months ago
From time to time I have a csv file coming with single row and it breaks the Glue Crawler because of the at least 2 row requirement to be classified as a CSV. Is there a way I can provide a custom CSV classifier so I don't need to write a grok for 300+ columns? I can't seem to be able to figure out what to use in Quote symbol since I don't have any quotes in the csv. This is a snippet of the data: SSR|CCE|34|BBB||1 I can write something like this ``` (?<col0>^.*)(?:[^|]*\|) ``` as a grok and repeat 300 times but would be good to make it less ugly
0
answers
0
votes
64
views
asked 6 months ago
We have incoming file with the fixed length field length (.dat). For example: |2|123 |AWS |0505 |3 When Glue Crawler crawles the file, it ignores all the int/long values that have trailing spaces. In this case 123 value won't appear in the result when querying created table in athena. I have tried to create the custom classifier to remove the white spaces however there is no specific custom **Quote symbol** that I can provide in the config to make crawler pick it up instead of default csv one. Any help would be appreciated
0
answers
0
votes
47
views
asked 6 months ago
Hello, due to the following Step by Step Guide provided by the official AWS Athena user-guide (Link at the End of the question), it should be possible to connect Tableau Desktop to Athena and Lake Formation via the Simba Athena JDBC Driver using Okta as Idp. The challenge that I am facing right now, is although i followed each step as documented in the Athena user-guide i can not make the connection work. The error message that i recieve whenever i try to connect Tableau Desktop states: > [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. The security token included in the request is invalid. [Execution ID not available] Invalid Username or Password. My athena.properties file to configure the driver on the Tableau via connection string URL looks as follows (User Name and Password are masked): ``` jdbc:awsathena://AwsRegion=eu-central-1; S3OutputLocation=s3://athena-query-results; AwsCredentialsProviderClass=com.simba.athena.iamsupport.plugin.OktaCredentialsProvider; idp_host=1234.okta.com; User=*****.*****@example.com; Password=******************; app_id=****************************; ssl_insecure=true; okta_mfa_type=oktaverifywithpush; LakeFormationEnabled=true; ``` The configuration settings used in here are from the official Simba Athena JDBC driver documentation (Version: 2.0.31). Furthermore i assigned the required permissions for my users and groups inside Lake Formation as stated in the Step by Step guide linked below. Right now I am not able to point out why I am not able to make the connection work. So I would be very greatful for any support / idea to find a solution on that topic. Best regards Link: https://docs.aws.amazon.com/athena/latest/ug/security-athena-lake-formation-jdbc-okta-tutorial.html#security-athena-lake-formation-jdbc-okta-tutorial-step-1-create-an-okta-account)
0
answers
0
votes
105
views
asked 7 months ago