Questions tagged with Extract Transform & Load Data

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

I converted a CSV(from S3) to parquet(to S3) using AWS glue and the file which is converted to Parquet was named randomly .How do i choose the name of the file that is to be converted to Parquet from CSV ? ![Enter image description here](/media/postImages/original/IMUQds6rTFS8i2Yv9YENybcQ) when i add data.parquet at the end of the s3 path (in target) without '/' ,AWS glues is creating a subfloder in the bucket with the name data.parquet instead of file name, where as the new file parquet file is created with the name like this "run-1678983665978-part-block-0-r-00000-snappy.parquet" where should i give a name to the parquet file ?
1
answers
0
votes
16
views
asked a day ago
Macie provides detailed positions of sensitive data in output file. But, I want to extract that data using positions from output file. Also, macie reveal only 10 samples. Is there any way to get more than 10 samples in aws macie "reveal samples"? If don't, is there any other solution(like sql query or something) to extract the complete data from sensitive files?
0
answers
0
votes
10
views
asked 2 days ago
I'm writing partitioned parquet data using a Spark data frame and mode=overwrite to update stale partitions. I have this set: spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic') The data is being written correctly with all the partitioning being set correctly, but I am also getting empty files created at each level of the path named <path_level>_$folder$. Removing mode=overwrite eliminates this strange behavior. Is there any way to prevent these zero size files from being created? Have I misconfigured something?
1
answers
0
votes
14
views
asked 2 days ago
How can one set up an Execution Class = FLEX on a Jupyter Job Run , im using the %magic on my %%configure cell like below and also setting the input arguments with --execution_class = FLEX But still the jobs are quicking as STANDARD %%configure { "region": "us-east-1", "idle_timeout": "480", "glue_version": "3.0", "number_of_workers": 10, "execution_class": "FLEX", "worker_type": "G.1X" } ![Enter image description here](/media/postImages/original/IMgaPRfCicTAKewOu41SXTqw)
2
answers
0
votes
43
views
asked 4 days ago
Hi, I'd appreciate AWS Athena support for TIMESTAMP data type with microsecond precision for all row formats and table engines. Currently, the support is very inconsistent. See the SQL script below. ``` drop table if exists test_csv; create external table if not exists test_csv ( id int, created_time timestamp ) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties('separatorChar'=',', 'quoteChar'='"', 'escapeChar'='\\') location 's3://my-bucket/tmp/timestamp_csv_test/'; -- result: OK drop table if exists test_parquet; create external table if not exists test_parquet ( id int, created_time timestamp ) row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' stored as inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' location 's3://my-bucket/tmp/timestamp_parquet_test/' tblproperties ('parquet.compress' = 'snappy'); -- result: OK drop table if exists test_iceberg; create table if not exists test_iceberg ( id int, created_time timestamp ) location 's3://my-bucket/tmp/timestamp_iceberg_test/' tblproperties ( 'table_type' ='iceberg'); -- result: OK insert into test_csv values (1, timestamp '2023-03-22 11:00:00.123456'); /* result: ERROR [HY000][100071] [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. GENERIC_INTERNAL_ERROR: class org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector cannot be cast to class org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector and org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector are in unnamed module of loader io.trino.server.PluginClassLoader @1df1bd44). If a data manifest file was generated at 's3://my-bucket/athena_results/ad44adee-2a80-4f41-906a-17aa5dc27730-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account. [Execution ID: ***] */ insert into test_parquet values (1, timestamp '2023-03-22 11:00:00.123456'); -- result: OK select * from test_parquet; -- result: OK DATA: 1,2023-03-22 11:00:00.123000 BUT THE TIMESTAMP VALUE IS TRUNCATED TO MILLISECONDS! insert into test_iceberg values (1, timestamp '2023-03-22 11:00:00.123456'); -- result: OK select * from test_csv; select * from test_iceberg; -- result: OK DATA: 1,2023-03-22 11:00:00.123456 THIS IS FINE ```
0
answers
0
votes
24
views
asked 7 days ago
Started getting this error today when querying data from Athena in a table created from parquet files in our S3 bucket: ![Enter image description here](/media/postImages/original/IMOlLNCZ5pR3qHzWlY8F4OXQ) I'm thinking a bad file somewhere but unable to narrow it down. Any steps or tips to resolve would be much appreciated. Thanks!
0
answers
0
votes
19
views
asked 7 days ago
Hi community, I am trying to perform an ETL job using AWS Glue. Our data is stored in MongoDB Atlas, inside a VPC. Our AWS is connected to our MongoDB Atlas using VPC peering. To perform the ETL job in AWS Glue I have first created a connection using the VPC details and the mongoDB Atlas URI along with the password and username. The connection is used by the AWS Glue crawlers to extract the schema to AWS Data Catalog Tables. This connection works! However, I am then attempting to perform the actual ETL job using the following pySpark code: #My Temp Variables source_database="d*********a" source_table_name="main_businesses source_mongodb_db_name = "main" source_mongodb_collection = "businesses" glueContext.create_dynamic_frame.from_catalog(database=source_database,table_name=source_table_name,additional_options = {"database": source_mongodb_db_name,"collection":source_mongodb_collection}) However the connection times out and for some reason mongodb atlas is blocking the connection from the ETL job. It's as if the ETL Job is using the connection differently than the crawler does. Maybe the ETL Job is not able to run the job inside our AWS VPC that is connected to the MongoDB Atlas VPC (VPC Peering is not possible?). Does anyone have any idea what might be going on or how I can fix this? Thank you!
1
answers
0
votes
21
views
asked 8 days ago
In Redshift, I'm trying to update a table using another table from another database. The error details: SQL Error [XX000]: ERROR: Assert Detail: ----------------------------------------------- error: Assert code: 1000 context: scan->m_src_id == table_id - query: 17277564 location: xen_execute.cpp:5251 process: padbmaster [pid=30866] The context is not helpful. I have used a similar join based approach for other tables and there the update statement has been working fine. Update syntax used: ``` UPDATE ods.schema.tablename SET "TimeStamp" = GETDATE(), "col" = S."col", FROM ods.schema.tablename T INNER JOIN stg.schema.tablename S ON T.Col = S.Col; ```
1
answers
0
votes
21
views
asked 13 days ago
I'm attempting to use AWS Data Pipeline to move a CSV file from my computer to AWS Data Lake as a parquet file. I'm unable to finad the exact template to select to migrate from my local computer. please help me in choosing the source .![.](/media/postImages/original/IMcwPjZ5cqQrKctVYzb3sm3A)
0
answers
0
votes
14
views
asked 14 days ago
i want to directly move a csv file from my laptop to aws data lake using aws pipeline? is it possible to so ? if yes how?
1
answers
0
votes
22
views
asked 14 days ago
Hey, My ETL Glue job is: 1. reading from Data Catalogue (S3 based), 2. selecting specific fields from the input file (which is json) 2. doing some mapping 3. saving output data to Postgres Aurora Data Catalogue I also want to save things like S3 input file name and date of creation, and I'm currently using input_file_name and current_timestamp functions from pyspark.sql.functions: `select1_to_df = select1.toDF().withColumn("input_file_name", input_file_name()).withColumn("current_timestamp", current_timestamp()) ` `select1_enriched = select1.fromDF(select1_to_df, glueContext, "select1_enriched") ` This is working fine for small datasets, but for large I see empty input_file_name and current_timestamp columns in my DB. Is there a better option to do that for a large data sets? It's possible that in the future I'll need to append more columns with custom calculations based on what is inside given file, so this functionality is crucial for me. Thanks
2
answers
0
votes
31
views
pkantor
asked 15 days ago
Hello, I am running a job to apply an ETL on a semi-colon-separated CSV on S3, however, when I read the file using the DynamicFrame feature of AWS and try to use any method like `printSchema` or `toDF`, I get the following error: ``` py4j.protocol.Py4JJavaError: An error occurred while calling o77.schema. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (52bff5da55da executor driver): com.amazonaws.services.glue.util.FatalException: Unable to parse file: s3://my-bucket/my-file.csv ``` I have already verified the codification, it is UTF-8 so there should be no problem. When I read the CSV using `spark.read.csv`, it works fine, and the Crawlers can also recognize the schema. The data has some special characters that shouldn't be there, and that's part of the ETL I am looking to perform. Neither using the `from_catalog` nor `from_options` function from AWS Glue works, the problem is the same whether I run the job locally on docker or Glue Studio. My data have a folder date partition so I would prefer to avoid using directly Spark to read the data and take advantage of the Glue Data Catalog as well. Thanks in advance.
1
answers
0
votes
37
views
Aftu
asked 16 days ago