Questions tagged with Extract Transform & Load Data

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

COPY command ignoring defaults if some values are present in specific fields

I am trying to load a Redshift table from a .csv file in an S3 bucket. When I created the table, three columns are set to have default values. ,"current_row_ind" CHAR(1) DEFAULT 'Y' ,"EFFECTIVE_DATE" DATE DEFAULT TO_DATE(CURRENT_DATE, 'YYYY-MM-DD') ,"EXPIRATION_DATE" DATE DEFAULT TO_DATE('9999-12-31', 'YYYY-MM-DD') Some of the rows in my file have values for these fields. Some do not. When I use the COPY command to load the table, Redshift seems to ignore the defaults for those fields that do not have values filled in. If I don't specify those fields in the list but leave them in my file, I get the following error: "Extra column(s) found". If I do specify those fields in the list, those fields in my file that do not have values are added with NULL values rather than the defaults. If I leave those fields off the COPY list and remove them from my file, they load with the defaults. Obviously this won't work as some of the fields have legitimate values. I've listed the parameters I'm using in my COPY command below: FORMAT AS CSV DELIMITER ',' QUOTE '"' ACCEPTINVCHARS '?' TIMEFORMAT 'auto' IGNOREHEADER 1 ACCEPTANYDATE BLANKSASNULL DATEFORMAT 'auto' EMPTYASNULL EXPLICIT_IDS FILLRECORD IGNOREBLANKLINES TRIMBLANKS REGION AS 'us-west-2' My question is how can I use the COPY command to load this table and have it pick up the values for these fields as they are in the file but use the defaults when values are not present? I have researched this here in re:Post as well as other forums such as StackOverflow but have not found anything that addresses this specific scenario. Any help would be appreciated.
0
answers
0
votes
16
views
asked 4 days ago

Redshift serverless data api sql calls

Hi, I am new to Redshift serverless and working on a Lambda function to connect to serverless database using Python redshift Data API and execute bunch of SQL and stored procedures that are stored in a Nexus repository artifact. I am seeing errors when I try to execute SQL statements read from a file as a string. Here is an example of a DDL from one of the scripts. -- Table Definitions -- ---------------------------------- -- test1 ----------------------- -- ---------------------------------- DROP TABLE IF EXISTS test1; CREATE TABLE test1 ( id varchar(32), name varchar(64) DEFAULT NULL, grade varchar(64) DEFAULT NULL, zip varchar(5) DEFAULT NULL -- test2 -------------------------- -- ---------------------------------- DROP TABLE IF EXISTS test2; CREATE TABLE test2( id varchar(32), user_id varchar(32) DEFAULT NULL, hnum varchar(6), ts_created timestamp DEFAULT NULL, ts_updated timestamp DEFAULT NULL -- and few other tables in the same script The function runs fine if I hard code the sql query in the code and I don't see any syntax or other errors with the sql file contents since I could run those using Redshift query editor by manually copy n pasting all the DDLs. Am I missing anything or using data API is not the right approach for this use case? Error and Traceback from the lambda function execution: During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/var/runtime/bootstrap.py", line 60, in <module> main() File "/var/runtime/bootstrap.py", line 57, in main awslambdaricmain.main([os.environ["LAMBDA_TASK_ROOT"], os.environ["_HANDLER"]]) File "/var/runtime/awslambdaric/__main__.py", line 21, in main bootstrap.run(app_root, handler, lambda_runtime_api_addr) File "/var/runtime/awslambdaric/bootstrap.py", line 405, in run handle_event_request( File "/var/runtime/awslambdaric/bootstrap.py", line 165, in handle_event_request xray_fault = make_xray_fault(etype.__name__, str(value), os.getcwd(), tb_tuples) FileNotFoundError: [Errno 2] No such file or directory
1
answers
0
votes
53
views
asked 10 days ago

Create table in GlueCatalog with Parquet format fails, even when enableUpdateCatalog

I am trying to modify a PySpark/Glue job to write a DynamicFrame to a new table in Glue Catalog, with a Parquet format. The database exists, the table does not yet. I am following the instructions posted here: https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html I am getting the error: ``` File "/tmp/resource_logs_with_attributes.py", line 443, in write_transformed_data_to_s3 format="glueparquet" File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 386, in write_dynamic_frame_from_catalog makeOptions(self._sc, additional_options), catalog_id) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o91.getCatalogSink. : com.amazonaws.services.glue.model.EntityNotFoundException: Table api_logs_core not found. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException; Request ID: 4ed5f8f4-e8bf-4764-afb8-425804d3bcd5; Proxy: null) ``` ``` sink = glueContext.getSink( connection_type="s3", path="s3://mybucket/test", enableUpdateCatalog=True, format="glueparquet" ) sink.setCatalogInfo( catalogDatabase='myexistinggluedatabase', catalogTableName='newtable' ) newdyf = glueContext.write_dynamic_frame_from_catalog( frame=my_dyf, database='myexistinggluedatabase', table_name='newtable', format="glueparquet" ) ``` I have tried variants including: * include the partition key array in the sink (`partitionKeys=["dt_pk"]`) * `format="parquet"` + `format_options={"useGlueParquetWriter": True}` * including `"updateBehavior": "UPDATE_IN_DATABASE"` in additional_options * `sink.writeFrame()` with options as documented * `my_dyf.write()` with connection_options and format + format_options This is generally functional code that has worked like this: * reads data from a source into a DynamicFrame * does basic transformations to a new schema * works in the context of a Glue Catalog * existing code uses `write_dynamic_frame.with_options`, specifying S3 location and format parquet, partitioned by date, etc., and to a table whose schema was discovered and written to the Glue Catalog by a Glue Crawler. This works in two phases: Glue Job runs and writes Parquet files to S3, then Glue Crawler updates the Glue Catalog * writes to a partitioned S3 data set in parquet format * Works either with standard parquet, or Glue Parquet Writer My sole change is that I have the same schema, partitions, data format as before, but that I now want to write to a new S3 location, and that I want to have Glue Job create the table and update the Data Catalog.
0
answers
0
votes
14
views
asked 20 days ago