Redshift Increment identity key by 1 when loading a table

0

I have a serverless database set up in Redshift, created a table, and am now trying to load that table from a .csv file I have uploaded to an S3 bucket. When I created the table I set the primary key as an identity key as follows: customerid integer NOT NULL identity(0,1) When I load the table using the COPY query, the key increments by 128 starting with 64 rather than by 1 starting at 1. For example, my customerID field has values of 64, 192, 320, 448, etc. I've read numerous articles that this is due to compression and parallelism. I've tried including the "COMPUPDATE off" command as part of my COPY query but that did not change the results. I've truncated my table each time before I've tried to reload it to reset the seed.

How can I load a table and have the identity key start with 1 and increment by 1?

已提问 1 年前2729 查看次数
4 回答
1
已接受的回答

Hi PurpleHaze,

Yes indeed there will be some gaps in the values. I would do the following:

Let's assume that the source file contains 3 fields: employee_id, employee_name and hire_date.

  1. Create a table (lets call it tbl_staging) with these three columns and then load the data into it normally using the COPY command.
  2. Create another table (lets call it tbl_work) with the same data structure as the previous table (tbl_staging) + one extra column as INTEGER or BIGINT.
  3. Do an insert/select from tbl_staging to tbl_work using the ROW_NUMBER function. For example, insert into tbl_work (....) select row_number() over (), .... from tbl_staging;

Please remark that the PARTITION BY or the ORDER BY are optional in the ROW_NUMBER this means that if you don't have to respect any sorting or partitioning you can keep it blank.

And do not forget to use the TRUNCATE command whenever you want to purge the tables ;) https://docs.aws.amazon.com/redshift/latest/dg/r_TRUNCATE.html.

Thanks.

AWS
专家
Ziad
已回答 1 年前
AWS
专家
已审核 1 年前
0

Ziad, thank you again for the response. This method would work well for a one or two time load to a table. I will be building out a process to automate ingestion of .csv files on a regular basis. I don't see how the process you outlined would be efficient for that. Seeing as you answered the scope of my initial question though, I will consider this closed. Thank you again for your assistance. Greatly appreciated!

已回答 1 年前
  • Thank you. Yes I understand your concern. If the target tables are not purged and data is just inserted (without updates), then the Max Value (or we can call it the Next Value) should be stored somewhere (in a table for example) so it can be used during the next load. If the target tables are not purged and data can be updated or modified (base on the generated ID), then in this case I would consider Surrogate Keys instead of Identity Values.

0

Hi PurpleHaze,

Amazon Redshift is a massively parallel processing system, the data is loaded in parallel when an insert/select or a copy statement is used. Redshift skips a number of values to make sure that the identity values are unique. So, the values are not usually generated in order.

You can use the ROW_NUMBER() function to generate the values in order after loading the data to Redshift (insert ... select row_number() over () as id_value from ...).

Here's the syntax of the ROW_NUMBER function https://docs.aws.amazon.com/redshift/latest/dg/r_WF_ROW_NUMBER.html.

Thanks.

AWS
专家
Ziad
已回答 1 年前
AWS
专家
已审核 1 年前
0

Thank you Ziad. The id's are generating in order, they just skip 128 increments so the ID number will be huge in a short time. To change the ID field after loading the table, would I do an insert or an update? IE: update tablename set customerid = ROW_NUMBER () OVER ( [ PARTITION BY customerid] )

已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则