Amazon Redshift not maintaining uniqueness in primary key column

0

hi , I created a table in amazon redshift database , " create table sales( salesid integer not null, listid integer not null, sellerid integer not null, primary key(salesid)); " , after successfully creating I inserted values into the table , but the table is accepting the duplicate values for the salesid , hence it is not maintain any primary key rules. Enter image description here

asked a year ago3324 views
4 Answers
0
Accepted Answer

This is by design - the https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html says:

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

Best practice is to have an ETL process that cleans your data. One way to do this is to load data into staging tables then run SQL statements to upsert into a set of main tables, de-duping the data in the process.

HTH

Nick

AWS
Nick
answered a year ago
profile picture
EXPERT
reviewed 12 days ago
0

Yes, like was already mentioned, Redshift does not guarantee the uniqueness. You have control it on your own. Some time ago, it was surprise for me as well :)

If you already have a lot of data in your table and some of them are duplicates, you can use Window Functions(Postgres and, therefore, Redshift support that functionality) to number the duplicate records and remove those you do not need. You can just google it, or if you need a script, ping me, I think I have one, because I had the issue with duplicate recored in Redshift and had to remove them.

profile picture
Max
answered a year ago
0

In addition to the answer from Max, here's a stored procedure to log or fix records with duplicated primary key.

https://github.com/awslabs/amazon-redshift-utils/blob/master/src/StoredProcedures/sp_check_primary_key.sql

AWS
answered a year ago
0

Duplicate rows in Amazon Redshift can occur due to various reasons, such as data ingestion issues, application errors, or a lack of primary key constraints. To deal with duplicate rows in Redshift, follow these best practices: Deduplicate data before ingestion: If possible, clean and deduplicate your data before loading it into Redshift. This can be done using ETL (Extract, Transform, Load) tools or custom scripts to identify and remove duplicates before importing the data. Use primary keys: Although Redshift doesn't enforce primary key constraints, defining primary keys in your table schema can help you identify duplicate rows easily. You can use these primary keys in your deduplication queries. Use window functions for deduplication: To remove duplicate rows from a table in Redshift, you can use window functions like ROW_NUMBER() in combination with a DELETE statement. Here's an example: sql WITH duplicates AS ( SELECT id, ROW_NUMBER() OVER (PARTITION BY primary_key_column1, primary_key_column2 ORDER BY id) AS row_num FROM your_table ) DELETE FROM your_table WHERE id IN (SELECT id FROM duplicates WHERE row_num > 1); Replace your_table with the table name and primary_key_column1, primary_key_column2 with the columns that together represent the primary key or unique identifier for the rows. This query will remove all duplicate rows, keeping only one unique record based on the specified primary key columns. Use CTAS (Create Table As Select) for large tables: If you need to deduplicate a large table, it might be more efficient to create a new table with deduplicated data and then replace the original table. Here's an example: sql -- Create a new table with deduplicated data CREATE TABLE your_table_deduplicated AS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY primary_key_column1, primary_key_column2 ORDER BY id) AS row_num FROM your_table ) t WHERE t.row_num = 1; -- Replace the original table with the deduplicated table BEGIN; DROP TABLE your_table; ALTER TABLE your_table_deduplicated RENAME TO your_table; COMMIT; Schedule regular deduplication: If your use case and data ingestion process are prone to creating duplicate rows, schedule periodic deduplication jobs to maintain data quality and query performance. Monitor for duplicates: Set up monitoring and alerts to notify you when the number of duplicate rows in your tables exceeds a certain threshold. This can help you proactively address data quality issues.

https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions