Recommend for designing tables in Redshift

0

While designing tables for Enterprise Data Warehouse (EDW), I found some instructions, for example https://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html. But, I concern about the table should be designed in one of two options: many columns or many rows , and then how can I evaluate which option is best, and also the reference link for this practice. The important expection is the query performance is the fastest, to ensure downstream systems can extract data from EDW as quickly as possible. I give the following example for the table design:

  1. Option 1: Many columns:

    account_numbermeasure1measure2measure3measure4measure5
    1100200300400500
  2. Option 2: Many rows:

    account_numbermeasure_typemeasure
    1'type1'100
    1'type2'200
    1'type3'300
    1'type4'400
    1'type5'500

And, 2 common methods of data extracting from EDW for downstream systems (like Data Mart on star-schema design) are as follows:

  1. Extract and transform data to push data to Detail Data Mart with many measure columns on each account numer.
  2. Extract and sum (total) on measure1 to calculate derrivative measure in Aggregated Data Mart. The expectation is to push data into the Aggregated Data Mart, for example Total Measure 1 by Segment.

Hope your help and Thanks in advanced.

Rafael
asked 8 months ago395 views
1 Answer
0

Hello Rafael,

I would say that you should never ever implement option 2 on Redshift.

Redshift is a columnar database and have no row indexes like OLTP DBs you might be used to. That means that it's hard for Redshift to find specific rows. On the other hand it stores data separately for each column, which makes it easy to access columns independently.

To speed up queries you can order your data with SORT KEYs. If you have no sort key in your table then you will be most of the time doing a full scan. If your query include JOINs you must also take in account DIST KEYs, which will tell Redshift how to distribute data across cluster's nodes and slices.

It's also important to choose the right DATA TYPE and COMPRESSION ENCODING for each column to achieve maximum performance.

Is this what you shared your real data ? Or there are more columns ? I could help you more if I had a full overview of your data schema.

Hope this gives you a start, feel free to comment if you have any doubts.

answered 8 months ago
  • Thanks for your idea. It's make sense to use the option 1 which Redshift is columar database.

    But, the data scheme just is an example, and in actual, there're more measure column than that, means that the table in option 1 will many more columns and also, the table in option 2 with have many value in 'balance_type' column. You can imagine that, in a banking business domain, there will be many types of balances of account (closing balance, in-due balance, past-due balance, interest balance, fee amount, ...), which can be added during the implementation process and future.

    I concern if using option 2, and set 'SORTKEY' for table in 'balance_type' column, and almose query behaviour using EDW data must filter in certain values in balance_type column. If so, could it be equivalent to option 1? Thanks.

  • I see that you are trying to overcome the challenges of schema evolution on a relational database by having this approach. If you do what you are proposing you lose type enforcement for each column, which means that if you have different data types you would need to choose the most "permissive" what means that if you have a string this column will have to be VARCHAR no matter what and every operation will have to cast data types based on measure_type to achieve desired output. Not to mention that you would lose optimal compression for each type too.

  • Thanks for your comment. But in this case, all columns have same data type numeric(23,2) which represents account balance. I understand that, in case of different data types, the design should be kept separate columns. Please help me understand in case the columns have the same data type, and I use sortkey to improve performance.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions