Formatting csv table data before upload to Redshift

Question

Pretty basic newbie redshift question here.  Wanting to upload historical test data to a redshift database and the data as we have it is in multiple csv's and is formatted in typical table format with columns set as like this:

| Time | Var1 | Var2 | Var3 |

| 0       | 2       | 3       | 4       |

| 1       | 4       | 5       | 6       |

| 2       | 6       | 7       | 8       |

In the past for other projects we've had data imported to redshift using COPY from s3 and the data was already formatted like below, which I've read is more efficient for querying in redshift:

| Time | Var   | Value |

| 0       | Var1 | 2        |

| 0       | Var2 | 3        |

| 0       | Var3 | 4        |

| 1       | Var3 | 4        |

| 1       | Var2 | 5        |

| 1       | Var3 | 6        |

| 2       | Var1 | 6        |

| 2       | Var2 | 7        |

| 2       | Var3 | 8        |

For this new data do we need to be transforming it to match the columnar setup before uploading to redshift?  Are there situations where this reformatting is not necessary?

Answer

The CSV file columns should match table columns. From your example the second file seems like a pivot of the first file. Since the layout is different you can create a second staging table for the second file and after loading the data with COPY command you can use the [PIVOT/UNPIVOT](https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html) to transform into first layout.

If the two files are independent then you can leave it as two separate tables. When running queries and joining the two tables you can use the PIVOT/UNPIVOT to transform at query execution time.

Having data as key-value pairs gives you flexibility of a varying schema but when running queries you often need to transform into tabular column format so it is easy to join with other tables. If your originating system is providing data in key-value pairs then use a staging schema that matches the source file format which will give you fast performance for write queries i.e., data ingestion. Then apply your transformations in intermediate or target schema which will give you fast join performance for your read queries.

Answer

The idea of a columnar-storage database like Redshift for OLAP is that you can have tables with many columns and still be able to efficiently query just a small subset of columns across many rows.   So I'd say there's no need to reformat as you're suggesting.  However I don't have broad enough experience in this area to say it's NEVER a good idea, you may have special requirements.

Formatting csv table data before upload to Redshift

관련 콘텐츠