Formatting csv table data before upload to Redshift

0

Pretty basic newbie redshift question here. Wanting to upload historical test data to a redshift database and the data as we have it is in multiple csv's and is formatted in typical table format with columns set as like this:

| Time | Var1 | Var2 | Var3 |

| 0 | 2 | 3 | 4 |

| 1 | 4 | 5 | 6 |

| 2 | 6 | 7 | 8 |

In the past for other projects we've had data imported to redshift using COPY from s3 and the data was already formatted like below, which I've read is more efficient for querying in redshift:

| Time | Var | Value |

| 0 | Var1 | 2 |

| 0 | Var2 | 3 |

| 0 | Var3 | 4 |

| 1 | Var3 | 4 |

| 1 | Var2 | 5 |

| 1 | Var3 | 6 |

| 2 | Var1 | 6 |

| 2 | Var2 | 7 |

| 2 | Var3 | 8 |

For this new data do we need to be transforming it to match the columnar setup before uploading to redshift? Are there situations where this reformatting is not necessary?

  • I would experiment both ways. That'd be a fun exercise regardless! Plus would readability be a factor in making a decision on how you'd store the data?

2개 답변
0

The idea of a columnar-storage database like Redshift for OLAP is that you can have tables with many columns and still be able to efficiently query just a small subset of columns across many rows. So I'd say there's no need to reformat as you're suggesting. However I don't have broad enough experience in this area to say it's NEVER a good idea, you may have special requirements.

전문가
답변함 일 년 전
0

The CSV file columns should match table columns. From your example the second file seems like a pivot of the first file. Since the layout is different you can create a second staging table for the second file and after loading the data with COPY command you can use the PIVOT/UNPIVOT to transform into first layout.

If the two files are independent then you can leave it as two separate tables. When running queries and joining the two tables you can use the PIVOT/UNPIVOT to transform at query execution time.

Having data as key-value pairs gives you flexibility of a varying schema but when running queries you often need to transform into tabular column format so it is easy to join with other tables. If your originating system is providing data in key-value pairs then use a staging schema that matches the source file format which will give you fast performance for write queries i.e., data ingestion. Then apply your transformations in intermediate or target schema which will give you fast join performance for your read queries.

profile pictureAWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인