- 新しい順
- 投票が多い順
- コメントが多い順
Hi,
-
No, not directly. Spark parallelises the processing of your DataFrame by using partitioning. Each partition writes a separate CSV file. In theory if you force to your dataframe to use only n number of partitions you could "control" the file size, however is not recommended as the repartitioning is a relatively expensive operations. One way to control Spark Partitioning by either forcing a repartitioning (forces a full reshuffle of the data) . Another way to reduce the number of partitions is using
coalesce()
.coalesce()
can reduce the number of partitions (not increase) without a full reshuffle of the data. For your problem, I wouldn't use any of these options though. Instead, I would look to merge those files at a later stage, after Spark has finished processing. Directly from our documentation "One remedy to solve your small file problem is to use the S3DistCP utility on Amazon EMR. You can use it to combine smaller files into larger objects. You can also use S3DistCP to move large amounts of data in an optimized fashion from HDFS to Amazon S3, Amazon S3 to Amazon S3, and Amazon S3 to HDFS." -
It is not an optimal file size for Athena. You are right for the 128MB/256MB range. Please have a look at the following links regarding Athena and Redshift Spectrum optimisations.
関連するコンテンツ
- AWS公式更新しました 3年前
- AWS公式更新しました 2年前