- 最新
- 投票最多
- 评论最多
Hi Kevin,
If you are using the visual editor, there's no out of the box transformation to order your data, so you must create yours. The simplest way is by creating a custom SQL Query transform.
Click on (+) to add a node and select "SQL Query", than just write the really simple query "SELECT * FROM .... ORDER BY <column>"
This generates the following script, which runs a Spark SQL query:
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame: for alias, frame in mapping.items(): frame.toDF().createOrReplaceTempView(alias) result = spark.sql(query) return DynamicFrame.fromDF(result, glueContext, transformation_ctx) # Script generated for node SQL Query SqlQuery0 = """ select * from myDataSource order by <myDataSourceColumn> """ SQLQuery_node1692843137953 = sparkSqlQuery( glueContext, query=SqlQuery0, mapping={"myDataSource": ChangeSchema_node2}, transformation_ctx="SQLQuery_node1692843137953", )
Additionally if you are writing your own script you could convert your Dynamic Frame to a Spark Dataframe and then sort data using the spark api [1]:
sorted_df = myframe.toDF().orderBy(["mycolumn"])
Hope this helps you, if you have further questions please let me know.
Reference: [1] https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.orderBy.html
i = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options={"paths":[input_loc], "recurse": True, "compressionType": "gzip", "groupFiles": "inPartition", "groupSize": "104857600"}, format="json", )
I plan on loading in the json via the following group settings ^
If I sort by a column on the dynamic data frame:
sorted_df = i.toDF().orderBy(["col"])
Then output it into parquet, will each parquet file be sorted by the column within the file? I would instead like the column to be sorted "across" the parquet files, if that makes sense.
Something like "z-ordering" ?
相关内容
- AWS 官方已更新 3 年前
- AWS 官方已更新 2 年前
Are you using the visual editor ? Or you have a script ?