This is a continuation of my previous posts as follows.
Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)
Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)
Glue DynamicFrame vs Spark DataFrame
Let's compare them using the Parquet file which I created in the part 1.
Data Read Speed Comparison
We will read a single large Parquet file and a highly partitioned Parquet file.
with timer('df'):
dyf = glueContext.create_dynamic_frame.from_options(
"s3",
{
"paths": [
"s3://.../parquet-chunk-high/"
]
},
"parquet",
)
print(dyf.count())
with timer('df partition'):
dyf = glueContext.create_dynamic_frame.from_options(
"s3",
{
"paths": [
"s3:/.../parquet-partition-high/"
]
},
"parquet",
)
print(dyf.count())
324917265
[df] done in 125.9965 s
324917265
[df partition] done in 55.9798 s
DynamicFrame is too slow...
Summary
- Based on the part 1 (Reading Speed Comparison), spark.read is 27.1 s (for single large file) and 36.3 s (for highly partitioned file), so DynamicFrame is quite slow.
- Interestingly, the speed of reading partitioned data is faster than single large Parquet file.
Top comments (0)