Situation:
When I only use partition clause, there are so many files in S3 bucket which is <1MB, this affect to the query speed and I want to make those become a bigger file.
Solution:
Solution 1: Use Athena "bucketing" method to custom the number of output file.
You can see this AWS blog for more information:
How can I set the number or size of files when I run a CTAS query in Athena?
However, there is one drawback if you use bucketing: Bucketed table do not support INSERT INTO query. Here comes the solution 2.
Solution 2: Use Glue repartition
The context is the same but now I want to use INSERT INTO query.
You can refer to this AWS blog for the procedure:
Build a Data Lake Foundation with AWS Glue and Amazon S3
Note that in the step numbered "13. View the job", we add the following code into the job:
datasource_df = dropnullfields3.repartition(<number of output file you want here>)
right after the line:
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
and edit the code:
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
to:
datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
If you want to know more about Glue repartition:
Try querying with Athena
Create table:
CREATE EXTERNAL TABLE IF NOT EXISTS demo_query (
dispatching_base_num string,
pickup_date string,
locationid bigint)
STORED AS PARQUET
LOCATION 's3://athena-examples/parquet/'
tblproperties ("parquet.compress"="SNAPPY");
Try to insert:
insert into demo_query ("dispatching_base_num", "pickup_date", "locationid") values ('aa23dtgt', '2020-12-03', 1234);
The insert query now should work. Success!
Top comments (1)
Industrial underfill epoxy adhesive glue
, circuit board-level adhesives, and adhesives for electronic goods have all been created by DeepMaterial.