DEV Community

Yen Trinh
Yen Trinh

Posted on

Glue – Athena custom output fixed number Of Files

Situation:

When I only use partition clause, there are so many files in S3 bucket which is <1MB, this affect to the query speed and I want to make those become a bigger file.

Solution:

Solution 1: Use Athena "bucketing" method to custom the number of output file.

You can see this AWS blog for more information:
How can I set the number or size of files when I run a CTAS query in Athena?

However, there is one drawback if you use bucketing: Bucketed table do not support INSERT INTO query. Here comes the solution 2.

Solution 2: Use Glue repartition

The context is the same but now I want to use INSERT INTO query.

You can refer to this AWS blog for the procedure:
Build a Data Lake Foundation with AWS Glue and Amazon S3

Note that in the step numbered "13. View the job", we add the following code into the job:

datasource_df = dropnullfields3.repartition(<number of output file you want here>)
Enter fullscreen mode Exit fullscreen mode

right after the line:

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
Enter fullscreen mode Exit fullscreen mode

and edit the code:

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
Enter fullscreen mode Exit fullscreen mode

to:

datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
Enter fullscreen mode Exit fullscreen mode

If you want to know more about Glue repartition:

Try querying with Athena
Create table:

CREATE EXTERNAL TABLE IF NOT EXISTS demo_query (
  dispatching_base_num string,
  pickup_date string,
  locationid bigint)
STORED AS PARQUET
LOCATION 's3://athena-examples/parquet/'
tblproperties ("parquet.compress"="SNAPPY");
Enter fullscreen mode Exit fullscreen mode

Try to insert:

insert into demo_query ("dispatching_base_num", "pickup_date", "locationid") values ('aa23dtgt', '2020-12-03', 1234);
Enter fullscreen mode Exit fullscreen mode

The insert query now should work. Success!

Oldest comments (1)

Collapse
 
sasuke002a profile image
Sasuke Shelby

Industrial underfill epoxy adhesive glue
, circuit board-level adhesives, and adhesives for electronic goods have all been created by DeepMaterial.