DEV Community

Uri Shaked
Uri Shaked

Posted on

Exporting Google Cloud Storage File List to BigQuery (CSV)

Sometimes you need to run complex queries against the list of files storage in your Google Cloud Storage. In my case, I wanted find out which files were no longer referenced by my database and clean them up.

Here is how I did it:

Exporting the File List to a CSV file

This simple 1-line script will allow you to the entire file list from Google Cloud Storage to a simple CSV format that can be then loaded into Google BigQuery:

gsutil ls -l gs://bucket-name/** | head -n-1 | awk 'BEGIN{print "Size,Modified,Path"}{print $1","$2",\""$3"\""}' > filelist.csv

Make sure to replace bucket-name with your actual GCS bucket name.

Loading the CSV file to BigQuery

Before you can load the CSV File to Google's BigQuery, you first need to upload it to Google Cloud storage. You can do it by running:

gsutil cp filelist.csv gs://some-bucket/filelist.csv

Then, you can load it right into bigquery:

bq load --autodetect --source_format=CSV mydataset.mytable gs://some-bucket/filelist.csv

Make sure you change some-bucket to your actual bucket name, and mydataset and mytable to the target BigQuery dataset and table, respectively.

I tested this process for a bucket containing about 350,000 files and it worked flawlessly.

Top comments (0)