loading...

Discussion on: Beyond CSV files: using Apache Parquet columnar files with Dask to reduce storage and increase performance. Try it now!

Collapse
paddy3118 profile image
Paddy3118

If you are worried about space wouldn't you work from a gzipped CSV file? I wonder what the size of the CSV file is when gzipped - 9?

Collapse
zompro profile image
zom-pro Author

I believe Dask doesn't support reading from zip. See here: github.com/dask/dask/issues/2554#i....

Looking around it does seem to be able to read from a single gzip but it doesn't seem to be straightforward. If you need to unzip the file then any gains on storage would be nullified.

I would be very interested to try it if you know a way and compare size and performance. Normally storage is cheap, at least a lot cheaper than other resources so performance is in most cases the priority and the compression is a nice to have (depends on the use case of course)

Collapse
paddy3118 profile image
Paddy3118

Sadly I don't use Dask, but in the past I have used zcat on a Linux command line to stream a csv to stdin for a script to then process without needing the whole of the data uncompressed in memory/on disk.

Thread Thread
zompro profile image
zom-pro Author

Cool I can totally see a use case for that streaming into something like Apache Kafka. I will prototype a couple of things and see if it can become another little article. Thanks!