DEV Community

Discussion on: Beyond CSV files: using Apache Parquet columnar files with Dask to reduce storage and increase performance. Try it now!

Collapse
 
paddy3118 profile image
Paddy3118

If you are worried about space wouldn't you work from a gzipped CSV file? I wonder what the size of the CSV file is when gzipped - 9?

Collapse
 
zompro profile image
Jorge PM • Edited

I believe Dask doesn't support reading from zip. See here: github.com/dask/dask/issues/2554#i....

Looking around it does seem to be able to read from a single gzip but it doesn't seem to be straightforward. If you need to unzip the file then any gains on storage would be nullified.

I would be very interested to try it if you know a way and compare size and performance. Normally storage is cheap, at least a lot cheaper than other resources so performance is in most cases the priority and the compression is a nice to have (depends on the use case of course)

Collapse
 
paddy3118 profile image
Paddy3118 • Edited

Sadly I don't use Dask, but in the past I have used zcat on a Linux command line to stream a csv to stdin for a script to then process without needing the whole of the data uncompressed in memory/on disk.

Thread Thread
 
zompro profile image
Jorge PM

Cool I can totally see a use case for that streaming into something like Apache Kafka. I will prototype a couple of things and see if it can become another little article. Thanks!