DEV Community

Discussion on: Are you still using Python 2?

 
hoelzro profile image
Rob Hoelz

Sure - I can fork it and submit my changes there after work!

Thread Thread
 
rhymes profile image
rhymes

Thank you!

Thread Thread
 
hoelzro profile image
Rob Hoelz

Thanks for offering to take a look; here's my fork with the Python 3 change: github.com/hoelzro/WikiCorpusExtra...

...and here's a gist using it: gist.github.com/hoelzro/80561443fe...

After digging in a little bit, I noticed that the bz2.BZ2File class is substantially slower in Python 3; using bz2.decompress or bz2.BZ2Decompressor is a lot faster! The former, of course, requires enough memory to hold both the uncompressed and compressed contents, and trying to use the latter is a more cumbersome and one might reintroduce the same overhead that bz2.BZ2File has if one implements the line splitting logic on top of it. I'm curious why BZ2File is slower in Python 3 - maybe I'll have a chance to dig in further tomorrow!