After digging in a little bit, I noticed that the bz2.BZ2File class is substantially slower in Python 3; using bz2.decompress or bz2.BZ2Decompressor is a lot faster! The former, of course, requires enough memory to hold both the uncompressed and compressed contents, and trying to use the latter is a more cumbersome and one might reintroduce the same overhead that bz2.BZ2File has if one implements the line splitting logic on top of it. I'm curious why BZ2File is slower in Python 3 - maybe I'll have a chance to dig in further tomorrow!
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Thanks for offering to take a look; here's my fork with the Python 3 change: github.com/hoelzro/WikiCorpusExtra...
...and here's a gist using it: gist.github.com/hoelzro/80561443fe...
After digging in a little bit, I noticed that the
bz2.BZ2File
class is substantially slower in Python 3; usingbz2.decompress
orbz2.BZ2Decompressor
is a lot faster! The former, of course, requires enough memory to hold both the uncompressed and compressed contents, and trying to use the latter is a more cumbersome and one might reintroduce the same overhead thatbz2.BZ2File
has if one implements the line splitting logic on top of it. I'm curious why BZ2File is slower in Python 3 - maybe I'll have a chance to dig in further tomorrow!