So I did a little bit of profiling on a subset of my wikidump data - a quick glance showed that bz2 decompression was way worse on Python 3. When I tried on decompressed files, the Python 3 cost was only 20% rather than 100% - so that's progress! I'll probably take a deeper look tomorrow.
Also, if you want to try this out, I'm using a dump file from the Russian Wikipedia (such as dumps.wikimedia.org/ruwiki/2018062...), and just extracting the list of documents via WikiXMLDumpFile(filename).getWikiDocuments() illustrates the difference in timing. You'll need to patch the code to behave with Python 3, though!
After digging in a little bit, I noticed that the bz2.BZ2File class is substantially slower in Python 3; using bz2.decompress or bz2.BZ2Decompressor is a lot faster! The former, of course, requires enough memory to hold both the uncompressed and compressed contents, and trying to use the latter is a more cumbersome and one might reintroduce the same overhead that bz2.BZ2File has if one implements the line splitting logic on top of it. I'm curious why BZ2File is slower in Python 3 - maybe I'll have a chance to dig in further tomorrow!
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Good idea, that's where you want to start for performance improvement anyway :-)
It seems you mostly use part of the standard library that hasn't changed with Python 3, it should be easy to test with Python 3.6 as well.
Let me know if you decide to do it and what results you get!
So I did a little bit of profiling on a subset of my wikidump data - a quick glance showed that bz2 decompression was way worse on Python 3. When I tried on decompressed files, the Python 3 cost was only 20% rather than 100% - so that's progress! I'll probably take a deeper look tomorrow.
I should clarify that I'm on Python 2.7.15 and 3.6.5 - I wonder if the newly released 3.7 would help?
Yeah, and if you can isolate the issue with a gist that I can take a look at I'm happy to do so!
I see nothing related to bz2/bzip2 in Python 3.7 what's new page: docs.python.org/3.7/whatsnew/3.7.html
Are you using Linux, MacOS or Windows?
Linux
Also, if you want to try this out, I'm using a dump file from the Russian Wikipedia (such as dumps.wikimedia.org/ruwiki/2018062...), and just extracting the list of documents via
WikiXMLDumpFile(filename).getWikiDocuments()
illustrates the difference in timing. You'll need to patch the code to behave with Python 3, though!Can you maybe just commit your branch for Python 3? You'll save me some work ;)
Sure - I can fork it and submit my changes there after work!
Thank you!
Thanks for offering to take a look; here's my fork with the Python 3 change: github.com/hoelzro/WikiCorpusExtra...
...and here's a gist using it: gist.github.com/hoelzro/80561443fe...
After digging in a little bit, I noticed that the
bz2.BZ2File
class is substantially slower in Python 3; usingbz2.decompress
orbz2.BZ2Decompressor
is a lot faster! The former, of course, requires enough memory to hold both the uncompressed and compressed contents, and trying to use the latter is a more cumbersome and one might reintroduce the same overhead thatbz2.BZ2File
has if one implements the line splitting logic on top of it. I'm curious why BZ2File is slower in Python 3 - maybe I'll have a chance to dig in further tomorrow!