Discussion on: Are you still using Python 2?

View post

Replies for: It was a third-party wikipedia parsing library (github.com/joaoventura/WikiCorpusE...) - maybe I should profile it and see where the bottleneck is!

Good idea, that's where you want to start for performance improvement anyway :-)

It seems you mostly use part of the standard library that hasn't changed with Python 3, it should be easy to test with Python 3.6 as well.

Let me know if you decide to do it and what results you get!

Rob Hoelz • Jun 28 '18

So I did a little bit of profiling on a subset of my wikidump data - a quick glance showed that bz2 decompression was way worse on Python 3. When I tried on decompressed files, the Python 3 cost was only 20% rather than 100% - so that's progress! I'll probably take a deeper look tomorrow.

Rob Hoelz • Jun 28 '18

I should clarify that I'm on Python 2.7.15 and 3.6.5 - I wonder if the newly released 3.7 would help?

rhymes • Jun 28 '18

When I tried on decompressed files, the Python 3 cost was only 20% rather than 100% - so that's progress! I'll probably take a deeper look tomorrow.

Yeah, and if you can isolate the issue with a gist that I can take a look at I'm happy to do so!

I wonder if the newly released 3.7 would help?

I see nothing related to bz2/bzip2 in Python 3.7 what's new page: docs.python.org/3.7/whatsnew/3.7.html

Are you using Linux, MacOS or Windows?

Rob Hoelz • Jun 28 '18

Linux

Rob Hoelz • Jun 28 '18

Also, if you want to try this out, I'm using a dump file from the Russian Wikipedia (such as dumps.wikimedia.org/ruwiki/2018062...), and just extracting the list of documents via WikiXMLDumpFile(filename).getWikiDocuments() illustrates the difference in timing. You'll need to patch the code to behave with Python 3, though!

rhymes • Jun 28 '18

Can you maybe just commit your branch for Python 3? You'll save me some work ;)

Rob Hoelz • Jun 28 '18

Sure - I can fork it and submit my changes there after work!

rhymes • Jun 28 '18

Thank you!

Rob Hoelz • Jun 29 '18

Thanks for offering to take a look; here's my fork with the Python 3 change: github.com/hoelzro/WikiCorpusExtra...

...and here's a gist using it: gist.github.com/hoelzro/80561443fe...

After digging in a little bit, I noticed that the bz2.BZ2File class is substantially slower in Python 3; using bz2.decompress or bz2.BZ2Decompressor is a lot faster! The former, of course, requires enough memory to hold both the uncompressed and compressed contents, and trying to use the latter is a more cumbersome and one might reintroduce the same overhead that bz2.BZ2File has if one implements the line splitting logic on top of it. I'm curious why BZ2File is slower in Python 3 - maybe I'll have a chance to dig in further tomorrow!