DEV Community

loading...

Mashpack - Beating Messagepack at its own Game

Seth Michael Larson
Blogging about Python and the Internet (He / Him)
・6 min read

Mashpack is a JSON-object serialization format that is a 'dialect' of the popular format Messagepack. I read Messagepacks specification and noticed a few details that I thought could be improved so I challenged myself to 'beat' Messagepack. Here are my results:

Key Design Differences

  • Single-Byte Prefix Reordering
  • Typed Arrays
  • 8-bit Array Type
  • Extensions

Single-byte Header Data Types

Both Mashpack and Messagepack use bit-prefix tricks to pack certain small data types into a single byte while still allowing enough prefix space to support all the other data types we need to ensure an efficient use of space.

Both specifications use of single-byte data types matters quite a bit as it is the hope that most serialized objects will end up as one of the single-byte header data types due to their small header footprint so having a large and common range of values is invaluable.

Spec Single-byte Header Data Types
Mashpack MAPP, STRP, MARRAYP, INTP, NINTP
Messagepack positive fixint, fixstr, negative fixint, fixmap, fixarray

In Mashpack the string and map data types are prioritized with
the largest amount of space. Compare this to Messagepack which prioritizes
positive integers quite heavily and all other data types have smaller ranges as a result.

See the tables below comparing the two specs:

Data Type Range in Mashpack Range in Messagepack
MAPP/fixmap 0 to 63 key-value pairs 0 to 15 key-value pairs
STRP/fixstr 0 to 63 encoded bytes 0 to 31 encoded bytes
MARRAYP/fixarray 0 to 31 elements 0 to 15 elements
NINTP/negative fixint -1 to -32 -1 to -32
INTP/positive fixint 0 to 31 0 to 127

See all Mashpack's data types in one table for more information.

Typed and Mixed Arrays

Mashpack adds a new data type that isn't present in Messagepack called the 'typed array' or just ARRAY*. This in contrast to a 'mixed array' (MARRAY*) requires that all objects within the array be the same data type. Objects which have different headers but are in the same family can be 'upgraded' to a larger data type long as the resulting typed array is smaller than having a mixed array.

An example is a list of integers, one of which would be encoded into an INTP type and all others being converted into INT8 would be converted into an ARRAY8[INT8]. See below:

object = [1, 255, 255]
could be naively encoded into:
MARRAYP([INTP(1), INT8(255), INT8(255)]) = 7 bytes

but converting the INTP(1) into INT8(1) we can take
advantage of typed arrays to end up with this:

ARRAY8[INT8]([1, 255, 255]) = 6 bytes

Adding an 8-bit Array Type

Despite having an unused data type id (0xC1) Messagepack does not define
an 8-bit array type. This in addition to the smaller range of fixarray
being only 15 elements means that arrays with 16 elements in Messagepack
use 3 bytes of header rather than 1 in Mashpack using MARRAYP.

Larger Strings means Unicode Friendly

Unicode brought a lot of harmony to what used to be a really hard problem with universally rendering bytes into strings. One of the necessary evils of unicode is that not all characters are 8 bits meaning that every string that's not ASCII-only requires additional space to store. Mashpack being able to store over double the number of encoded bytes within a one-byte string helps with this problem.

Messagepack's Python implementation also seems to have had a strange relationship with its string data types due to Python 2. The current Mashpack Python implementation supports Python 3+ only and only allows UTF-8 encoding for its string type.

Larger Strings means URLs are Mostly Short Too!

A lot of APIs nowadays have helpful 'explorative' URLS to hint API users about additional content. This will almost always take the form of a complete URL which are typically quite long! Almost always longer than 31 bytes. Having a larger one-byte string helps solve this problem. See the compression performance comparison section.

Extensions

Mashpack defines three extension data type sizes (EXT8, EXT16, and EXT32) while Messagepack defines 8 (fixint 1, fixint 2, fixint 4, fixint 8, fixint 16, ext8, ext16, and ext32)

Extensions work very similarly in Messagepack and Mashpack in that they have an extension code and data section. The extension code is an unsigned integer between 0 and 255 that denotes what extension is being used. The difference between Messagepack and Mashpack's extensions is that the fixed extension type for Messagepack defines 'fixed' extensions that are exactly 1, 2, 4, 8, or 16 bytes long. This allows for 2-bytes of header for extensions with smaller data sections.

Mashpack only allows a 3-byte header minimum, so small data extensions suffer slightly storage-wise under Mashpack. When using a data section 17 bytes or longer extensions pack the same under Mashpack and Messagepack.

Fewer Prefix Tiers

Mashpack's prefix tiering (2x2->3x3->31x8) having only three layers compared to Messagepack's prefix tiering (1x1->2x3->2x4->31x8) means Mashpack will only use a maximum of 3 comparisons to determine a packed objects type compared to 4 for Messagepack.

Fast Unpacking of Positive Fixint

Messagepack's positive fixint type being prefixed as a prefix of 0xxxxxxx allows the integer to be read directly from the byte it resides in without masking.

Mashpack doesn't have this property but only requires one bit-mask operation to make up for it. This difference was a necessary casualty in ensuring better bit prefixes and ranges for other objects.

Both Mashpack and Messagepack's NINTP and negative fixint have this property for negative numbers with a prefix of 111xxxxx.

Performance

Compression Differences

Here's a few small, medium, and large objects for comparing compression between Mashpack and Messagepack. Lower is better.

Object Mashpack Size Messagepack Size Difference
127 2 1 +1 (+50%)
[1] * 16 17 19 -2 (-10.5%)
URL 46 48 -2 (-4.1%)
[127] * 100 103 101 +2 (+1.9%)
[255] * 100 103 203 -100 (-49.2%)
GitHub Push Event 6091 6192 -101 (-1.6%)

Speed Differences

Here's a few small and large objects that were processed 10,000 times to test differences in packing and unpacking speed between Mashpack and Messagepack. Lower is better.

NOTE: These speed comparisons were done on an Intel i5-6400 4x2.7GHz with 16 GB RAM and using the Python 'fallback' version of Messagepack as there is no Cython implementation of Mashpack yet. More benchmarks will be coming to compare Cython implementations later.

Packing Objects

Object Mashpack Time Messagepack Time Difference
[1, 1, 1, 1, 1] 0.22 seconds 0.31 seconds -0.09 seconds
GitHub Push Event 42.9 seconds 59.4 seconds -16.5 seconds

Unpacking Objects

Object Mashpack Time Messagepack Time Difference
[1, 1, 1, 1, 1] 0.15 seconds 0.15 seconds 0 seconds
GitHub Push Event 8.39 seconds 9.72 seconds -1.33 seconds

How can I Help or Contribute?

Mashpack is a brand new specification! Its got plenty left to do before it's cemented into a standard. This is always the most fun part of projects, so I want to share that with everyone who's interested. :)

Everyone that contributes to the Mashpack repository will receive Collaborator for the repository as I would like this to be a project for the community. All changes require code review by one other Collaborator and any Code Owners.

New to Open Source or Python?

There are some issues that I've marked with good first issue that I'm willing to put extra time, effort, and resources into helping you through. If you want to tackle one of these issues but feel the need for additional support you can message me on Twitter to ask questions. :-)

Ready to Tackle Tougher Issues?

There are a lot of issues to contribute to in this category all marked with the help wanted label. I'll get around to these eventually but if you'd like to contribute and get your name out there this is a great place to start.

Are you familiar or interested in learning Cython?

As stated above there is no Cython implementation of the Mashpack Packer or Unpacker objects. Having a Cython implementation would increase the speed of the operations ten-fold and are necessary to be a good competitor to Messagepack. I've marked all issues related to Cython with a label.

Discussion (0)