Mashpack is a JSON-object serialization format that is a 'dialect' of the popular format Messagepack. I read Messagepacks specification and noticed a few details that I thought could be improved so I challenged myself to 'beat' Messagepack. Here are my results:
- Single-Byte Prefix Reordering
- Typed Arrays
- 8-bit Array Type
Both Mashpack and Messagepack use bit-prefix tricks to pack certain small data types into a single byte while still allowing enough prefix space to support all the other data types we need to ensure an efficient use of space.
Both specifications use of single-byte data types matters quite a bit as it is the hope that most serialized objects will end up as one of the single-byte header data types due to their small header footprint so having a large and common range of values is invaluable.
|Spec||Single-byte Header Data Types|
In Mashpack the string and map data types are prioritized with
the largest amount of space. Compare this to Messagepack which prioritizes
positive integers quite heavily and all other data types have smaller ranges as a result.
See the tables below comparing the two specs:
|Data Type||Range in Mashpack||Range in Messagepack|
||0 to 63 key-value pairs||0 to 15 key-value pairs|
||0 to 63 encoded bytes||0 to 31 encoded bytes|
||0 to 31 elements||0 to 15 elements|
||-1 to -32||-1 to -32|
||0 to 31||0 to 127|
See all Mashpack's data types in one table for more information.
Mashpack adds a new data type that isn't present in Messagepack called the 'typed array' or just
ARRAY*. This in contrast to a 'mixed array' (
MARRAY*) requires that all objects within the array be the same data type. Objects which have different headers but are in the same family can be 'upgraded' to a larger data type long as the resulting typed array is smaller than having a mixed array.
An example is a list of integers, one of which would be encoded into an
INTP type and all others being converted into
INT8 would be converted into an
ARRAY8[INT8]. See below:
object = [1, 255, 255] could be naively encoded into: MARRAYP([INTP(1), INT8(255), INT8(255)]) = 7 bytes but converting the INTP(1) into INT8(1) we can take advantage of typed arrays to end up with this: ARRAY8[INT8]([1, 255, 255]) = 6 bytes
Despite having an unused data type id (
0xC1) Messagepack does not define
an 8-bit array type. This in addition to the smaller range of fixarray
being only 15 elements means that arrays with 16 elements in Messagepack
use 3 bytes of header rather than 1 in Mashpack using
Unicode brought a lot of harmony to what used to be a really hard problem with universally rendering bytes into strings. One of the necessary evils of unicode is that not all characters are 8 bits meaning that every string that's not ASCII-only requires additional space to store. Mashpack being able to store over double the number of encoded bytes within a one-byte string helps with this problem.
Messagepack's Python implementation also seems to have had a strange relationship with its string data types due to Python 2. The current Mashpack Python implementation supports Python 3+ only and only allows UTF-8 encoding for its string type.
A lot of APIs nowadays have helpful 'explorative' URLS to hint API users about additional content. This will almost always take the form of a complete URL which are typically quite long! Almost always longer than 31 bytes. Having a larger one-byte string helps solve this problem. See the compression performance comparison section.
Mashpack defines three extension data type sizes (
EXT32) while Messagepack defines 8 (
Extensions work very similarly in Messagepack and Mashpack in that they have an extension code and data section. The extension code is an unsigned integer between 0 and 255 that denotes what extension is being used. The difference between Messagepack and Mashpack's extensions is that the fixed extension type for Messagepack defines 'fixed' extensions that are exactly 1, 2, 4, 8, or 16 bytes long. This allows for 2-bytes of header for extensions with smaller data sections.
Mashpack only allows a 3-byte header minimum, so small data extensions suffer slightly storage-wise under Mashpack. When using a data section 17 bytes or longer extensions pack the same under Mashpack and Messagepack.
Mashpack's prefix tiering (2x2->3x3->31x8) having only three layers compared to Messagepack's prefix tiering (1x1->2x3->2x4->31x8) means Mashpack will only use a maximum of 3 comparisons to determine a packed objects type compared to 4 for Messagepack.
Messagepack's positive fixint type being prefixed as a prefix of
0xxxxxxx allows the integer to be read directly from the byte it resides in without masking.
Mashpack doesn't have this property but only requires one bit-mask operation to make up for it. This difference was a necessary casualty in ensuring better bit prefixes and ranges for other objects.
Both Mashpack and Messagepack's
NINTP and negative fixint have this property for negative numbers with a prefix of
Here's a few small, medium, and large objects for comparing compression between Mashpack and Messagepack. Lower is better.
|Object||Mashpack Size||Messagepack Size||Difference|
| * 16||17||19||-2 (-10.5%)|
| * 100||103||101||+2 (+1.9%)|
| * 100||103||203||-100 (-49.2%)|
|GitHub Push Event||6091||6192||-101 (-1.6%)|
Here's a few small and large objects that were processed 10,000 times to test differences in packing and unpacking speed between Mashpack and Messagepack. Lower is better.
NOTE: These speed comparisons were done on an Intel i5-6400 4x2.7GHz with 16 GB RAM and using the Python 'fallback' version of Messagepack as there is no Cython implementation of Mashpack yet. More benchmarks will be coming to compare Cython implementations later.
|Object||Mashpack Time||Messagepack Time||Difference|
|[1, 1, 1, 1, 1]||0.22 seconds||0.31 seconds||-0.09 seconds|
|GitHub Push Event||42.9 seconds||59.4 seconds||-16.5 seconds|
|Object||Mashpack Time||Messagepack Time||Difference|
|[1, 1, 1, 1, 1]||0.15 seconds||0.15 seconds||0 seconds|
|GitHub Push Event||8.39 seconds||9.72 seconds||-1.33 seconds|
Mashpack is a brand new specification! Its got plenty left to do before it's cemented into a standard. This is always the most fun part of projects, so I want to share that with everyone who's interested. :)
Everyone that contributes to the Mashpack repository will receive Collaborator for the repository as I would like this to be a project for the community. All changes require code review by one other Collaborator and any Code Owners.
There are some issues that I've marked with
good first issue that I'm willing to put extra time, effort, and resources into helping you through. If you want to tackle one of these issues but feel the need for additional support you can message me on Twitter to ask questions. :-)
There are a lot of issues to contribute to in this category all marked with the
help wanted label. I'll get around to these eventually but if you'd like to contribute and get your name out there this is a great place to start.
As stated above there is no Cython implementation of the Mashpack Packer or Unpacker objects. Having a Cython implementation would increase the speed of the operations ten-fold and are necessary to be a good competitor to Messagepack. I've marked all issues related to Cython with a label.