Have you ever wanted to create a custom file format but didn't know where to start or found the task to be somewhat daunting? I've been there. Knowing how most binary file formats are generally structured certainly helps:
However, when making a new binary file format, everyone seems to go through the same set of exercises over and over again. These steps are roughly:
- Looking around to see what other people have done when making a file format. (Sometimes this is the last step.)
- Choosing a magic string for the first few bytes of the file.
- Deciding what the file header consists of.
- Designing what each object/structure in the file looks like. The more compact each structure is, the smaller the final file size will be but there are tradeoffs.
- Calculating the size of objects/structures and their placement in the final file, including any extra bits such as padding, CRC-32/hashes, etc.
- Return in 6 months to several years later only to realize that the original plans didn't make it possible to expand the format in a way that it now needs to expand.
- Repeatedly return to the implementation of the format over many years as a variety of security vulnerabilities are discovered.
Steps 2-4 are theoretically easy, but the hardest part to get right in structured binary files is position tracking (step 5). The size of each structure has to be known before it can be written (or read) and the header or something else in the file has to point at the location of at least one structure. It's generally a bad thing if one structure overwrites part or all of another structure, so the position of each structure in a file has to be planned out before actually writing anything. The more layers of structures that there are, the more of a nightmare that these "bookkeeping bits" wind up becoming. Also, making a file format is fraught with difficulties: Planning for possible future expansion without bloating the file format is tough to figure out in advance and bad actors can supply malformed files that can trigger security vulnerabilities such as buffer overflows, privilege escalation, etc.
However, in general, a structured binary file format is not all that different from data structures in RAM. In fact, most binary files share several common patterns such as "fixed arrays" where each entry in the array is of fixed size. If these patterns were implemented in a way that eliminated the difficult parts of file format design, then anyone could rapidly design their own file format.
Enter the Incredibly Flexible Data Storage (IFDS) file format:
cubiclesoft / ifds
Easily create your own custom file format with the Incredibly Flexible Data Storage (IFDS) file format. Repository contains: The IFDS specification (CC0) and the official PHP reference implementation of IFDS, a paging file cache class, and example usage classes (MIT or LGPL).
Incredibly Flexible Data Storage (IFDS) File Format
The Incredibly Flexible Data Storage (IFDS) file format enables the rapid creation of highly scalable and flexible custom file formats. Create your own customized binary file format with ease with IFDS.
Implementations of the IFDS file format specification internally handle all of the difficult bookkeeping bits that occur when inventing a brand new file format, which lets software developers focus on more important things like high level design and application development. See the use-cases and examples below that use the PHP reference implementation (MIT or LGPL, your choice) of the specification to understand what is possible with IFDS!
Features
- Custom magic strings (up to 123 bytes) and semantic file versioning.
- Included default data structures: Raw storage, binary key-value/key-ID map, fixed array, linked list.
- Extensible. Add your own low level data structures (supports up to 31 custom structures such as trees) and format encoders/decoders…
The primary purpose of IFDS is to provide data structure (and associated data) storage and retrieval while, at the same time, handling all of the bookkeeping bits internally. That is all it does.
IFDS is fairly unique. Almost every file format is single-purpose: JPEG, PDF, ZIP, SQLite database files, etc. Accessing each format's structure and data requires its own separate library. Any file format with IFDS as its base, on the other hand, only needs a single IFDS library to access the data structures and data, allowing such files to be verified and even optimized using any IFDS library. Understanding the data still needs a library layer to interpret the data but lower level structure/data storage and retrieval can be handled by any IFDS implementation.
IFDS is also fairly unique in that there is no preset "magic string." The official reference implementation defaults to "IFDS" but it is recommended to set your own magic string for your own file format.
IFDS is also fairly unique in that it handles tracking position information in a file mostly via object IDs. Each individual data structure is stored in an "object," which tracks the location of each data structure + either internal object data or DATA CHUNKS. With very few exceptions, every object has an object ID associated with it. The "object ID to position" lookup table allows the object to move freely within the file while the ID is used to reference the object itself. The closest comparison here would be a database table with an auto-incrementing ID. IFDS is not a database though.
IFDS is fairly unique in that it is extremely scalable. It generates file sizes as small as a couple hundred bytes all the way up to 18EB (2^64). Each object supports storing seekable (i.e. random access) data up to 280TB in size. IFDS also supports up to 4.2 billion objects. In general, maximum file size limits imposed by the operating system will be encountered long before reaching the limitations of IFDS...assuming you've even got that kind of storage available. Internally, IFDS is somewhat similar to how some file systems work. It even has a free space tracker to track free space as objects and data are created, deleted, moved, and resized. IFDS is not a file system though.
There are over a dozen use cases that are provided as ideas for what IFDS can be used for and three of those use-cases are explored in-depth in the main IFDS repository README. I won't rehash those here (you can read those yourself), but any file format that has ever existed could be redone in an IFDS-based universe and be better off in the long-term. To provide a possible example of this claim, let's take the average Javascript file. We all know that minification and compression are "important" for content delivery on the Internet. I'm not a fan of minifying Javascript as it makes debugging much more difficult (and sometimes impossible). However, I don't have any issues with lossless data compression.
So hypothetically speaking: What if your web server was IFDS TEXT-aware, your web browser was IFDS TEXT-aware, and your favorite text editor was also IFDS TEXT-aware? Javascript files could be setup to be transparently compressed by the text editor, the web server could extract the MIME type from the metadata section and also not waste a bunch of CPU cycles compressing content, and the web browser could transparently decompress the file. You, the developer, would see and edit and debug in plain text as you do now but the content would be transparently compressed when storing to disk. Any IFDS TEXT-enabled tools would also know to transparently decompress the file as it reads in the data.
Transparent compression/decompression and better metadata support for text files are just scratching the surface of what IFDS is capable of. The text editors we currently use are vastly inferior to what they could and should be because they are editing file formats stuck in the 1970's. As a direct result, many hacky workarounds have been invented in the last 45+ years instead of fixing the root problems.
Binary formats like IFDS, of course, have their caveats with the main argument against them being that "text editors can't read them and viewing them in hex editors is cumbersome." However, imagine a world, for just one moment, where IFDS (or something similar) was the primary baseline file format. We would have a couple of common tools for every OS that could read any file to determine if it is partially corrupt and offer the user visual options to repair the partially corrupted portions. The ability to open IFDS (or similar) files in a wide variety of tools would exist and those tools could interoperate with any software application. Hex editors would support IFDS natively and show structure and object breakdowns visually. Text editors might even do rational things with IFDS files that aren't even in the IFDS TEXT format. Applications would load configurations from IFDS CONF files and simple, small OS-level configuration tools could configure any application with a single, shared point-and-click interface. The list of benefits goes on and on and the world would be a better place. Having every binary file format do its own thing is fundamentally problematic. There are more significant problems to solve than producing and parsing approximately the same data structures on repeat for each individual file format. Okay, now we can return to reality.
The reality is that we can't replace everything overnight with IFDS or something similar. For example, replacing every JPEG image on Earth would be an impossible endeavor. New file formats, however, have an opportunity to start with something highly scalable and skip a lot of the frustrating steps when making a new file format.
Note that while I'm obviously fond of and waxing poetic about IFDS, my goal here is to get people thinking about the files we create, the hacky workarounds that we use due to format limitations, and maybe start solving those problems. I'm not saying IFDS is the "best" option for new file formats, but I did pour the past 6 months of my life into designing it: Three complete rewrites of the specification + countless minor revisions + the official reference implementation + more revisions to the spec + the test suite and example classes + the documentation. I'd just hate to see that significant effort go to waste.
IFDS is almost certainly unique. I haven't seen a file format before whose stated goal is to be scalable up to 18EB while being efficient on system resources, transparently handle the tough bookkeeping bits, and provide common data structures and data storage specifically for creating custom file formats. However, there are thousands of different file formats floating around out there, so it is of course possible that I overlooked something similar.
The Incredibly Flexible Data Storage (IFDS) specification is released under CC0 and the official IFDS reference implementation is released under MIT or LGPL. IFDS is not tied to CubicleSoft or me other than just being a place for responsible custodianship and is intended to be used by the larger community of software developers.
Top comments (0)