Torbjörn Einarson for Eyevinn Video Dev-Team Blog

Posted on Jan 31, 2023

mp4ff - beyond MP4 boxes

#gratitude

mp4ff consists an open source library and tools for handling MP4 files. It is written in Go, and I started writing it because I discovered that Go is a great language for writing back-end services and command-line tools, but didn't find a performant library that could handle fragmented files in all aspects I needed. Over time, my needs have grown, and so has mp4ff. Most of the code is still written by me, but there have been a number of contributions from colleagues as well external contributions. More are welcome!

mp4ff can handle the usual tasks of parsing and writing most MP4 boxes, the building blocks of the file format, but it goes beyond that by including code and tools to analyse and arrange the actual media samples. After a brief introduction to the file format, I will dive more into the features that I think are unique, or at least rare, for this library and the tools.

Background

Digital video and audio both consist of sampled or generated data like pixels which are compressed using various standards or proprietary algorithms. The binary data that results needs complementary metadata like time stamps, language, resolution etc. Since that is a common scenarios for a lot of different media, there are container formats that can be used to store or transport such media data together with its metadata.

A very wide-spread such container format is the MPEG-4 (MP4) file format. Its building blocks are boxes which all have a simple structure of a 4-byte size, a 4-byte type, and a value of length size-8 (plus an escape mechanism for bigger boxes). It can be depicted as:

Such boxes can be stacked inside each other, and there are plentiful of box types to handle a lot of different metadata and use cases. Most of the boxes are specified in MPEG standards, but there are also boxes from other standards, as well as proprietary extensions. The mp4ff library currently supports more than 110 different boxes.

The history of the format goes back to the QuickTime file format that was invented by Apple. In the 1990's, MPEG chose to build its new file format with QuickTime as a basis. Somewhat later, the file format specification was split into multiple parts of the ISO/IEC 14496 MPEG-4 standard, where part 12 (ISO/IEC 14496-12) is the base part (ISO Base Media File Format - ISOBMFF), part 14 contains MPEG-4 specific parts, part 15 defines MPEG video codec encapsulations, and part 30 describes subtitle wrappings.

There have already been a number of nice introductions to the MP4/ISOBMFF file format in this forum, so the focus of this post is not the file format and the box structures. Instead, we look at data about the media, the tools provided with mp4ff, and some optimisations needed to write efficient MP4 file format applications.

Progressive vs Segmented files

The original design of MP4 files have two main parts: one part with metadata and one with media data. The metadata is structured inside a [moov] box and the media data resides in an [mdat] box. The [moov] box contains information about the tracks, the codecs, as well as detailed information about the size, offset, and timing of every single MP4 sample. An MP4 sample is typically a video picture, an audio frame consisting of a group of compressed audio sampling values, or some subtitle text. Tracks provide timed sequences of samples, and can be interleaved into a progressive MP4 file which can be played in a media player at the same time as the file is downloaded.

The box structure of a (progressive) MP4 file can be investigated with the mp4ff tool mp4ff-info. It will print all the box names, sizes and an overview of its data. With command-line parameters it is possible to get more detailed information about a specific box type or about all boxes.

Cropping a progressive file using `mp4ff-crop`

When working with media, it soon becomes obvious that there is a lot of variation in encoding and packaging. It is therefore important to gather a catalogue of test assets that can be used in regression tests. However, video assets are often quite long and require a lot of storage, so having an "exact" shortened version of an asset is very convenient.

That is the reason I developed the mp4ff-crop tool. It shortens a progressive MP4 file, and, keeps every box of the original box untouched as far as it is possible. When it comes to the actual sample information, the information is of course truncated, but it is done in such a way that the interleaving is left intact as far as possible.

MP4 fragments

A progressive file is suited for a single-bitrate movie, but is too rigid for live content and adaptive bitrate (ABR) switching in HTTP streaming systems such as HLS and MPEG-DASH. These require a further evolution of the MP4 file format called fragments.

In the fragmented case, there is still an initial [moov] metadata box, but it does not contain any information about the individual samples. The sample metadata and media data are instead available as MP4 fragments consisting of a [moof], movie fragment box followed by an [mdat] box containing the corresponding media samples. A typical duration of such a fragment is a couple of seconds, but it can have a duration as short as a single sample.

Low-latency and multi-fragment segments

HTTP streaming solved the problem of efficiency distributing video over general IP infrastructure using existing HTTP-based Content Distribution Networks. However, for efficiency reasons, the segment durations are most often chosen to be several seconds long, e.g. 6s as HLS recommends. This goes against the desire to be able to distribute live video at lower latency. The HLS and MPEG-DASH solutions to achieve such lower latency are both based on fragmented segments.

Fragmented, or chunked, segments essentially means that each streaming segment is not one but multiple MP4 fragments. A complete segment could look like

where the [styp] box signals the beginning of a new segment, and only the first [mdat] box is required to start with a key frame. Each [moof][mdat] pair is a new fragment. With this structure, it is possible to produce and distribute a segment fragment by fragment, and thereby avoid to wait until the whole segment is produced.

Repackaging media

From the above, it should be clear that there is often a need to repackage media data. The progressive MP4 files is a good way of distributing full movies, while fragmented MP4 files are good for ABR streaming of both VoD and live.

The mp4ff library cannot only parse and write boxes, but does also have support for changing the structures. For this purpose, the mp4ff adds an extra layer inside fragmented files that groups boxes into structures of types InitSegment and MediaSegment, where the latter in turn contains a list of Fragment.

Some code examples are available as:

The first one segments a progressive file into a segmented file, and the second changes the segment durations of an already segmented file. These are just examples, but provide the basic core for a product implementation.

Another complexity is multiple tracks in the same file, for example combining video and one or more audio tracks. That is in general not a good idea for ABR streaming since it requires a lot of duplication of tracks to provide all combinations. Still, that use case can be supported by mp4ff as shown in examples/multitrack that reads a simple test asset with a video and QuickTime CEA-608 caption track and extracts the caption data.

Video codecs, NAL Units, and `mp4ff-nallister`

The mp4ff library has support for handling meta data and samples of the two video codecs H.264/AVC and H.265/HEVC. These codecs have double names since they were developed jointly by ITU-T and MPEG. The H.264 and H.265 names are from ITU-T, and the AVC and HEVC names are from MPEG. In the mp4ff code, the codecs are referred to as AVC and HEVC.

Starting in the AVC standard, a new layer was introduced in the video bitstream syntax called Network Abstraction Layer. It consists of NALUs (Network Abstraction Layer Units). The idea was to make the media bitstream data independent of the transport, which was not the case for earlier standards like MPEG-2 video, H.263 etc. With this concept, a video frame (picture) consists of one or multiple NALUs and each NALU starts with a NALU header signaling its type. The carriage of these NALUs differ between different systems, but the actual NALU content is always the same.

In MP4 files, a video frame is one MP4 sample and each such sample has a time, duration, size, and offset. There is therefore no need for start codes to find the start and length of a frame. An MP4 video sample therefore consists of one or more NALUs preceded by a fixed-size length field, and one can find the next NALU by jumping forward that length inside the sample. This is in contrast to MPEG-2 Transport Streams that have no position or length information, and therefore rely on start codes to find the start of NALUs.

The NALU types provide information about the picture type, but can also reveal metadata information. The tool mp4ff-nallister was created to display information about NALUs inside the video samples of an MP4 file.

It can be run like

> mp4ff-nallister 0.m4s
Sample 1, pts=0 (20762B): SEI_6 (9B), SEI_6 (756B), SEI_6 (7B), IDR_5 [I] (19974B)
Sample 2, pts=1536 (172590B): SEI_6 (7B), NonIDR_1 [P] (172575B)
Sample 3, pts=512 (7123B): SEI_6 (7B), NonIDR_1 [B] (7108B)
...

and prints one line for each video sample. Here one can see that the first video sample has presentation time (pts) 0, and has three SEI (Supplementary Enhancement Information) (type 6) NALU before the actual video starts with an IDR (type 5) NALU.

The SEI NAL units can contain a lot of different information. Some examples are timing, encoding settings for x264, HDR information for HEVC, and CEA-608 closed captions. With the parameter -sei 1 the tool provides more information about the SEI NALUs. The mp4ff mapping of NALU types to names should be complete, but parsing is restricted to a much smaller set.

Parameter Sets and `mp4ff-pslister`

A video decoder is decoding video given a context. The highest-level context such as the resolution, encoding settings, etc is given by a hierarchy of parameter sets. It is essential that the parameter sets used when decoding are aligned with the actual encoding parameters. When stitching video, it is therefore important to know if any essential parameter has changed. In that case, another parameter set previously transmitted must be used (they are numbered), or one must send a new parameter set inband in the video stream. The parameter sets are always carried in NALUs of specific types.

I have often found that is important to get detailed information about the parameter sets, and there is therefore a tool included with the mp4ff library called mp4ff-pslister. It decodes and prints the parameter set information. For H.264/AVC it should be complete, but that is not yet the case for H.265/HEVC.

The parameter sets can either be conveyed deep in the moov box or as NALUs in the video samples themselves. The tool extracts them from both places and can also interpret hex strings.

As an example, running the tool on an init segment (containing a moov box with parameter sets) yields:

> mp4ff-pslister -v -i init.mp4
Video avc track ID=1
SPS 1 len 37B: 67640028acd940780227e5ffffffffffda808080980800026e8f00004dd1e8a4c01e30632c
{
  "Profile": 100,
  "ProfileCompatibility": 0,
  "Level": 40,
  "ParameterID": 0,
  "ChromaFormatIDC": 1,
  "SeparateColourPlaneFlag": false,
  "BitDepthLumaMinus8": 0,
  "BitDepthChromaMinus8": 0,
  ...
}
PPS 1 len 5B: 68ebecb22c
{
  "PicParameterSetID": 0,
  "SeqParameterSetID": 0,
  "EntropyCodingModeFlag": true,
  ...
}
Codecs parameter (assuming avc1) from SPS id 0: avc1.640028

where the hex strings after the SPS and PPS lines are the binary payload data and the blocks following them show all the parameters carried in the parameter sets.

Subtitles in MP4 files and `mp4ff-wvttlister`

Part 30 of the MPEG-4 standard describes how TTML and WebVTT subtitles shall be carried in MP4 files. The approaches are very different. For TTML, the original XML structure is used, and there is typically a single sample per segment. That sample is a complete TTML XML document with internal time stamps. Beyond text, it can also provide image subtitles and carry such images as subsamples. The codecs attribute for such a file is stpp, and there is an [stpp] box that should be included in the track sample descriptor. mp4ff supports all relevant boxes.

WebVTT content is stored in a way that is much closer to video and audio. Every change in the subtitle state must be a new sample, and the full interval must be covered with either empty (no subtitle) or cue (text with styling) samples. The samples themselves are MP4 boxes like [vtte] for an empty sample and [vttc] for a text cue. The codecs attribute for fragmented WebVTT is wvtt and there is an [wvtt] sample description box. To extract the subtitles with timing and box structure, there is a tool mp4ff-wvttlister included with mp4ff. I found it quite useful when working with wvtt files.

Performance and optimisations

For small files and infrequent parsing, any Go code is probably fast enough to handle MP4 files. However, when working with big files and on-the-fly repackaging, there are typically bottle necks that need optimisations.

Memory allocations

In Go, a general performance area to consider is heap allocations. Benchmarking and profiling showed that there was indeed a lot of memory allocations when parsing the boxes. Part of this was due to the use of io.LimitReader for each box. That layer introduced extra slice allocations. Since MP4-boxes have a size field that reveals the size of boxes before they are read, one can go one step further and allocate a big slice that contains the top-level box and all its child boxes. That has been implemented by using a structure called SliceReader. The performance improvement in this area was done in version v0.27, and the gain can be seen in the benchmark table below:

name \ time/op	v0.26	v0.27	v0.27-sr
DecodeFile/1.m4s-16	21.9µs	6.7µs	2.6µs
DecodeFile/prog_8s.mp4-16	143µs	48µs	16µs

The executation time per file read is around a factor of 3 shorter when going from v0.26 to v0.27, and another factor of 3 when using the SliceReader. The library has now two decode functions for each box type Fooo: one DecodeFooo that uses an io.Reader and one DecodeFoooSR that reads from a SliceReader. The recommendation is to use the latter variant.

Lazy decoding

Another optimisation area is to avoid reading the media data into memory. That data can be huge (even beyond the 32-bit size field requiring the 64-bit variant). There is therefore an option when decoding a file that just reads the size of the [mdat] box and later uses byte-range request to read the actual media data.

A typical call looks like:

    parsedFile, err := DecodeFile(file, WithDecodeMode(DecModeLazyMdat))

where file must implement the io.ReadSeeker interface.

Reading multiple samples

Another way to increase performance when repackaging data is to fetch data for multiple samples at the same time. A useful function in this area is

   func (f *Fragment) GetSampleInterval(trex *TrexBox, startSampleNr, endSampleNr uint32) (SampleInterval, error)

which returns an interval of sample data.

Non-optimal MP4 structures

A final area I'd like to bring up is that the data structures of the MP4 boxes are not always optimal. As an example, let's discuss the "Composition Time to Sample box" [ctts].

This box is part of progressive MP4 files and contains a list of pairs (sample_count, sample_offset) which tells how much the presentation timestamp should differ from the decode timestamp for intervals of sample_count samples. These values tend to vary from sample to sample, so there is essentially one entry per list entry. For a 2hour movie with 25frames/s, there are therefore around 180 000 entries in the table. If one stores data using the same structure as in the standard, it is very inefficient to find the value for a specific sample, since the sample_count must be read for each sample by linear search. In fact, looking up values in [ctts] was the major CPU bottle neck for converting a big progressive MP4 file to a segmented one. By rewriting the code to store the sample_offset as individual values and use a bisect algorithm to find the correct value, the CPU usage of the whole program went down a factor of 5.

Buffered I/O

A general thing to keep in mind is that I/O operations are expensive, so buffered I/O is often beneficial to reduce the system resources needed. For this, the bufio standard library is good starting point.

Conclusion

mp4ff is a Go library bundled with a set of useful tools:

mp4ff-crop crops progressive files with minimal changes
mp4ff-info explores the box structure of an MP4 file
mp4ff-pslister extracts parameter set information
mp4ff-nallister provides information about NALUs
mp4ff-wvttlister lists subtitle samples in a wvtt file

If you have Go version 1.16 or later installed, you can easily install any of the tools, like mp4ff-info, by typing:

> go install github.com/Eyevinn/mp4ff/cmd/mp4ff-info@latest

mp4ff is under active development and can hopefully be of help to anyone developing a media service using MP4 files in Go, especially services using fragmented MP4 files such as MPEG-DASH and HLS.

Go is a great language that is known for its readability. I hope that you find the mp4ff code easy to work with.

Latest comments (1)

ARandomBoiIsMe • Sep 1 '24

I enjoyed this article alot, especially since it's quite hard to find content that actually describes the data storage system of an MP4 file and not just it's multiple boxes, lol.