DEV Community

Working with VideoToolbox for more control over video encoding and decoding - Part 2.

In the last article, I presented the structure of a MacOS app to encode and decode video using VideoToolbox. This time, I would like to focus on the actual encoding process. I will also look at ways to improve on and restructure the existing code, bearing in mind the various ways you could build such a project.

As we left the project in the last article, we had a working macOS app that encoded video from the camera and then sent it off directly to be encoded. In that naive implementation, we are essentially doing the same thing as - or even less than, really - what AVFoundation can do for you. AVFoundation provides access to hardware-accelerated compression and decompression by default. What you don't get there, though, is the ability to fine-tune and customise the encoding and decoding. That is the whole point of VideoToolbox. So let's see how to access the details of the encoding process. The changes I have made in conjunction with this article were merged in from my branch encoder-improvements. Once again, I have referred to existing projects for some of the approach I have taken. In this case, I have updated the encoder code in this iOS project from Objective-C to Swift 5.

The first improvement to make is to set up our encoder before we actually start using it.

func prepareToEncodeFrames() {
        let encoderSpecification = [
            kVTVideoEncoderSpecification_RequireHardwareAcceleratedVideoEncoder: true as CFBoolean
        ] as CFDictionary
        let status = VTCompressionSessionCreate(allocator: kCFAllocatorDefault, width: self.width, height: self.height, codecType: kCMVideoCodecType_H264, encoderSpecification: encoderSpecification, imageBufferAttributes: nil, compressedDataAllocator: nil, outputCallback: outputCallback, refcon: Unmanaged.passUnretained(self).toOpaque(), compressionSessionOut: &session)
        print("H264Coder init \(status == noErr) \(status)")
        // This demonstrates setting a property after the session has been created
        guard let compressionSession = session else { return }
        VTSessionSetProperty(compressionSession, key: kVTCompressionPropertyKey_RealTime, value: kCFBooleanTrue)
        VTSessionSetProperty(compressionSession, key: kVTCompressionPropertyKey_ProfileLevel, value: kVTProfileLevel_H264_Main_AutoLevel)
        VTSessionSetProperty(compressionSession, key: kVTCompressionPropertyKey_AllowFrameReordering, value: kCFBooleanFalse)
        VTSessionSetProperty(compressionSession, key: kVTCompressionPropertyKey_ExpectedFrameRate, value: CFNumberCreate(kCFAllocatorDefault, CFNumberType.intType, &self.fps))
Enter fullscreen mode Exit fullscreen mode

Here we are wrapping the method VTCompressionSessionPrepareToEncodeFrames(compressionSession) which you can read more about in the Apple docs.
We've also taken the opportunity to set some session properties before we begin encoding. Most are set inline here, but frames-per-second is an example of a property exposed in the class, set in appDelegate when creating the encoder:

    // Create encoder here (at the expense of dynamic setting of height and width)
    encoder = H264Encoder(width: 1280, height: 720, callback: { encodedBuffer in
      // self.sampleBufferNoOpProcessor(encodedBuffer) // Logs the buffers to the console for inspection
      // self.decodeCompressedFrame(encodedBuffer) // uncomment to see decoded video
    encoder?.delegate = self
    encoder?.fps = 15
Enter fullscreen mode Exit fullscreen mode

(As commented here, in the previous code we could dynamically set the width and height based on the incoming buffer of data. I sacrificed that here for the sake of other demonstrations, but you may need to find a way to keep that functionality in another application.)

I want to receive the compressed data in my appDelegate, so I can do something with it later. To this end I created an extension for two delegate functions I created in the encoder. First, the encoder protocol:

protocol H264EncoderDelegate: AnyObject {
    func dataCallBack(_ data: Data!, frameType: FrameType)
    func spsppsDataCallBack(_ sps:Data!, pps: Data!)
Enter fullscreen mode Exit fullscreen mode

and the extension:

extension AppDelegate : H264EncoderDelegate {
    func dataCallBack(_ data: Data!, frameType: FrameType) {
        let byteHeader:[UInt8] = [0,0,0,1]
        var byteHeaderData = Data(byteHeader)
        // Could decode here
        // H264Decoder.decode(byteHeaderData)

    func spsppsDataCallBack(_ sps: Data!, pps: Data!) {
        let spsbyteHeader:[UInt8] = [0,0,0,1]
        var spsbyteHeaderData = Data(spsbyteHeader)
        var ppsbyteHeaderData = Data(spsbyteHeader)
        // Could decode here
        // H264Decoder.decode(spsbyteHeaderData)
        // H264Decoder.decode(ppsbyteHeaderData)
Enter fullscreen mode Exit fullscreen mode

We'll discuss those byte headers shortly ;)

While we're there, to keep things tidy, we may as well make the existing AVManagerDelegate into an extension too:

// MARK: - AVManagerDelegate
extension AppDelegate : AVManagerDelegate {
    func onSampleBuffer(_ sampleBuffer: CMSampleBuffer) {
Enter fullscreen mode Exit fullscreen mode

In short, we are going to keep the existing buffer-to-buffer encoding for now and extend the callback method such that it will also call the above callbacks each time. And just to keep an eye on the overall plan, what we are doing here is preparing (encoding) the data as an elementary stream. The data we receive in our sampleBuffer is in AVCC format, whereas the format we want out is an elementary stream in the so-called Annex B format. Everything we do in the callback has to do with converting from the AVCC format to the Annex B format, while allowing us to tweak the details of that process in various ways.

If a sample buffer contains a keyframe we also know it will contain data describing how the decoder should handle these frames when it receives them.

So the first part of our callback looks like this:

let outputCallback: VTCompressionOutputCallback = { refcon, sourceFrameRefCon, status, infoFlags, sampleBuffer in
        guard let refcon = refcon,
              status == noErr,
              let sampleBuffer = sampleBuffer else {
            print("H264Coder outputCallback sampleBuffer NULL or status: \(status)")

        if (!CMSampleBufferDataIsReady(sampleBuffer))
            print("didCompressH264 data is not ready...");
        let encoder: H264Encoder = Unmanaged<H264Encoder>.fromOpaque(refcon).takeUnretainedValue()
        if(encoder.shouldUnpack) {
            var isKeyFrame:Bool = false

    //      Attempting to get keyFrame
            guard let attachmentsArray:CFArray = CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, createIfNecessary: false) else { return }
            if (CFArrayGetCount(attachmentsArray) > 0) {
                let cfDict = CFArrayGetValueAtIndex(attachmentsArray, 0)
                let dictRef: CFDictionary = unsafeBitCast(cfDict, to: CFDictionary.self)

                let value = CFDictionaryGetValue(dictRef, unsafeBitCast(kCMSampleAttachmentKey_NotSync, to: UnsafeRawPointer.self))
                if(value == nil) {
                    isKeyFrame = true
Enter fullscreen mode Exit fullscreen mode

(Note the encoder property shouldUnpack, which simply wraps all this unpacking code in an if-statement so you can activate it as required).
The callback receives a sample buffer on each call, which we need to check is ready to be processed. Next we need to know what kind of data is in the buffer. To do this, we need to take a look at what are known as "attachments" on the buffer - essentially an array of dictionaries providing information about the data. You can see a list of the many attachment keys used here. The one we need is kCMSampleAttachmentKey_NotSync, the absence of which indicates that the data we are dealing with is a keyframe. (And as you can see, it can get a bit messy working with CFDictionaries in Swift...)

So to the data contained in a keyframe sample data; once we know that we are dealing with a keyframe, we can extract two sets of data from our buffer. These are the Sequence Parameter Set and the Picture Parameter Set. A buffer with a keyframe will have both of these sets of data.

This is how we go about extracting these and sending them to the SPS and PPS callback:

            if(isKeyFrame) {
                var description: CMFormatDescription = CMSampleBufferGetFormatDescription(sampleBuffer)!
                // First, get SPS
                var sparamSetCount: size_t = 0
                var sparamSetSize: size_t = 0
                var sparameterSetPointer: UnsafePointer<UInt8>?
                var statusCode: OSStatus = CMVideoFormatDescriptionGetH264ParameterSetAtIndex(description, parameterSetIndex: 0, parameterSetPointerOut: &sparameterSetPointer, parameterSetSizeOut: &sparamSetSize, parameterSetCountOut: &sparamSetCount, nalUnitHeaderLengthOut: nil)

                if(statusCode == noErr) {
                    // Then, get PPS
                    var pparamSetCount: size_t = 0
                    var pparamSetSize: size_t = 0
                    var pparameterSetPointer: UnsafePointer<UInt8>?
                    var statusCode: OSStatus = CMVideoFormatDescriptionGetH264ParameterSetAtIndex(description, parameterSetIndex: 0, parameterSetPointerOut: &pparameterSetPointer, parameterSetSizeOut: &pparamSetSize, parameterSetCountOut: &pparamSetCount, nalUnitHeaderLengthOut: nil)
                    if(statusCode == noErr) {
                        var sps = NSData(bytes: sparameterSetPointer, length: sparamSetSize)
                        var pps = NSData(bytes: pparameterSetPointer, length: pparamSetSize)
                        encoder.delegate?.spsppsDataCallBack(sps as Data, pps: pps as Data)
Enter fullscreen mode Exit fullscreen mode

The decoder will know how to handle these correctly once it receives them (the idea being that we call the decoder from said callback, as indicated earlier). We can also come back to the byte headers I referred to earlier - for elementary streams, all so-called NAL units (each packet of data, basically) must begin with a byte header array of [0,0,0,1]. Thus, we prepend the data with that header.

After that we can handle the actual image data, which we send to our callback with an indication whether it is a keyframe (a.k.a. PFrame) or not (Iframe):

            var dataBuffer: CMBlockBuffer = CMSampleBufferGetDataBuffer(sampleBuffer)!
            var length: size_t = 0
            var totalLength: size_t = 0
            var bufferDataPointer: UnsafeMutablePointer<Int8>?
            var statusCodePtr: OSStatus = CMBlockBufferGetDataPointer(dataBuffer, atOffset: 0, lengthAtOffsetOut: &length, totalLengthOut: &totalLength, dataPointerOut: &bufferDataPointer)
            if(statusCodePtr == noErr) {
                var bufferOffset: size_t = 0
                let AVCCHeaderLength: Int = 4
                while(bufferOffset < totalLength - AVCCHeaderLength) {
                    // Read the NAL unit length
                    var NALUnitLength: UInt32 = 0
                    memcpy(&NALUnitLength, bufferDataPointer! + bufferOffset, AVCCHeaderLength)
                    //Big-Endian to Little-Endian
                    NALUnitLength = CFSwapInt32BigToHost(NALUnitLength)

                    var data = NSData(bytes:(bufferDataPointer! + bufferOffset + AVCCHeaderLength), length: Int(Int32(NALUnitLength)))
                    var frameType: FrameType = .FrameType_PFrame
                    var dataBytes = Data(bytes: data.bytes, count: data.length)
                    if((dataBytes[0] & 0x1F) == 5) {
                        // I-Frame
                        print("is IFrame")
                        frameType = .FrameType_IFrame

                    encoder.delegate?.dataCallBack(data as Data, frameType: frameType)
                    // Move to the next NAL unit in the block buffer
                    bufferOffset += AVCCHeaderLength + size_t(NALUnitLength);
Enter fullscreen mode Exit fullscreen mode

There are a few important details here. When building an elementary stream, we need to include the length of each packet as a requirement. So we need to get that length - see the comment "Read the NAL unit length" in the above code to see how we do that with memcpy...then we also need to convert the data from Big-endian to Little-endian - for which Core Foundation provides the method CFSwapInt32BigToHost(NALUnitLength). Further down (the last line) we can use that length to move to the next NAL unit in the buffer.

One final detail is the analysis of the dataBytes variable as a convenient way top know if we are dealing with an iframe or not - info that we use to update the frameType variable with for use in the dataCallback.

So at this stage, we have acquired everything we need to process the elementary stream of data. We could send this data to a decoder of our choice now. In the next article we will decode the data using an updated version of our decoder class. Part 3 is on its way...

Those of you keen to dig deeper may find this SO discussion and this one useful.

Here, again, is a link to the previous article on this subject.

The repository accompanying this post is available here.

Alan Allard is a developer at Eyevinn Technology, the European leading independent consultancy firm specializing in video technology and media distribution.

If you need assistance in the development and implementation of this, our team of video developers are happy to help out. If you have any questions or comments just drop us a line in the comments section to this post.

Discussion (0)